4 - Do simple processing operations in the raw data to improve its quality

In most cases, particularly if you’re sequencing short, single-end reads, the quality of your raw data is good enough to continue without any preprocessing. In fact, if you send your sequencing to an external facility, they often do these verifications and filtering for you, and you have “clean” sequences in the end. Nonetheless, it is always better to check before proceeding.

Sometimes things can go wrong, and you may need to do something about it. Some types of problems, like presence of contaminants, or some instances of positional bias will require to go back and redo the experiments. Other issues can be minimized.

4.1 - Remove low quality bases from your reads

As you may have noticed before, reads tend to lose quality towards their end, where there is a higher probability of erroneous bases being called.

QUESTION: If all bases of a ficticious machine had a Q=20 (1% probability of error), what would be the probability that one 100bp read from that machine would be completely correct?

Click Here to see the answer

P(correct)=(0.99)^100 ~= 36.6%!

This serves to exemplify that many reads in current sequencing machines are likely to have at least one base incorrect.

To avoid problems in subsequent analysis, you should remove regions of poor quality in your read, usually by trimming them from the end of reads using tools such as Trimmomatic. Similar to FastQC, Trimmomatic is a java program, so you can use it in any operating system (such as Windows and Mac), although unlike FastQC it needs to be run only using the commandline.

Like Trimmomatic, most software for the analysis of HTS data is freely available to users. Nonetheless, they often require the use of the command line (frequently only in a Unix-like environment). User-friendly desktop software such as CLC or Ugene is available, but given the quick pace of developmpent in this area, they are constantly outdated. Moreover, even with better algorithms, HTS analysis must often be run in external servers due to the heavy computational requirements. One popular tool is Galaxy, which allows even non-expert users to execute many different HTS analysis programs through a simple web interface. There are public instances of Galaxy where you can run your bioinformatics analysis (eg. https://usegalaxy.org, https://usegalaxy.eu). For the purpose of this course, we will run Galaxy instances locally installed in the classroom workstations. These will contain only the tools necessary to run the exercises for this course, but otherwise work very much like any other galaxy installation. After the course, you can use the public servers, or we can also set up for you a machine with Galaxy with the necessary tools.

TASK: Let’s use Galaxy to run Trimmomatic. Open the web browser (eg. Firefox). Type localhost:8080 in the URL tab (where you put the web addresses). This means that you are accessing a galaxy instance that is running on your local machine. You should see the Galaxy interface on your web browser. The available tools are listed on the left panel, and you can search for tools by their name. Search for trimmomatic in the tool search bar. Click on the tool Trimmomatic to see the options for running the tool.

QUESTION: What different operations can you perform on the reads with Trimmomatic?

Click Here to see the answer

You can perform the following operations with Trimmomatic (either isolated, or in combination):


   ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read
   SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold
   MINLEN: Drop the read if it is below a specified length
   LEADING: Cut bases off the start of a read, if below a threshold quality
   TRAILING: Cut bases off the end of a read, if below a threshold quality
   CROP: Cut the read to a specified length
   HEADCROP: Cut the specified number of bases from the start of the read
   AVGQUAL: Drop the read if the average quality is below a specified value
   MAXINFO: Trim reads adaptively, balancing read length and error rate to maximise the value of each read

TASK: Upload into Galaxy the fastq files from fastq_examples folder (Click on the upload icon upload on the top left of the interface). After uploading, you should now seem them on your history in the right panel. You can visualize their content by pressing the view data icon (the eye icon eye ). In Galaxy, use Trimmomatic to remove low quality bases from sample_quality_and_adaptors using the default method (a 4bp window average, with a threshold of Q=20). Finally, look at the impact by running FastQC (also in Galaxy) on the trimmed reads.

Hint: When uploading, Galaxy will try to guess the type of your files, but you can also explicitly specify the type of the files when uploading. For example, by default Galaxy will uncompress files when ulpoading, but if you specify that your files are fastqsanger.gz, it will keep them compressed, saving disk space.

QUESTION: What was the impact of running Trimmomatic?

Click Here to see the answer

The base quality improved significantly. Nonetheless, several sequences now became smaller due to the trimming. Some became very small, and it should even be impossible to use them afterwards in the remaining of the analysis. Therefore, it is common to remove sequences that fall below a certain length (eg. 36bp).

TASK: Although Galaxy is much simpler to use, it is hiding some details from you (and sometimes, important details are left out), so if you can, you should invest in learning the detailed command line usage of the tool, particularly if you’re using it very often. In the commandline, use Trimmomatic to remove low quality bases from sample_quality_and_adaptors. Type the command TrimmomaticSE -phred33 sample_quality_and_adaptors.fastq.gz sample_quality_and_adaptors.trimmed.fastq.gz SLIDINGWINDOW:4:20. You should now have a new file with the trimmed reads. Confirm the effect of Trimmomatic by using FastQC on the file with the trimmed reads.

4.2 - Remove adaptors and other artefactual sequences from your reads

Sequencing machines often require that you add specific sequences (adaptors) to your DNA so that it can be sequenced. Although sequencing facilities will generally remove these from the reads, for many different reasons, such sequences may end up in your reads, and you will need to remove them yourself. Moreover, cDNAs may contain parts of the non-genomic polyA tails that are part of mature mRNAs. Since these sequences are not part of the genome, they may prevent proper alignment and need to be removed before proceeding.

To remove these unwanted sequences, not only you have to look for the sequence in the reads, but also allow for sequencing errors, as well as the presence of incomplete sequences. Another issue of removing the adaptors is that you need to know which ones were used in your data. Since Illumina is used most of the time, these adaptors are already integrated in tools like Trimmomatic, which also take in consideration issues like reverse complement.

TASK: In Galaxy, Use Trimmomatic to remove adaptors from sample_adaptors.fastq.gz using Truseq3 adaptors (for this you need to select to perform an initial Illumina clip, then select the appropriate database of adaptors) and use FastQC to see the impact. Note: although Truseq3 mentions paired-end, you can also use them for single-end.

QUESTION: What was the impact of running Trimmomatic?

Click Here to see the answer

There was no effect in the quality of the sequences, since the original per base quality was already very good. Most reads are now only 36bp long, and the adaptor sequences are no longer present.

Hint: In Trimmomatic you can also use your own sequences as custom adaptors. For example, in case you use uncommon adaptors, or if you want to remove polyA tails or other artefactual sequences.

TASK: As you noticed, you can use Trimmommatic to do both quality and adaptor trimming. In Galaxy, use Trimmomatic to remove low quality bases from sample_quality_and_adaptors.fastq, as well as the remainings of illumina Nextera adaptors that are still left in some of the reads.

QUESTION: What was the impact of running Trimmomatic?

Click Here to see the answer

The base pair quality of the sequences improved and the few adaptor sequences were also removed.

Paired-end data need to be handled with special care. Reads may be removed entirely if their quality is very bad (eg. if you use the MINLEN parameter in Trimmomatic). This can result in pairing information being lost, if the other member of the pair is not also removed (or placed in a special set of unpaired sequences). Software such as Trimmomatic can also take paired data as input, and handle them properly.

TASK: Use Trimmomatic with the 20150821.A-2_BGVR_P218 paired-end example RNA-Seq data with the default SLIDINGWINDOW parameter, as well as MINLEN of 36. Also remove adaptors (use Truseq3 paired-end adaptors). Use FastQC to evaluate the impact of the procedure. If you use trimmomatic on each individual file, you’ll lose the pairing information. Therefore, you need to provide the paired data to Trimmomatic. Notice that, besides a paired fastq file, you also obtain unpaired reads that lost their pair.

Question: Which one has more reads - unpaired R1, or unpaired R2? Why is that?

Click Here to see the answer

The unpaired R1 has a lot more reads than unpaired R2. This is because R2 reads are usually of lower quality and are therefore more often removed.

NOTE: Assess how well you achieved the learning outcome. For this, see how well you responded to the different questions during the activities and also make the following questions to yourself.

Do you understand the process of quality trimming of a fastQ file?
Could you use Trimmomatic to improve the base quality of a fastQ file?
Do you understand the process of removing adaptors and other artefactual sequences from a fastQ file?
Could you use Trimmomatic to removing adaptors from a fastQ file?
Do you understand the potential issues of quality filtering and adaptor trimming in paired-end data?

Back

Back to main page.