7.1 - The process of generating gene counts from genome aligments

To perform differential expression analysis we need to count, for each sample, how many times a different transcript/gene is read. If we align directly against the transcriptome, we just need to count the number of alignments per gene/transcript. However, if there are many alternative transcripts, aligning will become difficult. One solution may be to use just one representative transcript, or the union of all transcripts to represent the gene, although this also has issues.

What is most often done is to align against the genome, and compare the alignments (SAM/BAM) against the gene annotation (as GTF or BED). We could consider that a read counts to a gene if it overlaps with any part of the gene, but in large mammalian genomes, genes can have large introns, and it is not rare that genes overlap with each other. Moreover, the presence of DNA contamination and immature RNAs may also influence the counts. It is usually preferable that a read will count for a gene only if it overlaps to at least some part corresponding to a valid mRNA transcribed from that gene. Then, if we have strand information, we should use it to resolve other possible ambiguities.

There are stil other factors to take in consideration. What to do if a read maps equally well to multiple genome regions? This will now depends a bit on the behavior on the alignment software. Usually, these cases are marked as having a low mapping quality, so we can simply ignore them by excluding alignments with a low mapping quality. But by ignoring these cases we’re losing information, and in the case of large genomes with a lot of large duplicated regions, this can be problematic. Again, if we want to use this information, we need to take into consideration what the aligner software will do. For example, bwa randomly attributes a read to one of the sites, while hisat outputs all alignmens (up to a given limit of k equally good ones).

Some counting tools will actually use the information that a read aligns to different places to estimate the likelihood that a read belongs to one or the other, depending on the local (unique) coverage. This is in fact the type of approach Salmon uses to attribute reads to transcripts. Salmon does not output an exact number of reads per transcript, but the sum of the likelihoods of reads belonging to it (eg. a read may have 60% likelihood of belonging to a transcript, and thus will count not as 1, but as 0.6).

Finally, how to avoid PCR artifacts? To be as safe as possible, we would remove duplicates to avoid PCR artifacts, and this frequently needs to be done before the counting process. Nonetheless, given that duplicates can be frequent in RNA-Seq, usually we do not remove them. Assuming that PCR artifacts occur randomly, then we should not have the same artifact in different biological replicates. In any case, for genes that are very important to us, we should always also visually check the alignments using software such as IGV.



7.2 - Use tools such as featureCounts to generate tables of gene counts

Featurecounts is a tool to generate gene counts from SAM/BAM alignments and GFF/GTF gene annotations. Its default behavior is to generate counts at the gene level. It assigns a read to a gene if it unambiguously overlaps at least one part of a cDNA produced by the gene (namely, exons). It ignores reads mapping equally well to multiple positions.

TASK: In galaxy, use featureCounts with all samples of the guilgur dataset (which are unstranded) and the sample gtf file (not the complete one you downloaded) as the annotation (leave all the other parameters as default, although take note of the options you have available).

QUESTION: What are the read counts for gene Rpn12R (Fbgn0036465) in all the guilgur samples?

Click Here to see the answer

  • WT Lib1: 0
  • WT Lib2: 3
  • mut Lib1: 672
  • mut Lib2: 734


TASK: In the commandline, run the command: featureCounts -a Drosophila_melanogaster.BDGP6.85.sample.gtf -o mut_lib1_R1.feature.counts mut_lib1_R1.bam. Run featurecounts -h to see all the options.


TASK: Use featureCounts to generate tables of counts for the Trapnell dataset (now using the full annotation).



7.3 - Use Salmon to generate counts with only the transcriptome

As mentioned previously, Salmon directly matches the raw reads against a fasta with the known transcriptome, directly generating a table of “counts”. Since it assigns reads to transcripts probabilistically, the result is sometimes not an integer, but a fractional number.


TASK:Like for the other aligners, Salmon also needs to create an index. In the terminal, run the command salmon index --transcripts Drosophila_melanogaster.BGP6.88.sample.cdna.fa --index Drosophila_melanogaster.BGP6.88.sample.cdna.salmon. Next, run the alignment using the command salmon quant --index Drosophila_melanogaster.BGP6.88.sample.cdna.salmon -l A -r mut_lib1_R1.fq.gz -o mut_lib1_R1.salmon.counts.

QUESTION: What is the result you obtain? (There should be a folder called mut_lib1_R1.salmon.counts. Inside the folder, there should be a file “quant.sf”. Open that file with a text editor or spreadsheet.)

Click Here to see the answer

You obtain a table of counts, but for each transcript. The counts are fractional numbers. You also have normalized counts (per million reads), and information on the "real" transcript length and an "effective" length that can be used for normalization, which takes into account several biases.


TASK: In Galaxy, run Salmon with the guilgur data against the sample transcriptome (Drosophila_melanogaster.BGP6.88.sample.cdna.fa). To obtain gene level counts, you also need to use a table converting transcripts to genes (Drosophila_melanogaster.BDGP6.88.sample.cdna.tr_to_gene.tab). Salmon will then obtain gene counts by merging transcript counts. Notice that no SAM/BAM is generated.

QUESTION: What are the salmon read counts for gene Rpn12R (Fbgn0036465) in all the guilgur samples?

Click Here to see the answer

  • WT Lib1: 0
  • WT Lib2: 2
  • mut Lib1: 687
  • mut Lib2: 776



NOTE: Assess how well you achieved the learning outcome. For this, see how well you responded to the different questions during the activities and also make the following questions to yourself.

  • Do you understand the concept of obtaining a table of read counts per gene

  • Do you understand the different choices you have when counting?

  • Could you use featureCounts to generate a table of counts?

  • Could you use Salmon to produce a table of counts by using the fastq files against the transcripts?


Back

Back to main page.