Broadly describe the High Throughput Sequencing Workflow
A list of genes of “interest” produced by an omics experiment such as RNA-seq is mostly meaningless on its own: gene identifiers are opaque, and we’re generally interested in understanding phenomena at the cellular and/or organismal level, rather than the gene level. To do this, we must abstract from the genes to their functions, or whatever other aspect we’re interested in studying (e.g., chromosome locations, transcription regulation networks). In order to abstract to the functional level, we need functional descriptions of the genes in a consistent manner, i.e., we need all functional aspects to be described in the same manner for all genes that have those aspects. In other words, we need genes to be annotated using a functional classification scheme. Moreover, this scheme should typically be organized hierarchically: the fine-grained gene functions generally occur in only a few genes, so further abstraction is required in order to enable integration of our gene set.
There are several suitable functional classification schemes available for genes, covering different aspects and/or levels of gene function. For instance, the Enzyme Commission (EC) classification covers individual enzymatic function, whereas KEGG covers metabolic pathways. The Gene Ontology (GO) is the broadest, most detailed, and most widely used classification scheme for gene function, and consequently also the most commonly used for the analysis of differential expression results.
GO is divided into three major functional aspects, called GO types: molecular function, which covers individual gene functions; biological process, which covers how gene functions integrate into cellular and/or organismal processes; and cellular component, which covers where gene functions take place. Each of these aspects is organized as a directed acyclic graph, which is essentially a relaxed hierarchy with multi-parenting. In addition to subclass (‘is a’) relation, GO includes other relations such as ‘part of’, ‘occurs in’, and ‘regulates’. While the three aspects of GO are ‘is a’ orthogonal, molecular functions can be ‘part of’ biological processes, and both can ‘occur’ in cellular components.
Genes can be (directly) annotated to multiple GO terms, even within the same aspect. Furthermore, according to the true path rule, a gene annotated to a GO term is implicitly annotated to all ancestors of that term. For instance, a gene annotated with ‘cytochrome c oxidase activity’ is inherently annotated with ‘catalytic activity’, ‘electron carrier activity’, ‘transporter activity’, and all other GO terms in the path between them.
GO annotations of genes are available on an individual basis in most genetic databases, as well as in dedicate GO browsers such as AmiGO and QuickGO. They can also be downloaded on a genome-wide scale from GO’s annotation repository or BioMart. Viewing the annotations of your gene set on an individual gene basis is unfeasible and insufficient: there are too many genes to analyze manually and integrate, and even if you could, this doesn’t tell you how significant the patterns you find are.
Task: Go to the Gene Ontology download page and download the GO in OBO format (right-click + save, otherwise the file opens in your browser). Download also the GO annotation file for Drosophila melanogaster.
Task: Go to BioMart and download the GO annotations for Drosophila melanogaster and Mus musculus. Select “ENSEMBL Genes” database, then “Mouse Genes”, then under Attributes, select “Gene stable ID” (remove the Transcript ID that is selected by default), and from the “External” section, “GO term accession”.
Having gene functions allows us to find patterns in our gene set of interest, but we need to assess how statistically significant such patterns are in order to substantiate any inference of meaning. We can do this using enrichment analysis.
Enrichment analysis is the application of statistical tests (usually the one-tailed Fisher’s exact test) to ascertain whether a sample set of entities is enriched in relation to the overall population with respect to particular features of interest. By enriched, we mean that the sample frequency of the feature is greater than would be expected by chance given the population frequency.
The one-tailed Fisher’s exact test, a.k.a hypergeometric test for over-representation, is based on the hypergeometric distribution, which is used to determine probabilities in sampling events without replacement. More concretely, the hypergeometric distribution measures the probability of obtaining k successes in a sample of n elements, given a population with K successes in N total elements. For example, we could use it to determine the probability of drawing 4 Aces in a Poker hand of 5 cards (4 successes in a sample of 5, given 4 successes in a population of 52) which is a mere 0.002%. For Fisher’s test, we want to measure the probability of getting at least k successes, meaning we have to sum hypergeometric probabilities from x=k up to x=min(n,K).
Functional enrichment analysis is the application of enrichment analysis to ‘omics gene lists, which can be considered samples of the genome (or the genes covered by the experiment), with the features of interest being gene functional categories (such as GO terms). It enables us to assess how meaningful statistically are the functional patterns we observe when going from the gene level to the functional level.
Some care is needed when defining the sample and the population sets of genes:
- The sample can be all differentially expressed genes, only the under-expressed, or only the over-expressed, depending on the biological question being addressed. It may make sense to perform enrichment analysis with all three options, as each gives you a different insight into your dataset. The genes we consider differentially expressed can simply be those with significant p-values in the differential expression test, but we can also be more strict and enforce at least 2-fold expression differences.
- The population should only include genes observed in the experiment, for which presence or absence from the sample could be accurately determined. In RNA-seq experiments, it should not be the whole genome, but only the set of genes with meaningful expression observed in any of the conditions being compared. How we define “meaningful” expression is also subject to interpretation: we can simply exclude genes with no counts, or be stricter and exclude those with very small counts as well.
- The sample must be a subset of the population, so we should first apply whatever criteria we deem fit to select the population, then select the sample from within that selection.
Some care is also needed when counting the frequencies and sizes of the sample and population:
- The sample and population frequencies (k and K) should be the number of genes in the respective set annotated with the GO term being tested or any of its descendants (according to the true path rule).
- The sample and population sizes (n and N) should be the total number of genes in the respective set that have at least one GO annotation under the GO type of the term being tested. That is to say, we should only count genes that have determined functions, as otherwise we cannot be certain whether they should be counted as successes for the test.
Enrichment analysis can only be used to compare a sample with the overall population whence it was extracted. It cannot be used to compare different samples directly. However, you can compare the functional enrichment results of different samples, and thus gauge the differences between them at the functional level. To assess whether two samples are statistically functionally different, you would need a different kind of statistical test, but that is beyond the scope of this course.
Aside: correcting for multiple testing
Most of our statistical tests − including Fisher’s exact test − rely on controlling type I errors. When we accept an event/observation as significant because it has a p-value of 0.001, we are accepting that statistically, one time in a thousand, we’ll be wrong − the event/observation in question will be the product of chance alone. This is a problem when we perform multiple related tests, as the chance of getting a statistical “extreme” in at least one of them will be greater the more tests we perform. Because GO enrichment analysis relies on performing hundreds (or sometimes thousands) of Fisher’s tests, we must correct the statistics for multiple testing.
There are two families of multiple test corrections: the family-wise error rate (FWER) and the false discovery rate (FDR). In the former, we control the probability of making at least one false discovery, which is a conservative but “safe” approach that produces corrected p-values. In the latter, we control the ratio of false discoveries, which is a more powerful but less “safe” approach. It produces q-values, which indicate the ratio of false discoveries you are accepting if you reject the null hypothesis.
GO functional enrichment analysis tools
There are several GO enrichment analysis tools available, for instance:
- Webtools: GOrilla, GO’s own tool
- Galaxy/Command Line tools: GOEnrichment (IGC Galaxy); Ontologizer (Galaxy test toolbox)
- R tools: gsea, GOstats, topGO
For some of these tools, you have to provide the version of GO and the GO annotations you want to use, in addition to the population and sample sets of genes. For other, you only need to provide the latter, and identify the organism in question. The first set of tools should be preferred, as they give you greater control over what you’re testing with. In either case, the sample and population frequencies will be computed automatically by the tool. The tool should exclude from both sets the genes that don’t have GO annotations (of the GO type being tested) and perform the tests independently for each GO type, but not all tools do this.
Task: Perform functional enrichment analysis using the GOEnrichment tool in Galaxry, on the differential expression results from the Trapnell dataset. Use the sample file and Drosophila annotation file in the functional_enrichment folder and the GO OBO file you downloaded earlier. For now, do not define a background population set. Run the tool with “summarize output” off and otherwise default options.
Question: What did you obtain when running the GOEnrichment tool?
Click here to see the answer
You obtain one table and one graph for each of the 3 sub-ontologies of GO (Molecular Function, Biological Process and Celular Component). The table contains the GO terms that occur in the sample and their Fisher's test p-value and FDR correction, ordered by adjusted p-value. It also contains the genes from the study set that are annotated with the terms. Finally, the graph provides a visual summary of this table, contextualized with GO relationships, with a color scale where intensity represents the degree of enrichment.
Question: Are there significantly enriched terms at 0.01 significance without multiple test corrections? And with the correction?
Click here to see the answer
Yes, we see several enriched terms in all categories, even after correction.
Question: Aren’t the differentially expressed genes in Trapnell supposed to be random? How do you explain that you obtained enriched functional terms, even after correction?
Click here to see the answer
In the Trapnell study they used as reference a real expression dataset to make their in silico expression experiment. Therefore, the genes were chosen randomly from a set of expressed genes in a given real experimental setting, and NOT from the full set of genes in the Drosophila genome. This result demonstrates the importance of choosing an appropriate background.
Task: Repeat the GOEnrichment run, but now use the population set provided, which contains only genes that have a numeric adjusted FDR (ie, not ‘NA’) or non-zero base expression. You should now have very few or no enriched functional terms.
NOTE: Assess how well you achieved the learning outcome by asking yourself the following questions:
Do you understand the concept of enrichment analysis and the underlying statistical test?
Could you define a population set and a sample set of genes from an RNA-seq experiment, after differential expression tests have been applied?
Why do we need to correct for multiple tests?
What is the difference between a p-value, a corrected p-value, and a q-value?
What we can get out of functional enrichment analysis results hinges heavily on what we put into them, i.e., on the biological context of our experiment, and the biological question(s) we are seeking to address. The clearer the question, the more straightforward it should be to interpret the results. In general, functional enrichment analysis can be used for:
- Validation (e.g., of a protocol for extracting membrane proteins)
- Characterization (e.g., of the effects of a stress in an organism)
- Elucidation (e.g., of the functions impacted by the knock-out of a transcription factor)
With respect to the results, it is essential to keep in mind that statistically significant does not mean biologically meaningful. On the one hand, we can have functional enrichment of functional aspects that seem too broad to derive any relevant conclusion, or that appear to be unrelated to the experiment in question. You should look at these with a critical eye − there may be some underlying meaning that is not readily apparent, even in the case of very generic terms such as “binding” or “single organism process”. On the other hand, aspects that are too specific may not be very interesting. In the extreme case of a biological process associated with a single gene in a given organism, if that gene appears in the study set, it is likely to be statistically enriched (if the study set is relatively small in comparison with the population set), but that doesn’t give us any insight into the study set as a whole. In general, we’re interested in GO terms that are sufficiently generic to integrate a significant part of our dataset, but sufficiently specific to give us some conclusive insights. Despite multiple test corrections, sporadic outliers may occur. After all, we’re making a statistical test (of enrichment) on top of another (of differential expression) which in turn is preceded by a statistical normalization. Even though we’re comfortable with the assumptions and p-values in each individual step, the likelihood of error propagates across the steps, and even fine differences in each step can affect the final results. You should also keep in mind that enrichment analysis is qualitative, rather than quantitative: you are treating genes as either “on” or “off” (be “on” differentially expressed, overexpressed, or underexpressed) and consequently only assessing which functional aspects are statistically affected, rather than by how much they are affected.
The p-values (or q-values) are mostly meaningless for interpreting the results—their role is just to identify enriched terms. The sample frequency and the semantics of the term (its definition, its place in the ontology’s structure) are the aspects you should draw upon for biological interpretation. While the former can be checked on results tables, the latter requires graph views of the results. Because of GO’s hierarchical structure, we may get related enriched terms with different levels of specificity, and we should consider them together as a cluster when drawing conclusions. These clusters may not be readily apparent from a results table, but are easy to detect in a graph view of the results.
The size and complexity of GO often result in huge results tables and graphs (particularly for biological process), which be quite overwhelming to analyze. We can reduce the number of tests perfomed by excluding unnecessary tests:
- Singletons, i.e., terms that occur in a single sample gene, may be enriched but are not very useful for understanding our dataset in an integrated manner, and thus can generally not be tested.
- Testing a superclass when its sample frequency is the same as its subclass’s is redundant and unnecessary. The superclass can only be enriched if the subclass also is, and we gain neither specificity nor integration by testing the superclass. Reducing the number of tests performed has the added benefit of reducing the multiple-test correction factor, which in the case of redundant tests would be artificially inflated (since the tests are fully dependent).
We can also reduce the number of tests performed by using GO slims, which are “trimmed” version of GO with only broad terms, instead of using the full GO. Different GO slims are available for different taxa. Using GO slims will greatly simplify the results. However, it will also lead to substantial loss in specificity, and the enriched terms may be more generic than we would like. Furthermore, using GO slims requires first converting the GO annotations from full GO to GO slim (which you can do using GOSlimmer in Galaxy).
Instead of using GO slims, we may simplify/summarize the enrichment analysis results a posteriori, using one of three strategies:
- The family-based clustering algorithm integrated into GOEnrichment which reduces complexity while keeping branch information, but loses some specificity -The semantic similarity-based REVIGO tool, which not only loses specificity but may merge branches
- An ad hoc filter, as long as it is not based on the enrichment analysis p-values, but rather on fixed features of GO (such as vertical or horizontal cuts of the ontology). Here, the rationale is that your initial enrichment analysis is exploratory, aimed at discovering what features are in your dataset, and you are refining it to focus on the feature(s) that matter to you.
Task: Pick up the differential expression results file for mouse brain vs. heart. Create a population file and two sample files, one for overexpressed genes and another for underexpressed ones.
Task: Run GOEnrichment in Galaxy using both sample files, as you did previously, using the Mouse annotation file you got from BioMart earlier. Run the program with the “Summarize Output” parameter set to off (and otherwise default parameters), then analyze the biological process results tables and graph files.
Question: Which was the order of the tissues in the original differential expression test (was it brain-heart or heart-brain)?
Click here to see the answer
The enrichment of terms like "modulation of synaptic transmission" in the underexpressed genes is a dead givaway for brain tissue, and means that the order of the differential expression test was heart-brain.
Task: Repeat the previous run with the underexpressed genes, but this time with “Summarize Output” set to on. Then download the generic GO Slim, use the GOSlimmer tool in Galaxy to convert the mouse annotations from GO to GO slim, and repeat the GOEnrichment run using GO Slim and the corresponding annotations (set “Summarize Output” to off again, even though it shouldn’t make much difference). Compare the BP graph files between these two alternatives and with the non-summarized results.
Question: Could you still identify that this sample corresponds to brain tissue with both simplification alternatives?
Click here to see the answer
Despite the low number of enriched terms with the GO Slim, the enrichment of "nervous system process" is sufficient to identify this sample as brain tissue.
Question: How do the results compare in terms of simplicity and specificity?
Click here to see the answer
While the GO Slim would enable validation of the type of tissue, the results are clearly too vague to make any deeper interpretation. The summarized results using the full ontology offer a good trade-off between simplicity and specificity: the BP graph is small enough to analyze as a whole but rich enough to allow interpretation beyond validation.
Task: Analyze the GO enrichment result graphs for some single cell clusters, which are included in the functional_enrichment/single_cell folder. Look at both the BP and the CC graph files for each cluster.
Question: Can you match the cell types to the GO enrichment results? The list of possible cell types in the dataset is: B cell, Dendritic cell, Macrophage, Neutrophil, NK cell, Stromal cell, T cell. Feel free to browse the wikipedia for information on these cell types if you are not familiarized with them.
Click here to see the answer
F = T cell; Z = Macrophage; X = B cell; Q = Stromal cell; E = Dendritic cell; T = Neutrophil; N = NK cell
(In F, "T Cell" specific BP and CC terms are a giveaway for T Cells; in Z, "phagocytosis" clearly suggests macrophages; in X, despite the misleading "positive regulation of T Cell activation" in the BP graph, the "B cell receptor complex" in the CC graph clearly identifies B Cells; in Q, the abundance of terms related to collagen and cell adhesion is a giveaway for stromal cells; in E, "positive regulation of adaptive immune response" and "response to external biotic stimulus" point towards dendritic cells; in T, several "neutrophil" BP terms leave no room for doubt that these are neutrophils; in N, the "natural killer cell" specific BP term clearly identifies these as NK cells).
Back to main page.