9 - Perform simple functional enrichment analysis and understand the concepts involved

9.1 - How to extract meaning from a list of genes

A list of genes of “interest” produced by an omics experiment such as RNA-seq is mostly meaningless on its own: gene identifiers are opaque, and we’re generally interested in understanding phenomena at the cellular and/or organismal level, rather than the gene level. To do this, we must abstract from the genes to their functions, or whatever other aspect we’re interested in studying (e.g., chromosome locations, transcription regulation networks). In order to abstract to the functional level, we need functional descriptions of the genes in a consistent manner, i.e., we need all functional aspects to be described in the same manner for all genes that have those aspects. In other words, we need genes to be annotated using a functional classification scheme. Moreover, this scheme should typically be organized hierarchically: the fine-grained gene functions generally occur in only a few genes, so further abstraction is required in order to enable integration of our gene set.

There are several suitable functional classification schemes available for genes, covering different aspects and/or levels of gene function. For instance, the Enzyme Commission (EC) classification covers individual enzymatic function, whereas KEGG covers metabolic pathways. The Gene Ontology (GO) is the broadest, most detailed, and most widely used classification scheme for gene function, and consequently also the most commonly used for the analysis of differential expression results.

GO is divided into three major functional aspects, called GO types: molecular function, which covers individual gene functions; biological process, which covers how gene functions integrate into cellular and/or organismal processes; and cellular component, which covers where gene functions take place. Each of these aspects is organized as a directed acyclic graph, which is essentially a relaxed hierarchy with multi-parenting. In addition to subclass (‘is a’) relation, GO includes other relations such as ‘part of’, ‘occurs in’, and ‘regulates’. While the three aspects of GO are ‘is a’ orthogonal, molecular functions can be ‘part of’ biological processes, and both can ‘occur’ in cellular components.

Genes can be (directly) annotated to multiple GO terms, even within the same aspect. Furthermore, according to the true path rule, a gene annotated to a GO term is implicitly annotated to all ancestors of that term. For instance, a gene annotated with ‘cytochrome c oxidase activity’ is inherently annotated with ‘catalytic activity’, ‘electron carrier activity’, ‘transporter activity’, and all other GO terms in the path between them.

GO annotations of genes are available on an individual basis in most genetic databases, as well as in dedicate GO browsers such as AmiGO and QuickGO. They can also be downloaded on a genome-wide scale from GO’s annotation repository or BioMart. Viewing the annotations of your gene set on an individual gene basis is unfeasible and insufficient: there are too many genes to analyze manually and integrate, and even if you could, this doesn’t tell you how significant the patterns you find are.

Task: Go to the Gene Ontology download page and download the GO in OBO format (right-click + save, otherwise the file opens in your browser). Download also the GO annotation file for Drosophila melanogaster.

Task: Go to BioMart and download the GO annotations for Drosophila melanogaster and Mus musculus. Select “ENSEMBL Genes” database, then “Mouse Genes”, then under Attributes, select “Gene stable ID” (remove the Transcript ID that is selected by default), and from the “External” section, “GO term accession”.

9.2 - Understand the concept of functional enrichment analysis, and the statistics involved

Enrichment analysis is the application of statistical tests (usually the one-tailed Fisher’s exact test) to ascertain whether a sample set of entities is enriched in relation to the overall population with respect to particular features of interest. By enriched, we mean that the sample frequency of the feature is greater than would be expected by chance given the population frequency.

The one-tailed Fisher’s exact test, a.k.a hypergeometric test for over-representation, is based on the hypergeometric distribution, which is used to determine probabilities in sampling events without replacement. More concretely, the hypergeometric distribution measures the probability of obtaining k successes in a sample of n elements, given a population with K successes in N total elements. For example, we could use it to determine the probability of drawing 4 Aces in a Poker hand of 5 cards (4 successes in a sample of 5, given 4 successes in a population of 52) which is a mere 0.002%. For Fisher’s test, we want to measure the probability of getting at least k successes, meaning we have to sum hypergeometric probabilities from x=k up to x=min(n,K).

Functional enrichment analysis is the application of enrichment analysis to ‘omics gene lists, which can be considered samples of the genome (or the genes covered by the experiment), with the features of interest being gene functional categories (such as GO terms). It enables us to assess how meaningful statistically are the functional patterns we observe when going from the gene level to the functional level.

Some care is needed when defining the sample and the population sets of genes:

The sample can be all differentially expressed genes, only the under-expressed, or only the over-expressed, depending on the biological question being addressed. It may make sense to perform enrichment analysis with all three options, as each gives you a different insight into your dataset. The genes we consider differentially expressed can simply be those with significant p-values in the differential expression test, but we can also be more strict and enforce at least 2-fold expression differences.
The population should only include genes observed in the experiment, for which presence or absence from the sample could be accurately determined. In RNA-seq experiments, it should not be the whole genome, but only the set of genes with meaningful expression observed in either of the conditions being compared. How we define “meaningful” expression is also subject to interpretation: we can simply exclude genes with no counts, or be stricter and exclude those with very small counts as well.
The sample must be a subset of the population, so we should first apply whatever criteria we deem fit to select the population, then select the sample from within that selection.

Some care is also needed when counting the frequencies and sizes of the sample and population:

The sample and population frequencies (k and K) should be the number of genes in the respective set annotated with the GO term being tested or any of its descendants (according to the true path rule).
The sample and population sizes (n and N) should be the total number of genes in the respective set that have at least one GO annotation under the GO type of the term being tested. That is to say, we should only count genes that have determined functions, as otherwise we cannot be certain whether they should be counted as successes for the test.

Most of our statistical tests − including Fisher’s exact test − rely on controlling type I errors. When we accept an event/observation as significant because it has a p-value of 0.001, we are accepting that statistically, one time in a thousand, we’ll be wrong − the event/observation in question will be the product of chance alone. This is a problem when we perform multiple related tests, as the chance of getting a statistical “extreme” in at least one of them will be greater the more tests we perform. Because GO enrichment analysis relies on performing hundreds (or sometimes thousands) of Fisher’s tests, we must correct the statistics for multiple testing.

There are two families of multiple test corrections: the family-wise error rate (FWER) and the false discovery rate (FDR). In the former, we control the probability of making at least one false discovery, which is a conservative but “safe” approach that produces corrected p-values. In the latter, we control the ratio of false discoveries, which is a more powerful but less “safe” approach. It produces q-values, which indicate the ratio of false discoveries you are accepting if you reject the null hypothesis.

There are several GO enrichment analysis tools available, for instance:

Webtools: GOrilla, GO’s own tool
Galaxy/Command Line tools: GOEnrichment (IGC Galaxy); Ontologizer (Galaxy test toolbox)
R tools: gsea, GOstats, topGO

For some of these tools, you have to provide the version of GO and the GO annotations you want to use, in addition to the population and sample sets of genes. For other, you only need to provide the latter, and identify the organism in question. The first set of tools should be preferred, as they give you greater control over what you’re testing with. In either case, the sample and population frequencies will be computed automatically by the tool. The tool should exclude from both sets the genes that don’t have GO annotations (of the GO type being tested) and perform the tests independently for each GO type, but not all tools do this.

Task: Picking up the differential expression results from the Trapnell et al example with 300 random differentially expressed Drosophila melanogaster genes, define the sample set and population set of genes for performing functional enrichment analysis. You can use your own EdgeR or DESeq results for this dataset instead of the published differential expression results. You can do the filtering and selection of genes either in a spreadsheet or in Galaxy.

Task: Using the sample and population files you generated in the previous task, as well as the GO file and Drosophila melanogaster GO annotation file you downloaded earlier, perform functional enrichment analsysis using the GOEnrichment tool in Galaxy, with “Summarize Output” set to off and otherwise default options. Are there significantly enriched terms at 0.01 significance without multiple test corrections? And with the correction?

NOTE: Assess how well you achieved the learning outcome by asking yourself the following questions:

Do you understand the concept of enrichment analysis and the underlying statistical test?
Could you define a population set and a sample set of genes from an RNA-seq experiment, after differential expression tests have been applied?
Why do we need to correct for multiple tests?
What is the difference between a p-value, a corrected p-value, and a q-value?

9.3 - Interpreting the results of functional enrichment analysis

What we can get out of functional enrichment analysis results hinges heavily on what we put into them, i.e., on the biological context of our experiment, and the biological question(s) we are seeking to address. The clearer the question, the more straightforward it should be to interpret the results. In general, functional enrichment analysis can be used for:

Validation (e.g., of a protocol for extracting membrane proteins)
Characterization (e.g., of the effects of a stress in an organism)
Elucidation (e.g., of the functions impacted by the knock-out of a transcription factor)

With respect to the results, it is essential to keep in mind that statistically significant does not mean biologically meaningful. On the one hand, we can have functional enrichment of functional aspects that seem too broad to derive any relevant conclusion, or that appear to be unrelated to the experiment in question. You should look at these with a critical eye − there may be some underlying meaning that is not readily apparent, even in the case of very generic terms such as “binding” or “single organism process”. On the other hand, aspects that are too specific may not be very interesting. In the extreme case of a biological process associated with a single gene in a given organism, if that gene appears in the study set, it is likely to be statistically enriched (if the study set is relatively small in comparison with the population set), but that doesn’t give us any insight into the study set as a whole. In general, we’re interested in GO terms that are sufficiently generic to integrate a significant part of our dataset, but sufficiently specific to give us some conclusive insights. Despite multiple test corrections, sporadic outliers may occur. After all, we’re making a statistical test (of enrichment) on top of another (of differential expression) which in turn is preceded by a statistical normalization. Even though we’re comfortable with the assumptions and p-values in each individual step, the likelihood of error propagates across the steps, and even fine differences in each step can affect the final results. You should also keep in mind that enrichment analysis is qualitative, rather than quantitative: you are treating genes as either “on” or “off” (be “on” differentially expressed, overexpressed, or underexpressed) and consequently only assessing which functional aspects are statistically affected, rather than by how much they are affected.

The p-values (or q-values) are mostly meaningless for interpreting the results—their role is just to identify enriched terms. The sample frequency and the semantics of the term (its definition, its place in the ontology’s structure) are the aspects you should draw upon for biological interpretation. While the former can be checked on results tables, the latter requires graph views of the results. Because of GO’s hierarchical structure, we may get related enriched terms with different levels of specificity, and we should consider them together as a cluster when drawing conclusions. These clusters may not be readily apparent from a results table, but are easy to detect in a graph view of the results.

The size and complexity of GO often result in huge results tables and graphs (particularly for biological process), which be quite overwhelming to analyze. We can reduce the number of tests perfomed by excluding unnecessary tests:

Singletons, i.e., terms that occur in a single sample gene, may be enriched but are not very useful for understanding our dataset in an integrated manner, and thus can generally not be tested.
Testing a superclass when its sample frequency is the same as its subclass’s is redundant and unnecessary. The superclass can only be enriched if the subclass also is, and we gain neither specificity nor integration by testing the superclass. Reducing the number of tests performed has the added benefit of reducing the multiple-test correction factor, which in the case of redundant tests would be artificially inflated (since the tests are fully dependent).

We can also reduce the number of tests performed by using GO slims, which are “trimmed” version of GO with only broad terms, instead of using the full GO. Different GO slims are available for different taxa. Using GO slims will greatly simplify the results. However, it will also lead to substantial loss in specificity, and the enriched terms may be more generic than we would like. Furthermore, using GO slims requires first converting the GO annotations from full GO to GO slim (which you can do using GOSlimmer in Galaxy).

Instead of using GO slims, we may simplify/summarize the enrichment analysis results a posteriori, using one of three strategies:

The family-based clustering algorithm integrated into GOEnrichment which reduces complexity while keeping branch information, but loses some specificity -The semantic similarity-based REVIGO tool, which not only loses specificity but may merge branches
An ad hoc filter, as long as it is not based on the enrichment analysis p-values, but rather on fixed features of GO (such as vertical or horizontal cuts of the ontology). Here the rationale is that your initial enrichment analysis is exploratory, aimed at discovering what features there are in your dataset, and you are refining it to focus on the feature that matter to you.

Task: Pick up the differential expression results file for mouse brain vs. heart. Create a population file and two sample files, one for overexpressed genes and another for underexpressed ones

Task: Run GOEnrichment in Galaxy using both sample files, as you did previously, using the Mouse annotation file you got from BioMart earlier. Run the program with the “Summarize Output” parameter set to off (and otherwise default parameters), then analyze the biological process results tables and graph files.

Task: Repeat the previous runs, but this time with “Summarize Output” set to on. Compare the results tables and graph files.

Task: Download the generic GO Slim. Use the GOSlimmer tool in Galaxy to convert the mouse annotations from GO to GO slim. Then repeat the GOEnrichment runs using GO Slim and the corresponding annotations. How do the results compare in terms of simplicity and specificity?

Back

Back to main page.