4. Taxonomic profiling
When doing a taxonomic classification, we want to find out what kind of organisms the sample contains. Who is there?
You will continue to work with the data that you have preprocessed in Module2. The DNA was isolated from the gut of premature human infants. The microbial diversity in the gut of premature infants is therefore very low.
The aim of this assignment is to perform taxonomic classification of organisms present in the dataset using several different methods/tools.
In this exercise you will:
- Taxonomic classification using a reference database of protein sequences using Kaiju
- Taxonomic classification using a K-mer based method using Kraken
- Taxonomic classification using clade specific marker genes using MetaPhlAn2
- Visalisation of taxonomic profiles with MEGAN
1. Taxonomic classification using a reference database of protein sequences using Kaiju
💡Kaiju is a software tool for the taxonomic classification of high-throughput sequencing reads, e.g., Illumina or Roche/454, from whole-genome sequencing of metagenomic DNA. Reads are directly assigned to taxa using the NCBI taxonomy and a reference database of protein sequences from microbial and viral genomes.
You can read more about the tool here.
As true for most taxonomic classification tools, your data is screened against a database of known sequences in order to classify your data. Before you can perform taxonomic classification of your reads, Kaiju’s database index needs to be built from the reference protein database. Even a relative small database such as the RefSeq generates large files when they are indexed (The indexing of the content is absolutely required in order to use a database). When RefSeq is indexed properly, it requires 32 GB of space.
The complete Reference Genomes from NCBI RefSeq was downloaded and index like this:
makeDB.sh -r
The indexing took about 2 hours to perform. This database contain only completely assembled and annotated reference genomes of Archaea and Bacteria from the NCBI RefSeq database. You can read more about RefSeq here.
Unfortunately, running Kaiju against RefSeq requires 21GB RAM. Your VM only have 16GB RAM, so we have indexed a smaller database called proGenomes. This database generally covers a broader phylogenetic range compared to the RefSeq dataset, and is therefore recommended, especially for environmental samples. As of February 2018, this database contains ca. 19M protein sequences, which amounts to a requirement of 13GB RAM for running Kaiju.
I] In ~/practical copy the data for this exercise from the shared disk:
cp -R /net/software/practical/4 .
II] Change the permissions of this directory:
chmod -R a+rwx 4
III] In ~/practical/4/ create a directory named kaiju
and enter this directory:
mkdir kaiju
cd kaiju
IV] Make symbolic links to the preprocessed FASTQ files from the 3. Trimming & filtering
(the paired FASTQ from repair). You might have named the after you own choice - which is ok. In the following examples, we call these input files sample1_R1_pair.fastq.gz
and sample1_R2_pair.fastq.gz
ln -s ~/practical/3/repair/sample1_R* .
In order to run Kaiju you need activate the correct Conda environment.
V] View the Kaiju run options (or arguments), type:
kaiju -h
VI] Run Kaiju against the ProGenomes database:
kaiju -t /net/software/databases/proGenomes/nodes.dmp -f /net/software/databases/proGenomes/kaiju_db.fmi -i <(gunzip -c sample1_R1_pair.fastq.gz) -j <(gunzip -c sample1_R2_pair.fastq.gz) -z 4 -x -s 75 -a greedy -e 5 -o sample1_proGenomes_kaiju.out -v
The different arguments are
- -t = Path to the pre-indexed database
- -f = Path to the pre-indexed database
- -z = Set the number of parallell threads to 4
- -x = Enable SEG low complexity filter
- -s = Set the minimum match score in ‘Greedy mode’ to 75
- -a greedy = Run Kaiju in ‘Greedy mode’ and allow 5 mismatches
- -e = Number of allowed substitutions
- -o = Name the output
sample1_proGenomes_kaiju.out
- -v = Enable the verbose output to get a rich output format
Running Kaiju on this dataset will take 17 minutes on a 8 GB RAM, 16 CPU machine. If you don’t want to wait for the tool to finish, we have provided the Kaiju output here: ~/practical/4/prerun_kaiju/
Kaiju will print one line for each read or read pair. The verbose output contains tab-separated columns with information about the length or score of the best match used for classification, NCBI taxon identifier of the assigned taxon, accession numbers, etc:
- either C or U, indicating whether the read is classified or unclassified.
- name of the read
- NCBI taxon identifier of the assigned taxon
- the length or score of the best match used for classification
- the taxon identifiers of all database sequences with the best match
- the accession numbers of all database sequences with the best match
- matching fragment sequence(s)
VII] View the top of the Kaiju output:
head sample1_proGenomes_kaiju.out
❓Which NCBI taxon does the first sequence classified (C)
read represent?
Solution - Click to expand
Look up the taxon ID in the [NCBI taxonomy browser](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi).
💡 You can use Krona to visualise the results Kaiju. Krona allows hierarchical data to be explored with zooming, multi-layered pie charts. Krona charts can be created using an Excel template or KronaTools, which includes support for several bioinformatics tools and raw data formats. The interactive charts are self-contained and can be viewed with any modern web browser.
In order to visualise the results in Krona, you first need to convert the Kaiju results into a file ready for import in Krona Tools.
Krona Tools require an input file with the full taxon path, like shown below:
91803 root cellular organisms Bacteria FCB group Bacteroidetes/Chlorobi group Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteroides ovatus
87964 root cellular organisms Bacteria FCB group Bacteroidetes/Chlorobi group Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteroides thetaiotaomicron
VIII] Convert the Kaiju output file into the proper Krona Tools format using kaiju2krona:
kaiju2krona -t /net/software/databases/proGenomes/nodes.dmp -n /net/software/databases/proGenomes/names.dmp -i sample1_proGenomes_kaiju.out -o sample1_proGenomes_kaiju.krona
IX] The file sample1_proGenomes_kaiju.krona
can then be imported into Krona and converted into an HTML file using Krona’s ktImportText program. Activate the Conda Krona
environment first!
ktImportText -o sample1_proGenomes_kaiju.html sample1_proGenomes_kaiju.krona
X] Open the HTML file you just generated in a web browser.
❓How many reads did not get assign to bacteria?
Solution - Click to expand
2841 reads
❓What seems to be the most abundant genera?
❓How many percent of the reads have mapped to this genera?
Solution - Click to expand
77%
❓Locate the genus Escherichia coli. How many of the reads in your sample originate from this specie and how many percent does this species represent from the total reads (Hint: you can type the name in the search field)?
Solution - Click to expand
27 reads, 0,001%
The program kaiju2table can convert Kaiju’s tab-separated output file into a summary report file for a given taxonomic rank, e.g., species.
In order to run kaiju2table you need activate the correct Conda environment.
XI] Print a report at species level with species that comprise at least 1 percent of the total reads. Name the output file containing the report sample1_proGenomes_kaiju_species.summary
:
kaiju2table -t /net/software/databases/proGenomes/nodes.dmp -n /net/software/databases/proGenomes/names.dmp -r species -m 0.1 -o sample1_proGenomes_species.summary sample1_proGenomes_kaiju.out
❓ What does the -r
and -m
flag do?
Solution - Click to expand
-r = Report taxonomic rank, -m = minimum required percentage for the taxon to be reported
❓How many percent of the reads belongs to the most abundant species in the sample?
❓ How many of the reads in your sample originate from Lactobacillus rhamnosus?
Solution - Click to expand
9086 reads
❓ Look at the last line in the report, how many percent of the reads could not be classified (does not match anything in the database)?
Solution - Click to expand
28%
❓ How many percent of the classified reads could not be assign at species level?
Solution - Click to expand
16,5%
XII] Create a report on genus level. Name the output file containing the report sample1_proGenomes_genus.summary
❓How many percent of the classified reads could not be assign at genus level?
Solution - Click to expand
1,766%
❓ What is the most abundant genera, and how many percent of the reads have mapped to this genera?
Solution - Click to expand
Bifidobacterium, 55,6%
Optional exercise
In the folder ~/practical/4/optional_refseq_kaiju/ we have run the same rawdata files against the complete RefSeq database. Running Kaiju against this database require at least 21GB RAM.
XIII] Move to this folder and generate summary reports at both genus and species level. NB! The path to the RefSeq database is: /net/software/databases/refseq_kaiju/
❓ What are the major differences when comparing the same data against two different databases (if any)?
Solution - Click to expand
More reads are classified using RefSeq, but less reads are can be assigned to species and genus level. Also, much fewer taxa are identified using RefSeq. This could be due to that RefSeq assign reads more accurately, hence assigning the reads to a lower taxonomic level.
❓ Compare the number of reads assigned to the species Bifidobacterium longum? Can you explain the differences?
Solution - Click to expand
RefSeq has more data from this specie and fewer subspecies, hence more reads match this specie using RefSeq. The ProGenome database has more subspecies, and assign reads (accurately?) to these subspecies and not to Bifidobacterium longum
In the folder ~/practical/4/optional_e_kaiju/ we have run the same rawdata files against the complete non-redundant protein database nr. This database took 256 GB of diskspace, therefore we could not put it on the VM.
The Kaiju output file is named sample1_e_kaiju.out
, the Kaiju report on species level sample1_e_kaiju_species.summary
and the Kaiju report on genus level sample1_e_kaiju_genus.summary
XIV] Move to this folder and open the summary reports.
❓ What are the major differences when comparing the same data against RefSeq (if any)?
Solution - Click to expand
Less reads can be assigned to species and genus level when using RefSeq. Probably due to the larger database
XV] Generate a Krona plot of the sample1_e_kaiju.out
file like you did previously.
❓ Can you see the differences you just described between the analysis against RefSeq and the complete nr in the plot?
Solution - Click to expand
Yes...
Progress tracker2. Taxonomic classification using a K-mer based method
💡Kraken is a system for ultrafast assignment of taxonomic labels to short DNA sequences using exact alignments of k-mers and a novel classification algorithm.
Kraken’s default database contains just under 6 billion (6e9) distinct k-mers, and requires at least 160 GB of disk space. We will therefore use MiniKraken - Kraken run with a reduced database. The database is 4 GB and contains about 2.7% of kmers from the original database. You can read more here.
I] Create a directory named kraken
in ~/practical/4/ and make symbolic links to the preprocessed FASTQ files from the 3. Trimming & filtering
- the paired FASTQ from repair (sample1_R1_pair.fastq.gz
and sample1_R2_pair.fastq.gz
).
II] Activate the correct Conda environment and view the Kraken run options (or arguments), type:
kraken -h
In order to classify reads with Kraken, you need to tell the program where the database is locate. We have downloaded the reduced database, formatted it and placed it here: /net/software/databases/minikraken/.
II] Run Kraken on the two FASTQ files and name the output sample1_kraken.out
. Remember that it is a paired end sample:
kraken --db /net/software/databases/minikraken/ --paired sample1_R1_pair.fastq.gz sample1_R2_pair.fastq.gz --output sample1_kraken.out --threads 4
❓ What does the --paired flag do?
Solution - Click to expand
Specify that the input is PE
III] Look at the first lines in output file (head sample1_kraken.out
).
Each sequence classified by Kraken results in a single line of output. Output lines contain five tab-delimited fields; from left to right, they are:
- “C”/“U”: one letter code indicating that the sequence was either classified or unclassified.
- The sequence ID, obtained from the FASTA/FASTQ header.
- The taxonomy ID Kraken used to label the sequence; this is 0 if the sequence is unclassified.
- The length of the sequence in bp.
- A space-delimited list indicating the LCA mapping of each k-mer in the sequence. For example, “562:13 561:4 A:31 0:1 562:3” would indicate that:
- the first 13 k-mers mapped to taxonomy ID #562
- the next 4 k-mers mapped to taxonomy ID #561
- the next 31 k-mers contained an ambiguous nucleotide
- the next k-mer was not in the database
- the last 3 k-mers mapped to taxonomy ID #562
IV] In order to make a Krona chart of the Kraken output you need to remove some of the columns in the sample1_kraken.out
file:
cut -f2,3 sample1_kraken.out > sample1_kraken.krona.in
V] Make a Krona chart from the file sample1_kraken.krona.in
using Krona’s ktImportText program (activate the Krona environment):
ktImportTaxonomy sample1_kraken.krona.in -o sample1_kraken.html
VI] Open the Krona chart from the Kraken analysis.
❓How many percent of the reads belongs to the most abundant species in the sample?
❓What is the most abundant genera, and how of the reads have mapped to this genera?
Solution - Click to expand
*Bifidobacterium longum*, 1728338 reads
❓Does any of the reads map to the specie Escherichia coli? How many of the reads in your sample originate from this specie?
Solution - Click to expand
39 reads
VII] Generate a taxonomic report of the Kraken output. NB! It is important that the database used must be the same as the one used to generate the output file (activate the Kraken environment):
kraken-report --db /net/software/databases/minikraken/ sample1_kraken.out > sample1_kraken.rpt
The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:
- Percentage of reads covered by the clade rooted at this taxon
- Number of reads covered by the clade rooted at this taxon
- Number of reads assigned directly to this taxon
- A rank code, indicating U = unclassified, D = domain, K = kingdom, P = phylum, C = class, O = order, F = family, G = genus, or S = species. All other ranks are simply ‘-’.
- NCBI taxonomy ID
- indented scientific name
VIII] Open the Kraken report in a text editor (e.g. Gedit)
❓How many percent of the reads were unclassified?
Solution - Click to expand
24.18%
❓How many percent of the reads belongs to the most abundant species in the sample?
Like for Kaiju, we have prerun the same data against a larger database - the complete Kraken database. The database is 505 GB in size, therefore we could not put it on the VM. The output is ~/practical/4/stor_kraken/
The Kaiju output file is named sample1_stor_kraken.out
, the Kraken report is called sample1_stor_kraken.rpt
.
XI] Compare this report with the report from the run against the smaller database (minikraken).
❓Against which database does the analysis produce least unclassified reads and how many reads are unclassified?
Solution - Click to expand
8.43% using the complete **Kraken** database
❓Are there any major differences in the taxa that are found using the two different databases?
Solution - Click to expand
The analysis using the complete **Kraken** database assign even more reads to *Bifidobacterium*, and detects more subspecies in many other genus
X] Make a Krona chart from the sample1_stor_kraken.out
❓ Can you see any taxonomic differences between the analysis against Minikraken and the complete Kraken database in the Krona plot?
Solution - Click to expand
Yes...
Progress tracker3. Taxonomic classification using clade specific markes
💡MetaPhlAn3 is part of bioBakery which is a collection of tools and workflows for executing common microbial community analyses.
MetaPhlAn3 can be used for profiling the composition of microbial communities from metagenomic shotgun sequencing data. MetaPhlAn3 relies on unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic).
You can read more about MetaPhlAn3 here
I] Create a directory named metaphlan
in ~/practical/4/ and make symbolic links to the preprocessed FASTQ files from the sample1_R1_pair.fastq.gz
and sample1_R2_pair.fastq.gz
from the exercise 3. Trimming & filtering
- the paired FASTQ from repair
In order to run MetaPhlAn3 you need activate the correct Conda environment.
II] View the MetaPhlAn3 run options (or arguments), type:
metaphlan2.py -h
III] Run MetaPhlAn3 on the two FASTQ files and name the output sample1_metaphlan.txt
. Remember that it is a paired end sample:
metaphlan sample1_R1_pair.fastq.gz,sample1_R2_pair.fastq.gz --bowtie2db /opt/databases/metaphlan3_db --bowtie2out sample1.bz2 --nproc 16 --input_type fastq -o sample1_metaphlan.txt
# --bowtie2db, location of the database
MetaPhlAn3 will take around 8 minutes to process this data, so it is possible to stretch your legs now:).
❓ What does the --nproc flag do?
Solution - Click to expand
Specify the number of CPUs
MetaPhlAn3 has created two output files. The first file sample1.bz2
is a compressed file and contains the intermediate mapping results to unique sequence markers. Alignments are listed one per line in tab-separated columns of read and reference marker.
IV] Look at the start of this file:
bzcat sample1.bz2 | head
The second file contains the final computed organism abundances. Organism abundances are listed one clade per line, tab-separated from the clade’s percent abundance.
V] Look at the first lines in output file:
head sample1_metaphlan.txt
After three lines with information on how the output was generated, the abundance profile should look similar to this - a text file with four columns:
The first column lists clades, ranging from taxonomic kingdoms (Bacteria, Archaea, etc.) through species. The taxonomic level of each clade is prefixed to indicate its level:
Kingdom: k__, Phylum: p__, Class: c__, Order: o__, Family: f__, Genus: g__, Species: s__
The second column list the NCBI taxonomy of the taxa
The third column list the relative abundances of the identified taxa. Since sequence-based profiling is relative and does not provide absolute cellular abundance measures, clades are hierarchically summed. Each level will sum to 100%; that is, the sum of all kingdom-level clades is 100%, the sum of all genus-level clades (including unclassified) is also 100%, and so forth.
In order to make a Krona chart of the MetaPhlAn3 output you need to convert it using the script metaphlan2krona.py. However, this script was written for MetaPhlAn2 which generate an abundance profile output with only two columns:
In order for metaphlan2krona.py to work properly, you first need to do some manipulations to the MetaPhlAn3 output to simulate a MetaPhlAn2 output
Activate the Metaphlan conda environment
V] Remove column 2 and 4 from the sample1_metaphlan.txt
:
cut -f1,3 sample1_metaphlan.txt > sample1_metaphlan_ok.txt
Activate the Krona conda environment
VI] Convert the output to Krona format:
metaphlan2krona.py -p sample1_metaphlan_ok.txt -k sample1_metaphlan.krona
VII] Then make the HTML plot in the correct conda environment:
ktImportText -o sample1_metaphlan.html sample1_metaphlan.krona
❓ Can you see any major taxonomic differences between this taxonomic profile, and the profiles you got from the analysis using Kaiju and Kraken?
Solution - Click to expand
Yes, fewer taxa identified
❓How many of the reads in your sample originate from Escherichia coli?
Solution - Click to expand
0
Progress tracker4. Visualisation of taxonomic profiles with MEGAN
💡MEGAN is a comprehensive toolbox for analysing microbiome data and comes with a user-friendly interface. MEGAN can perform both taxonomic and functional analysis and visualise the results in various plots and charts.
You can read more about the tool here.
The main application of MEGAN is to parse and analyse the result of a BLAST comparison of a set of reads against one or more reference databases, typically using BLASTN to compare against NCBI-NT. MEGAN also supports import of data from other programs such as Kaiju or Kraken in a delimiter-separated format (using comma’s or tabs). An example of such file is:
Taxon |
#reads |
Specie A |
100 |
Specie B |
400 |
Specie C |
50 |
There are several ways to create a tab-separated taxon summary file. In this exercise you will use the taxonomic profile you generated using Kaiju. This output file include all taxonomic levels.
I] Create a directory named megan
in ~/practical/4/ and make symbolic links to the Kaiju output against RefSeq (sample1_refseq_kaiju.out
):
-ln -s ~/practical/4/optional_refseq_kaiju/sample1_refseq_kaiju.out
II] Look at the first part of the file sample1_refseq_kaiju.out
:
head sample1_refseq_kaiju.out
The taxonomic annotation is in column 8 and is a NCBI ID number. You can read more about NCBI taxonomy here.
III] Add taxonomy labels (names not just numbers) to each assigned read using kaiju-addTaxonNames
, but do not specify the taxonomic rank, since some reads have been assigned to species level, while other have been assign to lower taxonomic levels.
In order to run kaiju-addTaxonNames you need activate the Kaiju Conda environment.
kaiju-addTaxonNames -t /net/software/databases/refseq_kaiju/nodes.dmp -n /net/software/databases/refseq_kaiju/names.dmp -i sample1_refseq_kaiju.out -o sample1_refseq_kaiju.out.taxa
OUTPUT: sample1_refseq_kaiju.out.taxa
IV] Look at the first part of the file sample1_refseq_kaiju.out.taxa
.
❓ Do you spot any changes in this file relative to sample1_refseq_kaiju.out
?
Solution - Click to expand
The NCBI taxonomic label (number) has been replaced by the name of the taxa names in the last column
As mentioned MEGAN needs a tab-delimited input file with one taxa per row and the number of reads matching this taxa. The Kaiju output has one reads per row (2817723 rows), and multiple reads may originate from the same taxa. You need to summarize the number of reads from each individual taxa before importing the taxonomic profile into MEGAN
V] Summarise all taxa in
sample1_refseq_kaiju.out.taxa
using this command:
awk -F '\t' 'BEGIN{FS=OFS="\t"}{print $8}' sample1_refseq_kaiju.out.taxa | sort | uniq -c | sed -E 's/^ *//; s/ /\t/' | awk 'BEGIN {FS="\t"; OFS="\t"} {print $2, $1}' > sample1_refseq_kaiju_megan.txt
# awk remove all columns except column 8 containing the taxonomic label
# sort will sort the taxa
# uniq will count the number of occurrences of each taxa, collapse identical taxa into one and write a second row with the number of occurrences for each taxa
# sed will change the delimiter to tab
# awk will change order of column 1 (counts) and column 2 (taxa)
OUTPUT: sample1_refseq_kaiju_megan.txt
❓ How many unique taxa are there in the sample?
Solution - Click to expand
1543 (including non-assigned reads)
VI] Open the file sample1_refseq_kaiju_megan.txt
in a text editor. The first lines should look something like this:
513462
Acaryochloris marina MBIC11017 1
Acetoanaerobium sticklandii 6
Acetobacteraceae 1
The first column displaying taxon, and the second showing how many sequence reads that mapped to this taxon. Note that the taxa are on various taxonomic levels (genus, species, etc.). The first row contain the reads with no taxonomic labels (did not match anything in RefSeq).
VII] Add “NA” to all the unassigned reads in the first row, second column in a text editor:
NA 513462
Acaryochloris marina MBIC11017 1
Acetoanaerobium sticklandii 6
Acetobacteraceae 1
Reduce number of taxa. It is very normal to reduce the number of taxa in a taxonomic profile, by filtering out the least abundant taxa. This can be done for example by removing all taxa with less than 1 % abundance, or by removing all taxa with less reads than a certain number we set as a threshold. Reducing the number of taxa will also make viewing in MEGAN easier.
VIII] Sort sample1_refseq_kaiju_megan.txt
so that the most abundant taxa are at the top:
4. Taxonomic profiling
When doing a taxonomic classification, we want to find out what kind of organisms the sample contains. Who is there?
You will continue to work with the data that you have preprocessed in Module2. The DNA was isolated from the gut of premature human infants. The microbial diversity in the gut of premature infants is therefore very low.
The aim of this assignment is to perform taxonomic classification of organisms present in the dataset using several different methods/tools.
In this exercise you will:
1. Taxonomic classification using a reference database of protein sequences using Kaiju
💡Kaiju is a software tool for the taxonomic classification of high-throughput sequencing reads, e.g., Illumina or Roche/454, from whole-genome sequencing of metagenomic DNA. Reads are directly assigned to taxa using the NCBI taxonomy and a reference database of protein sequences from microbial and viral genomes.
You can read more about the tool here.
As true for most taxonomic classification tools, your data is screened against a database of known sequences in order to classify your data. Before you can perform taxonomic classification of your reads, Kaiju’s database index needs to be built from the reference protein database. Even a relative small database such as the RefSeq generates large files when they are indexed (The indexing of the content is absolutely required in order to use a database). When RefSeq is indexed properly, it requires 32 GB of space.
The complete Reference Genomes from NCBI RefSeq was downloaded and index like this:
The indexing took about 2 hours to perform. This database contain only completely assembled and annotated reference genomes of Archaea and Bacteria from the NCBI RefSeq database. You can read more about RefSeq here.
Unfortunately, running Kaiju against RefSeq requires 21GB RAM. Your VM only have 16GB RAM, so we have indexed a smaller database called proGenomes. This database generally covers a broader phylogenetic range compared to the RefSeq dataset, and is therefore recommended, especially for environmental samples. As of February 2018, this database contains ca. 19M protein sequences, which amounts to a requirement of 13GB RAM for running Kaiju.
I] In ~/practical copy the data for this exercise from the shared disk:
II] Change the permissions of this directory:
III] In ~/practical/4/ create a directory named
kaiju
and enter this directory:IV] Make symbolic links to the preprocessed FASTQ files from the
3. Trimming & filtering
(the paired FASTQ from repair). You might have named the after you own choice - which is ok. In the following examples, we call these input filessample1_R1_pair.fastq.gz
andsample1_R2_pair.fastq.gz
In order to run Kaiju you need activate the correct Conda environment.
V] View the Kaiju run options (or arguments), type:
VI] Run Kaiju against the ProGenomes database:
The different arguments are
sample1_proGenomes_kaiju.out
Running Kaiju on this dataset will take 17 minutes on a 8 GB RAM, 16 CPU machine. If you don’t want to wait for the tool to finish, we have provided the Kaiju output here: ~/practical/4/prerun_kaiju/
Kaiju will print one line for each read or read pair. The verbose output contains tab-separated columns with information about the length or score of the best match used for classification, NCBI taxon identifier of the assigned taxon, accession numbers, etc:
VII] View the top of the Kaiju output:
❓Which NCBI taxon does the first sequence classified
(C)
read represent?Solution - Click to expand
Fill in the answer in the shared Google docs table
💡 You can use Krona to visualise the results Kaiju. Krona allows hierarchical data to be explored with zooming, multi-layered pie charts. Krona charts can be created using an Excel template or KronaTools, which includes support for several bioinformatics tools and raw data formats. The interactive charts are self-contained and can be viewed with any modern web browser.
In order to visualise the results in Krona, you first need to convert the Kaiju results into a file ready for import in Krona Tools.
Krona Tools require an input file with the full taxon path, like shown below:
VIII] Convert the Kaiju output file into the proper Krona Tools format using kaiju2krona:
IX] The file
sample1_proGenomes_kaiju.krona
can then be imported into Krona and converted into an HTML file using Krona’s ktImportText program. Activate the CondaKrona
environment first!X] Open the HTML file you just generated in a web browser.
❓How many reads did not get assign to bacteria?
Solution - Click to expand
❓What seems to be the most abundant genera?
Fill in the answer in the shared Google docs table
❓How many percent of the reads have mapped to this genera?
Solution - Click to expand
❓Locate the genus Escherichia coli. How many of the reads in your sample originate from this specie and how many percent does this species represent from the total reads (Hint: you can type the name in the search field)?
Solution - Click to expand
The program kaiju2table can convert Kaiju’s tab-separated output file into a summary report file for a given taxonomic rank, e.g., species.
In order to run kaiju2table you need activate the correct Conda environment.
XI] Print a report at species level with species that comprise at least 1 percent of the total reads. Name the output file containing the report
sample1_proGenomes_kaiju_species.summary
:❓ What does the
-r
and-m
flag do?Solution - Click to expand
❓How many percent of the reads belongs to the most abundant species in the sample?
Fill in the answer in the shared Google docs table
❓ How many of the reads in your sample originate from Lactobacillus rhamnosus?
Solution - Click to expand
❓ Look at the last line in the report, how many percent of the reads could not be classified (does not match anything in the database)?
Solution - Click to expand
❓ How many percent of the classified reads could not be assign at species level?
Solution - Click to expand
XII] Create a report on genus level. Name the output file containing the report
sample1_proGenomes_genus.summary
❓How many percent of the classified reads could not be assign at genus level?
Solution - Click to expand
❓ What is the most abundant genera, and how many percent of the reads have mapped to this genera?
Solution - Click to expand
Optional exercise
In the folder ~/practical/4/optional_refseq_kaiju/ we have run the same rawdata files against the complete RefSeq database. Running Kaiju against this database require at least 21GB RAM.
XIII] Move to this folder and generate summary reports at both genus and species level. NB! The path to the RefSeq database is:
/net/software/databases/refseq_kaiju/
❓ What are the major differences when comparing the same data against two different databases (if any)?
Solution - Click to expand
❓ Compare the number of reads assigned to the species Bifidobacterium longum? Can you explain the differences?
Solution - Click to expand
In the folder ~/practical/4/optional_e_kaiju/ we have run the same rawdata files against the complete non-redundant protein database nr. This database took 256 GB of diskspace, therefore we could not put it on the VM.
The Kaiju output file is named
sample1_e_kaiju.out
, the Kaiju report on species levelsample1_e_kaiju_species.summary
and the Kaiju report on genus levelsample1_e_kaiju_genus.summary
XIV] Move to this folder and open the summary reports.
❓ What are the major differences when comparing the same data against RefSeq (if any)?
Solution - Click to expand
XV] Generate a Krona plot of the
sample1_e_kaiju.out
file like you did previously.❓ Can you see the differences you just described between the analysis against RefSeq and the complete nr in the plot?
Solution - Click to expand
2. Taxonomic classification using a K-mer based method
💡Kraken is a system for ultrafast assignment of taxonomic labels to short DNA sequences using exact alignments of k-mers and a novel classification algorithm.
Kraken’s default database contains just under 6 billion (6e9) distinct k-mers, and requires at least 160 GB of disk space. We will therefore use MiniKraken - Kraken run with a reduced database. The database is 4 GB and contains about 2.7% of kmers from the original database. You can read more here.
I] Create a directory named
kraken
in ~/practical/4/ and make symbolic links to the preprocessed FASTQ files from the3. Trimming & filtering
- the paired FASTQ from repair (sample1_R1_pair.fastq.gz
andsample1_R2_pair.fastq.gz
).II] Activate the correct Conda environment and view the Kraken run options (or arguments), type:
In order to classify reads with Kraken, you need to tell the program where the database is locate. We have downloaded the reduced database, formatted it and placed it here: /net/software/databases/minikraken/.
II] Run Kraken on the two FASTQ files and name the output
sample1_kraken.out
. Remember that it is a paired end sample:❓ What does the --paired flag do?
Solution - Click to expand
III] Look at the first lines in output file (
head sample1_kraken.out
).Each sequence classified by Kraken results in a single line of output. Output lines contain five tab-delimited fields; from left to right, they are:
IV] In order to make a Krona chart of the Kraken output you need to remove some of the columns in the
sample1_kraken.out
file:V] Make a Krona chart from the file
sample1_kraken.krona.in
using Krona’s ktImportText program (activate the Krona environment):VI] Open the Krona chart from the Kraken analysis.
❓How many percent of the reads belongs to the most abundant species in the sample?
Fill in the answer in the shared Google docs table
❓What is the most abundant genera, and how of the reads have mapped to this genera?
Solution - Click to expand
❓Does any of the reads map to the specie Escherichia coli? How many of the reads in your sample originate from this specie?
Solution - Click to expand
VII] Generate a taxonomic report of the Kraken output. NB! It is important that the database used must be the same as the one used to generate the output file (activate the Kraken environment):
The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:
VIII] Open the Kraken report in a text editor (e.g. Gedit)
❓How many percent of the reads were unclassified?
Solution - Click to expand
❓How many percent of the reads belongs to the most abundant species in the sample?
Fill in the answer in the shared Google docs table
Like for Kaiju, we have prerun the same data against a larger database - the complete Kraken database. The database is 505 GB in size, therefore we could not put it on the VM. The output is ~/practical/4/stor_kraken/
The Kaiju output file is named
sample1_stor_kraken.out
, the Kraken report is calledsample1_stor_kraken.rpt
.XI] Compare this report with the report from the run against the smaller database (minikraken).
❓Against which database does the analysis produce least unclassified reads and how many reads are unclassified?
Solution - Click to expand
❓Are there any major differences in the taxa that are found using the two different databases?
Solution - Click to expand
X] Make a Krona chart from the
sample1_stor_kraken.out
❓ Can you see any taxonomic differences between the analysis against Minikraken and the complete Kraken database in the Krona plot?
Solution - Click to expand
3. Taxonomic classification using clade specific markes
💡MetaPhlAn3 is part of bioBakery which is a collection of tools and workflows for executing common microbial community analyses.
MetaPhlAn3 can be used for profiling the composition of microbial communities from metagenomic shotgun sequencing data. MetaPhlAn3 relies on unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic).
You can read more about MetaPhlAn3 here
I] Create a directory named
metaphlan
in ~/practical/4/ and make symbolic links to the preprocessed FASTQ files from thesample1_R1_pair.fastq.gz
andsample1_R2_pair.fastq.gz
from the exercise3. Trimming & filtering
- the paired FASTQ from repairIn order to run MetaPhlAn3 you need activate the correct Conda environment.
II] View the MetaPhlAn3 run options (or arguments), type:
III] Run MetaPhlAn3 on the two FASTQ files and name the output
sample1_metaphlan.txt
. Remember that it is a paired end sample:MetaPhlAn3 will take around 8 minutes to process this data, so it is possible to stretch your legs now:).
❓ What does the --nproc flag do?
Solution - Click to expand
MetaPhlAn3 has created two output files. The first file
sample1.bz2
is a compressed file and contains the intermediate mapping results to unique sequence markers. Alignments are listed one per line in tab-separated columns of read and reference marker.IV] Look at the start of this file:
The second file contains the final computed organism abundances. Organism abundances are listed one clade per line, tab-separated from the clade’s percent abundance.
V] Look at the first lines in output file:
After three lines with information on how the output was generated, the abundance profile should look similar to this - a text file with four columns:
The first column lists clades, ranging from taxonomic kingdoms (Bacteria, Archaea, etc.) through species. The taxonomic level of each clade is prefixed to indicate its level:
Kingdom: k__, Phylum: p__, Class: c__, Order: o__, Family: f__, Genus: g__, Species: s__
The second column list the NCBI taxonomy of the taxa
The third column list the relative abundances of the identified taxa. Since sequence-based profiling is relative and does not provide absolute cellular abundance measures, clades are hierarchically summed. Each level will sum to 100%; that is, the sum of all kingdom-level clades is 100%, the sum of all genus-level clades (including unclassified) is also 100%, and so forth.
In order to make a Krona chart of the MetaPhlAn3 output you need to convert it using the script metaphlan2krona.py. However, this script was written for MetaPhlAn2 which generate an abundance profile output with only two columns:
In order for metaphlan2krona.py to work properly, you first need to do some manipulations to the MetaPhlAn3 output to simulate a MetaPhlAn2 output
Activate the Metaphlan conda environment
V] Remove column 2 and 4 from the
sample1_metaphlan.txt
:Activate the Krona conda environment
VI] Convert the output to Krona format:
VII] Then make the HTML plot in the correct conda environment:
❓ Can you see any major taxonomic differences between this taxonomic profile, and the profiles you got from the analysis using Kaiju and Kraken?
Solution - Click to expand
❓How many of the reads in your sample originate from Escherichia coli?
Solution - Click to expand
4. Visualisation of taxonomic profiles with MEGAN
💡MEGAN is a comprehensive toolbox for analysing microbiome data and comes with a user-friendly interface. MEGAN can perform both taxonomic and functional analysis and visualise the results in various plots and charts.
You can read more about the tool here.
The main application of MEGAN is to parse and analyse the result of a BLAST comparison of a set of reads against one or more reference databases, typically using BLASTN to compare against NCBI-NT. MEGAN also supports import of data from other programs such as Kaiju or Kraken in a delimiter-separated format (using comma’s or tabs). An example of such file is:
There are several ways to create a tab-separated taxon summary file. In this exercise you will use the taxonomic profile you generated using Kaiju. This output file include all taxonomic levels.
I] Create a directory named
megan
in ~/practical/4/ and make symbolic links to the Kaiju output against RefSeq (sample1_refseq_kaiju.out
):II] Look at the first part of the file
sample1_refseq_kaiju.out
:The taxonomic annotation is in column 8 and is a NCBI ID number. You can read more about NCBI taxonomy here.
III] Add taxonomy labels (names not just numbers) to each assigned read using
kaiju-addTaxonNames
, but do not specify the taxonomic rank, since some reads have been assigned to species level, while other have been assign to lower taxonomic levels.In order to run kaiju-addTaxonNames you need activate the Kaiju Conda environment.
IV] Look at the first part of the file
sample1_refseq_kaiju.out.taxa
.❓ Do you spot any changes in this file relative to
sample1_refseq_kaiju.out
?Solution - Click to expand
As mentioned MEGAN needs a tab-delimited input file with one taxa per row and the number of reads matching this taxa. The Kaiju output has one reads per row (2817723 rows), and multiple reads may originate from the same taxa. You need to summarize the number of reads from each individual taxa before importing the taxonomic profile into MEGAN
V] Summarise all taxa in
sample1_refseq_kaiju.out.taxa
using this command:❓ How many unique taxa are there in the sample?
Solution - Click to expand
VI] Open the file
sample1_refseq_kaiju_megan.txt
in a text editor. The first lines should look something like this:The first column displaying taxon, and the second showing how many sequence reads that mapped to this taxon. Note that the taxa are on various taxonomic levels (genus, species, etc.). The first row contain the reads with no taxonomic labels (did not match anything in RefSeq).
VII] Add “NA” to all the unassigned reads in the first row, second column in a text editor:
Reduce number of taxa. It is very normal to reduce the number of taxa in a taxonomic profile, by filtering out the least abundant taxa. This can be done for example by removing all taxa with less than 1 % abundance, or by removing all taxa with less reads than a certain number we set as a threshold. Reducing the number of taxa will also make viewing in MEGAN easier.
VIII] Sort
sample1_refseq_kaiju_megan.txt
so that the most abundant taxa are at the top: