4. Taxonomic profiling

When doing a taxonomic classification, we want to find out what kind of organisms the sample contains. Who is there?

You will continue to work with the data that you have preprocessed in Module2. The DNA was isolated from the gut of premature human infants. The microbial diversity in the gut of premature infants is therefore very low.

The aim of this assignment is to perform taxonomic classification of organisms present in the dataset using several different methods/tools.

In this exercise you will:

Taxonomic classification using a reference database of protein sequences using Kaiju
Taxonomic classification using a K-mer based method using Kraken
Taxonomic classification using clade specific marker genes using MetaPhlAn2
Visalisation of taxonomic profiles with MEGAN

1. Taxonomic classification using a reference database of protein sequences using Kaiju

💡Kaiju is a software tool for the taxonomic classification of high-throughput sequencing reads, e.g., Illumina or Roche/454, from whole-genome sequencing of metagenomic DNA. Reads are directly assigned to taxa using the NCBI taxonomy and a reference database of protein sequences from microbial and viral genomes.

You can read more about the tool here.

As true for most taxonomic classification tools, your data is screened against a database of known sequences in order to classify your data. Before you can perform taxonomic classification of your reads, Kaiju’s database index needs to be built from the reference protein database. Even a relative small database such as the RefSeq generates large files when they are indexed (The indexing of the content is absolutely required in order to use a database). When RefSeq is indexed properly, it requires 32 GB of space.

The complete Reference Genomes from NCBI RefSeq was downloaded and index like this:

makeDB.sh -r

The indexing took about 2 hours to perform. This database contain only completely assembled and annotated reference genomes of Archaea and Bacteria from the NCBI RefSeq database. You can read more about RefSeq here.

Unfortunately, running Kaiju against RefSeq requires 21GB RAM. Your VM only have 16GB RAM, so we have indexed a smaller database called proGenomes. This database generally covers a broader phylogenetic range compared to the RefSeq dataset, and is therefore recommended, especially for environmental samples. As of February 2018, this database contains ca. 19M protein sequences, which amounts to a requirement of 13GB RAM for running Kaiju.

I] In ~/practical copy the data for this exercise from the shared disk:

cp -R /net/software/practical/4 .

II] Change the permissions of this directory:

chmod -R a+rwx 4

III] In ~/practical/4/ create a directory named kaiju and enter this directory:

mkdir kaiju
cd kaiju

IV] Make symbolic links to the preprocessed FASTQ files from the 3. Trimming & filtering (the paired FASTQ from repair). You might have named the after you own choice - which is ok. In the following examples, we call these input files sample1_R1_pair.fastq.gz and sample1_R2_pair.fastq.gz

ln -s ~/practical/3/repair/sample1_R* .

In order to run Kaiju you need activate the correct Conda environment.

V] View the Kaiju run options (or arguments), type:

kaiju -h

VI] Run Kaiju against the ProGenomes database:

kaiju -t /net/software/databases/proGenomes/nodes.dmp -f /net/software/databases/proGenomes/kaiju_db.fmi -i <(gunzip -c sample1_R1_pair.fastq.gz) -j <(gunzip -c sample1_R2_pair.fastq.gz) -z 4 -x -s 75 -a greedy -e 5 -o sample1_proGenomes_kaiju.out -v

The different arguments are

-t = Path to the pre-indexed database
-f = Path to the pre-indexed database
-z = Set the number of parallell threads to 4
-x = Enable SEG low complexity filter
-s = Set the minimum match score in ‘Greedy mode’ to 75
-a greedy = Run Kaiju in ‘Greedy mode’ and allow 5 mismatches
-e = Number of allowed substitutions
-o = Name the output sample1_proGenomes_kaiju.out
-v = Enable the verbose output to get a rich output format

Running Kaiju on this dataset will take 17 minutes on a 8 GB RAM, 16 CPU machine. If you don’t want to wait for the tool to finish, we have provided the Kaiju output here: ~/practical/4/prerun_kaiju/

Kaiju will print one line for each read or read pair. The verbose output contains tab-separated columns with information about the length or score of the best match used for classification, NCBI taxon identifier of the assigned taxon, accession numbers, etc:

either C or U, indicating whether the read is classified or unclassified.
name of the read
NCBI taxon identifier of the assigned taxon
the length or score of the best match used for classification
the taxon identifiers of all database sequences with the best match
the accession numbers of all database sequences with the best match
matching fragment sequence(s)

VII] View the top of the Kaiju output:

head sample1_proGenomes_kaiju.out

❓Which NCBI taxon does the first sequence classified (C) read represent?

Solution - Click to expand

Look up the taxon ID in the [NCBI taxonomy browser](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi).

Fill in the answer in the shared Google docs table

💡 You can use Krona to visualise the results Kaiju. Krona allows hierarchical data to be explored with zooming, multi-layered pie charts. Krona charts can be created using an Excel template or KronaTools, which includes support for several bioinformatics tools and raw data formats. The interactive charts are self-contained and can be viewed with any modern web browser.

In order to visualise the results in Krona, you first need to convert the Kaiju results into a file ready for import in Krona Tools.

Krona Tools require an input file with the full taxon path, like shown below:

91803	root	cellular organisms	Bacteria	FCB group	Bacteroidetes/Chlorobi group	Bacteroidetes	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides	Bacteroides ovatus
87964	root	cellular organisms	Bacteria	FCB group	Bacteroidetes/Chlorobi group	Bacteroidetes	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides	Bacteroides thetaiotaomicron

VIII] Convert the Kaiju output file into the proper Krona Tools format using kaiju2krona:

kaiju2krona -t /net/software/databases/proGenomes/nodes.dmp -n /net/software/databases/proGenomes/names.dmp -i sample1_proGenomes_kaiju.out -o sample1_proGenomes_kaiju.krona

IX] The file sample1_proGenomes_kaiju.krona can then be imported into Krona and converted into an HTML file using Krona’s ktImportText program. Activate the Conda Krona environment first!

ktImportText -o sample1_proGenomes_kaiju.html sample1_proGenomes_kaiju.krona

X] Open the HTML file you just generated in a web browser.

❓How many reads did not get assign to bacteria?

Solution - Click to expand

2841 reads

❓What seems to be the most abundant genera?

Fill in the answer in the shared Google docs table

❓How many percent of the reads have mapped to this genera?

Solution - Click to expand

77%

❓Locate the genus Escherichia coli. How many of the reads in your sample originate from this specie and how many percent does this species represent from the total reads (Hint: you can type the name in the search field)?

Solution - Click to expand

27 reads, 0,001%

The program kaiju2table can convert Kaiju’s tab-separated output file into a summary report file for a given taxonomic rank, e.g., species.

In order to run kaiju2table you need activate the correct Conda environment.

XI] Print a report at species level with species that comprise at least 1 percent of the total reads. Name the output file containing the report sample1_proGenomes_kaiju_species.summary:

kaiju2table -t /net/software/databases/proGenomes/nodes.dmp -n /net/software/databases/proGenomes/names.dmp -r species -m 0.1 -o sample1_proGenomes_species.summary sample1_proGenomes_kaiju.out

❓ What does the -r and -m flag do?

Solution - Click to expand

-r = Report taxonomic rank, -m = minimum required percentage for the taxon to be reported

❓How many percent of the reads belongs to the most abundant species in the sample?

Fill in the answer in the shared Google docs table

❓ How many of the reads in your sample originate from Lactobacillus rhamnosus?

Solution - Click to expand

9086 reads

❓ Look at the last line in the report, how many percent of the reads could not be classified (does not match anything in the database)?

Solution - Click to expand

28%

❓ How many percent of the classified reads could not be assign at species level?

Solution - Click to expand

16,5%

XII] Create a report on genus level. Name the output file containing the report sample1_proGenomes_genus.summary

❓How many percent of the classified reads could not be assign at genus level?

Solution - Click to expand

1,766%

❓ What is the most abundant genera, and how many percent of the reads have mapped to this genera?

Solution - Click to expand

Bifidobacterium, 55,6%

Optional exercise

In the folder ~/practical/4/optional_refseq_kaiju/ we have run the same rawdata files against the complete RefSeq database. Running Kaiju against this database require at least 21GB RAM.

XIII] Move to this folder and generate summary reports at both genus and species level. NB! The path to the RefSeq database is: /net/software/databases/refseq_kaiju/

❓ What are the major differences when comparing the same data against two different databases (if any)?

Solution - Click to expand

More reads are classified using RefSeq, but less reads are can be assigned to species and genus level. Also, much fewer taxa are identified using RefSeq. This could be due to that RefSeq assign reads more accurately, hence assigning the reads to a lower taxonomic level.

❓ Compare the number of reads assigned to the species Bifidobacterium longum? Can you explain the differences?

Solution - Click to expand

RefSeq has more data from this specie and fewer subspecies, hence more reads match this specie using RefSeq. The ProGenome database has more subspecies, and assign reads (accurately?) to these subspecies and not to Bifidobacterium longum

In the folder ~/practical/4/optional_e_kaiju/ we have run the same rawdata files against the complete non-redundant protein database nr. This database took 256 GB of diskspace, therefore we could not put it on the VM.

The Kaiju output file is named sample1_e_kaiju.out, the Kaiju report on species level sample1_e_kaiju_species.summary and the Kaiju report on genus level sample1_e_kaiju_genus.summary

XIV] Move to this folder and open the summary reports.

❓ What are the major differences when comparing the same data against RefSeq (if any)?

Solution - Click to expand

Less reads can be assigned to species and genus level when using RefSeq. Probably due to the larger database

XV] Generate a Krona plot of the sample1_e_kaiju.out file like you did previously.

❓ Can you see the differences you just described between the analysis against RefSeq and the complete nr in the plot?

Solution - Click to expand

Yes...

Progress tracker

Part 1 finished

2. Taxonomic classification using a K-mer based method

💡Kraken is a system for ultrafast assignment of taxonomic labels to short DNA sequences using exact alignments of k-mers and a novel classification algorithm.

Kraken’s default database contains just under 6 billion (6e9) distinct k-mers, and requires at least 160 GB of disk space. We will therefore use MiniKraken - Kraken run with a reduced database. The database is 4 GB and contains about 2.7% of kmers from the original database. You can read more here.

I] Create a directory named kraken in ~/practical/4/ and make symbolic links to the preprocessed FASTQ files from the 3. Trimming & filtering - the paired FASTQ from repair (sample1_R1_pair.fastq.gz and sample1_R2_pair.fastq.gz).

II] Activate the correct Conda environment and view the Kraken run options (or arguments), type:

kraken -h

In order to classify reads with Kraken, you need to tell the program where the database is locate. We have downloaded the reduced database, formatted it and placed it here: /net/software/databases/minikraken/.

II] Run Kraken on the two FASTQ files and name the output sample1_kraken.out. Remember that it is a paired end sample:

kraken --db /net/software/databases/minikraken/ --paired sample1_R1_pair.fastq.gz sample1_R2_pair.fastq.gz --output sample1_kraken.out --threads 4

❓ What does the --paired flag do?

Solution - Click to expand

Specify that the input is PE

III] Look at the first lines in output file (head sample1_kraken.out).

Each sequence classified by Kraken results in a single line of output. Output lines contain five tab-delimited fields; from left to right, they are:

“C”/“U”: one letter code indicating that the sequence was either classified or unclassified.
The sequence ID, obtained from the FASTA/FASTQ header.
The taxonomy ID Kraken used to label the sequence; this is 0 if the sequence is unclassified.
The length of the sequence in bp.
A space-delimited list indicating the LCA mapping of each k-mer in the sequence. For example, “562:13 561:4 A:31 0:1 562:3” would indicate that:
- the first 13 k-mers mapped to taxonomy ID #562
- the next 4 k-mers mapped to taxonomy ID #561
- the next 31 k-mers contained an ambiguous nucleotide
- the next k-mer was not in the database
- the last 3 k-mers mapped to taxonomy ID #562

IV] In order to make a Krona chart of the Kraken output you need to remove some of the columns in the sample1_kraken.out file:

cut -f2,3 sample1_kraken.out > sample1_kraken.krona.in

V] Make a Krona chart from the file sample1_kraken.krona.in using Krona’s ktImportText program (activate the Krona environment):

ktImportTaxonomy sample1_kraken.krona.in -o sample1_kraken.html

VI] Open the Krona chart from the Kraken analysis.

❓How many percent of the reads belongs to the most abundant species in the sample?

Fill in the answer in the shared Google docs table

❓What is the most abundant genera, and how of the reads have mapped to this genera?

Solution - Click to expand

*Bifidobacterium longum*, 1728338 reads

❓Does any of the reads map to the specie Escherichia coli? How many of the reads in your sample originate from this specie?

Solution - Click to expand

39 reads

VII] Generate a taxonomic report of the Kraken output. NB! It is important that the database used must be the same as the one used to generate the output file (activate the Kraken environment):

kraken-report --db /net/software/databases/minikraken/ sample1_kraken.out > sample1_kraken.rpt

The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:

Percentage of reads covered by the clade rooted at this taxon
Number of reads covered by the clade rooted at this taxon
Number of reads assigned directly to this taxon
A rank code, indicating U = unclassified, D = domain, K = kingdom, P = phylum, C = class, O = order, F = family, G = genus, or S = species. All other ranks are simply ‘-’.
NCBI taxonomy ID
indented scientific name

VIII] Open the Kraken report in a text editor (e.g. Gedit)

❓How many percent of the reads were unclassified?

Solution - Click to expand

24.18%

❓How many percent of the reads belongs to the most abundant species in the sample?

Fill in the answer in the shared Google docs table

Like for Kaiju, we have prerun the same data against a larger database - the complete Kraken database. The database is 505 GB in size, therefore we could not put it on the VM. The output is ~/practical/4/stor_kraken/

The Kaiju output file is named sample1_stor_kraken.out, the Kraken report is called sample1_stor_kraken.rpt.

XI] Compare this report with the report from the run against the smaller database (minikraken).

❓Against which database does the analysis produce least unclassified reads and how many reads are unclassified?

Solution - Click to expand

8.43% using the complete **Kraken** database

❓Are there any major differences in the taxa that are found using the two different databases?

Solution - Click to expand

The analysis using the complete **Kraken** database assign even more reads to *Bifidobacterium*, and detects more subspecies in many other genus

X] Make a Krona chart from the sample1_stor_kraken.out

❓ Can you see any taxonomic differences between the analysis against Minikraken and the complete Kraken database in the Krona plot?

Solution - Click to expand

Yes...

Progress tracker

Part 2 finished

3. Taxonomic classification using clade specific markes

💡MetaPhlAn3 is part of bioBakery which is a collection of tools and workflows for executing common microbial community analyses.

MetaPhlAn3 can be used for profiling the composition of microbial communities from metagenomic shotgun sequencing data. MetaPhlAn3 relies on unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic).

You can read more about MetaPhlAn3 here

I] Create a directory named metaphlan in ~/practical/4/ and make symbolic links to the preprocessed FASTQ files from the sample1_R1_pair.fastq.gz and sample1_R2_pair.fastq.gz from the exercise 3. Trimming & filtering - the paired FASTQ from repair

In order to run MetaPhlAn3 you need activate the correct Conda environment.

II] View the MetaPhlAn3 run options (or arguments), type:

metaphlan2.py -h

III] Run MetaPhlAn3 on the two FASTQ files and name the output sample1_metaphlan.txt. Remember that it is a paired end sample:

metaphlan sample1_R1_pair.fastq.gz,sample1_R2_pair.fastq.gz --bowtie2db /opt/databases/metaphlan3_db --bowtie2out sample1.bz2 --nproc 16 --input_type fastq -o sample1_metaphlan.txt

# --bowtie2db, location of the database

MetaPhlAn3 will take around 8 minutes to process this data, so it is possible to stretch your legs now:).

❓ What does the --nproc flag do?

Solution - Click to expand

Specify the number of CPUs

MetaPhlAn3 has created two output files. The first file sample1.bz2 is a compressed file and contains the intermediate mapping results to unique sequence markers. Alignments are listed one per line in tab-separated columns of read and reference marker.

IV] Look at the start of this file:

bzcat sample1.bz2 | head

The second file contains the final computed organism abundances. Organism abundances are listed one clade per line, tab-separated from the clade’s percent abundance.

V] Look at the first lines in output file:

head sample1_metaphlan.txt

After three lines with information on how the output was generated, the abundance profile should look similar to this - a text file with four columns:

The first column lists clades, ranging from taxonomic kingdoms (Bacteria, Archaea, etc.) through species. The taxonomic level of each clade is prefixed to indicate its level:

Kingdom: k__, Phylum: p__, Class: c__, Order: o__, Family: f__, Genus: g__, Species: s__

The second column list the NCBI taxonomy of the taxa

The third column list the relative abundances of the identified taxa. Since sequence-based profiling is relative and does not provide absolute cellular abundance measures, clades are hierarchically summed. Each level will sum to 100%; that is, the sum of all kingdom-level clades is 100%, the sum of all genus-level clades (including unclassified) is also 100%, and so forth.

In order to make a Krona chart of the MetaPhlAn3 output you need to convert it using the script metaphlan2krona.py. However, this script was written for MetaPhlAn2 which generate an abundance profile output with only two columns:

In order for metaphlan2krona.py to work properly, you first need to do some manipulations to the MetaPhlAn3 output to simulate a MetaPhlAn2 output

Activate the Metaphlan conda environment

V] Remove column 2 and 4 from the sample1_metaphlan.txt:

cut -f1,3 sample1_metaphlan.txt > sample1_metaphlan_ok.txt

Activate the Krona conda environment

VI] Convert the output to Krona format:

metaphlan2krona.py -p sample1_metaphlan_ok.txt -k sample1_metaphlan.krona

VII] Then make the HTML plot in the correct conda environment:

ktImportText -o sample1_metaphlan.html sample1_metaphlan.krona

❓ Can you see any major taxonomic differences between this taxonomic profile, and the profiles you got from the analysis using Kaiju and Kraken?

Solution - Click to expand

Yes, fewer taxa identified

❓How many of the reads in your sample originate from Escherichia coli?

Solution - Click to expand

Progress tracker

Part 3 finished

4. Visualisation of taxonomic profiles with MEGAN

💡MEGAN is a comprehensive toolbox for analysing microbiome data and comes with a user-friendly interface. MEGAN can perform both taxonomic and functional analysis and visualise the results in various plots and charts.

You can read more about the tool here.

The main application of MEGAN is to parse and analyse the result of a BLAST comparison of a set of reads against one or more reference databases, typically using BLASTN to compare against NCBI-NT. MEGAN also supports import of data from other programs such as Kaiju or Kraken in a delimiter-separated format (using comma’s or tabs). An example of such file is:

Taxon	#reads
Specie A	100
Specie B	400
Specie C	50

There are several ways to create a tab-separated taxon summary file. In this exercise you will use the taxonomic profile you generated using Kaiju. This output file include all taxonomic levels.

I] Create a directory named megan in ~/practical/4/ and make symbolic links to the Kaiju output against RefSeq (sample1_refseq_kaiju.out):

-ln -s ~/practical/4/optional_refseq_kaiju/sample1_refseq_kaiju.out

II] Look at the first part of the file sample1_refseq_kaiju.out:

head sample1_refseq_kaiju.out

The taxonomic annotation is in column 8 and is a NCBI ID number. You can read more about NCBI taxonomy here.

III] Add taxonomy labels (names not just numbers) to each assigned read using kaiju-addTaxonNames, but do not specify the taxonomic rank, since some reads have been assigned to species level, while other have been assign to lower taxonomic levels.

In order to run kaiju-addTaxonNames you need activate the Kaiju Conda environment.

kaiju-addTaxonNames -t /net/software/databases/refseq_kaiju/nodes.dmp -n /net/software/databases/refseq_kaiju/names.dmp -i sample1_refseq_kaiju.out -o sample1_refseq_kaiju.out.taxa

OUTPUT: sample1_refseq_kaiju.out.taxa

IV] Look at the first part of the file sample1_refseq_kaiju.out.taxa.

❓ Do you spot any changes in this file relative to sample1_refseq_kaiju.out?

Solution - Click to expand

The NCBI taxonomic label (number) has been replaced by the name of the taxa names in the last column

As mentioned MEGAN needs a tab-delimited input file with one taxa per row and the number of reads matching this taxa. The Kaiju output has one reads per row (2817723 rows), and multiple reads may originate from the same taxa. You need to summarize the number of reads from each individual taxa before importing the taxonomic profile into MEGAN

V] Summarise all taxa in
sample1_refseq_kaiju.out.taxa using this command:

awk -F '\t' 'BEGIN{FS=OFS="\t"}{print $8}' sample1_refseq_kaiju.out.taxa | sort | uniq -c | sed -E 's/^ *//; s/ /\t/' | awk 'BEGIN {FS="\t"; OFS="\t"} {print $2, $1}' > sample1_refseq_kaiju_megan.txt

# awk remove all columns except column 8 containing the taxonomic label
# sort will sort the taxa
# uniq will count the number of occurrences of each taxa, collapse identical taxa into one and write a second row with the number of occurrences for each taxa
# sed will change the delimiter to tab
# awk will change order of column 1 (counts) and column 2 (taxa)

OUTPUT: sample1_refseq_kaiju_megan.txt

❓ How many unique taxa are there in the sample?

Solution - Click to expand

1543 (including non-assigned reads)

VI] Open the file sample1_refseq_kaiju_megan.txt in a text editor. The first lines should look something like this:

	513462
Acaryochloris marina MBIC11017	1
Acetoanaerobium sticklandii	6
Acetobacteraceae	1

The first column displaying taxon, and the second showing how many sequence reads that mapped to this taxon. Note that the taxa are on various taxonomic levels (genus, species, etc.). The first row contain the reads with no taxonomic labels (did not match anything in RefSeq).

VII] Add “NA” to all the unassigned reads in the first row, second column in a text editor:

NA	513462
Acaryochloris marina MBIC11017	1
Acetoanaerobium sticklandii	6
Acetobacteraceae	1

Reduce number of taxa. It is very normal to reduce the number of taxa in a taxonomic profile, by filtering out the least abundant taxa. This can be done for example by removing all taxa with less than 1 % abundance, or by removing all taxa with less reads than a certain number we set as a threshold. Reducing the number of taxa will also make viewing in MEGAN easier.

VIII] Sort sample1_refseq_kaiju_megan.txt so that the most abundant taxa are at the top:

sort -t \t' -k2 -nr sample1_refseq_kaiju_megan.txt > sample1_refseq_kaiju_megan_sorted.txt

# -t specify tab as delimiter
# -k which column to sort by
# -nr numerical values with the most abundant at the top

IX] Open sample1_refseq_kaiju_megan_sorted.txt in a text editor and remove all taxa with less than 100 counts (sequence reads). Save the new file as sample1_refseq_kaiju_megan_100.txt

❓How many taxa are left?

Fill in the answer in the shared Google docs table

This file can now be visualised in MEGAN👍

X] Open MEGAN6:

MEGAN

XII] Press ‘File’, select ‘Import’ and ‘CSV Format’:

XIII] Select the sample1_refseq_kaiju_megan_100.txt to load the data into MEGAN:

XIV] Select ‘Class Count’, ‘Tab’ and ‘Taxonomy’. Press ‘Apply’:

A new window with the taxonomic assignments on a hierarchical tree structure will appear.

As you can see, MEGAN does not recognise “root” and “NA” as NCBI taxa. Instead, MEGAN display the sum of the reads from these categories as “Not assigned”.

In addition there is some information about the file you just imported in the “Message window”:

XV] Click on the circle representing Bacteria. Two numbers will appear:

1. Assigned means how many reads that were classified to the Bacteria taxa - that is on domain level

2. Summed means how many reads in total in the sample that are assigned to this AND deeper hierarchical levels of this domain

❓How many reads are assigned to the Bacteria level and not placed further down in the taxonomic hierarchy (eg. at family or genus level)?

Solution - Click to expand

2 325 reads. These sequence reads do not map to any deeper taxa and therefore is not possible to assign more precise taxonomy to.

XV] Open sample1_refseq_kaiju_megan_100.txt.

❓How many reads are assigned to the Bacteria level by Kaiju? Is it the same number as in MEGAN?

Solution - Click to expand

2 145 reads. Not the same number!!

Be aware that the NCBI taxonomy in MEGAN and Kaiju may not always be in sync. E.g. the specie Marinifilaceae bacterium SPP2 with 180 read counts from the taxonomic profile from Kaiju is not recognized by MEGAN. Instead, MEGAN has “counted” these 180 read counts as Bacteria.

The NCBI taxonomy is constantly increasing as new taxa are described and added, hence these discrapencies are often a result of the tools not being in sync

On the top of the hierarchical tree structure window, there are many options for what you can do with the data. You can mouse over all the symbols at the top to see their functions.

XVI] Press the “Rank” button, and expand the tree to the Family level.

❓What does the different sizes of the circles represent?

Solution - Click to expand

Relative abundance. Larger circle = higher abundance of the taxon

❓How many reads are in total assigned to the family Bifidobacteriaceae?

Fill in the answer in the shared Google docs table

Tip: If you have problems reading the information when pointing the mouse over the taxa, you can select the taxa -> select “Options” form the top menu -> select “List Summary…”

XVII] Click on the “Rank” button and select Species. MEGAN will try to expand the taxonomic tree all the way to species level.

The tree will expand to species level, and taxa assigned to Species level will be highlighted. Notice that some taxa are not highlighted, eg. Campylobacter.

❓ Why do you think there are no leaves on species level for the genus Campylobacter?

Solution - Click to expand

Because **Kaiju** could not assign taxonomy on species level to any of the reads belonging to this genus.

MEGAN cannot display subspecies, but collapse subspecies reads to species level.

XVIII] Click on the circle representing Bifidobacterium longum.

❓ How many reads are in total were assigned at subspecies level?

Solution - Click to expand

Summed (subspecies + species level) = 1440617 - Assigned (species level) = 539411 = 901206 at subspecies level

You can double check this by looking into your sample1_refseq_kaiju_megan_100.txt file

MEGAN has an interactive GUI, so you can select several nodes in the window by pressing Shift + mouse button. Alternatively, go to the “Select” menu.

XIX] Either way you choose, select all leaves at Phylum level and click on the “Draw data as heatmaps”.

❓ What does the darker green colour mean?

Solution - Click to expand

Higher abundance of this taxa.

XX] Click on the symbol “Show charts” and select “Show bar charts”. In the new window showing the distribution of reads at the level of the sample that appears, click on the “Classes” tab in the left part of the window (next to “Series”).

XXI] Try to change the appearance of the taxonomic profile by selecting some of the other chart views on the top, for example “Bar chart”

❓ Approximately how many percent of the reads belong to Actinobacteria? Hint: Press the % sign at the top.

Solution - Click to expand

~90%.

XXII] Make a word cloud of the taxonomic profile before you close this MEGAN.

MEGAN is very useful for visualisation and comparison of multiple samples. We will use the tool more extensively later in the course.

Progress tracker

Complete

That was the end of this practical - Good job 👍