6. Genome assembly from metagenomes

Sequence assembly is the computational reconstruction of a longer sequence from smaller sequence reads. Assembling genomes out of heterogeneous samples is an extremely challenging problem and one that remains unsolved.

Strain-level resolution and whole-genome profiling of specific microbes from metagenomic data sets require computationally demanding metagenomic assembly on pooled metagenomic samples. This approach does not generalise to studies with thousands of samples or low-abundance organisms, and it can produce chimeric (combine multiple strains) and fragmentary genome reconstructions.

In this Module you will be working with the data that you worked with in Module 2; shotgun metagenomic data of DNA isolated from premature baby stool. The main objectives of the practical is to assemble the preprocessed data from this metagenomic sample, and validate the assembly.

In this exercise you will:

Perform metagenomic assembly of the data using Megahit
Explore metagenomic assembly from metaSPAdes
Quality assessment using MetaQUAST

Good luck with the exercises and have fun👍😃👍

1. Perform metagenomic assembly of the data using Megahit

💡We will use the assembler Megahit to assemble the trimmed data. Megahit is an ultra fast assembly tool written for the assembly of metagenomic data. You can read more about the assembler here.

Assembling metagenomic data can be very resource demanding. This data set was tested on a virtual machine with 32 GB RAM and 8 cores and took 22 minutes to assemble!!. You can choose to start the assembly now, and continue on the last part of this exercise in the practical.

In order to run Megahit you need to activate the correct Conda environment.

I] In ~/practical/ and copy the data for this exercise from the shared disk:

cp -r /net/software/practical/6 .

Remember to give yourself write permissions to the directory (and sub directories)

II] Go to the directory ~/practical/6/megahit_prepro/ and make symbolic links to the preprocessed data from Exercise 3:

ln -s ~/practical/3/merge/sample1_unmerged_R1.fastq.gz .
ln -s ~/practical/3/merge/sample1_unmerged_R2.fastq.gz .
ln -s ~/practical/3/merge/sample1_merged.fastq.gz .

IV] Type megahit in the terminal and view the parameter setting options.

❓Why should you choose the -r flag and not the --12 flag to specify the input file with the merged reads?

Solution - Click to expand

Interleaved files are when the R1 and R2 reads are combined in one file, so that for each read pair, the R1 read in the file comes immediately before the R2 read:

read1_R1
read1_R2
read2_R1
read2_R2
read3_R1
read3_R2

V] Run assembly with Megahit:

megahit -1 sample1_unmerged_R1.fastq.gz -2 sample1_unmerged_R2.fastq.gz -r sample1_merged.fastq.gz -o sample_megahit_prepro

Assembling metagenomic data can be very resource demanding and take long time to perform.

You should start the Megahit assembly now, and continue on the second part of this exercise in the practical, and return to this part when the assembly is done (10 minutes)!

VI] When Megahit is done, go the the directory where the assembled contigs are and view the log file

❓How many CPU threads and memory was allocated to Megahit for performing the assembly?

Solution - Click to expand

Look at the top of the log file

❓How much time did it take for Megahit to finish the assembly?

Solution - Click to expand

Look at the bottom of the log file

❓What is the total length of all the contigs in the assembly?

Fill in the answer in the shared Google docs table

VII] Count the number of contigs in the FASTA file:

grep -c ">" final.contigs.fa

❓How many contigs did Megahit produce?

Fill in the answer in the shared Google docs table

❓ What is the size of the largest contig?

Fill in the answer in the shared Google docs table

❓What is overall GC content of the assembly?

Fill in the answer in the shared Google docs table

❓How is the Megahit assembly compared to the metaSPAdes assembly relative to total length, total GC content, number of contigs and size of largest contig?

Solution - Click to expand

You can compare the stats you filled in the Google docs and do the evaluation

VIII] Since you will be working with several assemblies, rename the FASTA file with the contigs to sample1_megahit_prepro.fasta:

mv final.contigs.fa sample1_megahit_prepro.fasta

Progress tracker

Part 1 finished

2. Metagenomic assembly of the data using metaSPAdes

metaSPAdes is part of the SPAdes (St. Petersburg genome assembler) - an assembly toolkit containing various assembly pipelines. metaSPAdes supports only a single library which has to be paired-end. In addition, you can provide long reads (e.g. using --pacbio or --nanopore options), but hybrid assembly for metagenomes remains an experimental pipeline and optimal performance is not guaranteed. It does not support careful mode

The current version of SPAdes works with Illumina or IonTorrent reads and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads. You can also provide additional contigs that will be used as long reads.

You can read more about the assembler here.

NOTE:

If adapter and/or quality trimming software has been used prior to assembly, files with the orphan reads can be provided as “single read files” -s for the corresponding read-pair library.
If you have merged some of the reads from your paired-end library (using tools s.a. BBMerge, you should provide the file with resulting reads as a “merged read file” --merged for the corresponding library.

Because of the compute time have prerun metaSPAdes on Sample 1:

The assembly of the preprocessed data using metaSPAdes was performed with the following parameter settings:

metaspades.py -1 sample1_unmerged_R1.fastq.gz -2 sample1_unmerged_R2.fastq.gz --merged sample1_merged.fastq.gz -o sample1_metaspades_prepro

I] Go to the directory /practical/6/metaspades_prepro/sample1_metaspades_prepro/ and view the content

II] Open the spades.log file

❓How many CPU threads and memory was allocated to metaSPAdes for performing the assembly?

Solution - Click to expand

Look toward the top of the log file

❓How much time did it take for metaSPAdes to finish the assembly with these compute resources?

Solution - Click to expand

Look at the bottom of the log file

❓Are you able to find the total length of all the contigs in the assembly in the spades.log file?

Solution - Click to expand

The log file contains some of the basic assembly statistics. You can also obtain this with some one-liners in the terminal. Let’s practise.

III] Count the total number of basepairs in the multiple FASTA file:

grep -v ">" contigs.fasta | wc | awk '{print $3-$1}'

❓What is the total length of all the contigs in the assembly?

Fill in the answer in the shared Google docs table

IV] Count the number of contigs in the FASTA file:

grep -c "^>" contigs.fasta

❓What is number of contigs in the assembly?

Fill in the answer in the shared Google docs table

❓What is the size of the largest contig? HINT: Sort the contigs by size using Seqkit (in the proper Conda environment)

Hint - Click to expand

`seqkit sort --by-length --reverse contigs.fasta > contigs_sorted.fasta`

Fill in the answer in the shared Google docs table

V] Generate a GC distribution of the contigs using SeqKit fx2tab:

seqkit fx2tab --name --gc your_input_file > contiglistGC.txt

❓What is the GC content of the largest contig?

Solution - Click to expand

60.47%

VI] Sort the GC distribution of the contigs using sort:

sort -k 2,2 contiglistGC.txt > sorted_contiglistGC.txt

❓What is the GC content of the contig with highest GC content?

Solution - Click to expand

100%

❓Is this very likely a true contig?

Solution - Click to expand

Probably a contig of read(s) that should have been removed in the preprocessing step

The default settings of metaSPAdes is to keep all contigs. It is important to evaluate the assembly results before you continue working with downstream analysis as error in the assembly may produce downstream errors.

VII] Remove all contigs > 200 bp:

seqkit seq -m 200 contigs.fasta > contigs200.fasta

❓What is size of the smallest contigs in the assembly now?

Fill in the answer in the shared Google docs table

❓How many contigs were > 200 bp?

Solution - Click to expand

1561 contigs

VIII] You can produce general statistics of an assembly in many ways. Try with SeqKit and stats.sh (stats.sh is part of the Bbtools package)

seqkit stats contigs200.fasta
stats.sh contigs200.fasta

❓What is overall GC content of the assembly?

Fill in the answer in the shared Google docs table

IX] Since you will be working with several assemblies, rename the FASTA file with the contigs to sample1_megahit_prepro.fasta:

mv contigs200.fasta sample1_metaspades_prepro.fasta

Progress tracker

Part 2 finished

Return to the Megahit practical, exercise 4, and complete this part of this exercise from there.

3. Quality assessment using MetaQUAST

💡 MetaQUAST evaluates and compares metagenome assemblies based on alignments to close references. It is based on QUAST genome quality assessment tool, but addresses features specific for metagenome datasets. The tool will divide all contigs into groups aligned to each reference genome they align to.

You will find more information about MetaQUAST here.

We have pre-run the assembly of the same data you have been working on, but directly on the raw data and on the data after removing host contamination before merging overlapping reads. The contig files are located in directory ~/practical/6/validation

I] First, copy the Megahit and metaSPAdes assemblies of the preprocessed data to the directory named /validation.

II] Check that all six assemblies (FASTA files) are there:

cd ../path_to/validation
ll

In this directory, there is a directory named reference_genomes, which contains the complete genomes sequences of the species we know are in the sample from the taxonomic profile we have performed prior to the course:). To save compute time, you can provide this directory of reference genomes together with the input files.

We prerun MetaQUAST to download the reference genomes by providing a text file with the species names of species found in the sample like this:

metaquast -o metaquast_out -t 16 *.fasta

# -t, Maximum number of threads

MetaQUAST identified all 16S rRNAs present, ran a taxonomic profile on these 16S rRNAs and if available downloaded the complete genomes of the from NCBI. In the next step MetaQUAST will align contigs against these genomes.

Alternatively, you can let MetaQUAST identify which species are present by searching for 16S rRNA sequences in the contigs and produce a list of references from this. The reference genomes will be downloaded from NCBI and the contigs will be aligned against these genomes. This will make the run time longer.

III] We suggest that you run MetaQUAST comparing the six assemblies and include the list of known species:

metaquast -R reference_genomes -o metaquast_results *.fasta

This will take around 5 minutes to run. You can stretch your legs now!!!

IV] When MetaQUAST is finished, open the MetaQUAST report in a web browser. There is a lot of information about your assemblies here. Tip: if you mouse over any of the text descriptions in the table (or plot) more detailed information about the particular descriptor. Use the report to evaluate which performs best/worst.

❓How many species is recognised in your sample and what is the total length of the reference genomes ?

Solution - Click to expand

12 species, 32 389 978 bp

❓Which of the assemblies is the largest (in bp)?

Fill in the answer in the shared Google docs table

❓Which of the assemblies produces the longest contig?

Solution - Click to expand

Megahit raw

❓Which of the assemblies produces most fully unaligned contigs relative to the reference genomes?

Solution - Click to expand

Both assemblies with the raw data

❓Which of the assemblies produces most mismatches relative to the reference genomes?

Solution - Click to expand

The Megahit assemblies

❓Bifidobacterium longum subsp. infantis is one of the species identified in the sample. How much of his genome is covered by the sample1_metaspades_prepro assembly (fraction)?

Solution - Click to expand

98,3%

Optional exercise

Have a look at the contig versus reference plot.

❓To which specie does the largest number of contigs align to and what is the genome size of the genome and how many contigs from sample1_metaspades_prepro align with this specie?

Solution - Click to expand

*Lactobacillus rhamnosusHN001*, 2 874 106 bp, 429 contigs

❓Which specie has the largest fraction of the total genome covered by all assemblies, and how much of the genome is approximately covered? HINT: Change plot to “Genome frac.” view

Solution - Click to expand

*Bifidobacterium longum subsp. infantis*, 98%

Select the Bifidobacterium longum subsp. infantis genome from the list of reference genomes at the bottom of the webpage. A new page will open with only mapping information against this specie.

❓Which assembler produces the largest contig that align to this reference genome and how long is it?

Solution - Click to expand

`sample1_meghit_unmerged` and `sample1_meghit_prepro`

❓ Which of the assemblies produces most mismatches against Bifidobacterium longum subsp. infantis?

Solution - Click to expand

`sample1_meghit_prepro`

❓What is the average GC content of Bifidobacterium longum subsp. infantis and what is the average GC content of the contigs mapping to this reference from the sample1_metaspades_prepro assembly?

Solution - Click to expand

59.86 and 59.82

❓ Based on the MetaQUAST results, which assembly would you continue working with?

Optional exercise II

Icarus generates contig size viewer and one or more contig alignment viewers. Contig size viewer contigs ordered from longest to shortest. Click on the “View in Icarus contig browser”.

Select to view Bifidobacterium longum subsp. infantis in more detail

❓How many contigs are tagged as unaligned?

Solution - Click to expand

Two

Select the unaligned contigs in the sample1_megahit_raw.fasta assembly

❓What could be the problem with this contig?

Solution - Click to expand

It should maybe be split in two contigs, but is wrongly joined.

Have a look at some of the other reference genomes too

Progress tracker

Complete

6. Genome assembly from metagenomes

In this exercise you will:

Good luck with the exercises and have fun👍😃👍

1. Perform metagenomic assembly of the data using Megahit

2. Metagenomic assembly of the data using metaSPAdes

NOTE:

3. Quality assessment using MetaQUAST

Optional exercise

Optional exercise II

That was the end of this practical - Good job 👍