Day 1a - Introduction
Learning objectives
In this exercise you learn how to
- find toy examples to work with,
- construct graphs using
vg construct
, - visualize graphs using
vg view
andBandage
, - convert graphs using
vg view
.
Getting started
Make sure you have vg
installed. It is already available on the course workstations. If you want to build it on your laptop, follow the instructions at the vg homepage (note that building vg
and all submodules from source can take ~1h). In this exercise, you will use small toy examples from the test
directory. So make sure you have checked out vg
repository:
cd ~
git clone https://github.com/vgteam/vg.git
Now create a directory to work on for this tutorial:
mkdir day1_intro
cd day1_intro
ln -s ~/vg/test/tiny
Constructing and viewing your first graphs
Like many other toolkits, vg
is comes with many different subcommands. First we will use vg construct
to build our first graph. Run it without parameters to get information on its usage:
vg construct
Let’s construct a graph from just one sequence in file tiny/tiny.fa
, which looks like this:
>x
CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG
To construct a graph, run
vg construct -r tiny/tiny.fa -m 32 > tiny.ref.vg
This will create a graph that just consists of a linear chain of nodes, each with 32 characters.
The switch -m
tells vg
to put at most 32 characters into each graph node (you might want to run it with different values and observe the different results.) To visualize a graph, you can use vg view
. Per default, vg view
will output a graph in GFA format. By adding -j
or -d
, you can generate JSON or DOT output.
vg view tiny.ref.vg
vg view -j tiny.ref.vg
vg view -d tiny.ref.vg
To work with the JSON output the tool jq comes in handy. To get all sequences in the graph, for instance, try
vg view -j tiny.ref.vg | jq '.node[].sequence'
Next, we use graphviz to layout the graph representation in DOT format
vg view -d tiny.ref.vg | dot -Tpdf -o tiny.ref.pdf
View the PDF and compare it to the input sequence. Now vary the parameter passed to -m
of vg construct
and visualize the result.
Now let’s build a new graph that has some variants built into it. First, take a look at at tiny/tiny.vcf.gz
, which contains variants in (gzipped) VCF format.
vg construct -r tiny/tiny.fa -v tiny/tiny.vcf.gz -m 32 > tiny.vg
Are there any warnings in the output? If yes, what do they indicate? Visualize the resulting graph.
Ok, that’s nice, but you might wonder which sequence of nodes actually corresponds to the sequence (tiny.fa
) you started from? To keep track of that, vg
adds a path to the graph. Let’s add this path to the visualization.
vg view -dp tiny.vg | dot -Tpdf -o tiny.pdf
You find the output too crowded? Option -S
removes the sequence labels and only plots node IDs.
vg view -dpS tiny.vg | dot -Tpdf -o tiny.pdf
Another tool that comes with the graphviz
package is Neato
. It creates force-directed layouts of a graph.
vg view -dpS tiny.vg | neato -Tpdf -o tiny.pdf
For these small graphs, the difference it not that big, but for more involved cases, these layouts can be much easier to read.
It’s also possible to read the graph in different formats. For instance, we can write the graph in GFA, modify it using text processing tools, and read it back in:
vg view tiny.vg > tiny.gfa
The GFA file has three elements. S
records represent nodes, L
records represent edges, and P
records represent paths.
We can use grep to remove the path lines:
cat tiny.gfa | grep -v ^P | vg view -dp - | dot -Tpdf -o tiny.no_path.pdf
Try to remove the nodes and/or the edges from the GFA file and visualize them. What happens?
Another tool for visualizing (not too big) graphs is Bandage. It supports graphs in GFA format. Try to visualize the graph
Bandage load tiny.gfa
Let’s step up to a slightly bigger example.
ln -s ~/vg/test/1mb1kgp
This directory contains 1Mbp of 1000 Genomes data for chr20:1000000-2000000
. As for the tiny example, let’s’ build one linear graph that only contains the reference sequence and one graph that additionally encodes the known sequence variation. The reference sequence is contained in 1mb1kgp/z.fa
, and the variation is contained in 1mb1kgp/z.vcf.gz
. Make a reference-only graph named ref.vg
, and a graph with variation named z.vg
. Look at the previous examples to figure out the command.
Take a look at the length of the graph, and the number of nodes/edges using vg stats
.
You might be tempted to visualize these graphs (and of course you are welcome to try), but they are sufficiently big already that neato
can run out of memory and crash. Try to load the graph in Bandage
too (it will be slow). Try to generate a PNG image with Bandage
(take a look at the Bandage -h
output).
Spoiler: build graphs from sequence alignments
A set of pairwise alignments implies a variation graph, so pangenome graphs can be obtained from alignments too. Use minimap2
and seqwish
to build graphs from the HLA gene haplotypes
ln -s ~/vg/test/GRCh38_alts/FASTA/HLA/
minimap2 -c -x asm20 -X -t 8 HLA/DRB1-3123.fa HLA/DRB1-3123.fa > DRB1-3123.paf
seqwish -s HLA/DRB1-3123.fa -p DRB1-3123.paf -g DRB1-3123.gfa
odgi build -g DRB1-3123.gfa -o - | odgi sort -i - -o DRB1-3123.og
odgi viz -i DRB1-3123.og -o DRB1-3123.png -x 2000
Use Bandage
to visualize graphs.
Back
Back to main page.