Day 1b - PGGB

Learning objectives

In this exercise you learn how to

build pangenome graphs using pggb,
explore pggb’s results,
understand how parameters affect the built pangenome graphs.

Getting started

Make sure you have pggb and its tools installed. It is already available on the course workstations. If you want to build everything on your laptop, follow the instructions at the pggb homepage (guix, docker, singularity, and conda alternatives are available). So make sure you have checked out pggb repository:

cd ~
git clone https://github.com/pangenome/pggb.git

Check out also wfmash repository (we need one of its scrips):

cd ~
git clone https://github.com/waveygang/wfmash.git

Now create a directory to work on for this tutorial:

mkdir day1_pggb
cd day1_pggb
ln -s ~/pggb/data

Build HLA pangenome graphs

The human leukocyte antigen (HLA) system is a complex of genes on chromosome 6 in humans which encode cell-surface proteins responsible for the regulation of the immune system.

Let’s build a pangenome graph from a collection of sequences of the DRB1-3123 gene:

pggb -i data/HLA/DRB1-3123.fa.gz -n 12 -t 8 -o out_DRB1_3123

Run pggb without parameters to get information on the meaning of each parameter:

pggb

Take a look at the files in the out_DRB1_3123 folder. Visualize the graph with Bandage.

Why did we specify -n 12?

How many alignments were executed during the pairwise alignment (take a look at the PAF output)? Visualize the alignments:

cd out_DRB1_3123
~/wfmash/scripts/paf2dotplot png large *paf
cd ..

Use odgi stats to obtain the graph length, and the number of nodes, edges, and paths. Do you think the resulting pangenome graph represents the input sequences well? Check the length and the number of the input sequences to answer this question.

How many blocks were selected and ‘smoothed’ during the two rounds of graph normalization (take a look at the *.log file to answer this question)?

Try building the same pangenome graph by specifying a lower percent identity (-p 95 by default):

pggb -i data/HLA/DRB1-3123.fa.gz -p 90 -n 12 -t 8 -o out2_DRB1_3123

Check graph statistics. Does this pangenome graph represent better or worse the input sequences than the previously produced graph?

Try to decrease the number of mappings to reteain for each segment:

pggb -i data/HLA/DRB1-3123.fa.gz -p 90 -n 6 -t 8 -o out3_DRB1_3123

How does it affect the graph?

Try to increase the target sequence length for the partial order alignment (POA) problem (-G 4001,4507 by default):

pggb -i data/HLA/DRB1-3123.fa.gz -p 90 -n 12 -t 8 -G 12000,13000 -o out4_DRB1_3123

How is this changing the runtime and the memory usage? How is this affecting graph statistics? How many blocks were selected and ‘smoothed’ during the two rounds of graph normalization?

Try 1, 3 or 4 rounds of normalization (for example,by specifying -G 4001, -G 4001,4507,4547, or -G 4001,4507,4547, 4999). How does this affect graph statistics?

Take the second pggb run and try to increase the segment length (-s 10000 by default):

pggb -i data/HLA/DRB1-3123.fa.gz -s 20000 -p 90 -n 12 -t 8 -o out5_DRB1_3123

How is this affecting graph statistics? Why?

pggb produces intermediate graphs during the process. Let’s keep all of them:

pggb -i data/HLA/DRB1-3123.fa.gz -p 90 -n 12 -t 8 --keep-temp-files -o out2_DRB1_3123_keep_intermediate_graphs

What does the file with name ending with .seqwish.gfa contain? and what about the file with name ending with .smooth.1.gfa? Take a look at the graph statistics of all the GFA files in the out2_DRB1_3123_keep_intermediate_graphs folder.

Choose another HLA gene from the data folder and explore how the statistics of the resulting graph change as s, p, n change. Produce scatter plots where on the x-axis there are the tested values of one of the pggb parameters (s, p, or n) and on the y-axis one of the graph statistics (length, number of nodes, or number of edges). You can do that using the final graph and/or the intermediate ones.

Build LPA pangenome graphs

Lipoprotein(a) (LPA) is a low-density lipoprotein variant containing a protein called apolipoprotein(a). Genetic and epidemiological studies have identified lipoprotein(a) as a risk factor for atherosclerosis and related diseases, such as coronary heart disease and stroke.

Try to make LPA pangenome graphs. The input sequences are in data/LPA/LPA.fa.gz. Sequences in this locus have a peculiarity: which one? Hint: visualize the alignments and take a look at the graph layout (with Bandage and/or in the .draw_multiqc.png files).

Back

Back to main page.