Day 3a - Pangenome (sub)graphs

Learning objectives

In this exercise you learn how to

extract subgraphs representing loci of interest,
visualize graph annotation,
untangle the pangenome graph.

Getting started

Make sure you have pggb and its tools installed.

Check out odgi repository (we need one of its example):

cd ~
git clone https://github.com/pangenome/odgi.git

Now create a directory to work on for this tutorial:

cd ~
mkdir day2_subgraphs
cd day2_subgraphs
ln -s ~/odgi/test

Lipoprotein A graph

Construct an odgi file from the LPA dataset in GFA format:

odgi build -g test/LPA.gfa -o LPA.og

The command creates a file called LPA.og, which contains the input graph in odgi format. This graph contains 13 contigs from 7 haploid human genome assemblies from 6 individuals plus the chm13 cell line.

Use odgi paths to display the sequence names, that is the path names.

To see the variants for the two contigs of the HG02572 sample, execute:

zgrep -v '^##' test/LPA.chm13__LPA__tig00000001.vcf.gz | head -n 9 | cut -f 1-9,16,17 | column -t

In the LPA.chm13__LPA__tig00000001.vcf.gz file, the variants were called with respect to the chm13__LPA__tig00000001 contig, which was used as the reference path. The ID field in the VCF lists the nodes involved in the variant. A > means that the node is visited in forward orientation, a < means that the node is visited in reverse orientation.

The insertion at position 1050 (G > GT) is present only in one of the HG02572’s contig (HG02572__LPA__tig00000001). To extract the sub-graph where this insertion falls, execute:

odgi extract -i LPA.og -n 23 -c 1 -o LPA.21_23_G_GT.og

The instruction extracts:

the node with ID 23 (-n 23),
the nodes reachable from this node following a single edge (-c 1) in the graph topology,
the edges connecting all the extracted nodes, and
the paths traversing all the extracted nodes.

How many nodes/edges the extracted graph contain? How many paths ? Which are their names?

Use Bandage to visualize the graph. Display node IDs. What can you see between nodes 21 and 23?

Human chromosome 6

Download the pangenome graph of the Human chromosome 6 in GFA format, decompress it, and convert it to a graph in odgi format.

wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/scratch/2021_11_16_pggb_wgg.88/chroms/chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa.gz
gunzip chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa.gz

odgi build -g chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa -o chr6.pan.og -t 8 -P

This graph contains contigs of 88 haploid, phased human genome assemblies from 44 individuals, plus the chm13 and grch38 reference genomes.

Extraction

The major histocompatibility complex (MHC) is a large locus in vertebrate DNA containing a set of closely linked polymorphic genes that code for cell surface proteins essential for the adaptive immune system. In humans, the MHC region occurs on chromosome 6. The human MHC is also called the HLA (human leukocyte antigen) complex (often just the HLA).

To see the coordinates of some HLA genes, execute:

head test/chr6.HLA_genes.bed -n 5

The coordinates are expressed with respect to the grch38 reference genome.

To extract the subgraph containing all the HLA genes annotated in the chr6.HLA_genes.bed file, execute:

odgi extract -i chr6.pan.og -o chr6.pan.MHC.og -b test/chr6.HLA_genes.bed -c 0 -E -t 8 -P

The instruction extracts:

the nodes belonging to the grch38#chr6 path ranges specified in the the chr6.HLA_genes.bed file via -b,
all nodes between the min and max positions touched by the given path ranges, also if they belong to other paths (-E),
the edges connecting all the extracted nodes, and
the paths traversing all the extracted nodes.

How many paths are present in the extracted subgraph? With 90 haplotypes (44 diploid samples plus 2 haploid reference genomes), how many paths would you expect in the subgraph if the MHC locus were solved with a single contig per haplotype?

To visualize the graph, execute:

odgi sort -i chr6.pan.MHC.og -o - -O | odgi viz -i - -o chr6.pan.MHC.png -s '#' -a 20

Why are we using odgi sort before visualizing the graph?

Are there haplotypes where the MHC locus is not resolved with a single contig? If so, which ones? Counts the number of contigs for each haplotype.

Generate the graph layout with odgi layout, asking to generate both the lay and the TSV files (remember to specify the number of threads). Visualize the layout with odgi draw and gfaestus. Specify -P to get information on the progress.

The MHC locus includes the complement component 4 (C4) region, which encodes proteins involved in the complement system. In humans, the C4 gene exists as 2 functionally distinct genes, C4A and C4B, which both vary in structure and copy number (Sekar et al., 2016). Moreover, C4A and C4B genes segregate in both long and short genomic forms, distinguished by the presence or absence of a human endogenous retroviral (HERV) sequence.

Find C4 coordinates:

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz
zgrep 'gene_id "C4A"\|gene_id "C4B"' hg38.ncbiRefSeq.gtf.gz |
  awk '$1 == "chr6"' | cut -f 1,4,5 |
  bedtools sort | bedtools merge -d 15000 | bedtools slop -l 10000 -r 20000 -g hg38.chrom.sizes |
  sed 's/chr6/grch38#chr6/g' > hg38.ncbiRefSeq.C4.coordinates.bed

Extract the C4 locus:

odgi extract -i chr6.pan.og -b hg38.ncbiRefSeq.C4.coordinates.bed -o - -c 0 -E -t 8 -P |
  odgi explode -i - --biggest 1 --sorting-criteria P --optimize -p chr6.pan.C4
odgi sort -i chr6.pan.C4.0.og -o chr6.pan.C4.sorted.og -p Ygs -x 100 -t 8 -P

What odgi explode is doing?

Regarding the odgi viz visualization, select the haplotypes to visualize

odgi paths -i chr6.pan.C4.sorted.og  -L | grep 'chr6\|HG00438\|HG0107\|HG01952' > chr6.selected_paths.txt

and visualize them

# odgi viz: default (binned) mode
odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt

# odgi viz: color by strand
odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.z.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt -z

# odgi viz: color by position
odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.du.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt -du

# odgi viz: color by depth
odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.m.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt -m -B Spectral:4

For the chr6.pan.C4.sorted.m.png image we used the Spectra color palette with 4 levels of node depths, so white indicates no depth, while grey, red, and yellow indicate depth 1, 2, and greater than or equal to 3, respectively. What information does this image provide us about the state of the C4 region in the selected haplotypes?

Visualize all haplotypes with odgi viz, coloring by depth. How many haplotypes have three copies of the C4 region? How many haplotypes are missing the HERV sequence? What is the copy number state of the chm13 and grch38 reference genomes?

Use odgi layout and odgi draw to compute and visualize the layout of the C4 locus. Visualize it in gfaestus too, displaying also the following annotation in the /home/participant/odgi/test/chr6.C4.bed file. The HERV sequence may be present or absent in the C4 regions across haplotypes: how does this reflect on the structure of the graph?

Untangling

To obtain another overview of a collapsed locus, we can apply odgi untangle to segment paths into linear segments by breaking these segments where the paths loop back on themselves. In this way, we can obtain information on the copy number status of the sequences in the locus.

To untangle the C4 graph, execute:

(echo query.name query.start query.end ref.name ref.start ref.end score inv self.cov n.th |
  tr ' ' '\t'; odgi untangle -i chr6.pan.C4.sorted.og -r $(odgi paths -i chr6.pan.C4.sorted.og -L | grep grch38) -t 8 -m 256 -P |
  bedtools sort -i - ) | awk '$8 == "-" { x=$6; $6=$5; $5=x; } { print }' |
  tr ' ' '\t'   > chr6.pan.C4.sorted.untangle.bed

Take a look at the chr6.pan.C4.sorted.untangle.bed file. For each segment in the query (query.name, query.start, and query.end columns), the best match on the reference is reported (ref.name, ref.start, and ref.end), with information about the quality of the match (score), the strand (inv), the copy number status (self.cov), and its rank over all possible matches (n.th).

Try to visualize the results with ggplot2 in R (hint: the intervals in the BED file can be displayed with geom_segment). Compare such a visualization with the visualization obtained with the odgi viz coloring by depth.

A pangenome graph represents the alignment of many genome sequences. By embedding gene annotations into the graph as paths, we “align” them with all other paths.

Injection

A pangenome graph represents the alignment of many genome sequences. By embedding gene annotations into the graph as paths, we “align” them with all other paths.

We start with gene annotations against the GRCh38 reference. Our annotations are against the full grch38#chr6, in test/chr6.C4.bed. Take a look at the first column in the annotation file

head test/chr6.C4.bed

However, the C4 locus graph chr6.c4.gfa is over the reference range, that is grch38#chr6:31972046-32055647. With odgi paths we can take a look at the names of the paths in the graph

odgi paths -i test/chr6.C4.gfa -L | grep grc

So, we must adjust the annotations to match the subgraph to ensure that path names and coordinates exactly correspond between the BED and GFA. We do so using odgi procbed, which cuts BED records to fit within a given subgraph:

odgi procbed -i test/chr6.C4.gfa -b test/chr6.C4.bed > chr6.C4.adj.bed

The coordinate space now matches that of the C4 subgraph. Now, we can inject these annotations into the graph:

odgi inject -i test/chr6.C4.gfa -b chr6.C4.adj.bed -o chr6.C4.genes.og

Use odgi viz to visualize the new subgraph with the injected paths.

We now use the gene names and the gggenes output format from odgi untangle to obtain a gene arrow map. We specify the injected paths as target paths:

odgi paths -i chr6.C4.genes.og -L | tail -4 > chr6.C4.gene.names.txt

odgi untangle -R chr6.C4.gene.names.txt -i chr6.C4.genes.og -j 0.5 -t 8 -g > chr6.C4.gene.gggenes.tsv

We use -j 0.5 to filter out low-quality matches.

We can then load this into R for plotting with gggenes:

require(ggplot2)
require(gggenes)
x <- read.delim('chr6.C4.gene.gggenes.tsv')
ggplot(x, aes(xmin=start, xmax=end, y=molecule, fill=gene, forward=strand)) + geom_gene_arrow()
ggsave('c4.gggenes.subset.png', height=1.5, width=15)

The plot will look a bit odd because some of the paths are in reverse complement orientation relative to the annotations. We can clean this up by using odgi flip, which flips paths around if they tend to be in the reverse complement orientation relative to the graph:

odgi flip -i chr6.C4.genes.og -o - -t 8 | odgi view -i - -g > chr6.C4.genes.flip.gfa

odgi untangle -R chr6.C4.gene.names.txt -i chr6.C4.genes.flip.gfa -j 0.5 -t 8 -g > chr6.C4.gene.gggenes.tsv

Back

Back to main page.