Day 2a - Homo sapiens pangenome graphs
Learning objectives
In this exercise you learn how to
- collect and preprocess de novo human assemblies,
- partition the assembly contigs by chromosome,
- built chromosome-specific pangenome graphs.
Getting started
Make sure you have pggb
, its tools, and gfaestus
installed.
Now create a directory to work on for this tutorial:
cd ~
mkdir day2_hsapiens
cd day2_hsapiens
Get the URLs of the assemblies:
mkdir -p ~/day2_hsapiens/assemblies
cd ~/day2_hsapiens/assemblies
wget https://raw.githubusercontent.com/human-pangenomics/HPP_Year1_Assemblies/main/assembly_index/Year1_assemblies_v2_genbank.index
grep 'chm13\|h38' Year1_assemblies_v2_genbank.index | awk '{ print $2 }' | sed 's%s3://human-pangenomics/working/%https://s3-us-west-2.amazonaws.com/human-pangenomics/working/%g' > refs.urls
grep 'chm13\|h38' -v Year1_assemblies_v2_genbank.index | awk '{ print $2; print $3 }' | sed 's%s3://human-pangenomics/working/%https://s3-us-west-2.amazonaws.com/human-pangenomics/working/%g' > samples.urls
Download the 2 haplotypes of the HG01978
diploid sample and the references:
cat refs.urls <(grep 'HG01978' samples.urls | head -n 6) | parallel -j 4 'wget -q {} && echo got {}'
Unpack the assemblies:
ls *.f1_assembly_v2_genbank.fa.gz | while read f; do echo $f; gunzip $f && samtools faidx $(basename $f .gz); done
Pangenome Sequence Naming
We follow the PanSN-spec naming to simplify the identification of samples and haplotypes in pangenomes. We do that by adding a prefix to the reference sequences:
fastix -p 'grch38#1#' <(zcat GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz) | bgzip -c -@ 8 > grch38_full.fa.gz && samtools faidx grch38_full.fa.gz
fastix -p 'chm13#1#' <(zcat chm13.draft_v1.1.fasta.gz) | bgzip -c -@ 8 > chm13.fa.gz && samtools faidx chm13.fa.gz
# Remove unplaced contigs from grch38 that are (hopefully) represented in chm13
samtools faidx grch38_full.fa.gz $(cat grch38_full.fa.gz.fai | cut -f 1 | grep -v _ ) | bgzip -@ 8 -c > grch38.fa.gz
# Cleaning
rm chm13.draft_v1.1.fasta.gz GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz grch38_full.fa.gz.*
Take a look at how sequence names are changed in the FASTA files.
Sequence partitioning
Put the two references together:
zcat chm13.fa.gz grch38.fa.gz | bgzip -c -@ 8 > chm13+grch38.fa.gz && samtools faidx chm13+grch38.fa.gz
Why are we using two reference genomes? Does the old grch38
reference have chromosomes that the chm13
reference does not have? If so, which ones?
We partition contigs by chromosome by mapping each assembly against the scaffolded references:
PATH_REFERENCE_FA_GZ=~/day2_hsapiens/assemblies/chm13+grch38.fa.gz
mkdir -p ~/day2_hsapiens/assemblies/partitioning
ls *.f1_assembly_v2_genbank.fa | while read FASTA; do
NAME=$(basename $FASTA .fa);
echo $NAME
PATH_PAF=~/day2_hsapiens/assemblies/partitioning/$NAME.vs.ref.paf
wfmash $PATH_REFERENCE_FA_GZ ~/day2_hsapiens/assemblies/$FASTA -s 50k -p 90 -N -m -t 8 > $PATH_PAF
done
Run wfmash
without parameters to get information on the meaning of each parameter. What does -N
mean?
For each haplotype (for each FASTA file), count how many contigs were partitioned in each reference chromosome. Which of the two references (grch38
and chm13
) do more contigs map to? If there is a clear winner, why?
What is the chromosome against which more contigs map? Make a plot displaying the number of contigs (on the y-axis) with respect to the chromosome (on the x-axis) to which they map. Consider grch38
and chm13
in the same way: for example, merge the number of contigs mapping against grch38#1#chr1
and chm13#1#chr1
in a single count for chr1
.
Collect unmapped contigs and remap them in split mode:
ls *.f1_assembly_v2_genbank.fa | while read FASTA; do
NAME=$(basename $FASTA .fa);
echo $NAME
PATH_PAF=~/day2_hsapiens/assemblies/partitioning/$NAME.vs.ref.paf
comm -23 <(cut -f 1 $FASTA.fai | sort) <(cut -f 1 $PATH_PAF | sort) > $NAME.unaligned.txt
wc -l $NAME.unaligned.txt
if [[ $(wc -l $NAME.unaligned.txt | cut -f 1 -d\ ) != 0 ]];
then
samtools faidx $FASTA $(tr '\n' ' ' < $NAME.unaligned.txt) > $NAME.unaligned.fa
samtools faidx $NAME.unaligned.fa
PATH_NO_SPLIT_PAF=~/day2_hsapiens/assemblies/partitioning/$NAME.vs.ref.no_split.paf
wfmash $PATH_REFERENCE_FA_GZ ~/day2_hsapiens/assemblies/$NAME.unaligned.fa -s 50k -p 90 -m -t 8 > $PATH_NO_SPLIT_PAF
fi
done
Collect our best mapping for each of our attempted split rescues:
cd ~/day2_hsapiens/assemblies/partitioning
ls *.vs.ref.no_split.paf | while read PAF; do
cat $PAF | awk -v OFS='\t' '{ print $1,$11,$0 }' | sort -n -r -k 1,2 | \
awk -v OFS='\t' '$1 != last { print($3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15); last = $1; }'
done > rescues.paf
Subset by chromosome, including the references:
cd ~/day2_hsapiens/assemblies
DIR_PARTITIONING=~/day2_hsapiens/assemblies/partitioning
mkdir -p parts
( seq 22; echo X; echo Y; echo M ) | while read i; do
awk '$6 ~ "chr'$i'$"' $(ls $DIR_PARTITIONING/*.vs.ref.paf | \
grep -v unaligned | sort; echo $DIR_PARTITIONING/rescues.paf) | cut -f 1 | sort | uniq \
> parts/chr$i.contigs;
done
( seq 22; echo X; echo Y; echo M ) | while read i; do
echo chr$i
samtools faidx chm13+grch38.fa.gz chm13#1#chr$i grch38#1#chr$i > parts/chr$i.pan.fa
ls *.f1_assembly_v2_genbank.fa | while read FASTA; do
NAME=$(basename $FASTA .fa);
echo chr$i $NAME
samtools faidx $FASTA $( comm -12 <(cut -f 1 $FASTA.fai | sort) <(sort parts/chr$i.contigs) ) >> parts/chr$i.pan.fa
done
bgzip -@ 8 parts/chr$i.pan.fa && samtools faidx parts/chr$i.pan.fa.gz
done
Pangenome graph building
Build the pangenome graph for chromosome 20. Since human presents a low sequence divergence, we will set the mapping identity (-p
parameter) in pggb
to 98
.
mkdir -p ~/day2_hsapiens/graphs
pggb -i ~/day2_hsapiens/assemblies/parts/chr20.pan.fa.gz -o ~/day2_hsapiens/graphs/chr20.pan -t 8 -p 98 -s 5000 -n 4 -k 311 -G 4000 -O 0.03 -T 8
Run pggb
without parameters to get information on the meaning of each parameter:
pggb
Why did we specify -n 4
?
The -k
parameter is used to filter exact matches shorter than 311 bps. Graph induction with seqwish
often works better when we filter very short matches out of the input alignments. In practice, these often occur in regions of low alignment quality, which are typical of areas with large indels and structural variations in the wfmash
alignments. This underalignment is then resolved in the final smoothxg
step. Removing short matches can simplify the graph and remove spurious relationships caused by short repeated homologies.
The -O
parameter specifies to expand a little bit the partial order alignment (POA) blocks at both sides. This help in better resolving POA blocks boundaries and avoiding wrong alignments due to not optimal block selection.
Try this at home: build the pangenome graph for all chromosomes.
Use odgi stats
to obtain the graph length, and the number of nodes, edges, and paths. Do you think the resulting pangenome graph represents the input sequences well? Check the length and the number of the input sequences to answer this question.
Take a look at the PNG files in the chr20.pan
folder. Is the layout of the graph roughly linear? Are there unaligned regions?
Generate another odgi viz
visualization with
cd ~/day2_hsapiens/graphs/chr20.pan
odgi paths -i chr20.pan.fa.gz.d8a8cb6.04f1c29.5486fb6.smooth.final.og -L | cut -f 1,2 -d '#' | uniq > prefixes.txt
odgi viz -i chr20.pan.fa.gz.d8a8cb6.04f1c29.5486fb6.smooth.final.og -o chr20.pan.fa.gz.d8a8cb6.04f1c29.5486fb6.smooth.final.og.viz_multiqc.2.png -x 1500 -y 500 -a 10 -I Consensus_ -M prefixes.txt
What do you think is different between the chr20.pan.fa.gz.d8a8cb6.04f1c29.5486fb6.smooth.final.og.viz_multiqc.png
image and the newly generated image (chr20.pan.fa.gz.d8a8cb6.04f1c29.5486fb6.smooth.final.og.viz_multiqc.2.png
)?
Go to the folder where the gfaestus
repository is checked out and open the chart layout with it:
cd ~/gfaestus
./target/release/gfaestus ~/day2_hsapiens/graphs/chr20.pan/chr20.pan.fa.gz.*.smooth.final.gfa ~/day2_hsapiens/graphs/chr20.pan.fa.gz.*.smooth.final.og.lay.tsv
Download the coordinates of the p and q amrs on chm13
in BED format:
wget https://raw.githubusercontent.com/pangenome/pggb-paper/main/data/chm13.pq_arms.bed
and load the file in gfaestus
: Tools
-> BED Label Wizard
-> Go to the folder where the BED file is -> Double click on the file
(do not click Accept
, it is buggy).
Produce full gene annotations and visualize them:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gff3.gz
zcat gencode.v40.annotation.gff3.gz | grep ^chr20 | awk '$3 == "gene"' | cut -f 1,4,5,9 | tr ';' '\t' | grep gene_name= | cut -f 1-3,7 | sed s/gene_name=// | sed s/^chr20/grch38#1#chr20/ > grch38#1#chr20.gene.bed
Back
Back to main page.