Day 4a - Saccharomyces cerevisiae pangenome graphs
Learning objectives
In this exercise you learn how to
- partition sequences.
Getting started
Make sure you have pggb and its tools installed. In particular, check out pggb repository:
cd ~
git clone https://github.com/pangenome/pggb.git
Next, go in the directory created in the previous activity on Saccharomyces cerevisiae:
cd ~/day3_yeast
Sequence partitioning
We can’t really expect to pairwise map all sequences together and obtain well separated connected components. It is likely to get a giant connected component, and probably a few smaller ones, due to incorrect mappings or false homologies. This might unnecessarily increase the computational burden, as well as complicate the downstream analyzes. Therefore, it is recommended to split up the input sequences into communities in order to find the latent structure of their mutual relationship.
We need to obtain the mutual relationship between the input assemblies in order to detect the underlying communities. To compute the pairwise mappings with wfmash, execute:
cd assemblies
wfmash scerevisiae8.fasta.gz -p 90 -n 7 -t 8 -m > scerevisiae8.mapping.paf
Why did we specify -n 7?
To project the PAF mappings into a network format (an edge list), execute:
python3 ~/pggb/scripts/paf2net.py -p scerevisiae8.mapping.paf
The paf2net.py script creates 3 files:
scerevisiae8.mapping.paf.edges.list.txtis the edge list representing the pairs of sequences mapped in the PAF;scerevisiae8.mapping.paf.edges.weights.txtis a list of edge weights (long and high estimated identity mappings have greater weight);scerevisiae8.mapping.paf.vertices.id2name.txtis the ‘id to sequence name’ map.
To identity the communities, execute:
python3 ~/git/pggb/scripts/net2communities.py \
-e scerevisiae8.mapping.paf.edges.list.txt \
-w scerevisiae8.mapping.paf.edges.weights.txt \
-n scerevisiae8.mapping.paf.vertices.id2name.txt
How many communities were detected?
The paf2net.py script creates a set of *.community.*.txt files one for each the communities detected. Each txt file lists the sequences that belong to the same community.
Are there communities that contain multiple chromosomes? Which ones?
Identity the communities again, but this time add the --plot option to visualize them too:
python3 ~/git/pggb/scripts/net2communities.py \
-e scerevisiae8.mapping.paf.edges.list.txt \
-w scerevisiae8.mapping.paf.edges.weights.txt \
-n scerevisiae8.mapping.paf.vertices.id2name.txt \
--plot
Take a look at the scerevisiae8.mapping.paf.edges.list.txt.communities.pdf file.
Write a little script that take the *.community.*.txt files in input and create the corresponding FASTA files, ready to be input to pggb. Run pggb on the communities with multiple chromosomes and compare the results (layout and variants) from the previous activities.
Back
Back to main page.