## Parse the mapped read-ends

__running time__: < 5 min

In this step of the pipeline we read each mapped read-end in the map files and extract the ones that are uniquely mapped.

For this step we use the `tadbit parse` tool and the fasta file used to generate the GEM index:

In [1]:
%%bash

tadbit parse -w ../results/PSC_rep1 --genome ../refGenome/mm39_chr3.fa

Writing log to ../results/PSC_rep1/process.log
Parsing chr3
saving genome in cache
Searching and mapping RE sites to the reference genome
Found 391730 RE sites
Loading read1
loading file: ../results/PSC_rep1/01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_frag_1-end_e06c0841c2.map
loading file: ../results/PSC_rep1/01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_full_1-end_e06c0841c2.map
Merge sort...........
Getting Multiple contacts
Loading read2
loading file: ../results/PSC_rep1/01_mapped_r2/PSC_HiC_rep1_subset_chr3_2_frag_1-end_dbbd3e8e31.map
loading file: ../results/PSC_rep1/01_mapped_r2/PSC_HiC_rep1_subset_chr3_2_full_1-end_dbbd3e8e31.map
Merge sort...........
Getting Multiple contacts
,-------.
| PATHs |
,----.-------.-------------------------------------------------------------------.--------------.
| Id | JOBid |                                                              Path |         Type |
|----+-------+-------------------------------------------------------------------+--------------|
| 

Writing versions of TADbit and dependencies
parsing genomic sequence
parsing reads in PSC_rep1 project


The output of the tool gives us some statistics of uniquely mapped reads on each read-end and the number of times that due to the split of the fragment-based strategy we are able to find multiple parts of the genome in a initial sequenced read.

In [2]:
%%bash

tadbit describe -w ../results/PSC_rep1 -t 5

,----------------.
| PARSED_OUTPUTs |
,----.--------.-----------------------.---------------.
| Id | PATHid | Total_uniquely_mapped |     Multiples |
|----+--------+-----------------------+---------------|
|  1 |     12 |            10,867,370 | 1:491280,2:19 |
|  2 |     14 |            10,869,408 | 1:494699,2:17 |
'----^--------^-----------------------^---------------'


The result of the command is stored in two separate tab-separated-values (tsv) files in the working directory under the `02_parsed_reads` subfolder that will contain the essential information of each read-end:

In [3]:
%%bash

ls ../results/PSC_rep1/02_parsed_reads/

PSC_rep1_r1_ff973af7de.tsv
PSC_rep1_r2_ff973af7de.tsv


In [4]:
%%bash

head ../results/PSC_rep1/02_parsed_reads/PSC_rep1_r1_ff973af7de.tsv

# Chromosome lengths (order matters):
# CRM chr3	159745316
# Mapped	reads count by iteration
# MAPPED 1 1535674
# MAPPED 2 9331696
SRR5344969.sra.100	chr3	132766027	1	75	132765971	132766190
SRR5344969.sra.100000059	chr3	49132935	0	75	49132861	49133024
SRR5344969.sra.100000088	chr3	151540391	1	75	151539956	151540546
SRR5344969.sra.100000099	chr3	120434307	1	75	120433512	120434337
SRR5344969.sra.100000104	chr3	109140190	0	75	109139933	109140375


In [5]:
%%bash

ls ../results/PSC_rep1/

01_mapped_r1
01_mapped_r2
02_parsed_reads
PSC_HiC_rep1_subset_chr3_1.fastq.gz_MboI_6389769ac5.png
PSC_HiC_rep1_subset_chr3_1.fastq.gz_MboI_e06c0841c2.png
PSC_HiC_rep1_subset_chr3_2.fastq.gz_MboI_dbbd3e8e31.png
TADbit_and_dependencies_versions.log
process.log
trace.db
trace.log


In [6]:
%%bash

tadbit describe -w ../results/PSC_rep1/

,-------.
| PATHs |
,----.-------.-------------------------------------------------------------------.--------------.
| Id | JOBid |                                                              Path |         Type |
|----+-------+-------------------------------------------------------------------+--------------|
|  1 |     1 |                          /home/3DAROC21/3DAROC21/results/PSC_rep1 |      WORKDIR |
|  2 |     1 |              ../../FASTQs/PSC/PSC_HiC_rep1_subset_chr3_1.fastq.gz | MAPPED_FASTQ |
|  3 |     1 |                                     ../../refGenome/mm39_chr3.gem |        INDEX |
|  4 |     1 |           PSC_HiC_rep1_subset_chr3_1.fastq.gz_MboI_6389769ac5.png |       FIGURE |
|  5 |     2 |           PSC_HiC_rep1_subset_chr3_1.fastq.gz_MboI_e06c0841c2.png |       FIGURE |
|  6 |     2 | 01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_full_1-end_e06c0841c2.map |      SAM/MAP |
|  7 |     2 | 01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_frag_1-end_e06c0841c2.map |      SAM/MAP |


Each tsv file contains:

1. Extended sequence identifier of the read. The extended identifier coincides with the identifier in the FASTQ file if the sequence maps to a single region of the genome. As a result of the fragment based mapping strategy it could be that a sequence in the FASTQ file is split and each part maps to a different region of the genome. In those cases the extended identifier is composed by the original identifier and a running number that distinguishes each split fragment.
2. Chromosome.
3. Start position of the mapped read.
4. Strand
5. Length of the sequence
6. Position of the left flanking restriction enzyme site
7. Position of the right flanking restriction enzyme site

  
This information will be used to filter the reads and, finally, to construct the interaction matrix.

### Questions

- Is the parsing step of the Hi-C pipeline removing spurious reads from experimental errors?
- Could one re-obtain the initial .fastq file from the .tsv files obtained after parsing?
- Would you be confortable to describe to a collegue the parsing step of the pipeline?