{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Iterative vs fragment-based mapping"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__running time__: < 15 min\n",
"\n",
"Iterative mapping first proposed by [(Imakaev et al., 2012)](#cite-Imakaev2012a), allows to map usually a high number of reads. However other methodologies, less \"brute-force\" can be used to take into account the chimeric nature of the Hi-C reads.\n",
"\n",
"A simple alternative is to allow split mapping.\n",
"\n",
"Another way consists in _pre-truncating_ [(Ay and Noble, 2015)](#cite-Ay2015a) reads that contain a ligation site and map only the longest part of the read [(Wingett et al., 2015)](#cite-Wingett2015).\n",
"\n",
"Finally, an intermediate approach, _fragment-based_, consists in mapping full length reads first, and than splitting unmapped reads at the ligation sites [(Serra et al. 2017)](#cite-Serra2017)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![mapping strategies](images/mapping.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Advantages of iterative mapping"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- It's the only solution when no restriction enzyme has been used (i.e. micro-C)\n",
"- Can be faster when few windows (2 or 3) are used"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Advantages of fragment-based mapping "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Generally faster\n",
"- Safer: mapped reads are generally larger than 25-30 nt (the largest window used in iterative mapping). Less reads are mapped, but the difference is usually canceled or reversed when looking for \"valid-pairs\"."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mapping"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`tadbit map` tool can be used to perform either iterative or fragment-based mapping, or a combination of both.\n",
"\n",
"In the course we will use fragment-based mapping which is generally more efficient for a conventional Hi-C experiment. If we choose to use the iterative mapping strategy the `tadbit map` command will be exactly the same but adding the `--iterative` option: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fragment-based mapping"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The fragment-based mapping strategy works in 2 steps: \n",
"1. The read-ends are mapped entirely, assuming that no ligation occurred in them. \n",
"2. For unmapped read-ends, the function searches for a ligation site (e.g. in the case of MboI this would correspond to `GATCGATC` and in the case of HindIII to `AAGCTAGCTT`). The read-end is split accordingly replacing the ligation site by two RE sites:\n",
"\n",
"```\n",
" read-end-part-one---AAGCTAGCTT----read-end-part-two\n",
" will be split in:\n",
" read-end-part-one---AAGCTT\n",
" and:\n",
" AAGCTT----read-end-part-two\n",
"```\n",
"\n",
"_Note: __if no ligation site is found__, step two will be repeated using digested RE site as split point (`AAGCT` in the case of HindIII). This is done in order to be protected against sequencing errors. When this path is followed the digested RE site is removed, but not replaced. __If digested RE sites are not found either, the read will be classified as unmapped__._\n",
"\n",
"_Note: __both mapping strategies can be combined__, for example defining the windows as previously (iterative mapping), but also give a RE name_ `r_enz=MboI` _and setting_ `frag_map=True` _like this if a read has not been mapped in any window, TADbit will also try to apply the fragment-based strategy._"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Writing log to ../results/PSC_rep1/process.log\n",
"Preparing FASTQ file\n",
" - conversion to MAP format\n",
"Mapping reads in window 1-end_e06c0841c2...\n",
"TO GEM 3 /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2/PSC_HiC_rep1_subset_chr3_1_c1j0cpu2\n",
"/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper -I /home/3DAROC21/3DAROC21/refGenome/mm39_chr3.gem -t 6 -F SAM -o /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2/PSC_HiC_rep1_subset_chr3_1_c1j0cpu2_full_1-end_e06c0841c2.map -i /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2/PSC_HiC_rep1_subset_chr3_1_c1j0cpu2\n",
"Parsing result...\n",
" x removing GEM input /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2/PSC_HiC_rep1_subset_chr3_1_c1j0cpu2\n",
" x removing map /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2/PSC_HiC_rep1_subset_chr3_1_c1j0cpu2_full_1-end_e06c0841c2.map\n",
" - splitting into restriction enzyme (RE) fragments using ligation sites\n",
" - ligation sites are replaced by RE sites to match the reference genome\n",
" * enzymes: MboI & MboI, ligation site: GATCGATC, RE site: GATC & GATC\n",
"Preparing MAP file\n",
" x removing pre-GEM input /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2/PSC_HiC_rep1_subset_chr3_1_c1j0cpu2_filt_1-end_e06c0841c2.map\n",
"Mapping fragments of remaining reads...\n",
"TO GEM 3 /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2/PSC_HiC_rep1_subset_chr3_1_hgvvsuag\n",
"/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper -I /home/3DAROC21/3DAROC21/refGenome/mm39_chr3.gem -t 6 -F SAM -o /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2/PSC_HiC_rep1_subset_chr3_1_hgvvsuag_frag_1-end_e06c0841c2.map -i /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2/PSC_HiC_rep1_subset_chr3_1_hgvvsuag\n",
"Parsing result...\n",
" x removing GEM input /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2/PSC_HiC_rep1_subset_chr3_1_hgvvsuag\n",
" x removing failed to map /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2/PSC_HiC_rep1_subset_chr3_1_c1j0cpu2_fail_e06c0841c2.map\n",
" x removing tmp mapped /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2/PSC_HiC_rep1_subset_chr3_1_hgvvsuag_frag_1-end_e06c0841c2.map\n",
",---------------.\n",
"| MAPPED_INPUTs |\n",
",----.--------.------------.------.------.------.--------.---------------.-------------------.----------.-----------------.---------.\n",
"| Id | PATHid | Entries | Trim | Frag | Read | Enzyme | Dangling_Ends | Ligation_Sites | WRKDIRid | MAPPED_OUTPUTid | INDEXid |\n",
"|----+--------+------------+------+------+------+--------+---------------+-------------------+----------+-----------------+---------|\n",
"| 1 | 2 | 10,410,291 | None | full | 1 | MboI | MboI:1.340% | MboI-MboI:35.204% | 1 | 6 | 3 |\n",
"| 2 | 2 | 1,078,595 | None | frag | 1 | MboI | MboI:1.340% | MboI-MboI:35.204% | 1 | 7 | 3 |\n",
"'----^--------^------------^------^------^------^--------^---------------^-------------------^----------^-----------------^---------'\n",
",-------.\n",
"| PATHs |\n",
",----.-------.-------------------------------------------------------------------.--------------.\n",
"| Id | JOBid | Path | Type |\n",
"|----+-------+-------------------------------------------------------------------+--------------|\n",
"| 1 | 1 | /home/3DAROC21/3DAROC21/results/PSC_rep1 | WORKDIR |\n",
"| 2 | 1 | ../../FASTQs/PSC/PSC_HiC_rep1_subset_chr3_1.fastq.gz | MAPPED_FASTQ |\n",
"| 3 | 1 | ../../refGenome/mm39_chr3.gem | INDEX |\n",
"| 4 | 1 | PSC_HiC_rep1_subset_chr3_1.fastq.gz_MboI_6389769ac5.png | FIGURE |\n",
"| 5 | 2 | PSC_HiC_rep1_subset_chr3_1.fastq.gz_MboI_e06c0841c2.png | FIGURE |\n",
"| 6 | 2 | 01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_full_1-end_e06c0841c2.map | SAM/MAP |\n",
"| 7 | 2 | 01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_frag_1-end_e06c0841c2.map | SAM/MAP |\n",
"'----^-------^-------------------------------------------------------------------^--------------'\n",
",------.\n",
"| JOBs |\n",
",----.----------------------------------------------------------------------------------------------------------------------------------------------------------------------------.---------------------.---------------------.------.----------------.\n",
"| Id | Parameters | Launch_time | Finish_time | Type | Parameters_md5 |\n",
"|----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+------+----------------|\n",
"| 1 | skip_mapping:1 read:1 chr_name:[] noX:0 fast_fragment:0 cpus:6 mapper:gem mapper_binary:/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper mapper_param:{} gem_version:3 | 04/10/2021 14:01:11 | 04/10/2021 14:01:15 | Map | 6389769ac5 |\n",
"| 2 | skip_mapping:0 read:1 chr_name:[] noX:0 fast_fragment:0 cpus:6 mapper:gem mapper_binary:/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper mapper_param:{} gem_version:3 | 04/10/2021 14:03:07 | 04/10/2021 14:07:50 | Map | e06c0841c2 |\n",
"'----^----------------------------------------------------------------------------------------------------------------------------------------------------------------------------^---------------------^---------------------^------^----------------'\n",
",---------------.\n",
"| MAPPED_INPUTs |\n",
",----.--------.------------.------.------.------.--------.---------------.-------------------.----------.-----------------.---------.\n",
"| Id | PATHid | Entries | Trim | Frag | Read | Enzyme | Dangling_Ends | Ligation_Sites | WRKDIRid | MAPPED_OUTPUTid | INDEXid |\n",
"|----+--------+------------+------+------+------+--------+---------------+-------------------+----------+-----------------+---------|\n",
"| 1 | 2 | 10,410,291 | None | full | 1 | MboI | MboI:1.340% | MboI-MboI:35.204% | 1 | 6 | 3 |\n",
"| 2 | 2 | 1,078,595 | None | frag | 1 | MboI | MboI:1.340% | MboI-MboI:35.204% | 1 | 7 | 3 |\n",
"'----^--------^------------^------^------^------^--------^---------------^-------------------^----------^-----------------^---------'\n",
",-------.\n",
"| PATHs |\n",
",----.-------.-------------------------------------------------------------------.--------------.\n",
"| Id | JOBid | Path | Type |\n",
"|----+-------+-------------------------------------------------------------------+--------------|\n",
"| 1 | 1 | /home/3DAROC21/3DAROC21/results/PSC_rep1 | WORKDIR |\n",
"| 2 | 1 | ../../FASTQs/PSC/PSC_HiC_rep1_subset_chr3_1.fastq.gz | MAPPED_FASTQ |\n",
"| 3 | 1 | ../../refGenome/mm39_chr3.gem | INDEX |\n",
"| 4 | 1 | PSC_HiC_rep1_subset_chr3_1.fastq.gz_MboI_6389769ac5.png | FIGURE |\n",
"| 5 | 2 | PSC_HiC_rep1_subset_chr3_1.fastq.gz_MboI_e06c0841c2.png | FIGURE |\n",
"| 6 | 2 | 01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_full_1-end_e06c0841c2.map | SAM/MAP |\n",
"| 7 | 2 | 01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_frag_1-end_e06c0841c2.map | SAM/MAP |\n",
"'----^-------^-------------------------------------------------------------------^--------------'\n",
",------.\n",
"| JOBs |\n",
",----.----------------------------------------------------------------------------------------------------------------------------------------------------------------------------.---------------------.---------------------.------.----------------.\n",
"| Id | Parameters | Launch_time | Finish_time | Type | Parameters_md5 |\n",
"|----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+------+----------------|\n",
"| 1 | skip_mapping:1 read:1 chr_name:[] noX:0 fast_fragment:0 cpus:6 mapper:gem mapper_binary:/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper mapper_param:{} gem_version:3 | 04/10/2021 14:01:11 | 04/10/2021 14:01:15 | Map | 6389769ac5 |\n",
"| 2 | skip_mapping:0 read:1 chr_name:[] noX:0 fast_fragment:0 cpus:6 mapper:gem mapper_binary:/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper mapper_param:{} gem_version:3 | 04/10/2021 14:03:07 | 04/10/2021 14:07:50 | Map | e06c0841c2 |\n",
"'----^----------------------------------------------------------------------------------------------------------------------------------------------------------------------------^---------------------^---------------------^------^----------------'\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Writing versions of TADbit and dependencies\n",
"Generating Hi-C QC plot\n",
" - Dangling-ends (sensu-stricto): 1.340%\n",
" - Ligation sites: 35.204%\n",
"mapping ../FASTQs/PSC/PSC_HiC_rep1_subset_chr3_1.fastq.gz read 1 to ../results/PSC_rep1\n",
"/home/3DAROC21/miniconda2/envs/tadbit/lib/python3.7/site-packages/pytadbit/mapping/full_mapper.py:601: UserWarning: WARNING: only 145 Gb left on tmp_dir: /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r1_e06c0841c2\n",
"\n",
" warn('WARNING: only %d Gb left on tmp_dir: %s\\n' % (fspace, temp_dir))\n",
"cleaning temporary files\n"
]
}
],
"source": [
"%%bash \n",
"\n",
"tadbit map -w ../results/PSC_rep1 \\\n",
" --fastq ../FASTQs/PSC/PSC_HiC_rep1_subset_chr3_1.fastq.gz \\\n",
" --read 1 \\\n",
" --index ../refGenome/mm39_chr3.gem \\\n",
" --renz MboI \\\n",
" --cpus 12"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And for the other read-end we should change the `--fastq` file and the `--read` parameter to `2`"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Writing log to ../results/PSC_rep1/process.log\n",
"Preparing FASTQ file\n",
" - conversion to MAP format\n",
"Mapping reads in window 1-end_dbbd3e8e31...\n",
"TO GEM 3 /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31/PSC_HiC_rep1_subset_chr3_2_2sbavg3r\n",
"/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper -I /home/3DAROC21/3DAROC21/refGenome/mm39_chr3.gem -t 6 -F SAM -o /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31/PSC_HiC_rep1_subset_chr3_2_2sbavg3r_full_1-end_dbbd3e8e31.map -i /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31/PSC_HiC_rep1_subset_chr3_2_2sbavg3r\n",
"Parsing result...\n",
" x removing GEM input /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31/PSC_HiC_rep1_subset_chr3_2_2sbavg3r\n",
" x removing map /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31/PSC_HiC_rep1_subset_chr3_2_2sbavg3r_full_1-end_dbbd3e8e31.map\n",
" - splitting into restriction enzyme (RE) fragments using ligation sites\n",
" - ligation sites are replaced by RE sites to match the reference genome\n",
" * enzymes: MboI & MboI, ligation site: GATCGATC, RE site: GATC & GATC\n",
"Preparing MAP file\n",
" x removing pre-GEM input /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31/PSC_HiC_rep1_subset_chr3_2_2sbavg3r_filt_1-end_dbbd3e8e31.map\n",
"Mapping fragments of remaining reads...\n",
"TO GEM 3 /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31/PSC_HiC_rep1_subset_chr3_2_tcn500zr\n",
"/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper -I /home/3DAROC21/3DAROC21/refGenome/mm39_chr3.gem -t 6 -F SAM -o /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31/PSC_HiC_rep1_subset_chr3_2_tcn500zr_frag_1-end_dbbd3e8e31.map -i /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31/PSC_HiC_rep1_subset_chr3_2_tcn500zr\n",
"Parsing result...\n",
" x removing GEM input /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31/PSC_HiC_rep1_subset_chr3_2_tcn500zr\n",
" x removing failed to map /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31/PSC_HiC_rep1_subset_chr3_2_2sbavg3r_fail_dbbd3e8e31.map\n",
" x removing tmp mapped /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31/PSC_HiC_rep1_subset_chr3_2_tcn500zr_frag_1-end_dbbd3e8e31.map\n",
",---------------.\n",
"| MAPPED_INPUTs |\n",
",----.--------.------------.------.------.------.--------.---------------.-------------------.----------.-----------------.---------.\n",
"| Id | PATHid | Entries | Trim | Frag | Read | Enzyme | Dangling_Ends | Ligation_Sites | WRKDIRid | MAPPED_OUTPUTid | INDEXid |\n",
"|----+--------+------------+------+------+------+--------+---------------+-------------------+----------+-----------------+---------|\n",
"| 1 | 2 | 10,410,291 | None | full | 1 | MboI | MboI:1.340% | MboI-MboI:35.204% | 1 | 6 | 3 |\n",
"| 2 | 2 | 1,078,595 | None | frag | 1 | MboI | MboI:1.340% | MboI-MboI:35.204% | 1 | 7 | 3 |\n",
"| 3 | 8 | 10,410,291 | None | full | 2 | MboI | MboI:1.434% | MboI-MboI:35.818% | 1 | 10 | 3 |\n",
"| 4 | 8 | 1,087,028 | None | frag | 2 | MboI | MboI:1.434% | MboI-MboI:35.818% | 1 | 11 | 3 |\n",
"'----^--------^------------^------^------^------^--------^---------------^-------------------^----------^-----------------^---------'\n",
",-------.\n",
"| PATHs |\n",
",----.-------.-------------------------------------------------------------------.--------------.\n",
"| Id | JOBid | Path | Type |\n",
"|----+-------+-------------------------------------------------------------------+--------------|\n",
"| 1 | 1 | /home/3DAROC21/3DAROC21/results/PSC_rep1 | WORKDIR |\n",
"| 2 | 1 | ../../FASTQs/PSC/PSC_HiC_rep1_subset_chr3_1.fastq.gz | MAPPED_FASTQ |\n",
"| 3 | 1 | ../../refGenome/mm39_chr3.gem | INDEX |\n",
"| 4 | 1 | PSC_HiC_rep1_subset_chr3_1.fastq.gz_MboI_6389769ac5.png | FIGURE |\n",
"| 5 | 2 | PSC_HiC_rep1_subset_chr3_1.fastq.gz_MboI_e06c0841c2.png | FIGURE |\n",
"| 6 | 2 | 01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_full_1-end_e06c0841c2.map | SAM/MAP |\n",
"| 7 | 2 | 01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_frag_1-end_e06c0841c2.map | SAM/MAP |\n",
"| 8 | 3 | ../../FASTQs/PSC/PSC_HiC_rep1_subset_chr3_2.fastq.gz | MAPPED_FASTQ |\n",
"| 9 | 3 | PSC_HiC_rep1_subset_chr3_2.fastq.gz_MboI_dbbd3e8e31.png | FIGURE |\n",
"| 10 | 3 | 01_mapped_r2/PSC_HiC_rep1_subset_chr3_2_full_1-end_dbbd3e8e31.map | SAM/MAP |\n",
"| 11 | 3 | 01_mapped_r2/PSC_HiC_rep1_subset_chr3_2_frag_1-end_dbbd3e8e31.map | SAM/MAP |\n",
"'----^-------^-------------------------------------------------------------------^--------------'\n",
",------.\n",
"| JOBs |\n",
",----.----------------------------------------------------------------------------------------------------------------------------------------------------------------------------.---------------------.---------------------.------.----------------.\n",
"| Id | Parameters | Launch_time | Finish_time | Type | Parameters_md5 |\n",
"|----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+------+----------------|\n",
"| 1 | skip_mapping:1 read:1 chr_name:[] noX:0 fast_fragment:0 cpus:6 mapper:gem mapper_binary:/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper mapper_param:{} gem_version:3 | 04/10/2021 14:01:11 | 04/10/2021 14:01:15 | Map | 6389769ac5 |\n",
"| 2 | skip_mapping:0 read:1 chr_name:[] noX:0 fast_fragment:0 cpus:6 mapper:gem mapper_binary:/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper mapper_param:{} gem_version:3 | 04/10/2021 14:03:07 | 04/10/2021 14:07:50 | Map | e06c0841c2 |\n",
"| 3 | skip_mapping:0 read:2 chr_name:[] noX:0 fast_fragment:0 cpus:6 mapper:gem mapper_binary:/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper mapper_param:{} gem_version:3 | 04/10/2021 14:07:52 | 04/10/2021 14:12:15 | Map | dbbd3e8e31 |\n",
"'----^----------------------------------------------------------------------------------------------------------------------------------------------------------------------------^---------------------^---------------------^------^----------------'\n",
",---------------.\n",
"| MAPPED_INPUTs |\n",
",----.--------.------------.------.------.------.--------.---------------.-------------------.----------.-----------------.---------.\n",
"| Id | PATHid | Entries | Trim | Frag | Read | Enzyme | Dangling_Ends | Ligation_Sites | WRKDIRid | MAPPED_OUTPUTid | INDEXid |\n",
"|----+--------+------------+------+------+------+--------+---------------+-------------------+----------+-----------------+---------|\n",
"| 1 | 2 | 10,410,291 | None | full | 1 | MboI | MboI:1.340% | MboI-MboI:35.204% | 1 | 6 | 3 |\n",
"| 2 | 2 | 1,078,595 | None | frag | 1 | MboI | MboI:1.340% | MboI-MboI:35.204% | 1 | 7 | 3 |\n",
"| 3 | 8 | 10,410,291 | None | full | 2 | MboI | MboI:1.434% | MboI-MboI:35.818% | 1 | 10 | 3 |\n",
"| 4 | 8 | 1,087,028 | None | frag | 2 | MboI | MboI:1.434% | MboI-MboI:35.818% | 1 | 11 | 3 |\n",
"'----^--------^------------^------^------^------^--------^---------------^-------------------^----------^-----------------^---------'\n",
",-------.\n",
"| PATHs |\n",
",----.-------.-------------------------------------------------------------------.--------------.\n",
"| Id | JOBid | Path | Type |\n",
"|----+-------+-------------------------------------------------------------------+--------------|\n",
"| 1 | 1 | /home/3DAROC21/3DAROC21/results/PSC_rep1 | WORKDIR |\n",
"| 2 | 1 | ../../FASTQs/PSC/PSC_HiC_rep1_subset_chr3_1.fastq.gz | MAPPED_FASTQ |\n",
"| 3 | 1 | ../../refGenome/mm39_chr3.gem | INDEX |\n",
"| 4 | 1 | PSC_HiC_rep1_subset_chr3_1.fastq.gz_MboI_6389769ac5.png | FIGURE |\n",
"| 5 | 2 | PSC_HiC_rep1_subset_chr3_1.fastq.gz_MboI_e06c0841c2.png | FIGURE |\n",
"| 6 | 2 | 01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_full_1-end_e06c0841c2.map | SAM/MAP |\n",
"| 7 | 2 | 01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_frag_1-end_e06c0841c2.map | SAM/MAP |\n",
"| 8 | 3 | ../../FASTQs/PSC/PSC_HiC_rep1_subset_chr3_2.fastq.gz | MAPPED_FASTQ |\n",
"| 9 | 3 | PSC_HiC_rep1_subset_chr3_2.fastq.gz_MboI_dbbd3e8e31.png | FIGURE |\n",
"| 10 | 3 | 01_mapped_r2/PSC_HiC_rep1_subset_chr3_2_full_1-end_dbbd3e8e31.map | SAM/MAP |\n",
"| 11 | 3 | 01_mapped_r2/PSC_HiC_rep1_subset_chr3_2_frag_1-end_dbbd3e8e31.map | SAM/MAP |\n",
"'----^-------^-------------------------------------------------------------------^--------------'\n",
",------.\n",
"| JOBs |\n",
",----.----------------------------------------------------------------------------------------------------------------------------------------------------------------------------.---------------------.---------------------.------.----------------.\n",
"| Id | Parameters | Launch_time | Finish_time | Type | Parameters_md5 |\n",
"|----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+------+----------------|\n",
"| 1 | skip_mapping:1 read:1 chr_name:[] noX:0 fast_fragment:0 cpus:6 mapper:gem mapper_binary:/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper mapper_param:{} gem_version:3 | 04/10/2021 14:01:11 | 04/10/2021 14:01:15 | Map | 6389769ac5 |\n",
"| 2 | skip_mapping:0 read:1 chr_name:[] noX:0 fast_fragment:0 cpus:6 mapper:gem mapper_binary:/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper mapper_param:{} gem_version:3 | 04/10/2021 14:03:07 | 04/10/2021 14:07:50 | Map | e06c0841c2 |\n",
"| 3 | skip_mapping:0 read:2 chr_name:[] noX:0 fast_fragment:0 cpus:6 mapper:gem mapper_binary:/home/3DAROC21/miniconda2/envs/tadbit/bin/gem-mapper mapper_param:{} gem_version:3 | 04/10/2021 14:07:52 | 04/10/2021 14:12:15 | Map | dbbd3e8e31 |\n",
"'----^----------------------------------------------------------------------------------------------------------------------------------------------------------------------------^---------------------^---------------------^------^----------------'\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Writing versions of TADbit and dependencies\n",
"Generating Hi-C QC plot\n",
" - Dangling-ends (sensu-stricto): 1.434%\n",
" - Ligation sites: 35.818%\n",
"mapping ../FASTQs/PSC/PSC_HiC_rep1_subset_chr3_2.fastq.gz read 2 to ../results/PSC_rep1\n",
"/home/3DAROC21/miniconda2/envs/tadbit/lib/python3.7/site-packages/pytadbit/mapping/full_mapper.py:601: UserWarning: WARNING: only 143 Gb left on tmp_dir: /home/3DAROC21/3DAROC21/results/PSC_rep1_tmp_r2_dbbd3e8e31\n",
"\n",
" warn('WARNING: only %d Gb left on tmp_dir: %s\\n' % (fspace, temp_dir))\n",
"cleaning temporary files\n"
]
}
],
"source": [
"%%bash \n",
"\n",
"tadbit map -w ../results/PSC_rep1 \\\n",
" --fastq ../FASTQs/PSC/PSC_HiC_rep1_subset_chr3_2.fastq.gz \\\n",
" --read 2 \\\n",
" --index ../refGenome/mm39_chr3.gem \\\n",
" --renz MboI \\\n",
" --cpus 12"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Map files logs and format"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TADbit tools uses a SQLite database file where all the commands executed by the user as well as all relevant statistics of the analysis are stored. An output of the database is shown at the end of each command.\n",
"\n",
"In this case we can see useful statistics of the mapping step, for example how many reads are mapped at full length, after splitting and the percentages of dangling-ends and ligation sites:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
",---------------.\n",
"| MAPPED_INPUTs |\n",
",----.--------.------------.------.------.------.--------.---------------.-------------------.----------.-----------------.---------.\n",
"| Id | PATHid | Entries | Trim | Frag | Read | Enzyme | Dangling_Ends | Ligation_Sites | WRKDIRid | MAPPED_OUTPUTid | INDEXid |\n",
"|----+--------+------------+------+------+------+--------+---------------+-------------------+----------+-----------------+---------|\n",
"| 1 | 2 | 10,410,291 | None | full | 1 | MboI | MboI:1.340% | MboI-MboI:35.204% | 1 | 6 | 3 |\n",
"| 2 | 2 | 1,078,595 | None | frag | 1 | MboI | MboI:1.340% | MboI-MboI:35.204% | 1 | 7 | 3 |\n",
"| 3 | 8 | 10,410,291 | None | full | 2 | MboI | MboI:1.434% | MboI-MboI:35.818% | 1 | 10 | 3 |\n",
"| 4 | 8 | 1,087,028 | None | frag | 2 | MboI | MboI:1.434% | MboI-MboI:35.818% | 1 | 11 | 3 |\n",
"'----^--------^------------^------^------^------^--------^---------------^-------------------^----------^-----------------^---------'\n"
]
}
],
"source": [
"%%bash\n",
"\n",
"tadbit describe -w ../results/PSC_rep1 -t 4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can list the generated map files in the working directory in the subfolder `01_mapped_r1` and `01_mapped_r2` and inspect one of the files to see its structure:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PSC_HiC_rep1_subset_chr3_1_frag_1-end_e06c0841c2.map\n",
"PSC_HiC_rep1_subset_chr3_1_full_1-end_e06c0841c2.map\n",
"PSC_HiC_rep1_subset_chr3_2_frag_1-end_dbbd3e8e31.map\n",
"PSC_HiC_rep1_subset_chr3_2_full_1-end_dbbd3e8e31.map\n"
]
}
],
"source": [
"%%bash\n",
"\n",
"ls ../results/PSC_rep1/01_mapped_r1/\n",
"ls ../results/PSC_rep1/01_mapped_r2/"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SRR5344969.sra.180~2~\tGATCATATCTACAGTCATTTTTTAAAATA\tHHHHGGGGGGGGGGGGGGGGGGGGCCCCC\t1\tchr3:+:5690818:29\n",
"SRR5344969.sra.578\tGATCCACAAAGACAAATTTGGTATGT\tHHHHGGGGGGGGGGGGGGGGGCCCCC\t1\tchr3:-:29696661:26\n"
]
}
],
"source": [
"%%bash\n",
"\n",
"head ../results/PSC_rep1/01_mapped_r1/PSC_HiC_rep1_subset_chr3_1_frag_1-end_e06c0841c2.map -n 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A map file is a tsv file storing information of a sequence mapped to a reference genome. It is composed by the following columns:\n",
"\n",
"1. Sequence identifier of the read in the FASTQ.\n",
"2. The sequence.\n",
"3. The base call quality scores, Phred +33 encoded using ASCII characters.\n",
"4. Reserved for mapper information.\n",
"5. The information of the region where the read was mapped as chromosome:strand:position:length\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Questions\n",
"\n",
"- True or False? In case it is false try to correct the\tsentence:\n",
" 1. The iterative mapping works only if a restriction enzyme has been used in the experiment.\n",
" 2. Iterative and fragment-based mapping are incompatible between each other. \n",
" 3. If a read has been mapped, the .map files contains all the information in a .fastq file for this read.\n",
" 4. If a sequence is not mapped, it doesn't appear in a .map file. \n",
"- Would you be confortable to explain to your collegues the iterative mapping strategy? What about the fragment-based mapping?\n",
"- Would you be confortable to explain the difference between a read and restriction-enzyme fragment? Are they the same thing?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### References\n",
"\n",
"[^ref1](#ref-1) Imakaev, Maxim V and Fudenberg, Geoffrey and McCord, Rachel Patton and Naumova, Natalia and Goloborodko, Anton and Lajoie, Bryan R and Dekker, Job and Mirny, Leonid A. 2012. _Iterative correction of Hi-C data reveals hallmarks of chromosome organization._. [URL](http://www.ncbi.nlm.nih.gov/pubmed/22941365)\n",
"\n",
"[^ref2](#ref-2) Ay, Ferhat and Noble, William Stafford. 2015. _Analysis methods for studying the 3D architecture of the genome_. [URL](http://genomebiology.com/2015/16/1/183)\n",
"\n",
"[^ref3](#ref-3) Wingett, Steven and Ewels, Philip and Furlan-Magaril, Mayra and Nagano, Takashi and Schoenfelder, Stefan and Fraser, Peter and Andrews, Simon. 2015. _HiCUP: pipeline for mapping and processing Hi-C data._. [URL](http://f1000research.com/articles/4-1310/v1)\n",
"\n",
"[^ref4](#ref-4) François Serra, Davide Baù, Mike Goodstadt, David Castillo, Guillaume Filion and Marc A. Marti-Renom. 2017. _Automatic analysis and 3D-modelling of Hi-C data using TADbit reveals structural features of the fly chromatin colors_. [URL](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005665)\n",
"\n",
"[^ref5](#ref-5) Marco-Sola, Santiago and Sammeth, Michael and Guigó, Roderic and Ribeca, Paolo. 2012. _The GEM mapper: fast, accurate and versatile alignment by filtration_.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": "block",
"toc_window_display": true
}
},
"nbformat": 4,
"nbformat_minor": 1
}