## *Mus musculus*'s reference genome sequence

__running time__: < 2 min

We will map the Hi-C reads to the GRCm39/mm39 (https://genome.ucsc.edu/goldenPath/releaseLog.html#mm39) assembled reference genome which can be downloaded frm https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/

Having subsampled our FASTQ files to contain only reads from Chromosome 3 (chr3) we can also used a reduced reference genome containing only chr3. To extract chr3 we used `samtools faidx`. 

The resulting fasta file can be found in the _refGenome_ folder:

In [1]:
%%bash

ls ../refGenome

genmap
mm39_chr3.fa
mm39_chr3.fa_genome.TADbit
mm39_chr3.gem
mm39_chr3.info


## Creation of an index file for GEM mapper

Indexing a genome is very similar to indexing a book. When you read a book and you want to know on which page a chapter begins, it is much faster to look it up in a pre-built index than going through every page of the book until you found it. Same goes for alignments: if you want to find the location of a specific sequence, it is much more efficient to look it in an index than to scan each time the entire genome. Indices allow the aligner to narrow down the potential origin of a query sequence within the genome, saving both time and memory. To know more the idea of indexing, you may have a look at Ben Langmeads youtube channel (creator of the bowtie and bowtie2 alignment software) (https://www.youtube.com/user/BenLangmead).

TADbit supports three different mappers: [GEM](#cite-gem), [bowtie2](#cite-bowtie2), and [hisat2](#cite-hisat2).

For the course we will use the GEM version 3 mapper which is the default in TADbit.

Every mapper uses its own indexed file to efficiently map in terms of time and accuracy the FASTQ file. GEM index file is generated using the `gem-indexer`command:  

In [2]:
%%bash

gem-indexer --input ../refGenome/mm39_chr3.fa --output ../refGenome/mm39_chr3

2022/11/6 11:07:03 -- [Inspecting MultiFASTA]
2022/11/6 11:07:03 --  100% ... done [0.433 s]
2022/11/6 11:07:03 -- Inspected text 319490637 characters (index_complement=yes). Requesting 304 MB (encoded text)
2022/11/6 11:07:03 -- [Reading MultiFASTA]
2022/11/6 11:07:04 --  100000000 bases parsed
2022/11/6 11:07:04 -- Total 162407738 bases parsed ...done [1.074 s]
2022/11/6 11:07:04 -- [Generating Text (explicit Reverse-Complement)]
2022/11/6 11:07:05 --  100% ... done [0.454 s]
2022/11/6 11:07:05 -- [Generating BWT Forward-Text]
2022/11/6 11:07:05 -- [Building-BWT::Counting K-mers]
2022/11/6 11:07:06 --  100% ... done [0.962 s]
2022/11/6 11:07:06 -- [Building-BWT::Generating SA-Positions]
2022/11/6 11:07:06 --    3% 
2022/11/6 11:07:06 --    6% 
2022/11/6 11:07:06 --    9% 
2022/11/6 11:07:06 --   12% 
2022/11/6 11:07:06 --   15% 
2022/11/6 11:07:06 --   19% 
2022/11/6 11:07:06 --   22% 
2022/11/6 11:07:12 --   25% 
2022/11/6 11:07:12 --   28% 
2022/11/6 11:07:12 --   31% 
2022/11/6 11

We should obtain a .gem file in the _refGenome_ folder:

In [3]:
%%bash

ls ../refGenome/

genmap
mm39_chr3.fa
mm39_chr3.fa_genome.TADbit
mm39_chr3.gem
mm39_chr3.info


### References

<a name="cite-gem"/><sup>[^GEM](#ref-1) </sup>Marco-Sola, S., Sammeth, M., Guigó, R. et al. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat Methods 9, 1185–1188 (2012). https://doi.org/10.1038/nmeth.2221

<a name="cite-bowtie2"/><sup>[^Bowtie2](#ref-2) </sup>Langmead, B., Salzberg, S. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359 (2012). https://doi.org/10.1038/nmeth.1923

<a name="cite-hisat2"/><sup>[^Hisat2](#ref-3) </sup>Yun Zhang, Chanhee Park, Christopher Bennett, Micah Thornton and Daehwan Kim. Rapid and accurate alignment of nucleotide conversion sequencing reads with HISAT-3N