Background
- In the context of viruses, molecular epidemiology is used to describe how we can make inferences about the transmission, distribution, etiology, and prevention of viral infections within a population
- Given the decreasing cost of sequencing, the use of molecular epidemiology to understand viruses is increasing
Questions
- There are many questions that can be addressed with sequence data
- When did an epidemic start?
- How did an epidemic spread spatially, and between different risk groups?
- What are the dynamics of transmission over time
- These questions involve reconstructing phylogenetic trees from sequence data
Transmission and phylogenies
Ideally, we would like to know about the transmission history of a pathogen, but even in ideal cases, there is a loss of information in the phylogeny.
- No information before the common ancestor of the samples
- Direction of transmission is lost
- No information on individuals who have died/recovered before sampling
- We do not know which host is infected by which virus (except sampled individuals)
- Not all infection events are 'observed'
Workflow
- A typical molecular epidemiology workflow is rather linear
- Obtain sequences
- Align
- Screen for recombination, if necessary
- Reconstruct phylogeny
- Integrate phylogeny with other data (e.g. time, country)
- Visualise results
Problems
- There are many steps
- There are many software packages available to perform even a single part of this workflow
Why R?
- Free, general purpose statistical software
- Many libraries (>4000), including those for sequence analysis
- Can call external programs
- Literate programming
Why RStudio
- R runs in a terminal
- RStudio sits on top of R, and offers a number of additional features
- Multiple windows for editing and running of code
- Workspace browser
- File browser
- Integrated help
- Graphics window
What you'll (hopefully) learn
- Retrieving sequence data
- Obtaining sequence metadata
- Processing and altering sequence names
- Developing simple pipelines
What you won't learn
- Processing of next-generation sequencing data
- BEAST
Course structure
Next steps
Next, we'll go through a minimal example of generating a phylogeny. We'll do much more, both in terms of upstream analysis (data processing) and in downstream analysis (visualisation and interpretation) later.