For this practical, we will use R (and R Studio) and call fastsimcoal2 directly from R.
We will then visualize the results in R. It requires the following R packages:
You can install these packages with the following commands in R.
install.packages("plotrix")
install.packages("diagram")
install.packages("RColorBrewer")
The SFS (also called Allele frequency spectrum - AFS) is a summary of genome-wide data that describes the number of single nucleotide polymorphisms (SNPs) with a given frequency in our sample.
Consider the following genotypic data matrix where each entry codes for the genotypes as 0 (homozygote for reference allele), 1 (heterozygote), 2 (homozygote for alternative allele). In this matrix each column corresponds to a site, and each row corresponds to an individual. This is similar to what we can obtain from a variant call format (VCF) file.
snp1 | snp2 | snp3 | snp4 | snp5 | snp6 | snp7 | snp8 | snp9 | snp10 | snp11 | snp12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ind1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
ind2 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
ind3 | 0 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
ind4 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
ind5 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 1 | 0 | 1 | 0 | 1 |
Exercise 1.1.1 (10 min).
Use a piece of paper and a pen to compute the observed SFS from the above matrix.
Before starting, answer the following questions:
- What is the x-axis of the SFS? What is the range of possible values for the x-axis?
- What is the y-axis of the SFS? What is the range of possible values for the y-axis?
First, you need to compute the absolute allele frequency for each SNP.
Second, you need to count the number of sites with a given allele frequency.
Exercise 1.1.2.
Compute and plot the derived site frequency spectrum from the dataset in file “./Data/example.GT” with R.
Use the R code given below. First, read the file using the function read.table(). Second, compute the allele frequency for each SNP using the function colSums(). Third, count the number of sites with a given allele frequenc using the function table(). Fourth, plot the SFS with the barplot() function.
This dataset was simulated from a single stationary population with a constant effective size, consisting of a sample of 5 diploid individuals genotyped at 10,000 sites.
IMPORTANT: In this exercise we assume that the ancestral/derived state of each allele can be determined (e.g. comparing with data from a related species), and that the reference allele corresponds to the ancestral state, and the alternative allele to the derived state.
# set the folder Data as the working directory. Please change in your computer accordingly.
path_to_datafile <- "./Data/"
# number of SNPs
nsnps <- 10000
# number of individuals
nind <- 5
# read the genotype data as a matrix
genomat <- matrix(scan(file=paste(path_to_datafile,"example.GT", sep="")), nrow=nind, ncol=nsnps, byrow=T)
# Exercise: get the observed SFS with the colSums() and table() functions.
all_freq <- colSums(genomat)
obs.sfs <- table(all_freq)
# to plot use the barplot() function.
Results.
You should obtain something like the following barplot on the left.
The plot on the left shows the SFS including the monomorphic sites, whereas the right plot shows the relative SFS, including only the polymorphic sites. The solid line on the right corresponds to the theoretical expectation for a stationary population of constant size.
TO DISCUSS
The SFS is an efficient way to sumarize genome-wide variation in a sample.
So far we have been looking at 1D SFS. But this can be extended to any number of populations. For instance, if we had sampled data from two populations, the 2D-SFS corresponds to a matrix where the entry (i,j) corresponds to the count of sites with frequency i in population 1, and frequency j in population 2.
Consider the following genotype matrix for 12 SNPs genotyped at 5 individuals from 2 populations, 2 diploid individuals from Pop1, and 3 diploid individuals from Pop2. Each row corresponds to one individual and each column to one SNP. Genotypes are coded as 0, 1, 2 as in Exercise 1.1.
snp1 | snp2 | snp3 | snp4 | snp5 | snp6 | snp7 | snp8 | snp9 | snp10 | snp11 | snp12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Pop1 ind1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Pop1 ind2 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 2 | 0 |
snp1 | snp2 | snp3 | snp4 | snp5 | snp6 | snp7 | snp8 | snp9 | snp10 | snp11 | snp12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Pop2 ind1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |
Pop2 ind2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
Pop2 ind3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 2 |
Exercise 1.2.1 (10 min).
Use a piece of paper and a pen to compute the observed 2 dimensional SFS from the above matrices with the genotypes for each population.
The 2D SFS can be seen as a matrix. Before starting answer the following questions:
First, you need to compute the absolute allele frequency for each SNP in each population.
Second, you need to count the number of sites with a given allele frequency in each population.
The following matrix contains the 2D-SFS of 50,000 simulated sites from two populations, with a sample size of 2 diploids for pop1 and 3 diploids for pop2. Hence, we have 5 rows and 7 columns. Each entry (i,j) contains the number of SNPs with a derived allele frequency of i in pop1, and j in pop2. The sum of the entries in this matrix is the total number os sites, which in this case is 50,000.
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 23718 | 4770 | 1804 | 683 | 288 | 85 | 22 |
1 | 3428 | 1919 | 1370 | 826 | 495 | 214 | 93 |
2 | 873 | 866 | 853 | 783 | 584 | 428 | 219 |
3 | 225 | 329 | 450 | 533 | 541 | 557 | 471 |
4 | 23 | 72 | 155 | 251 | 413 | 649 | 1010 |
Exercise 1.2.2.
Based on the above matrix, answer the following questions.
We will analyse the models simulated according to the following scenarios.
The input files with the observed SFS are in the folder: “./FscInputFiles/”. There are two files:
Exercise 2.1.2.
Have a look at the input files.
You can do this using a text editor (e.g. RStudio editor, gedit, TextPad, Notepad). You can also read them with R using the read.table() function using the code below.
# read the observed SFS
# 1D from a population that underwent an expansion
pop1d <- read.table("./FscInputFiles/1PopExpInst20Mb_DAFpop0.obs", skip=1, stringsAsFactors = F, header=T)
# 2D from two populations with different Ne that diverged some time ago
pop2d <- read.table("./FscInputFiles/2PopDiv20Mb_jointDAFpop1_0.obs", skip=1, stringsAsFactors = F, header=T, row.names = 1)
For the sample from a single population, the observed derived SFS is:
d0_0 | d0_1 | d0_2 | d0_3 | d0_4 | d0_5 | d0_6 | d0_7 | d0_8 | d0_9 | d0_10 |
---|---|---|---|---|---|---|---|---|---|---|
19988389 | 10543 | 355 | 163 | 127 | 128 | 78 | 82 | 84 | 51 | 0 |
For the sample from two populations, the observed derived SFS is:
d0_0 | d0_1 | d0_2 | d0_3 | d0_4 | d0_5 | d0_6 | |
---|---|---|---|---|---|---|---|
d1_0 | 19969298 | 8235 | 2294 | 945 | 368 | 152 | 49 |
d1_1 | 4188 | 1615 | 1142 | 703 | 471 | 234 | 112 |
d1_2 | 1204 | 853 | 815 | 674 | 532 | 370 | 221 |
d1_3 | 415 | 429 | 539 | 556 | 535 | 471 | 363 |
d1_4 | 94 | 199 | 296 | 390 | 543 | 695 | 0 |
The SFS is the data used to infer the parameters with fastsimcoal2. You do not need to give more information! So, all the genome-wide data is summarized by the SFS and no other information is required. From this data we can infer parameters and compute the likelihood.
TO DISCUSS:
Can the observed SFS include the monomorphic sites?
Can the observed SFS include linked sites?
Exercise 2.1.3.
Have a look at the TPL and EST files for the models shown above.
Each model is defined in a TPL file. The parameters we want to infer are given a parameter name tag (e.g. NPOP, TDIV). There are two TPL files on the folder “FscInputFiles”, corresponding to the two models.
The search range for each parameter is defined in the EST file. For each parameter we specify the search range using the corresponding parameter name tag. There are two EST files on the folder: “FscInputFiles”, corresponding to the two models.
In sum, there are two steps:
For further information on definition of models check the manual and the material given during the lecture. NOTE: the TPL and EST files need to have the same filename, but have different extensions: [filename].est and [filename].tpl.
TO DISCUSS
What is the mutation rate assumed? Do we need a mutation rate?
To run fastsimcoal2 we need to specify the following options:
The following scripts will run 1 fastsimcoal2 optimization, starting from a random initial starting value. When dealing with real data one needs to repeat at least 20 to 100 runs (i.e. runs with different initial starting values), and then we select the run that maximizes the likelihood.
Exercise 2.2.1.
- Run fastsimcoal2 to estimate the parameters of the single population model.
Follow the R commands below. Try to understand what is done at each step of the code.
# load functions to run fastsimcoal2 and to process the output
source("utilFscFunctions.r")
# create a list that saves all the required settings to run fastsimcoal2
settings <- list()
# path to fastsicoal2 executable file
settings$pathToFsc <- "./fastsimcoal2/fsc26_linux64/"
# path to input files (working directory)
settings$pathTo_InputFile <- "./FscInputFiles/"
# path to TPL and EST file (tags)
settings$TPL_EST_file_tag <- "1PopExpInst20Mb"
# number of coalescent simulations
settings$n_coalsims <- 100000
# number of optimization cycles
settings$n_cycles <- 20
# IF YOU WANT TO RUN FASTSIMCOAL2 FROM THE COMMAND LINE
# get the fastsimcoal2 command line
cmd <- get_fsc2_commandline(settings)
# IF YOU WANT TO RUN FASTSIMCOAL2 BY CALLING IT FROM R
# run fastsimcoal2
run_fsc2(settings)
Running fastsimcoal2 should take less than 20 seconds. You will get a print of the optimization procedure, with the likelihood at each iteration. For instance, I obtained the following in one of the runs:
Estimation of parameters by conditional maximization via Brent algorithm (initial lhood = -111487)
Iter 1 Curr best params: 4224 55 74883 lhood=-48455.9
Iter 2 Curr best params: 4224 55 74883 lhood=-48455.9
Iter 3 Curr best params: 4224 55 74883 lhood=-48455.9
Iter 4 Curr best params: 4219 114 2852 lhood=-47560.3
Iter 5 Curr best params: 82085 344 2223 lhood=-45117.9
… Iter 20 Curr best params: 566385 532 2046 lhood=-45045.8
The likelihood is given in units of log10. Hence, the closer to zero the higher the likelihood. We can see that the likelihood increased from -111487 to -45045.8 in 20 cycles of optimization. We can also see that after 1 iteration the parameters were NPOP 4224, NANC 55, TCHANGE 74883, and that at iteration 20 the values were NPOP 566385, NANC 532, TCHANGE 2046.
How to interpret fastsimcoal2 results?
Fastsimcoal2 will output several files, which contain important information. The most relevant information for most users is:
Exercise 2.2.2
Obtain parameter estimates and maximum likelihood.
Follow the code below that will read the output files from fastsimcoal2 to obtain the parameter estimates.
# read the file with the maximum likelihood estimates
maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile, "/",
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, ".bestlhoods",
sep=""), header=T)
# look at the parameter estimates
maxlhoodEstimates
## NPOP NANC TCHANGE MaxEstLhood MaxObsLhood
## 1 525242 540 2037 -45045.87 -45043.21
The maxlhoodEstimates contain a matrix where each column corresponds to a parameter name tag and the corresponding parameter estimate. You have the result of 1 run. In the analysis of real data sets we need to run several independent runs (different starting values), and select the parameter estimates of the run attaining the maximum likelihood estimate (MaxEstLhood).
TO DISCUSS:
Difference between MaxObsLhood and MaxEstLhood
MaxObsLhood: is the maximum possible value for the likelihood if there was a perfect fit of the expected to the observed SFS, i.e. if the expected SFS was the relative observed SFS.
MaxEstLhood: is the maximum likelihood estimated according to the model parameters.
The better the fit, the smaller the difference between MaxObsLhood and MaxEstLhood
So, in your case, all students should have the same MaxObsLhood because all students are looking at the same OBSERVED SFS, but each one of you will have a different MaxEstLhood because you have run optimization cycles starting from different initial values.
Exercise 2.2.3
- Assess the fit of the expected SFS under the model to the observed SFS.
Follow the code below that will read the output files from fastsimcoal2 to obtain the parameter estimates.
# Assess the fit of the expected SFS into the observed SFS
# Read the expected SFS
# 1D from a population that underwent an expansion
obsfileend <- "_DAFpop0"
pop1d <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, obsfileend, ".obs",
sep=""), skip=1, stringsAsFactors = F, header=T)
expectedSFS <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, obsfileend, ".txt",
sep=""), header=T)
# discard the monomorphic sites from the observed SFS to compare with the expected SFS
observedSFS <- pop1d
# observed SFS just with SNPs (polymorphic sites)
observedSFS[1] <- 0
# Plot the observed vs expected SFS
plot_fitSFS_1d(observedSFS, expectedSFS, c(1,max(observedSFS)*1.1), "pop1")
Copy the file with the parameters that maximized the likelihood for your run, “*.bestlhoods" into a shared folder, such that we have a set of all the runs to analyse.
DISCUSSION
Exercise 2.3. (optional) - Run for the model with two populations by modifying the code above.
The folder “./Henn_et_al_data” contains exome data from two human populations (Namibian San, SAN; https://en.wikipedia.org/wiki/San_people, and Mexican Mayan, MAYA, https://en.wikipedia.org/wiki/Maya_peoples) published in Henn et al. (2015) PNAS (http://www.pnas.org/content/113/4/E440.abstract). This data was kindly provided by Stephan Peischl (http://www.bioinformatics.unibe.ch/about_us/staff/dr_peischl_stephan/index_eng.html), as is a small sub-set of the original data. As such, it cannot be used outside the scope of this course. The original data is available at NCBI Sequence Read Archive (accession no. SRP036155).
The folder “./Henn_et_al_data” contains the following files:
- 2D-SFS for the two populations, assuming that we sequenced a total of 44Mb
- Original genotypic matrix
- TPL and EST files for a model without gene flow
In this case pop1 is Maya with 8 diploid individuals sampled (16 gene copies), and pop2 is San with 6 diploid individuals sampled (12 gene copies).
Exercise 3.1.1. Have a look at the input files and answer the following questions:
If the monomorphic sites are not included we cannot use the mutation rate as a clock to get inferences. When we do not have the number of monomorphic sites, all parameters must be specified in relation to a given fixed parameter (e.g. an effective size, or a time of event). Here I obtained the number of monomorphic sites assuming that a total of 44MB from the exome was sequenced and that sites that are not SNPs are monomorphic, i.e. #monomorphic = 44e6 - #SNPs.
Exercise 3.1.2.
Infer the time of divergence between two human populations, San from Africa and Maya from America using fastsimcoal2
Use the script below for the model without gene flow.
# load functions to run fastsimcoal2 and to process the output
source("utilFscFunctions.r")
source("ParFileInterpreter-VS1.r")
# create a list that saves all the required settings to run fastsimcoal2
settings <- list()
# path to fastsicoal2 executable file
settings$pathToFsc <- "./fastsimcoal2/fsc26_linux64/"
# path to input files (working directory)
settings$pathTo_InputFile <- "./Henn_et_al_data/"
# path to TPL and EST file (tags)
settings$TPL_EST_file_tag <- "NoMig_San_Maya"
# number of coalescent simulations
settings$n_coalsims <- 100000
# number of optimization cycles
settings$n_cycles <- 20
# IF YOU WANT TO RUN FASTSIMCOAL2 FROM THE COMMAND LINE
# get the fastsimcoal2 command line
cmd <- get_fsc2_commandline(settings)
# IF YOU WANT TO RUN FASTSIMCOAL2 BY CALLING IT FROM R
# run fastsimcoal2
run_fsc2(settings)
# read the file with the maximum likelihood estimates
maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile, "/",
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, ".bestlhoods",
sep=""), header=T)
# look at the parameter estimates
maxlhoodEstimates
# Read the observed SFS
obsfileend <- "_jointDAFpop1_0"
# 2D from two human populations: San and Maya
maya_san_2dsfs <- read.table(paste(settings$pathTo_InputFile, "/", settings$TPL_EST_file_tag, obsfileend,".obs",sep=""), skip=1, stringsAsFactors = F, header=T, row.names = 1)
# Read the expected SFS
expectedSFS <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, obsfileend, ".txt",
sep=""), header=T)
# discard the monomorphic sites from the observed SFS
pol_obs_2d <- as.matrix(maya_san_2dsfs)
# observed SFS just with SNPs (polymorphic sites)
pol_obs_2d[c(1,1)] <- 0
# get the marginal 1d sfs for pop 1
marginal1d_sfs <- rowSums(pol_obs_2d)
exp.marginal1d_sfs <- rowSums(expectedSFS)
# Plot the observed vs expected SFS
plot_fitSFS_1d(marginal1d_sfs, exp.marginal1d_sfs, c(1,max(marginal1d_sfs)*1.1), "Maya")
# get the marginal 1d sfs for pop 2
marginal1d_sfs <- colSums(pol_obs_2d)
exp.marginal1d_sfs <- colSums(expectedSFS)
# Plot the observed vs expected SFS
plot_fitSFS_1d(marginal1d_sfs, exp.marginal1d_sfs, c(1,max(marginal1d_sfs)*1.1), "San")
# Plot the 2D-SFS fit
plot2dSFS(maya_san_2dsfs, expectedSFS, xtag="San", ytag="Maya", minentry=1)
plot_relDiff2dSFS(maya_san_2dsfs, expectedSFS, xtag="San", ytag="Maya", minentry=1)
# Draw a scenario of the model parameter estimates assuming a generation time of 29 years
# pdf(file=paste(settings$TPL_EST_file_tag,"_model.pdf", sep=""), width=8, height=10)
# path to the maxL.par file that is a par file with the parameters that maximize the likelihodo
path_to_maxL_file <- paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/",settings$TPL_EST_file_tag, "_maxL", sep="")
parFileInterpreter(args=path_to_maxL_file, pop.names=c("Maya","San"), gentime=29, printPDF=FALSE)
# dev.off()
Modify the script below for the model with gene flow between San and Maya.
DISCUSSION
To obtain the SFS in the previous exercises we assumed we could determine the ancestral and derived state of alleles. This is usually done by comparing our data with outgroup species. We further assumed that genotypes were coded according to the derived state of alleles (i.e., reference allele corresponded to the ancestral state, and the alternative allele to the derived). The resulting SFS is thus called “derived SFS” or “unfolded SFS”. However, in many cases, especially for non-model organisms, we do not have data from closely related species (outgroup) to determine the ancestral state of mutations. In such cases we can summarize the data as a function of the “minor allele frequency” SFS or “folded SFS”.
EXERCISE 4.1.
Compute the folded (minor allele frequency) SFS for the following genotypic matrix.
Follow the steps below. The “minor allele frequency” refers to the frequency of the less frequent allele at a particular site. If we go back to our genotype matrix:
snp1 | snp2 | snp3 | snp4 | snp5 | snp6 | snp7 | snp8 | snp9 | snp10 | snp11 | snp12 |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 2 |
1 | 0 | 1 | 1 | 2 | 0 | 2 | 0 | 1 | 0 | 0 | 2 |
0 | 1 | 1 | 0 | 2 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
1 | 0 | 2 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 1 | 1 |
0 | 0 | 2 | 2 | 2 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
Assuming that allles were determined arbitrarily as reference (allele as in reference genome) or alternative (allele different from reference), and genotypes are coded as 0 (homozygote for reference allele), 1 (heterozygote), 2 (homozygote for alternative allele) we can obtain the minor allele frequency spectrum as follows. First, compute the allele frequency of the reference and alternative state for each site.
The corresponding allele frequencies are:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
allele1 | 8 | 9 | 2 | 7 | 1 | 9 | 3 | 9 | 8 | 9 | 7 | 3 |
allele2 | 2 | 1 | 8 | 3 | 9 | 1 | 7 | 1 | 2 | 1 | 3 | 7 |
Now, for each site we need to decide what is the reference allele, which for the folded SFS is the allele with the minor allele frequency, i.e. the allele with a frequency less than 50%. In this example the allele1 is the minor allele for sites 3, 5, 7 and 12, whereas the allele2 is the minor allele for sites 1, 2, 4, 6, 8, 9, 10 and 11.
Next, we compute the “minor allele SFS” by counting how many sites have a given minor allele frequency.
To compute the SFS entry for the minor allele frequency of 1, we count the number of sites with a minor allele frequency of 1.
SFS entry 1: 5 (snps 2, 5, 6, 8, 10)
SFS entry 2: 3 (snps 1, 3, 9)
SFS entry 3: 4 (snps 4, 7, 11, 12)
SFS entries from 4 to 10: 0
Another way to compute the minor allele SFS is to obtain the SFS assuming that the alternative allele is the derived. Then add the entries with the same “minor allele frequency”. Note that alleles with a derived allele of 1 or 9 (out of 10) correspond to a minor allele frequency of 1, that alleles with a derived allele of 2 or 8 (out of 10) correspond to a minor allele frequency of 2, and so on. Hence, the minor allele (or folded) SFS can be computed by adding the entries of a derived SFS leading to the same minor allele frequency, as if we were folding the derived SFS at the 50% allele frequency.
EXERCISE 4.2.
compute the MAF SFS for the genotypic matrix in the file “./Data/example.GT”.
Follow the steps below. Here we will use the SFS computed assuming that reference allele is the ancestral and then add the entries with the same “minor allele frequency”.
Another way to see this is that we need to add the entries with the same color to obtain the “minor allele SFS (MAF SFS)” or “folded SFS”.
library(RColorBrewer)
par(mfrow=c(1,2))
colvs <- brewer.pal(6, "Set1")
barplot(obs.sfs, xlab="derived allele frequency", ylab="number of sites", col=colvs[c(1:6,5:1)])
# set the maf sfs the same as obs.sfs
maf.obs.sfs <- as.numeric(obs.sfs)
# all the entries of maf.sfs > (nind/2) set to zero
maf.obs.sfs[c((nind+2):((2*nind)+1))] <- 0
# add the entry 1 to (nind*2)+1, 2 to (nind*2)-1, ...
maf.obs.sfs[c(1:(nind))] <- obs.sfs[c(1:(nind))]+obs.sfs[c(((2*nind)+1):(nind+2))]
# str(maf.obs.sfs)
barplot(maf.obs.sfs, xlab="minor allele frequency", ylab="number of sites", col=colvs, main="Minor allele frequency (MAF) SFS", names.arg =c(0:(nind*2)))
Below I provide you with an R function to compute the MAF (folded) from an unfolded (derived) SFS.
# GETMAF
# Given an unfolded SFS get the minor allele frequency spectrum (folded SFS)
# INPUT
# unfolded.sfs : numeric vector of size (nindividuals*2)+2 with the unfolded SFS
# NOTE! Only works for unfolded.sfs with an odd number of entries!!
getMaf <- function(unfolded.sfs) {
# set the maf sfs the same as obs.sfs
maf <- as.numeric(unfolded.sfs)
# define the length of the SFS and half the SFS
n <- length(maf)
nhalf <- ceiling(n/2)
if(n %% 2 == 0) {
stop("This getMaf function only works for SFS with odd number of entries (for data from diploid individuals.)")
}
# all the entries of maf.sfs > (n/2) set to zero, where n is 2*nindividuals
maf[(nhalf+1):n] <- 0
# add the entry 1 to n, 2 to n-1, and so on...
maf[1:(nhalf-1)] <- unfolded.sfs[1:(nhalf-1)]+
unfolded.sfs[n:(nhalf+1)]
maf
}
# Apply function to get maf.obs.sfs
maf.obs.sfs <- getMaf(obs.sfs)
In most cases data contains missing data because we could not determine the genotype for a given individual at a particular site. When computing the observed SFS missing data pose some problems, as the x-axis in the SFS corresponds the absolute allele frequency (not the relative frequency). A genotype matrix with missing data could be as follows:
snp1 | snp2 | snp3 | snp4 | snp5 | snp6 | snp7 | snp8 | snp9 | snp10 | snp11 | snp12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ind1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
ind2 | NA | 0 | 1 | 1 | 1 | NA | 1 | 0 | 1 | 0 | 1 | 0 |
ind3 | 0 | 1 | 2 | NA | 1 | 1 | 0 | NA | 0 | 0 | 0 | 0 |
ind4 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | NA | 1 | 0 | 1 | 1 |
ind5 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 1 | 0 | 1 | 0 | 1 |
Consider sites snp1 and snp2 that both have one heterozygote individual for the derived allele, and all the remaining with ancestral allele. (Note: here I assume we know the ancestral state). If there was no missing data both snps would contribute to the entry 1 of the SFS. However, here snp2 has no missing data, and hence we have a frequency of 1 out of 8 gene copies (5 diploids), whereas snp1 has a frequency of 1 out of 10 gene copies (4 diploids) because one individual is missing. Which entry should snp1 contribute to? It is not easy to answer this question!
To deal with missing data we have three options:
In the script below I exemplify how we could get the SFS by resampling genotypes such that we have min_ss individuals per site with data. The required info is a genotypic matrix.
# number of SNPs
nsnps <- 10000
# number of individuals
nind <- 5
# read the genotype matrix with missing data
md_geno <- matrix(scan(file="./Data/example_missing_data.GT"), nrow=nind, ncol=nsnps, byrow=T)
# find minimum sample size
md_persite <- colSums(is.na(md_geno))
min_ss <- nind-max(md_persite)
# Go through each site, sample min_ss genotypes for each site without missing data
no_ms_geno <- apply(md_geno, 2, function(each_site) {sample(each_site[!is.na(each_site)], size=min_ss, replace = F)})
# Check that there is no missing data
if(sum(is.na(no_ms_geno))>0) {stop("Error: there is still missing data after resampling!")}
# Get the SFS for the dataset without missing data
obs.sfs <- table(colSums(no_ms_geno))
# Plot the unfolded and folded SFS
par(mfrow=c(1,2))
# unfolded
barplot(obs.sfs, xlab="derived allele frequency", ylab="number of sites")
# folded
maf.obs.sfs <- getMaf(obs.sfs)
barplot(maf.obs.sfs, xlab="minor allele frequency", ylab="number of sites", main="Minor allele frequency (MAF) SFS", names.arg =c(0:(min_ss*2)))