Markov chain Monte Carlo sampling of gene genealogies conditional on unphased SNP genotype data

Kelly M. Burkett; Brad McNeney; Jinko Graham

doi:10.1515/sagmb-2012-0011

Published by De Gruyter August 17, 2013

Markov chain Monte Carlo sampling of gene genealogies conditional on unphased SNP genotype data

Kelly M. Burkett , Brad McNeney and Jinko Graham

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2012-0011

Showing a limited preview of this publication:

Abstract

The gene genealogy is a tree describing the ancestral relationships among genes sampled from unrelated individuals. Knowledge of the tree is useful for inference of population-genetic parameters and has potential application in gene-mapping. Markov chain Monte Carlo approaches that sample genealogies conditional on observed genetic data typically assume that haplotype data are observed even though commonly-used genotyping technologies provide only unphased genotype data. We have extended our haplotype-based genealogy sampler, sampletrees, to handle unphased genotype data. We use the sampled haplotype configurations as a diagnostic for adequate sampling of the tree space based on the reasoning that if haplotype sampling is restricted, sampling from the tree space will also be restricted. We compare the distributions of sampled haplotypes across multiple runs of sampletrees, and to those estimated by the phase inference program, PHASE. Performance was excellent for the majority of individuals as shown by the consistency of results across multiple runs. However, for some individuals in some datasets, sampletrees had problems sampling haplotype configurations; longer run lengths would be required for these datasets. For many datasets though, we expect that sampletrees will be useful for sampling from the posterior distribution of gene genealogies given unphased genotype data.

Keywords: coalescent model; gene genealogy; haplotype uncertainty; Markov chain Monte Carlo

Corresponding author: Kelly M. Burkett, Department of Statistics and Actuarial Science, Simon Fraser University, 8888 University Drive, Burnaby, BC, V5A 1S6, Canada, e-mail: kburkett@sfu.ca

We thank two anonymous reviewers for constructive comments that greatly improved the presentation of the manuscript. This work was supported by the Natural Sciences and Engineering Research Council of Canada and the Mathematics of Information Technology and Complex Systems Networks of Centres of Excellence. Portions of this work were undertaken while KB held a Canadian Institutes of Health Research Doctoral Research Award and a Michael Smith Foundation for Health Research (MSFHR) Senior Graduate Trainee Award, and while JG was a MSFHR Scholar.

Appendix A: Similarity scores in the dictionary rephaser

The dictionary rephaser uses two different scores to (1) sample a new haplotype pair from the list of candidate configurations and (2) select new sibling nodes for the tip nodes whose haplotypes were updated. Both scores use the mutation model to weight the selection.

The first score quantifies the similarity between a particular haplotype pair in the list of candidate configurations and the haplotype dictionary, which consists of the tip haplotypes. Each haplotype pair in the list of candidate configurations is linked to one individual. Suppose the haplotype pair being scored is the j^th possible phasing of the i^th individual. If this pair is individual i’s current phasing, it receives a score of zero; otherwise the score is computed as follows:

Compare the first haplotype of configuration j to the haplotype dictionary and store the maximum number of identical alleles between the haplotype and a haplotype in the dictionary. Call this max₁; it has maximum value L, the total number of loci, when the first haplotype of configuration j is identical to a haplotype in the dictionary.
Compare the second haplotype of configuration j to the dictionary and compute max₂.
max₁ and max₂ correspond to the similarity for the two most similar haplotypes in the dictionary. There are d=2L–max₁–max₂ differences between configuration j and this closest pair. Assuming the differences between the two sequences correspond to mutation events, the similarity score accounts for the probability of the mutation events. Recall that mutations occur according to a Poisson process with rate θ/2 per SNP per unit of coalescence time. For two tip nodes having first common ancestor t time units ago, the total branch time separating the two nodes will be 2t and, since mutation events on the two branches are independent, the number of mutations occurring on the branches separating the two nodes will therefore be Poisson distributed with mean θ/2×2t. However, since the j^th configuration does not yet correspond to an actual node in the tree, we let t=1, which is the average coalescence time for a sample of two sequences under the neutral coalescent model. The score then is
where the current value of θ is used to compute the score.
Calculate the score for all candidate configurations as described above and rescale all the scores so that they sum to 1. This defines the proposal distribution for sampling a new haplotype configuration.

The second similarity score is used to choose the new sibling node K_n for the target node K_i at the tips of the tree. The new sibling node could be any node other than K_i’s current parental node and is chosen based on its sequence’s similarity to the new haplotype assigned to node K_i. We desire a similarity score that strongly favours coalescence events between identical or very similar sequences in order to increase the probability of accepting the update. As with the score for sampling a new configuration, this score weights the number of differences between the portion of the sequence that each could have inherited from their parental node by the probability of mutation events; this probability is derived from the Poisson mutation model described in Section 2.1. We assume that no recombination events occurred along the branches b_i and b_n so that K_i and K_n both inherit all tracked sequence from their parental node. We also assume that the allelic differences in the sequence that both inherited from their parental node, between locus 1 and min(z_i, z_n)–1, are due to distinct mutation events occurring somewhere over the combined branch time separating the two nodes, t=b_i+b_n. As K_i and K_n aren’t currently sibling nodes, we do not know b_i and b_n. We therefore let their hypothetical parental node’s time be the mean of its proposal distribution when computing the probability under the Poisson mutation model of the sequence similarity between K_n and K_i.

Appendix B: Additional simulations

B.1 Performance with larger datasets

We performed additional simulations to evaluate sampletrees performance with larger sample and region sizes. As previously described, we simulated datasets with the MS program (Hudson, 2002) using mutation and recombination rates of 10^–⁸ and an effective population size of 10,000. Two sets of simulations were performed, the first to examine how run time was affected by larger sample sizes and by increasing the number of SNPs and the second to examine how convergence was affected.

To examine the run time of larger datasets, we simulated 100 datasets of 500 individuals and a 100 kb region. Random subsets composed of all combinations of 25, 50, 100 and 150 individuals and SNPs were selected from the full datasets. For the selection of SNPs, a location in the 100 kb region was randomly sampled and the resulting dataset consisted of the closest SNPs to that location. Therefore, the average SNP density of all the datasets remains the same regardless of the number of SNPs that are sampled. By contrast, the average length of the region increases with the number of SNPs sampled.

Sampletrees was run with the settings previously described (Section 3) for 1 million iterations. Figure 7 shows run times by sample size. For each value of the number of SNPs, the run time increases linearly with sample size; similar results are seen with plots of run times by number of SNPs (not shown).

Although running sampletrees with 100 or more SNPs would be extremely time consuming, we would not necessarily expect improved performance with larger region sizes because recombination events limit the amount of sequence that is tracked on the tree. Depending on linkage disequilibrium, the additional SNPs may only contribute information about the genealogy of the focal point near the tips of the tree, before being removed due to recombination.

We ran limited simulations to evaluate the effects on convergence of additional SNPs and samples. We simulated 10 datasets using the same procedure as described for evaluating timing. We then ran 100 million iterations on datasets of 25 and 50 individuals and 25 and 50 SNPs as previously described. Figure 8 shows histograms of the maximum absolute difference between the three estimates of the posterior probability of the true haplotype configuration for individuals in the 10 datasets for each of the 4 combinations of number of SNPs and individuals.

Figure 8

Histogram of the maximum absolute difference between estimates of posterior probability of true configuration for various numbers of individuals (N) and number of SNPs (#SNPs). Note that the y-axis of each histogram has been scaled in order to better compare the frequencies in the 0.1–1 range. However, this scaling leads to distortion and in particular the first bar of all histograms is much higher than it appears in the plots.

When the number of SNPs is doubled, we see an increase in the maximum absolute difference between the three estimates; when the sample size is 25 the proportion of samples with maximum absolute difference >0.05 is 0.01 for 25 SNPs and 0.08 for 50 SNPs (Figure 8A vs. B). With the increased number of SNPs there are more individuals with more extreme differences in estimates. There are 10 individuals with estimates having a >0.1 difference when there are 50 SNPs; however, seven of these individuals belong to the same dataset. As with the previous analyses (Section 3.2.1), we would expect that an increase in the number of MCMC samples would likely lower the differences between estimates.

When the sample size is doubled, we see only a small increase in the maximum difference between the estimates (Figure 8A vs. C); the proportion of samples with maximum absolute difference >0.05 is 0.01 for 25 SNPs and 0.03 for 50 SNPs. We would also expect a small increase in the number of MCMC samples would lower the differences between estimates.

Finally, when both the sample size and number of SNPs is doubled, the results are similar to when only the number of SNPs is doubled (Figure 8B vs. D). Therefore, increasing the number of SNPs included in the dataset has a larger effect on convergence than increasing the sample size; we would have to run the chain longer for datasets of more SNPs. This observation may be due to the increase in the number of haplotypes that are possible with increased number of SNPs. This in turn increases the size of the tree space to sample over. However, the maximum number of possible haplotypes is always 2^l–1, where l is the number of SNPs, and so the size of the haplotype space is not larger when the sample size is increased.

B.2 Performance with populations undergoing exponential growth

Human genetic sequence data were previously found to be consistent with population genetic models having a constant population size. However, sequencing studies with larger sample sizes than previously used have recently discovered an excess of rare variants (for example, Nelson et al., 2012), which are compatible with exponentially growing populations (Keinan and Clark, 2012). Exponential growth changes the shape of the ancestral trees; in particular, the tip branches are proportionally longer relative to the full height of the tree. Mutation events on these long tip branches will only be present in a few of the tip haplotypes, leading to many more low frequency variants.

We therefore also simulated 10 datasets assuming recent exponential growth using model parameters similar to those estimated by Gravel et al. (2011). MS (Hudson, 2002) was again used to simulate gene genealogies for a sample of 50 sequences of length 20 kb. We assumed that the present day population size was 1 million and that the growth rate was 0.45% for 1000 generations (approximately 25,000 years), at which point the population size was assumed to be constant. We used a 10% MAF cutoff, as in our other simulations and, because exponential growth will cause an increase in rare variants, we also examined the impact of removing only singleton SNPs. Lower frequency variants provide information about the ancestry near the tips of the tree and we therefore expect that using higher MAF cutoffs will sacrifice some information about the relationships at the tips of the tree. Due to run time constraints, we also limited the total number of SNPs per dataset to 30 by randomly sampling from the set of SNPs meeting the MAF threshold; this restriction was not required for the datasets with the 10% MAF threshold.

For each dataset, we ran sampletrees three times for 100 million iterations. Figure 9 shows scatterplots, by pair of runs, of the estimated posterior probabilities of each individual’s true haplotype configuration for all individuals in the 10 datasets at the two MAF cutoffs. As in Figure 4, the majority of individuals have similar posterior probability estimates across the three runs. In addition, the results are quite concordant for both MAF cutoffs.

Figure 9

Scatterplots of the estimated posterior probabilities of the true haplotype configuration for individuals in all datasets. Each plot in the matrix compares estimates for two of three runs. The red line indicates identical estimates in the two runs being compared. (A) Only remove singletons (B) MAF cutoff of 10%.

Figure 10 shows the maximum absolute difference in posterior probability estimates across the three runs for the two MAF criteria. The difference in estimates, a diagnostic for convergence, shows a similar pattern to the simulations without population growth. Without growth and with a 10% MAF cutoff, the proportion of differences >0.1 was 4% (Section 3.2.1). With growth, the proportion of differences >0.1 was 4% when only singletons were removed. However, the proportion of differences >0.1 was 8% for the 10% MAF cutoff, indicating slower convergence and the need for longer run lengths. The simulations with growth do not have more individuals with extremely high differences between runs; with the analysis with no singletons the maximum absolute difference was <0.22 and for the analysis with a 10% MAF cutoff, only one difference was >0.25. Therefore, under exponential growth we see only minor evidence of slower convergence relative to the simulations with a constant population size. Interestingly, the results with the 10% MAF cutoff are worse than those with the singleton cutoff. One possible explanation is that the low frequency SNPs provide additional information about the tip haplotypes and the genealogy. In our simulations with roughly similar numbers of SNPs, more heterozygosity and hence haplotype uncertainty would be expected when SNPs are ascertained with a 10% MAF cutoff than with a singleton cutoff.

Figure 10

Histogram of the maximum absolute difference between estimates of posterior probability of true configuration (A) when only singleton SNPs are removed from the dataset (B) when the MAF cutoff is set to 10%. Note that the y-axis of each histogram has again been scaled in order to better compare the frequencies in the 0.1–1 range.

We also compared results to PHASE, and found that the differences between sampletrees and PHASE estimates are higher under simulations with exponential growth. For the simulations without growth, 7% of individuals had a difference in estimates >0.1 (Section 3.2.2); for the simulations with growth the proportion of individuals was 16% and 22% for the analysis excluding singletons and the 10% cut-off, respectively. As with the results under constant population size described in Section 3.2.2, the larger difference between estimates can be explained by the increased haplotype diversity. Sampletrees is visiting many more haplotype configurations in the simulations with growth than in the simulations without growth. The increased haplotype configurations are reflected by the size of the set of possible haplotypes, which grows when the allele swap update visits new haplotype configurations. Under growth, and a MAF cut-off of 10%, the median factor by which the size of the set of possible haplotypes increases is 13. For comparison, under no growth, the median factor is 7 for datasets of approximately the same size (25 SNPs and 25 individuals).

To summarize, increasing the number of SNPs and/or number of individuals in the dataset, or including population growth can result in larger differences between posterior probability estimates than with smaller datasets involving fewer SNPs and/or fewer individuals sampled from a population of constant size. However, the differences in estimates are not that large and increasing the number of MCMC samples should be sufficient to improve results, as seen in Section 3.2.1. We also found that increasing the number of SNPs has a larger effect on convergence than increasing the sample size. This is reassuring since typical datasets have larger sample sizes than the ones examined here and since we do not expect improved performance with larger regions because recombination events limit the amount of sequence that is tracked on the tree. Finally, including SNPs with lower minor allele frequencies also improved mixing in our simulations, as shown by the increased concordance in posterior probability estimates across the three runs for the exponential growth scenario. In our simulations, the lower MAF SNPs are expected to provide more information about the tips of the genealogic tree and the phasing of haplotypes than the higher MAF SNPs. Therefore, the inclusion of lower MAF SNPs may provide a helpful restriction on the set of trees that are likely given the genotype data.

References

Andrés, A. M., A. G. Clark, L. Shimmin, E. Boerwinkle, C. F. Sing and J. E. Hixson (2007): “Understanding the accuracy of statistical haplotype inference with sequence data of known phase,” Genetic. Epidemiol., 31, 659–671.Search in Google Scholar

Beaumont, M. (2010): “Approximate Bayesian computation in evolution and ecology,” Annu. Rev. Ecol. Evol. Syst., 41, 379–406.Search in Google Scholar

Burkett, K. M. (2011): Markov chain Monte Carlo sampling of gene genealogies conditional on observed genetic data, Ph.D. thesis, Simon Fraser University, Burnaby, Canada.Search in Google Scholar

Burkett, K. M., B. McNeney and J. Graham (2013): A Markov chain Monte Carlo sampler for gene genealogies conditional on haplotype data. In: Chaubney, Y. (Ed.), Some Recent Advances in Mathematics and Statistics, Proceedings of Statistics 2011 Canada/IMST 2011-FIM XX, Montreal, July 2011. Singapore: World Scientific Publishing, pp. 29–44.10.1142/9789814417983_0003Search in Google Scholar

Clark, A. G. (1990): “Inference of haplotypes from PCR-amplified samples of diploid populations,” Mol. Biol. Evol., 7, 111–122.Search in Google Scholar

Geyer, C. J. and E. A. Thompson (1995): “Annealing Markov chain Monte Carlo with applications to ancestral inference,” J. Am. Stat. Assoc., 90, 909–920.Search in Google Scholar

Gravel, S., B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth, A. G. Clark, F. Yu, T. Gibbs, Richard A. and C. D. Bustamante (2011): “Demographic history and rare allele sharing among human populations,” Proc. Natl. Acad. Sci., 108, 11983–11988.Search in Google Scholar

Griffiths, R. C. and P. Marjoram (1997): “An ancestral recombination graph.” In: Donnelly, P. and S. Tavaré (Eds.). Progress in Population Genetics and Human Evolution, Heidelberg, Germany: Springer-Verlag, pp. 257–270.10.1007/978-1-4757-2609-1_16Search in Google Scholar

Hubert, L. and P. Arabie (1985): “Comparing partitions,” J. Classif., 2, 193–218.Search in Google Scholar

Hudson, R. R. (1983): “Properties of a neutral allele model with intragenic recombination,” Theor. Popul. Biol., 23, 183–201.Search in Google Scholar

Hudson, R. R. (2002): “Generating samples under a Wright-Fisher neutral model of genetic variation,” Bioinformatics, 18, 337–338.10.1093/bioinformatics/18.2.337Search in Google Scholar PubMed

Igo Jr, R. P., J. Li and K. A. Goddard (2009): “Association mapping by generalized linear regression with density-based haplotype clustering,” Genet. Epidemiol., 33, 16–26.Search in Google Scholar

Keinan, A. and A. G. Clark (2012): “Recent explosive human population growth has resulted in an excess of rare genetic variants,” Science, 336, 740–743.10.1126/science.1217283Search in Google Scholar PubMed PubMed Central

Kingman, J. F. C. (1982): “The coalescent,” Stoch. Proc. Appl., 13, 235–248.Search in Google Scholar

Kuhner, M. K. (2006): “LAMARC 2.0: Maximum likelihood and Bayesian estimation of population parameters,” Bioinformatics, 22, 768–770.Search in Google Scholar

Kuhner, M. K. (2009): “Coalescent genealogy samplers: windows into population history,” Trends Ecol. Evol., 24, 86–93.Search in Google Scholar

Kuhner, M. K. and J. Felsenstein (2000): “Sampling among haplotype resolutions in a coalescent-based genealogy sampler,” Genet. Epidemiol., 19(Suppl 1), S15–S21.Search in Google Scholar

Lin, D. Y. and B. E. Huang (2007): “The use of inferred haplotypes in downstream analyses,” Am. J. Hum. Genet., 80, 577–579.Search in Google Scholar

Marchini, J., D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z. S. Qin, H. M. Munro, G. R. Abecasis and P. Donnelly (2006): “A comparison of phasing algorithms for trios and unrelated individuals,” Am. J. Hum. Genet., 78, 437–450.Search in Google Scholar

McVean, G. A. and N. J. Cardin (2005): “Approximating the coalescent with recombination,” Philos. Trans. R. Soc. Lond., B, Biol. Sci., 360, 1387–1393.Search in Google Scholar

McVean, G. A. T., S. R. Myers, S. Hunt, P. Deloukas, D. R. Bentley and P. Donnelly (2004): “The fine-scale structure of recombination rate variation in the human genome,” Science 304, 581–584.10.1126/science.1092500Search in Google Scholar PubMed

Mensah, F. K., M. S. Gilthorpe, C. F. Davies, L. J. Keen, P. J. Adamson, E. Roman, G. J. Morgan, J. L. Bidwell and G. R. Law (2007): “Haplotype uncertainty in association studies,” Genet. Epidemiol., 31, 348–357.Search in Google Scholar

Minichiello, M. J. and R. Durbin (2006): “Mapping trait loci by use of inferred ancestral recombination graphs,” Am. J. Hum. Genet., 79, 910–922.Search in Google Scholar

Morris, A. P., J. C. Whittaker and D. J. Balding (2004): “Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data,” Am. J. Hum. Genet., 74, 945–953.10.1086/420773Search in Google Scholar PubMed PubMed Central

Nelson, M. R., D. Wegmann, M. G. Ehm, D. Kessner, P. St. Jean, C. Verzilli, J. Shen, Z. Tang, S.-A. Bacanu, D. Fraser, L. Warren, J. Aponte, M. Zawistowski, X. Liu, H. Zhang, Y. Zhang, J. Li, Y. Li, L. Li, P. Woollard, S. Topp, M. D. Hall, K. Nangle, J. Wang, G. Abecasis, L. R. Cardon, S. Zllner, J. C. Whittaker, S. L. Chissoe, J. Novembre and V. Mooser (2012): “An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people,” Science, 337, 100–104.10.1126/science.1217876Search in Google Scholar PubMed PubMed Central

Paradis, E., J. Claude and K. Strimmer (2004): “APE: analyses of phylogenetics and evolution in R language,” Bioinformatics, 20, 289–290.10.1093/bioinformatics/btg412Search in Google Scholar PubMed

Robinson, D. and L. Foulds (1981): “Comparison of phylogenetic trees,” Math. Biosci., 53, 131–147.Search in Google Scholar

Stephens, M. (2003): “Inference under the coalescent.” In: D. J. Balding, M. Bishop, and C. Cannings, (Eds.), Handbook of Statistical Genetics, Chichester, England: John Wiley and Sons, 2nd edition, pp. 636–661.Search in Google Scholar

Stephens, M., N. J. Smith and P. Donnelly (2001): “A new statistical method for haplotype reconstruction from population data,” Am. J. Hum. Genet., 68, 978–989.Search in Google Scholar

Zöllner, S. and J. K. Pritchard (2005): “Coalescent-based association mapping and fine mapping of complex trait loci,” Genetics, 169, 1071–1092.10.1534/genetics.104.031799Search in Google Scholar PubMed PubMed Central

Published Online: 2013-08-17

Published in Print: 2013-10-01

Markov chain Monte Carlo sampling of gene genealogies conditional on unphased SNP genotype data

Abstract

Appendix A: Similarity scores in the dictionary rephaser

Appendix B: Additional simulations

B.1 Performance with larger datasets

B.2 Performance with populations undergoing exponential growth

References

Journal and Issue

Articles in the same Issue