A sequentially Markov conditional sampling distribution for structured populations with migration and recombination
Introduction
Under a given population genetic model, the conditional sampling distribution (CSD), also called a copying model by some authors, describes the probability that an additionally sampled haplotype is of a certain type, given that a collection of haplotypes has already been observed. As described below, various applications in population genomics make use of the CSD. Although the CSD is of much importance, no exact closed-form expressions are known in the situations to which it has been applied, and so a number of approximations have been proposed.
Following the seminal work of Stephens and Donnelly (2000) and Fearnhead and Donnelly (2001), Li and Stephens (2003) proposed a widely used CSD, denoted , which models the additionally observed haplotype as an imperfect mosaic of the haplotypes already observed. The model underlying can be cast as a hidden Markov model (HMM), thus admitting efficient implementation. In their paper, Li and Stephens used the CSD in a pseudo-likelihood framework to estimate fine-scale recombination rates, and subsequently and its extensions have been used in numerous other population genetic applications, including estimating gene-conversion parameters (Gay et al., 2007, Yin et al., 2009), and phasing genotype sequence data into haplotype sequence data and imputing missing data (Stephens and Scheet, 2005, Li and Abecasis, 2006, Li et al., 2010, Marchini et al., 2007, Howie et al., 2009).
Another important application of the CSD that has received much attention is the inference of population structure and demography. Hellenthal et al. (2008) employed to model human colonization history as a sequence of founder events and estimated the order of the founding events, as well as the relative contribution of different founding populations during the events. To estimate the splitting time of two populations Davison et al. (2009) modified to incorporate the split into the copying model, and used the same pseudo-likelihood framework as Li and Stephens (2003) to estimate the time of splitting. In a more recent study, Lawson et al. (2012) applied to a sample of DNA sequences and used properties of the inferred mosaic pattern to reveal structure in the underlying population.
To handle admixture, a modification to was introduced by Price et al. (2009), who assumed that the previously observed haplotypes in the CSD are from two distinct ancestral populations (e.g., African and European). In modeling the mosaic pattern for a haplotype sampled from the admixed population (e.g., African American), it is then assumed more likely that adjacent segments originate from the same ancestral population, rather than from two different ancestral populations. Price et al. applied this modified copying model to detect chromosomal segments of distinct ancestry in admixed individuals and estimated admixture fractions in recently admixed populations. The same model was applied by Wegmann et al. (2011), who used the inferred ancestry switch-points to estimate relative recombination rates between different populations.
As discussed above, is a very useful CSD with a variety of applications, but it was not derived from, though it was certainly motivated by, principles underlying the coalescent process. To derive CSDs in a principled way, De Iorio and Griffiths (2004a) introduced a general approximation technique based on the diffusion process dual to the coalescent; this work was first presented in the case of a single locus and a panmictic population, but in a companion paper (De Iorio and Griffiths, 2004b) the authors applied the method to the case of a subdivided population with migration. Griffiths et al. (2008) extended the diffusion approximation technique to handle recombination in the special case of two loci with parent-independent mutation at each locus, and Paul and Song (2010) later generalized the framework to an arbitrary number of loci and an arbitrary finite-alleles mutation model.
Although more accurate than the CSDs developed by Fearnhead and Donnelly (2001) and by Li and Stephens (2003), the CSD proposed by Paul and Song (2010) is not amenable to efficient evaluation. More precisely, can be computed by solving a recursion that becomes intractable for a large number of loci. However, utilizing ideas related to the sequentially Markov coalescent (SMC) (Wiuf and Hein, 1999, McVean and Cardin, 2005, Marjoram and Wall, 2006), which is a simplified genealogical process that captures the essential features of the full coalescent model with recombination, we (Paul et al., 2011) recently developed an approximation to that could be cast as an HMM with continuous hidden state space. Furthermore, upon discretizing this continuous state space, we obtained an accurate approximation with computational efficiency comparable to the CSDs of Fearnhead and Donnelly (2001) and Li and Stephens (2003).
In this paper, we extend our previous work on the sequentially Markov CSD to incorporate subdivided population structure with migration. Following Paul and Song (2010), we describe a genealogical process for an additionally sampled haplotype conditioned on the genealogy of already observed haplotypes. We present a recursion that can be used to compute the probability of the additionally sampled haplotype, but, as in Paul and Song (2010), solving this recursion is tractable only for a small number of loci. As in Paul et al. (2011), we apply the sequentially Markov framework to the conditional genealogical process with migration and recombination, and obtain an accurate approximation that facilitates computation for a large number of loci. As a concrete application, we demonstrate empirically that our new CSD can be employed in various pseudo-likelihoods to produce accurate estimation of a wide range of migration rates.
The remainder of this paper is organized as follows. In Section 2, we introduce the notation adopted throughout the paper and describe the relevant population genetic model, the coalescent with recombination and migration. We then describe the genealogical interpretation of our CSD in Section 3 and introduce several approximations in Section 4 to obtain a CSD for which computation is tractable. In Section 5, we demonstrate the applicability of our CSD by employing it to the estimation of migration rates from simulated data. Finally, we conclude in Section 6 with a discussion of further applications and extensions of the CSD developed herein to estimate demographic parameters in more complex scenarios.
Section snippets
Background
In this section, we briefly describe how migration is integrated into the coalescent with recombination, and recall the CSD proposed by Paul and Song (2010), which we extend to incorporate migration in the following section. We begin by defining some general notation that will be used throughout.
A new CSD for structured populations with recombination and migration
We now introduce an approximate CSD by extending the genealogical process of Section 2.3 to a general structured population with . Suppose that conditioned on having already observed a structured sample configuration , we wish to sample additional haplotypes with of them in each deme . As before, given the true fully-specified genealogy for the conditional configuration , including migration events, it is possible to sample a conditional genealogy for the additional
An efficiently computable CSD as an approximation of
As described above, the recursion for becomes computationally intractable for even modest datasets. In what follows, we adopt a set of approximations to obtain a CSD that admits efficient implementation, while retaining the accuracy of .
Application: estimating migration rates
To demonstrate the utility of our approximate CSD , we considered estimating migration rates for data simulated under the full coalescent with recombination and migration. In particular, we simulated data for bi-allelic loci. For simplicity, we set and for all , and for all . We considered a structured population with two demes (i.e., ), and set and . For each value of , we generated
Discussion
Numerous applications in population genomics make use of the conditional sampling distribution, so developing accurate, efficiently computable CSDs for various population genetic models is of much interest. Recently, we proposed an accurate sequentially Markov CSD that follows from approximating the diffusion process dual to the coalescent with recombination for a single panmictic population. In this paper, we have extended that approach to incorporate subdivided population structure with
Acknowledgments
We thank John Kamm for many stimulating and fruitful discussions. This research is supported in part by a DFG Research Fellowship STE2011/1-1 to MS; an NIH National Research Service Award Trainee appointment on T32-HG00047 to JSP; and an NIH grant R01-GM094402, and a Packard Fellowship for Science and Engineering to YSS.
References (34)
- et al.
An approximate likelihood for genetic data under a model with recombination and population splitting
Theor. Popul. Biol.
(2009) - et al.
Can one learn history from the allelic spectrum?
Theor. Popul. Biol.
(2008) - et al.
Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation
Am. J. Hum. Genet.
(2005) - et al.
Recombination as a point process along sequences
Theor. Popul. Biol.
(1999) - et al.
Measures of divergence between populations and the effect of forces that reduce variability
Mol. Biol. Evol.
(1998)- et al.
Importance sampling on coalescent histories. I
Adv. in Appl. Probab.
(2004) - et al.
Importance sampling on coalescent histories. II: subdivided population models
Adv. in Appl. Probab.
(2004) - et al.
Estimating recombination rates from population genetic data
Genetics
(2001) - et al.
Estimating meiotic gene conversion rates from population genetic data
Genetics
(2007)
Demographic history and rare allele sharing among human populations
Proc. Natl. Acad. Sci.
Importance sampling and the two-locus model with subdivided population structure
Adv. in Appl. Probab.
An ancestral recombination graph
Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data
PLos Genet.
Inferring human colonization history using a copying model
PLoS Genet.
The structured coalescent
A flexible and accurate genotype imputation method for the next generation of genome-wide association studies
PLoS Genet.
Cited by (35)
Developments in coalescent theory from single loci to chromosomes
2020, Theoretical Population BiologyInference of population history using coalescent HMMs: review and outlook
2018, Current Opinion in Genetics and DevelopmentCitation Excerpt :In particular, diCal version 2 allows for the parametric inference of more complex demographic models involving multiple populations, and SMC++ and ASMC push the boundaries of scalability for coalescent-HMMs. Building on diCal v1 [50] and advances to the CSD framework [54,55], diCal v2 [56] was developed to perform parametric inference of essentially arbitrarily complex demographic models, including estimating divergence times, continuous and pulse migration, and population sizes with possible exponential growth. The method can scale to tens of haplotypes and has been used on models with three populations, but can handle arbitrarily many populations at increased computational cost.
Genetic studies of the peopling of the Americas: What insights do diachronic mitochondrial genome datasets provide?
2017, Quaternary InternationalCitation Excerpt :For example, contrasting pre- and post-Contact indigenous mitochondrial data demonstrated that the historical demographic bottleneck resulted in a further decrease of the Native American genetic diversity that could not be observed by sampling modern populations alone (Bolnick and Smith, 2003; Llamas et al., 2016a; O'Fallon and Fehren-Schmitz, 2011; Raff et al., 2011; Schultz Shook and Smith, 2008). Recent model-free algorithms for estimating human population split times using autosomal DNA from a few individuals have relatively poor resolution within the last twenty thousand years (Li and Durbin, 2011; Sheehan et al., 2013; Steinrücken et al., 2013). More recent demographic events can be inferred by adding more samples and phasing the genome sequences, but this comes at significant extra cost and effort (Raghavan et al., 2015; Schiffels and Durbin, 2014).
Explosive genetic evidence for explosive human population growth
2016, Current Opinion in Genetics and DevelopmentCitation Excerpt :However, there is no shortage of complementary methods that are based on LD and haplotype information. Many of these methods were built on coalescent and hidden Markov models [25–27,28••,29••] and others incorporate inference of identity-by-descent (IBD) and identity-by-state (IBS) [30,31,32••,33] (Box 1). We first provide a brief overview of ancient population size history, since many studies on recent changes in population size make assumptions of ancient events.
Ancestral population genomics using coalescence hidden Markov models and heuristic optimisation algorithms
2015, Computational Biology and ChemistryCitation Excerpt :To alleviate this, the sequential Markov coalescence approximation assumes that statistical dependencies between local genealogies are Markov (McVean and Cardin, 2005; Marjoram and Wall, 2006; Chen et al., 2009; Hobolth and Jensen, 2014). In recent years a number of inference tools have been developed based on combining the sequential Markov coalescence with hidden Markov models, constructing so-called coalescence hidden Markov models or CoalHMMs, that have been constructed for the inference of speciation times (Hobolth et al., 2007; Dutheil et al., 2009; Mailund et al., 2011), gene-flow patterns (Steinrücken et al., 2013; Mailund et al., 2012), changing population sizes (Li and Durbin, 2011; Sheehan et al., 2013; Schiffels and Durbin, 2014) or inference of recombination patters (Munch et al., 2014) and have been used in a number of whole genome analyses (Locke et al., 2011; Scally et al., 2012; Prado-Martinez et al., 2013; Prüfer et al., 2012; Miller et al., 2012). These models exploit that even a very small sample of full genomic sequences holds a wealth of information about the sample's ancestry: Loci sufficiently far apart in the genome can, because of recombination in the sample's history, be considered essentially independent samples from the underlying sample populations.
Impact of range expansions on current human genomic diversity
2014, Current Opinion in Genetics and Development