Elsevier

Theoretical Population Biology

Volume 87, August 2013, Pages 51-61
Theoretical Population Biology

A sequentially Markov conditional sampling distribution for structured populations with migration and recombination

https://doi.org/10.1016/j.tpb.2012.08.004Get rights and content

Abstract

Conditional sampling distributions (CSDs), sometimes referred to as copying models, underlie numerous practical tools in population genomic analyses. Though an important application that has received much attention is the inference of population structure, the explicit exchange of migrants at specified rates has not hitherto been incorporated into the CSD in a principled framework. Recently, in the case of a single panmictic population, a sequentially Markov CSD has been developed as an accurate, efficient approximation to a principled CSD derived from the diffusion process dual to the coalescent with recombination. In this paper, the sequentially Markov CSD framework is extended to incorporate subdivided population structure, thus providing an efficiently computable CSD that admits a genealogical interpretation related to the structured coalescent with migration and recombination. As a concrete application, it is demonstrated empirically that the CSD developed here can be employed to yield accurate estimation of a wide range of migration rates.

Introduction

Under a given population genetic model, the conditional sampling distribution (CSD), also called a copying model by some authors, describes the probability that an additionally sampled haplotype is of a certain type, given that a collection of haplotypes has already been observed. As described below, various applications in population genomics make use of the CSD. Although the CSD is of much importance, no exact closed-form expressions are known in the situations to which it has been applied, and so a number of approximations have been proposed.

Following the seminal work of Stephens and Donnelly (2000) and Fearnhead and Donnelly (2001), Li and Stephens (2003) proposed a widely used CSD, denoted πˆLS, which models the additionally observed haplotype as an imperfect mosaic of the haplotypes already observed. The model underlying πˆLS can be cast as a hidden Markov model (HMM), thus admitting efficient implementation. In their paper, Li and Stephens used the CSD πˆLS in a pseudo-likelihood framework to estimate fine-scale recombination rates, and subsequently πˆLS and its extensions have been used in numerous other population genetic applications, including estimating gene-conversion parameters (Gay et al., 2007, Yin et al., 2009), and phasing genotype sequence data into haplotype sequence data and imputing missing data (Stephens and Scheet, 2005, Li and Abecasis, 2006, Li et al., 2010, Marchini et al., 2007, Howie et al., 2009).

Another important application of the CSD that has received much attention is the inference of population structure and demography. Hellenthal et al. (2008) employed πˆLS to model human colonization history as a sequence of founder events and estimated the order of the founding events, as well as the relative contribution of different founding populations during the events. To estimate the splitting time of two populations Davison et al. (2009) modified πˆLS to incorporate the split into the copying model, and used the same pseudo-likelihood framework as Li and Stephens (2003) to estimate the time of splitting. In a more recent study, Lawson et al. (2012) applied πˆLS to a sample of DNA sequences and used properties of the inferred mosaic pattern to reveal structure in the underlying population.

To handle admixture, a modification to πˆLS was introduced by Price et al. (2009), who assumed that the previously observed haplotypes in the CSD are from two distinct ancestral populations (e.g., African and European). In modeling the mosaic pattern for a haplotype sampled from the admixed population (e.g., African American), it is then assumed more likely that adjacent segments originate from the same ancestral population, rather than from two different ancestral populations. Price et al. applied this modified copying model to detect chromosomal segments of distinct ancestry in admixed individuals and estimated admixture fractions in recently admixed populations. The same model was applied by Wegmann et al. (2011), who used the inferred ancestry switch-points to estimate relative recombination rates between different populations.

As discussed above, πˆLS is a very useful CSD with a variety of applications, but it was not derived from, though it was certainly motivated by, principles underlying the coalescent process. To derive CSDs in a principled way, De Iorio and Griffiths (2004a) introduced a general approximation technique based on the diffusion process dual to the coalescent; this work was first presented in the case of a single locus and a panmictic population, but in a companion paper (De Iorio and Griffiths, 2004b) the authors applied the method to the case of a subdivided population with migration. Griffiths et al. (2008) extended the diffusion approximation technique to handle recombination in the special case of two loci with parent-independent mutation at each locus, and Paul and Song (2010) later generalized the framework to an arbitrary number of loci and an arbitrary finite-alleles mutation model.

Although more accurate than the CSDs developed by Fearnhead and Donnelly (2001) and by Li and Stephens (2003), the CSD πˆPS proposed by Paul and Song (2010) is not amenable to efficient evaluation. More precisely, πˆPS can be computed by solving a recursion that becomes intractable for a large number of loci. However, utilizing ideas related to the sequentially Markov coalescent (SMC) (Wiuf and Hein, 1999, McVean and Cardin, 2005, Marjoram and Wall, 2006), which is a simplified genealogical process that captures the essential features of the full coalescent model with recombination, we (Paul et al., 2011) recently developed an approximation to πˆPS that could be cast as an HMM with continuous hidden state space. Furthermore, upon discretizing this continuous state space, we obtained an accurate approximation with computational efficiency comparable to the CSDs of Fearnhead and Donnelly (2001) and Li and Stephens (2003).

In this paper, we extend our previous work on the sequentially Markov CSD to incorporate subdivided population structure with migration. Following Paul and Song (2010), we describe a genealogical process for an additionally sampled haplotype conditioned on the genealogy of already observed haplotypes. We present a recursion that can be used to compute the probability of the additionally sampled haplotype, but, as in Paul and Song (2010), solving this recursion is tractable only for a small number of loci. As in Paul et al. (2011), we apply the sequentially Markov framework to the conditional genealogical process with migration and recombination, and obtain an accurate approximation that facilitates computation for a large number of loci. As a concrete application, we demonstrate empirically that our new CSD can be employed in various pseudo-likelihoods to produce accurate estimation of a wide range of migration rates.

The remainder of this paper is organized as follows. In Section  2, we introduce the notation adopted throughout the paper and describe the relevant population genetic model, the coalescent with recombination and migration. We then describe the genealogical interpretation of our CSD in Section  3 and introduce several approximations in Section  4 to obtain a CSD for which computation is tractable. In Section  5, we demonstrate the applicability of our CSD by employing it to the estimation of migration rates from simulated data. Finally, we conclude in Section  6 with a discussion of further applications and extensions of the CSD developed herein to estimate demographic parameters in more complex scenarios.

Section snippets

Background

In this section, we briefly describe how migration is integrated into the coalescent with recombination, and recall the CSD πˆPS proposed by Paul and Song (2010), which we extend to incorporate migration in the following section. We begin by defining some general notation that will be used throughout.

A new CSD πˆMig for structured populations with recombination and migration

We now introduce an approximate CSD πˆMig by extending the genealogical process of Section  2.3 to a general structured population with |Γ|1. Suppose that conditioned on having already observed a structured sample configuration n, we wish to sample c additional haplotypes with cγ of them in each deme γ. As before, given the true fully-specified genealogy An for the conditional configuration n, including migration events, it is possible to sample a conditional genealogy C for the c additional

An efficiently computable CSD as an approximation of πˆMig

As described above, the recursion for πˆMig(c|n) becomes computationally intractable for even modest datasets. In what follows, we adopt a set of approximations to obtain a CSD that admits efficient implementation, while retaining the accuracy of πˆMig.

Application: estimating migration rates

To demonstrate the utility of our approximate CSD πˆMigSMC-AOD, we considered estimating migration rates for data simulated under the full coalescent with recombination and migration. In particular, we simulated data for k=104 bi-allelic loci. For simplicity, we set θ=5×102 and P()=(1/21/21/21/2) for all L, and ρb=5×102 for all bB. We considered a structured population with two demes (i.e., Γ={1,2}), and set κ1=κ2=0.5 and m12=m21=m. For each value of m{0.01,0.10,1.00,10.0}, we generated

Discussion

Numerous applications in population genomics make use of the conditional sampling distribution, so developing accurate, efficiently computable CSDs for various population genetic models is of much interest. Recently, we proposed an accurate sequentially Markov CSD that follows from approximating the diffusion process dual to the coalescent with recombination for a single panmictic population. In this paper, we have extended that approach to incorporate subdivided population structure with

Acknowledgments

We thank John Kamm for many stimulating and fruitful discussions. This research is supported in part by a DFG Research Fellowship STE2011/1-1 to MS; an NIH National Research Service Award Trainee appointment on T32-HG00047 to JSP; and an NIH grant R01-GM094402, and a Packard Fellowship for Science and Engineering to YSS.

References (34)

  • S. Gravel et al.

    Demographic history and rare allele sharing among human populations

    Proc. Natl. Acad. Sci.

    (2011)
  • R.C. Griffiths et al.

    Importance sampling and the two-locus model with subdivided population structure

    Adv. in Appl. Probab.

    (2008)
  • R.C. Griffiths et al.

    An ancestral recombination graph

  • R.N. Gutenkunst et al.

    Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data

    PLos Genet.

    (2009)
  • G. Hellenthal et al.

    Inferring human colonization history using a copying model

    PLoS Genet.

    (2008)
  • H.M. Herbots

    The structured coalescent

  • B.N. Howie et al.

    A flexible and accurate genotype imputation method for the next generation of genome-wide association studies

    PLoS Genet.

    (2009)
  • Cited by (35)

    • Inference of population history using coalescent HMMs: review and outlook

      2018, Current Opinion in Genetics and Development
      Citation Excerpt :

      In particular, diCal version 2 allows for the parametric inference of more complex demographic models involving multiple populations, and SMC++ and ASMC push the boundaries of scalability for coalescent-HMMs. Building on diCal v1 [50] and advances to the CSD framework [54,55], diCal v2 [56] was developed to perform parametric inference of essentially arbitrarily complex demographic models, including estimating divergence times, continuous and pulse migration, and population sizes with possible exponential growth. The method can scale to tens of haplotypes and has been used on models with three populations, but can handle arbitrarily many populations at increased computational cost.

    • Genetic studies of the peopling of the Americas: What insights do diachronic mitochondrial genome datasets provide?

      2017, Quaternary International
      Citation Excerpt :

      For example, contrasting pre- and post-Contact indigenous mitochondrial data demonstrated that the historical demographic bottleneck resulted in a further decrease of the Native American genetic diversity that could not be observed by sampling modern populations alone (Bolnick and Smith, 2003; Llamas et al., 2016a; O'Fallon and Fehren-Schmitz, 2011; Raff et al., 2011; Schultz Shook and Smith, 2008). Recent model-free algorithms for estimating human population split times using autosomal DNA from a few individuals have relatively poor resolution within the last twenty thousand years (Li and Durbin, 2011; Sheehan et al., 2013; Steinrücken et al., 2013). More recent demographic events can be inferred by adding more samples and phasing the genome sequences, but this comes at significant extra cost and effort (Raghavan et al., 2015; Schiffels and Durbin, 2014).

    • Explosive genetic evidence for explosive human population growth

      2016, Current Opinion in Genetics and Development
      Citation Excerpt :

      However, there is no shortage of complementary methods that are based on LD and haplotype information. Many of these methods were built on coalescent and hidden Markov models [25–27,28••,29••] and others incorporate inference of identity-by-descent (IBD) and identity-by-state (IBS) [30,31,32••,33] (Box 1). We first provide a brief overview of ancient population size history, since many studies on recent changes in population size make assumptions of ancient events.

    • Ancestral population genomics using coalescence hidden Markov models and heuristic optimisation algorithms

      2015, Computational Biology and Chemistry
      Citation Excerpt :

      To alleviate this, the sequential Markov coalescence approximation assumes that statistical dependencies between local genealogies are Markov (McVean and Cardin, 2005; Marjoram and Wall, 2006; Chen et al., 2009; Hobolth and Jensen, 2014). In recent years a number of inference tools have been developed based on combining the sequential Markov coalescence with hidden Markov models, constructing so-called coalescence hidden Markov models or CoalHMMs, that have been constructed for the inference of speciation times (Hobolth et al., 2007; Dutheil et al., 2009; Mailund et al., 2011), gene-flow patterns (Steinrücken et al., 2013; Mailund et al., 2012), changing population sizes (Li and Durbin, 2011; Sheehan et al., 2013; Schiffels and Durbin, 2014) or inference of recombination patters (Munch et al., 2014) and have been used in a number of whole genome analyses (Locke et al., 2011; Scally et al., 2012; Prado-Martinez et al., 2013; Prüfer et al., 2012; Miller et al., 2012). These models exploit that even a very small sample of full genomic sequences holds a wealth of information about the sample's ancestry: Loci sufficiently far apart in the genome can, because of recombination in the sample's history, be considered essentially independent samples from the underlying sample populations.

    • Impact of range expansions on current human genomic diversity

      2014, Current Opinion in Genetics and Development
    View all citing articles on Scopus
    View full text