A non-zero variance of Tajima’s estimator for two sequences even for infinitely many unlinked loci

https://doi.org/10.1016/j.tpb.2017.03.002Get rights and content

Abstract

The population-scaled mutation rate, θ, is informative on the effective population size and is thus widely used in population genetics. We show that for two sequences and n unlinked loci, the variance of Tajima’s estimator (θˆ), which is the average number of pairwise differences, does not vanish even as n. The non-zero variance of θˆ results from a (weak) correlation between coalescence times even at unlinked loci, which, in turn, is due to the underlying fixed pedigree shared by gene genealogies at all loci. We derive the correlation coefficient under a diploid, discrete-time, Wright–Fisher model, and we also derive a simple, closed-form lower bound. We also obtain empirical estimates of the correlation of coalescence times under demographic models inspired by large-scale human genealogies. While the effect we describe is small (Varθˆθ2ONe1), it is important to recognize this feature of statistical population genetics, which runs counter to commonly held notions about unlinked loci.

Introduction

The population-scaled mutation rate, θ, is defined as 4Neμ, where Ne is the effective population size and μ is the mutation rate per locus per generation (Wakeley, 2009). Two classic estimators were developed for θ, Watterson’s (based on the number of segregating sites Watterson, 1975) and Tajima’s (based on the average number of pairwise differences Tajima (1983), Tajima (1989)). For a single pair of sequences, both estimators are identical (denoted here as θˆ) and equal to the number of differences between the sequences.

Increasing the number of sampled individuals has limited ability to improve these estimates of θ, because shared ancestry reduces the number of independent branches on which mutations can arise (Rosenberg and Nordborg, 2002). Felsenstein (2006) showed that the variance of maximum likelihood estimates of θ decreases approximately logarithmically with the number of individuals sampled. In contrast, the variance decreases inversely with the number of independent loci. Thus, to increase the accuracy of estimates of θ, it is generally more effective to increase the number of independent loci than the sample size at each locus (see also e.g., Edwards “bibausep Beerli (2000), Pluzhnikov “bibausep Donnelly (1996) and references within).

Consider a set of n unlinked loci located on different (non-homologous) chromosomes. We show here that even as n, the variance of the resulting estimate of θ does not converge to zero, in contrast to what we may have naïvely assumed. This behavior results from the fact that coalescence times, even at unlinked loci, are in fact weakly correlated, due to the sharing of the same fixed underlying pedigree across all loci (Wakeley et al., 2012). By conditioning on the number of shared genealogical common ancestors, we derive a simple approximate lower bound, as a function of Ne, on the variance of θˆ (Sections 2 The relation of the variance of, 3 Modeling the effect of the shared pedigree).

Unlinked loci may also be sampled from the same chromosome, separated by an infinitely high recombination rate. The correlation of coalescence times in such a case is higher, as the two loci may travel together for the first few generations. Therefore, the extent of the correlation, and thereby, the variance of θˆ, also depend on the sampling configuration. In Section 4, we derive the correlation coefficient analytically, as a function of the configuration and the effective population size, using a diploid discrete time Wright–Fisher model (DDTWF). This model is an extension of the haploid DTWF model, previously advocated by Bhaskar et al. (2014) for the study of large samples from finite populations.

Our results for the variance of θˆ were obtained under the Wright–Fisher demographic model. To shed light on the variance of θˆ under more realistic demographic models, in Section 5 we run simulations based on real, large-scale human genealogical data (Kaplanis et al., 2017). The pedigrees inspired by different human populations differ from each other and from the Wright–Fisher pedigrees in a number of ways, for example in the variance of the relatedness of any two randomly chosen individuals. These differences lead to differences in the variance of θˆ for each population, even if they have the same effective population size. Finally, we study some properties of linked sites in Section 6.

We note that the dependence of gene genealogies at unlinked loci has been previously recognized, most recently in the context of matching probabilities. Specifically, the probability of the genotypes of two individuals to match at two or more loci was computed under the Wright–Fisherand other models, and shown to differ from the product of the corresponding one-locus probabilities Laurie “bibausep Weir (2003), Song “bibausep Slatkin (2007), Bhaskar “bibausep Song (2009). In earlier literature, this effect was demonstrated in the context of identity-by-descent probabilities at unlinked loci (Weir and Cockerham, 1969) and implicitly in results on linkage disequilibrium (Ohta and Kimura, 1969). However, the treatment of this effect in the context of effective population size estimation is to our knowledge new.

Section snippets

The relation of the variance of θˆ to the correlation of the coalescence times

For a sample of size two (haploids) at n loci, the estimator of θ can be expressed as θˆ(n)=1ni=1nθˆi,where θˆi is the number of differences at locus i. If we assume the loci are exchangeable, we have: Varθˆ(n)=Varθˆin+n1nCovθˆi,θˆj;ij.

Under the standard coalescent model (Wakeley, 2009), θiˆ is Poisson distributed with mean 2μTi, where Ti is the time until coalescence at locus i in generations and μ is the mutation rate per locus per generation. Using the law of total

Modeling the effect of the shared pedigree

In this section, we study the role of the shared underlying pedigree in the non-zero variance of θˆ. We first provide a formal derivation of the statistical inconsistency of θˆ, followed by an intuitive derivation of an approximate lower bound. Exact calculations appear in Section 4.

Exact results for the correlation of the coalescence times at unlinked loci

In this section, we provide an exact derivation of the correlation of coalescence times at unlinked loci under a diploid, discrete-time, Wright–Fisher model. Further, we consider multiple sampling configurations for those loci, as explained below.

Wright–Fisher simulations

In this section, we use simulation of the 2-sex diploid, discrete-time Wright–Fisher model to support our analytical results from Section 3.2. To estimate the correlation coefficient of the coalescence times at two loci, we first simulate many Wright–Fisher pedigrees. We then sample, for each pedigree, two individuals from the current generation. We set the population size Ne to be the same in every generation, with equal numbers of males and females. We then consider two loci on non-homologous

Linked sites and model comparisons

We have so far only studied unlinked sites; however, our analytical results for the DDTWF models can be relatively easily extended to the case of linked loci. Such an extension is important, since, for example, the covariance of coalescence times at two loci is directly related to the r2 measure of linkage disequilibrium (McVean, 2002). Quantifying the behavior of different models in terms of the covariance of coalescence times can provide insight into the importance of certain modeling

Summary and discussion

Previous studies of estimators of θ using data from a single locus have revealed properties rather different from classical statistical results for independent samples, due to the non-independence of samples exerted by their shared gene genealogy. In particular, Tajima (1983) demonstrated that the average number of pairwise differences is an inconsistent estimator of θ as the sample size at the locus tends to infinity, and Joyce (1999) showed that there is no linear unbiased estimator using

References (31)

  • Bittles, A.H., Black, M.L., 2015. Global patterns and tables of consanguinity. URL...
  • ChangJ.T.

    Recent common ancestors of all present-day individuals

    Adv. Appl. Probab.

    (1999)
  • EdwardsS.V. et al.

    Perspective: gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies

    Evolution

    (2000)
  • FelsensteinJ.

    Accuracy of coalescent likelihood estimates: do we need more sites, more sequences, or more loci?

    Mol. Biol. Evol.

    (2006)
  • GriffithsR. et al.

    An ancestral recombination graph

  • Cited by (17)

    View all citing articles on Scopus
    View full text