A non-zero variance of Tajima’s estimator for two sequences even for infinitely many unlinked loci
Introduction
The population-scaled mutation rate, , is defined as , where is the effective population size and is the mutation rate per locus per generation (Wakeley, 2009). Two classic estimators were developed for , Watterson’s (based on the number of segregating sites Watterson, 1975) and Tajima’s (based on the average number of pairwise differences Tajima (1983), Tajima (1989)). For a single pair of sequences, both estimators are identical (denoted here as ) and equal to the number of differences between the sequences.
Increasing the number of sampled individuals has limited ability to improve these estimates of , because shared ancestry reduces the number of independent branches on which mutations can arise (Rosenberg and Nordborg, 2002). Felsenstein (2006) showed that the variance of maximum likelihood estimates of decreases approximately logarithmically with the number of individuals sampled. In contrast, the variance decreases inversely with the number of independent loci. Thus, to increase the accuracy of estimates of , it is generally more effective to increase the number of independent loci than the sample size at each locus (see also e.g., Edwards “bibausep Beerli (2000), Pluzhnikov “bibausep Donnelly (1996) and references within).
Consider a set of unlinked loci located on different (non-homologous) chromosomes. We show here that even as , the variance of the resulting estimate of does not converge to zero, in contrast to what we may have naïvely assumed. This behavior results from the fact that coalescence times, even at unlinked loci, are in fact weakly correlated, due to the sharing of the same fixed underlying pedigree across all loci (Wakeley et al., 2012). By conditioning on the number of shared genealogical common ancestors, we derive a simple approximate lower bound, as a function of , on the variance of (Sections 2 The relation of the variance of, 3 Modeling the effect of the shared pedigree).
Unlinked loci may also be sampled from the same chromosome, separated by an infinitely high recombination rate. The correlation of coalescence times in such a case is higher, as the two loci may travel together for the first few generations. Therefore, the extent of the correlation, and thereby, the variance of , also depend on the sampling configuration. In Section 4, we derive the correlation coefficient analytically, as a function of the configuration and the effective population size, using a diploid discrete time Wright–Fisher model (DDTWF). This model is an extension of the haploid DTWF model, previously advocated by Bhaskar et al. (2014) for the study of large samples from finite populations.
Our results for the variance of were obtained under the Wright–Fisher demographic model. To shed light on the variance of under more realistic demographic models, in Section 5 we run simulations based on real, large-scale human genealogical data (Kaplanis et al., 2017). The pedigrees inspired by different human populations differ from each other and from the Wright–Fisher pedigrees in a number of ways, for example in the variance of the relatedness of any two randomly chosen individuals. These differences lead to differences in the variance of for each population, even if they have the same effective population size. Finally, we study some properties of linked sites in Section 6.
We note that the dependence of gene genealogies at unlinked loci has been previously recognized, most recently in the context of matching probabilities. Specifically, the probability of the genotypes of two individuals to match at two or more loci was computed under the Wright–Fisherand other models, and shown to differ from the product of the corresponding one-locus probabilities Laurie “bibausep Weir (2003), Song “bibausep Slatkin (2007), Bhaskar “bibausep Song (2009). In earlier literature, this effect was demonstrated in the context of identity-by-descent probabilities at unlinked loci (Weir and Cockerham, 1969) and implicitly in results on linkage disequilibrium (Ohta and Kimura, 1969). However, the treatment of this effect in the context of effective population size estimation is to our knowledge new.
Section snippets
The relation of the variance of to the correlation of the coalescence times
For a sample of size two (haploids) at loci, the estimator of can be expressed as where is the number of differences at locus . If we assume the loci are exchangeable, we have:
Under the standard coalescent model (Wakeley, 2009), is Poisson distributed with mean , where is the time until coalescence at locus in generations and is the mutation rate per locus per generation. Using the law of total
Modeling the effect of the shared pedigree
In this section, we study the role of the shared underlying pedigree in the non-zero variance of . We first provide a formal derivation of the statistical inconsistency of , followed by an intuitive derivation of an approximate lower bound. Exact calculations appear in Section 4.
Exact results for the correlation of the coalescence times at unlinked loci
In this section, we provide an exact derivation of the correlation of coalescence times at unlinked loci under a diploid, discrete-time, Wright–Fisher model. Further, we consider multiple sampling configurations for those loci, as explained below.
Wright–Fisher simulations
In this section, we use simulation of the 2-sex diploid, discrete-time Wright–Fisher model to support our analytical results from Section 3.2. To estimate the correlation coefficient of the coalescence times at two loci, we first simulate many Wright–Fisher pedigrees. We then sample, for each pedigree, two individuals from the current generation. We set the population size to be the same in every generation, with equal numbers of males and females. We then consider two loci on non-homologous
Linked sites and model comparisons
We have so far only studied unlinked sites; however, our analytical results for the DDTWF models can be relatively easily extended to the case of linked loci. Such an extension is important, since, for example, the covariance of coalescence times at two loci is directly related to the measure of linkage disequilibrium (McVean, 2002). Quantifying the behavior of different models in terms of the covariance of coalescence times can provide insight into the importance of certain modeling
Summary and discussion
Previous studies of estimators of using data from a single locus have revealed properties rather different from classical statistical results for independent samples, due to the non-independence of samples exerted by their shared gene genealogy. In particular, Tajima (1983) demonstrated that the average number of pairwise differences is an inconsistent estimator of as the sample size at the locus tends to infinity, and Joyce (1999) showed that there is no linear unbiased estimator using
References (31)
- et al.
On the genealogy of a population of biparental individuals
J. Theor. Biol.
(2000) - et al.
Sequential Markov coalescent algorithms for population models with demographic structure
Theor. Popul. Biol.
(2009) The coalescent
Stochastic Process. Appl.
(1982)- et al.
Dependency effects in multi-locus match probabilities
Theor. Popul. Biol.
(2003) - et al.
A Markov chain model of coalescence with recombination
Theor. Popul. Biol.
(1997) - et al.
A graphical approach to multi-locus match probability computation: revisiting the product rule
Theor. Popul. Biol.
(2007) On the number of segregating sites in genetical models without recombination
Theor. Popul. Biol.
(1975)- et al.
Recombination as a point process along sequences
Theor. Popul. Biol.
(1999) - et al.
Distortion of genealogical properties when the sample is very large
Proc. Natl. Acad. Sci. USA
(2014) - et al.
Multi-locus match probability in a finite population: a fundamental difference between the Moran and Wright-Fisher models
Bioinformatics
(2009)
Recent common ancestors of all present-day individuals
Adv. Appl. Probab.
Perspective: gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies
Evolution
Accuracy of coalescent likelihood estimates: do we need more sites, more sequences, or more loci?
Mol. Biol. Evol.
An ancestral recombination graph
Cited by (17)
Variance and limiting distribution of coalescence times in a diploid model of a consanguineous population
2021, Theoretical Population BiologyIntroduction to the Paul Joyce special issue
2018, Theoretical Population BiologyThe Expected Behaviors of Posterior Predictive Tests and Their Unexpected Interpretation
2024, Molecular Biology and EvolutionPractical application of the linkage disequilibrium method for estimating contemporary effective population size: A review
2024, Molecular Ecology Resources