A non-zero variance of Tajima’s estimator for two sequences even for infinitely many unlinked loci

doi:10.1016/j.tpb.2017.03.002

Theoretical Population Biology

Volume 122, July 2018, Pages 22-29

https://doi.org/10.1016/j.tpb.2017.03.002 Get rights and content

Abstract

The population-scaled mutation rate, $θ$ , is informative on the effective population size and is thus widely used in population genetics. We show that for two sequences and $n$ unlinked loci, the variance of Tajima’s estimator ( $\hat{θ}$ ), which is the average number of pairwise differences, does not vanish even as $n \to \infty$ . The non-zero variance of $\hat{θ}$ results from a (weak) correlation between coalescence times even at unlinked loci, which, in turn, is due to the underlying fixed pedigree shared by gene genealogies at all loci. We derive the correlation coefficient under a diploid, discrete-time, Wright–Fisher model, and we also derive a simple, closed-form lower bound. We also obtain empirical estimates of the correlation of coalescence times under demographic models inspired by large-scale human genealogies. While the effect we describe is small ( $Var [\hat{θ}] ∕ θ^{2} \approx O (N_{e}^{- 1})$ ), it is important to recognize this feature of statistical population genetics, which runs counter to commonly held notions about unlinked loci.

Introduction

The population-scaled mutation rate, $θ$ , is defined as $4 N_{e} μ$ , where $N_{e}$ is the effective population size and $μ$ is the mutation rate per locus per generation (Wakeley, 2009). Two classic estimators were developed for $θ$ , Watterson’s (based on the number of segregating sites Watterson, 1975) and Tajima’s (based on the average number of pairwise differences Tajima (1983), Tajima (1989)). For a single pair of sequences, both estimators are identical (denoted here as $\hat{θ}$ ) and equal to the number of differences between the sequences.

Increasing the number of sampled individuals has limited ability to improve these estimates of $θ$ , because shared ancestry reduces the number of independent branches on which mutations can arise (Rosenberg and Nordborg, 2002). Felsenstein (2006) showed that the variance of maximum likelihood estimates of $θ$ decreases approximately logarithmically with the number of individuals sampled. In contrast, the variance decreases inversely with the number of independent loci. Thus, to increase the accuracy of estimates of $θ$ , it is generally more effective to increase the number of independent loci than the sample size at each locus (see also e.g., Edwards “bibausep Beerli (2000), Pluzhnikov “bibausep Donnelly (1996) and references within).

Consider a set of $n$ unlinked loci located on different (non-homologous) chromosomes. We show here that even as $n \to \infty$ , the variance of the resulting estimate of $θ$ does not converge to zero, in contrast to what we may have naïvely assumed. This behavior results from the fact that coalescence times, even at unlinked loci, are in fact weakly correlated, due to the sharing of the same fixed underlying pedigree across all loci (Wakeley et al., 2012). By conditioning on the number of shared genealogical common ancestors, we derive a simple approximate lower bound, as a function of $N_{e}$ , on the variance of $\hat{θ}$ (Sections 2 The relation of the variance of, 3 Modeling the effect of the shared pedigree).

Unlinked loci may also be sampled from the same chromosome, separated by an infinitely high recombination rate. The correlation of coalescence times in such a case is higher, as the two loci may travel together for the first few generations. Therefore, the extent of the correlation, and thereby, the variance of $\hat{θ}$ , also depend on the sampling configuration. In Section 4, we derive the correlation coefficient analytically, as a function of the configuration and the effective population size, using a diploid discrete time Wright–Fisher model (DDTWF). This model is an extension of the haploid DTWF model, previously advocated by Bhaskar et al. (2014) for the study of large samples from finite populations.

Our results for the variance of $\hat{θ}$ were obtained under the Wright–Fisher demographic model. To shed light on the variance of $\hat{θ}$ under more realistic demographic models, in Section 5 we run simulations based on real, large-scale human genealogical data (Kaplanis et al., 2017). The pedigrees inspired by different human populations differ from each other and from the Wright–Fisher pedigrees in a number of ways, for example in the variance of the relatedness of any two randomly chosen individuals. These differences lead to differences in the variance of $\hat{θ}$ for each population, even if they have the same effective population size. Finally, we study some properties of linked sites in Section 6.

We note that the dependence of gene genealogies at unlinked loci has been previously recognized, most recently in the context of matching probabilities. Specifically, the probability of the genotypes of two individuals to match at two or more loci was computed under the Wright–Fisherand other models, and shown to differ from the product of the corresponding one-locus probabilities Laurie “bibausep Weir (2003), Song “bibausep Slatkin (2007), Bhaskar “bibausep Song (2009). In earlier literature, this effect was demonstrated in the context of identity-by-descent probabilities at unlinked loci (Weir and Cockerham, 1969) and implicitly in results on linkage disequilibrium (Ohta and Kimura, 1969). However, the treatment of this effect in the context of effective population size estimation is to our knowledge new.

Section snippets

The relation of the variance of $\hat{θ}$ to the correlation of the coalescence times

For a sample of size two (haploids) at $n$ loci, the estimator of $θ$ can be expressed as ${\hat{θ}}_{(n)} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{θ}}_{i},$ where ${\hat{θ}}_{i}$ is the number of differences at locus $i$ . If we assume the loci are exchangeable, we have: $Var [{\hat{θ}}_{(n)}] = \frac{Var [{\hat{θ}}_{i}]}{n} + \frac{n - 1}{n} Cov [{\hat{θ}}_{i}, {\hat{θ}}_{j}]; i \neq j .$

Under the standard coalescent model (Wakeley, 2009), $\hat{θ_{i}}$ is Poisson distributed with mean $2 μ T_{i}$ , where $T_{i}$ is the time until coalescence at locus $i$ in generations and $μ$ is the mutation rate per locus per generation. Using the law of total

Modeling the effect of the shared pedigree

In this section, we study the role of the shared underlying pedigree in the non-zero variance of $\hat{θ}$ . We first provide a formal derivation of the statistical inconsistency of $\hat{θ}$ , followed by an intuitive derivation of an approximate lower bound. Exact calculations appear in Section 4.

Exact results for the correlation of the coalescence times at unlinked loci

In this section, we provide an exact derivation of the correlation of coalescence times at unlinked loci under a diploid, discrete-time, Wright–Fisher model. Further, we consider multiple sampling configurations for those loci, as explained below.

Wright–Fisher simulations

In this section, we use simulation of the 2-sex diploid, discrete-time Wright–Fisher model to support our analytical results from Section 3.2. To estimate the correlation coefficient of the coalescence times at two loci, we first simulate many Wright–Fisher pedigrees. We then sample, for each pedigree, two individuals from the current generation. We set the population size $N_{e}$ to be the same in every generation, with equal numbers of males and females. We then consider two loci on non-homologous

Linked sites and model comparisons

We have so far only studied unlinked sites; however, our analytical results for the DDTWF models can be relatively easily extended to the case of linked loci. Such an extension is important, since, for example, the covariance of coalescence times at two loci is directly related to the $r^{2}$ measure of linkage disequilibrium (McVean, 2002). Quantifying the behavior of different models in terms of the covariance of coalescence times can provide insight into the importance of certain modeling

Summary and discussion

Previous studies of estimators of $θ$ using data from a single locus have revealed properties rather different from classical statistical results for independent samples, due to the non-independence of samples exerted by their shared gene genealogy. In particular, Tajima (1983) demonstrated that the average number of pairwise differences is an inconsistent estimator of $θ$ as the sample size at the locus tends to infinity, and Joyce (1999) showed that there is no linear unbiased estimator using

References (31)

DerridaB. et al.
On the genealogy of a population of biparental individuals
J. Theor. Biol.
(2000)
ErikssonA. et al.
Sequential Markov coalescent algorithms for population models with demographic structure
Theor. Popul. Biol.
(2009)
KingmanJ.F.C.
The coalescent
Stochastic Process. Appl.
(1982)
LaurieC. et al.
Dependency effects in multi-locus match probabilities
Theor. Popul. Biol.
(2003)
SimonsenK.L. et al.
A Markov chain model of coalescence with recombination
Theor. Popul. Biol.
(1997)
SongY.S. et al.
A graphical approach to multi-locus match probability computation: revisiting the product rule
Theor. Popul. Biol.
(2007)
WattersonG.A.
On the number of segregating sites in genetical models without recombination
Theor. Popul. Biol.
(1975)
WiufC. et al.
Recombination as a point process along sequences
Theor. Popul. Biol.
(1999)
BhaskarA. et al.
Distortion of genealogical properties when the sample is very large
Proc. Natl. Acad. Sci. USA
(2014)
BhaskarA. et al.
Multi-locus match probability in a finite population: a fundamental difference between the Moran and Wright-Fisher models
Bioinformatics
(2009)

Bittles, A.H., Black, M.L., 2015. Global patterns and tables of consanguinity. URL...

ChangJ.T.

Recent common ancestors of all present-day individuals

Adv. Appl. Probab.

(1999)

EdwardsS.V. et al.

Perspective: gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies

Evolution

(2000)

FelsensteinJ.

Accuracy of coalescent likelihood estimates: do we need more sites, more sequences, or more loci?

Mol. Biol. Evol.

(2006)

GriffithsR. et al.

An ancestral recombination graph

Cited by (17)

Variance and limiting distribution of coalescence times in a diploid model of a consanguineous population
2021, Theoretical Population Biology
Recent modeling studies interested in runs of homozygosity (ROH) and identity by descent (IBD) have sought to connect these properties of genomic sharing to pairwise coalescence times. Here, we examine a variety of features of pairwise coalescence times in models that consider consanguinity. In particular, we extend a recent diploid analysis of mean coalescence times for lineage pairs within and between individuals in a consanguineous population to derive the variance of coalescence times, studying its dependence on the frequency of consanguinity and the kinship coefficient of consanguineous relationships. We also introduce a separation-of-time-scales approach that treats consanguinity models analogously to mathematically similar phenomena such as partial selfing, using this approach to obtain coalescence-time distributions. This approach shows that the consanguinity model behaves similarly to a standard coalescent, scaling population size by a factor $1 - 3 c$ , where $c$ represents the kinship coefficient of a randomly chosen mating pair. It provides the explanation for an earlier result describing mean coalescence time in the consanguinity model in terms of $c$ . The results extend the potential to make predictions about ROH and IBD in relation to demographic parameters of diploid populations.
Introduction to the Paul Joyce special issue
2018, Theoretical Population Biology
The Expected Behaviors of Posterior Predictive Tests and Their Unexpected Interpretation
2024, Molecular Biology and Evolution
Practical application of the linkage disequilibrium method for estimating contemporary effective population size: A review
2024, Molecular Ecology Resources
Correlation of coalescence times in a diploid Wright-Fisher model with recombination and selfing
2023, arXiv
On the number of genealogical ancestors tracing to the source groups of an admixed population
2023, Genetics

View all citing articles on Scopus

View full text

A non-zero variance of Tajima’s estimator for two sequences even for infinitely many unlinked loci

Abstract

Introduction

Section snippets

The relation of the variance of θˆ to the correlation of the coalescence times

Modeling the effect of the shared pedigree

Exact results for the correlation of the coalescence times at unlinked loci

Wright–Fisher simulations

Linked sites and model comparisons

Summary and discussion

J. Theor. Biol.

Theor. Popul. Biol.

Stochastic Process. Appl.

Theor. Popul. Biol.

Theor. Popul. Biol.

Theor. Popul. Biol.

Theor. Popul. Biol.

Theor. Popul. Biol.

Distortion of genealogical properties when the sample is very large

Proc. Natl. Acad. Sci. USA

Multi-locus match probability in a finite population: a fundamental difference between the Moran and Wright-Fisher models

Bioinformatics

Recent common ancestors of all present-day individuals

Adv. Appl. Probab.

Perspective: gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies

Evolution

Accuracy of coalescent likelihood estimates: do we need more sites, more sequences, or more loci?

Mol. Biol. Evol.

An ancestral recombination graph

The relation of the variance of $\hat{θ}$ to the correlation of the coalescence times