Approximate Bayesian computation with functional statistics

Samuel Soubeyrand; Florence Carpentier; François Guiton; Etienne K. Klein

doi:10.1515/sagmb-2012-0014

Published by De Gruyter March 26, 2013

Approximate Bayesian computation with functional statistics

Samuel Soubeyrand , Florence Carpentier , François Guiton and Etienne K. Klein

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2012-0014

Showing a limited preview of this publication:

Abstract

Functional statistics are commonly used to characterize spatial patterns in general and spatial genetic structures in population genetics in particular. Such functional statistics also enable the estimation of parameters of spatially explicit (and genetic) models. Recently, Approximate Bayesian Computation (ABC) has been proposed to estimate model parameters from functional statistics. However, applying ABC with functional statistics may be cumbersome because of the high dimension of the set of statistics and the dependences among them. To tackle this difficulty, we propose an ABC procedure which relies on an optimized weighted distance between observed and simulated functional statistics. We applied this procedure to a simple step model, a spatial point process characterized by its pair correlation function and a pollen dispersal model characterized by genetic differentiation as a function of distance. These applications showed how the optimized weighted distance improved estimation accuracy. In the discussion, we consider the application of the proposed ABC procedure to functional statistics characterizing non-spatial processes.

Keywords: dispersal model; marked point process; pairwise genetic distance; parameter estimation; spatial model; TwoGener

Corresponding author: Samuel Soubeyrand, INRA, UR546 Biostatistics and Spatial Processes, F-84914 Avignon, France

We thank the reviewers for their useful suggestions and comments. This research was supported by the French research agency ANR (EMILE project).

7 Appendix

A Approximation of the posterior distribution of θ given S

This appendix shows in a simple case that p_ε(θ|S) given by Eq. (1) approximates the posterior distribution conditional on the statistics p(θ|S) given by Eq. (2). This simple case is based on regularity assumptions about the conditional probability distribution function of S given θ which cannot be checked in usual applications of ABC where is generally analytically intractable.

Here, (i) θ and S are fixed, (ii) the space of statistics is ℝ, (iii) the conditional probability distribution function of S given θ is three times differentiable over ℝ, (iv) the absolute values of its second and third derivatives are π-integrable over Θ, (v) its third derivative is a Lipschitz function .

From the Taylor’s theorem,

Using assumption (v), the absolute value of the remainder term R(ε, S, θ) is bounded from above by:

Let r(ε, S, θ)=R(ε, S, θ)/ε³. Using assumption (iv), the upper bound of |R(ε, S, θ)| is π-integrable over Θ as well as θ↦r(ε, S, θ) and, consequently,

Therefore, when ε tends to zero, p_ε(θ|S) approximates the posterior distribution conditional on the statistics p(θ|S).

B Implementation of algorithms A2-ME, TS and PLS and comparison criteria

In Subsection 5.5, algorithm A2 was applied to three different sets of summary statistics: (i) a subset of the raw statistics obtained with the minimum entropy approach of Nunes and Balding (2010), (ii) a subset of the raw statistics btained with the two-stage procedure of Nunes and Balding (2010) and (iii) a subset of the axes obtained after a PLS regression between the parameters and the statistics like in Wegmann et al. (2009). The three algorithms are denoted A2-ME, A2-TS and A2-PLS. Note that these approaches can be carried out with the abctools R package proposed by Nunes and Prangle and the ABCtoolbox software of (Wegmann et al., 2010).

For the ME (resp. TS) selection of the statistics, the exhaustive search of the subset of statistics which minimizes the entropy (resp. mean square root of the sum of squared errors, denoted MRSSE) over all the possible subsets was not feasible (for 91 statistics there are almost 2.5×10²⁷ possible subsets). Therefore, we replaced the exhaustive search by an iterative search: we used a forward search based on the entropy (resp. MRSSE). At iteration one, the subset is made of the statistic which leads to the lowest entropy (resp. MRSSE). Then, at each of the following iterations, the current subset is completed by the statistic which leads, when it is merged to the current subset of statistics, to the lowest entropy (resp. MRSSE). The iterative search is stopped when the entropy (resp. MRSSE) is no more decreasing by the addition of any of the remaining statistics. For A2-ME, 5 (resp. 7) statistics were selected with τ=10^–4 (resp. τ=10^–3). For A2-TS, 11 (resp. 14) statistics were selected with τ=10^–4 (resp. τ=10^–3).

It has to be noted that, in A2-TS, the MRSSE that we used includes a standardization to rescale the components of the parameter vector (as suggested by Nunes and Balding, 2010) and satisfies:

where is the set of 100 PODS selected in stage one of the two-stage procedure, is the posterior sample (set of accepted parameter vectors) obtained for each PODS in obtained after stage one of the two-stage procedure. The size of the posterior sample is n_acc=Iτ=10⁶τ where τ is the acceptance threshold (10^–4 or 10^–3).

For A2-PLS, we fitted the PLS regression to 10⁴ simulations and kept the minimum number of axes explaing at least 99% of the variance of the parameters. This led us to keep 24 axes among 91 possible axes.

We computed marginal criteria measuring the accuracy of the estimation for each parameter component. The marginal BMSE was computed with 1000 new pseudo-observed data sets (PODS) not used in the implementation of the algorithms which are compared. Among the 1000 new PODS, only 266 were used to compute the marginal PMSE, the marginal coverage of the 95%-posterior intervals of δ and b and the marginal mean square root of the sum of squared errors [MRSSE; criterion used in the two-stage approach of Nunes and Balding (2010)]. The 266 PODS were obtained as follows: For each value of τ (10^–4 and 10^–3), we selected the 250 PODS with the summary statistics reduced by the ME approach which are the closest from the observed summary statistics reduced by the ME approach (the closeness is quantified with the Euclidean distance). Then we merged the two sets of PODS and obtained a set of 266 different PODS. This selection of PODS is similar to stage one in the TS procedure of Nunes and Balding (2010).

Let , j=1,…,10³ denote the 1000 new PODS. The marginal BMSE satisfies:

where are the marginal posterior medians of the K components of , the marginal posterior medians being obtained with either A4, A2-ME, A2-TS or A2-PLS.

Let denote the set of 266 selected PODS. The marginal PMSE satisfies:

The marginal coverage of any parameter by the corresponding marginal 95%-posterior interval is:

where and are the posterior quantiles of order 0.025 and 0.975 of the k-th component of θ^*_j, the posterior quantiles being obtained with either A4, A2-ME, A2-TS or A2-PLS, and 1 is the indicator function [1(E)=1 if event E holds, zero otherwise]. The marginal MRSSE satisfies:

where denotes the posterior sample (set of accepted parameter vectors) obtained by applying either A4, A2-ME, A2-TS or A2-PLS to the j-th PODS (to infer θ^*_j). The size of the posterior sample is n_acc=Iτ=10⁶τ where in algorithm A4 and τ=10^–4 or 10^–3 in algorithms A2-ME, TS and PLS.

References

Austerlitz, F. and P. E. Smouse (2002) “Two-generation analysis of pollen flow across a landscape. iv. estimating the dispersal parameter,” Genetics, 161, 355.10.1093/genetics/161.1.355Search in Google Scholar PubMed PubMed Central

Austerlitz, F., C. W. Dick, C. Dutech, E. K. Klein, S. Oddou-Muratorio, P. E. Smouse and V. L. Sork (2004) “Using genetic markers to estimate the pollen dispersal curve,” Mol. Ecol., 13, 937–954.10.1111/j.1365-294X.2004.02100.xSearch in Google Scholar PubMed

Barnes, C. P., S. Filippi, M. P. H. Stumpf and T. Thorne (2012) “Considerate approaches to constructing summary statistics for ABC model selection,” Stat. Comput., 22, 1181–1197.10.1007/s11222-012-9335-7Search in Google Scholar

Beaumont, M. A. (2010) “Approximate bayesian computation in evolution and ecology,” Annu. Rev. Ecol. Evol. Syst., 41, 379–406.10.1146/annurev-ecolsys-102209-144621Search in Google Scholar

Beaumont, M. A., W. Zhang and D. J. Balding (2002) “Approximate bayesian computation in population genetics,” Genetics, 162, 2025–2035.10.1093/genetics/162.4.2025Search in Google Scholar PubMed PubMed Central

Beaumont, M. A., J.-M. Cornuet, J.-M. Marin and C. Robert (2009) “Adaptivity for ABC algorithms: the ABC-PMC scheme,” Biometrika (in press), 96, 983–990.10.1093/biomet/asp052Search in Google Scholar

Blum, M. G. B. (2010a) “Approximate bayesian computation: a nonparametric perspective,” J. Am. Stat. Assoc., 205, 1178–1187.10.1198/jasa.2010.tm09448Search in Google Scholar

Blum, M. G. B. (2010b) Choosing the summary statistics and the acceptance rate in approximate bayesian computation. In: Lechevallier, Y., Saporta, G. (Eds.), Proceedings of COMPSTAT’2010. Physica-Verlag, pp. 47–56.10.1007/978-3-7908-2604-3_4Search in Google Scholar

Blum, M. G. B. and O. François (2010) “Non-linear regression models for approximate bayesian computation,” Stat. Comput., 20, 63–73.10.1007/s11222-009-9116-0Search in Google Scholar

Blum, M. G. B., M. A. Nunes, D. Prangle and S. A. Sisson (2012) “A comparative review of dimension reduction methods in approximate bayesian computation,” Arxiv preprintar Xiv: 1202.3819.Search in Google Scholar

Carpentier, F. (2010) Modélisations de la dispersion du pollen et estimation à partir de marqueurs génétiques, Ph.D. thesis, Université Montpellier 2.Search in Google Scholar

Chilés, J.-P. and P. Delfiner (1999) Geostatistics. Modeling Spatial Uncertainty. New York: Wiley.10.1002/9780470316993Search in Google Scholar

Cressie, N. A. C. (1991) Statistics for Spatial Data. New York: Wiley.Search in Google Scholar

Csilléry, K., M. G. B. Blum, O. E. Gaggiotti and O. François (2010) “Approximate bayesian computation (ABC) in practice,” Trends Ecol. Evol., 25, 410–418.10.1016/j.tree.2010.04.001Search in Google Scholar PubMed

Csilléry, K., O. François and M. Blum (2011) “Abc: an R package for Approximate Bayesian Computation (ABC),” Arxiv preprint arXiv:1106.2793.Search in Google Scholar

Fearnhead, P. and D. Prangle (2012) “Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation,” J. R. Stat. Soc. B, 74, 419–474.10.1111/j.1467-9868.2011.01010.xSearch in Google Scholar

Haario, H., E. Saksman and J. Tamminen (2001) “An adaptive metropolis algorithm,” Bernoulli, 7, 223–242.10.2307/3318737Search in Google Scholar

Haon-Lasportes, E., F. Carpentier, O. Martin, E. K. Klein and S. Soubeyrand (2011) Conditioning on parameter point estimates in approximate bayesian computation. Research Report. INRA, Biostatistics and Spatial Processes Research Unit.Search in Google Scholar

Hardy, O. J. (2003) “Estimation of pairwise relatedness between individuals and characterization of isolation-by-distance processes using dominant genetic markers,” Mol. Ecol., 12, 1577–1588.10.1046/j.1365-294X.2003.01835.xSearch in Google Scholar PubMed

Illian, J., A. Penttinen, H. Stoyan and D. Stoyan (2008) Statistical Analysis and Modelling of Spatial PointPatterns. New York: Wiley.10.1002/9780470725160Search in Google Scholar

Joyce, P. and P. Marjoram (2008) “Approximately sufficient statistics and bayesian computation,” Stat. Appl. Genet. Mol. Biol., 7, 1–16.10.2202/1544-6115.1389Search in Google Scholar PubMed

Jung, H. and P. Marjoram (2011) “Choice of summary statistic weights in approximate bayesian computation,” Stat. Appl. Genet. Mol. Biol., 10, 1–23.10.2202/1544-6115.1586Search in Google Scholar PubMed PubMed Central

Kirkpatrick, S., C. D. Gelatt Jr. and M. P. Vecchi (1983) “Optimization by simulated annealing,” Science 220, 671–680.10.1126/science.220.4598.671Search in Google Scholar PubMed

Leuenberger, C. and D. Wegmann (2010) “Bayesian computation and model selection without likelihoods,” Genetics, 184, 243–252.10.1534/genetics.109.109058Search in Google Scholar PubMed PubMed Central

Marin, J. M., P. Pudlo, C. P. Robert and R. Ryder (2011) “Approximate bayesian computational methods,” J. Stat. Comput., 22, 1167–1180.10.1007/s11222-011-9288-2Search in Google Scholar

Marjoram, P., V. Plagnol and S. Tavaré (2003) “Markov chain Monte Carlo without likelihoods,” PNAS, 100, 15324–15328.10.1073/pnas.0306899100Search in Google Scholar PubMed PubMed Central

McCulloch, C. E and S. R. Searle (2001) Generalized, Linear, and Mixed Models. New York: Wiley.10.1002/9780470057339.vag009Search in Google Scholar

Nelder, J. A. and R. Mead (1965) “A simplex method for function minimization,” Comput. J., 7, 308–313.10.1093/comjnl/7.4.308Search in Google Scholar

Nunes, M. A. and D. J. Balding (2010) “On optimal selection of summary statistics for approximate bayesian computation,” Stat. Appl. Genet. Mol. Biol., 9, 1–14.10.2202/1544-6115.1576Search in Google Scholar PubMed

Oddou-Muratorio, S., E. K. Klein and F. Austerlitz (2005) “Pollen flow in the wildservice tree, sorbus torminalis (L.) Crantz. II. Pollen dispersal and heterogeneity in mating success inferred from parent–offspring analysis,” Mol. Ecol., 14, 4441–4452.10.1111/j.1365-294X.2005.02720.xSearch in Google Scholar PubMed

Pritchard, J. K., M. T. Seielstad, A. Perez-Lezaun and M. W. Feldman (1999) “Population growth oh human y chromosomes: a study of y chromosome mibrosatellites,” Mol. Biol. Evol., 16, 1791–1798.10.1093/oxfordjournals.molbev.a026091Search in Google Scholar PubMed

Robledo-Arnuncio, J. J. and F. Austerlitz (2006) “Pollen dispersal in spatially aggregated populations,” The American Naturalist, 168, 500–511.10.1086/507881Search in Google Scholar PubMed

Robledo-Arnuncio, J. J., F. Austerlitz and P. E. Smouse (2006) “A new method of estimating the pollen dispersal curve independently of effective density,” Genetics, 173, 1033–1045.10.1534/genetics.105.052035Search in Google Scholar PubMed PubMed Central

Rohatgi, V. K. (2003) Statistical Inference. Mineola, NY: Dover Publications.Search in Google Scholar

Rousset, F. (1997) “Genetic differentiation and estimation of gene flow from F-statistics under isolation by distance,” Genetics, 145, 1219.10.1093/genetics/145.4.1219Search in Google Scholar PubMed PubMed Central

Rousset, F. (2000) “Genetic differentiation between individuals,” J. Evol. Biol., 13, 58–62.10.1046/j.1420-9101.2000.00137.xSearch in Google Scholar

Rousset, F. and R. Leblois (2007) “Likelihood and approximate likelihood analyses of genetic structure in a linear habitat: performance and robustness to model mis-specification,” Mol. Biol. Evol., 24, 2730–2745.10.1093/molbev/msm206Search in Google Scholar PubMed

Rubin, D. B. (1984) “Bayesianly justifiable and relevant frequency calculations for the applied statistician,” Ann. Stat., 12, 1151–1172.10.1214/aos/1176346785Search in Google Scholar

Ruppert, D., M. P. Wand and R. J. Carroll (2003) Semiparametric Regression. Cambridge: Cambridge University Press.10.1017/CBO9780511755453Search in Google Scholar

Smouse, P. E., R. J. Dyer, R. D. Westfall and V. L. Sork (2001) “Two-generation analysis of pollen flow across a landscape .i. malegamete heterogeneity among females,” Evolution, 55, 260–271.10.1111/j.0014-3820.2001.tb01291.xSearch in Google Scholar PubMed

Stoyan, D. and H. Stoyan (1994) Fractals, Random Shapes and Pointfields: Methods of Geometrical Statistics. New York: Wiley.Search in Google Scholar

Wegmann, D., C. Leuenberger and L. Excoffier (2009) “Efficient Approximate Bayesian Computation coupled with Markov chain Monte Carlo without likelihood,” Genetics, 182, 1207–1218.10.1534/genetics.109.102509Search in Google Scholar PubMed PubMed Central

Wegmann, D., C. Leuenberger, S. Neuenschwander and L. Excoffier (2010) “Abctoolbox: a versatile toolkit for approximate bayesian computations,” BMC Bioinformatics, 11, 116.INRA, UR546 Biostatistics and Spatial Processes, F-84914 Avignon, France10.1186/1471-2105-11-116Search in Google Scholar PubMed PubMed Central

Wilson, D. J., E. Gabriel, A. J. H. Leatherbarrow, J. Cheesbrough, S. Gee, E. Bolton, A. Fox, C. A. Hart, P. J. Diggle and P. Fearnhead (2009) “Rapid evolution and the importance of recombination to the gastroenteric pathogen campylobacter jejuni,” Mol. Biol. Evol., 26, 385–397.10.1093/molbev/msn264Search in Google Scholar PubMed PubMed Central

Published Online: 2013-03-26

Approximate Bayesian computation with functional statistics

Abstract

7 Appendix

A Approximation of the posterior distribution of θ given S

B Implementation of algorithms A2-ME, TS and PLS and comparison criteria

References

Journal and Issue

Articles in the same Issue