Bayesian clustering of DNA sequences using Markov chains and a stochastic partition model

Väinö Jääskinen; Ville Parkkinen; Lu Cheng; Jukka Corander

doi:10.1515/sagmb-2013-0031

Published by De Gruyter November 19, 2013

Bayesian clustering of DNA sequences using Markov chains and a stochastic partition model

Väinö Jääskinen , Ville Parkkinen , Lu Cheng and Jukka Corander

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2013-0031

Showing a limited preview of this publication:

Abstract

In many biological applications it is necessary to cluster DNA sequences into groups that represent underlying organismal units, such as named species or genera. In metagenomics this grouping needs typically to be achieved on the basis of relatively short sequences which contain different types of errors, making the use of a statistical modeling approach desirable. Here we introduce a novel method for this purpose by developing a stochastic partition model that clusters Markov chains of a given order. The model is based on a Dirichlet process prior and we use conjugate priors for the Markov chain parameters which enables an analytical expression for comparing the marginal likelihoods of any two partitions. To find a good candidate for the posterior mode in the partition space, we use a hybrid computational approach which combines the EM-algorithm with a greedy search. This is demonstrated to be faster and yield highly accurate results compared to earlier suggested clustering methods for the metagenomics application. Our model is fairly generic and could also be used for clustering of other types of sequence data for which Markov chains provide a reasonable way to compress information, as illustrated by experiments on shotgun sequence type data from an Escherichia coli strain.

Keywords: Clustering; DNA sequences; Markov chains; metagenomics

Corresponding author: Väinö Jääskinen, Department of Mathematics and Statistics, University of Helsinki, FI-00014, Finland, Tel.: +358505354346, e-mail: vaino.jaaskinen@helsinki.fi

References

Abe, T., S. Kanaya, M. Kinouchi, Y. Ichiba, T. Kozuki, and T. ad Ikemura (2002): “A novel bioinformatic strategy for unveiling hidden genome signatures of eukaryotes: self-organizing map of oligonucleotide frequency,” Genome Informatics, 13, 12–20.10.1101/gr.634603Search in Google Scholar PubMed PubMed Central

Abe, T., S. Kanaya, M. Kinouchi, Y. Ichiba, T. Kozuki and T. Ikemura (2003): “Informatics for unveiling hidden genome signatures,” Genome Res., 13, 693–702.Search in Google Scholar

Abe, T., H. Sugawara, S. Kanaya, M. Kinouchi and T. Ikemura (2006): “Self-Organizing Map (SOM) unveils and visualizes hidden sequence characteristics of a wide range of eukaryote genomes,” Gene, 365, 27–34.10.1016/j.gene.2005.09.040Search in Google Scholar PubMed

Amir, A. and O. Zuk (2011): “Bacterial community reconstruction using compressed sensing,” J. Comput. Biol., 18, 1723–1741.Search in Google Scholar

Barash, Y. and N. Friedman (2002): “Context-specific bayesian clustering for gene expression data,” J. Comput. Biol., 9, 169–191.Search in Google Scholar

Barry, D. and J. Hartigan (1992): “Product partition models for change point problems,” The Annals of Statistics, 20, 260–279.10.1214/aos/1176348521Search in Google Scholar

Bell, E. (1934): “Exponential numbers,” T. Am. Math. Monthly, 41, 411–419.10.1080/00029890.1934.11987615Search in Google Scholar

Ben-Gal, I., A. Shani, A. Gohr, J. Grau, S. Arviv, A. Shmilovici, S. Posch and I. Grosse (2005): “Identification of transcription factor binding sites with variable-order bayesian networks,” Bioinformatics, 21, 2657–2666.10.1093/bioinformatics/bti410Search in Google Scholar PubMed

Blackwell, D. and J. B. MacQueen (1973): “Ferguson distributions via pólya urn schemes,” The Annals of Statistics, 2, 353–355.10.1214/aos/1176342372Search in Google Scholar

Bühlmann, P. and A. J. Wyner (1999): “Variable length markov chains,” The Annals of Statistics, 27, 480–513.10.1214/aos/1018031204Search in Google Scholar

Cai, Y. and Y. Sun (2011): “ESPRIT-Tree: hierarchical clustering analysis of millions of 16s rRNA pyrosequences in quasilinear computational time,” Nucleic Acids Res., 39, e95.Search in Google Scholar

Chaudhuri, R. R., M. Sebaihia, J. L. Hobman, M. A. Webber, D. L. Leyton, M. D. Goldberg, A. F. Cunningham, A. Scott-Tucker, P. R. Ferguson, C. M. Thomas, G. Frankel, C. M. Tang, E. G. Dudley and T. R. Henderson (2010): “Complete genome sequence and comparative metabolic profiling of the prototypical enteroaggregative escherichia coli strain 042,” PLoS One, 5, e8801.10.1371/journal.pone.0008801Search in Google Scholar PubMed PubMed Central

Cheng, L., A. Walker and J. Corander (2012): “Bayesian estimation of bacterial community composition from 454 sequencing data,” Nucleic Acids Res., 40, 5240–5249.Search in Google Scholar

Corander, J., Y. Cui, T. Koski and J. Sirén (2013): “Have i seen you before? principles of bayesian predictive classification revisited,” Stat. comput., 23, 59–73.Search in Google Scholar

Corander, J., M. Gyllenberg and T. Koski (2006): “Bayesian model learning based on a parallel mcmc strategy,” Stat. comput., 16, 355–362.Search in Google Scholar

Corander, J., M. Gyllenberg and T. Koski (2007): “Random partition models and exchangeability for Bayesian identification of population structure,” B. Math. Biol., 69, 797–815.Search in Google Scholar

Cover, T. M. and J. A. Thomas (2006): Elements of information theory 2nd edition, Hoboken, New Jersey: John Wiley & Sons, Inc.Search in Google Scholar

Dahl, D. (2009): “Modal clustering in a class of product partition models,” Bayesian Analysis, 4, 243–264.10.1214/09-BA409Search in Google Scholar

Dempster, A. P., N. M. Laird and D. B. Rubin (1977): “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Stat. Soc. B, 39, 1–38.Search in Google Scholar

Edgar, R. (2010): “Search and clustering orders of magnitude faster than BLAST,” Bioinformatics, 26, 2460–2461.10.1093/bioinformatics/btq461Search in Google Scholar PubMed

Erlich, Y., A. Gordon, M. Brand, G. Hannon and P. Mitra (2010): “Compressed genotyping,” IEEE T. Inform. Theory, 56, 706–723.Search in Google Scholar

Ester, M., H. Kriegel, J. Sander and X. Xu (1996): “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, AAAI Press, pp. 226–231.Search in Google Scholar

Ferguson, T. (1973): “A Bayesian analysis of some nonparametric problems,” The Annals of Statistics, 209–230.10.1214/aos/1176342360Search in Google Scholar

Griffiths, T. and Z. Ghahramani (2005): Infinite latent feature models and the Indian buffet process, London: Gatsby Computational Neuroscience Unit, University College.Search in Google Scholar

Haas, B., D. Gevers, A. Earl, M. Feldgarden, D. Ward, G. Giannoukos, D. Ciulla, D. Tabbaa, S. Highlander, E. Sodergren, B. Methé, T. Z. DeSantis, J. F. Petrosino, R. Knight and B. W. Birren (2011): “Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons,” Genome Res., 21, 494–504.Search in Google Scholar

Hanski, I., L. von Hertzen, N. Fyhrquist, K. Koskinen, K. Torppa, T. Laatikainen, P. Karisola, P. Auvinen, L. Paulin, M. J. Mäkelä, E. Vartiainen, T. U. Kosunen, H. Alenius and T. Haahteia (2012): “Environmental biodiversity, human microbiota, and allergy are interrelated,” Proc. Natl. Acad. Sci., 109, 8334–8339.Search in Google Scholar

Hao, X., R. Jiang and T. Chen (2011): “Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering,” Bioinformatics, 27, 611–618.10.1093/bioinformatics/btq725Search in Google Scholar PubMed PubMed Central

Hartigan, J. (1990): “Partition models,” Commun. Stat. A-Theor., 19, 2745–2756.Search in Google Scholar

Jain, S. and R. Neal (2004): “A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model,” J. Comput. Graph. Stat., 13, 158–182.Search in Google Scholar

Jain, S. and R. Neal (2007): “Splitting and merging components of a nonconjugate Dirichlet process mixture model,” Bayesian Analysis, 2, 445–472.10.1214/07-BA219Search in Google Scholar

Jensen, S. T., X. S. Liu, Q. Zhou and J. S. Liu (2004): “Computational discovery of gene regulatory binding motifs: a Bayesian perspective,” Stat. Sci., 19, 188–204.Search in Google Scholar

Kelley, D. R. and S. L. Salzberg (2010): “Clustering metagenomic sequences with interpolated Markov models,” BMC bioinformatics, 11, 544.10.1186/1471-2105-11-544Search in Google Scholar PubMed PubMed Central

Koski, T. (2001): Hidden Markov models for bioinformatics, Dordrecht, The Netherlands: Kluwer Academic Publishers.10.1007/978-94-010-0612-5Search in Google Scholar

Larkin, M., G. Blackshields, N. Brown, R. Chenna, P. McGettigan, H. McWilliam, F. Valentin, I. Wallace, A. Wilm, R. Lopez, J. D. Thompson, T. J. Gibson and D. G. Higgins (2007): “Clustal W and Clustal X version 2.0,” Bioinformatics, 23, 2947–2948.10.1093/bioinformatics/btm404Search in Google Scholar PubMed

Lau, J. W. and P. J. Green (2007): “Bayesian model-based clustering procedures,” J. Comput. Graph. Stat., 16, 526–558.Search in Google Scholar

Meinicke, P., K. Aßhauer and T. Lingner (2011): “Mixture models for analysis of the taxonomic composition of metagenomes,” Bioinformatics, 27, 1618–1624.10.1093/bioinformatics/btr266Search in Google Scholar PubMed PubMed Central

Neal, R. (2000): “Markov chain sampling methods for Dirichlet process mixture models,” J. Comput. Graph. Stat., 9, 249–265.Search in Google Scholar

Otu, H. H. and K. Sayood (2003): “A divide-and-conquer approach to fragment assembly,” Bioinformatics, 19, 22–29.10.1093/bioinformatics/19.1.22Search in Google Scholar PubMed

Pitman, J. (2006): Combinatorial stochastic processes, New York: Springer-Verlag.Search in Google Scholar

Quintana, F. A. (2006): “A predictive view of bayesian clustering,” J. Stat. Plan. Infer., 136, 2407–2429.Search in Google Scholar

Quintana, F. and M. Newton (2000): “Computational aspects of nonparametric Bayesian analysis with applications to the modeling of multiple binary sequences,” J. Comput. Graph. Stat., 9, 711–737.Search in Google Scholar

Rambaut, A. and N. Grass (1997): “Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees,” Comput. Appl. Biosci.: CABIOS, 13, 235–238.Search in Google Scholar

Ramoni, M., P. Sebastiani and P. Cohen (2002a ): “Bayesian clustering by dynamics,” Mach. Learn., 47, 91–121.10.1023/A:1013635829250Search in Google Scholar

Ramoni, M., P. Sebastiani and I. Kohane (2002b ): “Cluster analysis of gene expression dynamics,” Proc. Natl. Acad. Sci., 99, 9121–9126.10.1073/pnas.132656399Search in Google Scholar PubMed PubMed Central

Rodriguez, A., D. B. Dunson and A. E. Gelfand (2008): “The nested dirichlet process,” J. Am. Stat. Assoc., 103.10.1198/016214508000000553Search in Google Scholar

Rota, G. (1964): “The number of partitions of a set,” T. Am. Math. Monthly, 71, 498–504.10.1080/00029890.1964.11992270Search in Google Scholar

Sethuraman, J. (1994): “A constructive definition of dirichlet priors,” Statistica Sinica, 4, 639–650.Search in Google Scholar

Sinha, S. and M. Tompa (2000): “A statistical method for finding transcription factor binding sites,” in Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, AAAI, 344–354.Search in Google Scholar

Sinha, S. and M. Tompa (2002): “Discovery of novel transcription factor binding sites by statistical overrepresentation,” Nucleic Acids Res., 30, 5549–5560.Search in Google Scholar

Smyth, P. (1997): “Clustering sequences with hidden Markov models,” Adv. Neural Infor. Proc. Syst., 9, 648–654.Search in Google Scholar

Strehl, A. and J. Ghosh (2003): “Cluster ensembles – a knowledge reuse framework for combining multiple partitions,” J. Mach. Learn. Res., 3, 583–617.Search in Google Scholar

Sun, Y., Y. Cai, S. Huse, R. Knight, W. Farmerie, X. Wang and V. Mai (2012): “A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis,” Briefings in Bioinformatics, 13, 107–121.10.1093/bib/bbr009Search in Google Scholar PubMed PubMed Central

Teh, Y. W., M. I. Jordan, M. J. Beal and D. M. Blei (2006): “Hierarchical dirichlet processes,” J. Am. Stat. Assoc., 101.10.1198/016214506000000302Search in Google Scholar

Published Online: 2013-11-19

Published in Print: 2014-02-01

Bayesian clustering of DNA sequences using Markov chains and a stochastic partition model

Abstract

References

Journal and Issue

Articles in the same Issue