Abstract
Taxonomic classification of genomic sequences is usually based on evolutionary distance obtained by alignment. In this work we introduce a novel alignment-free classification approach based on probabilistic topic modeling. Using a k-mer (small fragments of length k) decomposition of DNA sequences and the Latent Dirichlet Allocation algorithm, we built a classifier for 16S rRNA bacterial gene sequences. We tested our method with a tenfold cross validation procedure considering a bacteria dataset of 3000 elements belonging to the most numerous bacteria phyla: Actinobacteria, Firmicutes and Proteobacteria. Experiments were carried out using complete and 400 bp long 16S sequences, in order to test the robustness of the proposed methodology. Our results, in terms of precision scores and for different number of topics, ranges from 100 %, at class level, to 77 % at genus level, for both full and 400 bp length, considering k-mers of length 8. These results demonstrate the effectiveness of the proposed approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Bart, E., Welling, M., Perona, P.: Unsupervised organization of image collections: taxonomies and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 33(11), 2302–2315 (2011)
Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Chor, B., Horn, D., Goldman, N., Levy, Y., Massingham, T.: Genomic DNA k-mer spectra: models and modalities. Genome Biol. 10(10), R108 (2009)
Cole, J.R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R.J., Kulam-Syed-Mohideen, A.S., McGarrell, D.M., Marsh, T., Garrity, G.M., Tiedje, J.M.: The ribosomal database project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 37(Database issue), D141–D145 (2009)
DeSantis, T.Z., Keller, K., Karaoz, U., Alekseyenko, A.V., Singh, N.N.S., Brodie, E.L., Pei, Z., Andersen, G.L., Larsen, N.: Simrank: Rapid and sensitive general-purpose k-mer search tool. BMC Ecol. 11, 11 (2011)
Drancourt, M., Berger, P., Raoult, D.: Systematic 16S rRNA gene sequencing of atypical clinical isolates identified 27 new bacterial species associated with humans. J. Clin. Microbiol. 42(5), 2197–2202 (2004)
Falush, D., Stephens, M., Pritchard, J.K.: Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164(4), 1567–1587 (2003)
Fiannaca, A., La Rosa, M., Rizzo, R., Urso, A.: Analysis of DNA barcode sequences using neural gas and spectral representation. In: Iliadis, L., Papadopoulos, H., Jayne, C. (eds.) EANN 2013, Part II. CCIS, vol. 384, pp. 212–221. Springer, Heidelberg (2013)
Geer, L.Y., Marchler-Bauer, A., Geer, R.C., Han, L., He, J., He, S., Liu, C., Shi, W., Bryant, S.H.: The NCBI BioSystems database. Nucleic Acids Res. 38(Database issue), D492–D496 (2010)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS 101(Suppl. 1), 5228–5235 (2004)
Grun, B., Hornik, K.: topicmodels: An R package for fitting topic models. J. Stat. Softw. 40(13), 1–30 (2011)
Hu, D.J., Saul, L.K.: A probabilistic topic model for unsupervised learning of musical key-profiles. In: 10th International Society for Music Information Retrieval Conference (ISMIR 2009), pp. 441–446 (2009) (2009 International Society for Music Information Retrieval)
Kim, S., Narayanan, S., Sundaram, S.: Acoustic topic model for audio information retrieval. In: 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 37–40. IEEE, October 2009
Kuksa, P., Pavlovic, V.: Efficient alignment-free DNA barcode analytics. BMC Bioinform. 10(Suppl. 14), S9 (2009)
La Rosa, M., Di Fatta, G., Gaglio, S., Giammanco, G.M., Rizzo, R., Urso, A.M.: Soft topographic map for clustering and classification of bacteria. In: Berthold, M., Shawe-Taylor, J., Lavrač, N. (eds.) IDA 2007. LNCS, vol. 4723, pp. 332–343. Springer, Heidelberg (2007)
La Rosa, M., Gaglio, S., Rizzo, R., Urso, A.: Normalised compression distance and evolutionary distance of genomic sequences: comparison of clustering results. Int. J. Knowl. Eng. Soft Data Paradigms 1(4), 345–362 (2009)
La Rosa, M., Rizzo, R., Urso, A.M., Gaglio, S.: Comparison of genomic sequences clustering using normalized compression distance and evolutionary distance. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part III. LNCS (LNAI), vol. 5179, pp. 740–746. Springer, Heidelberg (2008)
La Rosa, M., Fiannaca, A., Rizzo, R., Urso, A.: A study of compression–based methods for the analysis of barcode sequences. In: Peterson, L.E., Masulli, F., Russo, G. (eds.) CIBB 2012. LNCS, vol. 7845, pp. 105–116. Springer, Heidelberg (2013)
La Rosa, M., Fiannaca, A., Rizzo, R., Urso, A.: Alignment-free analysis of barcode sequences by means of compression-based methods. BMC Bioinform. 14(Suppl. 7), S4 (2013)
La Rosa, M., Rizzo, R., Urso, A.: Soft topographic maps for clustering and classifying bacteria using housekeeping genes. Adv. Artif. Neural Syst. 2011, 1–8 (2011)
Li, M., Chen, X., Li, X.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)
Liu, Z., DeSantis, T.Z., Andersen, G.L., Knight, R.: Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Res. 36(18), e20 (2008)
McCallum, A., Wang, X., Corrada-Emmanuel, A.: Topic and role discovery in social networks with experiments on enron and academic email. J. Artif. Intell. Res. 30, 249–272 (2007)
Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, pp. 524–531. IEEE (2005)
Sandberg, R., Winberg, G., Bränden, C.I., Kaske, A., Ernberg, I., Cöster, J.: Capturing whole-genome characteristics in short sequences using a Naïve Bayesian classifier. Genome Res. 11, 1404–1409 (2001)
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Landauer, T., McNamara, D., Dennis, S., Kintsch, W. (eds.) Handbook of Latent Semantic Analysis. Erlbaum, Hillsdale (2007)
Wang, Q., Garrity, G.M., Tiedje, J.M., Cole, J.R.: Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73(16), 5261–5267 (2007)
Werner, J.J., Koren, O., Hugenholtz, P., DeSantis, T.Z., Walters, W.A., Caporaso, J.G., Angenent, L.T., Knight, R., Ley, R.E.: Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys. ISME J. 6(1), 94–103 (2012)
Zhou, D., Manavoglu, E., Li, J., Giles, C.L., Zha, H.: Probabilistic models for discovering e-communities. In: Proceedings of the 15th International Conference on World Wide Web - WWW ’06, p. 173. ACM Press, New York (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
La Rosa, M., Fiannaca, A., Rizzo, R., Urso, A. (2014). Genomic Sequence Classification Using Probabilistic Topic Modeling. In: Formenti, E., Tagliaferri, R., Wit, E. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2013. Lecture Notes in Computer Science(), vol 8452. Springer, Cham. https://doi.org/10.1007/978-3-319-09042-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-09042-9_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09041-2
Online ISBN: 978-3-319-09042-9
eBook Packages: Computer ScienceComputer Science (R0)