Skip to main content

Genomic Sequence Classification Using Probabilistic Topic Modeling

  • Conference paper
  • First Online:
Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2013)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8452))

Abstract

Taxonomic classification of genomic sequences is usually based on evolutionary distance obtained by alignment. In this work we introduce a novel alignment-free classification approach based on probabilistic topic modeling. Using a k-mer (small fragments of length k) decomposition of DNA sequences and the Latent Dirichlet Allocation algorithm, we built a classifier for 16S rRNA bacterial gene sequences. We tested our method with a tenfold cross validation procedure considering a bacteria dataset of 3000 elements belonging to the most numerous bacteria phyla: Actinobacteria, Firmicutes and Proteobacteria. Experiments were carried out using complete and 400 bp long 16S sequences, in order to test the robustness of the proposed methodology. Our results, in terms of precision scores and for different number of topics, ranges from 100 %, at class level, to 77 % at genus level, for both full and 400 bp length, considering k-mers of length 8. These results demonstrate the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)

    Article  Google Scholar 

  2. Bart, E., Welling, M., Perona, P.: Unsupervised organization of image collections: taxonomies and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 33(11), 2302–2315 (2011)

    Google Scholar 

  3. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)

    Article  MathSciNet  Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Chor, B., Horn, D., Goldman, N., Levy, Y., Massingham, T.: Genomic DNA k-mer spectra: models and modalities. Genome Biol. 10(10), R108 (2009)

    Article  Google Scholar 

  6. Cole, J.R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R.J., Kulam-Syed-Mohideen, A.S., McGarrell, D.M., Marsh, T., Garrity, G.M., Tiedje, J.M.: The ribosomal database project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 37(Database issue), D141–D145 (2009)

    Google Scholar 

  7. DeSantis, T.Z., Keller, K., Karaoz, U., Alekseyenko, A.V., Singh, N.N.S., Brodie, E.L., Pei, Z., Andersen, G.L., Larsen, N.: Simrank: Rapid and sensitive general-purpose k-mer search tool. BMC Ecol. 11, 11 (2011)

    Google Scholar 

  8. Drancourt, M., Berger, P., Raoult, D.: Systematic 16S rRNA gene sequencing of atypical clinical isolates identified 27 new bacterial species associated with humans. J. Clin. Microbiol. 42(5), 2197–2202 (2004)

    Article  Google Scholar 

  9. Falush, D., Stephens, M., Pritchard, J.K.: Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164(4), 1567–1587 (2003)

    Google Scholar 

  10. Fiannaca, A., La Rosa, M., Rizzo, R., Urso, A.: Analysis of DNA barcode sequences using neural gas and spectral representation. In: Iliadis, L., Papadopoulos, H., Jayne, C. (eds.) EANN 2013, Part II. CCIS, vol. 384, pp. 212–221. Springer, Heidelberg (2013)

    Google Scholar 

  11. Geer, L.Y., Marchler-Bauer, A., Geer, R.C., Han, L., He, J., He, S., Liu, C., Shi, W., Bryant, S.H.: The NCBI BioSystems database. Nucleic Acids Res. 38(Database issue), D492–D496 (2010)

    Google Scholar 

  12. Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS 101(Suppl. 1), 5228–5235 (2004)

    Google Scholar 

  13. Grun, B., Hornik, K.: topicmodels: An R package for fitting topic models. J. Stat. Softw. 40(13), 1–30 (2011)

    Google Scholar 

  14. Hu, D.J., Saul, L.K.: A probabilistic topic model for unsupervised learning of musical key-profiles. In: 10th International Society for Music Information Retrieval Conference (ISMIR 2009), pp. 441–446 (2009) (2009 International Society for Music Information Retrieval)

    Google Scholar 

  15. Kim, S., Narayanan, S., Sundaram, S.: Acoustic topic model for audio information retrieval. In: 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 37–40. IEEE, October 2009

    Google Scholar 

  16. Kuksa, P., Pavlovic, V.: Efficient alignment-free DNA barcode analytics. BMC Bioinform. 10(Suppl. 14), S9 (2009)

    Article  Google Scholar 

  17. La Rosa, M., Di Fatta, G., Gaglio, S., Giammanco, G.M., Rizzo, R., Urso, A.M.: Soft topographic map for clustering and classification of bacteria. In: Berthold, M., Shawe-Taylor, J., Lavrač, N. (eds.) IDA 2007. LNCS, vol. 4723, pp. 332–343. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  18. La Rosa, M., Gaglio, S., Rizzo, R., Urso, A.: Normalised compression distance and evolutionary distance of genomic sequences: comparison of clustering results. Int. J. Knowl. Eng. Soft Data Paradigms 1(4), 345–362 (2009)

    Article  Google Scholar 

  19. La Rosa, M., Rizzo, R., Urso, A.M., Gaglio, S.: Comparison of genomic sequences clustering using normalized compression distance and evolutionary distance. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part III. LNCS (LNAI), vol. 5179, pp. 740–746. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  20. La Rosa, M., Fiannaca, A., Rizzo, R., Urso, A.: A study of compression–based methods for the analysis of barcode sequences. In: Peterson, L.E., Masulli, F., Russo, G. (eds.) CIBB 2012. LNCS, vol. 7845, pp. 105–116. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  21. La Rosa, M., Fiannaca, A., Rizzo, R., Urso, A.: Alignment-free analysis of barcode sequences by means of compression-based methods. BMC Bioinform. 14(Suppl. 7), S4 (2013)

    Article  Google Scholar 

  22. La Rosa, M., Rizzo, R., Urso, A.: Soft topographic maps for clustering and classifying bacteria using housekeeping genes. Adv. Artif. Neural Syst. 2011, 1–8 (2011)

    Article  Google Scholar 

  23. Li, M., Chen, X., Li, X.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)

    Article  Google Scholar 

  24. Liu, Z., DeSantis, T.Z., Andersen, G.L., Knight, R.: Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Res. 36(18), e20 (2008)

    Article  Google Scholar 

  25. McCallum, A., Wang, X., Corrada-Emmanuel, A.: Topic and role discovery in social networks with experiments on enron and academic email. J. Artif. Intell. Res. 30, 249–272 (2007)

    Google Scholar 

  26. Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, pp. 524–531. IEEE (2005)

    Google Scholar 

  27. Sandberg, R., Winberg, G., Bränden, C.I., Kaske, A., Ernberg, I., Cöster, J.: Capturing whole-genome characteristics in short sequences using a Naïve Bayesian classifier. Genome Res. 11, 1404–1409 (2001)

    Article  Google Scholar 

  28. Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Landauer, T., McNamara, D., Dennis, S., Kintsch, W. (eds.) Handbook of Latent Semantic Analysis. Erlbaum, Hillsdale (2007)

    Google Scholar 

  29. Wang, Q., Garrity, G.M., Tiedje, J.M., Cole, J.R.: Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73(16), 5261–5267 (2007)

    Article  Google Scholar 

  30. Werner, J.J., Koren, O., Hugenholtz, P., DeSantis, T.Z., Walters, W.A., Caporaso, J.G., Angenent, L.T., Knight, R., Ley, R.E.: Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys. ISME J. 6(1), 94–103 (2012)

    Google Scholar 

  31. Zhou, D., Manavoglu, E., Li, J., Giles, C.L., Zha, H.: Probabilistic models for discovering e-communities. In: Proceedings of the 15th International Conference on World Wide Web - WWW ’06, p. 173. ACM Press, New York (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Massimo La Rosa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

La Rosa, M., Fiannaca, A., Rizzo, R., Urso, A. (2014). Genomic Sequence Classification Using Probabilistic Topic Modeling. In: Formenti, E., Tagliaferri, R., Wit, E. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2013. Lecture Notes in Computer Science(), vol 8452. Springer, Cham. https://doi.org/10.1007/978-3-319-09042-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09042-9_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09041-2

  • Online ISBN: 978-3-319-09042-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics