Abstract
Reads classification is an important fundamental problem in metagenomics study. With the development of next-generation sequencing, metagenome samples can be generated using much less money and time. However, the short reads generated by next-generation sequencing make the problem of reads classification much more difficult than before. None of the existing tools can assign NGS short reads to each genome accurately, which limit their use in real application. Fortunately, in many applications, it is meaningless to separate all the species in the metagenome sample from each other. That is because we usually only focus on some specified species categories in the sample and do not care about the others. There is no existing tool that is designed technically for obtaining specified species from short metagenome reads generated by next-generation sequencing. In this paper, we propose a tool named MetaObtainer to obtain the specified species from next-generation sequencing short reads. The tool synthesizes some of newest technologies for processing of short reads, so it can have better performance than other tools. It can (1) deal with next-generation sequencing reads which are shorter than 100 bp with very high accuracy (both of precision and recall are more than 90 %); (2) find unknown species using the reference genomes of species which are similar with it; (3) perform well when reads of specified species are very few in the dataset; (4) handle genomes of similar abundance levels as well as different abundance levels (1:10); and (5) obtain multiple species categories from metagenome sample.
Similar content being viewed by others
References
Béjà O et al (2000) Construction and analysis of bacterial artificial chromosome libraries from a marine microbial assemblage. Environ Microbiol 2(5):516–529
Huson DH et al (2007) MEGAN analysis of metagenomic data. Genome Res 17(3):377–386
Krause L et al (2008) Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res 36(7):2230–2239
Yang B et al (2010) Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers. BMC Bioinform 11(Suppl 2):S5
Yang B et al (2010) MetaCluster: unsupervised binning of environmental genomic fragments and taxonomic annotation. In: Proceedings of the first ACM international conference on bioinformatics and computational biology, pp 170–179
Leung HC et al (2011) A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics 27(11):1489–1495
Chatterji S et al (2008) CompostBin: a DNA composition-based algorithm for binning environmental shotgun reads. In: Research in computational molecular biology, pp 17–28
Diaz NN et al (2009) TACOA-taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinform 10(1):56
McHardy AC et al (2006) Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods 4(1):63–72
Brady A et al (2009) Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 6(9):673–676
Reis-Filho JS (2009) Next-generation sequencing. Breast Cancer Res 11(Suppl 3):S12
Bentley SD et al (2004) Comparative genomic structure of prokaryotes. Annu Rev Genet 38:771–791
Wu Y et al (2010) A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. In: Research in computational molecular biology, pp 535–549
Tanaseichuk O et al (2011) Separating metagenomic short reads into genomes via clustering. In: WABI, pp 298–313
Tanaseichuk O et al (2012) A probabilistic approach to accurate abundance-based binning of metagenomic reads. In: Algorithms in bioinformatics, pp 404–416
Wang Y et al (2012) MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species. J Comput Biol 19(2):241–249
Wang Y et al (2012) MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 28(18):i356–i362
Wu Q et al (2012) Homology-independent discovery of replicating pathogenic circular RNAs by deep sequencing and a new computational algorithm. Proc Nat Acad Sci 109(10):3938–3943
Cortes C et al (1995) Support vector machine. Mach Learn 20(3):273–297
Dayhoff JE et al (2001) Artificial neural networks. Cancer 91(S8):1615–1635
Cover T et al (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Chor B et al (2009) Genomic DNA k-mer spectra: models and modalities. Genome Biol 10(10):R108
Zhou F et al (2008) Barcodes for genomes and applications. BMC Bioinform 9(1):546
Richter DC et al (2008) MetaSim-A sequencing simulator for genomics and metagenomics. PloS One 3(10):e3373
Acknowledgments
We thank Jiaoyun Yang, Pengyu Nie, and Xingxing Zhang, who provided many helpful suggestions for our article. Constructive comments from the reviewers are also appreciated. This work is supported by the National Natural Science Foundation of China (No. 61033009 and No. 60970085) and Foreign Scholars in University Research and Teaching Programs (B07033).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pan, W., Chen, B. & Xu, Y. MetaObtainer: A Tool for Obtaining Specified Species from Metagenomic Reads of Next-generation Sequencing. Interdiscip Sci Comput Life Sci 7, 405–413 (2015). https://doi.org/10.1007/s12539-015-0281-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-015-0281-x