Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data

Välimäki, Niko; Rivals, Eric

doi:10.1007/978-3-642-38036-5_24

Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data

Niko Välimäki²³ &
Eric Rivals²⁴

Conference paper

4089 Accesses
5 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7875))

Abstract

Philippe et al. (2011) proposed a data structure called Gk arrays for indexing and querying large collections of high-throughput sequencing data in main-memory. The data structure supports versatile queries for counting, locating, and analysing the coverage profile of k-mers in short-read data. The main drawback of the Gk arrays is its space-consumption, which can easily reach tens of gigabytes of main-memory even for moderate size inputs. We propose a compressed variant of Gk arrays that supports the same set of queries, but in both near-optimal time and space. In practice, the compressed Gk arrays scale up to much larger inputs with highly competitive query times compared to its non-compressed predecessor. The main applications include variant calling, error correction, coverage profiling, and sequence assembly.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight BWT construction for very large string collections. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 219–231. Springer, Heidelberg (2011)
Chapter Google Scholar
Beller, T., Gog, S., Ohlebusch, E., Schnattinger, T.: Computing the longest common prefix array based on the burrows-wheeler transform. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 197–208. Springer, Heidelberg (2011)
Chapter Google Scholar
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.-P., Rivals, E., Vingron, M.: q-gram Based Database Searching Using a Suffix Array (QUASAR). In: 3rd Int. Conf. on Computational Molecular Biology, pp. 77–83. ACM Press (1999)
Google Scholar
Chikhi, R., Lavenier, D.: Localized genome assembly from reads to scaffolds: Practical traversal of the paired string graph. In: Przytycka, T.M., Sagot, M.-F. (eds.) WABI 2011. LNCS, vol. 6833, pp. 39–48. Springer, Heidelberg (2011)
Chapter Google Scholar
Claude, F., Fariña, A., Martínez-Prieto, M.A., Navarro, G.: Compressed q-gram indexing for highly repetitive biological sequences. In: Proc. 10th IEEE Intl. Conf. on Bioinformatics and Bioengineering, pp. 86–91 (2010)
Google Scholar
Conway, T.C., Bromage, A.J.: Succinct Data Structures for Assembling Large Genomes. Bioinformatics 27(4), 479–486 (2011)
Article Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS), pp. 390–398. IEEE Computer Society (2000)
Google Scholar
Fischer, J., Mäkinen, V., Navarro, G.: Faster entropy-bounded compressed suffix trees. Theor. Comput. Sci. 410(51), 5354–5364 (2009)
Article MATH Google Scholar
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: 14th Ann. ACM-SIAM Symp. on Discrete Algorithms, pp. 841–850 (2003)
Google Scholar
Hazelhurst, S., Lipták, Z.: Kaboom! a new suffix array based algorithm for clustering expression data. Bioinformatics 27(24), 3348–3355 (2011)
Article Google Scholar
Hon, W.-K., Lam, T.-W., Sadakane, K., Sung, W.-K., Yiu, S.-M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23–36 (2007)
Article MathSciNet MATH Google Scholar
Hon, W.-K., Sadakane, K.: Space-economical algorithms for finding maximal unique matches. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 144–152. Springer, Heidelberg (2002)
Chapter Google Scholar
Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: FOCS, pp. 713–722. IEEE Computer Society (2009)
Google Scholar
Jacobson, G.: Succinct Static Data Structures. PhD thesis, Carnegie–Mellon (1989)
Google Scholar
Li, H.: Implementation of BCR, https://github.com/lh3/ropebwt
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)
Article Google Scholar
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
Article Google Scholar
Melsted, P., Pritchard, J.: Efficient counting of k-mers in dna sequences using a bloom filter. BMC Bioinformatics 12(1), 333 (2011)
Article Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1) (2007)
Google Scholar
Philippe, N., Salson, M., Commes, T., Rivals, E.: CRAC: an integrated approach to read analysis. Genome Biology (in press, 2013)
Google Scholar
Philippe, N., Salson, M., Lecroq, T., Léonard, M., Commes, T., Rivals, E.: Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics 12, 242 (2011)
Google Scholar
Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics, page Advance access (January 2013)
Google Scholar
Salmela, L., Schröder, J.: Correcting errors in short reads by multiple alignments. Bioinformatics 27(11), 1455–1461 (2011)
Article Google Scholar
Sirén, J.: Compressed Full-Text Indexes for Highly Repetitive Collections. PhD thesis, Dept. of Computer Science, Report A-2012-5, University of Helsinki (2012)
Google Scholar
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space Theta(N). Inf. Process. Lett. 17(2), 81–84 (1983)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Genome-Scale Biology Research Program, and Department of Medical Genetics, Faculty of Medicine, University of Helsinki, Finland
Niko Välimäki
LIRMM and Institut de Biologie Computationelle, CNRS & Université Montpellier 2, France
Eric Rivals

Authors

Niko Välimäki
View author publications
You can also search for this author in PubMed Google Scholar
Eric Rivals
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science, Georgia State University, 34 Peachtree Street, Suite 1410, 30303, Atlanta, GA, USA
Zhipeng Cai
Computer Science, Iowa State University, 50011, Ames, IA, USA
Oliver Eulenstein
Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd, Suite 315, 28223, Charlotte, NC, USA
Daniel Janies
Physiology and Neurobiology, University of Connecticut, 75 North Eagleville Road, Unit 3156, 06269, Storrs, CT, USA
Daniel Schwartz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Välimäki, N., Rivals, E. (2013). Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data. In: Cai, Z., Eulenstein, O., Janies, D., Schwartz, D. (eds) Bioinformatics Research and Applications. ISBRA 2013. Lecture Notes in Computer Science(), vol 7875. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38036-5_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-38036-5_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38035-8
Online ISBN: 978-3-642-38036-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics