Skip to main content

Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7875))

Abstract

Philippe et al. (2011) proposed a data structure called Gk arrays for indexing and querying large collections of high-throughput sequencing data in main-memory. The data structure supports versatile queries for counting, locating, and analysing the coverage profile of k-mers in short-read data. The main drawback of the Gk arrays is its space-consumption, which can easily reach tens of gigabytes of main-memory even for moderate size inputs. We propose a compressed variant of Gk arrays that supports the same set of queries, but in both near-optimal time and space. In practice, the compressed Gk arrays scale up to much larger inputs with highly competitive query times compared to its non-compressed predecessor. The main applications include variant calling, error correction, coverage profiling, and sequence assembly.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight BWT construction for very large string collections. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 219–231. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  2. Beller, T., Gog, S., Ohlebusch, E., Schnattinger, T.: Computing the longest common prefix array based on the burrows-wheeler transform. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 197–208. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  3. Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.-P., Rivals, E., Vingron, M.: q-gram Based Database Searching Using a Suffix Array (QUASAR). In: 3rd Int. Conf. on Computational Molecular Biology, pp. 77–83. ACM Press (1999)

    Google Scholar 

  4. Chikhi, R., Lavenier, D.: Localized genome assembly from reads to scaffolds: Practical traversal of the paired string graph. In: Przytycka, T.M., Sagot, M.-F. (eds.) WABI 2011. LNCS, vol. 6833, pp. 39–48. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  5. Claude, F., Fariña, A., Martínez-Prieto, M.A., Navarro, G.: Compressed q-gram indexing for highly repetitive biological sequences. In: Proc. 10th IEEE Intl. Conf. on Bioinformatics and Bioengineering, pp. 86–91 (2010)

    Google Scholar 

  6. Conway, T.C., Bromage, A.J.: Succinct Data Structures for Assembling Large Genomes. Bioinformatics 27(4), 479–486 (2011)

    Article  Google Scholar 

  7. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS), pp. 390–398. IEEE Computer Society (2000)

    Google Scholar 

  8. Fischer, J., Mäkinen, V., Navarro, G.: Faster entropy-bounded compressed suffix trees. Theor. Comput. Sci. 410(51), 5354–5364 (2009)

    Article  MATH  Google Scholar 

  9. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: 14th Ann. ACM-SIAM Symp. on Discrete Algorithms, pp. 841–850 (2003)

    Google Scholar 

  10. Hazelhurst, S., Lipták, Z.: Kaboom! a new suffix array based algorithm for clustering expression data. Bioinformatics 27(24), 3348–3355 (2011)

    Article  Google Scholar 

  11. Hon, W.-K., Lam, T.-W., Sadakane, K., Sung, W.-K., Yiu, S.-M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23–36 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  12. Hon, W.-K., Sadakane, K.: Space-economical algorithms for finding maximal unique matches. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 144–152. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  13. Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: FOCS, pp. 713–722. IEEE Computer Society (2009)

    Google Scholar 

  14. Jacobson, G.: Succinct Static Data Structures. PhD thesis, Carnegie–Mellon (1989)

    Google Scholar 

  15. Li, H.: Implementation of BCR, https://github.com/lh3/ropebwt

  16. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)

    Article  Google Scholar 

  17. Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)

    Article  Google Scholar 

  18. Melsted, P., Pritchard, J.: Efficient counting of k-mers in dna sequences using a bloom filter. BMC Bioinformatics 12(1), 333 (2011)

    Article  Google Scholar 

  19. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1) (2007)

    Google Scholar 

  20. Philippe, N., Salson, M., Commes, T., Rivals, E.: CRAC: an integrated approach to read analysis. Genome Biology (in press, 2013)

    Google Scholar 

  21. Philippe, N., Salson, M., Lecroq, T., Léonard, M., Commes, T., Rivals, E.: Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics 12, 242 (2011)

    Google Scholar 

  22. Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics, page Advance access (January 2013)

    Google Scholar 

  23. Salmela, L., Schröder, J.: Correcting errors in short reads by multiple alignments. Bioinformatics 27(11), 1455–1461 (2011)

    Article  Google Scholar 

  24. Sirén, J.: Compressed Full-Text Indexes for Highly Repetitive Collections. PhD thesis, Dept. of Computer Science, Report A-2012-5, University of Helsinki (2012)

    Google Scholar 

  25. Willard, D.E.: Log-logarithmic worst-case range queries are possible in space Theta(N). Inf. Process. Lett. 17(2), 81–84 (1983)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Välimäki, N., Rivals, E. (2013). Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data. In: Cai, Z., Eulenstein, O., Janies, D., Schwartz, D. (eds) Bioinformatics Research and Applications. ISBRA 2013. Lecture Notes in Computer Science(), vol 7875. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38036-5_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38036-5_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38035-8

  • Online ISBN: 978-3-642-38036-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics