Abstract
For many bioinformatics applications it is crucial to know frequencies of all subsequences of length k (k-mers) constructed from reads (short-reads) that are obtained in process of DNA sequencing. We present an effective parallel algorithm for k-mers counting that is based on nested bucket sort algorithm, whereby sizes of partitions and number of buckets per partition are precomputed. The proposed algorithm is designed for multicore architecture and properly combines MPI framework (OpenMPI) with POSIX threads achieving very good performance. According to our experiments it overcomes existing solutions in running time when compared on the genome of Drosophila melanogaster (SRX040485).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In this case by sorting we mean classification by some criteria, ordering means arranging the data into non-increasing or non-decreasing order.
- 2.
Decreasing time effect is produced by comparative O(n.log(n)) final sorting algorithm that uses less time to sort k groups of l elements than 1 group of \(k*l\) elements. This implies overall time complexity to be \(O(n + n.log(\frac{n}{b}))\), b to be the number of buckets. For reasonable high b values the complexity tends to be O(n). Please keep in mind that time complexity is much less accurate than actual real performance measuring.
- 3.
Drosophila melanogaster (SRX040485) http://www.ebi.ac.uk/ena/data/view/SRX040485.
References
Audano, P., Vannberg, F.: Kanalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics (2014). doi:10.1093/bioinformatics/btu152. Accessed 18 March 2014
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970). doi:10.1145/362686.362692
Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)
Compeau, P.E., Pevzner, P.A., Tesler, G.: How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29(11), 987–991 (2011). doi:10.1038/nbt.2023
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn., pp. 174–177. MIT Press and McGraw-Hill, Cambridge, New York (2001). ISBN: 0-262-03293-7. Section 8.4: Bucket sort
Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based k-mer counting on a PC. BMC Bioinf. 14, 160 (2013)
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj, A.: KMC 2: fast and resource-frugal k-mer counting. abs/1407.1507 (2014)
Edgar, G., Fagg, G.E., Bosilca, G.: Open MPI: goals, concept, and design of a next generation mpi implementation. In: Proceedings: 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary (2004)
Farkaš, T.: Parallel Bucket sort algorithm for ordering short DNA sequences. In: IIT.SRC 2015: Student Research Conference, Bratislava, pp. 77–82 (2015). ISBN: 978-80-227-4342-6
Hollerith, H.: US. pat. Nr. 395781, 395782, 395783
Marais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
McIlroy, P.M., et al.: Engineering radix sort. Comput. Syst. 6(1), 5–27 (1993)
Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12, 333 (2011)
Pevzner, P.A., Tang, H., Waterman, M.S.: An eulerian path approach to DNA fragment assembly. Proc. Nat. Acad. Sci. U.S.A. 98(17), 9748–9753 (2001)
Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)
Roy, R.S., Bhattacharya, D., Schliep, A.: Turtle: identifying frequent k-mers with cache-efficient algorithms. Bioinformatics (2014). doi:10.1093/bioinformatics/btu132
Shendure, J., Ji, H.: Next-generation DNS sequencing. Nat. Biotechnol. 26(10), 1135–1145 (2008)
Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS ONE 9(7), e101271 (2014). doi:10.1371/journal.pone.0101271
Acknowledgements
This work was partially supported by the Institute of Informatics and Software Engineering, FIIT STU, Intelligent analysis of big data by semantic-oriented and bio-inspired methods in parallel environment, the scientific Grant Agency of the Slovak Republic, grant No. VG 1/0752/14, project DNApuzzleDNA, FIIT STU that allowed us to use high performance computing on cluster of STU (project number 26230120002) and by the Research and Development Operational Programme as part of the project “International Centre of Excellence for Research of Intelligent and Secure Information-Communication Technologies and Systems”, ITMS 26240120039.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Farkaš, T., Kubán, P., Lucká, M. (2016). Effective Parallel Multicore-Optimized K-mers Counting Algorithm. In: Freivalds, R., Engels, G., Catania, B. (eds) SOFSEM 2016: Theory and Practice of Computer Science. SOFSEM 2016. Lecture Notes in Computer Science(), vol 9587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49192-8_38
Download citation
DOI: https://doi.org/10.1007/978-3-662-49192-8_38
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49191-1
Online ISBN: 978-3-662-49192-8
eBook Packages: Computer ScienceComputer Science (R0)