Abstract
Suffix trees, which are trie structures that present the suffixes of sequences (e.g., strings), are widely used for sequence search in different application domains such as, text data mining, bioinformatics and computational biology. In particular, suffix trees are useful in bioinformatics applications, because they can search similar sub-sequences and extract frequent sequence patterns efficiently. In recent years, efficient construction of a suffix tree that allows faster sequence searches has become one of the most important challenges, because the number and size of the data that are stored in sequence databases have been increasing exponentially. This paper proposes a novel parallelization model for approximate sequence matching that uses disk-based suffix trees, which are built on hard disks not on memory, on a multi-core CPU. In the proposed parallelization model, we divide an entire sequence database into two or more sub-databases called partitions. For each partition, we build a disk-based suffix tree and define a task as an approximate sequence matching on one disk-based suffix tree. Moreover, the proposed parallelization model involves a multiple buffering management system to avoid conflicts among CPU-cores. We evaluated the proposed parallelization model using an actual amino acid sequence database on a PC. The experimental results show a substantial improvement in computation performance.
Article PDF
Similar content being viewed by others
References
P. Weiner, “Linear pattern matching algorithms,” in Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973), SWAT ’73, pp. 1–11, 1973.
E. M. McCreight, “A space-economical suffix tree construction algorithm,” Journal of the ACM, vol. 23, pp. 262–272, Apr. 1976.
D. Gusfield, Algorithms on strings, trees, and sequences: computer science and computational biology. New York, NY, USA: Cambridge University Press, 1997.
Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel, “Practical methods for constructing suffix trees,” The VLDB Journal, vol. 14, no. 3, pp. 281–299, 200–5.
B. Phoophakdee and M. J. Zaki, “Genome-scale disk-based suffix tree indexing,” in Proceedings of the 2007 ACM SIGMOD international conference on Management of data, SIGMOD ’07, pp. 833–844, 2007.
M. Barsky, U. Stege, A. Thomo, and C. Upton, “Suffix trees for very large genomic sequences,” in Proceedings of the 18th ACM conference on Information and knowledge management, CIKM ’09, pp. 1417–1420, 2009.
M. R. M. Mark D. Hill, “Amdahl’s law in the multicore era,” vol. 41 of IEEE Computer 2008, pp. 33–38, 200–8.
D. J. DeWitt and J. Gray, “Parallel database systems: the future of database processing or a passing fad?,” ACM SIGMOD Record, vol. 19, pp. 104–112, Dec. 1990.
D. Comer, “Ubiquitous b-tree,” ACM Computing Surveys (CSUR), vol. 11, pp. 121–137, June 1979.
B. Seeger and P.-A. Larson, “Multi-disk b-trees,” in Proceedings of the 1991 ACM SIGMOD international conference on Management of data, SIGMOD ’91, pp. 436–445, 1991.
A. Guttman, “R-trees: a dynamic index structure for spatial searching,” in Proceedings of the 1984 ACM SIGMOD international conference on Management of data, SIGMOD ’84, pp. 47–57, 1984.
I. Kamel and C. Faloutsos, “Parallel r-trees,” in Proceedings of the 1992 ACM SIGMOD international conference on Management of data, SIGMOD ’92, pp. 195–204, 1992.
G. Graefe, H. Kimura, and H. Kuno, “Foster b-trees,” ACM Trans. Database Syst., vol. 37, pp. 17:1–17:29, Sept. 2012.
R. Hariharan, “Optimal parallel suffix tree construction,” in Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, STOC ’94, pp. 290–299, 1994.
D. Tsirogiannis and N. Koudas, “Suffix tree construction algorithms on modern hardware,” in Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10, pp. 263–274, 2010.
A. Ghoting and K. Makarychev, “I/o efficient algorithms for serial and parallel suffix tree construction,” ACM Trans. Database Syst., vol. 35, pp. 25:1–25:37, Oct. 2010.
E. Mansour, A. Allam, S. Skiadopoulos, and P. Kalnis, “Era: efficient serial and parallel suffix tree construction for very long strings,” Proc. VLDB Endow., vol. 5, pp. 49–60, Sept. 2011.
R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and B. Falsafi, “Shore-mt: a scalable storage manager for the multicore era,” in Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT ’09, pp. 24–35, 2009.
X. Ding, K. Wang, and X. Zhang, “Srm-buffer: an os buffer management technique to prevent last level cache from thrashing in multicores,” in Proceedings of the sixth conference on Computer systems, EuroSys ’11, pp. 243–256, 2011.
S. J. Bedathur and J. R. Haritsa, “Engineering a fast online persistent suffix tree construction.,” in ICDE, pp. 720–731, 200–4.
Author information
Authors and Affiliations
Corresponding author
Additional information
Y.Watanuki, K.Tamura, H.Kitakami and Y.Takahashi are with the Graduate School of Information Sciences, Hiroshima City University, 3-4-1, Ozuka-Higashi, Asa-Minami-Ku, Hiroshima 731-3194, Japan; corresponding e-mail: (ktamura@hiroshima-cu.ac.jp).
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
Yousuke Watanuki is a student at the Department of Intelligent Systems, Graduate School of Information Sciences, Hiroshima City University, Hiroshima, Japan. His research interests include suffix tree and parallel computing.
Keiichi Tamura received his B.Eng., M.Eng., and Ph.D. degrees in Information Science from Kyushu University, Fukuoka, Japan, in 1998, 2000, and 2005, respectively. He is presently Associate Professor at the Department of Intelligent Systems, Graduate School of Information Sciences, Hiroshima City University, Hiroshima, Japan. He has been in organizing committee of IEEE SMC Hiroshima Chapter since 2012. His research interests include parallel computing, data engineering, data mining, and evolutionary computation. He is a member of IEEE, Information Processing Society of Japan, Database Society of Japan, The Japanese Society for Artificial Intelligence, Japan Society for Fuzzy Theory and Intelligent Informatics.
Hajime Kitakami has been a Professor in the Department of Intelligent Systems, Graduate School of Information Sciences, Hiroshima City University in Japan since 1994. He received the M.Eng. from Tohoku University in 1976 and Ph.D. in engineering from Kyushu University in 1992. His paper was recorded as the 25th Anniversary Best Paper Award of Information Processing Society of Japan (IPSJ) in 1985. He received Paper Award from Japanese Society for Engineering Education (JSEE) in 2003. His research interests include database, data mining, distributed parallel processing, and bioinformatics. He has been an editorial board member for Transactions on Mathematical Modeling and its Applications (TOM), Journal of the Information Processing Society of Japan (IPSJ) since 2006. Also, he has been an editorial board member for Journal of the Database Society of Japan (DBSJ) since 2008.
Yoshifumi Takahashi Yoshifumi Takahashi received a Master of Information Engineering degree from Hiroshima City University, Japan in 2010. He is now a doctoral student in the Graduate School of Information Science, Hiroshima City University.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Tamura, K., Watanuki, Y., Kitakami, H. et al. Multiple Buffering for Parallel Approximate Sequence Matching using Disk-based Suffix Tree on Multi-core CPU. GSTF J Comput 3, 22 (2013). https://doi.org/10.7603/s40601-013-0022-0
Published:
DOI: https://doi.org/10.7603/s40601-013-0022-0