An Adaptive Similarity Search in Massive Datasets

Phan, Trong Nhan; Küng, Josef; Dang, Tran Khanh

doi:10.1007/978-3-662-49175-1_3

Trong Nhan Phan¹⁸,
Josef Küng¹⁸ &
Tran Khanh Dang¹⁹

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9480))

561 Accesses
2 Citations

Abstract

Similarity search is an important task engaging in different fields of studies as well as in various application domains. The era of big data, however, has been posing challenges on existing information systems in general and on similarity search in particular. Aiming at large-scale data processing, we propose an adaptive similarity search in massive datasets with MapReduce. Additionally, our proposed scheme is both applicable and adaptable to popular similarity search cases such as pairwise similarity, search-by-example, range queries, and k-Nearest Neighbour queries. Moreover, we embed our collaborative refinements to effectively minimize irrelevant data objects as well as unnecessary computations. Furthermore, we experience our proposed methods with the two different document models known as shingles and terms. Last but not least, we conduct intensive empirical experiments not only to verify these methods themselves but also to compare them with a previous related work on real datasets. The results, after all, confirm the effectiveness of our proposed methods and show that they outperform the previous work in terms of query processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing parallel algorithms for all pairs similarity search. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining, USA, pp. 203–212 (2013)
Google Scholar
Alex cluster. http://www.jku.at/content/e213/e174/e167/e186534. Accessed 4 Feb 2014
Apache Software Foundation: Hadoop: A Framework for Running Applications on Large Clusters Built of Commodity Hardware (2006)
Google Scholar
Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with MapReduce. In: Proceedings of the 10th IEEE International Conference on Data Mining, pp. 731–736 (2010)
Google Scholar
Dang, T.K., Küng, J.: The SH-tree: a super hybrid index structure for multidimensional data. In: Mayr, H.C., Lazanský, J., Quirchmayr, G., Vogel, P. (eds.) DEXA 2001. LNCS, vol. 2113, pp. 340–349. Springer, Heidelberg (2001)
Chapter Google Scholar
Dang, T.K.: Solving approximate similarity queries. Int. J. Comput. Syst. Sci. Eng. 22(1–2), 71–89 (2007). CRL Publishing Ltd., UK
MathSciNet Google Scholar
DBLP data set. http://dblp.uni-trier.de/xml/. Accessed 8 Mar 2014
De Francisci Morales, G., Lucchese, C., Baraglia, R.: Scaling out all pairs similarity search with MapReduce. In: Proceedings of the 8th Workshop on Large-Scale Distributed Systems for Information Retrieval, pp. 25–30 (2010)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation, pp. 137–150. USENIX Association (2004)
Google Scholar
Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Companion Volume, Columbus, Ohio, pp. 265–268 (2008)
Google Scholar
Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U.: Efficient similarity search in very large string sets. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 262–279. Springer, Heidelberg (2012)
Chapter Google Scholar
Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with MapReduce. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds.) APWeb 2011. LNCS, vol. 6612, pp. 412–423. Springer, Heidelberg (2011)
Chapter Google Scholar
Phan, T.N., Küng, J., Dang, T.K.: An elastic approximate similarity search in very large datasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds.) Globe 2014. LNCS, vol. 8648, pp. 49–60. Springer, Heidelberg (2014)
Google Scholar
Phan, T.N., Küng, J., Dang, T.K.: An efficient similarity search in large data collections with MapReduce. In: Dang, T.K., Wagner, R., Neuhold, E., Takizawa, M., Küng, J., Thoai, N. (eds.) FDSE 2014. LNCS, vol. 8860, pp. 44–57. Springer, Heidelberg (2014)
Google Scholar
Philips, L.: The double metaphone search algorithm. C/C++ Users J. 18(6), 38–43 (2000)
MathSciNet Google Scholar
Project Gutenberg. http://www.gutenberg.org/. Accessed 8 Mar 2014
Rajaraman, A., Ullman J.D.: Finding similar items. In: The book Mining of Massive Datasets, 1st edn., pp. 71–127. Cambridge University Press (2011). Chapter 3
Google Scholar
Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H.: Efficient and scalable processing of string similarity join. IEEE Trans. Knowl. Data Eng. 25(10), 2217–2230 (2013)
Article Google Scholar
Szmit, R.: Locality sensitive hashing for similarity search using MapReduce on large scale data. In: Kłopotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzchoń, S.T. (eds.) IIS 2013. LNCS, vol. 7912, pp. 171–178. Springer, Heidelberg (2013)
Chapter Google Scholar
Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 563–570 (2008)
Google Scholar
Ture, F., Elsayed, T., Lin, J.: No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In: Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 943–952 (2011)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, USA, pp. 495–506 (2010)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International World Wide Web Conference, pp. 131–140 (2008)
Google Scholar
Zhang, D., Yang, G., Hu, Y., Jin, Z., Cai, D., He, X.: A unified approximate nearest neighbor search scheme by combining data structure and hashing. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp. 681–687 (2013)
Google Scholar

Download references

Acknowledgements

We would like to give our thanks to Mr. Faruk Kujundžić, Information Management team, Johannes Kepler University Linz, for kindly supporting us in Alex Cluster.

Author information

Authors and Affiliations

Institute for Application Oriented Knowledge Processing, Johannes Kepler University Linz, Linz, Austria
Trong Nhan Phan & Josef Küng
Faculty of Computer Science and Engineering, HCMC University of Technology, Ho Chi Minh City, Vietnam
Tran Khanh Dang

Authors

Trong Nhan Phan
View author publications
You can also search for this author in PubMed Google Scholar
Josef Küng
View author publications
You can also search for this author in PubMed Google Scholar
Tran Khanh Dang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Trong Nhan Phan .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Josef Küng
FAW, University of Linz, Linz, Austria
Roland Wagner
Ho Chi Minh City Univ. of Technology, Ho Chi Minh City, Vietnam
Tran Khanh Dang
Ho Chi Minh City Univ. of Technology, Ho Chi Minh City, Vietnam
Nam Thoai

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Phan, T.N., Küng, J., Dang, T.K. (2016). An Adaptive Similarity Search in Massive Datasets. In: Hameurlain, A., Küng, J., Wagner, R., Dang, T., Thoai, N. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIII. Lecture Notes in Computer Science(), vol 9480. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49175-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-662-49175-1_3
Published: 01 January 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49174-4
Online ISBN: 978-3-662-49175-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics