Skip to main content

An Adaptive Similarity Search in Massive Datasets

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIII

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9480))

Abstract

Similarity search is an important task engaging in different fields of studies as well as in various application domains. The era of big data, however, has been posing challenges on existing information systems in general and on similarity search in particular. Aiming at large-scale data processing, we propose an adaptive similarity search in massive datasets with MapReduce. Additionally, our proposed scheme is both applicable and adaptable to popular similarity search cases such as pairwise similarity, search-by-example, range queries, and k-Nearest Neighbour queries. Moreover, we embed our collaborative refinements to effectively minimize irrelevant data objects as well as unnecessary computations. Furthermore, we experience our proposed methods with the two different document models known as shingles and terms. Last but not least, we conduct intensive empirical experiments not only to verify these methods themselves but also to compare them with a previous related work on real datasets. The results, after all, confirm the effectiveness of our proposed methods and show that they outperform the previous work in terms of query processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alabduljalil, M.A., Tang, X., Yang, T.: Optimizing parallel algorithms for all pairs similarity search. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining, USA, pp. 203–212 (2013)

    Google Scholar 

  2. Alex cluster. http://www.jku.at/content/e213/e174/e167/e186534. Accessed 4 Feb 2014

  3. Apache Software Foundation: Hadoop: A Framework for Running Applications on Large Clusters Built of Commodity Hardware (2006)

    Google Scholar 

  4. Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with MapReduce. In: Proceedings of the 10th IEEE International Conference on Data Mining, pp. 731–736 (2010)

    Google Scholar 

  5. Dang, T.K., Küng, J.: The SH-tree: a super hybrid index structure for multidimensional data. In: Mayr, H.C., Lazanský, J., Quirchmayr, G., Vogel, P. (eds.) DEXA 2001. LNCS, vol. 2113, pp. 340–349. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  6. Dang, T.K.: Solving approximate similarity queries. Int. J. Comput. Syst. Sci. Eng. 22(1–2), 71–89 (2007). CRL Publishing Ltd., UK

    MathSciNet  Google Scholar 

  7. DBLP data set. http://dblp.uni-trier.de/xml/. Accessed 8 Mar 2014

  8. De Francisci Morales, G., Lucchese, C., Baraglia, R.: Scaling out all pairs similarity search with MapReduce. In: Proceedings of the 8th Workshop on Large-Scale Distributed Systems for Information Retrieval, pp. 25–30 (2010)

    Google Scholar 

  9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation, pp. 137–150. USENIX Association (2004)

    Google Scholar 

  10. Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Companion Volume, Columbus, Ohio, pp. 265–268 (2008)

    Google Scholar 

  11. Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U.: Efficient similarity search in very large string sets. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 262–279. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  12. Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with MapReduce. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds.) APWeb 2011. LNCS, vol. 6612, pp. 412–423. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  13. Phan, T.N., Küng, J., Dang, T.K.: An elastic approximate similarity search in very large datasets with MapReduce. In: Hameurlain, A., Dang, T.K., Morvan, F. (eds.) Globe 2014. LNCS, vol. 8648, pp. 49–60. Springer, Heidelberg (2014)

    Google Scholar 

  14. Phan, T.N., Küng, J., Dang, T.K.: An efficient similarity search in large data collections with MapReduce. In: Dang, T.K., Wagner, R., Neuhold, E., Takizawa, M., Küng, J., Thoai, N. (eds.) FDSE 2014. LNCS, vol. 8860, pp. 44–57. Springer, Heidelberg (2014)

    Google Scholar 

  15. Philips, L.: The double metaphone search algorithm. C/C++ Users J. 18(6), 38–43 (2000)

    MathSciNet  Google Scholar 

  16. Project Gutenberg. http://www.gutenberg.org/. Accessed 8 Mar 2014

  17. Rajaraman, A., Ullman J.D.: Finding similar items. In: The book Mining of Massive Datasets, 1st edn., pp. 71–127. Cambridge University Press (2011). Chapter 3

    Google Scholar 

  18. Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H.: Efficient and scalable processing of string similarity join. IEEE Trans. Knowl. Data Eng. 25(10), 2217–2230 (2013)

    Article  Google Scholar 

  19. Szmit, R.: Locality sensitive hashing for similarity search using MapReduce on large scale data. In: Kłopotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzchoń, S.T. (eds.) IIS 2013. LNCS, vol. 7912, pp. 171–178. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  20. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 563–570 (2008)

    Google Scholar 

  21. Ture, F., Elsayed, T., Lin, J.: No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In: Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 943–952 (2011)

    Google Scholar 

  22. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, USA, pp. 495–506 (2010)

    Google Scholar 

  23. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International World Wide Web Conference, pp. 131–140 (2008)

    Google Scholar 

  24. Zhang, D., Yang, G., Hu, Y., Jin, Z., Cai, D., He, X.: A unified approximate nearest neighbor search scheme by combining data structure and hashing. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp. 681–687 (2013)

    Google Scholar 

Download references

Acknowledgements

We would like to give our thanks to Mr. Faruk Kujundžić, Information Management team, Johannes Kepler University Linz, for kindly supporting us in Alex Cluster.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Trong Nhan Phan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Phan, T.N., Küng, J., Dang, T.K. (2016). An Adaptive Similarity Search in Massive Datasets. In: Hameurlain, A., Küng, J., Wagner, R., Dang, T., Thoai, N. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIII. Lecture Notes in Computer Science(), vol 9480. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49175-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-49175-1_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-49174-4

  • Online ISBN: 978-3-662-49175-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics