skip to main content
research-article

Efficient processing of k nearest neighbor joins using MapReduce

Published:01 June 2012Publication History
Skip Abstract Section

Abstract

k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difficult to perform a kNN join on a centralized machine efficiently. In this paper, we investigate how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance filtering, and hence reduces both the shuffling and computational costs. To reduce the shuffling cost, we propose two approximate algorithms to minimize the number of replicas. Extensive experiments on our in-house cluster demonstrate that our proposed methods are efficient, robust and scalable.

References

  1. A. Akdogan, U. Demiryurek, F. B. Kashani, and C. Shahabi. Voronoi-based geospatial query processing with MapReduce. In CloudCom, pages 9--16, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Böhm and F. Krebs. Supporting KDD applications by the k-nearest neighbor join. In DEXA, pages 504--516, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  3. C. Böhm and F. Krebs. The k-nearest neighbour join: Turbo charging the KDD process. Knowl. Inf. Syst., 6(6): 728--749, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Böhm and H.-P. Kriegel. A cost model and index architecture for the similarity join. In ICDE, pages 411--420, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In SIGMOD, pages 93--104, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Trans. Database Syst., 28(4): 517--580, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. idistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst., 30(2): 364--397, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of MapReduce: An in-depth study. PVLDB, 3(1): 472--483, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Kim and K. Shim. Parallel top-k similarity join algorithms using MapReduce. In ICDE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Metwally and C. Faloutsos. V-smart-join: A scalable MapReduce framework for all-pair similarity joins of multisets and vectors. PVLDB, 5(8): 704--715, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Okcan and M. Riedewald. Processing theta-joins using MapReduce. In SIGMOD, pages 949--960, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Stupar, S. Michel, and R. Schenkel. RankReduce - processing k-nearest neighbor queries on top of MapReduce. In LSDS-IR, pages 13--18, 2010.Google ScholarGoogle Scholar
  16. R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In SIGMOD, pages 495--506, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Xia, H. Lu, B. C. Ooi, and J. Hu. Gorder: An efficient method for knn join processing. In VLDB, pages 756--767, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Yao, F. Li, and P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. In ICDE, pages 4--15, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  19. C. Yu, B. Cui, S. Wang, and J. Su. Efficient index-based knn join processing for high-dimensional data. Information and Software Technology, 49(4): 332--344, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Yu, B. C. Ooi, K.-L. Tan, and H. V. Jagadish. Indexing the distance: An efficient method to knn processing. In VLDB, pages 421--430, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Zhang, F. Li, and J. Jestes. Efficient parallel knn joins for large data in MapReduce. In EDBT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 10
    June 2012
    180 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 June 2012
    Published in pvldb Volume 5, Issue 10

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader