research-article

Efficient processing of k nearest neighbor joins using MapReduce

Authors:
Wei Lu

National University of Singapore

National University of Singapore
View Profile

,
Yanyan Shen

National University of Singapore

National University of Singapore
View Profile

,
Su Chen

National University of Singapore

National University of Singapore
View Profile

,
Beng Chin Ooi

National University of Singapore

National University of Singapore
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 10pp 1016–1027https://doi.org/10.14778/2336664.2336674

Published:01 June 2012Publication History

Proceedings of the VLDB Endowment

Abstract

k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difficult to perform a kNN join on a centralized machine efficiently. In this paper, we investigate how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance filtering, and hence reduces both the shuffling and computational costs. To reduce the shuffling cost, we propose two approximate algorithms to minimize the number of replicas. Extensive experiments on our in-house cluster demonstrate that our proposed methods are efficient, robust and scalable.

References

A. Akdogan, U. Demiryurek, F. B. Kashani, and C. Shahabi. Voronoi-based geospatial query processing with MapReduce. In CloudCom, pages 9--16, 2010. Google ScholarDigital Library
C. Böhm and F. Krebs. Supporting KDD applications by the k-nearest neighbor join. In DEXA, pages 504--516, 2003.Google ScholarCross Ref
C. Böhm and F. Krebs. The k-nearest neighbour join: Turbo charging the KDD process. Knowl. Inf. Syst., 6(6): 728--749, 2004. Google ScholarDigital Library
C. Böhm and H.-P. Kriegel. A cost model and index architecture for the similarity join. In ICDE, pages 411--420, 2001. Google ScholarDigital Library
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In SIGMOD, pages 93--104, 2000. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999. Google ScholarDigital Library
G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Trans. Database Syst., 28(4): 517--580, 2003. Google ScholarDigital Library
H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. idistance: An adaptive B⁺-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst., 30(2): 364--397, 2005. Google ScholarDigital Library
D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of MapReduce: An in-depth study. PVLDB, 3(1): 472--483, 2010. Google ScholarDigital Library
Y. Kim and K. Shim. Parallel top-k similarity join algorithms using MapReduce. In ICDE, 2012. Google ScholarDigital Library
E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998. Google ScholarDigital Library
A. Metwally and C. Faloutsos. V-smart-join: A scalable MapReduce framework for all-pair similarity joins of multisets and vectors. PVLDB, 5(8): 704--715, 2012. Google ScholarDigital Library
A. Okcan and M. Riedewald. Processing theta-joins using MapReduce. In SIGMOD, pages 949--960, 2011. Google ScholarDigital Library
A. Stupar, S. Michel, and R. Schenkel. RankReduce - processing k-nearest neighbor queries on top of MapReduce. In LSDS-IR, pages 13--18, 2010.Google Scholar
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In SIGMOD, pages 495--506, 2010. Google ScholarDigital Library
C. Xia, H. Lu, B. C. Ooi, and J. Hu. Gorder: An efficient method for knn join processing. In VLDB, pages 756--767, 2004. Google ScholarDigital Library
B. Yao, F. Li, and P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. In ICDE, pages 4--15, 2010.Google ScholarCross Ref
C. Yu, B. Cui, S. Wang, and J. Su. Efficient index-based knn join processing for high-dimensional data. Information and Software Technology, 49(4): 332--344, 2007. Google ScholarDigital Library
C. Yu, B. C. Ooi, K.-L. Tan, and H. V. Jagadish. Indexing the distance: An efficient method to knn processing. In VLDB, pages 421--430, 2001. Google ScholarDigital Library
C. Zhang, F. Li, and J. Jestes. Efficient parallel knn joins for large data in MapReduce. In EDBT, 2012. Google ScholarDigital Library

Recommendations

K-Nearest Neighbor Finding Using MaxNearestDist

Similarity searching often reduces to finding the k nearest neighbors to a query object. Finding the k nearest neighbors is achieved by applying either a depth- first or a best-first algorithm to the search hierarchy containing the data. These ...
Read More
On efficient obstructed reverse nearest neighbor query processing
GIS '11: Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

In this paper, we study a new form of reverse nearest neighbor (RNN) queries, i.e., obstructed reverse nearest neighbor (ORNN) search. It considers the impact of obstacles on the distance between objects, which is ignored by the existing work on RNN ...
Read More
Ranked Reverse Nearest Neighbor Search

Given a set of data points P and a query point q in a multidimensional space, Reverse Nearest Neighbor (RNN) query finds data points in P whose nearest neighbors are q. Reverse k-Nearest Neighbor (RkNN) query (where k ≥ 1) generalizes RNN query to find ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 10
June 2012
180 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 June 2012
Published in pvldb Volume 5, Issue 10
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 64
  Total Citations
  View Citations
- 1,226
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient processing of k nearest neighbor joins using MapReduce

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

K-Nearest Neighbor Finding Using MaxNearestDist

On efficient obstructed reverse nearest neighbor query processing

Ranked Reverse Nearest Neighbor Search

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient processing of k nearest neighbor joins using MapReduce

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

K-Nearest Neighbor Finding Using MaxNearestDist

On efficient obstructed reverse nearest neighbor query processing

Ranked Reverse Nearest Neighbor Search

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media