Abstract
k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difficult to perform a kNN join on a centralized machine efficiently. In this paper, we investigate how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance filtering, and hence reduces both the shuffling and computational costs. To reduce the shuffling cost, we propose two approximate algorithms to minimize the number of replicas. Extensive experiments on our in-house cluster demonstrate that our proposed methods are efficient, robust and scalable.
- A. Akdogan, U. Demiryurek, F. B. Kashani, and C. Shahabi. Voronoi-based geospatial query processing with MapReduce. In CloudCom, pages 9--16, 2010. Google ScholarDigital Library
- C. Böhm and F. Krebs. Supporting KDD applications by the k-nearest neighbor join. In DEXA, pages 504--516, 2003.Google ScholarCross Ref
- C. Böhm and F. Krebs. The k-nearest neighbour join: Turbo charging the KDD process. Knowl. Inf. Syst., 6(6): 728--749, 2004. Google ScholarDigital Library
- C. Böhm and H.-P. Kriegel. A cost model and index architecture for the similarity join. In ICDE, pages 411--420, 2001. Google ScholarDigital Library
- M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In SIGMOD, pages 93--104, 2000. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999. Google ScholarDigital Library
- G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Trans. Database Syst., 28(4): 517--580, 2003. Google ScholarDigital Library
- H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. idistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst., 30(2): 364--397, 2005. Google ScholarDigital Library
- D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of MapReduce: An in-depth study. PVLDB, 3(1): 472--483, 2010. Google ScholarDigital Library
- Y. Kim and K. Shim. Parallel top-k similarity join algorithms using MapReduce. In ICDE, 2012. Google ScholarDigital Library
- E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998. Google ScholarDigital Library
- A. Metwally and C. Faloutsos. V-smart-join: A scalable MapReduce framework for all-pair similarity joins of multisets and vectors. PVLDB, 5(8): 704--715, 2012. Google ScholarDigital Library
- A. Okcan and M. Riedewald. Processing theta-joins using MapReduce. In SIGMOD, pages 949--960, 2011. Google ScholarDigital Library
- A. Stupar, S. Michel, and R. Schenkel. RankReduce - processing k-nearest neighbor queries on top of MapReduce. In LSDS-IR, pages 13--18, 2010.Google Scholar
- R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In SIGMOD, pages 495--506, 2010. Google ScholarDigital Library
- C. Xia, H. Lu, B. C. Ooi, and J. Hu. Gorder: An efficient method for knn join processing. In VLDB, pages 756--767, 2004. Google ScholarDigital Library
- B. Yao, F. Li, and P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. In ICDE, pages 4--15, 2010.Google ScholarCross Ref
- C. Yu, B. Cui, S. Wang, and J. Su. Efficient index-based knn join processing for high-dimensional data. Information and Software Technology, 49(4): 332--344, 2007. Google ScholarDigital Library
- C. Yu, B. C. Ooi, K.-L. Tan, and H. V. Jagadish. Indexing the distance: An efficient method to knn processing. In VLDB, pages 421--430, 2001. Google ScholarDigital Library
- C. Zhang, F. Li, and J. Jestes. Efficient parallel knn joins for large data in MapReduce. In EDBT, 2012. Google ScholarDigital Library
Recommendations
K-Nearest Neighbor Finding Using MaxNearestDist
Similarity searching often reduces to finding the k nearest neighbors to a query object. Finding the k nearest neighbors is achieved by applying either a depth- first or a best-first algorithm to the search hierarchy containing the data. These ...
On efficient obstructed reverse nearest neighbor query processing
GIS '11: Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information SystemsIn this paper, we study a new form of reverse nearest neighbor (RNN) queries, i.e., obstructed reverse nearest neighbor (ORNN) search. It considers the impact of obstacles on the distance between objects, which is ignored by the existing work on RNN ...
Ranked Reverse Nearest Neighbor Search
Given a set of data points P and a query point q in a multidimensional space, Reverse Nearest Neighbor (RNN) query finds data points in P whose nearest neighbors are q. Reverse k-Nearest Neighbor (RkNN) query (where k ≥ 1) generalizes RNN query to find ...
Comments