Abstract
The similarity join has become an important database primitive to sup-port similarity search and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Well-known are two types of the similarity join, the distance range join where the user defines a distance threshold for the join, and the closest point query or k-distance join which retrieves the k most similar pairs. In this paper, we propose an important, third similarity join operation called k-nearest neighbor join which combines each point of one point set with its k nearest neighbors in the other set. We discover that many standard algorithms of Knowledge Discovery in Databases (KDD) such as k-means and k-medoid clustering, nearest neighbor classifi-cation, data cleansing, postprocessing of sampling-based data mining etc. can be implemented on top of the k-nn join operation to achieve performance improve-ments without affecting the quality of the result of these algorithms. Our list of possible applications includes standard methods for all stages of the KDD process including preprocessing, data mining, and postprocessing. Thus, our method is turbo charging the complete KDD process.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: OPTICS: Ordering Points To Identify the Clustering Structure. In: ACM SIGMOD Int. Conf. on Management of Data (1999)
Agrawal, R., Lin, K., Sawhney, H., Shim, K.: Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases. In: Int. Conf on Very Large Data Bases, VLDB (1995)
Brachmann, R., Anand, T.: The Process of Knowledge Discovery in Database. In: Fayyad, U.M., et al. (eds.) Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park (1996)
Böhm, C., Braunmüller, B., Breunig, M.M., Kriegel, H.-P.: Fast Clustering Based on High-Dimensional Similarity Joins. In: Int. Conf. on Information Knowledge Management, CIKM (2000)
Berchtold, S., Böhm, C., Keim, D., Kriegel, H.-P.: A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space. In: ACM Symposium on Principles of Database Systems, PODS (1997)
Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.-P.: Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data. In: ACM SIGMOD Int. Conf. on Management of Data (2001)
Böhm, C., Kriegel, H.-P.: A Cost Model and Index Architecture for the Similarity Join. In: IEEE Int. Conf on Data Engineering, ICDE (2001)
Brinkhoff, T., Kriegel, H.-P., Seeger, B.: Efficient Processing of Spatial Joins Using R-trees. In: ACM SIGMOD Int. Conf. on Management of Data (1993)
Breunig, M.M., Kriegel, H.-P., Kröger, P., Sander, J.: Data Bubbles: Quality Preserving Performance Boosting for Hierarchical Clustering. In: ACM SIGMOD Int. Conf. on Management of Data (2001)
Böhm, C.: The Similarity Join: A Powerful Database Primitive for High Performance Data Mining, tutorial. In: IEEE Int. Conf. on Data Engineering, ICDE (2001)
Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Closest Pair Queries in Spatial Databases. In: ACM SIGMOD Int. Conf. on Management of Data (2000)
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From Data Mining to Knowledge Discovery: An Overview. In: Fayyad, U.M., et al. (eds.) Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park (1996)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2000)
Hjaltason, G.R., Samet, H.: Ranking in Spatial Databases. In: Int. Symp. on Large Spatial Datab, SSD (1995)
Hjaltason, G.R., Samet, H.: Incremental Distance Join Algorithms for Spatial Databases. In: SIGMOD Int. Conf. on Management of Data (1998)
Hattori, K., Torii, Y.: Effective algorithms for the nearest neighbor method in the clustering problem. Pattern Recognition 26(5) (1993)
Koudas, N., Sevcik, C.: Size Separation Spatial Join. In: ACM SIGMOD Int. Conf. on Managem. of Data (1997)
Koudas, N., Sevcik, C.: High Dimensional Similarity Joins: Algorithms and Performance Evaluation. In: IEEE Int. Conf. on Data Engineering (ICDE) (1998) (best paper award)
Preparata, F.P., Shamos, M.I.: Computational Geometry. Springer, Heidelberg (1985)
Roussopoulos, N., Kelley, S., Vincent, F.: Nearest Neighbor Queries. In: ACM SIGMOD Int. Conf. on Management of Data (1995)
Sander, J., Ester, M., Kriegel, H.-P., Xu, X.: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications. Data Mining and Knowledge Discovery 2(2). Kluwer Academic Publishers(1998)
Shin, H., Moon, B., Lee, S.: Adaptive Multi-Stage Distance Join Processing. In: ACM SIGMOD Int. Conf. on Management of Data (2000)
Shim, K., Srikant, R., Agrawal, R.: High-Dimensional Similarity Joins. In: IEEE Int. Conf. on Data Engin. (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Böhm, C., Krebs, F. (2003). Supporting KDD Applications by the k-Nearest Neighbor Join. In: Mařík, V., Retschitzegger, W., Štěpánková, O. (eds) Database and Expert Systems Applications. DEXA 2003. Lecture Notes in Computer Science, vol 2736. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45227-0_50
Download citation
DOI: https://doi.org/10.1007/978-3-540-45227-0_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40806-2
Online ISBN: 978-3-540-45227-0
eBook Packages: Springer Book Archive