doi:10.1016/j.csda.2005.10.001
Copyright © 2005 Elsevier B.V. All rights reserved.
KNN-kernel density-based clustering for high-dimensional multivariate data
Thanh N. Trana, Ron Wehrensa and Lutgarde M.C. Buydens
, a, 
aAnalytical Chemistry, Institute for Molecules and Materials, Radboud University Nijmegen, Nijmegen, The Netherlands
Received 17 December 2004;
revised 3 October 2005;
accepted 3 October 2005.
Available online 24 October 2005.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
Density-based clustering algorithms for multivariate data often have difficulties with high-dimensional data and clusters of very different densities. A new density-based clustering algorithm, called KNNCLUST, is presented in this paper that is able to tackle these situations. It is based on the combination of nonparametric k-nearest-neighbor (KNN) and kernel (KNN-kernel) density estimation. The KNN-kernel density estimation technique makes it possible to model clusters of different densities in high-dimensional data sets. Moreover, the number of clusters is identified automatically by the algorithm. KNNCLUST is tested using simulated data and applied to a multispectral compact airborne spectrographic imager (CASI)_image of a floodplain in the Netherlands to illustrate the characteristics of the method.
Keywords: Multivariate data; Classification; Clustering
Fig. 1. Triangular and Gaussian kernels.
Fig. 2. KNN and KNN-kernel estimation on the sample data set containing 500 objects generated from one Gaussian distribution (mean=0 and s=1), with k=100. The dotted line is the theoretical pdf function for the data set.
Fig. 3. Density estimation functions for the data set of two classes of different densities.
Fig. 4. A simple example shows how KNNCLUST works with k=2. Each row plots objects in one particular step of the process with their indices and symbols, illustrated class memberships. The symbols: *, o, x, □,
, and Δ stand for cluster membership, in which objects belonging to the same cluster have the same symbol.
Fig. 5. The simulated data set. Class one is a mixture of two Gaussians and the other three are generated from three single Gaussian distributions with very different in cluster densities.
Fig. 6. Clustering result by KNNCLUST with k=180; the total accuracy is 95.9%.
Fig. 7. DBSCAN (a) min-points=10, ε=20; (b) min-points=20, ε=950.
Fig. 8. The best of 100 runs of EM to (a) 4 clusters; (b) 5 clusters.
Fig. 9. The gray-scale images of the first two principal components (PC1 on the left, PC2 on the right), and the six main object classes that have been identified in the area.
Fig. 10. Score plot of PC1 and PC2.
Fig. 11. Score-plots of two first PCs by DBSCAN with parameter min-points is 25 (a) 8 clusters found by ε=300, (b) 6 clusters found by ε=400, (c) 4 clusters found by ε=500 and (d) 2 clusters found by ε=900.
Fig. 12. Score-plots of two first PCs and result images of six clusters obtained by (a) K-means (the best of 100 runs); (b) EM (the best of 100 runs) and (c) KNNCLUST with k=550.
Fig. 13. Compactness index of K-means in 100 runs compared to the index of the KNNCLUST result.
Table 1.
Commonly used univariate kernels; where z=(x-xi)./H
