Elsevier

Pattern Recognition

Volume 74, February 2018, Pages 1-14
Pattern Recognition

A Novel clustering method based on hybrid K-nearest-neighbor graph

https://doi.org/10.1016/j.patcog.2017.09.008Get rights and content

Highlights

  • A novel data model termed hybrid k-nearest-neighbor graph is proposed to represent the data sets.

  • A clustering method is developed based on the hybrid k-nearest-neighbor graph.

  • A novel internal validity index is proposed to evaluate the validity of nonlinear clustering results.

Abstract

Most of the existing clustering methods have difficulty in processing complex nonlinear data sets. To remedy this deficiency, in this paper, a novel data model termed Hybrid K-Nearest-Neighbor (HKNN) graph, which combines the advantages of mutual k-nearest-neighbor graph and k-nearest-neighbor graph, is proposed to represent the nonlinear data sets. Moreover, a Clustering method based on the HKNN graph (CHKNN) is proposed. The CHKNN first generates several tight and small subclusters, then merges these subclusters by exploiting the connectivity among them. In order to select the optimal parameters for CHKNN, we further propose an internal validity index termed K-Nearest-Neighbor Index (KNNI), which can also be used to evaluate the validity of nonlinear clustering results by varying a control parameter. Experimental results on synthetic and real-world data sets, as well as that on the video clustering, have demonstrated the significant improvement on performance over existing nonlinear clustering methods and internal validity indices.

Introduction

Clustering is considered as one of the most important problems in machine learning. It processes data with unknown distribution and is the foundation for further learning [1]. A large number of clustering methods have been proposed for nonlinear data sets, including kernel-based methods [2], [3], [4], graph-based methods [5], [6], [7], density-based methods [8], [9], [10], support vector-based methods [11], [12], and multi-exemplar methods [13], [14], etc.

The kernel-based methods [2], [3], [4] use a nonlinear kernel mapping ϕ(.) to map the nonlinear data sets from the original space to a kernel space, in which the original data points can be linearly separated. The graph-based methods [6], [7] first construct a graph whose vertices represent the data points and edges represent the similarity between each two points, then cut the graph into several parts [6], or utilize multi-prototype competitive learning to refine the coarse clusters which have been initialized by another graph clustering method [7]. Instead of using common similarity among data points, the Shared Nearest Neighbors (SNN) clustering method [5] uses similarity that is redefined as the number of near neighbors that the two points share to construct a graph. The density-based methods [8], [9], [10] utilize a density-based notion which considers the high-density regions as cluster centers. The support vector-based methods [11], [12] first map the data sets to a compact kernel space as the kernel-based methods do, then find a hyper sphere which can surround most of the data points in this kernel space, finally use the inverse of the kernel function to map the spherical surface of this hyper sphere to the original space for categorizing the data sets. As one of the multi-exemplar methods, the Multi-Prototype Clustering (MPC) [13] first generates several center points and then uses a partition metric to decide which of these center points should be merged. In Multi-Exemplar Affinity Propagation (MEAP) [14], each data point is assigned to the most appropriate exemplar and each exemplar is assigned to the most appropriate super-exemplar.

Although these methods provide various methodologies and ideas for clustering nonlinear data sets, most of them have difficulty in processing large or complex nonlinear data sets. For kernel-based methods and support vector-based methods, the optimal nonlinear kernel mapping ϕ(.) is always unknown and difficult to be determined in practice. Moreover, most of these methods are computationally expensive. The well applied density-based methods cannot process the data sets consisting of complex non-convex clusters, either. For instance, on the concentric ring data set, since the density of points in various parts of a ring may be similar, it is difficult to find the cluster centers that are defined as high-density regions. Being similar to density-based methods, the multi-exemplar methods tend to find the exemplars and super-exemplars on data sets, but not all clusters have exemplars or super-exemplars.

Aim to remedy the above mentioned deficiencies, in this paper, we abandon the concept of cluster center and consider clusters as connective regions with high density. In order to find out the connective regions and high-density regions simultaneously, we proposed a Clustering method based on Hybrid K-Nearest-Neighbor graph (CHKNN). The CHKNN consists of two processing phases, namely, finding subclusters phase and merging phase. During the finding subclusters phase, the CHKNN aims to find out high-density regions. A few tight and small subclusters are generated. During the merging phase, the CHKNN aims to connect the high-density regions. Parts of the subclusters are merged according to the connectivity among them.

In order to choose the optimal parameters of CHKNN, we further propose an internal validity index termed K-Nearest-Neighbor Index (KNNI). Most of the existing internal validity indices [15], [16], [17], [18], [19] ignore the connectivity information among data points, that is the reason why they cannot evaluate the validity of nonlinear clustering results accurately. Instead, the KNNI pays attention to the connectivity among one point and its nearest neighbors. Experimental results show that the KNNI can evaluate the validity of nonlinear clustering results accurately.

In summary, the contributions of this paper are summarized as follows:

  • A novel data model termed hybrid k-nearest-neighbor graph is proposed to represent the original data sets. This model can discover the density and connectivity information containing in data sets conveniently and accurately.

  • A clustering method termed CHKNN is developed based on the hybrid k-nearest-neighbor graph. The CHKNN method can process the clustering problems on linear and nonlinear data sets effectively.

  • A novel internal validity index termed KNNI is proposed, which utilizes the connectivity information among data points to evaluate the validity of nonlinear clustering results.

The rest of this paper is organized as follows. Section 2 briefly reviews the related works of nonlinear graph-based clustering methods and internal validity indices. Sections 3 and 4 describe the proposed CHKNN method and the proposed KNNI in detail, respectively. Experimental results are reported in Section 5. Concluding remarks and directions of future works are presented in Section 6.

Section snippets

Related works

The majority of graph-based clustering methods aim to find out the high-correlative points [20]. In DBSCAN [8], the density of a point is defined as the number of the Eps-neighborhoods of the point. The Eps-neighborhood of a point p, denoted by NEps(p), is defined by NEps(p)={qD|dist(p,q)Eps}, where Eps is a positive real number. The density-connected and density-reachable points are considered as the high-correlative points. Since DBSCAN cannot find clusters effectively with different

Proposed CHKNN method

For clarity in the description of the proposed method, we first give definitions on the fundamental concepts in Section 3.1, then describe the proposed CHKNN method in Section 3.2, finally present the computational complexity analysis of CHKNN in Section 3.3.

Proposed KNNI

In this section, the KNNI, which can be used to evaluate the validity of nonlinear clustering results by varying an input parameter M, is introduced in detail. We first formally define the related concepts in Section 4.1, then describe the proposed KNNI in Section 4.2.

Results

In this section, we first present experimental results of the proposed CHKNN on synthetic linear and nonlinear data sets in Section 5.1, then compare the results of CHKNN with nine methods in literatures on seven data sets (one synthetic and six real) in Section 5.2. Afterwards, we report the performance of cross validation of nine methods in Section 5.3. Furthermore, we show the effectiveness of the CHKNN on video clustering in Section 5.4. Finally, the comparative results of KNNI and other

Conclusions and future works

In this paper, we propose a novel nonlinear clustering method termed CHKNN based on HKNN graph and an internal validity index KNNI. CHKNN is insensitive to noises and can find clusters correctly on linear and complex nonlinear data sets with appropriate parameters. While the KNNI can help to choose the optimal parameters. Experimental comparisons have been performed on both synthetic and real-world data sets to show the effectiveness of the proposed methods.

The KNNI can avoid the blindness of

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grants 61573150, 61573152, 61175114 and 91420302. Guangdong innovative project 2013KJCX0009, Guangzhou project 201604016113 and 201604046018. Tip-top Scientific and Technical Innovative Youth Talents of Guangdong special support program (No. 2016TQ03X542).

Yikun Qin is currently a Master student in pattern recognition and intelligent systems at the South China University of Technology, Guangzhou, China. His research interests include the pattern recognition and machine learning.

References (42)

  • B. Schölkopf et al.

    Nonlinear component analysis as a kernel eigenvalue problem

    Neural Comput.

    (1998)
  • WangC.-D. et al.

    A conscience on-line learning approach for kernel based clustering

    Proceedings of the 2010 IEEE International Conference on Data Mining

    (2010)
  • WangC.-D. et al.

    Conscience online learning: an efficient approach for robust kernel-based clustering

    Knowl. Inf. Syst.

    (2012)
  • L. Ertöz et al.

    Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data.

    Proceedings of the SDM

    (2003)
  • ShiJ. et al.

    Normalized cuts and image segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • WangC.-D. et al.

    Graph-based multiprototype competitive learning and its applications

    IEEE Trans. Syst. Man Cybern. Part C Appl. Rev.

    (2012)
  • M. Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise.

    Proceedings of the Kdd

    (1996)
  • M. Ester

    Density-based clustering

    Encyclopedia of Database Systems

    (2009)
  • A. Rodriguez et al.

    Clustering by fast search and find of density peaks

    Science

    (2014)
  • A. Ben-Hur et al.

    Support vector clustering

    J. Mach. Learn. Res.

    (2002)
  • WangC.-D. et al.

    Multi-exemplar affinity propagation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • Cited by (54)

    • A robust clustering method with noise identification based on directed K-nearest neighbor graph

      2022, Neurocomputing
      Citation Excerpt :

      It is clear that CDKNN obtains the best clustering performance and achieves greater improvement in all cases except that it achieves the third-best performance in the Iris datasets. CHKNN obtains 0.465 in the Wine dataset [15]. Compared with CHKNN, our method also exceeds 30%.

    • Selective Nearest Neighbors Clustering

      2022, Pattern Recognition Letters
    View all citing articles on Scopus

    Yikun Qin is currently a Master student in pattern recognition and intelligent systems at the South China University of Technology, Guangzhou, China. His research interests include the pattern recognition and machine learning.

    Zhu Liang Yu received his BSEE in 1995 and MSEE in 1998, both in electronic engineering from the Nanjing University of Aeronautics and Astronautics, China. He received his Ph.D. in 2006 from Nanyang Technological University, Singapore. He joined Center for Signal Processing, Nanyang Technological University from 2000 as a research engineer, then as a Group Leader from 2001. In 2008, he joined the College of Automation Science and Engineering, South China University of Technology and was promoted to be a full professor in 2011. His research interests include signal processing, pattern recognition, machine learning and their applications in communications, biomedical engineering, etc.

    Chang-Dong Wang received his Ph.D. degree in computer science in 2013 from Sun Yat-sen University, China. He is currently an assistant professor at School of Mobile Information Engineering, Sun Yat-sen University. His current research interests include machine learning and pattern recognition, especially focusing on data clustering and its applications. He has published over 30 scientific papers in international journals and conferences such as IEEE TPAMI, IEEE TKDE, IEEE TSMC-C, Pattern Recognition, Knowledge and Information System, Neurocomputing, ICDM and SDM. His ICDM 2010 paper won the Honorable Mention for Best Research Paper Awards. He won 2012 Microsoft Research Fellowship Nomination Award. He was awarded 2015 Chinese Association for Artificial Intelligence (CAAI) Outstanding Dissertation.

    Zhenghui Gu received the Ph.D. degree from Nanyang Technological University in 2003. From 2002 to 2008, she was with Institute for Infocomm Research, Singapore. She joined the College of Automation Science and Engineering, South China University of Technology, in 2009 as an associate professor. She was promoted to be a full professor in 2015. Her research interests include the fields of signal processing and pattern recognition.

    Yuanqing Li was born in Hunan Province, China, in 1966. He received the B.S. degree in applied mathematics from Wuhan University, Wuhan, China, in 1988, the M.S. degree in applied mathematics from South China Normal University, Guangzhou, China, in 1994, and the Ph.D. degree in control theory and applications from South China University of Technology, Guangzhou, China, in 1997. Since 1997, he has been with South China University of Technology, where he became a full professor in 2004. In 2002-04, he worked at the Laboratory for Advanced Brain Signal Processing, RIKEN Brain Science Institute, Saitama, Japan, as a researcher. In 2004-08, he worked at the Laboratory for Neural Signal Processing, Institute for Inforcomm Research, Singapore, as a research scientist. His research interests include, blind signal processing, sparse representation, machine learning, brain-computer interface, EEG and fMRI data analysis. He is the author or coauthor of more than 60 scientific papers in journals and conference proceedings.

    View full text