Minimum spanning tree based split-and-merge: A hierarchical clustering method
Introduction
Clustering plays an important role in data mining, pattern recognition, and machine learning. It aims at grouping N data points into K clusters so that data points within the same cluster are similar, while data points in diverse clusters are different from each other. From the machine learning point of view, clustering is unsupervised learning as it classifies a dataset without any a priori knowledge. A large number of clustering algorithms [20], [37], [46] have been presented in the literature since K-means [32], one of the most popular clustering algorithms, was published. The algorithms can be grouped into hierarchical clustering [12], [15], [16], [19], [27], partitional clustering [2], [6], [17], [24], [32], [35], density-based clustering [10], [18], grid-based clustering [1], [43], model-based clustering [11], [33], and graph-based clustering [29], [38], [40], [45], [47], [49]. Hierarchical and partitional clustering are the two most common groups [30].
Generally, a hierarchical clustering algorithm partitions a dataset into various clusters by an agglomerative or a divisive approach based on a dendrogram. Agglomerative clustering starts by considering each point as a cluster, and iteratively combines two most similar clusters in terms of an objective function. In contrast, divisive clustering starts with only one cluster including all data points. It iteratively selects a cluster and partitions it into two subclusters. The main advantage of hierarchical clustering is that it produces a nested tree of partitions and can therefore be more informative than non-hierarchical clustering. Furthermore, with the dendrogram, the optimal number of clusters can be determined. However, hierarchical clustering has a relatively high computational cost. Single linkage [39] and complete linkage [26] are two well-known examples of hierarchical clustering algorithms, and they take O(N2 logN) time.
Partitional clustering splits a dataset at once using an objective function. K-means is one of the most popular examples of partitional clustering. It employs mean-squared-error as its objective function. Its main advantage is that it runs efficiently: its computational complexity is O(NKId), where I is the number of iterations for convergence, and d is the dimensionality of the dataset. Since K and d are usually far less than N, the algorithm runs in a linear time on low-dimensional data. However, there does not exist a universal objective function that can be used to discover all different intrinsic structures of datasets. Therefore, partitional clustering produces inaccurate results when the objective function used does not capture the intrinsic structure of the data. This is the reason why partitional clustering algorithms are incapable of dealing with clusters of arbitrary shapes, different sizes and densities.
Clustering algorithms that combine the advantages of hierarchical and partitional clustering have been proposed in the literature [5], [23], [25], [28], [30], [31]. This kind of hybrid algorithms analyzes the dataset in two stages. In the first stage, the dataset is split into a number of subsets with a partitioning criterion. In the second stage, the produced subsets are merged in terms of a similarity measure. Different split and merge approaches have been designed in several hybrid algorithms. CSM [30] first applies K-means to partition the dataset into K′ subsets, where K′ is an input parameter. Afterwards, single linkage, which uses a dedicated cohesion function as the similarity measure, is utilized to iteratively merge the K′ subsets until K subsets are achieved. In the split stage, as K-means may produce different partitions in different runs, the final results may be unstable.
CHAMELEON [23] is another example of a hybrid clustering algorithm. It constructs a K-nearest neighbor graph, and employs a graph cut scheme to partition the graph into K′ subsets. Relative inter-connectivity and relative closeness are defined to merge the subsets. Liu et al. [31] proposed a multi-prototype clustering algorithm, which can also be considered as a hybrid method. The method uses a convergence mechanism, and repeatedly performs split and merge operations until the prototypes remain unchanged. However, many empirical parameters are involved. Kaukoranta et al. [25] proposed a split-and-merge algorithm, where the objective function is to minimize the mean squared error.
A minimum spanning tree (MST) is a useful graph structure, which has been employed to capture perceptual grouping [21]. Zahn defined several criteria of edge inconsistency for detecting clusters of different shapes [49]. However, for datasets consisting of differently shaped clusters, the method lacks an adaptive selection of the criteria. Xu et al. [47] proposed three MST-based algorithms: removing long MST-edges, a center-based iterative algorithm, and a representative-based global optimal algorithm. But for a specific dataset, users do not know which algorithm is suitable.
In this paper, we propose a minimum spanning tree based split-and-merge method (SAM). It works on numerical data and assumes that the graph can be calculated in a vector space. In the splitting stage, three iterations of MSTs are used to construct a neighborhood graph called 3-MST graph. The vertices with high degrees in the graph are selected as the initial prototypes, and K-means is then applied. In the merge stage, the neighboring subsets with respect to the MST are filtered out and considered for merge.
The rest of the paper is organized as follows: In Section 2, the proposed algorithm is described. The experimental results are presented in Section 3, and conclusions are drawn in Section 4.
Section snippets
Problem formulation
Given a set of data points X = {x1, x2, … , xi, … , xN}, where is a feature vector, the goal of clustering is to partition the set X into K clusters: C1, C2, … , CK, where Ci ≠ ∅, Ci ∩ Cj = ∅, X = C1 ∪ C2 ⋯ ∪ CK, i = 1: K, j = 1: K, i ≠ j. An undirected graph has been employed to represent a dataset for clustering [38], [49]. Let G(X) = (V, E) denote the undirected complete graph of X, the weights of the edges can be calculated with function , where V = X, E = {(xi, xj)∣xi, xj ∈ X, i ≠ j}, and
Experimental results
The clustering performance of the proposed method is evaluated on six synthetic and four real datasets. The first four synthetic datasets DS1–DS4 are taken from the literature [22], [4], [14], [13], and the next two DS5, DS6 are from [42]. These datasets are illustrated in Fig. 5. The four real world instances are taken from the UCI datasets [50], including IRIS, WINE, Breast Cancer Wisconsin (WBC), and Breast Cancer Wisconsin Diagnostic (WDBC). The descriptions of these datasets are shown in
Conclusion
The proposed method employs minimum spanning trees in different stages. Before a dataset is split into different clusters, the hairs (leaves together with the connecting edges) of the first MST computed for the whole instance are pruned.
In the splitting stage, more than the desired K clusters are created by K-means. Three MSTs on an iteratively refined graph are computed and combined to determine the initial prototypes for K-means, because randomly selected initial prototypes would lead to
Acknowledgments
The authors thank the anonymous reviewers for their constructive comments and suggestions which helped improve the quality of this paper. The work of C. Zhong was supported by Zhejiang Provincial Natural Science Foundation of China, No. Y1090851, and the Center for International Mobility (CIMO). The work of D. Miao was supported by the National Natural Science Foundation of China, No. 60970061, No. 61075056, and the Research Fund for the Doctoral Program of Higher Education, No. 20060247039.
References (50)
- et al.
Immune k-means and negative selection algorithms for data analysis
Inform. Sci.
(2009) - et al.
Robust path-based spectral clustering
Pattern Recognit.
(2008) - et al.
A time-efficient pattern reduction algorithm for k-means clustering
Inform. Sci.
(2011) - et al.
Hierarchical clustering of mixed data based on distance hierarchy
Inform. Sci.
(2007) - et al.
An agglomerative clustering algorithm using a dynamic k-nearest-neighbor list
Inform. Sci.
(2011) - et al.
Data clustering by minimizing disconnectivity
Inform. Sci.
(2011) - et al.
Clustering high dimensional data: a graph-based relaxed optimization approach
Inform. Sci.
(2008) - et al.
A multi-prototype clustering algorithm
Pattern Recognit.
(2009) Graph clustering
Comput. Sci. Rev.
(2007)The relative neighborhood graph of a finite planar set
Pattern Recognit.
(1980)
Some new indexes of cluster validity
IEEE Trans. Syst. Man Cybern. B. Cybern.
A divide-and-merge methodology for clustering
ACM Trans. Database Syst.
Density conscious subspace clustering for high-dimensional data
IEEE Trans. Knowl. Data Eng.
Reducing redundancy in subspace clustering
IEEE Trans. Knowl. Data Eng.
Introduction to Algorithms
Unsupervised learning of finite mixture models
IEEE Trans. Pattern Anal. Mach. Intell.
Fast agglomerative clustering using a k-nearest neighbor graph
IEEE Trans. Pattern Anal. Mach. Intell.
FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data
BMC Bioinformatics
Clustering aggregation
ACM Trans. Knowl. Disc. Data
The Elements of Statistical Learning, Data Mining, Inference and Prediction
Cited by (96)
A novel hybridization approach to improve the critical distance clustering algorithm: Balancing speed and quality
2024, Expert Systems with ApplicationsA novel open-set clustering algorithm
2023, Information SciencesA fast O(NlgN) time hybrid clustering algorithm using the circumference proximity based merging technique for diversified datasets
2023, Engineering Applications of Artificial IntelligenceK-DGHC: A hierarchical clustering method based on K-dominance granularity
2023, Information SciencesHCDC: A novel hierarchical clustering algorithm based on density-distance cores for data sets with varying density
2023, Information SystemsCitation Excerpt :They are divided into different types due to different clustering strategies. The hierarchical methods divide the data set into various clusters by agglomeration, such as AGNES [2], or by splitting, such as DIANA [2], based on the dendrogram [4]. The hierarchical clustering algorithms are also divided into single linkage, complete linkage, average linkage, centroid linkage, and Ward’s method [5].