Clustering Using a Combination of Particle Swarm Optimization and K-means

Garvishkumar K. Patel; Vipul K. Dabhi; Harshadkumar B. Prajapati

doi:10.1515/jisys-2015-0099

Open Access Published by De Gruyter May 26, 2016

Clustering Using a Combination of Particle Swarm Optimization and K-means

Garvishkumar K. Patel , Vipul K. Dabhi and Harshadkumar B. Prajapati

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2015-0099

Abstract

Clustering is an unsupervised kind of grouping of data points based on the similarity that exists between them. This paper applied a combination of particle swarm optimization and K-means for data clustering. The proposed approach tries to improve the performance of traditional partition clustering techniques such as K-means by avoiding the initial requirement of number of clusters or centroids for clustering. The proposed approach is evaluated using various primary and real-world datasets. Moreover, this paper also presents a comparison of results produced by the proposed approach and by the K-means based on clustering validity measures such as inter- and intra-cluster distances, quantization error, silhouette index, and Dunn index. The comparison of results shows that as the size of the dataset increases, the proposed approach produces significant improvement over the K-means partition clustering technique.

Keywords: Partition clustering; K-means; particle swarm optimization; clustering validity measures

1 Introduction

Data clustering [17] divides data points into a number of clusters where data points in the same cluster are more similar than the data points that reside in different clusters. Data clustering [17] aims at grouping the data points based on the similarity that exists between them. Clustering is a technique of grouping homogeneous objects of a dataset with the aim to extract some meaningful pattern or information. Clustering can be applied in many applications such as image processing [9], medical domain [3] for gene clustering, and web usage mining [4] for finding user access patterns. Different clustering methods [2, 5] are available that take into account the nature of the datasets and the input parameters in order to cluster the data. In partition clustering techniques, data points are clustered around the centroids based on the distance of data points to the centroid of the cluster. Depending on the nature of the clustering algorithm, the number of centroids is either defined in advance by the user or is automatically determined by the algorithm. Finding the optimum number of clusters or natural groups in the data is an important task in data clustering. The clustering approaches available thus far are either partition based or hierarchy based; however, both approaches have their own advantages and limitations in terms of finding the number of clusters, shape of clusters, and cluster overlapping. Some other approaches are based on hybridization of different clustering techniques and involve optimization in the clustering process [12].

This paper proposes the combination of particle swarm optimization (PSO) and K-means techniques to achieve better clustering results. The K-means algorithm is the most popular clustering algorithm because of its easy implementation and fast computation. However, it may produce a local optimal solution due to its random selection of initial partitions. The proposed method applies optimization techniques to overcome this limitation of K-means. PSO offers a globalized search methodology but suffers from slow convergence near the optimal solution. The proposed approach applies the result of K-means as an initial position of particles of PSO. The performance of the proposed algorithm is evaluated on the basis of the following measures: inter-cluster distance, intra-cluster distance, quantization error, silhouette index, and Dunn index.

This paper is organized as follows. Section 2 discusses related work, including a partition clustering technique (K-means), PSO, and clustering validity measures; moreover, it provides a survey of data clustering using PSO. Section 3 presents and discusses the proposed approach. Section 4 describes the experimental setup for the evaluation of the proposed approach. Moreover, the section presents, compares, and discusses the results obtained by using the proposed approach. Finally, Section 5 presents the conclusions.

2 Background and Related Work

This section provides a review of the partition clustering technique, PSO, and various clustering validity measures. It also presents related work on partition clustering attempted through the combination of PSO and K-means. A study of partition clustering and detailed analysis of applying PSO for improving partition clustering is available in Ref. [10].

2.1 Partition Clustering Technique: K-means

A partition clustering technique [1] partitions the data points by moving them from one cluster to another appropriate cluster. K-means is one of the well-known partition clustering techniques. K-means tries to optimize the objective function, given in Eq. (1):

(1)E = ∑i = 1c∑x=cid(x, mi), (1)

where m_i is the center of cluster C_i and d(x, m_i) is the Euclidean distance between data points (all data points belong to particular cluster center) x and m_i.

Objective function E tries to minimize the distance of each data point from the center (centroid) of the cluster. The K-means algorithm proceeds by initializing the number of clusters. Then, it assigns each data point to the particular cluster whose center or centroid is nearest. Then, the algorithm recalculates the centers or centroids (using mean). The process continues until the centers of the clusters stop changing. The K-means technique suffers from the following drawbacks: (i) the final result depends on selection of initial centroid (cluster center); (ii) K-means might get converged into local optima; (iii) different runs of K-means on the same dataset may produce different results; (iv) the produced clusters are sensitive to outliers; and (v) K-means can work well only on numeric datasets.

2.2 Particle Swarm Optimization

PSO is a random search and optimization technique [6] that mimics the flocking of birds and produces a swarm-like structure. Each particle in a swarm represents a candidate solution for the given problem. Each particle updates its own position based on its own experience P_best (personal best) and the experience of the best particle in the swarm G_best (global best). The parameter n is the number of dimensions of swarm. The i^th particle of the swarm is represented by a position X_i=(X_i1, …, X_in) and velocity V_i=(V_i1, …, V_in). After every iteration, each particle updates its own velocity and position by using Eqs. (2) and (3), respectively:

(2)Vid(t + 1) = wVid(t) + C1r()(Pid − Xid) + C2r()(Pg − Xid), (2)

(3)Xid(t + 1) = Vid(t + 1) + Xid(t), (3)

where P_id is the P_best position found by the i^th particle, and P_g is the G_best position found from all the particles in the swarm. r() is a uniform random function. C₁ (which deals with cognitive component) and C₂ (which deals with social component) are acceleration constants, and w is an inertia weight. The inertia weight w is useful for maintaining balance between the local and global search. If the velocity of a particle exceeds the maximum speed limit, then velocity clamping is performed.

2.3 Clustering Validity Measures

In this subsection, we discuss different evaluation measures. These measures are useful to evaluate the quality of clusters, obtained using K-means and the proposed algorithm. There are two types of evaluation measures available for evaluating the results of any clustering algorithm: (i) internal validity measures and (ii) external validity measures. External validity measures use external information that is not contained in the dataset. These types of measures match the output of the clustering algorithm with the known (ground truth, prepared by human expert) clusters. Cluster purity, F-measures, Jaccard score, and entropy are few examples of external validity measures. In the absence of external information (such as true cluster number), internal validation measures are used for evaluating the quality of any clustering algorithm. The Dunn index and the silhouette index are examples of internal validation measures.

2.3.1 Inter-cluster Distance

The distance between the centroids of the clusters is known as the inter-cluster distance. To obtain a good quality of clusters, the separation between clusters should be as large as possible. Therefore, the inter-cluster distance should be as maximal as possible:

(4)Inter-clust dist = min(∥ci − cj∥)2, (4)

where c_j is the centroid of cluster j.

2.3.2 Intra-cluster Distance

The distance between all data instances and centroid within a cluster is known as the intra-cluster distance. To obtain good-quality clusters, compact clusters with little deviation from their centroids are preferable. Therefore, intra-cluster distance should be as minimal as possible:

(5)Intra-clust dist = 1n∑j = 1k∥xi − cj∥2. (5)

In the equation, k is the number of clusters, c_j is the centroid of cluster j, and n is the number of data instances. The term ∥x_i–c_j∥² is the distance between data instances and centroid.

2.3.3 Quantization Error

For better cluster quality, the goal is to minimize the average quantization error, which is represented as follows:

(6)Quantization error = ∑j = 1k{∑i = 1Nj∥xi − cj∥2Nj}k. (6)

In the equation, k is the number of clusters, c_j is the centroid of cluster j, N_j is number of data points in cluster j, and ∥x_i–c_j∥² is the distance between data points and centroid.

2.3.4 Silhouette Index

The silhouette index is an internal validity measure [13]. The silhouette value of the i^th data point is calculated using Eq. (7):

(7)S(i) = (b(i) − a(i))max[a(i), b(i)], (7)

where a(i) denotes the average distance between the i^th data point and all other points included in same cluster. The term b(i) denotes the minimum average distance between all data points in different clusters. The value of the silhouette index should be in the range of –1 to 1. The value of a(i) represents the compactness of the cluster in which data point i^th belongs; the smaller the value of a(i), the more compact the cluster is. The value of b(i) represents the separation of the cluster from other clusters; a high value of b(i) represents that the cluster is highly separated from other clusters.

2.3.5 Dunn Index

The Dunn index [13] is an internal validity measure. A high value of Dunn index denotes good quality of clusters. The Dunn index is calculated using Eq. (8). The Dunn index is a ratio of inter- to intra-cluster distance. Therefore, we need to first calculate the intra-cluster distances within all clusters and the inter-cluster distances between all cluster pairs.

(8)Dunn index = min1 ≤ i ≤ c{minj = i + 1,c{d(ci, cj)max1 ≤ k ≤ c(d(xk))}}, (8)

where d(c_i, c_j) represents the inter-cluster distance between cluster x_i and x_j. The term d(x_k) represents the intra-cluster distance of cluster x_k, and the term c is the number of cluster.

2.4 Related Work: Data Clustering Using PSO and K-means

Van der Merwe and Engelbrecht [16] carried out clustering using PSO to find the centroid of clusters. They tested their proposed method on six datasets, and compared it with K-means in terms of intra- and inter-cluster similarity, and quantization error. Neshat et al. [8] combined PSO and K-means for taking advantage of the global search ability of PSO and faster convergence of K-means. They applied this hybrid algorithm to cluster six datasets, and compared their efficiencies with PSO and K-means. Rana et al. [11] applied PSO with K-means. Their proposed algorithm was mainly used to avoid the drawback of K-means, which is getting stuck into local optima or being stagnated. They used four datasets, and compared the results with K-means and genetic algorithm (GA). Saini and Kaur [14] applied the combination of PSO and K-means for improving cluster quality. Their method applies the result of PSO as an input seed of K-means to obtain better results, and evaluates cluster quality based on various clustering measures such as inter-/intra-cluster distance and quantization error.

3 Proposed Approach

The traditional approach of partition clustering (K-means algorithm) needs the value of K (number of clusters) to drive the clustering process. However, the proposed approach does not require the exact value of K as input; rather, it takes a range of values and determines the appropriate value of K. We also validate the proposed approach for the quality of resulting clusters based on different cluster validity measures used for evaluating the quality of clusters. The proposed approach is a combination of the partition clustering technique (K-means) and swarm intelligent technique (PSO). In clustering, we have to optimize the values of inter-cluster distance and intra-cluster distance for producing high-quality clusters. Therefore, we consider partition clustering as an optimization problem, and we attempt to use PSO to solve this optimization problem.

In the proposed approach, the user has to specify the values for the range of the number of clusters. The proposed approach takes all values of the range one by one and returns the integer value (number of cluster) for which it found the optimal solution. The approach also returns the centroids of these clusters. Steps performed in the proposed approach are shown in Figure 1. In the proposed approach, the K-means algorithm gets executed once for each value of the number of clusters. The output produced by the K-means algorithm – values of centroids – is used to initialize the position of one of the particles of PSO. The remaining particles of PSO are randomly initialized. The proposed approach optimizes the value of inter-cluster distance and quantization error using PSO. For finding the optimal number of clusters, the proposed approach uses various cluster validity measures for measuring the efficiency of produced clusters. The measures used are the following: (i) inter-cluster distance, (ii) intra-cluster distance, (iii) quantization error, (iv) silhouette index, and (v) Dunn index.

Figure 1:

Flowchart of the Proposed Approach.

Researchers [8, 11, 14, 16] have applied a hybrid (combination of K-mean and PSO) approach for data clustering and concluded that the hybrid approach gives better results than using K-means or PSO technique in standalone mode for data clustering. As the objective function of the K-means algorithm is not convex, it may have many local minima. Moreover, the K-means algorithm is sensitive to the selection of values of initial cluster centroids. Owing to these reasons, the K-means algorithm may converge to local optimum. PSO is a nature-inspired algorithm that may overpass local optima and give a global optimum solution. Moreover, as PSO is a population-based algorithm (which does parallel search in multiple directions), it helps in minimizing the effect of selection of values of initial cluster centroids on the final outcome (cluster quality). Therefore, we use K-means algorithm for local search and PSO for global search.

Our work differs in three ways from existing works: (i) sequence of execution of K-means and PSO algorithms, (ii) number and types of problems tested, and (iii) measures used for the evaluation of experimental results. We have tested the proposed approach on diverse and heterogeneous datasets. Therefore, our results are more representative than those of other existing works. Also, we have evaluated the results using internal cluster validation indexes as compared to other existing works.

Sequence of execution of K-means and PSO algorithms: The hybrid approaches differ in the way these two algorithms are integrated. Some researchers [16] have applied K-means first and then the output of K-means is given as input to PSO, whereas others have applied PSO first and the output of PSO is given as input to K-means [8, 11, 14]. Our approach is similar to that in Ref. [16], which executes K-means first and then the output of K-means is given as input to PSO.
We have considered three (Iris, Breast cancer, and Thyroid) primary datasets and three (Bank customer, Stock price, Uscensus) real-world datasets. The real-world datasets are high-dimensional datasets. Authors [16] considered six datasets (Artificial problem 1, Artificial problem 2, Iris plants, Wine, Breast cancer, and Automotives). The problems considered in Ref. [8] are Iris, Pima, Wine, Glass, Sonar, and WDBC. Rana et al. [11] considered two artificial problems and two (Wine and Iris) standard benchmark problems. Saini and Kaur [14] evaluated the proposed approach on three datasets: Artificial dataset, Iris, and Wine.
Many researchers [11, 14, 16] have used inter-cluster distance, intra-cluster distance, and quantization error criteria for evaluating the quality of clusters. In addition to these criteria, authors [14] used execution time and accuracy. Neshat et al. [8] have used sum of intra-cluster distance and error rate measures for evaluating the performance of clustering techniques. We considered the following five measures for evaluating the quality of clusters: inter-cluster distance, intra-cluster distance, quantization error, silhouette index, and Dunn index.

3.1 Implementation Details

We set all particles’ initial position in the range of 0–1. We set the velocity of all particles in the range of 0–0.1. As each particle represents the solution of the clustering problem, the dimension of the variables representing swarm velocity and swarm position are equal to No. of clusters×No. of dimensions×No. of particles.

The particle’s personal best position (P_best) is represented as a matrix of dimensions equals to No. of clusters×No. of dimensions. We scale the position of each particle in every dimension (direction) in the allowed range (determined by taking the difference of maximum and minimum values for every dimension). We initialized the position of the first particle of the swarm with the value of centroids given by K-means. The PSO algorithm will run for a specified number of iterations for one particular value of number of clusters (centroids).

We calculated the difference between the first centroid (represented by number of dimensions) of the first particle and the first data point (number of dimensions). We calculated the two-norm [using norm() function of Matlab] of this difference (vector). We repeated this step for all data points. As a result, we get the vector representing the difference between the first centroid of the first particle and all data points. The dimension of the difference vector equal to (No. of data points×1).

We repeated this step and calculated the difference vector for the number of clusters (centroids) of the first particle. We repeated this process for all particles. As a result, we obtained the Distances matrix. The dimension of the Differences matrix equals to (No. of data points×No. of clusters×No. of particles). The first element of the Differences matrix represents the distance (norm of distances in all dimensions/directions) of the first centroid from all data points. At this stage, we have a Differences matrix for every particle. The Differences matrix is useful in calculating the fitness value of every particle.

The next step is to assign a cluster number to every data point or find out which data point is nearest to the centroid of which cluster number. We determined the minimum value among all distance values (norm of the difference between centroid and data point, obtained in all directions) that we obtained for every data point. The data point was assigned to the cluster number whose centroid has minimum distance. We repeated this step for every particle. As a result, we obtained an Assignment matrix that represents assigned cluster numbers to every data point, for every particle. Different data points may be assigned to the same or different cluster numbers, for each particle. The dimension of Assignment matrix is (No. of particles×No. of data points).

We calculated the current fitness value for every particle. The Assignment matrix and Distances matrix, calculated in the above steps, are used for the calculation of fitness value. We have taken the mean of distance values (obtained from the Distances matrix) as a fitness measure of the particle. The personal best, P_best, fitness of the particle is compared with its current fitness. If the current fitness is less than the P_best, then the P_best of the particle is updated. The fitness of every particle of the swarm is calculated and stored in a swarm fitness vector. The minimum value of this vector represents the global best (G_best).

The quantization error is set to G_best for the current value of number of clusters. The Euclidean distance between pairs of centroids of particle is calculated [using the pdist() function of Matlab]. The calculated value represents the inter-cluster distance between the number of clusters. The inertia, cognitive, and social components of every particle is calculated. The position and velocity of every particle is updated. These steps are repeated for the specified number of iterations.

The entire process is repeated for the range of values specified for the number of clusters (centroids). At the end, the algorithm outputs the value for the following: optimal number of clusters with centroids, quantization error, inter-cluster distance, intra-cluster distance, Dunn index, and silhouette index.

4 Experimental Setup and Results

In the implementation work, first of all we run the K-means algorithm and obtain the values of centroids returned by K-means. Centroids given by K-means are used to initialize the position of one of the particles of the swarm. The positions of the remaining particles are initialized randomly. Every particle is represented by a number of cluster centroids, which are placed in search space by performing the particle initialization step of the PSO algorithm. After the particle initialization step, data points are assigned to different particles by considering the distance between the position of particles and the position of data points. The values of position, velocity, P_best, and G_best of particles are updated. The PSO optimizes the values of inter-cluster distance and quantization error for producing high-quality clusters. We have applied velocity clamping to make sure that the velocity of particles remains in the specified range or not go beyond the maximum allowable velocity. After completion of PSO, cluster validity measures are calculated for measuring the quality of resulting clusters. The process is repeated until all the number of clusters in user-specified ranges are tested.

In the experiments, optimization of quantization error and inter-cluster distance for producing better-quality clusters are done using the proposed approach. We implement traditional K-means partition clustering for a comparison of result. The implementation also includes implementation of various clustering validity measures for validating the results produced by the traditional partition clustering technique K-means and the proposed approach.

4.1 Dataset Description

We implement the proposed approach on Matlab R2014a. We test the proposed approach on various primary and real-world datasets using the proposed approach. The primary datasets used for the experiments are as follows: Iris, Breast cancer, and Thyroid. The primary datasets are taken from the University of California Irvine Machine Learning Repository [15]. Various real-world datasets such as Bank customer, Stock data, and Uscensus [7] are also tested using the proposed approach. As partition clustering techniques work well only on numerical datasets, we use numerical datasets having different sizes and characteristics. Table 1 highlights the important information of each dataset.

Table 1:

Datasets Used for Experiments.

Dataset	# attributes	# data vectors
Iris	4	150
Breast cancer	9	699
Thyroid	21	7200
Bank customer	33	8192
Stock price	10	950
Uscensus	68	8000

4.2 Experimental Results and Discussion

This section presents a comparison of results obtained using the K-means algorithm with those obtained using the proposed approach on primary and real-world datasets. The main purpose is to compare the quality of the obtained clusters, where quality is measured using the following measures: inter-cluster distance, intra-cluster distance, quantization error, silhouette index, and Dunn index. To obtain better cluster quality, the proposed approach seeks to maximize the values of inter-cluster distance, silhouette index, and Dunn index, and minimize the values of intra-cluster distance and quantization error.

Table 2 presents the comparison of results obtained using K-means and the proposed algorithms for primary datasets: Iris, Breast cancer, and Thyroid. We present the mean and standard deviation values for different measures, obtained over 20 runs. The values of standard deviation are useful to find out the range of values to which the algorithm converges. For the proposed algorithm, we set the following PSO parameters to ensure good convergence: No. of particles=30, w=0.72, C₁=C₂=1.49.

Table 2:

Comparison of Results Obtained Using the Proposed Approach and K-means on Primary Datasets.

Dataset (# clusters) [# rows, # cols]	Technique	Inter-cluster distance (max)	Intra-cluster distance (min)	Quantization error (min)	Silhouette index (max)	Dunn index (max)
Iris	K-means	1.1480±0.1931	2.9284±0.4782	0.7321±0.1195	0.6610±0.0264	0.0825±0.0227
(4 clusters) [150, 4]	Proposed approach	1.1973±0.1705	2.2756±0.0411	0.5689±0.0102	0.6659±0.0239	0.0869±0.0245
Breast cancer	K-means	0.6726±0.1218	4.1813±1.0351	1.0453±0.2587	0.4952±0.0843	0.1380±0.0319
(4 clusters) [699, 9]	Proposed approach	0.6868±0.0827	2.2433±0.0773	0.5608±0.0193	0.5149±0.0514	0.1420±0.0253
Thyroid	K-means	0.9145±0.1334	3.1980±1.9489	0.7995±0.4872	0.4103±0.1057	0.3750±0.0889
(4 clusters) [7200, 21]	Proposed approach	0.9245±0.1262	2.3581±0.2313	0.5895±0.0578	0.4140±0.0565	0.3740±0.0782

It is observed from Table 2 that the proposed approach has the smallest values for average quantization error as compared to the K-means algorithm. Moreover, the standard deviation values for average quantization error is also smaller compared to K-means algorithm, which suggests the stability of the proposed approach. For all problems, the proposed approach produced marginally higher values for inter-cluster distance. Considering intra-cluster distance, the significant difference between values is observed for the Breast cancer dataset. The values obtained by two algorithms for the silhouette index and Dunn index are similar, except for the Breast cancer dataset.

The significant difference in results, in terms of values of all measures, produced by two algorithms is observed for all real-world datasets. When considering values of inter-cluster distance, it is observed that the proposed approach obtained significantly large values suggesting larger separation between clusters. Considering the values of intra-cluster distance, it is observed that the proposed approach obtained significantly smaller values, suggesting the generation of compact clusters. It is only for the Uscensus dataset that the value of the silhouette index for the proposed approach is smaller than the value of the K-means algorithm. For all real-world datasets, the proposed algorithm had a smaller value for average quantization error than the K-means algorithm. Hence, it is concluded that the proposed algorithm gives better quality of clusters than the K-means algorithm.

Table 3 shows the change in clustering validity measures based on experiments carried out on real-world datasets: Bank customer, Stock price, and Uscensus, using the K-means and the proposed approach. The results produced by the proposed approach showed significant improvement on clustering validity measures for the real-world datasets. For Uscensus, the dataset value of the silhouette index produced by the proposed approach is lower than the value produced by the K-means. This is due to the size of the Uscensus dataset, which is largest with respect to the size of other real-world datasets used in the experiment. It contains 68 attributes and 8000 data vectors. Therefore, from Tables 2 and 3, it can be concluded that the values of clustering validity measures, such as silhouette index and Dunn index, produced by the proposed approach is worse when large size of datasets are considered for the experiments.

Table 3:

Comparison of Results Obtained Using the Proposed Approach and K-means on Real-World Datasets.

Dataset (# clusters) [# rows, # cols]	Technique	Inter-cluster distance (max)	Intra-cluster distance (min)	Quantization error (min)	Silhouette index (max)	Dunn index (max)
Bank customer	K-means	4.2515±0.1439	136.0222±6.4171	22.6703±1.069	0.1178±0.0014	0.0898±0.0006
(6 clusters) [8192, 33]	Proposed approach	10.0009±2.8273	13.4863±6.7771	2.2477±1.1295	0.2302±0.2198	0.0915±0.0363
Stock price	K-means	6.9557±1.1537	262.3045±108.8180	26.2304±10.8818	0.4270±0.0249	0.0349±0.0044
(10 clusters) [950, 10]	Proposed approach	8.2936±2.2537	65.2668±2.2587	6.5266±0.2258	0.4502±0.0340	0.0379±0.0106
Uscensus	K-means	11.7015±2.8830	67.1739±3.1549	11.1964±0.5265	0.7092±0.1136	0.1433±0.0394
(6 clusters) [8000, 68]	Proposed approach	19.3038±6.6077	49.9808±5.0247	8.3301±0.8374	0.6032±0.2282	0.4599±0.5328

From the results presented in Tables 2 and 3, it is observed that the improvement obtained by the proposed approach on primary datasets is less significant compared to that obtained on the real-world datasets. The justification for this is as follows: the primary datasets have the following characteristics – small in size, well separated, and non-overlapping. Owing to these characteristics, it is easy for the K-means algorithm to carry out partition clustering.

The proposed approach shows significant improvement over the traditional K-means technique for real-world datasets. Real-world datasets are not linearly separable and not pre-processed; moreover, they contain small irregular-sized data instances in the cluster. Therefore, it is difficult to do partition clustering on real-world datasets. Consequently, the traditional partition clustering technique, K-means, gives poor results on real-world datasets. The proposed approach optimizes the value of inter-cluster distance and quantization error, due to which the approach performs well on highly dense and non-linear data.

Figure 2 shows the plots obtained using the proposed approach for optimization of quantization error and inter-cluster distance versus the number of iterations for the Breast cancer dataset. It shows that the quantization error decreases and the inter-cluster distance increases as the number of iterations increases. During every iteration of PSO, the velocity and position of the swarm are changed and the algorithm seeks for a better solution. A steady graph after a certain number of iterations suggests that there is no further improvement possible in the obtained solution.

Figure 2:

Optimization of Inter-cluster Distance and Quantization Error.

The experiment is carried out on the Uscensus dataset by splitting the dataset based on different values of attributes. The proposed approach was run for the obtained datasets for the different values of the number of clusters. Then, a comparison of results was carried out between the proposed approach and the K-means based on the clustering validity measures. Figures 3–7 present comparisons of results obtained for the Uscensus dataset based on inter-cluster distance, intra-cluster distance, quantization error, silhouette index, and Dunn index by splitting the dataset into 20, 40, and 68 attributes. The comparisons of results show that as the dimension of the datasets increases, the proposed approach produces significant improvements over the traditional partition clustering technique, K-means.

Figure 3:

Comparison of Obtained Results for Uscensus Dataset Based on Inter-cluster Distance.

Figure 4:

Comparison of Obtained Results for Uscensus Dataset Based on Intra-cluster Distance.

Figure 5:

Comparison of Obtained Results for Uscensus Dataset Based on Quantization Error.

Figure 6:

Comparison of Obtained Results for Uscensus Dataset Based on Silhouette Index.

Figure 7:

Comparison of Obtained Results for Uscensus Dataset Based on Dunn Index.

5 Conclusions

In this paper, we have proposed an approach based on the combination of PSO and K-means. The paper presented steps of the proposed approach. The proposed work was evaluated using various clustering validity measures. This work used different sizes of primary and real-world dataset for evaluating the performance of the proposed approach. A comparison of results produced by the proposed approach and K-means are also presented for showing the significant improvements achieved by the proposed approach. Furthermore, the paper also carried out a comparison of results by splitting the data in different sizes for checking the scalability of the proposed approach. Based on the result analysis, it can be concluded that as the dimension of the datasets is increased, the proposed approach produces significant improvements over the traditional partition clustering technique, K-means.

Bibliography

[1] S. Ayramo and T. Karkkainen, Introduction to partitioning-based clustering methods with a robust example, in: Reports of the Department of Mathematics and Information Technology (Series C: Software and Computational Engineering), University of Jyväskylä, Finland, 2006.Search in Google Scholar

[2] P. Berkhin, A survey of clustering data mining techniques, in: Grouping Multidimensional Data, pp. 25–71, Springer, Berlin, 2006.10.1007/3-540-28349-8_2Search in Google Scholar

[3] K. J. Cios and G. W. Moore, Uniqueness of medical data mining, Artif. Intell. Med.26 (2002), 1–24.10.1016/S0933-3657(02)00049-0Search in Google Scholar

[4] R. Cooley, B. Mobasher and J. Srivastava, Web mining: information and pattern discovery on the world wide web, in: Proceedings of the Ninth IEEE International Conference on Tools with Artificial Intelligence, pp. 558–567, IEEE, Piscataway, NJ, 1997.Search in Google Scholar

[5] A. K. Jain, M. N. Murty and P. J. Flynn, Data clustering: a review, ACM Comput. Surv.31 (1999), 264–323.10.1145/331499.331504Search in Google Scholar

[6] J. Kennedy, Particle swarm optimization, in: Encyclopedia of Machine Learning, pp. 760–766, Springer, Berlin, 2010.Search in Google Scholar

[7] Minnesota Population Center, https://www.ipums.org/, Accessed September 2015.Search in Google Scholar

[8] M. Neshat, S. F. Yazdi, D. Yazdani and M. Sargolzaei, A new cooperative algorithm based on PSO and k-means for data clustering, J. Comput. Sci.8 (2012), 188–194.10.3844/jcssp.2012.188.194Search in Google Scholar

[9] T. N. Pappas, An adaptive clustering algorithm for image segmentation, IEEE T. Signal Proces.40 (1992), 901–914.10.1109/78.127962Search in Google Scholar

[10] G. K. Patel, V. K. Dabhi and H. B. Prajapati, Study and analysis of particle swarm optimization for improving partition clustering, in: 2015 International Conference on Advances in Computer Engineering and Applications (ICACEA), pp. 218–225, IEEE, Piscataway, NJ, 2015.10.1109/ICACEA.2015.7164699Search in Google Scholar

[11] S. Rana, S. Jasola and R. Kumar, A hybrid sequential approach for data clustering using K-Means and particle swarm optimization algorithm, Int. J. Eng. Sci. Technol.2 (2010), 167–176.10.4314/ijest.v2i6.63708Search in Google Scholar

[12] S. Rana, S. Jasola and R. Kumar, A review on particle swarm optimization algorithms and their applications to data clustering, Artif. Intell. Rev.35 (2011), 211–222.10.1007/s10462-010-9191-9Search in Google Scholar

[13] E. Rendón, I. Abundez, A. Arizmendi and E. M. Quiroz, Internal versus external cluster validation indexes, Int. J. Comput. Commun.5 (2011), 27–34.Search in Google Scholar

[14] G. Saini and H. Kaur, A novel approach towards K-mean clustering algorithm with PSO, Int. J. Comput. Sci. Inf. Technol.5 (2014), 5978–5986.Search in Google Scholar

[15] University of California Irvine Machine Learning Repository, howpublished = http://archive.ics.uci.edu/ml/, note = Accessed September 2015.Search in Google Scholar

[16] D. W. Van der Merwe and A. P. Engelbrecht, Data clustering using particle swarm optimization, in: The 2003 Congress on Evolutionary Computation, 2003. CEC’03, 1, pp. 215–220, IEEE, Piscataway, NJ, 2003.10.1109/CEC.2003.1299577Search in Google Scholar

[17] R. Xu and D. Wunsch, Survey of clustering algorithms, IEEE T. Neural Networ.16 (2005), 645–678.10.1109/TNN.2005.845141Search in Google Scholar PubMed

Received: 2015-9-11

Published Online: 2016-5-26

Published in Print: 2017-7-26

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Clustering Using a Combination of Particle Swarm Optimization and K-means

Abstract

1 Introduction

2 Background and Related Work

2.1 Partition Clustering Technique: K-means

2.2 Particle Swarm Optimization

2.3 Clustering Validity Measures

2.3.1 Inter-cluster Distance

2.3.2 Intra-cluster Distance

2.3.3 Quantization Error

2.3.4 Silhouette Index

2.3.5 Dunn Index

2.4 Related Work: Data Clustering Using PSO and K-means

3 Proposed Approach

3.1 Implementation Details

4 Experimental Setup and Results

4.1 Dataset Description

4.2 Experimental Results and Discussion

5 Conclusions

Bibliography

Journal and Issue

Articles in the same Issue