Elsevier

Pattern Recognition

Volume 88, April 2019, Pages 569-583
Pattern Recognition

HG-means: A scalable hybrid genetic algorithm for minimum sum-of-squares clustering

https://doi.org/10.1016/j.patcog.2018.12.022Get rights and content

Highlights

  • An efficient hybrid GA is proposed for minimum sum-of-squares (MSSC) clustering.

  • It finds higher-quality local minima than K-means and state-of-the-art algorithms.

  • Its computational effort grows linearly with the number of samples and clusters.

  • Better local minima of the MSSC translate into better cluster validity.

  • Large improvements are observable for datasets with many clusters and dimensions.

Abstract

Minimum sum-of-squares clustering (MSSC) is a widely used clustering model, of which the popular K-means algorithm constitutes a local minimizer. It is well known that the solutions of K-means can be arbitrarily distant from the true MSSC global optimum, and dozens of alternative heuristics have been proposed for this problem. However, no other algorithm has been predominantly adopted in the literature. This may be related to differences of computational effort, or to the assumption that a near-optimal solution of the MSSC has only a marginal impact on clustering validity.

In this article, we dispute this belief. We introduce an efficient population-based metaheuristic that uses K-means as a local search in combination with problem-tailored crossover, mutation, and diversification operators. This algorithm can be interpreted as a multi-start K-means, in which the initial center positions are carefully sampled based on the search history. The approach is scalable and accurate, outperforming all recent state-of-the-art algorithms for MSSC in terms of solution quality, measured by the depth of local minima. This enhanced accuracy leads to clusters which are significantly closer to the ground truth than those of other algorithms, for overlapping Gaussian-mixture datasets with a large number of features. Therefore, improved global optimization methods appear to be essential to better exploit the MSSC model in high dimension.

Introduction

Broadly defined, clustering is the problem of organizing a collection of elements into coherent groups in such a way that similar elements are in the same cluster and different elements are in different clusters. Of the models and formulations for this problem, the Euclidean minimum sum-of-squares clustering (MSSC) is prominent in the literature. MSSC can be formulated as an optimization problem in which the objective is to minimize the sum-of-squares of the Euclidean distances of the samples to their cluster means. This problem has been extensively studied over the last 50 years, as highlighted by various surveys and books [see, e.g., [14], [19], [22]].

The NP-hardness of MSSC [2] and the size of practical datasets explain why most MSSC algorithms are heuristics, designed to produce an approximate solution in a reasonable computational time. K-means [16] (also called Lloyd’s algorithm [32]) and K-means++ [5] are two popular local search algorithms for MSSC that differ in the construction of their initial solutions. Their simplicity and low computational complexity explain their extensive use in practice. However, these methods have two significant disadvantages: (i) their solutions can be distant from the global optimum, especially in the presence of a large number of clusters and dimensions, and (ii) their performance is sensitive to the initial conditions of the search.

To circumvent these issues, a variety of heuristics and metaheuristics have been proposed with the aim of better escaping from shallow local minima (i.e., poor solutions in terms of the MSSC objective). Nearly all the classical metaheuristic frameworks have been applied, including simulated annealing, tabu search, variable neighborhood search, iterated local search, evolutionary algorithms [1], [15], [21], [34], [37], [39], as well as more recent incremental methods and convex optimization techniques [4], [7], [8], [23]. However, these sophisticated methods have not been predominantly used in machine learning applications. This may be explained by three main factors: (1) data size and computational time restrictions, (2) the limited availability of implementations, or (3) the belief that a near-optimal solution of the MSSC model has little impact on clustering validity.

To help remove these barriers, we introduce a simple and efficient hybrid genetic search for the MSSC called HG-means, and conduct extensive computational analyses to measure the correlation between solution quality (in terms of the MSSC objective) and clustering performance (based on external measures). Our method combines the improvement capabilities of the K-means algorithm with a problem-tailored crossover, an adaptive mutation scheme, and population-diversity management strategies. The overall method can be seen as a multi-start K-means algorithm, in which the initial center positions are sampled by the genetic operators based on the search history. HG-means’ crossover uses a minimum-cost matching algorithm as a subroutine, with the aim of inheriting genetic material from both parents without excessive perturbation and creating child solutions that can be improved in a limited number of iterations. The adaptive mutation operator has been designed to help cover distant samples without being excessively attracted by outliers. Finally, the population is managed so as to prohibit clones and favor the discovery of diverse solutions, a feature that helps to avoid premature convergence toward low-quality local minima.

As demonstrated by our experiments on a variety of datasets, HG-means produces MSSC solutions of significantly higher quality than those provided by previous algorithms. Its computational time is also lower than that of recent state-of-the-art optimization approaches, and it grows linearly with the number of samples and dimension. Moreover, when considering the reconstruction of a mixture of Gaussians, we observe that the standard repeated K-means and K-means++ approaches remain trapped in shallow local minima which can be very far from the ground truth, whereas HG-means consistently attains better local optima and finds more accurate clusters. The performance gains are especially pronounced on datasets with a larger number of clusters and a feature space of higher dimension, in which more independent information is available, but also in which pairwise distances are known to become more uniform and less meaningful. Therefore, some key challenges associated to high-dimensional data clustering may be overcome by improving the optimization algorithms, before even considering a change of clustering model or paradigm.

The remainder of this article is structured as follows. Section 2 formally defines the MSSC and reviews the related literature. Section 3 describes the proposed HG-means algorithm. Section 4 reports our computational experiments, and Section 5 provides some concluding remarks.

Section snippets

Problem statement

In a clustering problem, we are given a set P={p1,,pn} of n samples, where each sample pi is represented as a point in Rd with coordinates (pi1,,pid), and we seek to partition P into m disjoint clusters C=(C1,,Cm) so as to minimize a criterion f(C). There is no universal objective suitable for all applications, but f( · ) should generally promote homogeneity (similar samples should be in the same cluster) and separation (different samples should be in different clusters). MSSC corresponds to

Proposed methodology

HG-means is a hybrid metaheuristic that combines the exploration capabilities of a genetic algorithm and the improvement capabilities of a local search, along with general population-management strategies to preserve the diversity of the genetic information. Similarly to Kivijärvi et al. [26], Krishna and Murty [27] and some subsequent studies, the K-means algorithm is used as a local search. Moreover, the proposed method differs from previous work in its main variation operators: it relies on

Experimental analysis

We conducted extensive computational experiments to evaluate the performance of HG-means. After a description of the datasets and a preliminary parameter calibration (Sections 4.1–4.2), our first analysis focuses on solution quality from the perspective of the MSSC optimization problem (Section 4.3). We compare the solution quality obtained by HG-means with that of the current state-of-the-art algorithms, in terms of objective function value and computational time, and we study the sensitivity

Conclusions and perspectives

In this article, we have studied the MSSC problem, a classical clustering model of which the popular K-means algorithm constitutes a local minimizer. We have proposed a hybrid genetic algorithm, HG-means, that combines the improvement capabilities of K-means as a local search with the diversification capabilities of problem-tailored genetic operators. The algorithm uses an exact minimum-cost matching crossover operator and an adaptive mutation procedure to generate strategic initial center

Declarations of interest

None.

Acknowledgments

The authors thank the four anonymous referees for their detailed comments, which significantly contributed to improving this paper. This research is partially supported by CNPq [grant number 308498/2015-1] and FAPERJ [grant number E-26/203.310/2016] in Brazil. This support is gratefully acknowledged.

Daniel Gribel is currently pursuing a PhD degree in the field of optimization and automated reasoning, in the Department of Computer Science at the Pontifical Catholic University of Rio de Janeiro, Brazil. He is active in a variety of academic and industrial projects. His current research focuses on global optimization techniques applied to data mining problems, such as clustering, classification, regression and detection of patterns.

References (43)

  • N. Karmitsa et al.

    Clustering in large data sets with the limited memory bundle method

    Pattern Recognit.

    (2018)
  • A. Likas et al.

    The global k-means clustering algorithm

    Pattern Recognit.

    (2003)
  • U. Maulik et al.

    Genetic algorithm-based clustering technique

    Pattern Recognit.

    (2000)
  • M. Sarkar et al.

    A clustering algorithm using an evolutionary programming-based approach

    Pattern Recognit. Lett.

    (1997)
  • P. Scheunders

    A comparison of clustering algorithms applied to color image quantization

    Pattern Recognit. Lett.

    (1997)
  • S.Z. Selim et al.

    A simulated annealing algorithm for the clustering problem

    Pattern Recognit.

    (1991)
  • K. Sörensen et al.

    MAPM: Memetic algorithms with population management

    Comput. Oper. Res.

    (2006)
  • D. Aloise et al.

    NP-hardness of euclidean sum-of-squares clustering

    Mach. Learn.

    (2009)
  • D. Aloise et al.

    An improved column generation algorithm for minimum sum-of-squares clustering

    Math. Program.

    (2012)
  • L.T.H. An et al.

    New and efficient DCA based algorithms for minimum sum-of-squares clustering

    Pattern Recognit.

    (2014)
  • D. Arthur et al.

    K-means++: the advantages of careful seeding

    SODA’07. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms

    (2007)
  • Cited by (42)

    • How to Use K-means for Big Data Clustering?

      2023, Pattern Recognition
      Citation Excerpt :

      Hence, most of existing approaches are only suitable for relatively small datasets. The NP-hardness of MSSC [9] and the size of practical datasets explain why most MSSC algorithms are heuristics, designed to produce an approximate solution in a reasonable computational time [11]. Nowadays, much of the research in this field is directed toward developing effective methods for solving the NP-hard MSSC problem [9] in real practical conditions of large and big datasets [18], where most of the classical methods have shown a lack of efficiency.

    • ABARC: An agent-based rough sets clustering algorithm

      2022, Intelligent Systems with Applications
    View all citing articles on Scopus

    Daniel Gribel is currently pursuing a PhD degree in the field of optimization and automated reasoning, in the Department of Computer Science at the Pontifical Catholic University of Rio de Janeiro, Brazil. He is active in a variety of academic and industrial projects. His current research focuses on global optimization techniques applied to data mining problems, such as clustering, classification, regression and detection of patterns.

    Thibaut Vidal is a professor in the Department of Computer Science at the Pontifical Catholic University of Rio de Janeiro, Brazil. Previously, he was postdoctoral researcher at the Laboratory for Information and Decision Systems at MIT. His research interests include combinatorial optimization, integer and convex programming, with applications to machine learning, signal processing, resource allocation and logistics problems.

    View full text