Selecting the Number of Clusters K with a Stability Trade-off: An Internal Validation Criterion

Mourer, Alex; Forest, Florent; Lebbah, Mustapha; Azzag, Hanane; Lacaille, Jérôme

doi:10.1007/978-3-031-33374-3_17

Alex Mourer^10,13,
Florent Forest^11,13,
Mustapha Lebbah¹²,
Hanane Azzag¹¹ &
…
Jérôme Lacaille¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13935))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1116 Accesses
1 Citations

Abstract

Model selection is a major challenge in non-parametric clustering. There is no universally admitted way to evaluate clustering results for the obvious reason that no ground truth is available. The difficulty to find a universal evaluation criterion is a consequence of the ill-defined objective of clustering. In this perspective, clustering stability has emerged as a natural and model-agnostic principle: an algorithm should find stable structures in the data. If data sets are repeatedly sampled from the same underlying distribution, an algorithm should find similar partitions. However, stability alone is not well-suited to determine the number of clusters. For instance, it is unable to detect if the number of clusters is too small. We propose a new principle: a good clustering should be stable, and within each cluster, there should exist no stable partition. This principle leads to a novel clustering validation criterion based on between-cluster and within-cluster stability, overcoming limitations of previous stability-based methods. We empirically demonstrate the effectiveness of our criterion to select the number of clusters and compare it with existing methods. Code is available at https://github.com/FlorentF9/skstab.

A. Mourer and F. Forest—Equal contribution. Supported by ANRT CIFRE grants and Safran Aircraft Engines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Balcan, M.F., Liang, Y.: Clustering under perturbation resilience. SIAM J. Comput. (2016)
Google Scholar
Barton, T.: https://github.com/deric/clustering-benchmark
Ben-David, S.: Clustering-what both theoreticians and practitioners are doing wrong. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Ben-David, S., Pál, D., Simon, H.U.: Stability of k-means clustering. In: International Conference on Computational Learning Theory (2007)
Google Scholar
Ben-David, S., Von Luxburg, U.: Relating clustering stability to properties of cluster boundaries. In: 21st Annual Conference on Learning Theory, COLT 2008 (2008)
Google Scholar
Ben-David, S., Von Luxburg, U., Pál, D.: A sober look at clustering stability. In: International Conference on Computational Learning Theory (2006)
Google Scholar
Ben-David, S., Reyzin, L.: Data stability in clustering: a closer look. Theoretical Computer Science (2014)
Google Scholar
Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing (2002)
Google Scholar
Bubeck, S., Meila, M., Luxburg, U.V.: How the initialization affects the stability of the k-means algorithm. ESAIM - Probability and Statistics (2012)
Google Scholar
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat.(1974)
Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. (1979)
Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. (2006)
Google Scholar
Desgraupes, B.: ClusterCrit: clustering indices. CRAN Package (2013)
Google Scholar
Dunn, J.C.: Well-Separated clusters and optimal fuzzy partitions. J. Cybern. (1974)
Google Scholar
Falasconi, M., Gutierrez, A., Pardo, M., Sberveglieri, G., Marco, S.: A stability based validity method for fuzzy clustering. Pattern Recogn. (2010)
Google Scholar
Fang, Y., Wang, J.: Selection of the number of clusters via the bootstrap method. Comput. Stat. Data Anal. 56(3), 468–477 (2012)
Article MathSciNet Google Scholar
Gagolewski, M., Bartoszuk, M., Cena A.G.: A new, fast, and outlier-resistant hierarchical clustering algorithm (2016)
Google Scholar
Hamerly, G., Elkan, C.: Learning the k in k-means. In: NIPS (2004)
Google Scholar
Hennig, C.: Cluster-wise assessment of cluster stability. Comput. Stat. Data Anal. 52(1), 258–271 (2007)
Article MathSciNet Google Scholar
Hess, S., Duivesteijn, W.: K is the magic number - inferring the number of clusters through nonparametric concentration inequalities. In: EMCL-PKDD (2019)
Google Scholar
Hofmeyr, D.P.: Degrees of freedom and model selection for k-means clustering. arXiv preprint arXiv:1806.02034 (2018)
Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering solutions. Neural Comput. (2004)
Google Scholar
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., et al.: Package ‘cluster’ (2013)
Google Scholar
Meila, M.: How to tell when a clustering is (approximately) correct using convex relaxations. In: Advances in Neural Information Processing Systems (2018)
Google Scholar
Möller, U., Radke, D.: A cluster validity approach based on nearest-neighbor resampling. In: Proceedings - International Conference on Pattern Recognition (2006)
Google Scholar
Pelleg, D., Moore, A.: X-means: extending k-means with efficient estimation of the number of clusters. In: International Conference on Machine Learning (2000)
Google Scholar
Ray, S., Turi, R.: Determination of number of clusters in k-means clustering and application in colour image segmentation. In: Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques (1999)
Google Scholar
Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. (1987)
Google Scholar
Shamir, O., Tishby, N.: Cluster stability for finite samples. In: Advances in Neural Information Processing Systems (2007)
Google Scholar
Smith, S.P., Dubes, R.: Stability of a hierarchical clustering. Pattern Recogn. (1980)
Google Scholar
Strauss, J.S., Bartko, J.J., Carpenter, W.T.: The use of clustering techniques for the classification of psychiatric patients. British J. Psychiatry (1973)
Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. Royal Stat. Soc. Ser. B (2001)
Google Scholar
Vijayaraghavan, A., Dutta, A., Wang, A.: Clustering stable instances of euclidean k-means. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Von Luxburg, U.: Clustering stability: an overview. Found. Trends® Mach. Learn. (2010)
Google Scholar
Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics (2001)
Google Scholar
Zhao, Q., Xu, M., Fränti, P.: Extending external validity measures for determining the number of clusters. Intell. Syst. Design Appl. (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

SAMM, Université Paris 1 Panthéon Sorbonne, Paris, France
Alex Mourer
LIPN (CNRS UMR 7030), Université Sorbonne Paris Nord, Villetaneuse, France
Florent Forest & Hanane Azzag
DAVID lab, Université de Versailles/Paris-Saclay, Versailles, France
Mustapha Lebbah
Safran Aircraft Engines, Moissy-Cramayel, France
Alex Mourer, Florent Forest & Jérôme Lacaille

Authors

Alex Mourer
View author publications
You can also search for this author in PubMed Google Scholar
Florent Forest
View author publications
You can also search for this author in PubMed Google Scholar
Mustapha Lebbah
View author publications
You can also search for this author in PubMed Google Scholar
Hanane Azzag
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Lacaille
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florent Forest .

Editor information

Editors and Affiliations

Kyoto University, Kyoto, Japan
Hisashi Kashima
IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Tsuyoshi Ide
National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3611 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mourer, A., Forest, F., Lebbah, M., Azzag, H., Lacaille, J. (2023). Selecting the Number of Clusters K with a Stability Trade-off: An Internal Validation Criterion. In: Kashima, H., Ide, T., Peng, WC. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2023. Lecture Notes in Computer Science(), vol 13935. Springer, Cham. https://doi.org/10.1007/978-3-031-33374-3_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-33374-3_17
Published: 27 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33373-6
Online ISBN: 978-3-031-33374-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Selecting the Number of Clusters K with a Stability Trade-off: An Internal Validation Criterion