Skip to main content

Advertisement

Springer Nature Link
Log in
Menu
Find a journal Publish with us Track your research
Search
Cart
  1. Home
  2. Machine Learning
  3. Article

Stability and model selection in k-means clustering

  • Published: 29 April 2010
  • Volume 80, pages 213–243, (2010)
  • Cite this article
Download PDF
Machine Learning Aims and scope Submit manuscript
Stability and model selection in k-means clustering
Download PDF
  • Ohad Shamir1 &
  • Naftali Tishby2 
  • 1235 Accesses

  • Explore all metrics

Abstract

Clustering stability methods are a family of widely used model selection techniques for data clustering. Their unifying theme is that an appropriate model should result in a clustering which is robust with respect to various kinds of perturbations. Despite their relative success, not much is known theoretically on why or when do they work, or even what kind of assumptions they make in choosing an ‘appropriate’ model. Moreover, recent theoretical work has shown that they might ‘break down’ for large enough samples. In this paper, we focus on the behavior of clustering stability using k-means clustering. Our main technical result is an exact characterization of the distribution to which suitably scaled measures of instability converge, based on a sample drawn from any distribution in ℝn satisfying mild regularity conditions. From this, we can show that clustering stability does not ‘break down’ even for arbitrarily large samples, at least for the k-means framework. Moreover, it allows us to identify the factors which eventually determine the behavior of clustering stability. This leads to some basic observations about what kind of assumptions are made when using these methods. While often reasonable, these assumptions might also lead to unexpected consequences.

Article PDF

Download to read the full article text

Similar content being viewed by others

Estimating the number of clusters via a corrected clustering instability

Article Open access 18 May 2020

Selecting the Number of Clusters K with a Stability Trade-off: An Internal Validation Criterion

Chapter © 2023

k Is the Magic Number—Inferring the Number of Clusters Through Nonparametric Concentration Inequalities

Chapter © 2020

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.
  • Artificial Intelligence
Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

  • Anthony, M., & Bartlet, P. (1999). Neural network learning: theoretical foundations. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Ben-David, S., & von Luxburg, U. (2008). Relating clustering stability to properties of cluster boundaries. In 21st annual conference on learning theory, Helsinki, Finland, July 9–12, 2008 (pp. 379–390)

  • Ben-David, S., von Luxburg, U., & Pál, D. (2006). A sober look at clustering stability. In 19th annual conference on learning theory, Pittsburgh, PA, USA, June 22–25, 2006 (pp. 5–19).

  • Ben-David, S., Pál, D., & Simon, H.-U. (2007). Stability of k-means clustering. In 20th annual conference on learning theory, San Diego, CA, USA, June 13–15, 2007 (pp. 20–34).

  • Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific symposium on biocomputing, Lihue, Hawaii, USA, January 3–7, 2002 (pp. 6–17).

  • Bertoni, A., & Valentini, G. (2007). Model order selection for biomolecular data clustering. BMC Bioinformatics, 8(Suppl 2), S7.

    Article  Google Scholar 

  • Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd edn.). New York: Wiley.

    MATH  Google Scholar 

  • Dudley, R. (1999). Uniform central limit theorems. Cambridge studies in advanced mathematics. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology, 3(7), 0036.1–0036.21.

    Article  Google Scholar 

  • Hartigan, J. (1975). Clustering algorithms. New York: Wiley.

    MATH  Google Scholar 

  • Hoeffman-Jørgensen, J., Shepp, L. A., & Dudley, R. (1979). On the lower tail of gaussian seminorms. The Annals of Probability, 7(2), 319–342.

    Article  MathSciNet  Google Scholar 

  • Horn, R. A., & Johnson, C. R. (1985). Matrix analysis. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • Krieger, A., & Green, P. (1999). A cautionary note on using internal cross validation to select the number of clusters. Psychometrika, 64(3), 341–353.

    Article  Google Scholar 

  • Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stability-based validation of clustering solutions. Neural Computation, 16(6), 1299–1323.

    Article  MATH  Google Scholar 

  • Latała, R., & Oleszkiewicz, K. (1999). Gaussian measures of dilatations of convex symmetric sets. Annals of Probability, 27(4), 1922–1938.

    Google Scholar 

  • Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13(11), 2573–2593.

    Article  MATH  Google Scholar 

  • Linder, T. (2002). Principles of nonparametric learning. In L. Gyorfi (Ed.), CISM courses and lecture notes : Vol. 434. Learning-theoretic methods in vector quantization. New York: Springer. Chap. 4.

    Google Scholar 

  • Milman, V. D., & Schechtman, G. (1986). Asymptotic theory of finite dimensional normed spaces. Berlin: Springer.

    MATH  Google Scholar 

  • Pollard, D. (1982). A central limit theorem for k-means clustering. The Annals of Probability, 10(4), 919–926.

    Article  MATH  MathSciNet  Google Scholar 

  • Radchenko, P. (2004). Asymptotics under nonstandard conditions. PhD thesis, Yale University.

  • Shamir, O., & Tishby, N. (2008a). Cluster stability for finite samples. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems (Vol. 20, pp. 1297–1304). Cambridge: MIT Press.

    Google Scholar 

  • Shamir, O., & Tishby, N. (2008b). Model selection and stability in k-means clustering. In 21st annual conference on learning theory, Helsinki, Finland, July 9–12, 2008 (pp. 367–378). Cambridge: MIT Press.

    Google Scholar 

  • Shamir, O., & Tishby, N. (2009). On the reliability of clustering stability in the large sample regime. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 21, pp. 1465–1472). Cambridge: MIT Press.

    Google Scholar 

  • Smolkin, M., & Ghosh, D. (2003). Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics, 4, 36.

    Article  Google Scholar 

  • Steinley, D. (2006). K-means clustering: a half-century synthesis. British Journal of Mathematical & Statistical Psychology, 59(1), 1–34.

    Article  MathSciNet  Google Scholar 

  • van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes: with applications to statistics. Berlin: Springer.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

  1. School of Computer Science and Engineering, The Hebrew University, Jerusalem, 91904, Israel

    Ohad Shamir

  2. School of Computer Science and Engineering, and Interdisciplinary Center for Neural Computation, The Hebrew University, Jerusalem, 91904, Israel

    Naftali Tishby

Authors
  1. Ohad Shamir
    View author publications

    You can also search for this author inPubMed Google Scholar

  2. Naftali Tishby
    View author publications

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ohad Shamir.

Additional information

Editors: Sham Kakade and Ping Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shamir, O., Tishby, N. Stability and model selection in k-means clustering. Mach Learn 80, 213–243 (2010). https://doi.org/10.1007/s10994-010-5177-8

Download citation

  • Received: 15 March 2009

  • Accepted: 01 November 2009

  • Published: 29 April 2010

  • Issue Date: September 2010

  • DOI: https://doi.org/10.1007/s10994-010-5177-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Clustering
  • Model selection
  • Stability
  • Statistical learning theory
Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Advertisement

Search

Navigation

  • Find a journal
  • Publish with us
  • Track your research

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Journal finder
  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our brands

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Discover
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support
  • Legal notice
  • Cancel contracts here

3.148.233.239

Not affiliated

Springer Nature

© 2025 Springer Nature