Skip to main content

BICO: BIRCH Meets Coresets for k-Means Clustering

  • Conference paper
Algorithms – ESA 2013 (ESA 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8125))

Included in the following conference series:

  • 2754 Accesses

Abstract

We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH [27] with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space.

BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means++ [5] on S.

We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen [24]) and with approximation algorithms (Stream-KM++ [2], StreamLS [16,26]) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.

This research was partly supported by DFG grants BO 2755/1-1 and SO 514/4-3 and within the Collaborative Research Center SFB876, project A2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)

    Google Scholar 

  2. Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++: A clustering algorithm for data streams. ACM Journal of Experimental Algorithmics 17(1) (2012)

    Google Scholar 

  3. Agarwal, P.K., Erickson, J.: Geometric range searching and its relatives. Contemporary Mathematics 223, 1–56 (1999)

    Article  MathSciNet  Google Scholar 

  4. Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. Journal of the ACM 51(4), 606–635 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  5. Arthur, D., Vassilvitskii, S.: How slow is the k-means method? In: Proc. of the 22nd SoCG, pp. 144–153 (2006)

    Google Scholar 

  6. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proc. of the 18th SODA, pp. 1027–1035 (2007)

    Google Scholar 

  7. Bentley, J.L., Saxe, J.B.: Decomposable searching problems i: Static-to-dynamic transformation. J. Algorithms 1(4), 301–358 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  8. Chen, K.: On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing 39(3), 923–947 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  9. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)

    Google Scholar 

  10. Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proc. of the 43rd STOC, pp. 569–578 (2011)

    Google Scholar 

  11. Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proc. 23rd SoCG, pp. 11–18 (2007)

    Google Scholar 

  12. Feldman, D., Schmidt, M., Sohler, C.: Constant-size coresets for k-means, pca and projective clustering. In: Proc. of the 24th SODA, pp. 1434–1453 (2012)

    Google Scholar 

  13. Fink, G.A., Plötz, T.: Open source project ESMERALDA

    Google Scholar 

  14. Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2(2), 139–172 (1987)

    Google Scholar 

  15. Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: Proc. of the 37th STOC, pp. 209–217 (2005)

    Google Scholar 

  16. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE TKDE 15(3), 515–528 (2003)

    Google Scholar 

  17. Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Inform. Systems 25(5), 345–366 (2000)

    Article  Google Scholar 

  18. Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. Inform. Systems 26(1), 35–58 (2001)

    Article  MATH  Google Scholar 

  19. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. Journal of Intelligent Inform. Systems 17(2-3), 107–145 (2001)

    Article  MATH  Google Scholar 

  20. Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry 37(1), 3–19 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  21. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proc. of the 36th STOC, pp. 291–300 (2004)

    Google Scholar 

  22. Langberg, M., Schulman, L.J.: Universal epsilon-approximators for integrals. In: Proc. of the 21st SODA, pp. 598–607 (2010)

    Google Scholar 

  23. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)

    Article  Google Scholar 

  24. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symp. on Math. Stat. and Prob., pp. 281–297 (1967)

    Google Scholar 

  25. Ng, R.T., Han, J.: Clarans: A method for clustering objects for spatial data mining. IEEE TKDE 14(5), 1003–1016 (2002)

    Google Scholar 

  26. O’Callaghan, L., Meyerson, A., Motwani, R., Mishra, N., Guha, S.: Streaming-data algorithms for high-quality clustering. In: Proc. 18th ICDE, pp. 685–694 (2002)

    Google Scholar 

  27. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1(2), 141–182 (1997)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fichtenberger, H., Gillé, M., Schmidt, M., Schwiegelshohn, C., Sohler, C. (2013). BICO: BIRCH Meets Coresets for k-Means Clustering. In: Bodlaender, H.L., Italiano, G.F. (eds) Algorithms – ESA 2013. ESA 2013. Lecture Notes in Computer Science, vol 8125. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40450-4_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40450-4_41

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40449-8

  • Online ISBN: 978-3-642-40450-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics