BICO: BIRCH Meets Coresets for k-Means Clustering

Fichtenberger, Hendrik; Gillé, Marc; Schmidt, Melanie; Schwiegelshohn, Chris; Sohler, Christian

doi:10.1007/978-3-642-40450-4_41

Hendrik Fichtenberger¹⁸,
Marc Gillé¹⁸,
Melanie Schmidt¹⁸,
Chris Schwiegelshohn¹⁸ &
…
Christian Sohler¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8125))

Included in the following conference series:

European Symposium on Algorithms

2754 Accesses

Abstract

We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH [27] with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space.

BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means++ [5] on S.

We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen [24]) and with approximation algorithms (Stream-KM++ [2], StreamLS [16,26]) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.

This research was partly supported by DFG grants BO 2755/1-1 and SO 514/4-3 and within the Collaborative Research Center SFB876, project A2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Faster Algorithms for the Constrained k-means Problem

Article 06 November 2017

Scalable and space-efficient Robust Matroid Center algorithms

Article Open access 17 April 2023

Core-Sets: Updated Survey

References

Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)
Google Scholar
Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++: A clustering algorithm for data streams. ACM Journal of Experimental Algorithmics 17(1) (2012)
Google Scholar
Agarwal, P.K., Erickson, J.: Geometric range searching and its relatives. Contemporary Mathematics 223, 1–56 (1999)
Article MathSciNet Google Scholar
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. Journal of the ACM 51(4), 606–635 (2004)
Article MathSciNet MATH Google Scholar
Arthur, D., Vassilvitskii, S.: How slow is the k-means method? In: Proc. of the 22nd SoCG, pp. 144–153 (2006)
Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proc. of the 18th SODA, pp. 1027–1035 (2007)
Google Scholar
Bentley, J.L., Saxe, J.B.: Decomposable searching problems i: Static-to-dynamic transformation. J. Algorithms 1(4), 301–358 (1980)
Article MathSciNet MATH Google Scholar
Chen, K.: On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing 39(3), 923–947 (2009)
Article MathSciNet MATH Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)
Google Scholar
Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proc. of the 43rd STOC, pp. 569–578 (2011)
Google Scholar
Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proc. 23rd SoCG, pp. 11–18 (2007)
Google Scholar
Feldman, D., Schmidt, M., Sohler, C.: Constant-size coresets for k-means, pca and projective clustering. In: Proc. of the 24th SODA, pp. 1434–1453 (2012)
Google Scholar
Fink, G.A., Plötz, T.: Open source project ESMERALDA
Google Scholar
Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2(2), 139–172 (1987)
Google Scholar
Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: Proc. of the 37th STOC, pp. 209–217 (2005)
Google Scholar
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE TKDE 15(3), 515–528 (2003)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Inform. Systems 25(5), 345–366 (2000)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. Inform. Systems 26(1), 35–58 (2001)
Article MATH Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. Journal of Intelligent Inform. Systems 17(2-3), 107–145 (2001)
Article MATH Google Scholar
Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry 37(1), 3–19 (2007)
Article MathSciNet MATH Google Scholar
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proc. of the 36th STOC, pp. 291–300 (2004)
Google Scholar
Langberg, M., Schulman, L.J.: Universal epsilon-approximators for integrals. In: Proc. of the 21st SODA, pp. 598–607 (2010)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)
Article Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symp. on Math. Stat. and Prob., pp. 281–297 (1967)
Google Scholar
Ng, R.T., Han, J.: Clarans: A method for clustering objects for spatial data mining. IEEE TKDE 14(5), 1003–1016 (2002)
Google Scholar
O’Callaghan, L., Meyerson, A., Motwani, R., Mishra, N., Guha, S.: Streaming-data algorithms for high-quality clustering. In: Proc. 18th ICDE, pp. 685–694 (2002)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1(2), 141–182 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Efficient Algorithms and Complexity Theory, TU Dortmund, Germany
Hendrik Fichtenberger, Marc Gillé, Melanie Schmidt, Chris Schwiegelshohn & Christian Sohler

Authors

Hendrik Fichtenberger
View author publications
You can also search for this author in PubMed Google Scholar
Marc Gillé
View author publications
You can also search for this author in PubMed Google Scholar
Melanie Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Chris Schwiegelshohn
View author publications
You can also search for this author in PubMed Google Scholar
Christian Sohler
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Utrecht University, Princetonplein 5, 3584 CC, Utrecht, The Netherlands
Hans L. Bodlaender
Dipartimento di Ingegneria Civile e Ingegneria Informatica, Università di Roma “Tor Vergata”, via del Politecnico 1, 00133, Rome, Italy
Giuseppe F. Italiano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fichtenberger, H., Gillé, M., Schmidt, M., Schwiegelshohn, C., Sohler, C. (2013). BICO: BIRCH Meets Coresets for k-Means Clustering. In: Bodlaender, H.L., Italiano, G.F. (eds) Algorithms – ESA 2013. ESA 2013. Lecture Notes in Computer Science, vol 8125. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40450-4_41

Download citation

DOI: https://doi.org/10.1007/978-3-642-40450-4_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40449-8
Online ISBN: 978-3-642-40450-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics