Abstract
We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH [27] with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space.
BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means++ [5] on S.
We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen [24]) and with approximation algorithms (Stream-KM++ [2], StreamLS [16,26]) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.
This research was partly supported by DFG grants BO 2755/1-1 and SO 514/4-3 and within the Collaborative Research Center SFB876, project A2.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)
Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++: A clustering algorithm for data streams. ACM Journal of Experimental Algorithmics 17(1) (2012)
Agarwal, P.K., Erickson, J.: Geometric range searching and its relatives. Contemporary Mathematics 223, 1–56 (1999)
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. Journal of the ACM 51(4), 606–635 (2004)
Arthur, D., Vassilvitskii, S.: How slow is the k-means method? In: Proc. of the 22nd SoCG, pp. 144–153 (2006)
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proc. of the 18th SODA, pp. 1027–1035 (2007)
Bentley, J.L., Saxe, J.B.: Decomposable searching problems i: Static-to-dynamic transformation. J. Algorithms 1(4), 301–358 (1980)
Chen, K.: On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing 39(3), 923–947 (2009)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)
Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proc. of the 43rd STOC, pp. 569–578 (2011)
Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proc. 23rd SoCG, pp. 11–18 (2007)
Feldman, D., Schmidt, M., Sohler, C.: Constant-size coresets for k-means, pca and projective clustering. In: Proc. of the 24th SODA, pp. 1434–1453 (2012)
Fink, G.A., Plötz, T.: Open source project ESMERALDA
Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2(2), 139–172 (1987)
Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: Proc. of the 37th STOC, pp. 209–217 (2005)
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE TKDE 15(3), 515–528 (2003)
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Inform. Systems 25(5), 345–366 (2000)
Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. Inform. Systems 26(1), 35–58 (2001)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. Journal of Intelligent Inform. Systems 17(2-3), 107–145 (2001)
Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry 37(1), 3–19 (2007)
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proc. of the 36th STOC, pp. 291–300 (2004)
Langberg, M., Schulman, L.J.: Universal epsilon-approximators for integrals. In: Proc. of the 21st SODA, pp. 598–607 (2010)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symp. on Math. Stat. and Prob., pp. 281–297 (1967)
Ng, R.T., Han, J.: Clarans: A method for clustering objects for spatial data mining. IEEE TKDE 14(5), 1003–1016 (2002)
O’Callaghan, L., Meyerson, A., Motwani, R., Mishra, N., Guha, S.: Streaming-data algorithms for high-quality clustering. In: Proc. 18th ICDE, pp. 685–694 (2002)
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1(2), 141–182 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fichtenberger, H., Gillé, M., Schmidt, M., Schwiegelshohn, C., Sohler, C. (2013). BICO: BIRCH Meets Coresets for k-Means Clustering. In: Bodlaender, H.L., Italiano, G.F. (eds) Algorithms – ESA 2013. ESA 2013. Lecture Notes in Computer Science, vol 8125. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40450-4_41
Download citation
DOI: https://doi.org/10.1007/978-3-642-40450-4_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40449-8
Online ISBN: 978-3-642-40450-4
eBook Packages: Computer ScienceComputer Science (R0)