Abstract
We propose a new method, called SimClus, for clustering with lower bound on similarity. Instead of accepting k the number of clusters to find, the alternative similarity-based approach imposes a lower bound on the similarity between an object and its corresponding cluster representative (with one representative per cluster). SimClus achieves a O(logn) approximation bound on the number of clusters, whereas for the best previous algorithm the bound can be as poor as O(n). Experiments on real and synthetic datasets show that our algorithm produces more than 40% fewer representative objects, yet offers the same or better clustering quality. We also propose a dynamic variant of the algorithm, which can be effectively used in an on-line setting.
This work was supported in part by NSF Grants EMT-0829835, and CNS-0103708, and NIH Grant 1R01EB0080161-01A1.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB Proceedings (August 2003)
Aslam, J., Pelekhov, J.E., Rus, D.: The star clustering algorithm for static and dynamic information organization. Graph Algorithms and Application 8(1), 95–129 (2004)
Azoury, K.S., Warmuth, M.K.: Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning 43(3), 211–246 (2001)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data streams. In: ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems. ACM, New York
Banerjee, A., Basu, S.: Topic models over text streams: A study of batch and online unsepervised learning. In: SIAM Data Mining (2007)
Gil-García, R.J., Badía-Contelles, J.M., Pons-Porrata, A.: Extended Star Clustering. In: Sanfeliu, A., Ruiz-Shulcloper, J. (eds.) CIARP 2003. LNCS, vol. 2905, pp. 480–487. Springer, Heidelberg (2003)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (1979)
King, G., Tzeng, W.: On-line algorithm for the dominating set problem. Information Processing Letters 61, 11–14 (1997)
Lund, C., Yannakakis, M.: On the hardness of approximating minimization problems. Journal of the ACM 41(5), 960–981 (1994)
Vazirani, V.V.: Approximation Algorithms. Springer, Heidelberg
Zuckerman, D.: Np-complete problems have a version that’s hard to approximate. In: Proc. of Eighth Annual Structure in Complexity Theorey, pp. 305–312. IEEE Computer Society, Los Alamitos (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hasan, M.A., Salem, S., Pupacdi, B., Zaki, M.J. (2009). Clustering with Lower Bound on Similarity. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-01307-2_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)