Clustering with Lower Bound on Similarity

Hasan, Mohammad Al; Salem, Saeed; Pupacdi, Benjarath; Zaki, Mohammed J.

doi:10.1007/978-3-642-01307-2_14

Mohammad Al Hasan²³,
Saeed Salem²³,
Benjarath Pupacdi²⁴ &
…
Mohammed J. Zaki²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5476))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3149 Accesses
5 Citations

Abstract

We propose a new method, called SimClus, for clustering with lower bound on similarity. Instead of accepting k the number of clusters to find, the alternative similarity-based approach imposes a lower bound on the similarity between an object and its corresponding cluster representative (with one representative per cluster). SimClus achieves a O(logn) approximation bound on the number of clusters, whereas for the best previous algorithm the bound can be as poor as O(n). Experiments on real and synthetic datasets show that our algorithm produces more than 40% fewer representative objects, yet offers the same or better clustering quality. We also propose a dynamic variant of the algorithm, which can be effectively used in an on-line setting.

This work was supported in part by NSF Grants EMT-0829835, and CNS-0103708, and NIH Grant 1R01EB0080161-01A1.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB Proceedings (August 2003)
Google Scholar
Aslam, J., Pelekhov, J.E., Rus, D.: The star clustering algorithm for static and dynamic information organization. Graph Algorithms and Application 8(1), 95–129 (2004)
Article MathSciNet MATH Google Scholar
Azoury, K.S., Warmuth, M.K.: Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning 43(3), 211–246 (2001)
Article MATH Google Scholar
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data streams. In: ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems. ACM, New York
Google Scholar
Banerjee, A., Basu, S.: Topic models over text streams: A study of batch and online unsepervised learning. In: SIAM Data Mining (2007)
Google Scholar
Gil-García, R.J., Badía-Contelles, J.M., Pons-Porrata, A.: Extended Star Clustering. In: Sanfeliu, A., Ruiz-Shulcloper, J. (eds.) CIARP 2003. LNCS, vol. 2905, pp. 480–487. Springer, Heidelberg (2003)
Chapter Google Scholar
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (1979)
MATH Google Scholar
King, G., Tzeng, W.: On-line algorithm for the dominating set problem. Information Processing Letters 61, 11–14 (1997)
Article MathSciNet MATH Google Scholar
Lund, C., Yannakakis, M.: On the hardness of approximating minimization problems. Journal of the ACM 41(5), 960–981 (1994)
Article MathSciNet MATH Google Scholar
Vazirani, V.V.: Approximation Algorithms. Springer, Heidelberg
Google Scholar
Zuckerman, D.: Np-complete problems have a version that’s hard to approximate. In: Proc. of Eighth Annual Structure in Complexity Theorey, pp. 305–312. IEEE Computer Society, Los Alamitos (1993)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, USA
Mohammad Al Hasan, Saeed Salem & Mohammed J. Zaki
Chulabhorn Research Institute, Laksi, Bangkok, Thailand
Benjarath Pupacdi

Authors

Mohammad Al Hasan
View author publications
You can also search for this author in PubMed Google Scholar
Saeed Salem
View author publications
You can also search for this author in PubMed Google Scholar
Benjarath Pupacdi
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed J. Zaki
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5 Tiwanont Road, 12000, Bangkadi, Muang, Pathumthani, Thailand
Thanaruk Theeramunkong
Dept. of Computer Engineering, Faculty of Engineering, Chulalongkorn University, 10330, Bangkok, Thailand
Boonserm Kijsirikul
Faculty of Science & Engineering, York University, 355 Lumbers Building, 4700 Keele Street, M3J 1P3, Toronto, Ontario, Canada
Nick Cercone
School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, 923-1292, Ishikawa, Japan
Tu-Bao Ho

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hasan, M.A., Salem, S., Pupacdi, B., Zaki, M.J. (2009). Clustering with Lower Bound on Similarity. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-01307-2_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics