skip to main content
column

On the equivalence of PLSI and projected clustering

Published:17 January 2013Publication History
Skip Abstract Section

Abstract

The problem of projected clustering was first proposed in the ACMSIGMOD Conference in 1999, and the Probabilistic Latent Semantic Indexing (PLSI) technique was independently proposed in the ACMSIGIR Conference in the same year. Since then, more than two thousand papers have been written on these problems by the database, data mining and information retrieval communities, along completely independent lines of work. In this paper, we show that these two problems are essentially equivalent, under a probabilistic interpretation to the projected clustering problem. We will show that the EM-algorithm, when applied to the probabilistic version of the projected clustering problem, can be almost identically interpreted as the PLSI technique. The implications of this equivalence are significant, in that they imply the cross-usability of many of the techniques which have been developed for these problems over the last decade. We hope that our observations about the equivalence of these problems will stimulate further research which can significantly improve the currently available solutions for either of these problems.

References

  1. C. C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, J.-S. Park. Fast Algorithms for Projected Clustering. ACM SIGMOD Conference, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. C. Aggarwal, P. S. Yu. Finding Generalized Projected Clusters in High Dimensional Space, ACM SIGMOD Conference, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. C. Aggarwal, J. Han, J. Wang, P. Yu. A Framework for Projected Clustering of High Dimensional Data Streams, VLDB , 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. C. Aggarwal, C. Zhai. A Survey of Text Clustering Algorithms, Mining Text Data, Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. C. Aggarwal. Re-designing Distance Functions and Distance-based Applications for High Dimensional Data, ACM SIGMOD Record, March, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. C. Aggarwal, C. Reddy. Data Clustering: Algorithms and Applications, CRC Press, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. C. Aggarwal, A. Hinneburg, D. Keim. On the Surprising Behavior of Distance Metrics in High Dimensional Space, ICDT, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Agrawal, J. Gehrke, P. Raghavan, D. Gunopulos. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, SIGMOD Conference, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft. When is nearest neighbor meaningful? ICDT Conference, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Blei, A. Ng, M. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 3: pp. 993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. T. Deerwester, S. T. Dumais, G. Furnas, R. Harshman. Indexing by Latent Semantic Analysis, JASIS, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  12. A. P. Dempster, N. M. Laird and D. B. Rubin. "Maximum Likelihood from Incomplete Data via the EM Algorithm", Journal of the Royal Statistical Society, B, vol. 39, no. 1, pp. 1--38, 1977.Google ScholarGoogle Scholar
  13. I. Dhillon. Co-clustering Documents and Words using bipartite spectral graph partitioning, ACM KDD Conference, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Hinneburg, C. Aggarwal, D. Keim. What is the nearest neighbor in high dimensional space? VLDB Conference, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Hoffman. Probabilistic Latent Semantic Indexing, ACM SIGIR Conference, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. C. Madeira, A. L. Oliveira. Bi-clustering Algorithms for Biological Data Analysis: A Survey, IEEE/ACM Transactions on Computational Biology, 1(1), pp. 24--35, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Moise, J. Sander, M. Ester. P3C: A Robust Projected Clustering Algorithm, ICDM Conference, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  19. A. Singh, G. Gordon. A Unified View of Matrix Factorization Models, ECML/PKDD Conference, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Xu, X. Liu, Y. Gong. Document Clustering based on non-negative matrix factorization, ACM SIGIR Conference, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On the equivalence of PLSI and projected clustering

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGMOD Record
          ACM SIGMOD Record  Volume 41, Issue 4
          December 2012
          62 pages
          ISSN:0163-5808
          DOI:10.1145/2430456
          Issue’s Table of Contents

          Copyright © 2013 Author

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 17 January 2013

          Check for updates

          Qualifiers

          • column

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader