column

On the equivalence of PLSI and projected clustering

Author:
Charu C. Aggarwal

IBM T. J. Watson Research Center, Yorktown Heights, NY

IBM T. J. Watson Research Center, Yorktown Heights, NY
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 41 Issue 4December 2012pp 45–50https://doi.org/10.1145/2430456.2430469

Published:17 January 2013Publication History

ACM SIGMOD Record

Abstract

The problem of projected clustering was first proposed in the ACMSIGMOD Conference in 1999, and the Probabilistic Latent Semantic Indexing (PLSI) technique was independently proposed in the ACMSIGIR Conference in the same year. Since then, more than two thousand papers have been written on these problems by the database, data mining and information retrieval communities, along completely independent lines of work. In this paper, we show that these two problems are essentially equivalent, under a probabilistic interpretation to the projected clustering problem. We will show that the EM-algorithm, when applied to the probabilistic version of the projected clustering problem, can be almost identically interpreted as the PLSI technique. The implications of this equivalence are significant, in that they imply the cross-usability of many of the techniques which have been developed for these problems over the last decade. We hope that our observations about the equivalence of these problems will stimulate further research which can significantly improve the currently available solutions for either of these problems.

References

C. C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, J.-S. Park. Fast Algorithms for Projected Clustering. ACM SIGMOD Conference, 1999. Google ScholarDigital Library
C. C. Aggarwal, P. S. Yu. Finding Generalized Projected Clusters in High Dimensional Space, ACM SIGMOD Conference, 2000. Google ScholarDigital Library
C. C. Aggarwal, J. Han, J. Wang, P. Yu. A Framework for Projected Clustering of High Dimensional Data Streams, VLDB , 2004. Google ScholarDigital Library
C. C. Aggarwal, C. Zhai. A Survey of Text Clustering Algorithms, Mining Text Data, Springer, 2012. Google ScholarDigital Library
C. C. Aggarwal. Re-designing Distance Functions and Distance-based Applications for High Dimensional Data, ACM SIGMOD Record, March, 2001. Google ScholarDigital Library
C. C. Aggarwal, C. Reddy. Data Clustering: Algorithms and Applications, CRC Press, 2013.Google ScholarDigital Library
C. C. Aggarwal, A. Hinneburg, D. Keim. On the Surprising Behavior of Distance Metrics in High Dimensional Space, ICDT, 2001. Google ScholarDigital Library
R. Agrawal, J. Gehrke, P. Raghavan, D. Gunopulos. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, SIGMOD Conference, 1998. Google ScholarDigital Library
K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft. When is nearest neighbor meaningful? ICDT Conference, 1999. Google ScholarDigital Library
D. Blei, A. Ng, M. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 3: pp. 993--1022, 2003. Google ScholarDigital Library
S. T. Deerwester, S. T. Dumais, G. Furnas, R. Harshman. Indexing by Latent Semantic Analysis, JASIS, 1990.Google ScholarCross Ref
A. P. Dempster, N. M. Laird and D. B. Rubin. "Maximum Likelihood from Incomplete Data via the EM Algorithm", Journal of the Royal Statistical Society, B, vol. 39, no. 1, pp. 1--38, 1977.Google Scholar
I. Dhillon. Co-clustering Documents and Words using bipartite spectral graph partitioning, ACM KDD Conference, 2001. Google ScholarDigital Library
A. Hinneburg, C. Aggarwal, D. Keim. What is the nearest neighbor in high dimensional space? VLDB Conference, 2000. Google ScholarDigital Library
T. Hoffman. Probabilistic Latent Semantic Indexing, ACM SIGIR Conference, 1999. Google ScholarDigital Library
S. C. Madeira, A. L. Oliveira. Bi-clustering Algorithms for Biological Data Analysis: A Survey, IEEE/ACM Transactions on Computational Biology, 1(1), pp. 24--35, 2004. Google ScholarDigital Library
G. Moise, J. Sander, M. Ester. P3C: A Robust Projected Clustering Algorithm, ICDM Conference, 2006. Google ScholarDigital Library
B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.Google ScholarCross Ref
A. Singh, G. Gordon. A Unified View of Matrix Factorization Models, ECML/PKDD Conference, 2008. Google ScholarDigital Library
W. Xu, X. Liu, Y. Gong. Document Clustering based on non-negative matrix factorization, ACM SIGIR Conference, 2003. Google ScholarDigital Library

Index Terms

On the equivalence of PLSI and projected clustering
1. Information systems

Recommendations

Robust projected clustering

Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many ...
Read More
The Projected Dip-means Clustering Algorithm
SETN '18: Proceedings of the 10th Hellenic Conference on Artificial Intelligence

One of the major research issues in data clustering concerns the estimation of number of clusters. In previous work, the dip-means clustering algorithm has been proposed as a successful attempt to tackle this problem. Dip-means is an incremental ...
Read More
On an equivalence between PLSI and LDA
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

Latent Dirichlet Allocation (LDA) is a fully generative approach to language modelling which overcomes the inconsistent generative semantics of Probabilistic Latent Semantic Indexing (PLSI). This paper shows that PLSI is a maximum a posteriori estimated ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGMOD Record Volume 41, Issue 4
December 2012
62 pages
ISSN:0163-5808
DOI:10.1145/2430456
Issue’s Table of Contents

Copyright © 2013 Author
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 January 2013
Check for updates
Qualifiers
- column
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 107
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On the equivalence of PLSI and projected clustering

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Robust projected clustering

The Projected Dip-means Clustering Algorithm

On an equivalence between PLSI and LDA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

On the equivalence of PLSI and projected clustering

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Robust projected clustering

The Projected Dip-means Clustering Algorithm

On an equivalence between PLSI and LDA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media