Abstract
The problem of projected clustering was first proposed in the ACMSIGMOD Conference in 1999, and the Probabilistic Latent Semantic Indexing (PLSI) technique was independently proposed in the ACMSIGIR Conference in the same year. Since then, more than two thousand papers have been written on these problems by the database, data mining and information retrieval communities, along completely independent lines of work. In this paper, we show that these two problems are essentially equivalent, under a probabilistic interpretation to the projected clustering problem. We will show that the EM-algorithm, when applied to the probabilistic version of the projected clustering problem, can be almost identically interpreted as the PLSI technique. The implications of this equivalence are significant, in that they imply the cross-usability of many of the techniques which have been developed for these problems over the last decade. We hope that our observations about the equivalence of these problems will stimulate further research which can significantly improve the currently available solutions for either of these problems.
- C. C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, J.-S. Park. Fast Algorithms for Projected Clustering. ACM SIGMOD Conference, 1999. Google ScholarDigital Library
- C. C. Aggarwal, P. S. Yu. Finding Generalized Projected Clusters in High Dimensional Space, ACM SIGMOD Conference, 2000. Google ScholarDigital Library
- C. C. Aggarwal, J. Han, J. Wang, P. Yu. A Framework for Projected Clustering of High Dimensional Data Streams, VLDB , 2004. Google ScholarDigital Library
- C. C. Aggarwal, C. Zhai. A Survey of Text Clustering Algorithms, Mining Text Data, Springer, 2012. Google ScholarDigital Library
- C. C. Aggarwal. Re-designing Distance Functions and Distance-based Applications for High Dimensional Data, ACM SIGMOD Record, March, 2001. Google ScholarDigital Library
- C. C. Aggarwal, C. Reddy. Data Clustering: Algorithms and Applications, CRC Press, 2013.Google ScholarDigital Library
- C. C. Aggarwal, A. Hinneburg, D. Keim. On the Surprising Behavior of Distance Metrics in High Dimensional Space, ICDT, 2001. Google ScholarDigital Library
- R. Agrawal, J. Gehrke, P. Raghavan, D. Gunopulos. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, SIGMOD Conference, 1998. Google ScholarDigital Library
- K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft. When is nearest neighbor meaningful? ICDT Conference, 1999. Google ScholarDigital Library
- D. Blei, A. Ng, M. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 3: pp. 993--1022, 2003. Google ScholarDigital Library
- S. T. Deerwester, S. T. Dumais, G. Furnas, R. Harshman. Indexing by Latent Semantic Analysis, JASIS, 1990.Google ScholarCross Ref
- A. P. Dempster, N. M. Laird and D. B. Rubin. "Maximum Likelihood from Incomplete Data via the EM Algorithm", Journal of the Royal Statistical Society, B, vol. 39, no. 1, pp. 1--38, 1977.Google Scholar
- I. Dhillon. Co-clustering Documents and Words using bipartite spectral graph partitioning, ACM KDD Conference, 2001. Google ScholarDigital Library
- A. Hinneburg, C. Aggarwal, D. Keim. What is the nearest neighbor in high dimensional space? VLDB Conference, 2000. Google ScholarDigital Library
- T. Hoffman. Probabilistic Latent Semantic Indexing, ACM SIGIR Conference, 1999. Google ScholarDigital Library
- S. C. Madeira, A. L. Oliveira. Bi-clustering Algorithms for Biological Data Analysis: A Survey, IEEE/ACM Transactions on Computational Biology, 1(1), pp. 24--35, 2004. Google ScholarDigital Library
- G. Moise, J. Sander, M. Ester. P3C: A Robust Projected Clustering Algorithm, ICDM Conference, 2006. Google ScholarDigital Library
- B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.Google ScholarCross Ref
- A. Singh, G. Gordon. A Unified View of Matrix Factorization Models, ECML/PKDD Conference, 2008. Google ScholarDigital Library
- W. Xu, X. Liu, Y. Gong. Document Clustering based on non-negative matrix factorization, ACM SIGIR Conference, 2003. Google ScholarDigital Library
Index Terms
- On the equivalence of PLSI and projected clustering
Recommendations
Robust projected clustering
Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many ...
The Projected Dip-means Clustering Algorithm
SETN '18: Proceedings of the 10th Hellenic Conference on Artificial IntelligenceOne of the major research issues in data clustering concerns the estimation of number of clusters. In previous work, the dip-means clustering algorithm has been proposed as a successful attempt to tackle this problem. Dip-means is an incremental ...
On an equivalence between PLSI and LDA
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrievalLatent Dirichlet Allocation (LDA) is a fully generative approach to language modelling which overcomes the inconsistent generative semantics of Probabilistic Latent Semantic Indexing (PLSI). This paper shows that PLSI is a maximum a posteriori estimated ...
Comments