Concept Decompositions for Large Sparse Text Data Using Clustering

Dhillon, Inderjit S.; Modha, Dharmendra S.

doi:10.1023/A:1007612920971

Concept Decompositions for Large Sparse Text Data Using Clustering

Published: January 2001

Volume 42, pages 143–175, (2001)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Concept Decompositions for Large Sparse Text Data Using Clustering

Download PDF

Inderjit S. Dhillon¹ &
Dharmendra S. Modha²

7525 Accesses
6 Altmetric
Explore all metrics

Abstract

Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors–a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain “fractal-like” and “self-similar” behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned by all the concept vectors. We empirically establish that the approximation errors of the concept decompositions are close to the best possible, namely, to truncated singular value decompositions. As our third contribution, we show that the concept vectors are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors are global in the word space and are dense. Nonetheless, we observe the surprising fact that the linear subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them. In conclusion, the concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized “basis” for text data sets.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Berry, M. W., Dumais, S. T., & O'Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM Review 37(4), 573–595.
Google Scholar
Björck, A. & Golub, G. (1973). Numerical methods for computing angles between linear subspaces. Mathematics of Computation 27(123).
Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1999). Document categorization and query generation on the World Wide Web using WebACE. AI Review 13(5-6), 365–391.
Google Scholar
Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Technical Report 1997-015, Digital Systems Research Center.
Caid, W. R. & Oing, P. (1997). System and method of context vector generation and retrieval. US Patent No. 5619709.
Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J.W. (1992). Scatter/Gather: A cluster-based approach to browsing large document collections. In: Proc. ACM SIGIR.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6), 391–407.
Google Scholar
Dhillon, I. S. & Modha, D. S. (1999). Concept decompositions for large sparse text data using clustering. Technical Report RJ 10147 (95022), IBM Almaden Research Center.
Dhillon, I. S. & Modha, D. S. (2000). A parallel data-clustering algorithm for distributed memory multiprocessors. In: M. J. Zaki and C. T. Ho (eds.): Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, Volume 1759. Springer-Verlag, New York, pp. 245–260. Presented at the 1999 Large-Scale Parallel KDD Systems Workshop, San Diego, CA.
Google Scholar
Dhillon, I. S., Modha, D. S., & Spangler, W. S. (1998). Visualizing Class Structure of Multidimensional Data. In: S. Weisberg (ed.): Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics, Vol. 30. Minneapolis, MN, pp. 488–493.
Duda, R. O. & Hart, P. E. (1973). Pattern Classification and Scene Analysis. New York: Wiley.
Google Scholar
Frakes, W. B. & Baeza-Yates, R. (1992). Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, New Jersey Prentice Hall.
Google Scholar
Gallant, S. I. (1994). Methods for generating or revising context vectors for a plurality of word stems. US Patent No. 5325298.
Garey, M. R., Johnson, D. S., & Witsenhausen, H. S. (1982). The complexity of the generalized Lloyd-Max problem. IEEE Trans. Inform. Theory 28(2), 255/256.
Google Scholar
Golub, G. H. and Van Loan, C. F. (1996). Matrix computations. Baltimore, MD, USA: The Johns Hopkins University Press.
Google Scholar
Hartigan, J. A. (1975). Clustering Algorithms. New York: Wiley.
Google Scholar
Hearst, M. A. & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In: Proc. ACM SIGIR.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In: Proc. ACM SIGIR.
Isbell, C. L. & Viola, P. (1998). Restructuring sparse high dimensional data for effective retrieval. In: Advances in neural information processing (Vol. 11).
Kleinberg, J., Papadimitriou, C. H., & Raghavan, P. (1998). A microeconomic view of data mining. Data Mining and Knowledge Discovery 2(4), 311–324.
Google Scholar
Kolda, T. G. (1997). Limited-Memory Matrix Methods with Applications. Ph.D. Thesis, The Applied Mathematics Program, University of Maryland, College Park, Mayland.
Google Scholar
Leland, W. E., Taqqu, M. S., Willinger, W., & Wilson, D. V. (1994). On the self-similar nature of ethernet traffic. IEEE/ACM Transactions on Networking 2(1), 1–15.
Google Scholar
Mandelbrot, B. B. (1988). Fractal geometry of nature. W. H. Freeman & Company.
O'Leary, D. P. & Peleg, S. (1983). Digital image compression by outer product expansion. IEEE Trans. Communications 31, 441–444.
Google Scholar
Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (1998). Latent semantic indexing: A probabilistic analysis. In: Proc. Seventeenth ACM-SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, Seattle, Washington. pp. 159–168.
Pollard, D. (1982). Quantization and the method of k-means. IEEE Trans. Inform. Theory 28, 199–205.
Google Scholar
Rasmussen, E. (1992). Clustering Algorithms. In: W. B. Frakes & R. Baeza-Yates (eds.): Information retrieval: Data structures and algorithms. pp. 419–442, Prentice-Hall.
Rissanen, J., Speed, T., & Yu, B. (1992). Density estimation by stochastic complexity. IEEE Trans. Inform. Theory 38, 315–323.
Google Scholar
Sabin, M. J. & Gray, R. M. (1986). Global convergence and empirical consistency of the generalized Lloyd algorithm. IEEE Trans. Inform. Theory 32(2), 148–155.
Google Scholar
Sahami, M., Yusufali, S., & Baldonado, M., (1999). SONIA: A Service for Organizing Networked Information Autonomously. In: Proc. ACM Digital Libraries.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Inform. proc. & management. pp. 513–523.
Salton, G. & McGill, M. J. (1983). Introduction to modern retrieval. New York: McGraw-Hill Book Company.
Google Scholar
Saul, L. & Pereira, F. (1997). Aggregate and mixed-order Markov models for statistical language processing. In: Proc. 2nd Int. Conf. Empirical Methods in Natural Language Processing.
Schütze, H. & Silverstein, C. (1997). Projections for efficient document clustering. In: Proc. ACM SIGIR.
Silverstein, C. & Pedersen, J. O. (1997). Almost-constant-time clustering of arbitrary corpus subsets. In: Proc. ACM SIGIR.
Singhal, A., Buckley, C., Mitra, M., & Salton, G. (1996). Pivoted document length normalization. In: Proc. ACM SIGIR.
Vaithyanathan, S. & Dom, B. (1999). Model selection in unsupervised learning with applications to document clustering. In: Proc. 16th Int. Machine Learning Conf., Bled, Slovenia.
Willet, P. (1988). Recent trends in hierarchic document clustering: a critical review. Inform. Proc. & Management pp. 577–597.
Zamir, O. & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In: Proc. ACM SIGIR.
Zipf, G. K. (1949). Human behavior and the principle of least effort. Reading, MA: Addison Wesley.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Texas, Austin, TX, 78712, USA
Inderjit S. Dhillon
IBM Almaden Research Center, 650 Harry Road, San Jose, CA, 95120, USA
Dharmendra S. Modha

Authors

Inderjit S. Dhillon
View author publications
You can also search for this author inPubMed Google Scholar
Dharmendra S. Modha
View author publications
You can also search for this author inPubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dhillon, I.S., Modha, D.S. Concept Decompositions for Large Sparse Text Data Using Clustering. Machine Learning 42, 143–175 (2001). https://doi.org/10.1023/A:1007612920971

Download citation

Issue Date: January 2001
DOI: https://doi.org/10.1023/A:1007612920971

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Concept Decompositions for Large Sparse Text Data Using Clustering

Abstract

Article PDF

Similar content being viewed by others

Text mining using nonnegative matrix factorization and latent semantic analysis

Compressed Dictionary Learning

A method for selecting the relevant dimensions for high-dimensional classification in singular vector spaces

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Concept Decompositions for Large Sparse Text Data Using Clustering

Abstract

Article PDF

Similar content being viewed by others

Text mining using nonnegative matrix factorization and latent semantic analysis

Compressed Dictionary Learning

A method for selecting the relevant dimensions for high-dimensional classification in singular vector spaces

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article