Abstract
Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors–a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain “fractal-like” and “self-similar” behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned by all the concept vectors. We empirically establish that the approximation errors of the concept decompositions are close to the best possible, namely, to truncated singular value decompositions. As our third contribution, we show that the concept vectors are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors are global in the word space and are dense. Nonetheless, we observe the surprising fact that the linear subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them. In conclusion, the concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized “basis” for text data sets.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Berry, M. W., Dumais, S. T., & O'Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM Review 37(4), 573–595.
Björck, A. & Golub, G. (1973). Numerical methods for computing angles between linear subspaces. Mathematics of Computation 27(123).
Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1999). Document categorization and query generation on the World Wide Web using WebACE. AI Review 13(5-6), 365–391.
Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Technical Report 1997-015, Digital Systems Research Center.
Caid, W. R. & Oing, P. (1997). System and method of context vector generation and retrieval. US Patent No. 5619709.
Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J.W. (1992). Scatter/Gather: A cluster-based approach to browsing large document collections. In: Proc. ACM SIGIR.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6), 391–407.
Dhillon, I. S. & Modha, D. S. (1999). Concept decompositions for large sparse text data using clustering. Technical Report RJ 10147 (95022), IBM Almaden Research Center.
Dhillon, I. S. & Modha, D. S. (2000). A parallel data-clustering algorithm for distributed memory multiprocessors. In: M. J. Zaki and C. T. Ho (eds.): Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, Volume 1759. Springer-Verlag, New York, pp. 245–260. Presented at the 1999 Large-Scale Parallel KDD Systems Workshop, San Diego, CA.
Dhillon, I. S., Modha, D. S., & Spangler, W. S. (1998). Visualizing Class Structure of Multidimensional Data. In: S. Weisberg (ed.): Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics, Vol. 30. Minneapolis, MN, pp. 488–493.
Duda, R. O. & Hart, P. E. (1973). Pattern Classification and Scene Analysis. New York: Wiley.
Frakes, W. B. & Baeza-Yates, R. (1992). Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, New Jersey Prentice Hall.
Gallant, S. I. (1994). Methods for generating or revising context vectors for a plurality of word stems. US Patent No. 5325298.
Garey, M. R., Johnson, D. S., & Witsenhausen, H. S. (1982). The complexity of the generalized Lloyd-Max problem. IEEE Trans. Inform. Theory 28(2), 255/256.
Golub, G. H. and Van Loan, C. F. (1996). Matrix computations. Baltimore, MD, USA: The Johns Hopkins University Press.
Hartigan, J. A. (1975). Clustering Algorithms. New York: Wiley.
Hearst, M. A. & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In: Proc. ACM SIGIR.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In: Proc. ACM SIGIR.
Isbell, C. L. & Viola, P. (1998). Restructuring sparse high dimensional data for effective retrieval. In: Advances in neural information processing (Vol. 11).
Kleinberg, J., Papadimitriou, C. H., & Raghavan, P. (1998). A microeconomic view of data mining. Data Mining and Knowledge Discovery 2(4), 311–324.
Kolda, T. G. (1997). Limited-Memory Matrix Methods with Applications. Ph.D. Thesis, The Applied Mathematics Program, University of Maryland, College Park, Mayland.
Leland, W. E., Taqqu, M. S., Willinger, W., & Wilson, D. V. (1994). On the self-similar nature of ethernet traffic. IEEE/ACM Transactions on Networking 2(1), 1–15.
Mandelbrot, B. B. (1988). Fractal geometry of nature. W. H. Freeman & Company.
O'Leary, D. P. & Peleg, S. (1983). Digital image compression by outer product expansion. IEEE Trans. Communications 31, 441–444.
Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (1998). Latent semantic indexing: A probabilistic analysis. In: Proc. Seventeenth ACM-SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, Seattle, Washington. pp. 159–168.
Pollard, D. (1982). Quantization and the method of k-means. IEEE Trans. Inform. Theory 28, 199–205.
Rasmussen, E. (1992). Clustering Algorithms. In: W. B. Frakes & R. Baeza-Yates (eds.): Information retrieval: Data structures and algorithms. pp. 419–442, Prentice-Hall.
Rissanen, J., Speed, T., & Yu, B. (1992). Density estimation by stochastic complexity. IEEE Trans. Inform. Theory 38, 315–323.
Sabin, M. J. & Gray, R. M. (1986). Global convergence and empirical consistency of the generalized Lloyd algorithm. IEEE Trans. Inform. Theory 32(2), 148–155.
Sahami, M., Yusufali, S., & Baldonado, M., (1999). SONIA: A Service for Organizing Networked Information Autonomously. In: Proc. ACM Digital Libraries.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Inform. proc. & management. pp. 513–523.
Salton, G. & McGill, M. J. (1983). Introduction to modern retrieval. New York: McGraw-Hill Book Company.
Saul, L. & Pereira, F. (1997). Aggregate and mixed-order Markov models for statistical language processing. In: Proc. 2nd Int. Conf. Empirical Methods in Natural Language Processing.
Schütze, H. & Silverstein, C. (1997). Projections for efficient document clustering. In: Proc. ACM SIGIR.
Silverstein, C. & Pedersen, J. O. (1997). Almost-constant-time clustering of arbitrary corpus subsets. In: Proc. ACM SIGIR.
Singhal, A., Buckley, C., Mitra, M., & Salton, G. (1996). Pivoted document length normalization. In: Proc. ACM SIGIR.
Vaithyanathan, S. & Dom, B. (1999). Model selection in unsupervised learning with applications to document clustering. In: Proc. 16th Int. Machine Learning Conf., Bled, Slovenia.
Willet, P. (1988). Recent trends in hierarchic document clustering: a critical review. Inform. Proc. & Management pp. 577–597.
Zamir, O. & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In: Proc. ACM SIGIR.
Zipf, G. K. (1949). Human behavior and the principle of least effort. Reading, MA: Addison Wesley.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Dhillon, I.S., Modha, D.S. Concept Decompositions for Large Sparse Text Data Using Clustering. Machine Learning 42, 143–175 (2001). https://doi.org/10.1023/A:1007612920971
Issue Date:
DOI: https://doi.org/10.1023/A:1007612920971