Skip to main content

Document Clustering

  • Reference work entry
Encyclopedia of Machine Learning
  • 651 Accesses

Synonyms

High-dimensional clustering; Text clustering; Unsupervised learning on document datasets

Definition

At a high-level, the problem of document clustering is defined as follows. Given a set S of n documents, we would like to partition them into a predetermined number of k subsets S 1, S 2, …, S k , such that the documents assigned to each subset are more similar to each other than the documents assigned to different subsets. Document clustering is an essential part of text mining and has many applications in information retrieval and knowledge management. Document clustering faces two big challenges: the dimensionality of the feature space tends to be high (i.e., a document collection often consists of thousands or tens of thousands unique words) and the size of a document collection tends to be large.

Motivation and Background

Clusteringis an essential component of data mining and a fundamental means of knowledge discovery in data exploration. Fast and high-quality document...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Recommended Reading

  • Boley, D. (1998). Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 325–344.

    Google Scholar 

  • Cutting, D. R., Pedersen, J. O., Karger, D. R., & Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR (pp. 318–329). Copenhagen, Denmark.

    Google Scholar 

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1), 1–38.

    MATH  MathSciNet  Google Scholar 

  • Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Knowledge discovery and data mining (pp. 269–274). San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Ding, C., He, X., Zha, H., Gu, M., & Simon, H. (2001). Spectral min-max cut for graph partitioning and data clustering. Technical report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA.

    Google Scholar 

  • Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley.

    MATH  Google Scholar 

  • Fisher, D. (1996). Iterative optimization and simplification of hierarchical clusterings. Journal of Artificial Intelligence Research, 4, 147–180.

    MATH  Google Scholar 

  • Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Englewood Cliffs, NJ: Prentice-Hall.

    MATH  Google Scholar 

  • Karypis, G. (2002). C luto: A clustering toolkit. Technical report 02-017, Department of Computer Science, University of Minnesota. Available at http://www.cs.umn.edu/~cluto.

  • King, B. (1967). Step-wise clustering procedures. Journal of the American Statistical Association, 69, 86–101.

    Google Scholar 

  • MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th symposium on mathematical statistics and probability (pp. 281–297). Berkeley, CA: University of California Press.

    Google Scholar 

  • Salton, G. (1989). Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading, MA: Addison-Wesley.

    Google Scholar 

  • Sneath, P. H., & Sokal, R. R. (1973). Numerical taxonomy. San Francisco: Freeman.

    MATH  Google Scholar 

  • Zahn, K. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20), 68–86.

    Google Scholar 

  • Zha H., He X., Ding C., Simon H., and Gu M. Bipartite graph partitioning and data clustering. In Proceedings of the International Conference on Information and Knowledge Management, 2001.

    Google Scholar 

  • Zhao, Y., & Karypis, G. (2004). Criterion functions for document clustering: Experiments and analysis. Machine Learning, 55, 311–331.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this entry

Cite this entry

Zhao, Y., Karypis, G. (2011). Document Clustering. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_231

Download citation

Publish with us

Policies and ethics