Document Clustering

Zhao, Ying; Karypis, George

doi:10.1007/978-0-387-30164-8_231

Ying Zhao &
George Karypis

651 Accesses

Synonyms

High-dimensional clustering; Text clustering; Unsupervised learning on document datasets

Definition

At a high-level, the problem of document clustering is defined as follows. Given a set S of n documents, we would like to partition them into a predetermined number of k subsets S ₁, S ₂, …, S _k, such that the documents assigned to each subset are more similar to each other than the documents assigned to different subsets. Document clustering is an essential part of text mining and has many applications in information retrieval and knowledge management. Document clustering faces two big challenges: the dimensionality of the feature space tends to be high (i.e., a document collection often consists of thousands or tens of thousands unique words) and the size of a document collection tends to be large.

Motivation and Background

Clusteringis an essential component of data mining and a fundamental means of knowledge discovery in data exploration. Fast and high-quality document...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Recommended Reading

Boley, D. (1998). Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 325–344.
Google Scholar
Cutting, D. R., Pedersen, J. O., Karger, D. R., & Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR (pp. 318–329). Copenhagen, Denmark.
Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1), 1–38.
MATH MathSciNet Google Scholar
Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Knowledge discovery and data mining (pp. 269–274). San Francisco: Morgan Kaufmann.
Google Scholar
Ding, C., He, X., Zha, H., Gu, M., & Simon, H. (2001). Spectral min-max cut for graph partitioning and data clustering. Technical report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA.
Google Scholar
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley.
MATH Google Scholar
Fisher, D. (1996). Iterative optimization and simplification of hierarchical clusterings. Journal of Artificial Intelligence Research, 4, 147–180.
MATH Google Scholar
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Englewood Cliffs, NJ: Prentice-Hall.
MATH Google Scholar
Karypis, G. (2002). C luto: A clustering toolkit. Technical report 02-017, Department of Computer Science, University of Minnesota. Available at http://www.cs.umn.edu/~cluto.
King, B. (1967). Step-wise clustering procedures. Journal of the American Statistical Association, 69, 86–101.
Google Scholar
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th symposium on mathematical statistics and probability (pp. 281–297). Berkeley, CA: University of California Press.
Google Scholar
Salton, G. (1989). Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading, MA: Addison-Wesley.
Google Scholar
Sneath, P. H., & Sokal, R. R. (1973). Numerical taxonomy. San Francisco: Freeman.
MATH Google Scholar
Zahn, K. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20), 68–86.
Google Scholar
Zha H., He X., Ding C., Simon H., and Gu M. Bipartite graph partitioning and data clustering. In Proceedings of the International Conference on Information and Knowledge Management, 2001.
Google Scholar
Zhao, Y., & Karypis, G. (2004). Criterion functions for document clustering: Experiments and analysis. Machine Learning, 55, 311–331.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Authors

Ying Zhao
View author publications
You can also search for this author in PubMed Google Scholar
George Karypis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, 2052
Claude Sammut
Faculty of Information Technology, Clayton School of Information Technology, Monash University, P.O. Box 63, Victoria, Australia, 3800
Geoffrey I. Webb

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Zhao, Y., Karypis, G. (2011). Document Clustering. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_231

Download citation

DOI: https://doi.org/10.1007/978-0-387-30164-8_231
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics