Definition
At a high-level, the problem of document clustering is defined as follows. Given a set S of n documents, we would like to partition them into a predetermined number of k subsets S 1, S 2, …, S k , such that the documents assigned to each subset are more similar to each other than the documents assigned to different subsets. Document clustering is an essential part of text mining and has many applications in information retrieval and knowledge management. Document clustering faces two big challenges: the dimensionality of the feature space tends to be high (i.e., a document collection often consists of thousands or tens of thousands unique words) and the size of a document collection tends to be large.
Motivation and Background
Clusteringis an essential component of data mining and a fundamental means of knowledge discovery in data exploration. Fast and high-quality document...
Recommended Reading
Boley, D. (1998). Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 325–344.
Cutting, D. R., Pedersen, J. O., Karger, D. R., & Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR (pp. 318–329). Copenhagen, Denmark.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1), 1–38.
Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Knowledge discovery and data mining (pp. 269–274). San Francisco: Morgan Kaufmann.
Ding, C., He, X., Zha, H., Gu, M., & Simon, H. (2001). Spectral min-max cut for graph partitioning and data clustering. Technical report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA.
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley.
Fisher, D. (1996). Iterative optimization and simplification of hierarchical clusterings. Journal of Artificial Intelligence Research, 4, 147–180.
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Englewood Cliffs, NJ: Prentice-Hall.
Karypis, G. (2002). C luto: A clustering toolkit. Technical report 02-017, Department of Computer Science, University of Minnesota. Available at http://www.cs.umn.edu/~cluto.
King, B. (1967). Step-wise clustering procedures. Journal of the American Statistical Association, 69, 86–101.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th symposium on mathematical statistics and probability (pp. 281–297). Berkeley, CA: University of California Press.
Salton, G. (1989). Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading, MA: Addison-Wesley.
Sneath, P. H., & Sokal, R. R. (1973). Numerical taxonomy. San Francisco: Freeman.
Zahn, K. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20), 68–86.
Zha H., He X., Ding C., Simon H., and Gu M. Bipartite graph partitioning and data clustering. In Proceedings of the International Conference on Information and Knowledge Management, 2001.
Zhao, Y., & Karypis, G. (2004). Criterion functions for document clustering: Experiments and analysis. Machine Learning, 55, 311–331.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this entry
Cite this entry
Zhao, Y., Karypis, G. (2011). Document Clustering. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_231
Download citation
DOI: https://doi.org/10.1007/978-0-387-30164-8_231
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering