Abstract
This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.
Article PDF
Similar content being viewed by others
References
Available at http://www.cs.umn.edutlkarypicluto/files/datasets.tar.gz.
Available from ftp://ftp.cs.corell.edu/pub/smartt
Beeferman, D., & Berger, A. (2000). Agglomerative clustering of a search engine query log. In Proc. of the Sixth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining (pp. 407–416).
Berry, M., Dumais, S., & O’Brien, G. (1995). Using linear algebra for intelligent information retrieval. SIAM Review, 37, 573–595.
Boley, D. (1998). Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2:4.
Cheeseman, P., & Stutz, J (1996). Baysian classification (AutoClass): Theory and results. In U. Fayyad, G. Piatetsky-Shapil. P Smith, & R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining (pp. 153–180). AAAIT Press.
Cheng, C.-K, & Wei, Y.-C. A. (1991). An improved two-way partitioning algorithm with stable performance. IEEE Transactions on Computer Aided Design, 10:12, 1502–1511.
Cutting, D., Pedersen, J., Karger, D., & Tukey, J. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR. (pp. 318–329). Copenhagen.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39.
Devore, J., & Peck, R. (1997). Statistics: The exploration and analysis of data.Belmont, CA: Duxbury Press.
Dhillon, S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Knowledge Discovery and Data Mining (pp. 269–274).
Dhillon, I. S., & Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42:1/2, 143–175.
Ding, C., He, X., Zha, H., Gu, M., & Simon, H. (2001). Spectral min-max cut for graph partitioning adut data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA.
Duda, R., Hart, P., & Stork, D. (2001). Pattern classification. John Wiley & Sons.
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the Second Int’l Conference on Knowledge Discovery and Data Mining. Portland: OR.
Fisher, D. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning,2, 139–172.
Fisher, D. (1996). Iterative optimization and simplification of hierarchical clusterings. Journal of Artificial Intelligence Research,4, 147–180.
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. In Proc. of 1998 ACM-SIGMOD Int. Conf on Management of Data.
Guha, S., Rastogi, R., & Shim, K. (1999). ROCK: A robust clustering algorithm for categorical attributes. In Proc. of the 15th Int’l Conjf on Data Eng.
Hagen, L., & Kahng, A. (1991). Fast spectral methods for ratio cut partitioning and clustering. In Proceedings of IEEE International Conference on Computer Aided Design (pp. 10–13).
Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). WebACE: A web agent for document categorization and exploartion. In Proc. of the 2nd International Conference on Autonomous Agents.
Han, J., Kamber, M., & Tung, A. K. H. (2001). Spatial clustering methods in data mining: A survey In H. Miller, & J. Han (Eds.), Geographic data mining and knowledge discovery. Taylor and Francis.
Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1994). OHSUMED: An interactive retrial evaluation and new large test collection for research. In SIGIR-94 (pp. 192–201).
Jackson, J. E. (1991). A User’s guide to principal components. John Wiley & Sons.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review ACM Computing Surveys, 31:3, 264–323.
Karypis, G., & Han, E. (2000). Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Technical Report TR–00–016, Department of Computer Science, University of Minnesota, Minneapolis. Available on the WWW at URL ht://ww.cs.umn.edutkarypis.
Karypis, G., Han, E., & Kumar, V. (1999a). Chameleon: Ahierarchical clustering algorithm using dynamic modeling. IEEE Computer 32:8, 68–75.
Karypis, G., Han, E., & Kumar, V. (1999b). Mulilevel refinement for hierarchical clustering. Technical Report TR–99–020, Department of Computer Science, University of Minnesota, Minneapolis.
King, B. (1967). Step-wise clustering procedures. Journal of the American Statistical Association, 69, 86–101.
Kolda, T., & Hendrickson, B. (2000). Paitioning: sparse rectangular and structurally non symmetric matrices for parallel computation. SIAM Journal on scientific Computing, 21:6, 2048–2072.
Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proc. of the Fifth ACM SIGKDD Int’l Coenfrence on Knowledge Discovery and Data Mining (pp. 16–22).
Lewis, D. D. (1999). Reuters-2t1578 text categorization test collection Distribution 1.0. http://www.research. att.com/~lewis.
MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math Statist Prob (pp. 281–297).
Meila, M., & Heckerman, D. (2001). An experimental comparison of model-based clustering methods. Machine Learning, 42, 9–29.
Ng, R., & Han, J. (1994). Efficient and effective clustering method for spatial data mining. In Proc. of the 20th VLDB Conference (pp. 144–155). Santiago, Chile.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14:3, 130–137.
Puzicha, J., Hofmann, T., & Buhmann, J. M. (2000). A theory of proximity based clustering: Structure detection by optimization. PATREC: Pattern Recognition. Pergamon Press. (vol. 33, pp. 617–634).
Salton, G. (1989). Automatic text processing: The transformation, analysis, & retrieval of information by computer. Addison-Wesley.
Savaresi, S., & Boley, D. (2001). On the performance of bisecting K-means and PDDP. In First SIAM International Conference on Data Mining (SDM’2001).
Savaresi, S., Boley, D., Bittanti, S., & Gazzaniga, G. (2002) Choosing the cluster to split in bisecting divisive clustering algorithms. In Second SIAM International Conference on Data Mining (SDM’2002).
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:8, 888–905.
Sneath, P. H., & Sokal, R. R. (1973). Numerical taxonomy. London, UK: Freeman.
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. KDD Workshop on Text Mining.
Strehl, A., & Ghosh, J. (2000). Scalable approach to balanced, high-dimensional clustering of market-baskets. In Proceedings of HiPC. TREC (1999). Text REtrieval conference. http://trec.nist.gov.
Zahn, K. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on 606 Computers, C-20, 68–86.
Zha, H., He, X., Ding, C., Simon, H., & Gu, M. (2001a). Bipartite graph partitioning and data clustering. CIKM.
Zha, H., He, X., Ding, C., Simon, H., & Gu, M. (2001b). Spectral relaxation for K-means clustering. Technical Report TR-2001-XX, Pennsylvania State University, University Park, PA.
Zhao, Y., & Karypis, G. (2001). Criterionfunctionsfor document clustering: Experiments and analysis. Technical Report TR #01–40, Department of Computer Science, University of Minnesota, Minneapolis, MN. Available on the WWW at http://cs.umn.edu/-karypis/publications.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Zhao, Y., Karypis, G. Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Mach Learn 55, 311–331 (2004). https://doi.org/10.1023/B:MACH.0000027785.44527.d6
Issue Date:
DOI: https://doi.org/10.1023/B:MACH.0000027785.44527.d6