Exploring Validity Indices for Clustering Textual Data

El Sayed, Ahmad; Hacid, Hakim; Zighed, Djamel

doi:10.1007/978-3-540-88067-7_16

Ahmad El Sayed⁴,
Hakim Hacid⁴ &
Djamel Zighed⁴

Part of the book series: Studies in Computational Intelligence ((SCI,volume 165))

753 Accesses
2 Citations

Abstract

The goal of any clustering algorithm producing flat partitions of data, is to find both the optimal clustering solution and the optimal number of clusters. One natural way to reach this goal without the need for parameters, is to involve a validity index in a clustering process, which can lead to an objective selection of the optimal number of clusters. In this chapter, we provide two main contributions. Firstly, since validity indices have been mostly studied in a two or three-dimensionnal datasets, we have chosen to evaluate them in a real-world applications, document and word clustering. Secondly, we propose a new context-aware method that aims at enhancing the validity indices usage as stopping criteria in agglomerative algorithms. Experimental results show that the method is a step-forward in using, with more reliability, validity indices as stopping criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press / Addison-Wesley (1999)
Google Scholar
Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, pp. 6–17 (2002)
Google Scholar
Bezdek, J.C., Li, W., Attikiouzel, Y., Windham, M.P.: A geometric approach to cluster validity for normal mixtures. Soft Comput. 1(4), 166–179 (1997)
Google Scholar
Chou, C.-H., Su, M.-C., Lai, E.: A new cluster validity measure and its application to image compression. Pattern Anal. Appl. 7(2), 205–220 (2004)
MathSciNet Google Scholar
Christopher Manning, H.S.: Foundations of statistical natural language processing (1999)
Google Scholar
Cimiano, P., Hotho, A., Staab, S.: Comparing conceptual, divise and agglomerative clustering for learning taxonomies from text. In: ECAI, pp. 435–439 (2004)
Google Scholar
Davies, D.L., B.D.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2) (1979)
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John Willey & Sons (2001)
Google Scholar
Dunn, J.C.: Well separated clusters and optimal fuzzy paritions. Journal Cybern. 4, 95–104 (1974)
Article MathSciNet Google Scholar
Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. Comput. J. 41(8), 578–588 (1998)
Article MATH Google Scholar
Greene, D., Cunningham, P.: Efficient prediction-based validation for document clustering. In: ECML, pp. 663–670 (2006)
Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering validity checking methods: Part ii. SIGMOD Record 31(3), 19–27 (2002)
Article Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)
Google Scholar
Harris, Z.S.: Distributional structure (1985)
Google Scholar
Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In: SIGIR, pp. 76–84 (1996)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 22(11), 1025–1034 (1973)
Article Google Scholar
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: ICML, pp. 170–178 (1997)
Google Scholar
Krzanowski, W.J., Lai, Y.T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44, 23–34 (1988)
Article MATH MathSciNet Google Scholar
Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering solutions. Neural Comput. 16(6), 1299–1323 (2004)
Article MATH Google Scholar
Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. Pattern Anal. Mach. Intell. 24(12), 1650–1654 (2002)
Article Google Scholar
Michalski, R., Stepp, R., Diday, E.: A recent advance in data analysis: Clustering objects into classes characterized by conjuctive concepts. Progress in Pattern Recognition 1 (1983)
Google Scholar
Miller, G.A.: Wordnet: A lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)
Article Google Scholar
Pedersen, T., Kulkarni, A.: Selecting the “right” number of senses based on clustering criterion functions. In: EACL (2006)
Google Scholar
Qiu, Y., Frei, H.-P.: Concept based query expansion. In: SIGIR 1993: Proc. of the 16th annual Int. ACM SIGIR Conf. on Research and development in information retrieval, pp. 160–169. ACM, New York (1993)
Chapter Google Scholar
Raskutti, B., Leckie, C.: An evaluation of criteria for measuring the quality of clusters. In: IJCAI, pp. 905–910 (1999)
Google Scholar
Rissanen, J.: Stochastic complexity in statistical inquiry. World Scientific Publishing Co., Singapore (1989)
MATH Google Scholar
Saitta, S., Raphael, B., Smith, I.F.C.: A bounded index for cluster validity. In: MLDM, pp. 174–187 (2007)
Google Scholar
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Technical report, Ithaca, NY, USA (1987)
Google Scholar
Sergios Theodoridis, K.K.: Pattern recognition. Academic Press, London (1999)
Google Scholar
Shari Landes, R.I.T., Leacock, C.: Building semantic concordances, pp. 199–216 (1998)
Google Scholar
Sharma, S.: Applied multivariate techniques. John Wiley and Sons, Chichester (1996)
Google Scholar
Stokoe, C., Oakes, M.P., Tait, J.: Word sense disambiguation in information retrieval revisited. In: SIGIR, pp. 159–166 (2003)
Google Scholar
Harabasz, C.T.: A dendrite method for cluster analysis. Communications in Statistics 3, 1–27 (1974)
MathSciNet Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the gap statistic. Technical report, Dept. of Statistics, Stanford University (2000)
Google Scholar
Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and intuitive clustering of web documents. In: KDD, pp. 287–290 (1997)
Google Scholar
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)
Article MATH Google Scholar
Zhao, Y., Karypis, G., Fayyad, U.M.: Hierarchical clustering algorithms for document datasets. Data Min. Knowl. Discov. 10(2), 141–168 (2005)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

ERIC Laboratory- 5, University of Lyon, avenue Pierre Mendès-France, 69676, Bron cedex, France
Ahmad El Sayed, Hakim Hacid & Djamel Zighed

Authors

Ahmad El Sayed
View author publications
You can also search for this author in PubMed Google Scholar
Hakim Hacid
View author publications
You can also search for this author in PubMed Google Scholar
Djamel Zighed
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Lyon, Lyon, France
Djamel A. Zighed & Hakim Hacid &
Shimane University, Shimane, Japan
Shusaku Tsumoto
University of North Carolina, Charlotte, NC, USA
Zbigniew W. Ras

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

El Sayed, A., Hacid, H., Zighed, D. (2009). Exploring Validity Indices for Clustering Textual Data. In: Zighed, D.A., Tsumoto, S., Ras, Z.W., Hacid, H. (eds) Mining Complex Data. Studies in Computational Intelligence, vol 165. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88067-7_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-88067-7_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88066-0
Online ISBN: 978-3-540-88067-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics