Skip to main content

Exploring Validity Indices for Clustering Textual Data

  • Chapter
Mining Complex Data

Part of the book series: Studies in Computational Intelligence ((SCI,volume 165))

Abstract

The goal of any clustering algorithm producing flat partitions of data, is to find both the optimal clustering solution and the optimal number of clusters. One natural way to reach this goal without the need for parameters, is to involve a validity index in a clustering process, which can lead to an objective selection of the optimal number of clusters. In this chapter, we provide two main contributions. Firstly, since validity indices have been mostly studied in a two or three-dimensionnal datasets, we have chosen to evaluate them in a real-world applications, document and word clustering. Secondly, we propose a new context-aware method that aims at enhancing the validity indices usage as stopping criteria in agglomerative algorithms. Experimental results show that the method is a step-forward in using, with more reliability, validity indices as stopping criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press / Addison-Wesley (1999)

    Google Scholar 

  2. Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, pp. 6–17 (2002)

    Google Scholar 

  3. Bezdek, J.C., Li, W., Attikiouzel, Y., Windham, M.P.: A geometric approach to cluster validity for normal mixtures. Soft Comput. 1(4), 166–179 (1997)

    Google Scholar 

  4. Chou, C.-H., Su, M.-C., Lai, E.: A new cluster validity measure and its application to image compression. Pattern Anal. Appl. 7(2), 205–220 (2004)

    MathSciNet  Google Scholar 

  5. Christopher Manning, H.S.: Foundations of statistical natural language processing (1999)

    Google Scholar 

  6. Cimiano, P., Hotho, A., Staab, S.: Comparing conceptual, divise and agglomerative clustering for learning taxonomies from text. In: ECAI, pp. 435–439 (2004)

    Google Scholar 

  7. Davies, D.L., B.D.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2) (1979)

    Google Scholar 

  8. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John Willey & Sons (2001)

    Google Scholar 

  9. Dunn, J.C.: Well separated clusters and optimal fuzzy paritions. Journal Cybern. 4, 95–104 (1974)

    Article  MathSciNet  Google Scholar 

  10. Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. Comput. J. 41(8), 578–588 (1998)

    Article  MATH  Google Scholar 

  11. Greene, D., Cunningham, P.: Efficient prediction-based validation for document clustering. In: ECML, pp. 663–670 (2006)

    Google Scholar 

  12. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering validity checking methods: Part ii. SIGMOD Record 31(3), 19–27 (2002)

    Article  Google Scholar 

  13. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)

    Google Scholar 

  14. Harris, Z.S.: Distributional structure (1985)

    Google Scholar 

  15. Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In: SIGIR, pp. 76–84 (1996)

    Google Scholar 

  16. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  17. Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 22(11), 1025–1034 (1973)

    Article  Google Scholar 

  18. Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: ICML, pp. 170–178 (1997)

    Google Scholar 

  19. Krzanowski, W.J., Lai, Y.T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44, 23–34 (1988)

    Article  MATH  MathSciNet  Google Scholar 

  20. Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering solutions. Neural Comput. 16(6), 1299–1323 (2004)

    Article  MATH  Google Scholar 

  21. Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. Pattern Anal. Mach. Intell. 24(12), 1650–1654 (2002)

    Article  Google Scholar 

  22. Michalski, R., Stepp, R., Diday, E.: A recent advance in data analysis: Clustering objects into classes characterized by conjuctive concepts. Progress in Pattern Recognition 1 (1983)

    Google Scholar 

  23. Miller, G.A.: Wordnet: A lexical database for english. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  24. Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)

    Article  Google Scholar 

  25. Pedersen, T., Kulkarni, A.: Selecting the “right” number of senses based on clustering criterion functions. In: EACL (2006)

    Google Scholar 

  26. Qiu, Y., Frei, H.-P.: Concept based query expansion. In: SIGIR 1993: Proc. of the 16th annual Int. ACM SIGIR Conf. on Research and development in information retrieval, pp. 160–169. ACM, New York (1993)

    Chapter  Google Scholar 

  27. Raskutti, B., Leckie, C.: An evaluation of criteria for measuring the quality of clusters. In: IJCAI, pp. 905–910 (1999)

    Google Scholar 

  28. Rissanen, J.: Stochastic complexity in statistical inquiry. World Scientific Publishing Co., Singapore (1989)

    MATH  Google Scholar 

  29. Saitta, S., Raphael, B., Smith, I.F.C.: A bounded index for cluster validity. In: MLDM, pp. 174–187 (2007)

    Google Scholar 

  30. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Technical report, Ithaca, NY, USA (1987)

    Google Scholar 

  31. Sergios Theodoridis, K.K.: Pattern recognition. Academic Press, London (1999)

    Google Scholar 

  32. Shari Landes, R.I.T., Leacock, C.: Building semantic concordances, pp. 199–216 (1998)

    Google Scholar 

  33. Sharma, S.: Applied multivariate techniques. John Wiley and Sons, Chichester (1996)

    Google Scholar 

  34. Stokoe, C., Oakes, M.P., Tait, J.: Word sense disambiguation in information retrieval revisited. In: SIGIR, pp. 159–166 (2003)

    Google Scholar 

  35. Harabasz, C.T.: A dendrite method for cluster analysis. Communications in Statistics 3, 1–27 (1974)

    MathSciNet  Google Scholar 

  36. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the gap statistic. Technical report, Dept. of Statistics, Stanford University (2000)

    Google Scholar 

  37. Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and intuitive clustering of web documents. In: KDD, pp. 287–290 (1997)

    Google Scholar 

  38. Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)

    Article  MATH  Google Scholar 

  39. Zhao, Y., Karypis, G., Fayyad, U.M.: Hierarchical clustering algorithms for document datasets. Data Min. Knowl. Discov. 10(2), 141–168 (2005)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

El Sayed, A., Hacid, H., Zighed, D. (2009). Exploring Validity Indices for Clustering Textual Data. In: Zighed, D.A., Tsumoto, S., Ras, Z.W., Hacid, H. (eds) Mining Complex Data. Studies in Computational Intelligence, vol 165. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88067-7_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-88067-7_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-88066-0

  • Online ISBN: 978-3-540-88067-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics