Skip to main content
Log in

Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The task of text clustering is to partition a set of text documents into different meaningful groups such that the documents in a particular cluster are more similar to each other than the documents of other clusters according to a similarity or dissimilarity measure. Therefore, the role of similarity measure is crucial for producing good-quality clusters. The content similarity between two documents is generally used to form individual clusters, and it is measured by considering shared terms between the documents. However, the same may not be effective for a reasonably large and high-dimensional corpus. Therefore, a similarity measure is proposed here to improve the performance of text clustering using spectral method. The proposed similarity measure between two documents assigns a score based on their content similarity and their individual similarity with the shared neighbours over the corpus. The effectiveness of the proposed document similarity measure has been tested for clustering of different standard corpora using spectral clustering method. The empirical results using some well-known text collections have shown that the proposed method performs better than the state-of-the-art text clustering techniques in terms of normalized mutual information, f-measure and v-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://glaros.dtc.umn.edu.

  2. http://trec.nist.gov.

  3. https://github.com/KlugerLab/SpectralNet.

  4. https://github.com/trigeorgis/Deep-Semi-NMF.

  5. http://scikit-learn.org/stable/modules/clustering.html.

  6. https://github.com/tanmaybasu/Postimpact-Smilarity-for-Text-Clustering.

References

  1. Romero C, Ventura S (2010) Educational data mining: a review of the state of the art. IEEE Trans Syst Man Cybern Part C 40(6):601–618

    Article  Google Scholar 

  2. Xu Z, Ke Y (2016) Effective and efficient spectral clustering on text and link data. In: Proceedings of ACM international conference on information and knowledge management. pp 357–366

  3. Shaham U, Stanton K, Li H, Nadler B, Basri R, Kluger Y (2018) Spectralnet: spectral clustering using deep neural networks. In: Proceedings of international conference on learning representations

  4. Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200

    Article  Google Scholar 

  5. Basu T, Murthy CA (2015) A similarity assessment technique for effective grouping of documents. Inf Sci 311:149–162

    Article  Google Scholar 

  6. Glasbey CA (1993) An analysis of histogram-based thresholding algorithms. CVGIP Graph Models Image Process 55(6):532–537

    Article  Google Scholar 

  7. Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems. pp 849–856

  8. Kaya M, Bilge HŞ (2019) Deep metric learning: a survey. Symmetry 11(9):1066

    Article  Google Scholar 

  9. Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: Proceedings of international conference on machine learning. pp 209–216

  10. Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10(2):207

    MATH  Google Scholar 

  11. Kulis B et al (2012) Metric learning: a survey. Found Trends Mach Learn 5(4):287–364

    Article  MathSciNet  Google Scholar 

  12. Dai G, Xie J, Zhu F, Fang Y (2017) Deep correlated metric learning for sketch-based 3d shape retrieval. In: Proceedings of AAAI conference on artificial intelligence

  13. Harwood B, Kumar BGV, Carneiro G, Reid I, Drummond T (2017) Smart mining for deep metric learning. In: Proceedings of the IEEE international conference on computer vision. pp 2821–2829

  14. Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Workshop on artificial intelligence for web search. 58:64

  15. Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the New Zealand computer science research student conference, Christchurch, New Zealand. pp 49–56

  16. Basu T, Murthy CA (2013) Cues: a new hierarchical approach for document clustering. J Pattern Recognit Res 8(1):66–84

    Article  Google Scholar 

  17. Ghosal A, Nandy A, Das AK, Goswami S, Panday M (2020) A short review on different clustering techniques and their applications. Emerg Technol Modell Graph. https://doi.org/10.1007/978-981-13-7403-6_9

    Article  Google Scholar 

  18. Karypis G, Han E-H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75

    Article  Google Scholar 

  19. Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279

    Article  Google Scholar 

  20. Jain Anil K, Narasimha Murty M, Flynn Patrick J (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  21. Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992)Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. pp 318–329. ACM

  22. Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. pp 267–273. ACM

  23. Wang Y-X, Zhang Y-J (2013) Nonnegative matrix factorization: a comprehensive review. IEEE Trans Knowl Data Eng 25(6):1336–1353

    Article  MathSciNet  Google Scholar 

  24. Ding CHQ, Li T, Jordan MI (2008) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 32(1):45–55

    Article  Google Scholar 

  25. Trigeorgis G, Bousmalis K, Zafeiriou S, Schuller BW (2017) A deep matrix factorization method for learning attribute representations. IEEE Trans Pattern Anal Mach Intell 39(3):417–429

    Article  Google Scholar 

  26. Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150

    Article  Google Scholar 

  27. Yang Y, Ma Z, Yang Y, Nie F, Heng TS (2015) Multitask spectral clustering by exploring intertask correlation. IEEE Trans Cybern 45(5):1083–1094

    Article  Google Scholar 

  28. Cai D, He X, Han J (2005) Document clustering using locality preserving indexing. IEEE Trans Knowl Data Eng 17(12):1624–1637

    Article  Google Scholar 

  29. Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. Proc Conf Comput Visi Pattern Recognit 2:1735–1742

    Google Scholar 

  30. Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New York

    MATH  Google Scholar 

  31. Sam HE-H, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: European conference on principles of data mining and knowledge discovery. pp 424–431. Springer

  32. Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854

    MathSciNet  MATH  Google Scholar 

  33. Rosenberg A, Hirschberg J (2007) V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)

  34. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  35. Ruxton Graeme D (2006) The unequal variance t-test is an underused alternative to student’s t-test and the mann-whitney u test. Behav Ecol 17(4):688–690

    Article  Google Scholar 

  36. Friedman JH et al (1994) Flexible metric nearest neighbor classification. Technical report, Technical report, Department of Statistics, Stanford University

  37. De Amorim R, Cordeiro R, Boris M (2012) Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recogn 45(3):1061–1075

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A portion of the work had done when the authors were affiliated to Ramakrishna Mission Vivekananda Educational and Research Institute, Belur Math, West Bengal, India.

Appendix A: additional results

Appendix A: additional results

The performance of the spectral clustering technique using the postimpact similarity measure and the performance of the spectral clustering technique using some standard linear distance functions are reported in Table 9 in terms of NMI for all the corpora as described in Sect. 4.1. These distance functions are defined in Table 8 along with their range of values. Table 9 shows that postimpact similarity measure outperforms the other distance functions for text clustering using the spectral-based method. The t-test as described in Sect. 4.4 has been performed to check the statistical significance of the results in Table 9. It has been found that the results are significant for all the cases when spectral clustering algorithm using postimpact similarity performs better than the same using other distance functions in Table 9. The effectiveness of the postimpact similarity for text clustering can be observed from these results.

Table 9 Performance of spectral clustering technique using different similarity measures in terms of NMI

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roy, A.K., Basu, T. Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering. Knowl Inf Syst 64, 723–742 (2022). https://doi.org/10.1007/s10115-022-01658-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01658-9

Keywords

Navigation