Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering

Roy, Arnab Kumar; Basu, Tanmay

doi:10.1007/s10115-022-01658-9

Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering

Regular Paper
Published: 17 February 2022

Volume 64, pages 723–742, (2022)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

378 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

The task of text clustering is to partition a set of text documents into different meaningful groups such that the documents in a particular cluster are more similar to each other than the documents of other clusters according to a similarity or dissimilarity measure. Therefore, the role of similarity measure is crucial for producing good-quality clusters. The content similarity between two documents is generally used to form individual clusters, and it is measured by considering shared terms between the documents. However, the same may not be effective for a reasonably large and high-dimensional corpus. Therefore, a similarity measure is proposed here to improve the performance of text clustering using spectral method. The proposed similarity measure between two documents assigns a score based on their content similarity and their individual similarity with the shared neighbours over the corpus. The effectiveness of the proposed document similarity measure has been tested for clustering of different standard corpora using spectral clustering method. The empirical results using some well-known text collections have shown that the proposed method performs better than the state-of-the-art text clustering techniques in terms of normalized mutual information, f-measure and v-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

A novel feature and class-based globalization technique for text classification

Article 25 April 2023

Notes

References

Romero C, Ventura S (2010) Educational data mining: a review of the state of the art. IEEE Trans Syst Man Cybern Part C 40(6):601–618
Article Google Scholar
Xu Z, Ke Y (2016) Effective and efficient spectral clustering on text and link data. In: Proceedings of ACM international conference on information and knowledge management. pp 357–366
Shaham U, Stanton K, Li H, Nadler B, Basri R, Kluger Y (2018) Spectralnet: spectral clustering using deep neural networks. In: Proceedings of international conference on learning representations
Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200
Article Google Scholar
Basu T, Murthy CA (2015) A similarity assessment technique for effective grouping of documents. Inf Sci 311:149–162
Article Google Scholar
Glasbey CA (1993) An analysis of histogram-based thresholding algorithms. CVGIP Graph Models Image Process 55(6):532–537
Article Google Scholar
Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems. pp 849–856
Kaya M, Bilge HŞ (2019) Deep metric learning: a survey. Symmetry 11(9):1066
Article Google Scholar
Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: Proceedings of international conference on machine learning. pp 209–216
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10(2):207
MATH Google Scholar
Kulis B et al (2012) Metric learning: a survey. Found Trends Mach Learn 5(4):287–364
Article MathSciNet Google Scholar
Dai G, Xie J, Zhu F, Fang Y (2017) Deep correlated metric learning for sketch-based 3d shape retrieval. In: Proceedings of AAAI conference on artificial intelligence
Harwood B, Kumar BGV, Carneiro G, Reid I, Drummond T (2017) Smart mining for deep metric learning. In: Proceedings of the IEEE international conference on computer vision. pp 2821–2829
Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Workshop on artificial intelligence for web search. 58:64
Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the New Zealand computer science research student conference, Christchurch, New Zealand. pp 49–56
Basu T, Murthy CA (2013) Cues: a new hierarchical approach for document clustering. J Pattern Recognit Res 8(1):66–84
Article Google Scholar
Ghosal A, Nandy A, Das AK, Goswami S, Panday M (2020) A short review on different clustering techniques and their applications. Emerg Technol Modell Graph. https://doi.org/10.1007/978-981-13-7403-6_9
Article Google Scholar
Karypis G, Han E-H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75
Article Google Scholar
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279
Article Google Scholar
Jain Anil K, Narasimha Murty M, Flynn Patrick J (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992)Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. pp 318–329. ACM
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. pp 267–273. ACM
Wang Y-X, Zhang Y-J (2013) Nonnegative matrix factorization: a comprehensive review. IEEE Trans Knowl Data Eng 25(6):1336–1353
Article MathSciNet Google Scholar
Ding CHQ, Li T, Jordan MI (2008) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 32(1):45–55
Article Google Scholar
Trigeorgis G, Bousmalis K, Zafeiriou S, Schuller BW (2017) A deep matrix factorization method for learning attribute representations. IEEE Trans Pattern Anal Mach Intell 39(3):417–429
Article Google Scholar
Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150
Article Google Scholar
Yang Y, Ma Z, Yang Y, Nie F, Heng TS (2015) Multitask spectral clustering by exploring intertask correlation. IEEE Trans Cybern 45(5):1083–1094
Article Google Scholar
Cai D, He X, Han J (2005) Document clustering using locality preserving indexing. IEEE Trans Knowl Data Eng 17(12):1624–1637
Article Google Scholar
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. Proc Conf Comput Visi Pattern Recognit 2:1735–1742
Google Scholar
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New York
MATH Google Scholar
Sam HE-H, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: European conference on principles of data mining and knowledge discovery. pp 424–431. Springer
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854
MathSciNet MATH Google Scholar
Rosenberg A, Hirschberg J (2007) V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Ruxton Graeme D (2006) The unequal variance t-test is an underused alternative to student’s t-test and the mann-whitney u test. Behav Ecol 17(4):688–690
Article Google Scholar
Friedman JH et al (1994) Flexible metric nearest neighbor classification. Technical report, Technical report, Department of Statistics, Stanford University
De Amorim R, Cordeiro R, Boris M (2012) Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recogn 45(3):1061–1075
Article Google Scholar

Download references

Author information

Authors and Affiliations

LICHESS.ORG, Maine-et-Loire, Loire Valley, France
Arnab Kumar Roy
Department of Data Science and Engineering, Indian Institute of Science Education and Research, Bhopal, India
Tanmay Basu

Authors

Arnab Kumar Roy
View author publications
You can also search for this author in PubMed Google Scholar
Tanmay Basu
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A portion of the work had done when the authors were affiliated to Ramakrishna Mission Vivekananda Educational and Research Institute, Belur Math, West Bengal, India.

Appendix A: additional results

The performance of the spectral clustering technique using the postimpact similarity measure and the performance of the spectral clustering technique using some standard linear distance functions are reported in Table 9 in terms of NMI for all the corpora as described in Sect. 4.1. These distance functions are defined in Table 8 along with their range of values. Table 9 shows that postimpact similarity measure outperforms the other distance functions for text clustering using the spectral-based method. The t-test as described in Sect. 4.4 has been performed to check the statistical significance of the results in Table 9. It has been found that the results are significant for all the cases when spectral clustering algorithm using postimpact similarity performs better than the same using other distance functions in Table 9. The effectiveness of the postimpact similarity for text clustering can be observed from these results.

Table 9 Performance of spectral clustering technique using different similarity measures in terms of NMI

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Roy, A.K., Basu, T. Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering. Knowl Inf Syst 64, 723–742 (2022). https://doi.org/10.1007/s10115-022-01658-9

Download citation

Received: 29 September 2019
Revised: 19 January 2022
Accepted: 21 January 2022
Published: 17 February 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10115-022-01658-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

A novel feature and class-based globalization technique for text classification

Notes

References

Author information

Authors and Affiliations

Additional information

Publisher's Note

Appendix A: additional results

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Postimpact similarity: a similarity measure for effective grouping of unlabelled text using spectral clustering

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

A novel feature and class-based globalization technique for text classification

Notes

References

Author information

Authors and Affiliations

Additional information

Publisher's Note

Appendix A: additional results

Appendix A: additional results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation