Fast Text Comparison Based on ElasticSearch and Dynamic Programming

Xiao, Pengcheng; Lu, Peng; Luo, Chunqi; Zhu, Zhousen; Liao, Xuehua

doi:10.1007/978-981-99-7254-8_5

Pengcheng Xiao¹²,
Peng Lu¹²,
Chunqi Luo¹²,
Zhousen Zhu¹² &
…
Xuehua Liao¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14306))

Included in the following conference series:

International Conference on Web Information Systems Engineering

884 Accesses

Abstract

Text comparison is a process of comparing and matching two or more texts to determine their similarities or differences. By calculating the similarity between two texts, tasks such as classification, clustering, retrieval, and comparison can be performed on texts. In this work, we have improved existing text matching methods based on ElasticSearch and dynamic programming. Leveraging the powerful indexing and search capabilities of ElasticSearch, our method enables fast retrieval and comparison of relevant documents. During the text comparison process, we utilize an improved LCS (Longest Common Subsequence) algorithm to calculate the matches between the texts. We conduct extensive experiments on real-world datasets to evaluate the performance and effectiveness of our method. The results demonstrate that our approach can accomplish text comparison tasks more efficiently while handling various types of text noise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alleman, M., Mamou, J., Rio, M.A.D., Tang, H., Kim, Y., Chung, S.: Syntactic perturbations reveal representational correlates of hierarchical phrase structure in pretrained language models (2021). https://doi.org/10.48550/arXiv.2104.07578
Atabuzzaman, M., Shajalal, M., Ahmed, M.E., Afjal, M.I., Aono, M.: Leveraging grammatical roles for measuring semantic similarity between texts. IEEE Access 9, 62972–62983 (2021). https://doi.org/10.1109/ACCESS.2021.3074747
Article Google Scholar
Cao, S., Yang, Y.: DP-BERT: dynamic programming BERT for text summarization. In: Fang, L., Chen, Y., Zhai, G., Wang, J., Wang, R., Dong, W. (eds.) CICAI 2021. LNCS, vol. 13070, pp. 285–296. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-93049-3_24
Chapter Google Scholar
Castro, A.P., Wainer, G.A., Calixto, W.P.: Weighting construction by bag-of-words with similarity-learning and supervised training for classification models in court text documents. Appl. Soft Comput. 124, 108987 (2022). https://doi.org/10.1016/j.asoc.2022.108987
Article Google Scholar
Das, D., Saha, B.: Approximating LCS and alignment distance over multiple sequences. CoRR abs/2110.12402 (2021). https://doi.org/10.48550/arXiv.2110.12402
Guo, W., Wang, Z., Han, F.: Multifeature fusion keyword extraction algorithm based on textrank. IEEE Access 10, 71805–71813 (2022). https://doi.org/10.1109/ACCESS.2022.3188861
Article Google Scholar
Huang, J., Fang, Z., Kasai, H.: LCS graph kernel based on Wasserstein distance in longest common subsequence metric space. Signal Process. 189, 108281 (2021). https://doi.org/10.1016/j.sigpro.2021.108281
Article Google Scholar
Inan, E.: Simit: a text similarity method using lexicon and dependency representations. New Gener. Comput. 38(3), 509–530 (2020). https://doi.org/10.1007/s00354-020-00099-8
Article Google Scholar
Jalilifard, A., Caridá, V.F., Mansano, A., Cristo, R.: Semantic sensitive TF-IDF to determine word relevance in documents. CoRR abs/2001.09896 (2020). https://doi.org/10.48550/arXiv.2001.09896
Kalbaliyev, E., Rustamov, S.: Text similarity detection using machine learning algorithms with character-based similarity measures. In: Biele, C., Kacprzyk, J., Owsiński, J.W., Romanowski, A., Sikorski, M. (eds.) MIDI 2020. AISC, vol. 1376, pp. 11–19. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74728-2_2
Chapter Google Scholar
Koloski, B., Pollak, S., Škrlj, B., Martinc, M.: Extending neural keyword extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, pp. 22–29. Association for Computational Linguistics (2021). www.aclanthology.org/2021.hackashop-1.4
Korfhage, N., Mühling, M., Freisleben, B.: ElasticHash: semantic image similarity search by deep hashing with elasticsearch. In: Tsapatsoulis, N., Panayides, A., Theocharides, T., Lanitis, A., Pattichis, C., Vento, M. (eds.) CAIP 2021. LNCS, vol. 13053, pp. 14–23. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89131-2_2
Chapter Google Scholar
Kuppili, V., Biswas, M., Edla, D.R., Prasad, K.J.R., Suri, J.S.: A mechanics-based similarity measure for text classification in machine learning paradigm. IEEE Trans. Emerg. Top. Comput. Intell. 4(2), 180–200 (2020). https://doi.org/10.1109/TETCI.2018.2863728
Article Google Scholar
Lim, J., Sa, I., Ahn, H.S., Gasteiger, N., Lee, S.J., MacDonald, B.: Subsentence extraction from text using coverage-based deep learning language models. Sensors 21(8), 2712 (2021). https://doi.org/10.3390/s21082712
Article Google Scholar
Liu, Z., Shi, Q., Ou, J.: LCS: a collaborative optimization framework of vector extraction and semantic segmentation for building extraction. IEEE Trans. Geosci. Remote Sens. 60, 1–15 (2022). https://doi.org/10.1109/TGRS.2022.3215852
Article Google Scholar
Marcińczuk, M., Gniewkowski, M., Walkowiak, T., Bȩdkowski, M.: Text document clustering: Wordnet vs. TF-IDF vs. word embeddings. In: Proceedings of the 11th Global Wordnet Conference, pp. 207–214. Global Wordnet Association (2021). www.aclanthology.org/2021.gwc-1.24
Murakami, R., Chakraborty, B.: Investigating the efficient use of word embedding with neural-topic models for interpretable topics from short texts. Sensors 22(3), 852 (2022). https://doi.org/10.3390/s22030852
Article Google Scholar
Qin, J., Zhou, Z., Tan, Y., Xiang, X., He, Z.: A big data text coverless information hiding based on topic distribution and TF-IDF. Int. J. Digit. Crime Forensics 13(4), 40–56 (2021). https://doi.org/10.4018/ijdcf.20210701.oa4
Article Google Scholar
Romanov, A.S., Kurtukova, A.V., Sobolev, A.A., Shelupanov, A.A., Fedotova, A.M.: Determining the age of the author of the text based on deep neural network models. Information 11(12), 589 (2020). https://doi.org/10.3390/info11120589
Article Google Scholar
Rosenberg, J., Coronel, J.B., Meiring, J., Gray, S., Brown, T.: Leveraging elasticsearch to improve data discoverability in science gateways. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), PEARC 2019, Chicago, IL, USA, 28 July–01 August 2019, pp. 19:1–19:5. ACM (2019). https://doi.org/10.1145/3332186.3332230
Sakai, Y.: A substring-substring LCS data structure. Theor. Comput. Sci. 753, 16–34 (2019). https://doi.org/10.1016/j.tcs.2018.06.034
Article MathSciNet MATH Google Scholar
Sakai, Y.: A data structure for substring-substring LCS length queries. Theoret. Comput. Sci. 911, 41–54 (2022). https://doi.org/10.1016/j.tcs.2022.02.004
Article MathSciNet MATH Google Scholar
Shang, W., Underwood, T.: Improving measures of text reuse in English poetry: A TF–IDF based method. In: Toeppe, K., Yan, H., Chu, S.K.W. (eds.) iConference 2021. LNCS, vol. 12645, pp. 469–477. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-71292-1_36
Chapter Google Scholar
Sheshasaayee, A., Thailambal, G.: Performance of multiple string matching algorithms in text mining. In: Satapathy, S.C., Bhateja, V., Udgata, S.K., Pattnaik, P.K. (eds.) Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. AISC, vol. 516, pp. 671–681. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-3156-4_71
Chapter Google Scholar
Sinha, A., Naskar, M.B., Pandey, M., Rautaray, S.S.: Text classification using machine learning techniques: comparative analysis. In: 2022 OITS International Conference on Information Technology (OCIT), pp. 102–107 (2022). https://doi.org/10.1109/OCIT56763.2022.00029
Sun, J., Nie, P., Xu, L., Zhang, H.: Design and implementation of analyzer management system based on elasticsearch. In: Zhao, X., Yang, S., Wang, X., Li, J. (eds.) WISA 2022. LNCS, vol. 13579, pp. 254–266. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20309-1_22
Chapter Google Scholar
Van, D.N., Trung, S.N., Hong, A.P.T., Hoang, T.T., Thanh, T.M.: A novel approach to end-to-end facial recognition framework with virtual search engine elasticsearch. In: Gervasi, O., et al. (eds.) ICCSA 2021. LNCS, vol. 12951, pp. 454–470. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86970-0_32
Chapter Google Scholar
Vishnupriya, G., Ramachandran, R.: Rabin-Karp algorithm based malevolent node detection and energy-efficient data gathering approach in wireless sensor network. Microprocess. Microsyst. 82, 103829 (2021). https://doi.org/10.1016/j.micpro.2021.103829
Article Google Scholar
Wei, B., Dai, J., Deng, L., Huang, H.: An optimization method for elasticsearch index shard number. In: 2020 16th International Conference on Computational Intelligence and Security (CIS), pp. 191–195 (2020). https://doi.org/10.1109/CIS52066.2020.00048
Yang, W., Li, H., Li, Y., Zou, Y., Zhao, H.: Design and implementation of intelligent warehouse platform based on elasticsearch. In: 6th International Conference on Software and e-Business, ICSEB 2022, Shenzhen, China, 9–11 December 2022, pp. 69–73. ACM (2022). https://doi.org/10.1145/3578997.3579016
Yao, J., Wang, K., Yan, J.: Incorporating label co-occurrence into neural network-based models for multi-label text classification. IEEE Access 7, 183580–183588 (2019). https://doi.org/10.1109/ACCESS.2019.2960626
Article Google Scholar
Zamfir, V., Carabas, M., Carabas, C., Tapus, N.: Systems monitoring and big data analysis using the elasticsearch system. In: 22nd International Conference on Control Systems and Computer Science, CSCS 2019, Bucharest, Romania, 28–30 May 2019, pp. 188–193. IEEE (2019). https://doi.org/10.1109/CSCS.2019.00039
Zandigohar, M., Dai, Y.: Information retrieval in single cell chromatin analysis using TF-IDF transformation methods. In: IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022, Las Vegas, NV, USA, 6–8 December 2022, pp. 877–882. IEEE (2022). https://doi.org/10.1109/BIBM55620.2022.9994949

Download references

Author information

Authors and Affiliations

Sichuan Normal University, Chengdu, China
Pengcheng Xiao, Peng Lu, Chunqi Luo, Zhousen Zhu & Xuehua Liao

Authors

Pengcheng Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Peng Lu
View author publications
You can also search for this author in PubMed Google Scholar
Chunqi Luo
View author publications
You can also search for this author in PubMed Google Scholar
Zhousen Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xuehua Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuehua Liao .

Editor information

Editors and Affiliations

Renmin University of China, Beijing, China
Feng Zhang
Victoria University, Footscray, VIC, Australia
Hua Wang
Qatar University, Doha, Qatar
Mahmoud Barhamgi
Swinburne University of Technology, Hawthorn, Australia
Lu Chen
Swinburne University of Technology, Hawthorn, Australia
Rui Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiao, P., Lu, P., Luo, C., Zhu, Z., Liao, X. (2023). Fast Text Comparison Based on ElasticSearch and Dynamic Programming. In: Zhang, F., Wang, H., Barhamgi, M., Chen, L., Zhou, R. (eds) Web Information Systems Engineering – WISE 2023. WISE 2023. Lecture Notes in Computer Science, vol 14306. Springer, Singapore. https://doi.org/10.1007/978-981-99-7254-8_5

Download citation

DOI: https://doi.org/10.1007/978-981-99-7254-8_5
Published: 21 October 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7253-1
Online ISBN: 978-981-99-7254-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fast Text Comparison Based on ElasticSearch and Dynamic Programming