Skip to main content

Fast Text Comparison Based on ElasticSearch and Dynamic Programming

  • Conference paper
  • First Online:
Web Information Systems Engineering – WISE 2023 (WISE 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14306))

Included in the following conference series:

  • 884 Accesses

Abstract

Text comparison is a process of comparing and matching two or more texts to determine their similarities or differences. By calculating the similarity between two texts, tasks such as classification, clustering, retrieval, and comparison can be performed on texts. In this work, we have improved existing text matching methods based on ElasticSearch and dynamic programming. Leveraging the powerful indexing and search capabilities of ElasticSearch, our method enables fast retrieval and comparison of relevant documents. During the text comparison process, we utilize an improved LCS (Longest Common Subsequence) algorithm to calculate the matches between the texts. We conduct extensive experiments on real-world datasets to evaluate the performance and effectiveness of our method. The results demonstrate that our approach can accomplish text comparison tasks more efficiently while handling various types of text noise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alleman, M., Mamou, J., Rio, M.A.D., Tang, H., Kim, Y., Chung, S.: Syntactic perturbations reveal representational correlates of hierarchical phrase structure in pretrained language models (2021). https://doi.org/10.48550/arXiv.2104.07578

  2. Atabuzzaman, M., Shajalal, M., Ahmed, M.E., Afjal, M.I., Aono, M.: Leveraging grammatical roles for measuring semantic similarity between texts. IEEE Access 9, 62972–62983 (2021). https://doi.org/10.1109/ACCESS.2021.3074747

    Article  Google Scholar 

  3. Cao, S., Yang, Y.: DP-BERT: dynamic programming BERT for text summarization. In: Fang, L., Chen, Y., Zhai, G., Wang, J., Wang, R., Dong, W. (eds.) CICAI 2021. LNCS, vol. 13070, pp. 285–296. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-93049-3_24

    Chapter  Google Scholar 

  4. Castro, A.P., Wainer, G.A., Calixto, W.P.: Weighting construction by bag-of-words with similarity-learning and supervised training for classification models in court text documents. Appl. Soft Comput. 124, 108987 (2022). https://doi.org/10.1016/j.asoc.2022.108987

    Article  Google Scholar 

  5. Das, D., Saha, B.: Approximating LCS and alignment distance over multiple sequences. CoRR abs/2110.12402 (2021). https://doi.org/10.48550/arXiv.2110.12402

  6. Guo, W., Wang, Z., Han, F.: Multifeature fusion keyword extraction algorithm based on textrank. IEEE Access 10, 71805–71813 (2022). https://doi.org/10.1109/ACCESS.2022.3188861

    Article  Google Scholar 

  7. Huang, J., Fang, Z., Kasai, H.: LCS graph kernel based on Wasserstein distance in longest common subsequence metric space. Signal Process. 189, 108281 (2021). https://doi.org/10.1016/j.sigpro.2021.108281

    Article  Google Scholar 

  8. Inan, E.: Simit: a text similarity method using lexicon and dependency representations. New Gener. Comput. 38(3), 509–530 (2020). https://doi.org/10.1007/s00354-020-00099-8

    Article  Google Scholar 

  9. Jalilifard, A., Caridá, V.F., Mansano, A., Cristo, R.: Semantic sensitive TF-IDF to determine word relevance in documents. CoRR abs/2001.09896 (2020). https://doi.org/10.48550/arXiv.2001.09896

  10. Kalbaliyev, E., Rustamov, S.: Text similarity detection using machine learning algorithms with character-based similarity measures. In: Biele, C., Kacprzyk, J., Owsiński, J.W., Romanowski, A., Sikorski, M. (eds.) MIDI 2020. AISC, vol. 1376, pp. 11–19. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74728-2_2

    Chapter  Google Scholar 

  11. Koloski, B., Pollak, S., Škrlj, B., Martinc, M.: Extending neural keyword extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, pp. 22–29. Association for Computational Linguistics (2021). www.aclanthology.org/2021.hackashop-1.4

  12. Korfhage, N., Mühling, M., Freisleben, B.: ElasticHash: semantic image similarity search by deep hashing with elasticsearch. In: Tsapatsoulis, N., Panayides, A., Theocharides, T., Lanitis, A., Pattichis, C., Vento, M. (eds.) CAIP 2021. LNCS, vol. 13053, pp. 14–23. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89131-2_2

    Chapter  Google Scholar 

  13. Kuppili, V., Biswas, M., Edla, D.R., Prasad, K.J.R., Suri, J.S.: A mechanics-based similarity measure for text classification in machine learning paradigm. IEEE Trans. Emerg. Top. Comput. Intell. 4(2), 180–200 (2020). https://doi.org/10.1109/TETCI.2018.2863728

    Article  Google Scholar 

  14. Lim, J., Sa, I., Ahn, H.S., Gasteiger, N., Lee, S.J., MacDonald, B.: Subsentence extraction from text using coverage-based deep learning language models. Sensors 21(8), 2712 (2021). https://doi.org/10.3390/s21082712

    Article  Google Scholar 

  15. Liu, Z., Shi, Q., Ou, J.: LCS: a collaborative optimization framework of vector extraction and semantic segmentation for building extraction. IEEE Trans. Geosci. Remote Sens. 60, 1–15 (2022). https://doi.org/10.1109/TGRS.2022.3215852

    Article  Google Scholar 

  16. Marcińczuk, M., Gniewkowski, M., Walkowiak, T., Bȩdkowski, M.: Text document clustering: Wordnet vs. TF-IDF vs. word embeddings. In: Proceedings of the 11th Global Wordnet Conference, pp. 207–214. Global Wordnet Association (2021). www.aclanthology.org/2021.gwc-1.24

  17. Murakami, R., Chakraborty, B.: Investigating the efficient use of word embedding with neural-topic models for interpretable topics from short texts. Sensors 22(3), 852 (2022). https://doi.org/10.3390/s22030852

    Article  Google Scholar 

  18. Qin, J., Zhou, Z., Tan, Y., Xiang, X., He, Z.: A big data text coverless information hiding based on topic distribution and TF-IDF. Int. J. Digit. Crime Forensics 13(4), 40–56 (2021). https://doi.org/10.4018/ijdcf.20210701.oa4

    Article  Google Scholar 

  19. Romanov, A.S., Kurtukova, A.V., Sobolev, A.A., Shelupanov, A.A., Fedotova, A.M.: Determining the age of the author of the text based on deep neural network models. Information 11(12), 589 (2020). https://doi.org/10.3390/info11120589

    Article  Google Scholar 

  20. Rosenberg, J., Coronel, J.B., Meiring, J., Gray, S., Brown, T.: Leveraging elasticsearch to improve data discoverability in science gateways. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), PEARC 2019, Chicago, IL, USA, 28 July–01 August 2019, pp. 19:1–19:5. ACM (2019). https://doi.org/10.1145/3332186.3332230

  21. Sakai, Y.: A substring-substring LCS data structure. Theor. Comput. Sci. 753, 16–34 (2019). https://doi.org/10.1016/j.tcs.2018.06.034

    Article  MathSciNet  MATH  Google Scholar 

  22. Sakai, Y.: A data structure for substring-substring LCS length queries. Theoret. Comput. Sci. 911, 41–54 (2022). https://doi.org/10.1016/j.tcs.2022.02.004

    Article  MathSciNet  MATH  Google Scholar 

  23. Shang, W., Underwood, T.: Improving measures of text reuse in English poetry: A TF–IDF based method. In: Toeppe, K., Yan, H., Chu, S.K.W. (eds.) iConference 2021. LNCS, vol. 12645, pp. 469–477. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-71292-1_36

    Chapter  Google Scholar 

  24. Sheshasaayee, A., Thailambal, G.: Performance of multiple string matching algorithms in text mining. In: Satapathy, S.C., Bhateja, V., Udgata, S.K., Pattnaik, P.K. (eds.) Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. AISC, vol. 516, pp. 671–681. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-3156-4_71

    Chapter  Google Scholar 

  25. Sinha, A., Naskar, M.B., Pandey, M., Rautaray, S.S.: Text classification using machine learning techniques: comparative analysis. In: 2022 OITS International Conference on Information Technology (OCIT), pp. 102–107 (2022). https://doi.org/10.1109/OCIT56763.2022.00029

  26. Sun, J., Nie, P., Xu, L., Zhang, H.: Design and implementation of analyzer management system based on elasticsearch. In: Zhao, X., Yang, S., Wang, X., Li, J. (eds.) WISA 2022. LNCS, vol. 13579, pp. 254–266. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20309-1_22

    Chapter  Google Scholar 

  27. Van, D.N., Trung, S.N., Hong, A.P.T., Hoang, T.T., Thanh, T.M.: A novel approach to end-to-end facial recognition framework with virtual search engine elasticsearch. In: Gervasi, O., et al. (eds.) ICCSA 2021. LNCS, vol. 12951, pp. 454–470. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86970-0_32

    Chapter  Google Scholar 

  28. Vishnupriya, G., Ramachandran, R.: Rabin-Karp algorithm based malevolent node detection and energy-efficient data gathering approach in wireless sensor network. Microprocess. Microsyst. 82, 103829 (2021). https://doi.org/10.1016/j.micpro.2021.103829

    Article  Google Scholar 

  29. Wei, B., Dai, J., Deng, L., Huang, H.: An optimization method for elasticsearch index shard number. In: 2020 16th International Conference on Computational Intelligence and Security (CIS), pp. 191–195 (2020). https://doi.org/10.1109/CIS52066.2020.00048

  30. Yang, W., Li, H., Li, Y., Zou, Y., Zhao, H.: Design and implementation of intelligent warehouse platform based on elasticsearch. In: 6th International Conference on Software and e-Business, ICSEB 2022, Shenzhen, China, 9–11 December 2022, pp. 69–73. ACM (2022). https://doi.org/10.1145/3578997.3579016

  31. Yao, J., Wang, K., Yan, J.: Incorporating label co-occurrence into neural network-based models for multi-label text classification. IEEE Access 7, 183580–183588 (2019). https://doi.org/10.1109/ACCESS.2019.2960626

    Article  Google Scholar 

  32. Zamfir, V., Carabas, M., Carabas, C., Tapus, N.: Systems monitoring and big data analysis using the elasticsearch system. In: 22nd International Conference on Control Systems and Computer Science, CSCS 2019, Bucharest, Romania, 28–30 May 2019, pp. 188–193. IEEE (2019). https://doi.org/10.1109/CSCS.2019.00039

  33. Zandigohar, M., Dai, Y.: Information retrieval in single cell chromatin analysis using TF-IDF transformation methods. In: IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022, Las Vegas, NV, USA, 6–8 December 2022, pp. 877–882. IEEE (2022). https://doi.org/10.1109/BIBM55620.2022.9994949

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuehua Liao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xiao, P., Lu, P., Luo, C., Zhu, Z., Liao, X. (2023). Fast Text Comparison Based on ElasticSearch and Dynamic Programming. In: Zhang, F., Wang, H., Barhamgi, M., Chen, L., Zhou, R. (eds) Web Information Systems Engineering – WISE 2023. WISE 2023. Lecture Notes in Computer Science, vol 14306. Springer, Singapore. https://doi.org/10.1007/978-981-99-7254-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-7254-8_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-7253-1

  • Online ISBN: 978-981-99-7254-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics