Abstract
The widespread availability of scientific documents in multiple languages, coupled with the development of automatic translation and editing tools, has created a demand for efficient methods that can detect plagiarism across different languages. In this paper, we present a novel cross-lingual plagiarism detection approach. The algorithm is based on the merger of two existing approaches that in turn achieve state-of-the-art (SOTA) or comparable to SOTA results on different benchmarks. The detailed analysis stages of existing approaches were sequentially merged levelling out the disadvantages of the approaches. The obtained algorithm significantly outperforms the ones it was merged of surpassing them by from 23 to 33% Plagdet Score, depending on different language pairs. The comparison between observed approaches was evaluated on a newly generated multilingual (English, Russian, Spanish, Armenian) test collection, where each suspicious document could contain plagiarised fragments from several languages. The merged method is applicable to various under-resourced languages which is shown on the example of the Armenian language.
Similar content being viewed by others
REFERENCES
Avetisyan, K., Malajyan, A., Ghukasyan, T., and Avetisyan, A., A simple and effective method of cross-lingual plagiarism detection. arXiv:2304.01352
Bakhteev, O., Ogaltsov, A., Khazov, A., Safin, K., and Kuznetsova, R., Crosslang: the system of cross-lingual plagiarism detection, Proc. Workshop on Document Intelligence at NeurIPS, Vancouver, 2019.
Botto Tobar, M., van den Brand, M., and Serebrenik, A., Cross-language plagiarism detection: methods, tools, and challenges: a systematic review, Int. J. Adv. Sci., Eng. Inf. Technol., 2022, vol. 12, no. 2, pp. 589–599. https://doi.org/10.18517/ijaseit.12.2.14711
Pataki, M., A new approach for searching translated plagiarism, Proc. 5th Int. Plagiarism Conf., Newcastle upon Tyne, 2012.
Kuznetsova, M.V., Bakhteev, O. Yu., and Chekhovich, Yu.V., The way to detect translated plagiarisms large textual collections, Inf. Ee Primen., 2021, vol. 15, no. 2, pp. 30–41.
Alzahrani, S.M., Cross-language semantic similarity of arabic-english short phrases, J. Comput. Sci., 2016, vol. 12, pp. 1–18.
Kasprzak, J., Brandejs, M., et al., Improving the reliability of the plagiarism detection system, Lab Report for PAN at CLEF, 2010, pp. 359–366.
Gupta, P., Singhal, K., Majumder, P., and Rosso, P., Detection of paraphrastic cases of mono-lingual and cross-lingual plagiarism, Proc. 17th IEEE Int. Conf. on Networks, ICON 2011, Singapore, 2011.
Gupta, P. and Singhal, K., Mapping hindi-english text re-use document pairs, in Multilingual Information Access in South Asian Languages, Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., and Rosso, P., Eds., Berlin, Heidelberg: Springer, 2013.
Gupta, P., Barrón-Cedeno, A., and Rosso, P., Cross-language high similarity search using a conceptual thesaurus, Proc. 3rd Int. Conf. of the CLEF Initiative, Information Access Evaluation. Multilinguality, Multi-Modality, and Visual Analytics: CLEF 2012, Rome, 2012, pp. 67–75.
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M., et al., Okapi at trec-3, Nist Special Publication Sp 109 (1995).
Ceska, Z., Toman, M., and Jezek, K., Multilingual plagiarism detection, in Proc. 13th Int. Conf. on Artificial Intelligence: Methodology, Systems, and Applications, AIMSA 2008, Varna, Bulgaria, Sept. 4–6, 2008, Springer, 2008, pp. 83–92.
Ferrero, J., Agnes, F., Besacier, L., and Schwab, D., Usingword embedding for cross-language plagiarism detection. arXiv:1702.03082
Kutuzov, A., Kopotev, M., Sviridenko, T., and Ivanova, L., Clustering comparable corpora of russian and ukrainian academic texts: word embeddings and semantic fingerprints. arXiv:1604.05372
Zubarev, D., Tikhomirov, I., and Sochenkov, I., Cross-lingual plagiarism detection method, in Proc. 23rd Int. Conf. on Data Analytics and Management in Data Intensive Domains: DAMDID/RCDL 2021, Moscow, Russia, October 26–29, 2021, Springer, 2022, pp. 207–222.
de Melo, G. and Weikum, G., Uwn: a large multilingual lexical knowledge base, Proc. Annu. Meeting of the Association for Computational Linguistics, Jeju Island, 2012.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V., Unsupervised cross-lingual representation learning at scale. arXiv:1911.02116
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang, W., Language-agnostic bert sentence embedding. arXiv:2007.01852
Gritsay, G., Avetisyan, K., and Grabovoy, A., 4lang: Open Access Dataset for Cross-Lingual Plagiarism Detection, Mendeley Data, 2023, vol. 1. https://doi.org/10.17632/vndpn2wsf9.1
ACKNOWLEDGMENTS
The authors thank ModelFront for providing us with the translation risk scores for our dataset.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors declare that they have no conflicts of interest.
Rights and permissions
About this article
Cite this article
Avetisyan, K., Gritsay, G. & Grabovoy, A. Cross-Lingual Plagiarism Detection: Two Are Better Than One. Program Comput Soft 49, 346–354 (2023). https://doi.org/10.1134/S0361768823040138
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0361768823040138