Cross-Lingual Plagiarism Detection: Two Are Better Than One

Avetisyan, K.; Gritsay, G.; Grabovoy, A.

doi:10.1134/S0361768823040138

Cross-Lingual Plagiarism Detection: Two Are Better Than One

Published: 28 July 2023

Volume 49, pages 346–354, (2023)
Cite this article

Programming and Computer Software Aims and scope Submit manuscript

K. Avetisyan^2,3,
G. Gritsay^1,4 &
A. Grabovoy^1,4

96 Accesses
Explore all metrics

Abstract

The widespread availability of scientific documents in multiple languages, coupled with the development of automatic translation and editing tools, has created a demand for efficient methods that can detect plagiarism across different languages. In this paper, we present a novel cross-lingual plagiarism detection approach. The algorithm is based on the merger of two existing approaches that in turn achieve state-of-the-art (SOTA) or comparable to SOTA results on different benchmarks. The detailed analysis stages of existing approaches were sequentially merged levelling out the disadvantages of the approaches. The obtained algorithm significantly outperforms the ones it was merged of surpassing them by from 23 to 33% Plagdet Score, depending on different language pairs. The comparison between observed approaches was evaluated on a newly generated multilingual (English, Russian, Spanish, Armenian) test collection, where each suspicious document could contain plagiarised fragments from several languages. The merged method is applicable to various under-resourced languages which is shown on the example of the Armenian language.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

REFERENCES

Avetisyan, K., Malajyan, A., Ghukasyan, T., and Avetisyan, A., A simple and effective method of cross-lingual plagiarism detection. arXiv:2304.01352
Bakhteev, O., Ogaltsov, A., Khazov, A., Safin, K., and Kuznetsova, R., Crosslang: the system of cross-lingual plagiarism detection, Proc. Workshop on Document Intelligence at NeurIPS, Vancouver, 2019.
Botto Tobar, M., van den Brand, M., and Serebrenik, A., Cross-language plagiarism detection: methods, tools, and challenges: a systematic review, Int. J. Adv. Sci., Eng. Inf. Technol., 2022, vol. 12, no. 2, pp. 589–599. https://doi.org/10.18517/ijaseit.12.2.14711
Article Google Scholar
Pataki, M., A new approach for searching translated plagiarism, Proc. 5th Int. Plagiarism Conf., Newcastle upon Tyne, 2012.
Kuznetsova, M.V., Bakhteev, O. Yu., and Chekhovich, Yu.V., The way to detect translated plagiarisms large textual collections, Inf. Ee Primen., 2021, vol. 15, no. 2, pp. 30–41.
Google Scholar
Alzahrani, S.M., Cross-language semantic similarity of arabic-english short phrases, J. Comput. Sci., 2016, vol. 12, pp. 1–18.
Article MathSciNet Google Scholar
Kasprzak, J., Brandejs, M., et al., Improving the reliability of the plagiarism detection system, Lab Report for PAN at CLEF, 2010, pp. 359–366.
Gupta, P., Singhal, K., Majumder, P., and Rosso, P., Detection of paraphrastic cases of mono-lingual and cross-lingual plagiarism, Proc. 17th IEEE Int. Conf. on Networks, ICON 2011, Singapore, 2011.
Gupta, P. and Singhal, K., Mapping hindi-english text re-use document pairs, in Multilingual Information Access in South Asian Languages, Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., and Rosso, P., Eds., Berlin, Heidelberg: Springer, 2013.
Google Scholar
Gupta, P., Barrón-Cedeno, A., and Rosso, P., Cross-language high similarity search using a conceptual thesaurus, Proc. 3rd Int. Conf. of the CLEF Initiative, Information Access Evaluation. Multilinguality, Multi-Modality, and Visual Analytics: CLEF 2012, Rome, 2012, pp. 67–75.
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M., et al., Okapi at trec-3, Nist Special Publication Sp 109 (1995).
Ceska, Z., Toman, M., and Jezek, K., Multilingual plagiarism detection, in Proc. 13th Int. Conf. on Artificial Intelligence: Methodology, Systems, and Applications, AIMSA 2008, Varna, Bulgaria, Sept. 4–6, 2008, Springer, 2008, pp. 83–92.
Ferrero, J., Agnes, F., Besacier, L., and Schwab, D., Usingword embedding for cross-language plagiarism detection. arXiv:1702.03082
Kutuzov, A., Kopotev, M., Sviridenko, T., and Ivanova, L., Clustering comparable corpora of russian and ukrainian academic texts: word embeddings and semantic fingerprints. arXiv:1604.05372
Zubarev, D., Tikhomirov, I., and Sochenkov, I., Cross-lingual plagiarism detection method, in Proc. 23rd Int. Conf. on Data Analytics and Management in Data Intensive Domains: DAMDID/RCDL 2021, Moscow, Russia, October 26–29, 2021, Springer, 2022, pp. 207–222.
de Melo, G. and Weikum, G., Uwn: a large multilingual lexical knowledge base, Proc. Annu. Meeting of the Association for Computational Linguistics, Jeju Island, 2012.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V., Unsupervised cross-lingual representation learning at scale. arXiv:1911.02116
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang, W., Language-agnostic bert sentence embedding. arXiv:2007.01852
Gritsay, G., Avetisyan, K., and Grabovoy, A., 4lang: Open Access Dataset for Cross-Lingual Plagiarism Detection, Mendeley Data, 2023, vol. 1. https://doi.org/10.17632/vndpn2wsf9.1

Download references

ACKNOWLEDGMENTS

The authors thank ModelFront for providing us with the translation risk scores for our dataset.

Author information

Authors and Affiliations

Moscow Institute of Physics and Technology, 141701, Dolgoprudny, Moscow oblast, Russia
G. Gritsay & A. Grabovoy
Russian-Armenian University, 0051, Yerevan, Armenia
K. Avetisyan
Ivannikov Institute for System Programming of the RAS, 109004, Moscow, Russia
K. Avetisyan
Antiplagiat Company, Moscow, Russia
G. Gritsay & A. Grabovoy

Authors

K. Avetisyan
View author publications
You can also search for this author in PubMed Google Scholar
G. Gritsay
View author publications
You can also search for this author in PubMed Google Scholar
A. Grabovoy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to K. Avetisyan, G. Gritsay or A. Grabovoy.

Ethics declarations

The authors declare that they have no conflicts of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Avetisyan, K., Gritsay, G. & Grabovoy, A. Cross-Lingual Plagiarism Detection: Two Are Better Than One. Program Comput Soft 49, 346–354 (2023). https://doi.org/10.1134/S0361768823040138

Download citation

Received: 14 March 2023
Revised: 31 March 2023
Accepted: 05 April 2023
Published: 28 July 2023
Issue Date: August 2023
DOI: https://doi.org/10.1134/S0361768823040138

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions