Skip to main content
Log in

Cross-Lingual Plagiarism Detection: Two Are Better Than One

  • Published:
Programming and Computer Software Aims and scope Submit manuscript

Abstract

The widespread availability of scientific documents in multiple languages, coupled with the development of automatic translation and editing tools, has created a demand for efficient methods that can detect plagiarism across different languages. In this paper, we present a novel cross-lingual plagiarism detection approach. The algorithm is based on the merger of two existing approaches that in turn achieve state-of-the-art (SOTA) or comparable to SOTA results on different benchmarks. The detailed analysis stages of existing approaches were sequentially merged levelling out the disadvantages of the approaches. The obtained algorithm significantly outperforms the ones it was merged of surpassing them by from 23 to 33% Plagdet Score, depending on different language pairs. The comparison between observed approaches was evaluated on a newly generated multilingual (English, Russian, Spanish, Armenian) test collection, where each suspicious document could contain plagiarised fragments from several languages. The merged method is applicable to various under-resourced languages which is shown on the example of the Armenian language.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.

Similar content being viewed by others

Notes

  1. https://github.com/spotify/annoy

  2. https://www.modelfront.com/

  3. https://disk.yandex.ru/d/wrMmU1vZ9cnZ7Q

REFERENCES

  1. Avetisyan, K., Malajyan, A., Ghukasyan, T., and Avetisyan, A., A simple and effective method of cross-lingual plagiarism detection. arXiv:2304.01352

  2. Bakhteev, O., Ogaltsov, A., Khazov, A., Safin, K., and Kuznetsova, R., Crosslang: the system of cross-lingual plagiarism detection, Proc. Workshop on Document Intelligence at NeurIPS, Vancouver, 2019.

  3. Botto Tobar, M., van den Brand, M., and Serebrenik, A., Cross-language plagiarism detection: methods, tools, and challenges: a systematic review, Int. J. Adv. Sci., Eng. Inf. Technol., 2022, vol. 12, no. 2, pp. 589–599. https://doi.org/10.18517/ijaseit.12.2.14711

    Article  Google Scholar 

  4. Pataki, M., A new approach for searching translated plagiarism, Proc. 5th Int. Plagiarism Conf., Newcastle upon Tyne, 2012.

  5. Kuznetsova, M.V., Bakhteev, O. Yu., and Chekhovich, Yu.V., The way to detect translated plagiarisms large textual collections, Inf. Ee Primen., 2021, vol. 15, no. 2, pp. 30–41.

    Google Scholar 

  6. Alzahrani, S.M., Cross-language semantic similarity of arabic-english short phrases, J. Comput. Sci., 2016, vol. 12, pp. 1–18.

    Article  MathSciNet  Google Scholar 

  7. Kasprzak, J., Brandejs, M., et al., Improving the reliability of the plagiarism detection system, Lab Report for PAN at CLEF, 2010, pp. 359–366.

  8. Gupta, P., Singhal, K., Majumder, P., and Rosso, P., Detection of paraphrastic cases of mono-lingual and cross-lingual plagiarism, Proc. 17th IEEE Int. Conf. on Networks, ICON 2011, Singapore, 2011.

  9. Gupta, P. and Singhal, K., Mapping hindi-english text re-use document pairs, in Multilingual Information Access in South Asian Languages, Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., and Rosso, P., Eds., Berlin, Heidelberg: Springer, 2013.

    Google Scholar 

  10. Gupta, P., Barrón-Cedeno, A., and Rosso, P., Cross-language high similarity search using a conceptual thesaurus, Proc. 3rd Int. Conf. of the CLEF Initiative, Information Access Evaluation. Multilinguality, Multi-Modality, and Visual Analytics: CLEF 2012, Rome, 2012, pp. 67–75.

  11. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M., et al., Okapi at trec-3, Nist Special Publication Sp 109 (1995).

  12. Ceska, Z., Toman, M., and Jezek, K., Multilingual plagiarism detection, in Proc. 13th Int. Conf. on Artificial Intelligence: Methodology, Systems, and Applications, AIMSA 2008, Varna, Bulgaria, Sept. 4–6, 2008, Springer, 2008, pp. 83–92.

  13. Ferrero, J., Agnes, F., Besacier, L., and Schwab, D., Usingword embedding for cross-language plagiarism detection. arXiv:1702.03082

  14. Kutuzov, A., Kopotev, M., Sviridenko, T., and Ivanova, L., Clustering comparable corpora of russian and ukrainian academic texts: word embeddings and semantic fingerprints. arXiv:1604.05372

  15. Zubarev, D., Tikhomirov, I., and Sochenkov, I., Cross-lingual plagiarism detection method, in Proc. 23rd Int. Conf. on Data Analytics and Management in Data Intensive Domains: DAMDID/RCDL 2021, Moscow, Russia, October 26–29, 2021, Springer, 2022, pp. 207–222.

  16. de Melo, G. and Weikum, G., Uwn: a large multilingual lexical knowledge base, Proc. Annu. Meeting of the Association for Computational Linguistics, Jeju Island, 2012.

  17. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V., Unsupervised cross-lingual representation learning at scale. arXiv:1911.02116

  18. Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang, W., Language-agnostic bert sentence embedding. arXiv:2007.01852

  19. Gritsay, G., Avetisyan, K., and Grabovoy, A., 4lang: Open Access Dataset for Cross-Lingual Plagiarism Detection, Mendeley Data, 2023, vol. 1. https://doi.org/10.17632/vndpn2wsf9.1

Download references

ACKNOWLEDGMENTS

The authors thank ModelFront for providing us with the translation risk scores for our dataset.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to K. Avetisyan, G. Gritsay or A. Grabovoy.

Ethics declarations

The authors declare that they have no conflicts of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Avetisyan, K., Gritsay, G. & Grabovoy, A. Cross-Lingual Plagiarism Detection: Two Are Better Than One. Program Comput Soft 49, 346–354 (2023). https://doi.org/10.1134/S0361768823040138

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0361768823040138

Keywords:

Navigation