Abstract
Plagiarism detection can be divided into source retrieval and text alignment subtasks. The text alignment subtask extracts all plagiarized passages from a given a pair of documents. The challenge is to identify passages of text that have been obfuscated. A given pair of documents could contain different types of obfuscation. Information about the type of obfuscation in a document pair could be useful for text alignment algorithms in plagiarism detection systems when choosing the most suitable algorithm for each type. The current paper describes a proposed approach to improve text alignment algorithms. The SVM neural network is used for classification of documents according to the type of obfuscation strategy used in the document pair. The parameter values in the proposed text alignment algorithm are set based on the type of obfuscation detected. The results of the proposed algorithm for Persian Plagdet corpus 2016 are shown. The proposed algorithm ranked first in the Persian Plagdet 2016 competition from among nine participant teams.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
In clustering, the values between 0 and 0.1 are removed because almost all documents pairs contain at least one value between 0 and 0.1.
References
Fiedler, R., Kaner, C.: Plagiarism detection services: how well do they actually perform. IEEE Technol. Soc. Mag. 29, 37–43 (2010)
Alzahrani, M., Salim, N., Abraham, A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans. Syst. Man Cybern.—Part C Appl. Rev. 42(2), 133–149 (2012)
Ali, A.M.E.T., Abdulla, H.M.D., Snasel, V.: Survey of plagiarism detection methods. In: IEEE Fifth Asia Modelling Symposium (AMS), pp. 39–42 (2011)
Potthast, M., Göring, S.: Towards data submissions for shared tasks: first experiences for the task of text alignment. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings (2015). ISSN 1613-0073
Potthast, M., Hagen, M., Beyer, A., Busse, M., et al.: Overview of the 6th international competition on plagiarism detection. In: Working Notes for CLEF 2014 Conference, Sheffield, UK, 15–18 September, CEUR Workshop Proceedings, vol. 1180, pp. 845–876 (2014). CEUR-WS.org
Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., et al.: Overview of the PAN@FIRE2016 shared task on persian plagiarism detection and text alignment corpus construction. In: Notebook Papers of FIRE 2016, FIRE-2016 (2016). CEUR-WS.org
Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: 23rd International Conference on Computational Linguistics (COLING 2010), pp. 997–1005 (2010)
Glinos, D.: A hybrid architecture for plagiarism detection. In: Notebook for PAN at CLEF 2014. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, Sheffield, UK, 15–18 September (2014). CEUR-WS.org. ISSN 1613-0073
Palkovskii, Y., Belov, A.: Developing high-resolution universal multi-type n-gram plagiarism detector. In: Notebook for PAN at CLEF 2014. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, Sheffield, UK, 15–18 September (2014). CEUR-WS.org. ISSN 1613-0073
Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Alvi, F., Stevenson, M., Clough, P.: Hashing and merging heuristics for text reuse detection. In: Notebook for PAN at CLEF (2014)
Minaei, B., Niknam, M.: An n-gram based method for nearly copy detection in plagiarism systems. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Gross, P., Modaresi, P.: Plagiarism alignment detection by merging context seeds. In: Notebook for PAN at CLEF (2014)
Torrejón, D.A.R., Ramos, J.M.M.: CoReMo 2.3 plagiarism detector text alignment module. In: Notebook for PAN at CLEF (2014)
Sanchez-Perez, M.A., Gelbukh, A.F., Sidorov, G.: Dynamically adjustable approach through obfuscation type recognition. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, 8–11 September 2015, CEUR Workshop Proceedings, vol. 1391 (2015). CEUR-WS.org
Sanchez-Perez, M., Sidorov, G., Gelbukh, A.: The winning approach to text alignment for text reuse detection at PAN 2014. In: Notebook for PAN at CLEF 2014, Sheffield, UK, 15–18 September, CEUR Workshop Proceedings, vol. 1180, pp. 1004–1011 (2014). CEUR-WS.org. ISSN 1613-0073
Kong, L., Han, Y., Han, Z., Yu, H., Wang, Q., Zhang, T., Qi, H.: Source retrieval based on learning to rank and text alignment based on plagiarism type recognition for plagiarism detection. In: Notebook for PAN at CLEF (2014)
Shrestha, P., Maharjan, S., Solorio, T.: Machine translation evaluation metric for text alignment. In: Notebook for PAN at CLEF (2014)
Ehsan, N., Shakery, A.: A pairwise document analysis approach for monolingual plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2014). CEUR-WS.org
Ehsan, N., Tompa, F.W., Shakery, A.: Using a dictionary and n-gram alignment to improve fine-grained cross-language plagiarism detection. In: Proceedings of the 2016 ACM Symposium on Document Engineering, pp. 59–68. ACM (2016)
Abnar, S., Dehghani, M., Zamani, H., Shakery, A.: Expanded n-grams for semantic text alignment. In: Notebook for PAN at CLEF (2014)
Talebpour, A., Shirzadi, M., Aminolroaya, Z.: Plagiarism detection based on a novel trie-based approach. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Momtaz, M., Bijari, K., Salehi, M., Veisi, H.: Graph-based approach to text alignment for plagiarism detection in Persian documents. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Esteki, F., Esfahani, F.S.: A plagiarism detection approach based on SVM for Persian texts. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Gharavi, E., Bijari, K., Zahirnia, K., Veisi, H.: A deep learning approach to Persian plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Mansoorizadeh, M., Rahgooy, T.: Persian plagiarism detection using sentence correlations. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Gillam, L., Vartapetiance, A.: From English to Persian: conversion of text alignment for plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Shamsfard, M., Jafari, H.S., Ilbeygi, M.: STeP-1: a set of fundamental tools for Persian text processing. In: LREC 2010, Malta (2010)
Davarpanah, M.R., Sanji, M., Aramideh, M.: Farsi lexical analysis and stop word list. Libr. Hi Tech 27, 435–449 (2009)
Shamsfard, M., Hesabi, A., Fadaei H., et al.: Semi automatic development of FarsNet; the Persian WordNet. In: Proceedings of 5th Global WordNet Conference (2010)
Gollub, T., Stein. B., Burrows, S.: Ousting Ivory tower research: towards a web framework for providing experiments as a service. In: 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1125–1126. ACM (2012). ISBN 978-1-4503-1472-5
Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11382-1_22
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Mashhadirajab, F., Shamsfard, M. (2018). Predicting Type of Obfuscation to Enhance Text Alignment Algorithms. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-73606-8_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73605-1
Online ISBN: 978-3-319-73606-8
eBook Packages: Computer ScienceComputer Science (R0)