Predicting Type of Obfuscation to Enhance Text Alignment Algorithms

Mashhadirajab, Fatemeh; Shamsfard, Mehrnoush

doi:10.1007/978-3-319-73606-8_6

Fatemeh Mashhadirajab¹⁷ &
Mehrnoush Shamsfard¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10478))

Included in the following conference series:

Forum for Information Retrieval Evaluation

613 Accesses

Abstract

Plagiarism detection can be divided into source retrieval and text alignment subtasks. The text alignment subtask extracts all plagiarized passages from a given a pair of documents. The challenge is to identify passages of text that have been obfuscated. A given pair of documents could contain different types of obfuscation. Information about the type of obfuscation in a document pair could be useful for text alignment algorithms in plagiarism detection systems when choosing the most suitable algorithm for each type. The current paper describes a proposed approach to improve text alignment algorithms. The SVM neural network is used for classification of documents according to the type of obfuscation strategy used in the document pair. The parameter values in the proposed text alignment algorithm are set based on the type of obfuscation detected. The results of the proposed algorithm for Persian Plagdet corpus 2016 are shown. The proposed algorithm ranked first in the Persian Plagdet 2016 competition from among nine participant teams.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://pan.webis.de/.
2.
http://ictrc.ac.ir/plagdet/.
3.
http://fire.irsi.res.in/fire/2016/home.
4.
http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
5.
In clustering, the values between 0 and 0.1 are removed because almost all documents pairs contain at least one value between 0 and 0.1.

References

Fiedler, R., Kaner, C.: Plagiarism detection services: how well do they actually perform. IEEE Technol. Soc. Mag. 29, 37–43 (2010)
Article Google Scholar
Alzahrani, M., Salim, N., Abraham, A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans. Syst. Man Cybern.—Part C Appl. Rev. 42(2), 133–149 (2012)
Article Google Scholar
Ali, A.M.E.T., Abdulla, H.M.D., Snasel, V.: Survey of plagiarism detection methods. In: IEEE Fifth Asia Modelling Symposium (AMS), pp. 39–42 (2011)
Google Scholar
Potthast, M., Göring, S.: Towards data submissions for shared tasks: first experiences for the task of text alignment. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings (2015). ISSN 1613-0073
Google Scholar
Potthast, M., Hagen, M., Beyer, A., Busse, M., et al.: Overview of the 6th international competition on plagiarism detection. In: Working Notes for CLEF 2014 Conference, Sheffield, UK, 15–18 September, CEUR Workshop Proceedings, vol. 1180, pp. 845–876 (2014). CEUR-WS.org
Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., et al.: Overview of the PAN@FIRE2016 shared task on persian plagiarism detection and text alignment corpus construction. In: Notebook Papers of FIRE 2016, FIRE-2016 (2016). CEUR-WS.org
Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: 23rd International Conference on Computational Linguistics (COLING 2010), pp. 997–1005 (2010)
Google Scholar
Glinos, D.: A hybrid architecture for plagiarism detection. In: Notebook for PAN at CLEF 2014. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, Sheffield, UK, 15–18 September (2014). CEUR-WS.org. ISSN 1613-0073
Palkovskii, Y., Belov, A.: Developing high-resolution universal multi-type n-gram plagiarism detector. In: Notebook for PAN at CLEF 2014. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, Sheffield, UK, 15–18 September (2014). CEUR-WS.org. ISSN 1613-0073
Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Article Google Scholar
Alvi, F., Stevenson, M., Clough, P.: Hashing and merging heuristics for text reuse detection. In: Notebook for PAN at CLEF (2014)
Google Scholar
Minaei, B., Niknam, M.: An n-gram based method for nearly copy detection in plagiarism systems. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Gross, P., Modaresi, P.: Plagiarism alignment detection by merging context seeds. In: Notebook for PAN at CLEF (2014)
Google Scholar
Torrejón, D.A.R., Ramos, J.M.M.: CoReMo 2.3 plagiarism detector text alignment module. In: Notebook for PAN at CLEF (2014)
Google Scholar
Sanchez-Perez, M.A., Gelbukh, A.F., Sidorov, G.: Dynamically adjustable approach through obfuscation type recognition. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, 8–11 September 2015, CEUR Workshop Proceedings, vol. 1391 (2015). CEUR-WS.org
Sanchez-Perez, M., Sidorov, G., Gelbukh, A.: The winning approach to text alignment for text reuse detection at PAN 2014. In: Notebook for PAN at CLEF 2014, Sheffield, UK, 15–18 September, CEUR Workshop Proceedings, vol. 1180, pp. 1004–1011 (2014). CEUR-WS.org. ISSN 1613-0073
Kong, L., Han, Y., Han, Z., Yu, H., Wang, Q., Zhang, T., Qi, H.: Source retrieval based on learning to rank and text alignment based on plagiarism type recognition for plagiarism detection. In: Notebook for PAN at CLEF (2014)
Google Scholar
Shrestha, P., Maharjan, S., Solorio, T.: Machine translation evaluation metric for text alignment. In: Notebook for PAN at CLEF (2014)
Google Scholar
Ehsan, N., Shakery, A.: A pairwise document analysis approach for monolingual plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2014). CEUR-WS.org
Ehsan, N., Tompa, F.W., Shakery, A.: Using a dictionary and n-gram alignment to improve fine-grained cross-language plagiarism detection. In: Proceedings of the 2016 ACM Symposium on Document Engineering, pp. 59–68. ACM (2016)
Google Scholar
Abnar, S., Dehghani, M., Zamani, H., Shakery, A.: Expanded n-grams for semantic text alignment. In: Notebook for PAN at CLEF (2014)
Google Scholar
Talebpour, A., Shirzadi, M., Aminolroaya, Z.: Plagiarism detection based on a novel trie-based approach. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Momtaz, M., Bijari, K., Salehi, M., Veisi, H.: Graph-based approach to text alignment for plagiarism detection in Persian documents. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Esteki, F., Esfahani, F.S.: A plagiarism detection approach based on SVM for Persian texts. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Gharavi, E., Bijari, K., Zahirnia, K., Veisi, H.: A deep learning approach to Persian plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Mansoorizadeh, M., Rahgooy, T.: Persian plagiarism detection using sentence correlations. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Gillam, L., Vartapetiance, A.: From English to Persian: conversion of text alignment for plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org
Shamsfard, M., Jafari, H.S., Ilbeygi, M.: STeP-1: a set of fundamental tools for Persian text processing. In: LREC 2010, Malta (2010)
Google Scholar
Davarpanah, M.R., Sanji, M., Aramideh, M.: Farsi lexical analysis and stop word list. Libr. Hi Tech 27, 435–449 (2009)
Article Google Scholar
Shamsfard, M., Hesabi, A., Fadaei H., et al.: Semi automatic development of FarsNet; the Persian WordNet. In: Proceedings of 5th Global WordNet Conference (2010)
Google Scholar
Gollub, T., Stein. B., Burrows, S.: Ousting Ivory tower research: towards a web framework for providing experiments as a service. In: 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1125–1126. ACM (2012). ISBN 978-1-4503-1472-5
Google Scholar
Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11382-1_22
Google Scholar

Download references

Author information

Authors and Affiliations

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
Fatemeh Mashhadirajab & Mehrnoush Shamsfard

Authors

Fatemeh Mashhadirajab
View author publications
You can also search for this author in PubMed Google Scholar
Mehrnoush Shamsfard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fatemeh Mashhadirajab .

Editor information

Editors and Affiliations

DAIICT, Gujarat, India
Prasenjit Majumder
Indian Statistical Institute, Kolkata, India
Mandar Mitra
DAIICT, Gujarat, India
Parth Mehta
DAIICT, Gujarat, India
Jainisha Sankhavara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mashhadirajab, F., Shamsfard, M. (2018). Predicting Type of Obfuscation to Enhance Text Alignment Algorithms. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-73606-8_6
Published: 04 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73605-1
Online ISBN: 978-3-319-73606-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics