Skip to main content

Predicting Type of Obfuscation to Enhance Text Alignment Algorithms

  • Conference paper
  • First Online:
Book cover Text Processing (FIRE 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10478))

Included in the following conference series:

  • 613 Accesses

Abstract

Plagiarism detection can be divided into source retrieval and text alignment subtasks. The text alignment subtask extracts all plagiarized passages from a given a pair of documents. The challenge is to identify passages of text that have been obfuscated. A given pair of documents could contain different types of obfuscation. Information about the type of obfuscation in a document pair could be useful for text alignment algorithms in plagiarism detection systems when choosing the most suitable algorithm for each type. The current paper describes a proposed approach to improve text alignment algorithms. The SVM neural network is used for classification of documents according to the type of obfuscation strategy used in the document pair. The parameter values in the proposed text alignment algorithm are set based on the type of obfuscation detected. The results of the proposed algorithm for Persian Plagdet corpus 2016 are shown. The proposed algorithm ranked first in the Persian Plagdet 2016 competition from among nine participant teams.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://pan.webis.de/.

  2. 2.

    http://ictrc.ac.ir/plagdet/.

  3. 3.

    http://fire.irsi.res.in/fire/2016/home.

  4. 4.

    http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

  5. 5.

    In clustering, the values between 0 and 0.1 are removed because almost all documents pairs contain at least one value between 0 and 0.1.

References

  1. Fiedler, R., Kaner, C.: Plagiarism detection services: how well do they actually perform. IEEE Technol. Soc. Mag. 29, 37–43 (2010)

    Article  Google Scholar 

  2. Alzahrani, M., Salim, N., Abraham, A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans. Syst. Man Cybern.—Part C Appl. Rev. 42(2), 133–149 (2012)

    Article  Google Scholar 

  3. Ali, A.M.E.T., Abdulla, H.M.D., Snasel, V.: Survey of plagiarism detection methods. In: IEEE Fifth Asia Modelling Symposium (AMS), pp. 39–42 (2011)

    Google Scholar 

  4. Potthast, M., Göring, S.: Towards data submissions for shared tasks: first experiences for the task of text alignment. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings (2015). ISSN 1613-0073

    Google Scholar 

  5. Potthast, M., Hagen, M., Beyer, A., Busse, M., et al.: Overview of the 6th international competition on plagiarism detection. In: Working Notes for CLEF 2014 Conference, Sheffield, UK, 15–18 September, CEUR Workshop Proceedings, vol. 1180, pp. 845–876 (2014). CEUR-WS.org

  6. Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., et al.: Overview of the PAN@FIRE2016 shared task on persian plagiarism detection and text alignment corpus construction. In: Notebook Papers of FIRE 2016, FIRE-2016 (2016). CEUR-WS.org

  7. Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: 23rd International Conference on Computational Linguistics (COLING 2010), pp. 997–1005 (2010)

    Google Scholar 

  8. Glinos, D.: A hybrid architecture for plagiarism detection. In: Notebook for PAN at CLEF 2014. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, Sheffield, UK, 15–18 September (2014). CEUR-WS.org. ISSN 1613-0073

  9. Palkovskii, Y., Belov, A.: Developing high-resolution universal multi-type n-gram plagiarism detector. In: Notebook for PAN at CLEF 2014. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, Sheffield, UK, 15–18 September (2014). CEUR-WS.org. ISSN 1613-0073

  10. Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)

    Article  Google Scholar 

  11. Alvi, F., Stevenson, M., Clough, P.: Hashing and merging heuristics for text reuse detection. In: Notebook for PAN at CLEF (2014)

    Google Scholar 

  12. Minaei, B., Niknam, M.: An n-gram based method for nearly copy detection in plagiarism systems. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org

  13. Gross, P., Modaresi, P.: Plagiarism alignment detection by merging context seeds. In: Notebook for PAN at CLEF (2014)

    Google Scholar 

  14. Torrejón, D.A.R., Ramos, J.M.M.: CoReMo 2.3 plagiarism detector text alignment module. In: Notebook for PAN at CLEF (2014)

    Google Scholar 

  15. Sanchez-Perez, M.A., Gelbukh, A.F., Sidorov, G.: Dynamically adjustable approach through obfuscation type recognition. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, 8–11 September 2015, CEUR Workshop Proceedings, vol. 1391 (2015). CEUR-WS.org

  16. Sanchez-Perez, M., Sidorov, G., Gelbukh, A.: The winning approach to text alignment for text reuse detection at PAN 2014. In: Notebook for PAN at CLEF 2014, Sheffield, UK, 15–18 September, CEUR Workshop Proceedings, vol. 1180, pp. 1004–1011 (2014). CEUR-WS.org. ISSN 1613-0073

  17. Kong, L., Han, Y., Han, Z., Yu, H., Wang, Q., Zhang, T., Qi, H.: Source retrieval based on learning to rank and text alignment based on plagiarism type recognition for plagiarism detection. In: Notebook for PAN at CLEF (2014)

    Google Scholar 

  18. Shrestha, P., Maharjan, S., Solorio, T.: Machine translation evaluation metric for text alignment. In: Notebook for PAN at CLEF (2014)

    Google Scholar 

  19. Ehsan, N., Shakery, A.: A pairwise document analysis approach for monolingual plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2014). CEUR-WS.org

  20. Ehsan, N., Tompa, F.W., Shakery, A.: Using a dictionary and n-gram alignment to improve fine-grained cross-language plagiarism detection. In: Proceedings of the 2016 ACM Symposium on Document Engineering, pp. 59–68. ACM (2016)

    Google Scholar 

  21. Abnar, S., Dehghani, M., Zamani, H., Shakery, A.: Expanded n-grams for semantic text alignment. In: Notebook for PAN at CLEF (2014)

    Google Scholar 

  22. Talebpour, A., Shirzadi, M., Aminolroaya, Z.: Plagiarism detection based on a novel trie-based approach. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org

  23. Momtaz, M., Bijari, K., Salehi, M., Veisi, H.: Graph-based approach to text alignment for plagiarism detection in Persian documents. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org

  24. Esteki, F., Esfahani, F.S.: A plagiarism detection approach based on SVM for Persian texts. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org

  25. Gharavi, E., Bijari, K., Zahirnia, K., Veisi, H.: A deep learning approach to Persian plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org

  26. Mansoorizadeh, M., Rahgooy, T.: Persian plagiarism detection using sentence correlations. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org

  27. Gillam, L., Vartapetiance, A.: From English to Persian: conversion of text alignment for plagiarism detection. In: Working Notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016, CEUR Workshop Proceedings (2016). CEUR-WS.org

  28. Shamsfard, M., Jafari, H.S., Ilbeygi, M.: STeP-1: a set of fundamental tools for Persian text processing. In: LREC 2010, Malta (2010)

    Google Scholar 

  29. Davarpanah, M.R., Sanji, M., Aramideh, M.: Farsi lexical analysis and stop word list. Libr. Hi Tech 27, 435–449 (2009)

    Article  Google Scholar 

  30. Shamsfard, M., Hesabi, A., Fadaei H., et al.: Semi automatic development of FarsNet; the Persian WordNet. In: Proceedings of 5th Global WordNet Conference (2010)

    Google Scholar 

  31. Gollub, T., Stein. B., Burrows, S.: Ousting Ivory tower research: towards a web framework for providing experiments as a service. In: 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1125–1126. ACM (2012). ISBN 978-1-4503-1472-5

    Google Scholar 

  32. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11382-1_22

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fatemeh Mashhadirajab .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mashhadirajab, F., Shamsfard, M. (2018). Predicting Type of Obfuscation to Enhance Text Alignment Algorithms. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73606-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73605-1

  • Online ISBN: 978-3-319-73606-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics