Word Length n-Grams for Text Re-use Detection

Barrón-Cedeño, Alberto; Basile, Chiara; Degli Esposti, Mirko; Rosso, Paolo

doi:10.1007/978-3-642-12116-6_58

Alberto Barrón-Cedeño¹⁷,
Chiara Basile¹⁸,
Mirko Degli Esposti¹⁸ &
…
Paolo Rosso¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6008))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1881 Accesses
14 Citations

Abstract

The automatic detection of shared content in written documents –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval, p. 192. Addison-Wesley Longman, Amsterdam (1999)
Google Scholar
Barrón-Cedeño, A., Eiselt, A., Rosso, P.: Monolingual Text Similarity Measures: A Comparison of Models over Wikipedia Articles Revisions. In: Proceedings of the ICON 2009: 7th International Conference on Natural Language Processing, pp. 29–38. Macmillan Publishers, Basingstoke (2009)
Google Scholar
Basile, C., Benedetto, D., Caglioti, E., Cristadoro, G., Degli Esposti, M.: A plagiarism detection procedure in three steps: selection, matches and “squares”. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E. (eds.) SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 1–9. CEUR-WS.org (2009)
Google Scholar
Bernstein, Y., Zobel, J.: A Scalable System for Identifying Co-Derivative Documents. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 55–67. Springer, Heidelberg (2004)
Google Scholar
Bigi, B.: Using Kullback-Leibler distance for text categorization. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 305–319. Springer, Heidelberg (2003)
Chapter Google Scholar
Broder, A.Z.: On the Resemblance and Containment of Documents. In: Compression and Complexity of Sequences (SEQUENCES 1997), pp. 21–29. IEEE Computer Society, Los Alamitos (1997)
Google Scholar
Clough, P., Gaizauskas, R., Piao, S., Wilks, Y.: Measuring Text Reuse. In: Proceedings of Association for Computational Linguistics (ACL 2002), Philadelphia, PA, pp. 152–159 (2002)
Google Scholar
Jaccard, P.: Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition, 2nd edn. Prentice-Hall, Englewood Cliffs (2009)
Google Scholar
Kang, N., Gelbukh, A., Han, S.-Y.: PPChecker: Plagiarism pattern checker in document copy detection. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 661–667. Springer, Heidelberg (2006)
Chapter Google Scholar
Kullback, S., Leibler, R.: On information and sufficiency. Annals of Mathematical Statistics 22(1), 79–86 (1951)
Article MATH MathSciNet Google Scholar
Lyon, C., Malcolm, J., Dickerson, B.: Detecting Short Passages of Similar Text in Large Document Collections. In: Conference on Empirical Methods in Natural Language Processing, Pennsylvania, pp. 118–125 (2001)
Google Scholar
Maurer, H., Kappe, F., Zaka, B.: Plagiarism - A Survey. Journal of Universal Computer Science 12(8), 1050–1084 (2006)
Google Scholar
Metzler, D., Bernstein, Y., Croft, B.W., Moffat, A., Zobel, J.: Similarity Measures for Tracking Information Flow. In: Conference on Information and Knowledge Management, pp. 517–524. ACM Press, New York (2005)
Google Scholar
Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st International Competition on Plagiarism Detection. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E. (eds.) SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, PAN 2009, pp. 1–9. CEUR-WS.org (2009)
Google Scholar
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local Algorithms for Document Fingerprinting. In: 2003 ACM SIGMOD International Conference on Management of Data. ACM, New York (2003)
Google Scholar
Stein, B., Meyer zu Eissen, S., Potthast, M.: Strategies for Retrieving Plagiarized Documents. In: Clarke, C., Fuhr, N., Kando, N., Kraaij, W., de Vries, A. (eds.) 30th Annual International ACM SIGIR Conference, pp. 825–826. ACM, New York (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

NLEL-ELiRF, Department of Information Systems and Computation, Universidad Politécnica de Valencia, Spain
Alberto Barrón-Cedeño & Paolo Rosso
Dipartimento di Matematica, Università di Bologna, Italy
Chiara Basile & Mirko Degli Esposti

Authors

Alberto Barrón-Cedeño
View author publications
You can also search for this author in PubMed Google Scholar
Chiara Basile
View author publications
You can also search for this author in PubMed Google Scholar
Mirko Degli Esposti
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, 07738, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barrón-Cedeño, A., Basile, C., Degli Esposti, M., Rosso, P. (2010). Word Length n-Grams for Text Re-use Detection. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2010. Lecture Notes in Computer Science, vol 6008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12116-6_58

Download citation

DOI: https://doi.org/10.1007/978-3-642-12116-6_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12115-9
Online ISBN: 978-3-642-12116-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics