Identification of Original Document by Using Textual Similarities

Shrestha, Prasha; Solorio, Thamar

doi:10.1007/978-3-319-18117-2_48

Identification of Original Document by Using Textual Similarities

Prasha Shrestha¹⁴ &
Thamar Solorio¹⁴

Conference paper

3331 Accesses
3 Citations
1 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9042))

Abstract

When there are two documents that share similar content, either accidentally or intentionally, the knowledge about which one of the two is the original source of the content is unknown in most cases. This knowledge can be crucial in order to charge or acquit someone of plagiarism, to establish the provenance of a document or in the case of sensitive information, to make sure that you can rely on the source of the information. Our system identifies the original document by using the idea that the pieces of text written by the same author have higher resemblance to each other than to those written by different authors. Given two pairs of documents with shared content, our system compares the shared part with the remaining text in both of the documents by treating them as bag of words. For cases when there is no reference text by one of the authors to compare against, our system makes predictions based on similarity of the shared content to just one of the documents.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-Science. SIGMOD Rec. 34, 31–36 (2005)
Article Google Scholar
Muniswamy-Reddy, K.K., Macko, P., Seltzer, M.: Provenance for the cloud. In: Proceedings of the 8th USENIX Conference on File and Storage Technologies, FAST 2010, pp. 15–14. USENIX Association, Berkeley (2010)
Google Scholar
Muniswamy-Reddy, K.K., Holland, D.A., Braun, U., Seltzer, M.: Provenance-aware storage systems. In: Proceedings of the Annual Conference on USENIX 2006 Annual Technical Conference, ATEC 2006, p. 4. USENIX Association, Berkeley (2006)
Google Scholar
Green, R.L., Watling, R.J.: Trace element fingerprinting of australian ocher using laser ablation inductively coupled plasma-mass spectrometry (LA-ICP-MS) for the provenance establishment and authentication of indigenous art*. Journal of Forensic Sciences 52, 851–859 (2007)
Article Google Scholar
Grozea, C., Popescu, M.: Who’s the thief? Automatic detection of the direction of plagiarism. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 700–710. Springer, Heidelberg (2010)
Chapter Google Scholar
Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. In: 3rd PAN Workshop Uncovering Plagiarism, Authorship and Social Software Misuse, vol. 2, p. 38 (2009)
Google Scholar
Barrón-Cedeño, A., Rosso, P.: On automatic plagiarism detection based on n-grams comparison. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 696–700. Springer, Heidelberg (2009)
Chapter Google Scholar
Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Language Resources and Evaluation 45, 63–82 (2011)
Article Google Scholar
Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 62. ACM (2004)
Google Scholar
Guthrie, D., Guthrie, L., Allison, B., Wilks, Y.: Unsupervised anomaly detection. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI 2007, pp. 1624–1628. Morgan Kaufmann Publishers Inc., San Francisco (2007)
Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60, 538–556 (2009)
Article Google Scholar
Koppel, M., Winter, Y.: Determining if two documents are written by the same author. Journal of the Association for Information Science and Technology 65, 178–187 (2014)
Article Google Scholar
Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Coling 2010: Posters, pp. 997–1005. Coling 2010 Organizing Committee, Beijing (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Houston, 4800 Calhoun Rd., Houston, TX, 77004, USA
Prasha Shrestha & Thamar Solorio

Authors

Prasha Shrestha
View author publications
You can also search for this author in PubMed Google Scholar
Thamar Solorio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prasha Shrestha .

Editor information

Editors and Affiliations

Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico DF, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shrestha, P., Solorio, T. (2015). Identification of Original Document by Using Textual Similarities. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2015. Lecture Notes in Computer Science(), vol 9042. Springer, Cham. https://doi.org/10.1007/978-3-319-18117-2_48

Download citation

DOI: https://doi.org/10.1007/978-3-319-18117-2_48
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18116-5
Online ISBN: 978-3-319-18117-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics