Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information
Introduction
Plagiarism refers to unauthorised use of text, code and ideas (Potthast, Barrón-Cedeño, Stein, & Rosso, 2011). In automatic cross-language plagiarism detection, the task is to retrieve plagiarized text written in language L that has originated from another document in a language other than L. With the rapid growth of documents in different languages, the increased accessibility of electronic documents, and the availability of translation tools, cross-language plagiarism has become a serious problem and its detection requires more attention.
Given a suspicious document s′ and a set of potential source documents D, we should determine whether a fragment of the suspicious document, was borrowed from a source document. This task comprises two main steps: candidate retrieval and detailed analysis. Candidate retrieval entails the identification of source documents that contain suspicious fragments. Detailed analysis requires closer comparison of the subject document with each suspected source and retrieval of plagiarized fragments. In this paper we focus on the first step, candidate document retrieval. Since a second phase will follow this step to eliminate false positive matches, we are more interested in high recall than in high precision in this research.
In cross-language plagiarism detection the languages of source and suspicious documents differ. To date only a few approaches have been focused on cross-language plagiarism detection (Barrón-Cedeño, Gupta, & Rosso, 2013a). Most previous methods are based on translating the whole suspicious or source documents coupled with monolingual techniques (Barrón-Cedeño et al., 2013a). Document translation depends on the existence and quality of machine translators. Translating documents in languages with low quality translation tools may cause poor quality documents. In this paper we propose an approach for the candidate retrieval phase of cross-language plagiarism detection which only considers a set of representative words and phrases extracted from each document as its content representation, instead of using the whole text. Since documents are represented by some extracted words and phrases, this approach is insensitive to punctuation, extra white space, and permutation of the document context and requires less translation time rather than translating the entire document. Our approach is therefore less dependent on the quality of machine translation between two languages, and if there is not a high quality translation tool available, any other translation resources such as dictionaries, parallel or comparable corpora could be used for translating representative words. Thus, our approach is applicable in languages with even limited translation resources.
Since plagiarism usually happens in parts of the text, there is a requirement to segment the texts into fragments to detect local similarity. There are some previous works that break the document into constituent parts such as sections, paragraphs (Nawab, 2012), or a fixed number of sentences (Pereira, Moreira, & Galante, 2010). In this paper a topic-based text segmentation approach is proposed in order to break the document based on its topical structure. Thus, a set of topically related passages from the suspicious document are used to retrieve potential sources.
In our proposed candidate retrieval process, after segmentation, we use a second level for considering proximity in retrieval of candidate documents. For each segment the word proximity is measured with positional language modelling (PLM) (Lv & Zhai, 2009). We believe this to be the first use of PLM in cross-language plagiarism detection.
We present the results of no-segmentation with a non-proximity-based language model as a baseline. According to the candidate document retrieval experiments, the segmentation technique increased F2 measure about 0.11 (21% improvement) over the baseline. Accompanying the segmentation technique with the positional language model increased F2 measure about 0.13 (25% improvement) over the baseline. These results are further compared with CL-CNG (Mcnamee & Mayfield, 2004) and a combination of translation and monolingual analysis. The proposed approach with text segmentation and using proximity-based retrieval outperforms these approaches with respect to F2.
The rest of the paper is organized as follows: Section 2 outlines related work in cross-language plagiarism detection. Section 3 describes the candidate document retrieval process in which the text segmentation approach, representative word extraction, and retrieval model are explained. Finally the experimental framework and results are discussed in Section 4, and our conclusion and future work are reported in Section 5.
Section snippets
Related work
Plagiarism detection methods can be classified into two approaches, intrinsic and external (Potthast et al., 2012). Intrinsic detection methods are those that use style analysis to detect parts of the text that are inconsistent in terms of writing style (Meyer zu Eißen, Stein, 2006, Oberreuter, Velásquez, 2013). The aim of external plagiarism detection methods is not only finding the suspicious text, but also finding the source for the plagiarized text. In monolingual plagiarism detection,
Proximity-based candidate document retrieval
The problem of candidate retrieval is defined as follows: given a suspicious document s′ and a set of potential source documents D, retrieve those source documents s ∈ D that likely contain source texts of some fragments of the suspicious document .
Our proposed process of cross-language candidate retrieval is depicted in Fig. 1. Since source documents may be in different languages, the process on source documents starts with language detection. Language detection process uses the
Experimental framework
The aim of this section is to analyse and discuss the proposed approaches in text segmentation and plagiarism candidate retrieval. For this reason, we performed two sets of experiments. The first set of experiments evaluate the performance of the text segmentation algorithm (reported in Section 4.1) and the second set of experiments evaluate the performance of the cross-lingual plagiarism candidate retrieval (reported in Section 4.2).
Conclusion and future work
This paper proposes a candidate document retrieval technique for retrieving potential source documents as the first step of plagiarism detection across languages. The proposed approach is based on word proximity to retrieve the potential sources for each suspicious document. This is an important factor in plagiarism detection. We deal with proximity in two steps. First, since plagiarism usually happens in parts of the text, there is a requirement to segment the texts into fragments to detect
Acknowledgement
This research was in part supported by a grant from the Institute for Research in Fundamental Sciences (No. CS1393-4-43). We would like to acknowledge the assistance and information provided by Mr. Mohammad-Hossein Motallebi.
We also gratefully acknowledge the editor and anonymous reviewers of our paper for the invaluable and constructive comments that help improving our manuscript.
References (45)
- et al.
Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style
Expert Systems with Applications
(2013) - et al.
A survey of paraphrasing and textual entailment methods
Artificial Intelligence Research
(2010) - et al.
Automatic cross-language plagiarism detection
7th international conference on natural language processing and knowledge engineering (nlp-ke), 2011
(2011) - et al.
Methods for cross-language plagiarism detection
Knowledge-Based Systems
(2013) - et al.
On cross-lingual plagiarism analysis using a statistical model
Workshop on uncovering plagiarism, authorship, and social software misuse PAN08
(2008) - et al.
Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection
Computational Linguistics, MIT Press
(2013) - et al.
Latent Dirichlet allocation
Journal of Machine Learning Research
(2003) Advances in domain independent linear text segmentation
Proceedings of the 1st North American chapter of the association for computational linguistics conference
(2000)Cross-language plagiarism detection methods
The student research workshop associated with recent advances in natural language processing, RANLP
(2013)- et al.
A cluster separation measure
IEEE Transactions on Pattern Analysis and Machine Intelligence
(1979)
Automatic cross-language retrieval using latent semantic indexing
AAAI spring symposium on cross-language text and speech retrieval
Evaluating text segmentation using boundary edit distance.
Association for computational linguistics, (ACL (1))
A dynamic programming algorithm for linear text segmentation
Journal of Intelligent Information Systems
Brown Corpus manual
Technical Report
Cross-language plagiarism detection using a multilingual semantic network
Advances in information retrieval, proceedings of the 35th European conference on information retrieval (ECIR13)
A systematic study of knowledge graph analysis for cross-language plagiarism detection
Information Processing & Management
Computing semantic relatedness using wikipedia-based explicit semantic analysis
International joint conference on artificial intelligence, IJCAI
Discourse segmentation of multi-party conversation
Proceedings of the 41st annual meeting on association for computational linguistics
Finding scientific topics
Proceedings of the National Academy of Sciences of the United States of America
TextTiling: Segmenting text into multi-paragraph subtopic passages
Computational Linguistics
Solving the problem of language recognition
Technical Report
On information and sufficiency
The Annals of Mathematical Statistics
Cited by (34)
Automated scholarly paper review: Concepts, technologies, and challenges
2023, Information FusionIdentifying cross-lingual plagiarism using rich semantic features and deep neural networks: A study on Arabic-English plagiarism cases
2022, Journal of King Saud University - Computer and Information SciencesCitation Excerpt :A study proposed candidate retrieval for Arabic text-reuse from web documents that used encoded fingerprints to formulate queries and gave the best selection of source documents (Lulu et al., 2016). Apart from previous studies on candidate retrieval from the same language, a study for cross-lingual candidate retrieval using two-level proximity information was proposed (Ehsan and Shakery, 2016). Their study used a keyword-focused approach where the suspicious (i.e., query) document was divided into fragments using a topic-based segmentation algorithm, followed by a proximity-based model to retrieve sources relevant to the query segments.
Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection
2020, Expert Systems with ApplicationsAn effective approach to candidate retrieval for cross-language plagiarism detection: A fusion of conceptual and keyword-based schemes
2020, Information Processing and ManagementCitation Excerpt :However, as any word may have multiple meanings, translating without considering the context leads to wrong result. Ehsan and Shakery (2016) presented one of the newest approaches for the retrieval of candidates that considers a set of words and phrases as representatives of the content instead of the whole text. The approach starts by segmenting the text, extracting keywords from each one and translating words using a dictionary.
Framework for syntactic string similarity measures
2019, Expert Systems with ApplicationsCitation Excerpt :Measuring similarity of text has further been used in detecting plagiarism. Since plagiarism usually happens in some parts of text, the text should be segmented into smaller fragments before measuring the similarity (Ehsan & Shakery, 2016). Modern techniques use skip-grams and Word2Vec for cross-language plagiarism detection as well as q-grams (Barrón-Cedeño, Gupta, & Rosso, 2013; Franco-Salvador, Rosso, & Montes-y-Gómez, 2016).
Let terms choose their own kernels: An intelligent approach to kernel selection for healthcare search
2019, Information Sciences