Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information

https://doi.org/10.1016/j.ipm.2016.04.006Get rights and content

Highlights

  • Proposing a candidate retrieval model for cross-lingual plagiarism detection

  • The method relies on using two levels of proximity information

  • Proposing a topic-based text segmentation method

  • Comparing the method with other cross-lingual plagiarism detection approaches

  • Showing improvements using text segmentation and positional language models

Abstract

The rapid growth of documents in different languages, the increased accessibility of electronic documents, and the availability of translation tools have caused cross-lingual plagiarism detection research area to receive increasing attention in recent years. The task of cross-language plagiarism detection entails two main steps: candidate retrieval and assessing pairwise document similarity. In this paper we examine candidate retrieval, where the goal is to find potential source documents of a suspicious text. Our proposed method for cross-language plagiarism detection is a keyword-focused approach. Since plagiarism usually happens in parts of the text, there is a requirement to segment the texts into fragments to detect local similarity. Therefore we propose a topic-based segmentation algorithm to convert the suspicious document to a set of related passages. After that, we use a proximity-based model to retrieve documents with the best matching passages. Experiments show promising results for this important phase of cross-language plagiarism detection.

Introduction

Plagiarism refers to unauthorised use of text, code and ideas (Potthast, Barrón-Cedeño, Stein, & Rosso, 2011). In automatic cross-language plagiarism detection, the task is to retrieve plagiarized text written in language L that has originated from another document in a language other than L. With the rapid growth of documents in different languages, the increased accessibility of electronic documents, and the availability of translation tools, cross-language plagiarism has become a serious problem and its detection requires more attention.

Given a suspicious document s′ and a set of potential source documents D, we should determine whether a fragment of the suspicious document, sfs, was borrowed from a source document. This task comprises two main steps: candidate retrieval and detailed analysis. Candidate retrieval entails the identification of source documents that contain suspicious fragments. Detailed analysis requires closer comparison of the subject document with each suspected source and retrieval of plagiarized fragments. In this paper we focus on the first step, candidate document retrieval. Since a second phase will follow this step to eliminate false positive matches, we are more interested in high recall than in high precision in this research.

In cross-language plagiarism detection the languages of source and suspicious documents differ. To date only a few approaches have been focused on cross-language plagiarism detection (Barrón-Cedeño, Gupta, & Rosso, 2013a). Most previous methods are based on translating the whole suspicious or source documents coupled with monolingual techniques (Barrón-Cedeño et al., 2013a). Document translation depends on the existence and quality of machine translators. Translating documents in languages with low quality translation tools may cause poor quality documents. In this paper we propose an approach for the candidate retrieval phase of cross-language plagiarism detection which only considers a set of representative words and phrases extracted from each document as its content representation, instead of using the whole text. Since documents are represented by some extracted words and phrases, this approach is insensitive to punctuation, extra white space, and permutation of the document context and requires less translation time rather than translating the entire document. Our approach is therefore less dependent on the quality of machine translation between two languages, and if there is not a high quality translation tool available, any other translation resources such as dictionaries, parallel or comparable corpora could be used for translating representative words. Thus, our approach is applicable in languages with even limited translation resources.

Since plagiarism usually happens in parts of the text, there is a requirement to segment the texts into fragments to detect local similarity. There are some previous works that break the document into constituent parts such as sections, paragraphs (Nawab, 2012), or a fixed number of sentences (Pereira, Moreira, & Galante, 2010). In this paper a topic-based text segmentation approach is proposed in order to break the document based on its topical structure. Thus, a set of topically related passages from the suspicious document are used to retrieve potential sources.

In our proposed candidate retrieval process, after segmentation, we use a second level for considering proximity in retrieval of candidate documents. For each segment the word proximity is measured with positional language modelling (PLM) (Lv & Zhai, 2009). We believe this to be the first use of PLM in cross-language plagiarism detection.

We present the results of no-segmentation with a non-proximity-based language model as a baseline. According to the candidate document retrieval experiments, the segmentation technique increased F2 measure about 0.11 (21% improvement) over the baseline. Accompanying the segmentation technique with the positional language model increased F2 measure about 0.13 (25% improvement) over the baseline. These results are further compared with CL-CNG (Mcnamee & Mayfield, 2004) and a combination of translation and monolingual analysis. The proposed approach with text segmentation and using proximity-based retrieval outperforms these approaches with respect to F2.

The rest of the paper is organized as follows: Section 2 outlines related work in cross-language plagiarism detection. Section 3 describes the candidate document retrieval process in which the text segmentation approach, representative word extraction, and retrieval model are explained. Finally the experimental framework and results are discussed in Section 4, and our conclusion and future work are reported in Section 5.

Section snippets

Related work

Plagiarism detection methods can be classified into two approaches, intrinsic and external (Potthast et al., 2012). Intrinsic detection methods are those that use style analysis to detect parts of the text that are inconsistent in terms of writing style (Meyer zu Eißen, Stein, 2006, Oberreuter, Velásquez, 2013). The aim of external plagiarism detection methods is not only finding the suspicious text, but also finding the source for the plagiarized text. In monolingual plagiarism detection,

Proximity-based candidate document retrieval

The problem of candidate retrieval is defined as follows: given a suspicious document s′ and a set of potential source documents D, retrieve those source documents sD that likely contain source texts of some fragments of the suspicious document sfs.

Our proposed process of cross-language candidate retrieval is depicted in Fig. 1. Since source documents may be in different languages, the process on source documents starts with language detection. Language detection process uses the

Experimental framework

The aim of this section is to analyse and discuss the proposed approaches in text segmentation and plagiarism candidate retrieval. For this reason, we performed two sets of experiments. The first set of experiments evaluate the performance of the text segmentation algorithm (reported in Section 4.1) and the second set of experiments evaluate the performance of the cross-lingual plagiarism candidate retrieval (reported in Section 4.2).

Conclusion and future work

This paper proposes a candidate document retrieval technique for retrieving potential source documents as the first step of plagiarism detection across languages. The proposed approach is based on word proximity to retrieve the potential sources for each suspicious document. This is an important factor in plagiarism detection. We deal with proximity in two steps. First, since plagiarism usually happens in parts of the text, there is a requirement to segment the texts into fragments to detect

Acknowledgement

This research was in part supported by a grant from the Institute for Research in Fundamental Sciences (No. CS1393-4-43). We would like to acknowledge the assistance and information provided by Mr. Mohammad-Hossein Motallebi.

We also gratefully acknowledge the editor and anonymous reviewers of our paper for the invaluable and constructive comments that help improving our manuscript.

References (45)

  • G. Oberreuter et al.

    Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style

    Expert Systems with Applications

    (2013)
  • I. Androutsopoulos et al.

    A survey of paraphrasing and textual entailment methods

    Artificial Intelligence Research

    (2010)
  • A. Anguita et al.

    Automatic cross-language plagiarism detection

    7th international conference on natural language processing and knowledge engineering (nlp-ke), 2011

    (2011)
  • A. Barrón-Cedeño et al.

    Methods for cross-language plagiarism detection

    Knowledge-Based Systems

    (2013)
  • A. Barrón-Cedeño et al.

    On cross-lingual plagiarism analysis using a statistical model

    Workshop on uncovering plagiarism, authorship, and social software misuse PAN08

    (2008)
  • A. Barrón-Cedeño et al.

    Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection

    Computational Linguistics, MIT Press

    (2013)
  • D.M. Blei et al.

    Latent Dirichlet allocation

    Journal of Machine Learning Research

    (2003)
  • F.Y. Choi

    Advances in domain independent linear text segmentation

    Proceedings of the 1st North American chapter of the association for computational linguistics conference

    (2000)
  • V. Danilova

    Cross-language plagiarism detection methods

    The student research workshop associated with recent advances in natural language processing, RANLP

    (2013)
  • D.L. Davies et al.

    A cluster separation measure

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1979)
  • S.T. Dumais et al.

    Automatic cross-language retrieval using latent semantic indexing

    AAAI spring symposium on cross-language text and speech retrieval

    (1997)
  • C. Fournier

    Evaluating text segmentation using boundary edit distance.

    Association for computational linguistics, (ACL (1))

    (2013)
  • P. Fragkou et al.

    A dynamic programming algorithm for linear text segmentation

    Journal of Intelligent Information Systems

    (2004)
  • W.N. Francis et al.

    Brown Corpus manual

    Technical Report

    (1979)
  • M. Franco-Salvador et al.

    Cross-language plagiarism detection using a multilingual semantic network

    Advances in information retrieval, proceedings of the 35th European conference on information retrieval (ECIR13)

    (2013)
  • M. Franco-Salvador et al.

    A systematic study of knowledge graph analysis for cross-language plagiarism detection

    Information Processing & Management

    (2016)
  • E. Gabrilovich et al.

    Computing semantic relatedness using wikipedia-based explicit semantic analysis

    International joint conference on artificial intelligence, IJCAI

    (2007)
  • M. Galley et al.

    Discourse segmentation of multi-party conversation

    Proceedings of the 41st annual meeting on association for computational linguistics

    (2003)
  • T.L. Griffiths et al.

    Finding scientific topics

    Proceedings of the National Academy of Sciences of the United States of America

    (2004)
  • M.A. Hearst

    TextTiling: Segmenting text into multi-paragraph subtopic passages

    Computational Linguistics

    (1997)
  • S. Johnson

    Solving the problem of language recognition

    Technical Report

    (1993)
  • S. Kullback et al.

    On information and sufficiency

    The Annals of Mathematical Statistics

    (1951)
  • Cited by (34)

    • Identifying cross-lingual plagiarism using rich semantic features and deep neural networks: A study on Arabic-English plagiarism cases

      2022, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      A study proposed candidate retrieval for Arabic text-reuse from web documents that used encoded fingerprints to formulate queries and gave the best selection of source documents (Lulu et al., 2016). Apart from previous studies on candidate retrieval from the same language, a study for cross-lingual candidate retrieval using two-level proximity information was proposed (Ehsan and Shakery, 2016). Their study used a keyword-focused approach where the suspicious (i.e., query) document was divided into fragments using a topic-based segmentation algorithm, followed by a proximity-based model to retrieve sources relevant to the query segments.

    • An effective approach to candidate retrieval for cross-language plagiarism detection: A fusion of conceptual and keyword-based schemes

      2020, Information Processing and Management
      Citation Excerpt :

      However, as any word may have multiple meanings, translating without considering the context leads to wrong result. Ehsan and Shakery (2016) presented one of the newest approaches for the retrieval of candidates that considers a set of words and phrases as representatives of the content instead of the whole text. The approach starts by segmenting the text, extracting keywords from each one and translating words using a dictionary.

    • Framework for syntactic string similarity measures

      2019, Expert Systems with Applications
      Citation Excerpt :

      Measuring similarity of text has further been used in detecting plagiarism. Since plagiarism usually happens in some parts of text, the text should be segmented into smaller fragments before measuring the similarity (Ehsan & Shakery, 2016). Modern techniques use skip-grams and Word2Vec for cross-language plagiarism detection as well as q-grams (Barrón-Cedeño, Gupta, & Rosso, 2013; Franco-Salvador, Rosso, & Montes-y-Gómez, 2016).

    View all citing articles on Scopus
    View full text