skip to main content
10.1145/1277741.1277928acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Strategies for retrieving plagiarized documents

Published:23 July 2007Publication History

ABSTRACT

For the identification of plagiarized passages in large document collections we present retrieval strategies which rely on stochastic sampling and chunk indexes. Using the entire Wikipedia corpus we compile n-gram indexes and compare them to a new kind of fingerprint index in a plagiarism analysis use case. Our index provides an analysis speed-up by factor 1.5 and is an order of magnitude smaller, while being equivalent in terms of precision and recall.

References

  1. Y. Bernstein and J. Zobel. A scalable system for identifying co-derivative documents. SPIRE '04, vol. 3246 of LNCS, pages 55--67.Google ScholarGoogle Scholar
  2. F. Botelho, Y. Kohayakawa, and N. Ziviani. A Practical Minimal Perfect Hashing Method. WEA '05, vol. 3505 of LNCS, pages 488--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. Shekita. Indexing Shared Content in Information Retrieval Systems. EDBT '06, pages 313--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. The VLDB Journal, pages 518--529, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Hoad and J. Zobel. Methods for Identifying Versioned and Plagiarised Documents. J. of the ASIST, 54(3):203--215, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Stein. Fuzzy-Fingerprints for Text-Based Information Retrieval. I-KNOW '05, JUCS, pages 572--579. Know--Center.Google ScholarGoogle Scholar

Index Terms

  1. Strategies for retrieving plagiarized documents

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
              July 2007
              946 pages
              ISBN:9781595935977
              DOI:10.1145/1277741

              Copyright © 2007 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 23 July 2007

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate792of3,983submissions,20%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader