ABSTRACT
For the identification of plagiarized passages in large document collections we present retrieval strategies which rely on stochastic sampling and chunk indexes. Using the entire Wikipedia corpus we compile n-gram indexes and compare them to a new kind of fingerprint index in a plagiarism analysis use case. Our index provides an analysis speed-up by factor 1.5 and is an order of magnitude smaller, while being equivalent in terms of precision and recall.
- Y. Bernstein and J. Zobel. A scalable system for identifying co-derivative documents. SPIRE '04, vol. 3246 of LNCS, pages 55--67.Google Scholar
- F. Botelho, Y. Kohayakawa, and N. Ziviani. A Practical Minimal Perfect Hashing Method. WEA '05, vol. 3505 of LNCS, pages 488--500. Google ScholarDigital Library
- A. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. Shekita. Indexing Shared Content in Information Retrieval Systems. EDBT '06, pages 313--330. Google ScholarDigital Library
- A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. The VLDB Journal, pages 518--529, 1999.Google ScholarDigital Library
- T. Hoad and J. Zobel. Methods for Identifying Versioned and Plagiarised Documents. J. of the ASIST, 54(3):203--215, 2003. Google ScholarDigital Library
- B. Stein. Fuzzy-Fingerprints for Text-Based Information Retrieval. I-KNOW '05, JUCS, pages 572--579. Know--Center.Google Scholar
Index Terms
- Strategies for retrieving plagiarized documents
Recommendations
Retrieving candidate plagiarised documents using query expansion
ECIR'12: Proceedings of the 34th European conference on Advances in Information RetrievalExternal plagiarism detection systems compare suspicious texts against a reference collection to identify the original one(s). The suspicious text may not contain a verbatim copy of the reference collection since plagiarists often try to disguise their ...
On retrieving intelligently plagiarized documents using semantic similarity
Plagiarism in text documents can be done in many ways. The most common form of plagiarizing a text document is to copy a chunk of text and alter it intelligently, thereby making it look original. Such cases are hard to detect since they require semantic ...
Efficient index for retrieving top-k most frequent documents
In the document retrieval problem (Muthukrishnan, 2002), we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P, we can ...
Comments