ABSTRACT
A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as "unique words" and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the flow of ideas in the book. By aligning the sequence of unique words from two books using the longest common subsequence (LCS) one can discover whether two books are duplicates. Experiments on several datasets show that DUPNIQ is more accurate than traditional methods for duplicate detection such as shingling and is fast. On a collection of 100K scanned English books DUPNIQ detects partial duplicates in 30 min using 350 cores and has precision 0.996 and recall 0.833 compared to shingling with precision 0.992 and recall 0.720. The technique works on other languages as well and is demonstrated for a French dataset.
- Internet Archive. http://www.archive.org, 2010.Google Scholar
- Project Gutenberg. http://www.gutenberg.org, 2010.Google Scholar
- Y. Bernstein and J. Zobel. A scalable system for identifying co-derivative documents. In SPIRE, pages 55--67, 2004.Google Scholar
- S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In ACM SIGMOD, pages 398--409, 1995. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997. Google ScholarDigital Library
- M. S. Charikar. Similarity estimation techniques from rounding algorithms. In 34th Ann. ACM Symp. on Theory of computing, pages 380--388, 2002. Google ScholarDigital Library
- A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002. Google ScholarDigital Library
- P. Clough. Old and new challenges in automatic plagiarism detection. National UK Plagiarism Advisory Service, http://www.ir.shef.ac.uk/cloughie/papers/pas\_plagiarism.pdf, 2003.Google Scholar
- J. Cooper, A. Coden, and E. Brown. Detecting similar documents using salient terms. In CIKM, pages 245--251, 2002. Google ScholarDigital Library
- S. Deorowicz. Solving longest common subsequence and related problems on graphical processing units. Softw. Pract. Exper., 40:673--700, July 2010. Google ScholarDigital Library
- M. Errami, Z. Sun, A. C. George, T. C. Long, M. A. Skinner, J. D. Wren, and H. R. Garner. Identifying duplicate content using statistically improbable phrases. Bioinformatics, 26(11):1453--1457, 2010. Google ScholarDigital Library
- S. Feng and R. Manmatha. A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books. In JCDL, pages 109--118, 2006. Google ScholarDigital Library
- H. Hajishirzi, W. tau Yih, and A. Kolcz. Adaptive near-duplicate detection via similarity learning. In SIGIR'10, pages 419--426, 2010. Google ScholarDigital Library
- N. Heintze. Scalable document fingerprinting. In USENIX Workshop on Electronic Commerce, 1996.Google Scholar
- M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In ACM SIGIR, pages 284--291, 2006. Google ScholarDigital Library
- T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003. Google ScholarDigital Library
- J. W. Hunt and M. D. McIlroy. An algorithm for differential file comparison. Technical Report CSTR 41, Bell Laboratories, Murray Hill, NJ, 1976.Google Scholar
- J. W. Hunt and T. G. Szymanski. A fast algorithm for computing longest common subsequences. Commun. ACM, 20:350--353, May 1977. Google ScholarDigital Library
- D. Lin. An information-theoretic definition of similarity. In ICML '98, pages 296--304, 1998. Google ScholarDigital Library
- U. Manber. Finding similar files in a large file system. In USENIX Winter 1994 Tech. Conf, pages 1--10, 1994. Google ScholarDigital Library
- D. Mimno, G. Crane, and A. Jones. Hierarchical catalog records: Implementing a FRBR catalog. In D-Lib Magazine, http://www.dlib.org/dlib/october05/crane/10crane.html, volume 11, Oct 2005.Google Scholar
- S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In ACM SIGMOD conference, pages 76--85, 2003. Google ScholarDigital Library
- J. Seo and W. B. Croft. Local text reuse detection. In ACM SIGIR, pages 571--578, 2008. Google ScholarDigital Library
- N. Shivakumar and H. Garcia-Molina. Scam: A copy detection mechanism for digital documents. In Ann. Conf. on the Theory and Practice of Digital Libraries, 1995.Google Scholar
- N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In Intl. Workshop on the World Wide Web and Databases, 1999. Google ScholarDigital Library
- G. Stewart, G. Crane, and A. Babeu. A new generation of textual corpora: mining corpora from very large collections. In JCDL, pages 356--365, 2007. Google ScholarDigital Library
Index Terms
- Partial duplicate detection for large book collections
Recommendations
Efficient partial-duplicate detection based on sequence matching
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalWith the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-duplicates only contain a small piece of text taken from other sources ...
Finding translations in scanned book collections
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalThis paper describes an approach for identifying translations of books in large scanned book collections with OCR errors. The method is based on the idea that although individual sentences do not necessarily preserve the word order when translated, a ...
A Novel Retake Detection Using LCS and SIFT Algorithm
PCM '09: Proceedings of the 10th Pacific Rim Conference on Multimedia: Advances in Multimedia Information ProcessingIn this paper, a method to determine retake in rushes videos is proposed. This method first divides the video into shots, and then each shot that contains a single color, color bar or clapper board is eliminated. In each remaining shot, the similarity ...
Comments