skip to main content
10.1145/2063576.2063647acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Partial duplicate detection for large book collections

Authors Info & Claims
Published:24 October 2011Publication History

ABSTRACT

A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as "unique words" and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the flow of ideas in the book. By aligning the sequence of unique words from two books using the longest common subsequence (LCS) one can discover whether two books are duplicates. Experiments on several datasets show that DUPNIQ is more accurate than traditional methods for duplicate detection such as shingling and is fast. On a collection of 100K scanned English books DUPNIQ detects partial duplicates in 30 min using 350 cores and has precision 0.996 and recall 0.833 compared to shingling with precision 0.992 and recall 0.720. The technique works on other languages as well and is demonstrated for a French dataset.

References

  1. Internet Archive. http://www.archive.org, 2010.Google ScholarGoogle Scholar
  2. Project Gutenberg. http://www.gutenberg.org, 2010.Google ScholarGoogle Scholar
  3. Y. Bernstein and J. Zobel. A scalable system for identifying co-derivative documents. In SPIRE, pages 55--67, 2004.Google ScholarGoogle Scholar
  4. S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In ACM SIGMOD, pages 398--409, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. S. Charikar. Similarity estimation techniques from rounding algorithms. In 34th Ann. ACM Symp. on Theory of computing, pages 380--388, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Clough. Old and new challenges in automatic plagiarism detection. National UK Plagiarism Advisory Service, http://www.ir.shef.ac.uk/cloughie/papers/pas\_plagiarism.pdf, 2003.Google ScholarGoogle Scholar
  9. J. Cooper, A. Coden, and E. Brown. Detecting similar documents using salient terms. In CIKM, pages 245--251, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Deorowicz. Solving longest common subsequence and related problems on graphical processing units. Softw. Pract. Exper., 40:673--700, July 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Errami, Z. Sun, A. C. George, T. C. Long, M. A. Skinner, J. D. Wren, and H. R. Garner. Identifying duplicate content using statistically improbable phrases. Bioinformatics, 26(11):1453--1457, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Feng and R. Manmatha. A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books. In JCDL, pages 109--118, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Hajishirzi, W. tau Yih, and A. Kolcz. Adaptive near-duplicate detection via similarity learning. In SIGIR'10, pages 419--426, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Heintze. Scalable document fingerprinting. In USENIX Workshop on Electronic Commerce, 1996.Google ScholarGoogle Scholar
  15. M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In ACM SIGIR, pages 284--291, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. W. Hunt and M. D. McIlroy. An algorithm for differential file comparison. Technical Report CSTR 41, Bell Laboratories, Murray Hill, NJ, 1976.Google ScholarGoogle Scholar
  18. J. W. Hunt and T. G. Szymanski. A fast algorithm for computing longest common subsequences. Commun. ACM, 20:350--353, May 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Lin. An information-theoretic definition of similarity. In ICML '98, pages 296--304, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. U. Manber. Finding similar files in a large file system. In USENIX Winter 1994 Tech. Conf, pages 1--10, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Mimno, G. Crane, and A. Jones. Hierarchical catalog records: Implementing a FRBR catalog. In D-Lib Magazine, http://www.dlib.org/dlib/october05/crane/10crane.html, volume 11, Oct 2005.Google ScholarGoogle Scholar
  22. S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In ACM SIGMOD conference, pages 76--85, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Seo and W. B. Croft. Local text reuse detection. In ACM SIGIR, pages 571--578, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. N. Shivakumar and H. Garcia-Molina. Scam: A copy detection mechanism for digital documents. In Ann. Conf. on the Theory and Practice of Digital Libraries, 1995.Google ScholarGoogle Scholar
  25. N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In Intl. Workshop on the World Wide Web and Databases, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Stewart, G. Crane, and A. Babeu. A new generation of textual corpora: mining corpora from very large collections. In JCDL, pages 356--365, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Partial duplicate detection for large book collections

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
          October 2011
          2712 pages
          ISBN:9781450307178
          DOI:10.1145/2063576

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 October 2011

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader