research-article

Partial duplicate detection for large book collections

Authors:
Ismet Zeki Yalniz

University of Massachusetts-Amherst, Amherst, MA, USA

University of Massachusetts-Amherst, Amherst, MA, USA
View Profile

,
Ethem F. Can

University of Massachusetts-Amherst, Amherst, MA, USA

University of Massachusetts-Amherst, Amherst, MA, USA
View Profile

,
R. Manmatha

University of Massachusetts-Amherst, Amherst, MA, USA

University of Massachusetts-Amherst, Amherst, MA, USA
View Profile

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementOctober 2011Pages 469–474https://doi.org/10.1145/2063576.2063647

Published:24 October 2011Publication History

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Pages 469–474

ABSTRACT

A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as "unique words" and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the flow of ideas in the book. By aligning the sequence of unique words from two books using the longest common subsequence (LCS) one can discover whether two books are duplicates. Experiments on several datasets show that DUPNIQ is more accurate than traditional methods for duplicate detection such as shingling and is fast. On a collection of 100K scanned English books DUPNIQ detects partial duplicates in 30 min using 350 cores and has precision 0.996 and recall 0.833 compared to shingling with precision 0.992 and recall 0.720. The technique works on other languages as well and is demonstrated for a French dataset.

References

Internet Archive. http://www.archive.org, 2010.Google Scholar
Project Gutenberg. http://www.gutenberg.org, 2010.Google Scholar
Y. Bernstein and J. Zobel. A scalable system for identifying co-derivative documents. In SPIRE, pages 55--67, 2004.Google Scholar
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In ACM SIGMOD, pages 398--409, 1995. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997. Google ScholarDigital Library
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In 34th Ann. ACM Symp. on Theory of computing, pages 380--388, 2002. Google ScholarDigital Library
A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002. Google ScholarDigital Library
P. Clough. Old and new challenges in automatic plagiarism detection. National UK Plagiarism Advisory Service, http://www.ir.shef.ac.uk/cloughie/papers/pas\_plagiarism.pdf, 2003.Google Scholar
J. Cooper, A. Coden, and E. Brown. Detecting similar documents using salient terms. In CIKM, pages 245--251, 2002. Google ScholarDigital Library
S. Deorowicz. Solving longest common subsequence and related problems on graphical processing units. Softw. Pract. Exper., 40:673--700, July 2010. Google ScholarDigital Library
M. Errami, Z. Sun, A. C. George, T. C. Long, M. A. Skinner, J. D. Wren, and H. R. Garner. Identifying duplicate content using statistically improbable phrases. Bioinformatics, 26(11):1453--1457, 2010. Google ScholarDigital Library
S. Feng and R. Manmatha. A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books. In JCDL, pages 109--118, 2006. Google ScholarDigital Library
H. Hajishirzi, W. tau Yih, and A. Kolcz. Adaptive near-duplicate detection via similarity learning. In SIGIR'10, pages 419--426, 2010. Google ScholarDigital Library
N. Heintze. Scalable document fingerprinting. In USENIX Workshop on Electronic Commerce, 1996.Google Scholar
M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In ACM SIGIR, pages 284--291, 2006. Google ScholarDigital Library
T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003. Google ScholarDigital Library
J. W. Hunt and M. D. McIlroy. An algorithm for differential file comparison. Technical Report CSTR 41, Bell Laboratories, Murray Hill, NJ, 1976.Google Scholar
J. W. Hunt and T. G. Szymanski. A fast algorithm for computing longest common subsequences. Commun. ACM, 20:350--353, May 1977. Google ScholarDigital Library
D. Lin. An information-theoretic definition of similarity. In ICML '98, pages 296--304, 1998. Google ScholarDigital Library
U. Manber. Finding similar files in a large file system. In USENIX Winter 1994 Tech. Conf, pages 1--10, 1994. Google ScholarDigital Library
D. Mimno, G. Crane, and A. Jones. Hierarchical catalog records: Implementing a FRBR catalog. In D-Lib Magazine, http://www.dlib.org/dlib/october05/crane/10crane.html, volume 11, Oct 2005.Google Scholar
S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In ACM SIGMOD conference, pages 76--85, 2003. Google ScholarDigital Library
J. Seo and W. B. Croft. Local text reuse detection. In ACM SIGIR, pages 571--578, 2008. Google ScholarDigital Library
N. Shivakumar and H. Garcia-Molina. Scam: A copy detection mechanism for digital documents. In Ann. Conf. on the Theory and Practice of Digital Libraries, 1995.Google Scholar
N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In Intl. Workshop on the World Wide Web and Databases, 1999. Google ScholarDigital Library
G. Stewart, G. Crane, and A. Babeu. A new generation of textual corpora: mining corpora from very large collections. In JCDL, pages 356--365, 2007. Google ScholarDigital Library

Index Terms

Partial duplicate detection for large book collections
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Document collection models
  2. Information systems applications
    1. Digital libraries and archives

Recommendations

Efficient partial-duplicate detection based on sequence matching
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

With the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-duplicates only contain a small piece of text taken from other sources ...
Read More
Finding translations in scanned book collections
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

This paper describes an approach for identifying translations of books in large scanned book collections with OCR errors. The method is based on the idea that although individual sentences do not necessarily preserve the word order when translated, a ...
Read More
A Novel Retake Detection Using LCS and SIFT Algorithm
PCM '09: Proceedings of the 10th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing

In this paper, a method to determine retake in rushes videos is proposed. This method first divides the video into shots, and then each shot that contains a single color, color bar or clapper board is eliminated. In each remaining shot, the similarity ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
October 2011
2712 pages
ISBN:9781450307178
DOI:10.1145/2063576
Editors:
Bettina Berendt,
Arjen de Vries,
Wenfei Fan,
Craig Macdonald
University of Glasgow, UK
,
Iadh Ounis
University of Glasgow, UK
,
Ian Ruthven
University of Strathclyde, UK
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 October 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
partial duplicate detection
sequence matching
unique words
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 237
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Partial duplicate detection for large book collections

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient partial-duplicate detection based on sequence matching

Finding translations in scanned book collections

A Novel Retake Detection Using LCS and SIFT Algorithm