research-article

On generating large-scale ground truth datasets for the deduplication of bibliographic records

Authors:
James A. Hammerton

Mendeley Ltd., London, UK

Mendeley Ltd., London, UK
View Profile

,
Michael Granitzer

University of Passau

University of Passau
View Profile

,
Dan Harvey

State LTD, London, UK

State LTD, London, UK
View Profile

,
Maya Hristakeva

View Profile

,
Kris Jack

View Profile

WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and SemanticsJune 2012Article No.: 18Pages 1–12https://doi.org/10.1145/2254129.2254153

Published:13 June 2012Publication History

WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics

Pages 1–12

ABSTRACT

Mendeley's crowd-sourced catalogue of research papers forms the basis of features such as the ability to search for papers, finding papers related to one currently being viewed and personalised recommendations. In order to generate this catalogue it is necessary to deduplicate the records uploaded from users' libraries and imported from external sources such as PubMed and arXiv. This task has been achieved at Mendeley via an automated system.

However the quality of the deduplication needs to be improved. "Ground truth" data sets are thus needed for evaluating the system's performance but existing datasets are very small. In this paper, the problem of generating large scale data sets from Mendeley's database is tackled. An approach based purely on random sampling produced very easy data sets so approaches that focus on more difficult examples were explored. We found that selecting duplicates and non duplicates from documents with similar titles produced more challenging datasets. Additionally we established that a Solr-based deduplication system can achieve a similar deduplication quality to the fingerprint-based system currently employed. Finally, we introduce a large scale deduplication ground truth dataset that we hope will be useful to others tackling deduplication.

References

M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 39--48. ACM New York, NY, USA, 2003. Google ScholarDigital Library
M. Bilenko and R. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 7--12, 2003.Google Scholar
M. Charikar. Similarity estimation techniques from rounding algorithms. Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, pages 380--388, 2002. Google ScholarDigital Library
I. Councill, H. Li, Z. Zhuang, S. Debnath, L. Bolelli, W. Lee, A. Sivasubramaniam, and C. Giles. Learning metadata from the evidence in an on-line citation matching scheme. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital libraries, pages 276--285. ACM, 2006. Google ScholarDigital Library
A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007. Google ScholarDigital Library
H. Hajishirzi, W. Yih, and A. Kolcz. Adaptive near-duplicate detection via similarity learning. In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in information retrieval, pages 419--426. ACM, 2010. Google ScholarDigital Library
S. Lawrence, C. L. Giles, and K. D. Bollacker. Autonomous citation matching. In Proceedings of the Third International Conference on Autonomous Agents. ACM Press, 1999. Google ScholarDigital Library
S. Lawrence, L. C. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67--71, 1999. Google ScholarDigital Library
G. Manku, A. Jain, and A. Sarma. Detecting near-duplicates for web crawling. In The 16th International Conference on World Wide Web, 2007. Google ScholarDigital Library
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems, pages 1425--1432, 2003.Google Scholar
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Datamining - KDD '02, page 269, New York, New York, USA, 2002. ACM Press. Google ScholarDigital Library
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings of SIGMOD '10, pages 495--506. ACM Press, 2010. Google ScholarDigital Library
C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, W. Tian, J. Xu, and R. Li. Mapdupreducer: detecting near duplicates over massive datasets. In Proceedings of the 2010 International Conference on Management of Data, pages 1119--1122. ACM, 2010. Google ScholarDigital Library

Index Terms

On generating large-scale ground truth datasets for the deduplication of bibliographic records
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems

Recommendations

Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information Systems

Recently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Read More
Deduplication for large scale backup and archival storage
Read More
Generating realistic datasets for deduplication analysis
USENIX ATC'12: Proceedings of the 2012 USENIX conference on Annual Technical Conference

Deduplication is a popular component of modern storage systems, with a wide variety of approaches. Unlike traditional storage systems, deduplication performance depends on data content as well as access patterns and meta-data characteristics. Most ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
June 2012
571 pages
ISBN:9781450309158
DOI:10.1145/2254129
Conference Chair:
Dumitru Dan Burdescu
University of Craiova, Romania
,
Program Chairs:
Rajendra Akerkar
Western Norway Research Institute, Norway
,
Costin Bădică
SUniversity of Craiova, Romania
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bibliographic metadata
fingerprinting
near duplicate detection
nearest neighbor search
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate140of278submissions,50%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 198
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On generating large-scale ground truth datasets for the deduplication of bibliographic records

WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Storage Deduplication by Virtual Large-Scale Disks

Deduplication for large scale backup and archival storage

Generating realistic datasets for deduplication analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

On generating large-scale ground truth datasets for the deduplication of bibliographic records

WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Storage Deduplication by Virtual Large-Scale Disks

Deduplication for large scale backup and archival storage

Generating realistic datasets for deduplication analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media