skip to main content
10.1145/2254129.2254153acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

On generating large-scale ground truth datasets for the deduplication of bibliographic records

Published:13 June 2012Publication History

ABSTRACT

Mendeley's crowd-sourced catalogue of research papers forms the basis of features such as the ability to search for papers, finding papers related to one currently being viewed and personalised recommendations. In order to generate this catalogue it is necessary to deduplicate the records uploaded from users' libraries and imported from external sources such as PubMed and arXiv. This task has been achieved at Mendeley via an automated system.

However the quality of the deduplication needs to be improved. "Ground truth" data sets are thus needed for evaluating the system's performance but existing datasets are very small. In this paper, the problem of generating large scale data sets from Mendeley's database is tackled. An approach based purely on random sampling produced very easy data sets so approaches that focus on more difficult examples were explored. We found that selecting duplicates and non duplicates from documents with similar titles produced more challenging datasets. Additionally we established that a Solr-based deduplication system can achieve a similar deduplication quality to the fingerprint-based system currently employed. Finally, we introduce a large scale deduplication ground truth dataset that we hope will be useful to others tackling deduplication.

References

  1. M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 39--48. ACM New York, NY, USA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Bilenko and R. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 7--12, 2003.Google ScholarGoogle Scholar
  3. M. Charikar. Similarity estimation techniques from rounding algorithms. Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, pages 380--388, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. I. Councill, H. Li, Z. Zhuang, S. Debnath, L. Bolelli, W. Lee, A. Sivasubramaniam, and C. Giles. Learning metadata from the evidence in an on-line citation matching scheme. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital libraries, pages 276--285. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. H. Hajishirzi, W. Yih, and A. Kolcz. Adaptive near-duplicate detection via similarity learning. In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in information retrieval, pages 419--426. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Lawrence, C. L. Giles, and K. D. Bollacker. Autonomous citation matching. In Proceedings of the Third International Conference on Autonomous Agents. ACM Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Lawrence, L. C. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67--71, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Manku, A. Jain, and A. Sarma. Detecting near-duplicates for web crawling. In The 16th International Conference on World Wide Web, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems, pages 1425--1432, 2003.Google ScholarGoogle Scholar
  11. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Datamining - KDD '02, page 269, New York, New York, USA, 2002. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings of SIGMOD '10, pages 495--506. ACM Press, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, W. Tian, J. Xu, and R. Li. Mapdupreducer: detecting near duplicates over massive datasets. In Proceedings of the 2010 International Conference on Management of Data, pages 1119--1122. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On generating large-scale ground truth datasets for the deduplication of bibliographic records

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
            June 2012
            571 pages
            ISBN:9781450309158
            DOI:10.1145/2254129

            Copyright © 2012 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 13 June 2012

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate140of278submissions,50%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader