ABSTRACT
Mendeley's crowd-sourced catalogue of research papers forms the basis of features such as the ability to search for papers, finding papers related to one currently being viewed and personalised recommendations. In order to generate this catalogue it is necessary to deduplicate the records uploaded from users' libraries and imported from external sources such as PubMed and arXiv. This task has been achieved at Mendeley via an automated system.
However the quality of the deduplication needs to be improved. "Ground truth" data sets are thus needed for evaluating the system's performance but existing datasets are very small. In this paper, the problem of generating large scale data sets from Mendeley's database is tackled. An approach based purely on random sampling produced very easy data sets so approaches that focus on more difficult examples were explored. We found that selecting duplicates and non duplicates from documents with similar titles produced more challenging datasets. Additionally we established that a Solr-based deduplication system can achieve a similar deduplication quality to the fingerprint-based system currently employed. Finally, we introduce a large scale deduplication ground truth dataset that we hope will be useful to others tackling deduplication.
- M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 39--48. ACM New York, NY, USA, 2003. Google ScholarDigital Library
- M. Bilenko and R. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 7--12, 2003.Google Scholar
- M. Charikar. Similarity estimation techniques from rounding algorithms. Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, pages 380--388, 2002. Google ScholarDigital Library
- I. Councill, H. Li, Z. Zhuang, S. Debnath, L. Bolelli, W. Lee, A. Sivasubramaniam, and C. Giles. Learning metadata from the evidence in an on-line citation matching scheme. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital libraries, pages 276--285. ACM, 2006. Google ScholarDigital Library
- A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007. Google ScholarDigital Library
- H. Hajishirzi, W. Yih, and A. Kolcz. Adaptive near-duplicate detection via similarity learning. In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in information retrieval, pages 419--426. ACM, 2010. Google ScholarDigital Library
- S. Lawrence, C. L. Giles, and K. D. Bollacker. Autonomous citation matching. In Proceedings of the Third International Conference on Autonomous Agents. ACM Press, 1999. Google ScholarDigital Library
- S. Lawrence, L. C. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67--71, 1999. Google ScholarDigital Library
- G. Manku, A. Jain, and A. Sarma. Detecting near-duplicates for web crawling. In The 16th International Conference on World Wide Web, 2007. Google ScholarDigital Library
- H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems, pages 1425--1432, 2003.Google Scholar
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Datamining - KDD '02, page 269, New York, New York, USA, 2002. ACM Press. Google ScholarDigital Library
- R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings of SIGMOD '10, pages 495--506. ACM Press, 2010. Google ScholarDigital Library
- C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, W. Tian, J. Xu, and R. Li. Mapdupreducer: detecting near duplicates over massive datasets. In Proceedings of the 2010 International Conference on Management of Data, pages 1119--1122. ACM, 2010. Google ScholarDigital Library
Index Terms
- On generating large-scale ground truth datasets for the deduplication of bibliographic records
Recommendations
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information SystemsRecently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Generating realistic datasets for deduplication analysis
USENIX ATC'12: Proceedings of the 2012 USENIX conference on Annual Technical ConferenceDeduplication is a popular component of modern storage systems, with a wide variety of approaches. Unlike traditional storage systems, deduplication performance depends on data content as well as access patterns and meta-data characteristics. Most ...
Comments