ABSTRACT
Blocking is a key component of Entity Resolution (ER) that aims to improve efficiency by quickly pruning out non-matching record pairs. However, depending on the noise in the dataset and the distribution of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. We propose a new methodology of progressive blocking that enables both efficient and effective ER and works across different entity cluster size distributions without manual fine tuning. In this paper, we demonstrate BEER (Blocking for Effective Entity Resolution), the first end-to-end system that leverages intermediate ER output in a feedback loop to refine the blocking result in a data-driven fashion, thereby enabling effective entity resolution. BEER allows the user to explore the different components of the ER pipeline, analyze the effectiveness of alternative blocking techniques and understand the interaction between blocking and ER. BEER supports visualization of the different entities present in a block, explains the change in blocking output with every round of feedback and allows the end-user to interactively compare different techniques. BEER has been developed as open-source software; the code and the demonstration video are available at beer-system.github.io.
Supplemental Material
- Mikhail Bilenko, Beena Kamath, and Raymond J Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, 2006.Google Scholar
- Valter Crescenzi, Andrea De Angelis, Donatella Firmani, Maurizio Mazzei, Paolo Merialdo, Federico Piai, and Divesh Srivastava. Alaska: A flexible benchmark for data integration tasks, 2021, https://arxiv.org/abs/2101.11259.Google Scholar
- Guilherme dal Bianco, Marcos André Goncc alves, and Denio Duarte. Bloss: Effective meta-blocking with almost no effort. Information Systems, 75, 2018.Google ScholarCross Ref
- Sanjib Das, Paul Suganthan GC, AnHai Doan, Jeffrey F Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, and Youngchoon Park. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD, 2017.Google ScholarDigital Library
- Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. Duplicate record detection: A survey. IEEE TKDE, 19(1), 2007.Google Scholar
- Donatella Firmani, Barna Saha, and Divesh Srivastava. Online entity resolution using an oracle. PVLDB, 9(5), 2016.Google Scholar
- Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. Robust entity resolution using random graphs. In SIGMOD, 2018.Google ScholarDigital Library
- Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. Efficient and effective er with progressive blocking. Accepted VLDB journal, 2021.Google Scholar
- Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD, 2014.Google ScholarDigital Library
- Luis Gravano, Panagiotis G Ipeirotis, Hosagrahar Visvesvaraya Jagadish, Nick Koudas, Shanmugauelayut Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In PVLDB, pages 491--500, 2001.Google ScholarDigital Library
- Mauricio A Hernández and Salvatore J Stolfo. The merge/purge problem for large databases. In ACM Sigmod Record, volume 24, pages 127--138. ACM, 1995.Google ScholarDigital Library
- Pradap Konda, Sanjib Das, Paul Suganthan GC, AnHai Doan, Adel Ardalan, Jeffrey R Ballard, Han Li, Fatemah Panahi, et al. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016.Google ScholarDigital Library
- Andrew McCallum, Kamal Nigam, and Lyle H Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD, pages 169--178, 2000.Google ScholarDigital Library
- N McNeill, Hakan Kardes, and Andrew Borthwick. Dynamic record blocking: efficient linking of massive databases in mapreduce. Citeseer, 2012.Google Scholar
- George Papadakis, George Alexiou, George Papastefanatos, and Georgia Koutrika. Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. PVLDB, 9(4):312--323, 2015.Google ScholarDigital Library
- George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederee, and Wolfgang Nejdl. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE TKDE, 25(12):2665--2682, 2012.Google Scholar
- George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. Meta-blocking: Taking entity resolutionto the next level. IEEE TKDE, 26(8):1946--1960, 2013.Google Scholar
- George Papadakis, George Mandilaras, Luca Gagliardelli, Giovanni Simonini, Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas, and Manolis Koubarakis. Three-dimensional entity resolution with jedai. Information Systems, 93:101565, 2020.Google ScholarCross Ref
- George Papadakis, George Papastefanatos, and Georgia Koutrika. Supervised meta-blocking. PVLDB, 7(14):1929--1940, 2014.Google ScholarDigital Library
- George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. Comparative analysis of approximate blocking techniques for entity resolution. PVLDB, 9(9):684--695, 2016.Google ScholarDigital Library
- George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, and Manolis Koubarakis. The return of jedai: end-to-end entity resolution for structured and semi-structured data. PVLDB, 11(12):1950--1953, 2018.Google ScholarDigital Library
- Kun Qian, Lucian Popa, and Prithviraj Sen. Systemer: a human-in-the-loop system for explainable entity resolution. PVLDB, 12(12):1794--1797, 2019.Google ScholarDigital Library
- Giovanni Simonini, Sonia Bergamaschi, and HV Jagadish. Blast: a loosely schema-aware meta-blocking approach for entity resolution. PVLDB, 9(12), 2016.Google ScholarDigital Library
- Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. Crowdsourcing algorithms for entity resolution. PVLDB, 7(12):1071--1082, 2014.Google ScholarDigital Library
- Jiannan Wang, Guoliang Li, Tim Kraska, Michael J Franklin, and Jianhua Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, 2013.Google ScholarDigital Library
Index Terms
- BEER: Blocking for Effective Entity Resolution
Recommendations
Entity resolution with iterative blocking
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataEntity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. ...
Efficient and effective ER with progressive blocking
AbstractBlocking is a mechanism to improve the efficiency of entity resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too ...
LinkDB: a probabilistic linkage database system
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataEntity linkage deals with the problem of identifying whether two pieces of information represent the same real world object. The traditional methodology computes the similarity among the entities, and then merges those with similarity above some ...
Comments