skip to main content
10.1145/3448016.3452747acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper
Public Access

BEER: Blocking for Effective Entity Resolution

Published:18 June 2021Publication History

ABSTRACT

Blocking is a key component of Entity Resolution (ER) that aims to improve efficiency by quickly pruning out non-matching record pairs. However, depending on the noise in the dataset and the distribution of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. We propose a new methodology of progressive blocking that enables both efficient and effective ER and works across different entity cluster size distributions without manual fine tuning. In this paper, we demonstrate BEER (Blocking for Effective Entity Resolution), the first end-to-end system that leverages intermediate ER output in a feedback loop to refine the blocking result in a data-driven fashion, thereby enabling effective entity resolution. BEER allows the user to explore the different components of the ER pipeline, analyze the effectiveness of alternative blocking techniques and understand the interaction between blocking and ER. BEER supports visualization of the different entities present in a block, explains the change in blocking output with every round of feedback and allows the end-user to interactively compare different techniques. BEER has been developed as open-source software; the code and the demonstration video are available at beer-system.github.io.

Skip Supplemental Material Section

Supplemental Material

3448016.3452747.mp4

mp4

31.8 MB

References

  1. Mikhail Bilenko, Beena Kamath, and Raymond J Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, 2006.Google ScholarGoogle Scholar
  2. Valter Crescenzi, Andrea De Angelis, Donatella Firmani, Maurizio Mazzei, Paolo Merialdo, Federico Piai, and Divesh Srivastava. Alaska: A flexible benchmark for data integration tasks, 2021, https://arxiv.org/abs/2101.11259.Google ScholarGoogle Scholar
  3. Guilherme dal Bianco, Marcos André Goncc alves, and Denio Duarte. Bloss: Effective meta-blocking with almost no effort. Information Systems, 75, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  4. Sanjib Das, Paul Suganthan GC, AnHai Doan, Jeffrey F Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, and Youngchoon Park. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. Duplicate record detection: A survey. IEEE TKDE, 19(1), 2007.Google ScholarGoogle Scholar
  6. Donatella Firmani, Barna Saha, and Divesh Srivastava. Online entity resolution using an oracle. PVLDB, 9(5), 2016.Google ScholarGoogle Scholar
  7. Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. Robust entity resolution using random graphs. In SIGMOD, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. Efficient and effective er with progressive blocking. Accepted VLDB journal, 2021.Google ScholarGoogle Scholar
  9. Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Luis Gravano, Panagiotis G Ipeirotis, Hosagrahar Visvesvaraya Jagadish, Nick Koudas, Shanmugauelayut Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In PVLDB, pages 491--500, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mauricio A Hernández and Salvatore J Stolfo. The merge/purge problem for large databases. In ACM Sigmod Record, volume 24, pages 127--138. ACM, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Pradap Konda, Sanjib Das, Paul Suganthan GC, AnHai Doan, Adel Ardalan, Jeffrey R Ballard, Han Li, Fatemah Panahi, et al. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Andrew McCallum, Kamal Nigam, and Lyle H Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD, pages 169--178, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N McNeill, Hakan Kardes, and Andrew Borthwick. Dynamic record blocking: efficient linking of massive databases in mapreduce. Citeseer, 2012.Google ScholarGoogle Scholar
  15. George Papadakis, George Alexiou, George Papastefanatos, and Georgia Koutrika. Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. PVLDB, 9(4):312--323, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederee, and Wolfgang Nejdl. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE TKDE, 25(12):2665--2682, 2012.Google ScholarGoogle Scholar
  17. George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. Meta-blocking: Taking entity resolutionto the next level. IEEE TKDE, 26(8):1946--1960, 2013.Google ScholarGoogle Scholar
  18. George Papadakis, George Mandilaras, Luca Gagliardelli, Giovanni Simonini, Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas, and Manolis Koubarakis. Three-dimensional entity resolution with jedai. Information Systems, 93:101565, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  19. George Papadakis, George Papastefanatos, and Georgia Koutrika. Supervised meta-blocking. PVLDB, 7(14):1929--1940, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. Comparative analysis of approximate blocking techniques for entity resolution. PVLDB, 9(9):684--695, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, and Manolis Koubarakis. The return of jedai: end-to-end entity resolution for structured and semi-structured data. PVLDB, 11(12):1950--1953, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kun Qian, Lucian Popa, and Prithviraj Sen. Systemer: a human-in-the-loop system for explainable entity resolution. PVLDB, 12(12):1794--1797, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Giovanni Simonini, Sonia Bergamaschi, and HV Jagadish. Blast: a loosely schema-aware meta-blocking approach for entity resolution. PVLDB, 9(12), 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. Crowdsourcing algorithms for entity resolution. PVLDB, 7(12):1071--1082, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jiannan Wang, Guoliang Li, Tim Kraska, Michael J Franklin, and Jianhua Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. BEER: Blocking for Effective Entity Resolution

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
          June 2021
          2969 pages
          ISBN:9781450383431
          DOI:10.1145/3448016

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 June 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • short-paper

          Acceptance Rates

          Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader