Abstract
The task of entity resolution (ER) aims to detect multiple records describing the same real-world entity in datasets and to consolidate them into a single consistent record. ER plays a fundamental role in guaranteeing good data quality, e.g., as input for data science pipelines. Yet, the traditional approach to ER requires cleaning the entire data before being able to run consistent queries on it; hence, users struggle to tackle common scenarios with limited time or resources (e.g., when the data changes frequently or the user is only interested in a portion of the dataset for the task).
We previously introduced BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data, according to a priority defined by the user. In this demonstration, we show how BrewER can be exploited to ease the burden of ER, allowing data scientists to save a significant amount of resources for their tasks.
- Hotham Altwaijry et al. 2013. Query-Driven Approach to Entity Resolution. PVLDB 6, 14 (2013), 1846--1857.Google Scholar
- Hotham Altwaijry et al. 2015. QuERy: A Framework for Integrating Entity Resolution with Query Processing. PVLDB 9, 3 (2015), 120--131.Google Scholar
- Vassilis Christophides et al. 2021. An Overview of End-to-End Entity Resolution for Big Data. CSUR 53, 6 (2021), 127:1--127:42.Google Scholar
- Valter Crescenzi et al. 2021. Alaska: A Flexible Benchmark for Data Integration Tasks. arXiv preprint arXiv:2101.11259.Google Scholar
- Xin Luna Dong and Divesh Srivastava. 2015. Big Data Integration. Morgan & Claypool Publishers.Google Scholar
- Luca Gagliardelli et al. 2019. SparkER: Scaling Entity Resolution in Spark. In EDBT. OpenProceedings.org, 602--605.Google Scholar
- Mazhar Hameed and Felix Naumann. 2020. Data Preparation: A Survey of Commercial Tools. SIGMOD Record 49, 3 (2020), 18--29.Google ScholarDigital Library
- Pradap Konda et al. 2016. Magellan: Toward Building Entity Matching Management Systems. PVLDB 9, 12 (2016), 1197--1208.Google ScholarDigital Library
- Yuliang Li et al. 2020. Deep Entity Matching with Pre-Trained Language Models. PVLDB 14, 1 (2020), 50--60.Google Scholar
- Sidharth Mudgal et al. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD. ACM, 19--34.Google Scholar
- Thorsten Papenbrock et al. 2015. Progressive Duplicate Detection. TKDE 27, 5 (2015), 1316--1329.Google Scholar
- Giovanni Simonini et al. 2018. Schema-agnostic Progressive Entity Resolution. In ICDE. IEEE Computer Society, 53--64.Google Scholar
- Giovanni Simonini et al. 2022. Entity Resolution On-Demand. PVLDB 15, 7 (2022), 1506--1518.Google Scholar
- Steven Euijong Whang et al. 2013. Pay-As-You-Go Entity Resolution. TKDE 25, 5 (2013), 1111--1124.Google ScholarDigital Library
Recommendations
Entity resolution on-demand
Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned ...
Handling data quality in entity resolution
IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systemsEntity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers).However, there are no unique identifiers that tell us what ...
Collective entity resolution in relational data
Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Comments