Abstract
Entity resolution (ER) refers to the problem of finding which virtual representations in one or more data sources refer to the same real-world entity. A central question in ER is how to find matching entity representations (so called duplicates) efficiently and in a scalable way. One general technique to address these issues is to leverage parallelization. In particular, almost all work on parallel ER focus on data parallelism. This paper focuses on task parallelism for ER. This type of parallelism allows to support incremental ER that offers incremental computation of the solution by streaming results of intermediate stages of ER as soon as they are computed. This possibly allows to obtain results in a more timely fashion and can also serve in a service-oriented setting with limited time or monetary budget. In summary, this paper presents a framework for task-parallelization of ER, supporting in particular ER of large amounts of semi-structured and heterogeneous data. We also discuss a possible implementation of our framework.
Similar content being viewed by others
References
Altowim Y, Mehrotra S (2017) Parallel progressive approach to entity resolution using mapreduce. In: International conference on data engineering (ICDE). IEEE, pp 909–920
Altowim Y, Kalashnikov DV, Mehrotra S (2014) Progressive approach to relational entity resolution. Proc VLDB Endow 7(11):999–1010
Benjelloun O, Garcia-Molina H, Gong H, Kawai H, Larson TE, Menestrina D, Thavisomboon S (2007) D-swoosh: a family of algorithms for generic, distributed entity resolution. In: International conference on distributed computing systems (ICDCS), pp 37–37
Chen X, Schallehn E, Saake G (2018) Cloud-scale entity resolution: current state and open challenges. Open J Big Data 4(1):30–51
Christen P (2011) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555
Dal Bianco G, Galante R, Heuser CA (2011) A fast approach for parallel deduplication on multicore processors. In: ACM symposium on applied computing (SAC). ACM, pp 1027–1032
Doan A, Halevy A, Ives Z (2012) Principles of data integration. Morgan Kaufmann, Burlington
Dong XL, Srivastava D (2013) Big data integration. In: IEEE international conference on data engineering (ICDE), pp 1245–1248
Efthymiou V, Papadakis G, Papastefanatos G, Stefanidis K, Palpanas T (2015) Parallel meta-blocking: realizing scalable entity resolution over large, heterogeneous data. In: IEEE International conference on big data (big data), pp 411–420
Kim Hs, Lee D (2007) Parallel linkage. In: ACM conference on information and knowledge management (CIKM), pp 283–292
Kolb L, Thor A, Rahm E (2012) Load balancing for mapreduce-based entity resolution. In: IEEE international conference on data engineering (ICDE), pp 618–629
Kolb L, Thor A, Rahm E (2013) Don’t match twice: redundancy-free similarity computation with mapreduce. In: Second workshop on data analytics in the cloud, pp 1–5
Laura L, Santaroni F (2011) Computing strongly connected components in the streaming model. In: Theory and practice of algorithms in (computer) systems (TAPAS). Springer, pp 193–205
Madhavan J, Cohen S, Dong XL, Halevy AY, Jeffery SR, Ko D, Yu C (2007) Web-scale data integration: You can afford to pay as you go. In: Biennial conference on innovative data systems research (CIDR), pp 342–350
Malhotra P, Agarwal P, Shroff G (2014) Graph-parallel entity resolution using lsh & imm. In: EDBT/ICDT workshops, pp 41–49
Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W (2011) Eliminating the redundancy in blocking-based entity resolution methods. In: ACM/IEEE joint conference on digital libraries (JCDL), pp 85–94
Papadakis G, Svirsky J, Gal A, Palpanas T (2016) Comparative analysis of approximate blocking techniques for entity resolution. Proc VLDB Endow 9(9):684–695
Papadakis G, Bereta K, Palpanas T, Koubarakis M (2017) Multi-core meta-blocking for big linked data. In: International conference on semantic systems (SEMANTiCS), pp 33–40
Papenbrock T, Heise A, Naumann F (2014) Progressive duplicate detection. IEEE Trans Knowl Data Eng 27(5):1316–1329
Raman V, Hellerstein JM (2001) Potter’s wheel: an interactive data cleaning system. In: International conference on very large data bases (VLDB), pp 381–390
Santos W, Teixeira T, Machado C, Meira Jr W, Ferreira R, Guedes D, Da Silva AS (2007) A scalable parallel deduplication algorithm. In: International symposium on computer architecture and high performance computing (SBAC-PAD), pp 79–86
Silva JA, Faria ER, Barros RC, Hruschka ER, De Carvalho AC, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13
Simonini G, Papadakis G, Palpanas T, Bergamaschi S (2018) Schema-agnostic progressive entity resolution. IEEE Trans Knowl Data Eng 31(6):1208–1221
Stefanidis K, Efthymiou V, Herschel M, Christophides V (2014) Entity resolution in the web of data. In: International conference on world wide web (WWW), pp 203–204
Talburt JR (2011) Entity resolution and information quality. Morgan Kaufmann, Burlington
Whang SE, Marmaros D, Garcia-Molina H (2012) Pay-as-you-go entity resolution. IEEE Trans Knowl Data Eng 25(5):1111–1124
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gazzarri, L., Herschel, M. Towards task-based parallelization for entity resolution. SICS Softw.-Inensiv. Cyber-Phys. Syst. 35, 31–38 (2020). https://doi.org/10.1007/s00450-019-00409-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-019-00409-6