Skip to main content
Log in

Towards task-based parallelization for entity resolution

  • Special Issue Paper
  • Published:
SICS Software-Intensive Cyber-Physical Systems

Abstract

Entity resolution (ER) refers to the problem of finding which virtual representations in one or more data sources refer to the same real-world entity. A central question in ER is how to find matching entity representations (so called duplicates) efficiently and in a scalable way. One general technique to address these issues is to leverage parallelization. In particular, almost all work on parallel ER focus on data parallelism. This paper focuses on task parallelism for ER. This type of parallelism allows to support incremental ER that offers incremental computation of the solution by streaming results of intermediate stages of ER as soon as they are computed. This possibly allows to obtain results in a more timely fashion and can also serve in a service-oriented setting with limited time or monetary budget. In summary, this paper presents a framework for task-parallelization of ER, supporting in particular ER of large amounts of semi-structured and heterogeneous data. We also discuss a possible implementation of our framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Altowim Y, Mehrotra S (2017) Parallel progressive approach to entity resolution using mapreduce. In: International conference on data engineering (ICDE). IEEE, pp 909–920

  2. Altowim Y, Kalashnikov DV, Mehrotra S (2014) Progressive approach to relational entity resolution. Proc VLDB Endow 7(11):999–1010

    Article  Google Scholar 

  3. Benjelloun O, Garcia-Molina H, Gong H, Kawai H, Larson TE, Menestrina D, Thavisomboon S (2007) D-swoosh: a family of algorithms for generic, distributed entity resolution. In: International conference on distributed computing systems (ICDCS), pp 37–37

  4. Chen X, Schallehn E, Saake G (2018) Cloud-scale entity resolution: current state and open challenges. Open J Big Data 4(1):30–51

    Google Scholar 

  5. Christen P (2011) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555

    Article  Google Scholar 

  6. Dal Bianco G, Galante R, Heuser CA (2011) A fast approach for parallel deduplication on multicore processors. In: ACM symposium on applied computing (SAC). ACM, pp 1027–1032

  7. Doan A, Halevy A, Ives Z (2012) Principles of data integration. Morgan Kaufmann, Burlington

    Google Scholar 

  8. Dong XL, Srivastava D (2013) Big data integration. In: IEEE international conference on data engineering (ICDE), pp 1245–1248

  9. Efthymiou V, Papadakis G, Papastefanatos G, Stefanidis K, Palpanas T (2015) Parallel meta-blocking: realizing scalable entity resolution over large, heterogeneous data. In: IEEE International conference on big data (big data), pp 411–420

  10. Kim Hs, Lee D (2007) Parallel linkage. In: ACM conference on information and knowledge management (CIKM), pp 283–292

  11. Kolb L, Thor A, Rahm E (2012) Load balancing for mapreduce-based entity resolution. In: IEEE international conference on data engineering (ICDE), pp 618–629

  12. Kolb L, Thor A, Rahm E (2013) Don’t match twice: redundancy-free similarity computation with mapreduce. In: Second workshop on data analytics in the cloud, pp 1–5

  13. Laura L, Santaroni F (2011) Computing strongly connected components in the streaming model. In: Theory and practice of algorithms in (computer) systems (TAPAS). Springer, pp 193–205

  14. Madhavan J, Cohen S, Dong XL, Halevy AY, Jeffery SR, Ko D, Yu C (2007) Web-scale data integration: You can afford to pay as you go. In: Biennial conference on innovative data systems research (CIDR), pp 342–350

  15. Malhotra P, Agarwal P, Shroff G (2014) Graph-parallel entity resolution using lsh & imm. In: EDBT/ICDT workshops, pp 41–49

  16. Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W (2011) Eliminating the redundancy in blocking-based entity resolution methods. In: ACM/IEEE joint conference on digital libraries (JCDL), pp 85–94

  17. Papadakis G, Svirsky J, Gal A, Palpanas T (2016) Comparative analysis of approximate blocking techniques for entity resolution. Proc VLDB Endow 9(9):684–695

    Article  Google Scholar 

  18. Papadakis G, Bereta K, Palpanas T, Koubarakis M (2017) Multi-core meta-blocking for big linked data. In: International conference on semantic systems (SEMANTiCS), pp 33–40

  19. Papenbrock T, Heise A, Naumann F (2014) Progressive duplicate detection. IEEE Trans Knowl Data Eng 27(5):1316–1329

    Article  Google Scholar 

  20. Raman V, Hellerstein JM (2001) Potter’s wheel: an interactive data cleaning system. In: International conference on very large data bases (VLDB), pp 381–390

  21. Santos W, Teixeira T, Machado C, Meira Jr W, Ferreira R, Guedes D, Da Silva AS (2007) A scalable parallel deduplication algorithm. In: International symposium on computer architecture and high performance computing (SBAC-PAD), pp 79–86

  22. Silva JA, Faria ER, Barros RC, Hruschka ER, De Carvalho AC, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13

    Article  Google Scholar 

  23. Simonini G, Papadakis G, Palpanas T, Bergamaschi S (2018) Schema-agnostic progressive entity resolution. IEEE Trans Knowl Data Eng 31(6):1208–1221

    Article  Google Scholar 

  24. Stefanidis K, Efthymiou V, Herschel M, Christophides V (2014) Entity resolution in the web of data. In: International conference on world wide web (WWW), pp 203–204

  25. Talburt JR (2011) Entity resolution and information quality. Morgan Kaufmann, Burlington

    Google Scholar 

  26. Whang SE, Marmaros D, Garcia-Molina H (2012) Pay-as-you-go entity resolution. IEEE Trans Knowl Data Eng 25(5):1111–1124

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leonardo Gazzarri.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gazzarri, L., Herschel, M. Towards task-based parallelization for entity resolution. SICS Softw.-Inensiv. Cyber-Phys. Syst. 35, 31–38 (2020). https://doi.org/10.1007/s00450-019-00409-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00450-019-00409-6

Keywords

Navigation