Exploring hybrid parallel systems for probabilistic record linkage

Boratto, Murilo; Alonso, Pedro; Pinto, Clicia; Melo, Pedro; Barreto, Marcos; Denaxas, Spiros

doi:10.1007/s11227-018-2328-3

Exploring hybrid parallel systems for probabilistic record linkage

Published: 21 March 2018

Volume 75, pages 1137–1149, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Murilo Boratto¹,
Pedro Alonso ORCID: orcid.org/0000-0002-6882-6592²,
Clicia Pinto³,
Pedro Melo³,
Marcos Barreto³ &
…
Spiros Denaxas⁴

280 Accesses
3 Citations
4 Altmetric
Explore all metrics

Abstract

Record linkage is a technique widely used to gather data stored in disparate data sources that presumably pertain to the same real world entity. This integration can be done deterministically or probabilistically, depending on the existence of common key attributes among all data sources involved. The probabilistic approach is very time-consuming due to the amount of records that must be compared, specifically in big data scenarios. In this paper, we propose and evaluate a methodology that simultaneously exploits multicore and multi-GPU architectures in order to perform the probabilistic linkage of large-scale Brazilian governmental databases. We present some algorithmic optimizations that provide high accuracy and improve performance by defining the best algorithm-architecture combination for a problem given its input size. We also discuss performance results obtained with different data samples, showing that a hybrid approach outperforms other configurations, providing an average speedup of 7.9 when linking up to 20.000 million records.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Big data analytics: a survey

Article Open access 01 October 2015

Notes

References

Andrade G, Viegas F, Ramos GS, Almeida J, Rocha L, Gonçalves M, Ferreira R (2013) GPU-NB: a fast CUDA-based implementation of Naïve Bayes. In: 2013 25th International Symposium on Computer Architecture and High Performance Computing, pp 168–175
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426
Article MATH Google Scholar
Cook S (2013) CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs, 1st edn. Morgan Kaufmann, San Francisco
Google Scholar
Doan A, Halevy A, Ives Z (2012) Principles of Data Integration. Elsevier, Amsterdam
Google Scholar
Étienne EY (2012) Hyper-threading. TurbsPublishing, Saarbrücken
Google Scholar
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64:1183–1210
Article MATH Google Scholar
Feng X, Jin H, Zheng R, Zhu L (2014) Near-duplicate detection using GPU-based simhash scheme. In: 2014 International Conference on Smart Computing, pp 223–228
Forchhammer B, Papenbrock T, Stening T, Viehmeier S, Naumann U.D.F (2013) Duplicate detection on GPUs. In: BTW. Köllen-Verlag, pp 165–184
Kim H.s, Lee D (2007) Parallel linkage. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007. ACM, New York, NY, USA, pp 283–292
Mamun AA, Aseltine R, Rajasekaran S (2015) RLT-S: a web system for record linkage. PLoS ONE 10(5):1–9
Article Google Scholar
Mamun AA, Aseltine R, Rajasekaran S (2016) Efficient record linkage algorithms using complete linkage clustering. PLoS ONE 11(4):1–21
Article Google Scholar
Mamun AA, Mi T, Aseltine R, Rajasekaran S (2014) Efficient sequential and parallel algorithms for record linkage. J Am Med Inform Assoc 21(2):252–262
Article Google Scholar
Mizell E, Biery R (2017) How GPUs are defining the future of data analytics
Munshi A, Gaster B, Mattson TG, Fung J, Ginsburg D (2011) OpenCL Programming Guide, 1st edn. Addison-Wesley, Reading
Google Scholar
NVIDIA Corporation: NVIDIA CUDA C programming guide (2010). Version 3.2
OpenMP Architecture Review Board: OpenMP application program interface version 4.0 (2013)
Pokorny J (2011) NoSQL databases: a step to database scalability in web environment. In: Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services, iiWAS ’11. ACM, New York, NY, USA, pp 278–283
Rendle S, Schmidt-Thieme L (2008) Scaling Record Linkage to Non-uniform Distributed Class Sizes. Springer, Berlin, pp 308–319
Google Scholar
Sehili Z, Kolb L, Borgs C, Schnell R, Rahm E (2015) Privacy preserving record linkage with ppjoin. In: Datenbanksysteme für Business, Technologie und Web (BTW), pp 85–104
Winkler WE (1999) The state of record linkage and current research problems
Zhong Z, Rychkov V, Lastovetsky A (2015) Data partitioning on multicore and multi-GPU platforms using functional performance models. IEEE Trans Comput 64(9):2506–2518
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work has been partially supported by CNPq, FAPESB, Bill & Melinda Gates Foundation, The Royal Society (UK), Medical Research Council (UK), NVIDIA Hardware Grant Program, Generalitat Valenciana (Grant PROMETEOII/2014/003), Spanish Government and European Commission through TEC2015-67387-C4-1-R (MINECO/FEDER), and network CAPAP-H. We have also worked in cooperation with the EU-COST Programme Action IC1305, “Network for Sustainable Ultrascale Computing (NESUS)”.

Author information

Authors and Affiliations

Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado da Bahia, Salvador, Bahia, Brazil
Murilo Boratto
Departament of Information Systems and Computation, Universitat Politècnica de València, Valencia, Spain
Pedro Alonso
Laboratório de Sistemas Distribuídos, Universidade Federal da Bahia, Salvador, Bahia, Brazil
Clicia Pinto, Pedro Melo & Marcos Barreto
Institute of Health Informatics Research, School of Computer Science and Informatics, University College London, London, UK
Spiros Denaxas

Authors

Murilo Boratto
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Alonso
View author publications
You can also search for this author in PubMed Google Scholar
Clicia Pinto
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Melo
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Barreto
View author publications
You can also search for this author in PubMed Google Scholar
Spiros Denaxas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Alonso.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boratto, M., Alonso, P., Pinto, C. et al. Exploring hybrid parallel systems for probabilistic record linkage. J Supercomput 75, 1137–1149 (2019). https://doi.org/10.1007/s11227-018-2328-3

Download citation

Published: 21 March 2018
Issue Date: 01 March 2019
DOI: https://doi.org/10.1007/s11227-018-2328-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring hybrid parallel systems for probabilistic record linkage

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Big data analytics: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploring hybrid parallel systems for probabilistic record linkage

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Big data analytics: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation