Abstract
Processing data that originates from different sources (such as environmental and medical data) can prove to be a difficult task, due to the heterogeneity of variables, storage systems, and file formats that can be used. Moreover, once the amount of data reaches a certain threshold, conventional mining methods (based on spreadsheets or statistical software) become cumbersome or even impossible to apply. Data Extract, Transform, and Load (ETL) solutions provide a framework to normalize and integrate heterogeneous data into a local data store. Additionally, the application of Online Analytical Processing (OLAP), a set of Business Intelligence (BI) methodologies and practices for multidimensional data analysis, can be an invaluable tool for its examination and mining. In this article, we describe a solution based on an ETL + OLAP tandem used for the on-the-fly analysis of tens of millions of individual medical, meteorological, and air quality observations from 16 provinces in Spain provided by 20 different national and regional entities in a diverse array for file types and formats, with the intention of evaluating the effect of several environmental variables on human health in future studies. Our work shows how a sizable amount of data, spread across a wide range of file formats and structures, and originating from a number of different sources belonging to various business domains, can be integrated in a single system that researchers can use for global data analysis and mining.
Similar content being viewed by others
References
Astriani W, Trisminingsih R (2016) Extraction, transformation, and loading (ETL) module for hotspot spatial data warehouse using Geokettle. Procedia Environ Sci 33:626–634
Baklanov A, Hänninen O, Slørdal LH, Kukkonen J, Bjergene N, Fay B, Finardi S, Hoe SC, Jantunen M, Karppinen A, Rasmussen A, Skouloudis A, Sokhi RS, Sørensen JH, Ødegaard V (2007) Integrated systems for forecasting urban meteorology, air pollution and population exposure. Atmos Chem Phys 7:855–874
Berndt DJ, Fisher JW, Hevner AR, Studnicki J (2001) Healthcare data warehousing and quality assurance. Computer 34(12):56–65. https://doi.org/10.1109/2.970578
Bivand RS, Pebesma E, Gómez-Rubio V (2013) Spatial data import and export. Applied Spatial Data Analysis with R. Volume 10 of the series Use R. pp 83–125. doi: https://doi.org/10.1007/978-1-4614-7618-4_4
Castellanos MG, Dayal U, Simitsis A, Wilkinson WK. Quality-driven ETL design optimization 2014. https://www.google.com/patents/US8719769
Curtis L, Rea W, Smith-Willis P, Fenyves E, Pan Y (2006) Adverse health effects of outdoor air pollutants. Environ Int 32(6):815–830
Duque-Méndez ND, Orozco-Alzate M, Vélez JJ (2014) Hydro-meteorological data analysis using OLAP techniques. Dyna rev.fac.nac.minas. 81 (185). https://doi.org/10.15446/dyna.v81n185.37700
Fdez-Arroyabe P, Roye D (2017). Co-creation and participatory design of big data infrastructures on the field of human health related climate services. In: C. Bhatt et al. (eds.), Internet of Things and Big Data Technologies for Next Generation Healthcare, Studies in Big Data 23. Doi: https://doi.org/10.1007/978-3-319-49736-5_9
International Classification of Diseases, Ninth Revision (ICD-9). Centers for Disease Control and Prevention. Retrieved March 10, 2016, from http://www.cdc.gov/nchs/icd/icd9.htm
JPivot - a JSP-based OLAP client. JPivot website. Retrieved March 15, 2016, from http://jpivot.sourceforge.net/
Kim L, Kim J, Kim S (2014) A guide for the utilization of Health Insurance Review and Assessment Service national patient samples. Epidemiol Health 36:e2014008. https://doi.org/10.4178/epih/e2014008
Kistemann T, Dangendorf F, Schweikart J (2002) New perspectives on the use of geographical information systems (GIS) in environmental health sciences. Int J Hyg Environ Health 205(3):169–181. https://doi.org/10.1078/1438-4639-00145
Lenzerini M. (2002) Data integration: a theoretical perspective. PODS ‘02 Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. doi:https://doi.org/10.1145/543613.543644
Mannucci PM, Harari S, Martinelli I, Franchini M (2015) Effects on health of air pollution: a narrative review. Intern Emerg Med 10(6):657–662. https://doi.org/10.1007/s11739-015-1276-7
MySQL 5.5 Reference Manual. MySQL AB. Retrieved March 15, 2016, from https://dev.mysql.com/doc/refman/5.5/en/
Pecoraro F, Luzi D, Designing RFL (2015) ETL tools to feed a data warehouse based on electronic healthcare record infrastructure. Stud Health Technol Inform 210:929–933
Pentaho Data Integration. Pentaho Website. Retrieved March 15, 2016, from http://www.pentaho.com/product/data-integration
Pentaho Mondrian. Pentaho Website. Retrieved March 15, 2016, from http://community.pentaho.com/projects/mondrian/
Rausch P, Sheta A, Ayesh A (2013) Business Intelligence and performance management: theory, systems, and industrial applications, Springer Verlag U.K., ISBN 978-1-4471-4865-4
Richards M, Ghanem M, Osmond M, Guo Y, Hassard J. Grid-based analysis of air pollution data. Ecological Modelling. 2006. Volume 194, Issues 1–3, 25: 274–286. https://doi.org/10.1016/j.ecolmodel.2005.10.042
Santurtún A, González-Hidalgo JC, Sanchez-Lorenzo A, Zarrabeitia MT (2015) Surface ozone concentration trends and its relationship with weather types in Spain (2001–2010). Atmos Environ 101:10–22. https://doi.org/10.1016/j.atmosenv.2014.11.005
Santurtún A, Sanchez-Lorenzo A, Villar A, Riancho JA, Zarrabeitia MT (2017) The influence of nitrogen dioxide on arrhythmias in Spain and its relationship with atmospheric circulation. Cardiovasc Toxicol 17(1):88–96. https://doi.org/10.1007/s12012-016-9359-x
Thomsen E (1997) OLAP solutions: building multidimensional information systems, 2nd edition. John Wiley & Sons. isbn:978-0-471-14931-6
Whitehorn M, Zare R (2005) Pasumansky M. Fast track to MDX, Springer London. isbn:1-84628-174-1
Whitehorn M, Zare R, Pasumansky M (2006) Fast Track to MDX. Springer-Verlag, London. https://doi.org/10.1007/1-84628-182-2
Zaiane OR, Xin M, Han J (1998) Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. Research and Technology Advances in Digital Libraries, ADL 98. Proceedings IEEE International Forum on DOI: https://doi.org/10.1109/ADL.1998.670376
Zuiderwijk A, Janssen M (2014) Open data policies, their implementation and impact: a framework for comparison
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflicts of interest.
Rights and permissions
About this article
Cite this article
Villar, A., Zarrabeitia, M.T., Fdez-Arroyabe, P. et al. Integrating and analyzing medical and environmental data using ETL and Business Intelligence tools. Int J Biometeorol 62, 1085–1095 (2018). https://doi.org/10.1007/s00484-018-1511-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00484-018-1511-9