Skip to main content
Log in

Predicting Missing Values in Medical Data Via XGBoost Regression

  • Research Article
  • Published:
Journal of Healthcare Informatics Research Aims and scope Submit manuscript

Abstract

The data in a patient’s laboratory test result is a notable resource to support clinical investigation and enhance medical research. However, for a variety of reasons, this type of data often contains a non-trivial number of missing values. For example, physicians may neglect to order tests or document the results. Such a phenomenon reduces the degree to which this data can be utilized to learn efficient and effective predictive models. To address this problem, various approaches have been developed to impute missing laboratory values; however, their performance has been limited. This is due, in part, to the fact no approaches effectively leverage the contextual information (1) in individual or (2) between laboratory test variables. We introduce an approach to combine an unsupervised prefilling strategy with a supervised machine learning approach, in the form of extreme gradient boosting (XGBoost), to leverage both types of context for imputation purposes. We evaluated the methodology through a series of experiments on approximately 8200 patients’ records in the MIMIC-III dataset. The results demonstrate that the new model outperforms baseline and state-of-the-art models on 13 commonly collected laboratory test variables. In terms of the normalized root mean square derivation (nRMSD), our model exhibits an imputation improvement by over 20%, on average. Missing data imputation on the temporal variables can be largely improved via prefilling strategy and the supervised training technique, which leverages both the longitudinal and cross-sectional context simultaneously.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

The dataset is sampled from Medical Information Mart for Intensive Care III. (MIMIC-III) (https://mimic.physionet.org/).

Notes

  1. Data challenge details at http://www.ieee-ichi.org/challenge.html

References

  1. Evans RS (2016) Electronic health records: then, now, and in the future. International Medical Informatics Association (IMIA) 1:S48–S61

  2. Richesson RL, Horvath MM, Rusincovitch SA (2014) Clinical research informatics and electronic health record data. International Medical Informatics Association (IMIA) 23(1):215–223

  3. Köpcke F, Trinczek B, Majeed RW, Schreiweis B, Wenk J, Leusch T, Ganslandt T, Ohmann C, Bergh B, Röhrig R, Dugas M, Prokosch HU (2013) Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence. BioMed Central (BMC) 13(1):37

  4. Hu Z, Melton GB, Arsoniadis EG, Wang Y, Kwaan MR, Simon GJ (2017) Strategies for handling missing clinical data for automated surgical site infection detection from the electronic health record. J Biomed Inform 68:112–120

    Article  Google Scholar 

  5. Beaulieu-Jones BK, Moore JH (2017) Missing data imputation in the electronic health record using deeply learned autoencoders. In Proceedings of the Pacific Symposium on Biocomputing. 207–218

  6. Weber GM, Adams WG, Bernstam EV, Bickel JP, Fox KP, Marsolo K, Raghavan VA, Turchin A, Zhou X, Murphy SN, Mandl KD (2017) Biases introduced by filtering electronic health records for patients with “complete data.”. J Am Med Inform Assoc 24:1134–1141

    Article  Google Scholar 

  7. Beaulieu-Jones BK, Lavage DR, Snyder JW, Moore JH, Pendergrass SA, Bauer CR (2018) Characterizing and managing missing structured data in electronic health records: data analysis. JMIR Med Inform 6:e11

    Article  Google Scholar 

  8. Pivovarov R, Albers DJ, Sepulveda JL, Elhadad N (2014) Identifying and mitigating biases in EHR laboratory tests. J Biomed Inform 51:24–34

    Article  Google Scholar 

  9. Waljee AK, Mukherjee A, Singal AG, Zhang Y, Warren J, Balis U, Marrero J, Zhu J, Higgins PDR (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3(8):e002847

    Article  Google Scholar 

  10. Buuren SV, Groothuis-Oudshoorn K (2011) MICE: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–68

    Article  Google Scholar 

  11. Luo Y, Szolovits P, Dighe AS et al (2017) 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data. J Am Med Inform Assoc 25:645–653

    Article  Google Scholar 

  12. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG (2016) MIMIC-III, a freely accessible critical care database. Scientific Data 3:160035

    Article  Google Scholar 

  13. Wells BJ, Chagin KM, Nowacki AS, Kattan MW (2013) Strategies for handling missing data in electronic health record derived data. EGEMS. 1(3):1035

    Article  Google Scholar 

  14. Cismondi F, Fialho AS, Vieira SM, Reti SR, Sousa JMC, Finkelstein SN (2013) Missing data in medical databases: impute, delete or classify. Artif Intell Med 58(1):63–72

    Article  Google Scholar 

  15. Li P, Stuart EA, Allison DB (2015) Multiple imputation: a flexible tool for handling missing data. JAMA. 314(18):1966–1967

    Article  Google Scholar 

  16. Donders AR, Van Der Heijden GJ, Stijnen T et al (2006) A gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091

    Article  Google Scholar 

  17. Schmitt P, Mandel J, Guedj M (2015) A comparison of six methods for missing data imputation. Journal of Biometrics & Biostatistics 6(1):1000224

    Google Scholar 

  18. Shrive FM, Stuart H, Quan H, Ghali WA (2006) Dealing with missing data in a multi-question de- pression scale: a comparison of imputation methods. BMC Med Res Methodol 6(1):57

    Article  Google Scholar 

  19. Troyanskaya O, Cantor M, Sherlock G, Hastie T, Tibshirani R, Botstein D, Altman R (2001) Missing value estimation methods for DNA microarrays. Bioinformatics. 17:520–525

    Article  Google Scholar 

  20. Deng Y, Chang C, Ido MS, Long Q (2016) Multiple imputation for general missing data patterns in the presence of high-dimensional data. Sci Rep 6(1):21689

    Article  Google Scholar 

  21. Zhang G, Little R (2009) Extensions of the penalized spline of propensity prediction method of imputation. Biometrics. 65(3):911–918

    Article  MathSciNet  Google Scholar 

  22. Luo Y, Szolovits P, Dighe AS, Baron JM (2016) Using machine learning to predict laboratory test results. American Journal of Clinical Pathology 145(6):7787–7788

    Article  Google Scholar 

  23. Little R, An H (2004) Robust likelihood-based analysis of multivariate data with missing values. Stat Sin 149(3):949–968

    MathSciNet  MATH  Google Scholar 

  24. Buuren SV, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 18(6):681–694

    Article  Google Scholar 

  25. Stekhoven DJ, Bühlmann P (2012) MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics 28:112–118

    Article  Google Scholar 

  26. Tang F, Ishwaran H (2017) Random forest missing data algorithms. Stat Anal Data Min 10(6):363–377

    Article  MathSciNet  Google Scholar 

  27. Hastie T, Mazumder R, Lee JD, Zadeh R (2015) Matrix completion and low-rank SVD via fast alternating least squares. J Mach Learn Res 16:3367–3402

    MathSciNet  MATH  Google Scholar 

  28. Mazumder R, Hastie T, Tibshirani R (2010) Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res 11(80):2287–2322

    MathSciNet  MATH  Google Scholar 

  29. Liao Z, Lu X, Yang T, Wang H (2009) Missing data imputation: a fuzzy K-means clustering algorithm over sliding window. In Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery. 133–137

  30. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794

  31. PythonAPIReference. https://xgboost.readthedocs.io/en/latest/python/pythonapi.html. Accessed Aug 9 2019

Download references

Acknowledgments

We thank the 7th IEEE International Conference on Healthcare Informatics (ICHI) Data Analytics Challenge on Missing data Imputation (DACMI) challenge committee for the dataset used in this study.

Funding

This research was supported, in part, by the National Library of Medicine of the National Institutes of Health under Award Number R01LM012854. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinmeng Zhang.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Code Availability

Source code at https://github.com/yanchao0222/SMILES.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Yan, C., Gao, C. et al. Predicting Missing Values in Medical Data Via XGBoost Regression. J Healthc Inform Res 4, 383–394 (2020). https://doi.org/10.1007/s41666-020-00077-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41666-020-00077-1

Keywords

Navigation