Review
A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models

https://doi.org/10.1016/j.jclinepi.2019.02.004Get rights and content

Abstract

Objectives

The objective of this study was to compare performance of logistic regression (LR) with machine learning (ML) for clinical prediction modeling in the literature.

Study Design and Setting

We conducted a Medline literature search (1/2016 to 8/2017) and extracted comparisons between LR and ML models for binary outcomes.

Results

We included 71 of 927 studies. The median sample size was 1,250 (range 72–3,994,872), with 19 predictors considered (range 5–563) and eight events per predictor (range 0.3–6,697). The most common ML methods were classification trees, random forests, artificial neural networks, and support vector machines. In 48 (68%) studies, we observed potential bias in the validation procedures. Sixty-four (90%) studies used the area under the receiver operating characteristic curve (AUC) to assess discrimination. Calibration was not addressed in 56 (79%) studies. We identified 282 comparisons between an LR and ML model (AUC range, 0.52–0.99). For 145 comparisons at low risk of bias, the difference in logit(AUC) between LR and ML was 0.00 (95% confidence interval, −0.18 to 0.18). For 137 comparisons at high risk of bias, logit(AUC) was 0.34 (0.20–0.47) higher for ML.

Conclusion

We found no evidence of superior performance of ML over LR. Improvements in methodology and reporting are needed for studies that compare modeling algorithms.

Introduction

Clinical risk prediction models are ubiquitous in many medical domains. These models aim to predict a clinically relevant outcome using person-level information. The traditional approach to develop these models involves the use of regression models, for example, logistic regression (LR) to predict disease presence (diagnosis) or disease outcomes (prognosis) [1]. Machine learning (ML) algorithms are gaining in popularity as an alternative approach for prediction and classification problems. ML methods include artificial neural networks, support vector machines, and random forests [2]. Although ML methods have been sporadically used for clinical prediction for some time [3], [4], the growing availability of increasingly large, voluminous, and rich data sets such as electronic health records data have reignited interest in exploiting these methods [5], [6], [7].

Definitions of what constitutes ML and the differences with statistical modeling have been discussed at length in the literature [8], yet the distinction is not clear-cut [9]. The seminal reference on this issue is Breiman's review of the “two cultures” [8]. Breiman contrasts theory-based models such as regression with empirical algorithms such as decision trees, artificial neural networks, support vector machines, or random forests. A useful definition of ML is that it focuses on models that directly and automatically learn from data [10]. By contrast, regression models are based on theory and assumptions, and benefit from human intervention and subject knowledge for model specification. For example, ML performs modeling more automatically than regression regarding the inclusion of nonlinear associations and interaction terms [11]. To do so, ML algorithms are often highly flexible algorithms that require penalization to avoid overfitting [12]. Some researchers describe the distinction between statistical modeling and ML as a continuum [5]. Other researchers label any method that deviates from basic regression models as ML [13], such as penalized regression (e.g., LASSO, elastic net) or generalized additive models (GAM). We note that these methods do not belong to ML using the “automatic learning from data” definition, and did not classify these as ML in this study.

Owing to its flexibility, ML is claimed to have better performance over traditional statistical modeling, and to better handle a larger number of potential predictors [5], [6], [7], [12], [14], [15], [16]. However, recent research suggested that ML requires more data than LR, which contradicts the above claim [17]. Furthermore, ML models are typically assessed in terms of discrimination performance (e.g., accuracy, area under the receiver operating characteristic [ROC] curve [AUC]), while the reliability of risk predictions (calibration) is often not assessed [18]. The claim of improved performance in clinical prediction is therefore not established.

The primary objective of this study was to compare the performance of LR with ML algorithms for the development of diagnostic or prognostic clinical prediction models for binary outcomes based on clinical data. Secondary objectives were to describe the characteristics of the studies, the type of ML algorithms that were used, the validation process, the modeling aspects of LR and ML, reporting quality, and risk of bias for comparing performance between regression and ML [19].

Section snippets

Materials and methods

The study was registered with PROSPERO (CRD42018068587). We followed the Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) statement.

Results

Our search identified 927 articles published since between 1/2016 and 8/2017, of which 802 studies were excluded based on title or abstract (Fig. 1). Fifty-four studies were excluded during full-text screening. Seventy-one studies met inclusion criteria and came from a wide variety of clinical domains, with oncology and cardiovascular medicine as the most common (Table A.3–4) [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46],

Discussion

Our systematic review of studies that compare clinical prediction models using LR and ML yielded the following key findings. Reporting of methodology and findings was very often incomplete and unclear; model validation procedures still often were poor. Calibration of risk predictions was seldom examined, and AUC performance of LR and ML was on average no different when comparisons had low risk of bias. The latter finding is in line with the claim that traditional approaches often perform

Acknowledgments

This work was supported by the Research Foundation–Flanders (FWO) [grant G0B4716N]; Internal Funds KU Leuven [grant C24/15/037]; Cancer Research UK [grant 5529/A16895]; the NIHR Biomedical Research Centre, Oxford, UK. The funding sources had no role in the conception, design, data collection, analysis, or reporting of this study.

References (113)

  • T. van der Ploeg et al.

    Modern modeling techniques had limited external validity in predicting mortality from traumatic brain injury

    J Clin Epidemiol

    (2016)
  • Z. Zhou et al.

    Predicting distant failure in early stage NSCLC treated with SBRT using clinical parameters Predicting distant failure in lung SBRT

    Radiother Oncol

    (2016)
  • R. Asaoka et al.

    Validating the usefulness of the “random forests” classifier to diagnose early glaucoma with optical coherence tomography

    Am J Ophthalmol

    (2017)
  • A.M. Chiriac et al.

    Designing predictive models for beta-lactam allergy using the drug allergy and hypersensitivity database

    J Allergy Clin Immunol Pract

    (2018)
  • J.A. Dean et al.

    Normal tissue complication probability (NTCP) modelling of severe acute mucositis using a novel oral mucosal surface organ at risk

    Clin Oncol

    (2017)
  • Y. Fei et al.

    Predicting risk for portal vein thrombosis in acute pancreatitis patients: a comparison of radical basis function artificial neural network and logistic regression models

    J Crit Care

    (2017)
  • Y. Fei et al.

    Artificial neural networks predict the incidence of portosplenomesenteric venous thrombosis in patients with acute pancreatitis

    J Thromb Haemost

    (2017)
  • Y. Fei et al.

    Predicting the incidence of portosplenomesenteric vein thrombosis in patients with acute pancreatitis using classification and regression tree algorithm

    J Crit Care

    (2017)
  • N.C. Hettige et al.

    Classification of suicide attempters in schizophrenia using sociocultural and clinical features: a machine learning approach

    Gen Hosp Psychiatry

    (2017)
  • Y.H. Hu et al.

    Predicting return visits to the emergency department for pediatric patients: applying supervised learning techniques to the Taiwan National Health Insurance Research Database

    Comput Methods Programs Biomed

    (2017)
  • C. Zhang et al.

    Subgroup identification of early preterm birth (ePTB): informing a future prospective enrichment clinical trial design

    BMC Pregnancy Childbirth

    (2017)
  • A.K. Arslan et al.

    Different medical data mining approaches based prediction of ischemic stroke

    Comput Methods Programs Biomed

    (2016)
  • J.A. Dean et al.

    Normal tissue complication probability (NTCP) modelling using spatial dose metrics and machine learning methods for severe acute oral mucositis resulting from head and neck radiotherapy

    Radiother Oncol

    (2016)
  • B. Van Calster et al.

    Reporting and interpreting decision curve analysis: a guide for investigators

    Eur Urol

    (2018)
  • E.W. Steyerberg

    Clinical prediction models

    (2009)
  • T. Hastie et al.

    The elements of statistical learning: data mining, inference, and prediction

    (2009)
  • A.L. Beam et al.

    Big data and machine learning in health care

    JAMA

    (2018)
  • J.H. Chen et al.

    Machine learning and prediction in medicine — beyond the peak of inflated expectations

    N Engl J Med

    (2017)
  • B.A. Goldstein et al.

    Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges

    Eur Heart J

    (2017)
  • L. Breiman

    Statistical modeling: the two cultures (with comments and a rejoinder by the author)

    Stat Sci

    (2001)
  • K.G.M. Moons et al.

    Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist

    PLoS Med

    (2014)
  • T.M. Mitchell

    Machine learning

    (1997)
  • A.L. Boulesteix et al.

    Machine learning versus statistical modeling

    Biom J

    (2014)
  • R.C. Deo et al.

    Learning about machine learning: the promise and pitfalls of big data and the electronic health record

    Circ Cardiovasc Qual Outcomes

    (2016)
  • H. He et al.

    Learning from imbalanced data

    IEEE Trans Knowl Data Eng

    (2008)
  • N.L.M.M. Pochet et al.

    Support vector machines versus logistic regression: improving prospective performance in clinical decision-making

    Ultrasound Obstet Gynecol

    (2006)
  • A. Rajkomar et al.

    Scalable and accurate deep learning for electronic health records

    NPJ Digit Med

    (2018)
  • W. Luo et al.

    Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view

    J Med Internet Res

    (2016)
  • T. van der Ploeg et al.

    Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints

    BMC Med Res Methodol

    (2014)
  • G.S. Collins et al.

    Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement

    J Clin Epidemiol

    (2015)
  • A.L. Boulesteix et al.

    A plea for neutral comparison studies in computational sciences

    PLoS One

    (2013)
  • D.J. Hand

    Classifier technology and the illusion of progress

    Stat Sci

    (2006)
  • P.F. Whiting et al.

    QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies

    Ann Intern Med

    (2011)
  • P. Probst et al.

    Tunability: importance of hyperparameters of machine learning algorithms

    (2018)
  • G.S. Collins et al.

    Quantifying the impact of different approaches for handling continuous predictors on the performance of a prognostic model

    Stat Med

    (2016)
  • M.S. Pepe

    The statistical evaluation of medical tests for classification and prediction

    (2003)
  • M. Adavi et al.

    Artificial neural networks versus bivariate logistic regression in prediction diagnosis of patients with hypertension and diabetes

    Med J Islam Repub Iran

    (2016)
  • Z. Habibi et al.

    Predicting ventriculoperitoneal shunt infection in children with hydrocephalus using artificial neural network

    Childs Nerv Syst

    (2016)
  • M. Jahani et al.

    Comparison of predictive models for the early diagnosis of diabetes

    Healthc Inform Res

    (2016)
  • R.J. Kate et al.

    Prediction and detection models for acute kidney injury in hospitalized older adults

    BMC Med Inform Decis Mak

    (2016)
  • Cited by (969)

    View all citing articles on Scopus
    View full text