Data-Adaptive Target Parameters

Hubbard, Alan E.; Kennedy, Chris J.; van der Laan, Mark J.

doi:10.1007/978-3-319-65304-4_9

Alan E. Hubbard⁸,
Chris J. Kennedy⁸ &
Mark J. van der Laan⁹

Part of the book series: Springer Series in Statistics ((SSS))

5557 Accesses
3 Citations

Abstract

What factors are most important in predicting coronary heart disease? Heart disease is the leading cause of death and serious injury in the United States. To address this question we turn to the Framingham Heart Study, which was designed to investigate the health factors associated with coronary heart disease (CHD) at a time when cardiovascular disease was becoming increasingly prevalent. Starting in 1948, the prospective cohort study began monitoring a population of 5209 men and women, ages 30–62, in Framingham, Massachusetts. Those subjects received extensive medical examinations and lifestyle interviews every 2 years that provide longitudinal measurements that can be compared to outcome status. The data has been analyzed in countless observational studies and resulted in risk score equations used widely to assess risk of coronary heart disease. In our case, we conduct a comparison analysis to Wilson et al. (1998) using the data-adaptive variable importance approach described in this chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://github.com/ck37/varImpact/.
2.
The estimated g is truncated to bounds of [0.025, 0.975] as in the TMLE R-package (Gruber and van der Laan 2012a). As in the TMLE R-package, we use nonnegative least squares as the meta-learner for both Q and g.
3.
Blood pressure levels are defined by JNC-V (Joint National Committee 1993): optimal (systolic ≤ 120 mm Hg and diastolic ≤ 80 mm Hg), normal blood pressure (systolic 120–129 mm Hg or diastolic 80–84 mm Hg), high normal blood pressure (systolic 130–139 mm Hg or diastolic 85–89 mm Hg), hypertension stage I (systolic 140–159 mm Hg or diastolic 90–99 mm Hg), and hypertension stage II–IV (systolic ≥ 160 or diastolic ≥ 100 mm Hg). “When systolic and diastolic pressures fell into different categories, the higher category was selected for the purposes of classification.”

References

L. Auret, C. Aldrich, Empirical comparison of tree ensemble variable importance measures. Chemom. Intel. Lab. Syst. 105(2), 157–170 (2011)
Article Google Scholar
O. Bembom, M.L. Petersen, S.-Y. Rhee, W.J. Fessel, S.E. Sinisi, R.W. Shafer, M.J. van der Laan, Biomarker discovery using targeted maximum likelihood estimation: application to the treatment of antiretroviral resistant HIV infection. Stat. Med. 28, 152–72 (2009)
Article MathSciNet Google Scholar
Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995)
MathSciNet MATH Google Scholar
D.I. Broadhurst, D.B. Kell, Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics 2(4), 171–196 (2006)
Article Google Scholar
T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2016), pp. 785–794
Google Scholar
J.H. Friedman, T.J. Hastie, R.J. Tibshirani, Glmnet: lasso and elastic-net regularized generalized linear models (2010). http://CRAN.R-project.org/package=glmnet
A. Gelman, Y.-S. Su, M. Yajima, J. Hill, M.G. Pittau, J. Kerman, T. Zheng, Arm: data analysis using regression and multilevel/hierarchical models (2010). http://CRAN.R-project.org/package=arm
U. Grömping, Variable importance assessment in regression: linear regression versus random forest. Am. Stat. 63(4) (2009)
Google Scholar
S. Gruber, M.J. van der Laan, tmle: an R package for targeted maximum likelihood estimation. J. Stat. Softw. 51(13) (2012a)
Google Scholar
A.E. Hubbard, M.J. van der Laan, Mining with inference: data adaptive target parameters, in Handbook of Big Data. Chapman-Handbooks-Statistical-Methods, ed. by P. Buhlmann, P. Drineas, M. Kane, M.J. van der Laan (Chapman & Hall/CRC, Boca Raton, 2016)
Google Scholar
A.E. Hubbard, I Diaz Munoz, A. Decker, J.B. Holcomb, M.A. Schreiber, E.M. Bulger, K.J. Brasel, E.E. Fox, D.J. del Junco, C.E. Wade et al., Time-dependent prediction and evaluation of variable importance using superlearning in high-dimensional clinical data. J. Trauma-Injury Infect. Crit. Care 75(1), S53–S60 (2013)
Google Scholar
A.E. Hubbard, S. Kherad-Pajouh, M.J. van der Laan, Statistical inference for data adaptive target parameters. Int. J. Biostat. 12(1), 3–19 (2016)
Article MathSciNet Google Scholar
J.P. Ioannidis, Why most discovered true associations are inflated. Epidemiology 19(5), 640–648 (2008)
Article Google Scholar
Joint National Committee, The fifth report of the joint national committee on detection, evaluation, and treatment of high blood pressure (JNC V). Arch. Intern. Med. 153(2), 154–183 (1993)
Google Scholar
A. Liaw, M. Wiener, Classification and regression by randomforest. R News 2(3), 18– 22 (2002)
Google Scholar
A.R. Luedtke, M.J. van der Laan, Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Stat. 44(2), 713–742 (2016a)
Google Scholar
A.R. Luedtke, M.J. van der Laan, Super-learning of an optimal dynamic treatment rule. Int. J. Biostat. 12(1), 305–332 (2016b)
Google Scholar
S. Milborrow, T Hastie, R Tibshirani, Earth: multivariate adaptive regression spline models. R package version 3.2-7 (2014)
Google Scholar
T. Mildenberger, Y. Rozenholc, D. Zasada, histogram: Construction of regular and irregular histograms with different options for automatic choice of bins (2009). http://CRAN.R-project.org/package=histogram
A. Peters, T. Hothorn, ipred: improved predictors (2009) http://CRAN.R-project.org/package=ipred
R. Pirracchio, M.L. Petersen, M.J. van der Laan, Improving propensity score estimators’ robustness to model misspecification using super learner. Am. J. Epidemiol. 181(2), 108–119 (2014)
Article Google Scholar
E.C. Polley, M.J. van der Laan, SuperLearner: super learner prediction (2013). http://CRAN.R-project.org/package=SuperLearner
E.C. Polley, E. LeDell, C. Kennedy, M.J. van der Laan, SuperLearner: super learner prediction (2017). https://github.com/ecpolley/SuperLearner
R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna (2016). http://www.R-project.org.
S. Rose, Robust machine learning variable importance analyses of medical conditions for health care spending. Health Serv. Res. (2018, in press)
Google Scholar
Y. Rozenholc, T. Mildenberger, U. Gather, Combining regular and irregular histograms by penalized likelihood. Comput. Stat. Data Anal. 54(12), 3313–3323 (2010)
Article MathSciNet MATH Google Scholar
M.J. van der Laan, Statistical inference for variable importance. Int. J. Biostat. 2(1), Article 2 (2006b)
Google Scholar
M.J. van der Laan, A.R. Luedtke, Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. Technical Report, Division of Biostatistics, University of California, Berkeley
Google Scholar
M.J. van der Laan, K.S. Pollard, Hybrid clustering of gene expression data with visualization and the bootstrap. J. Stat. Plann. Inference 117, 275–303 (2003)
Article MATH Google Scholar
M.J. van der Laan, E.C. Polley, A.E. Hubbard, Super learner. Stat. Appl. Genet. Mol. 6(1), Article 25 (2007)
Google Scholar
M.J. van der Laan, S. Rose, Targeted Learning: Causal Inference for Observational and Experimental Data (Springer, Berlin, Heidelberg, New York, 2011)
Book Google Scholar
H. Wang, S. Rose, M.J. van der Laan, Finding quantitative trait loci genes with collaborative targeted maximum likelihood learning. Stat. Probab. Lett. 81(7), 792–796 (2011a)
Google Scholar
H. Wang, S. Rose, M.J. van der Laan. Finding quantitative trait loci genes, in Targeted Learning: Causal Inference for Observational and Experimental Data, ed. by M.J. van der Laan, S. Rose (Springer, Berlin Heidelberg, New York, 2011b)
Google Scholar
H. Wang, Z. Zhang, S. Rose, M.J. van der Laan, A novel targeted learning methods for quantitative trait Loci mapping. Genetics 198(4), 1369–1376 (2014)
Article Google Scholar
P. Wilson, R.B. D’Agostino, D. Levy, A.M. Belanger, H. Silbershatz, W.B. Kannel, Prediction of coronary heart disease using risk factor categories. Circulation 97(18), 1837–1847 (1998)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Division of Biostatistics, University of California, Berkeley, 101 Haviland Hall,#7358, Berkeley, CA, 94720, USA
Alan E. Hubbard & Chris J. Kennedy
Division of Biostatistics and Department of Statistics, University of California, Berkeley, 101 Haviland Hall, #7358, Berkeley, CA, 94720, USA
Mark J. van der Laan

Authors

Alan E. Hubbard
View author publications
You can also search for this author in PubMed Google Scholar
Chris J. Kennedy
View author publications
You can also search for this author in PubMed Google Scholar
Mark J. van der Laan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alan E. Hubbard .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hubbard, A.E., Kennedy, C.J., van der Laan, M.J. (2018). Data-Adaptive Target Parameters. In: Targeted Learning in Data Science. Springer Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-65304-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-65304-4_9
Published: 16 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65303-7
Online ISBN: 978-3-319-65304-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics