Abstract
In recent years, mass spectrometry (MS)-based metabolomics has been extensively applied to characterize biochemical mechanisms, and study physiological processes and phenotypic changes associated with disease. Metabolomics has also been important for identifying biomarkers of interest suitable for clinical diagnosis. For the purpose of predictive modeling, in this chapter, we will review various supervised learning algorithms such as random forest (RF), support vector machine (SVM), and partial least squares-discriminant analysis (PLS-DA). In addition, we will also review feature selection methods for identifying the best combination of metabolites for an accurate predictive model. We conclude with best practices for reproducibility by including internal and external replication, reporting metrics to assess performance, and providing guidelines to avoid overfitting and to deal with imbalanced classes. An analysis of an example data will illustrate the use of different machine learning methods and performance metrics.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Maniscalco M, Fuschillo S, Paris D, Cutignano A, Sanduzzi A, Motta A (2019) Clinical metabolomics of exhaled breath condensate in chronic respiratory diseases. Adv Clin Chem 88:121–149. https://doi.org/10.1016/bs.acc.2018.10.002
Pujos-Guillot E, Petera M, Jacquemin J, Centeno D, Lyan B, Montoliu I, Madej D, Pietruszka B, Fabbri C, Santoro A, Brzozowska A, Franceschi C, Comte B (2018) Identification of pre-frailty sub-phenotypes in elderly using metabolomics. Front Physiol 9:1903. https://doi.org/10.3389/fphys.2018.01903
Sarode GV, Kim K, Kieffer DA, Shibata NM, Litwin T, Czlonkowska A, Medici V (2019) Metabolomics profiles of patients with Wilson disease reveal a distinct metabolic signature. Metabolomics 15(3):43. https://doi.org/10.1007/s11306-019-1505-6
Wang X, Zhang A, Sun H (2013) Power of metabolomics in diagnosis and biomarker discovery of hepatocellular carcinoma. Hepatology 57(5):2072–2077
Caesar LK, Kellogg JJ, Kvalheim OM, Cech NB (2019) Opportunities and limitations for untargeted mass spectrometry metabolomics to identify biologically active constituents in complex natural product mixtures. J Nat Prod 82:469. https://doi.org/10.1021/acs.jnatprod.9b00176
Liu LL, Lin Y, Chen W, Tong ML, Luo X, Lin LR, Zhang HL, Yan JH, Niu JJ, Yang TC (2019) Metabolite profiles of the cerebrospinal fluid in neurosyphilis patients determined by untargeted metabolomics analysis. Front Neurosci 13:150. https://doi.org/10.3389/fnins.2019.00150
Sanchez-Arcos C, Kai M, Svatos A, Gershenzon J, Kunert G (2019) Untargeted metabolomics approach reveals differences in host plant chemistry before and after infestation with different pea aphid host races. Front Plant Sci 10:188. https://doi.org/10.3389/fpls.2019.00188
Wang R, Yin Y, Zhu ZJ (2019) Advancing untargeted metabolomics using data-independent acquisition mass spectrometry technology. Anal Bioanal Chem 411:4349. https://doi.org/10.1007/s00216-019-01709-1
Allwood JW, Xu Y, Martinez-Martin P, Palau R, Cowan A, Goodacre R, Marshall A, Stewart D, Howarth C (2019) Rapid UHPLC-MS metabolite profiling and phenotypic assays reveal genotypic impacts of nitrogen supplementation in oats. Metabolomics 15(3):42. https://doi.org/10.1007/s11306-019-1501-x
Fang J, Zhao H, Zhang Y, Wong M, He Y, Sun Q, Xu S, Cai Z (2019) Evaluation of gas chromatography-atmospheric pressure chemical ionization tandem mass spectrometry as an alternative to gas chromatography tandem mass spectrometry for the determination of polychlorinated biphenyls and polybrominated diphenyl ethers. Chemosphere 225:288–294. https://doi.org/10.1016/j.chemosphere.2019.03.011
Lohr KE, Camp EF, Kuzhiumparambil U, Lutz A, Leggat W, Patterson JT, Suggett DJ (2019) Resolving coral photoacclimation dynamics through coupled photophysiological and metabolomic profiling. J Exp Biol 222:jeb195982. https://doi.org/10.1242/jeb.195982
Baumeister TUH, Ueberschaar N, Schmidt-Heck W, Mohr JF, Deicke M, Wichard T, Guthke R, Pohnert G (2018) DeltaMS: a tool to track isotopologues in GC- and LC-MS data. Metabolomics 14(4):41. https://doi.org/10.1007/s11306-018-1336-x
Gilmore IS, Heiles S, Pieterse CL (2019) Metabolic imaging at the single-cell scale: recent advances in mass spectrometry imaging. Annu Rev Anal Chem (Palo Alto Calif) 12:201. https://doi.org/10.1146/annurev-anchem-061318-115516
Do KT, Wahl S, Raffler J, Molnos S, Laimighofer M, Adamski J, Suhre K, Strauch K, Peters A, Gieger C, Langenberg C, Stewart ID, Theis FJ, Grallert H, Kastenmuller G, Krumsiek J (2018) Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics 14(10):128. https://doi.org/10.1007/s11306-018-1420-2
Liggi S, Hinz C, Hall Z, Santoru ML, Poddighe S, Fjeldsted J, Atzori L, Griffin JL (2018) KniMet: a pipeline for the processing of chromatography-mass spectrometry metabolomics data. Metabolomics 14(4):52. https://doi.org/10.1007/s11306-018-1349-5
Fielding S, Fayers PM, McDonald A, McPherson G, Campbell MK (2008) Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health Qual Life Outcomes 6(1):57
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147
Steyerberg EW, van Veen M (2007) Imputation is beneficial for handling missing data in predictive models. J Clin Epidemiol 60(9):979
Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G (2006) XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 78(3):779–787. https://doi.org/10.1021/ac051437y
Wei R, Wang J, Su M, Jia E, Chen S, Chen T, Ni Y (2018) Missing value imputation approach for mass spectrometry-based metabolomics data. Sci Rep 8(1):663. https://doi.org/10.1038/s41598-017-19120-0
Zhan X, Patterson AD, Ghosh D (2015) Kernel approaches for differential expression analysis of mass spectrometry-based metabolomics data. BMC Bioinformatics 16:77. https://doi.org/10.1186/s12859-015-0506-3
Gromski PS, Xu Y, Kotze HL, Correa E, Ellis DI, Armitage EG, Turner ML, Goodacre R (2014) Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites 4(2):433–452. https://doi.org/10.3390/metabo4020433
Kumar N, Hoque MA, Shahjaman M, Islam SM, Mollah MN (2017) Metabolomic biomarker identification in presence of outliers and missing values. Biomed Res Int 2017:2437608. https://doi.org/10.1155/2017/2437608
Sun X, Langer B, Weckwerth W (2015) Challenges of inversely estimating Jacobian from metabolomics data. Front Bioeng Biotechnol 3:188. https://doi.org/10.3389/fbioe.2015.00188
Lee JY, Styczynski MP (2018) NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics 14(12):153. https://doi.org/10.1007/s11306-018-1451-8
Di Guida R, Engel J, Allwood JW, Weber RJM, Jones MR, Sommer U, Viant MR, Dunn WB (2016) Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling. Metabolomics 12(5):93. https://doi.org/10.1007/s11306-016-1030-9
Chen MX, Wang SY, Kuo CH, Tsai IL (2019) Metabolome analysis for investigating host-gut microbiota interactions. J Formos Med Assoc 118(Suppl 1):S10–S22. https://doi.org/10.1016/j.jfma.2018.09.007
Shen X, Zhu ZJ (2019) MetFlow: an interactive and integrated workflow for metabolomics data cleaning and differential metabolite discovery. Bioinformatics 35:2870. https://doi.org/10.1093/bioinformatics/bty1066
McLachlan, Geoffrey J (2004) Discriminant analysis and statistical pattern recognition. Wiley-Interscience, Hoboken, N.J. John Wiley & Sons. & Wiley InterScience (Online Service)
McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: AAAI-98 workshop on learning for text categorization, vol 1. Citeseer, pp 41–48
Wang Q, Garrity GM, Tiedje JM, Cole JR (2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73(16):5261–5267
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Breiman L (2017) Classification and regression trees. Routledge, Boca Raton
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22
Gislason PO, Benediktsson JA, Sveinsson JR (2006) Random forests for land cover classification. Pattern Recogn Lett 27(4):294–300
Chen T, Cao Y, Zhang Y, Liu J, Bao Y, Wang C, Jia W, Zhao A (2013) Random forest in clinical metabolomics for phenotypic discrimination and biomarker selection. Evid Based Complement Alternat Med 2013:298183
Scott I, Lin W, Liakata M, Wood J, Vermeer CP, Allaway D, Ward J, Draper J, Beale M, Corol D (2013) Merits of random forests emerge in evaluation of chemometric classifiers by external validation. Anal Chim Acta 801:22–33
Ho TK (1998) Nearest neighbors in random subspaces. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, pp 640–648
Biau G (2012) Analysis of a random forests model. J Mach Learn Res 13(Apr):1063–1095
Hapfelmeier A, Hothorn T, Ulm K, Strobl C (2014) A new variable importance measure for random forests with missing data. Stat Comput 24(1):21–34
Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, Hamprecht FA (2009) A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10(1):213
Maker AV, Hu V, Kadkol SS, Hong L, Brugge W, Winter J, Yeo CJ, Hackert T, Buchler M, Lawlor RT, Salvia R, Scarpa A, Bassi C, Green S (2019) Cyst fluid biosignature to predict Intraductal papillary mucinous neoplasms of the pancreas with high malignant potential. J Am Coll Surg 228:721. https://doi.org/10.1016/j.jamcollsurg.2019.02.040
Tkachev V, Sorokin M, Mescheryakov A, Simonov A, Garazha A, Buzdin A, Muchnik I, Borisov N (2018) FLOating-window projective separator (FloWPS): a data trimming tool for support vector machines (SVM) to improve robustness of the classifier. Front Genet 9:717. https://doi.org/10.3389/fgene.2018.00717
Yerukala Sathipati S, Ho SY (2018) Identifying a miRNA signature for predicting the stage of breast cancer. Sci Rep 8(1):16138. https://doi.org/10.1038/s41598-018-34604-3
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. ACM, pp 144–152
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Ripley BD (1994) Flexible non-linear approaches to classification. In: From statistics to neural networks. Springer, Berlin, pp 105–126
Contreras-Jodar A, Nayan NH, Hamzaoui S, Caja G, Salama AAK (2019) Heat stress modifies the lactational performances and the urinary metabolomic profile related to gastrointestinal microbiota of dairy goats. PLoS One 14(2):e0202457. https://doi.org/10.1371/journal.pone.0202457
Park HG, Jang KS, Park HM, Song WS, Jeong YY, Ahn DH, Kim SM, Yang YH, Kim YG (2019) MALDI-TOF MS-based total serum protein fingerprinting for liver cancer diagnosis. Analyst 144:2231. https://doi.org/10.1039/c8an02241k
Quiros-Guerrero L, Albertazzi F, Araya-Valverde E, Romero RM, Villalobos H, Poveda L, Chavarria M, Tamayo-Castillo G (2019) Phenolic variation among Chamaecrista nictitans subspecies and varieties revealed through UPLC-ESI(−)-MS/MS chemical fingerprinting. Metabolomics 15(2):14. https://doi.org/10.1007/s11306-019-1475-8
Wang J, Yan D, Zhao A, Hou X, Zheng X, Chen P, Bao Y, Jia W, Hu C, Zhang ZL, Jia W (2019) Discovery of potential biomarkers for osteoporosis using LC-MS/MS metabolomic methods. Osteoporos Int 30:1491. https://doi.org/10.1007/s00198-019-04892-0
Grissa D, Petera M, Brandolini M, Napoli A, Comte B, Pujos-Guillot E (2016) Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data. Front Mol Biosci 3:30. https://doi.org/10.3389/fmolb.2016.00030
Bayci AWL, Baker DA, Somerset AE, Turkoglu O, Hothem Z, Callahan RE, Mandal R, Han B, Bjorndahl T, Wishart D, Bahado-Singh R, Graham SF, Keidan R (2018) Metabolomic identification of diagnostic serum-based biomarkers for advanced stage melanoma. Metabolomics 14(8):105. https://doi.org/10.1007/s11306-018-1398-9
Catav SS, Elgin ES, Dag C, Stark JL, Kucukakyuz K (2018) NMR-based metabolomics reveals that plant-derived smoke stimulates root growth via affecting carbohydrate and energy metabolism in maize. Metabolomics 14(11):143. https://doi.org/10.1007/s11306-018-1440-y
Guo JG, Guo XM, Wang XR, Tian JZ, Bi HS (2019) Metabolic profile analysis of free amino acids in experimental autoimmune uveoretinitis rat plasma. Int J Ophthalmol 12(1):16–24. https://doi.org/10.18240/ijo.2019.01.03
Rodrigues-Neto JC, Correia MV, Souto AL, Ribeiro JAA, Vieira LR, Souza MT Jr, Rodrigues CM, Abdelnur PV (2018) Metabolic fingerprinting analysis of oil palm reveals a set of differentially expressed metabolites in fatal yellowing symptomatic and non-symptomatic plants. Metabolomics 14(10):142. https://doi.org/10.1007/s11306-018-1436-7
Wong M, Lodge JK (2012) A metabolomic investigation of the effects of vitamin E supplementation in humans. Nutr Metab (Lond) 9(1):110. https://doi.org/10.1186/1743-7075-9-110
Li Y, Chen M, Liu C, Xia Y, Xu B, Hu Y, Chen T, Shen M, Tang W (2018) Metabolic changes associated with papillary thyroid carcinoma: a nuclear magnetic resonance-based metabolomics study. Int J Mol Med 41(5):3006–3014. https://doi.org/10.3892/ijmm.2018.3494
Rezig L, Servadio A, Torregrossa L, Miccoli P, Basolo F, Shintu L, Caldarelli S (2018) Diagnosis of post-surgical fine-needle aspiration biopsies of thyroid lesions with indeterminate cytology using HRMAS NMR-based metabolomics. Metabolomics 14(10):141. https://doi.org/10.1007/s11306-018-1437-6
Westerhuis JA, van Velzen EJ, Hoefsloot HC, Smilde AK (2010) Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics 6(1):119–128
Liquet B, Le Cao KA, Hocini H, Thiebaut R (2012) A novel approach for biomarker selection and the integration of repeated measures experiments from two assays. BMC Bioinformatics 13:325. https://doi.org/10.1186/1471-2105-13-325
Liu H, Motoda H (1998) Feature extraction, construction and selection: a data mining perspective, vol 453. Springer Science & Business Media, Norwell
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Weston J, Elisseeff A, Schölkopf B, Tipping M (2003) Use of the zero-norm with linear models and kernel methods. J Mach Learn Res 3(Mar):1439–1461
Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: ICML 1999, pp 258–267
Bozdogan H (1987) Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 52(3):345–370
Guan W, Zhou M, Hampton CY, Benigno BB, Walker LD, Gray A, McDonald JF, Fernández FM (2009) Ovarian cancer detection from metabolomic liquid chromatography/mass spectrometry data by support vector machines. BMC Bioinformatics 10(1):259
Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines
Kuhn M, Johnson K (2013) Applied predictive modeling, vol 26. Springer, New York
Behnamian A, Millard K, Banks SN, White L, Richardson M, Pasher J (2017) A systematic approach for variable selection with random forests: achieving stable variable importance values. IEEE Geosci Remote Sens Lett 14(11):1988–1992
Van Calster B, Vickers AJ (2015) Calibration of risk prediction models: impact on decision-analytic performance. Med Decis Making 35(2):162–169
Agresti A (2002) Categorical data analysis. Wiley, New York
Huang Y, Sullivan Pepe M, Feng Z (2007) Evaluating the predictiveness of a continuous marker. Biometrics 63(4):1181–1188
Holder LB, Haque MM, Skinner MK (2017) Machine learning for epigenetics and future medical applications. Epigenetics 12(7):505–514. https://doi.org/10.1080/15592294.2017.1329068
Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data, vol 110. University of California, Berkeley, pp 1–12
Breiman L, Friedman J, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman & Hall, New York
Japkowicz N (2000) Learning from imbalanced data sets: a comparison of various strategies. In: AAAI workshop on learning from imbalanced data sets. Menlo Park, CA, pp 10–15
Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML-2003 workshop on learning from imbalanced data sets II, pp 2–1
Ling CX, Li C (1998) Data mining for direct marketing: problems and solutions. In: KDD 1998, pp 73–79
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML 1997. Citeseer, pp 179–186
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: KDD 1999, pp 155–164
Cateni S, Colla V, Vannucci M (2014) A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135:32–41
Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II. Citeseer, pp 1–8
Collins GS, Reitsma JB, Altman DG, Moons KG (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMC Med 13(1):1
Cruickshank-Quinn CI, Jacobson S, Hughes G, Powell RL, Petrache I, Kechris K, Bowler R, Reisdorph N (2018) Metabolomics and transcriptomics pathway approach reveals outcome-specific perturbations in COPD. Sci Rep 8(1):17132
Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, Curran-Everett D, Silverman EK, Crapo JD (2010) Genetic epidemiology of COPD (COPDGene) study design. COPD 7(1):32–43. https://doi.org/10.3109/15412550903499522
Andersen SL, Briggs FBS, Winnike JH, Natanzon Y, Maichle S, Knagge KJ, Newby LK, Gregory SG (2019) Metabolome-based signature of disease pathology in MS. Mult Scler Relat Disord 31:12–21. https://doi.org/10.1016/j.msard.2019.03.006
Lee HS, Seo C, Hwang YH, Shin TH, Park HJ, Kim Y, Ji M, Min J, Choi S, Kim H, Park AK, Yee ST, Lee G, Paik MJ (2019) Metabolomic approaches to polyamines including acetylated derivatives in lung tissue of mice with asthma. Metabolomics 15(1):8. https://doi.org/10.1007/s11306-018-1470-5
Long NP, Yoon SJ, Anh NH, Nghi TD, Lim DK, Hong YJ, Hong SS, Kwon SW (2018) A systematic review on metabolomics-based diagnostic biomarker discovery and validation in pancreatic cancer. Metabolomics 14(8):109. https://doi.org/10.1007/s11306-018-1404-2
Regan EA, Hersh CP, Castaldi PJ, DeMeo DL, Silverman EK, Crapo JD, Bowler RP (2019) Omics and the search for blood biomarkers in COPD: insights from COPDGene. Am J Respir Cell Mol Biol 61:143. https://doi.org/10.1165/rcmb.2018-0245PS
Thévenot EA (2016) ropls: PCA, PLS (-DA) and OPLS (-DA) for multivariate analysis and feature selection of omics data
Rinaudo P, Boudah S, Junot C, Thévenot EA (2016) Biosigner: a new method for the discovery of significant molecular signatures from omics data. Front Mol Biosci 3:26
Zararsiz G, Goksuluk D, Korkmaz S, Eldem V, Duru IP, Unver T, Ozturk A, Zararsiz MG, klaR M, biocViews Sequencing, R (2014) Package ‘MLSeq’
Xia J, Psychogios N, Young N, Wishart DS (2009) MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res 37(suppl_2):W652–W660
Luan H, Ji F, Chen Y, Cai Z (2018) statTarget: a streamlined tool for signal drift correction and interpretations of quantitative mass spectrometry-based omics data. Anal Chim Acta 1036:66–72
Determan Jr CE, Determan Jr MCE (2015) Package ‘OmicsMarkeR’
Rohart F, Gautier B, Singh A, Le Cao K-A (2017) mixOmics: an R package for ‘omics feature selection and multiple data integration. PLoS Comput Biol 13(11):e1005752
Al-Akwaa FM, Yunits B, Huang S, Alhajaji H, Garmire LX (2018) Lilikoi: an R package for personalized pathway-based classification modeling using metabolomics data. GigaScience 7(12):giy136
Gift N, Gormley IC, Brennan L, Gormley MC (2010) Package ‘MetabolAnalyze’
Gaude E, Chignola F, Spiliotopoulos D, Spitaleri A, Ghitti M, Garcìa-Manteiga JM, Mari S, Musco G (2013) Muma, an R package for metabolomics univariate and multivariate statistical analysis. Curr Metabol 1(2):180–189
Palla P (2015) Information management and multivariate analysis techniques for metabolomics data. Universita’degli Studi di Cagliari
Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28(5):1–26
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
TRIPOD Checklist for Predictive Modeling for Metabolomics Data
Section/topic | Item | Checklist item | Section |
---|---|---|---|
Title and abstract | |||
Title | 1 | Identify the study as developing and/or validating a multivariable prediction model, the target population, and the outcome to be predicted | See title |
Abstract | 2 | Provide a summary of objectives, study design, setting, participants, sample size, predictors, outcome, statistical analysis, results, and conclusions | See abstract |
Introduction | |||
Background and objectives | 3a | Explain the medical context (including whether diagnostic or prognostic) and rationale for developing or validating the multivariable prediction model, including references to existing models | Subheading 4.1 |
3b | Specify the objectives, including whether the study describes the development or validation of the model or both | Internal validation, Subheading 4.4 | |
Methods | |||
Source of data | 4a | Describe the study design or source of data (e.g., randomized trial, cohort, or registry data), separately for the development and validation data sets, if applicable | Subheading 4.1 |
4b | Specify the key study dates, including start of accrual; end of accrual; and, if applicable, end of follow-up | ||
Participants | 5a | Specify key elements of the study setting (e.g., primary care, secondary care, general population) including number and location of centers | N/A |
5b | Describe eligibility criteria for participants | ||
5c | Give details of treatments received, if relevant | N/A | |
Outcome | 6a | Clearly define the outcome that is predicted by the prediction model, including how and when assessed | Subheading 4.1 |
6b | Report any actions to blind assessment of the outcome to be predicted | N/A | |
Predictors | 7a | Clearly define all predictors used in developing or validating the multivariable prediction model, including how and when they were measured | 2999 predictors, for more details see [87] |
7b | Report any actions to blind assessment of predictors for the outcome and other predictors | N/A | |
Sample size | 8 | Explain how the study size was arrived at | |
Missing data | 9 | Describe how missing data were handled (e.g., complete-case analysis, single imputation, multiple imputation) with details of any imputation method | The data was already preprocessed and imputed, see Subheading 4.1 |
Statistical analysis methods | 10c | For validation, describe how the predictions were calculated | Subheading 3.3 |
10d | Specify all measures used to assess model performance and, if relevant, to compare multiple models | Subheading 3.3 | |
10e | Describe any model updating (e.g., recalibration) arising from the validation, if done | N/A | |
Risk groups | 11 | Provide details on how risk groups were created, if done | N/A |
Development vs. validation | 12 | For validation, identify any differences from the development data in setting, eligibility criteria, outcome, and predictors | N/A |
Results | |||
Participants | 13a | Describe the flow of participants through the study, including the number of participants with and without the outcome and, if applicable, a summary of the follow-up time. A diagram may be helpful | |
13b | Describe the characteristics of the participants (basic demographics, clinical features, available predictors), including the number of participants with missing data for predictors and outcome | ||
13c | For validation, show a comparison with the development data of the distribution of important variables (demographics, predictors, and outcome) | ||
Model performance | 16 | Report performance measures (with CIs) for the prediction model | N/A |
Model-updating | 17 | If done, report the results from any model updating (i.e., model specification, model performance) | Subheading 4.4 |
Discussion | |||
Limitations | 18 | Discuss any limitations of the study (such as nonrepresentative sample, few events per predictor, missing data) | |
Interpretation | 19a | For validation, discuss the results with reference to performance in the development data, and any other validation data | N/A |
19b | Give an overall interpretation of the results, considering objectives, limitations, results from similar studies, and other relevant evidence | ||
Implications | 20 | Discuss the potential clinical use of the model and implications for future research | Subheadings 4.4 and 5. However, performance of the model is data-driven |
Other information | |||
Supplementary information | 21 | Provide information about the availability of supplementary resources, such as study protocol, web calculator, and data sets | |
Funding | 22 | Give the source of funding and the role of the funders for the present study | NIH |
Selected Open Source (R/Bioconductor/Web-Based) Tools for Supervised Learning Algorithms
Method | Source | Reference |
---|---|---|
PLS-DA | Bioconductor (ropls) | [93] |
PLS-DA, RF, and SVM | Bioconductor (biosigner) | [94] |
SVM, RF | Bioconductor (MLSeq) | [95] |
RF, SVM, PLS-DA | Metaboanalyst | [96] |
PCA, PLS-DA, RF | Bioconductor (statTarget) | [97] |
Feature selection, metric evaluation | Bioconductor (OmicsMarker) | [98] |
Sparse PLS-DA | Bioconductor (mixOmics) | [99] |
Feature selection, metric evaluation | CRAN (lilikoi) | [100] |
Probabilistic principal component analysis | CRAN (MetabolAnalyze) | [101] |
Kernel-based metabolite differential analysis | CRAN (KMDA) | [21] |
PLS-DA, OPLS-DA | CRAN (muma) | [102] |
RF | CRAN (RFmarkerDetector) | [103] |
RF, SVM, PLS-DA | CRAN (caret) | [104] |
Rights and permissions
Copyright information
© 2020 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Ghosh, T., Zhang, W., Ghosh, D., Kechris, K. (2020). Predictive Modeling for Metabolomics Data. In: Li, S. (eds) Computational Methods and Data Analysis for Metabolomics. Methods in Molecular Biology, vol 2104. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-0239-3_16
Download citation
DOI: https://doi.org/10.1007/978-1-0716-0239-3_16
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-0238-6
Online ISBN: 978-1-0716-0239-3
eBook Packages: Springer Protocols