Skip to main content

Predictive Modeling for Metabolomics Data

  • Protocol
  • First Online:

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2104))

Abstract

In recent years, mass spectrometry (MS)-based metabolomics has been extensively applied to characterize biochemical mechanisms, and study physiological processes and phenotypic changes associated with disease. Metabolomics has also been important for identifying biomarkers of interest suitable for clinical diagnosis. For the purpose of predictive modeling, in this chapter, we will review various supervised learning algorithms such as random forest (RF), support vector machine (SVM), and partial least squares-discriminant analysis (PLS-DA). In addition, we will also review feature selection methods for identifying the best combination of metabolites for an accurate predictive model. We conclude with best practices for reproducibility by including internal and external replication, reporting metrics to assess performance, and providing guidelines to avoid overfitting and to deal with imbalanced classes. An analysis of an example data will illustrate the use of different machine learning methods and performance metrics.

This is a preview of subscription content, log in via an institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

  1. Maniscalco M, Fuschillo S, Paris D, Cutignano A, Sanduzzi A, Motta A (2019) Clinical metabolomics of exhaled breath condensate in chronic respiratory diseases. Adv Clin Chem 88:121–149. https://doi.org/10.1016/bs.acc.2018.10.002

    Article  PubMed  Google Scholar 

  2. Pujos-Guillot E, Petera M, Jacquemin J, Centeno D, Lyan B, Montoliu I, Madej D, Pietruszka B, Fabbri C, Santoro A, Brzozowska A, Franceschi C, Comte B (2018) Identification of pre-frailty sub-phenotypes in elderly using metabolomics. Front Physiol 9:1903. https://doi.org/10.3389/fphys.2018.01903

    Article  PubMed  Google Scholar 

  3. Sarode GV, Kim K, Kieffer DA, Shibata NM, Litwin T, Czlonkowska A, Medici V (2019) Metabolomics profiles of patients with Wilson disease reveal a distinct metabolic signature. Metabolomics 15(3):43. https://doi.org/10.1007/s11306-019-1505-6

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Wang X, Zhang A, Sun H (2013) Power of metabolomics in diagnosis and biomarker discovery of hepatocellular carcinoma. Hepatology 57(5):2072–2077

    Article  CAS  PubMed  Google Scholar 

  5. Caesar LK, Kellogg JJ, Kvalheim OM, Cech NB (2019) Opportunities and limitations for untargeted mass spectrometry metabolomics to identify biologically active constituents in complex natural product mixtures. J Nat Prod 82:469. https://doi.org/10.1021/acs.jnatprod.9b00176

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Liu LL, Lin Y, Chen W, Tong ML, Luo X, Lin LR, Zhang HL, Yan JH, Niu JJ, Yang TC (2019) Metabolite profiles of the cerebrospinal fluid in neurosyphilis patients determined by untargeted metabolomics analysis. Front Neurosci 13:150. https://doi.org/10.3389/fnins.2019.00150

    Article  PubMed  PubMed Central  Google Scholar 

  7. Sanchez-Arcos C, Kai M, Svatos A, Gershenzon J, Kunert G (2019) Untargeted metabolomics approach reveals differences in host plant chemistry before and after infestation with different pea aphid host races. Front Plant Sci 10:188. https://doi.org/10.3389/fpls.2019.00188

    Article  PubMed  PubMed Central  Google Scholar 

  8. Wang R, Yin Y, Zhu ZJ (2019) Advancing untargeted metabolomics using data-independent acquisition mass spectrometry technology. Anal Bioanal Chem 411:4349. https://doi.org/10.1007/s00216-019-01709-1

    Article  CAS  PubMed  Google Scholar 

  9. Allwood JW, Xu Y, Martinez-Martin P, Palau R, Cowan A, Goodacre R, Marshall A, Stewart D, Howarth C (2019) Rapid UHPLC-MS metabolite profiling and phenotypic assays reveal genotypic impacts of nitrogen supplementation in oats. Metabolomics 15(3):42. https://doi.org/10.1007/s11306-019-1501-x

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Fang J, Zhao H, Zhang Y, Wong M, He Y, Sun Q, Xu S, Cai Z (2019) Evaluation of gas chromatography-atmospheric pressure chemical ionization tandem mass spectrometry as an alternative to gas chromatography tandem mass spectrometry for the determination of polychlorinated biphenyls and polybrominated diphenyl ethers. Chemosphere 225:288–294. https://doi.org/10.1016/j.chemosphere.2019.03.011

    Article  CAS  PubMed  Google Scholar 

  11. Lohr KE, Camp EF, Kuzhiumparambil U, Lutz A, Leggat W, Patterson JT, Suggett DJ (2019) Resolving coral photoacclimation dynamics through coupled photophysiological and metabolomic profiling. J Exp Biol 222:jeb195982. https://doi.org/10.1242/jeb.195982

    Article  PubMed  Google Scholar 

  12. Baumeister TUH, Ueberschaar N, Schmidt-Heck W, Mohr JF, Deicke M, Wichard T, Guthke R, Pohnert G (2018) DeltaMS: a tool to track isotopologues in GC- and LC-MS data. Metabolomics 14(4):41. https://doi.org/10.1007/s11306-018-1336-x

    Article  CAS  PubMed  Google Scholar 

  13. Gilmore IS, Heiles S, Pieterse CL (2019) Metabolic imaging at the single-cell scale: recent advances in mass spectrometry imaging. Annu Rev Anal Chem (Palo Alto Calif) 12:201. https://doi.org/10.1146/annurev-anchem-061318-115516

    Article  CAS  Google Scholar 

  14. Do KT, Wahl S, Raffler J, Molnos S, Laimighofer M, Adamski J, Suhre K, Strauch K, Peters A, Gieger C, Langenberg C, Stewart ID, Theis FJ, Grallert H, Kastenmuller G, Krumsiek J (2018) Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics 14(10):128. https://doi.org/10.1007/s11306-018-1420-2

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Liggi S, Hinz C, Hall Z, Santoru ML, Poddighe S, Fjeldsted J, Atzori L, Griffin JL (2018) KniMet: a pipeline for the processing of chromatography-mass spectrometry metabolomics data. Metabolomics 14(4):52. https://doi.org/10.1007/s11306-018-1349-5

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Fielding S, Fayers PM, McDonald A, McPherson G, Campbell MK (2008) Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health Qual Life Outcomes 6(1):57

    Article  PubMed  PubMed Central  Google Scholar 

  17. Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147

    Article  PubMed  Google Scholar 

  18. Steyerberg EW, van Veen M (2007) Imputation is beneficial for handling missing data in predictive models. J Clin Epidemiol 60(9):979

    Article  PubMed  Google Scholar 

  19. Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G (2006) XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 78(3):779–787. https://doi.org/10.1021/ac051437y

    Article  CAS  PubMed  Google Scholar 

  20. Wei R, Wang J, Su M, Jia E, Chen S, Chen T, Ni Y (2018) Missing value imputation approach for mass spectrometry-based metabolomics data. Sci Rep 8(1):663. https://doi.org/10.1038/s41598-017-19120-0

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Zhan X, Patterson AD, Ghosh D (2015) Kernel approaches for differential expression analysis of mass spectrometry-based metabolomics data. BMC Bioinformatics 16:77. https://doi.org/10.1186/s12859-015-0506-3

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Gromski PS, Xu Y, Kotze HL, Correa E, Ellis DI, Armitage EG, Turner ML, Goodacre R (2014) Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites 4(2):433–452. https://doi.org/10.3390/metabo4020433

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Kumar N, Hoque MA, Shahjaman M, Islam SM, Mollah MN (2017) Metabolomic biomarker identification in presence of outliers and missing values. Biomed Res Int 2017:2437608. https://doi.org/10.1155/2017/2437608

    Article  PubMed  PubMed Central  Google Scholar 

  24. Sun X, Langer B, Weckwerth W (2015) Challenges of inversely estimating Jacobian from metabolomics data. Front Bioeng Biotechnol 3:188. https://doi.org/10.3389/fbioe.2015.00188

    Article  PubMed  PubMed Central  Google Scholar 

  25. Lee JY, Styczynski MP (2018) NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics 14(12):153. https://doi.org/10.1007/s11306-018-1451-8

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Di Guida R, Engel J, Allwood JW, Weber RJM, Jones MR, Sommer U, Viant MR, Dunn WB (2016) Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling. Metabolomics 12(5):93. https://doi.org/10.1007/s11306-016-1030-9

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Chen MX, Wang SY, Kuo CH, Tsai IL (2019) Metabolome analysis for investigating host-gut microbiota interactions. J Formos Med Assoc 118(Suppl 1):S10–S22. https://doi.org/10.1016/j.jfma.2018.09.007

    Article  CAS  PubMed  Google Scholar 

  28. Shen X, Zhu ZJ (2019) MetFlow: an interactive and integrated workflow for metabolomics data cleaning and differential metabolite discovery. Bioinformatics 35:2870. https://doi.org/10.1093/bioinformatics/bty1066

    Article  PubMed  Google Scholar 

  29. McLachlan, Geoffrey J (2004) Discriminant analysis and statistical pattern recognition. Wiley-Interscience, Hoboken, N.J. John Wiley & Sons. & Wiley InterScience (Online Service)

    Google Scholar 

  30. McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: AAAI-98 workshop on learning for text categorization, vol 1. Citeseer, pp 41–48

    Google Scholar 

  31. Wang Q, Garrity GM, Tiedje JM, Cole JR (2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73(16):5261–5267

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106

    Google Scholar 

  33. Breiman L (2017) Classification and regression trees. Routledge, Boca Raton

    Book  Google Scholar 

  34. Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22

    Google Scholar 

  35. Gislason PO, Benediktsson JA, Sveinsson JR (2006) Random forests for land cover classification. Pattern Recogn Lett 27(4):294–300

    Article  Google Scholar 

  36. Chen T, Cao Y, Zhang Y, Liu J, Bao Y, Wang C, Jia W, Zhao A (2013) Random forest in clinical metabolomics for phenotypic discrimination and biomarker selection. Evid Based Complement Alternat Med 2013:298183

    PubMed  PubMed Central  Google Scholar 

  37. Scott I, Lin W, Liakata M, Wood J, Vermeer CP, Allaway D, Ward J, Draper J, Beale M, Corol D (2013) Merits of random forests emerge in evaluation of chemometric classifiers by external validation. Anal Chim Acta 801:22–33

    Article  CAS  PubMed  Google Scholar 

  38. Ho TK (1998) Nearest neighbors in random subspaces. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, pp 640–648

    Google Scholar 

  39. Biau G (2012) Analysis of a random forests model. J Mach Learn Res 13(Apr):1063–1095

    Google Scholar 

  40. Hapfelmeier A, Hothorn T, Ulm K, Strobl C (2014) A new variable importance measure for random forests with missing data. Stat Comput 24(1):21–34

    Article  Google Scholar 

  41. Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, Hamprecht FA (2009) A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10(1):213

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  42. Maker AV, Hu V, Kadkol SS, Hong L, Brugge W, Winter J, Yeo CJ, Hackert T, Buchler M, Lawlor RT, Salvia R, Scarpa A, Bassi C, Green S (2019) Cyst fluid biosignature to predict Intraductal papillary mucinous neoplasms of the pancreas with high malignant potential. J Am Coll Surg 228:721. https://doi.org/10.1016/j.jamcollsurg.2019.02.040

    Article  PubMed  PubMed Central  Google Scholar 

  43. Tkachev V, Sorokin M, Mescheryakov A, Simonov A, Garazha A, Buzdin A, Muchnik I, Borisov N (2018) FLOating-window projective separator (FloWPS): a data trimming tool for support vector machines (SVM) to improve robustness of the classifier. Front Genet 9:717. https://doi.org/10.3389/fgene.2018.00717

    Article  CAS  PubMed  Google Scholar 

  44. Yerukala Sathipati S, Ho SY (2018) Identifying a miRNA signature for predicting the stage of breast cancer. Sci Rep 8(1):16138. https://doi.org/10.1038/s41598-018-34604-3

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    Google Scholar 

  46. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. ACM, pp 144–152

    Google Scholar 

  47. Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin

    Google Scholar 

  48. Ripley BD (1994) Flexible non-linear approaches to classification. In: From statistics to neural networks. Springer, Berlin, pp 105–126

    Chapter  Google Scholar 

  49. Contreras-Jodar A, Nayan NH, Hamzaoui S, Caja G, Salama AAK (2019) Heat stress modifies the lactational performances and the urinary metabolomic profile related to gastrointestinal microbiota of dairy goats. PLoS One 14(2):e0202457. https://doi.org/10.1371/journal.pone.0202457

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Park HG, Jang KS, Park HM, Song WS, Jeong YY, Ahn DH, Kim SM, Yang YH, Kim YG (2019) MALDI-TOF MS-based total serum protein fingerprinting for liver cancer diagnosis. Analyst 144:2231. https://doi.org/10.1039/c8an02241k

    Article  CAS  PubMed  Google Scholar 

  51. Quiros-Guerrero L, Albertazzi F, Araya-Valverde E, Romero RM, Villalobos H, Poveda L, Chavarria M, Tamayo-Castillo G (2019) Phenolic variation among Chamaecrista nictitans subspecies and varieties revealed through UPLC-ESI(−)-MS/MS chemical fingerprinting. Metabolomics 15(2):14. https://doi.org/10.1007/s11306-019-1475-8

    Article  CAS  PubMed  Google Scholar 

  52. Wang J, Yan D, Zhao A, Hou X, Zheng X, Chen P, Bao Y, Jia W, Hu C, Zhang ZL, Jia W (2019) Discovery of potential biomarkers for osteoporosis using LC-MS/MS metabolomic methods. Osteoporos Int 30:1491. https://doi.org/10.1007/s00198-019-04892-0

    Article  CAS  PubMed  Google Scholar 

  53. Grissa D, Petera M, Brandolini M, Napoli A, Comte B, Pujos-Guillot E (2016) Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data. Front Mol Biosci 3:30. https://doi.org/10.3389/fmolb.2016.00030

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Bayci AWL, Baker DA, Somerset AE, Turkoglu O, Hothem Z, Callahan RE, Mandal R, Han B, Bjorndahl T, Wishart D, Bahado-Singh R, Graham SF, Keidan R (2018) Metabolomic identification of diagnostic serum-based biomarkers for advanced stage melanoma. Metabolomics 14(8):105. https://doi.org/10.1007/s11306-018-1398-9

    Article  CAS  PubMed  Google Scholar 

  55. Catav SS, Elgin ES, Dag C, Stark JL, Kucukakyuz K (2018) NMR-based metabolomics reveals that plant-derived smoke stimulates root growth via affecting carbohydrate and energy metabolism in maize. Metabolomics 14(11):143. https://doi.org/10.1007/s11306-018-1440-y

    Article  CAS  PubMed  Google Scholar 

  56. Guo JG, Guo XM, Wang XR, Tian JZ, Bi HS (2019) Metabolic profile analysis of free amino acids in experimental autoimmune uveoretinitis rat plasma. Int J Ophthalmol 12(1):16–24. https://doi.org/10.18240/ijo.2019.01.03

    Article  PubMed  PubMed Central  Google Scholar 

  57. Rodrigues-Neto JC, Correia MV, Souto AL, Ribeiro JAA, Vieira LR, Souza MT Jr, Rodrigues CM, Abdelnur PV (2018) Metabolic fingerprinting analysis of oil palm reveals a set of differentially expressed metabolites in fatal yellowing symptomatic and non-symptomatic plants. Metabolomics 14(10):142. https://doi.org/10.1007/s11306-018-1436-7

    Article  CAS  PubMed  Google Scholar 

  58. Wong M, Lodge JK (2012) A metabolomic investigation of the effects of vitamin E supplementation in humans. Nutr Metab (Lond) 9(1):110. https://doi.org/10.1186/1743-7075-9-110

    Article  CAS  Google Scholar 

  59. Li Y, Chen M, Liu C, Xia Y, Xu B, Hu Y, Chen T, Shen M, Tang W (2018) Metabolic changes associated with papillary thyroid carcinoma: a nuclear magnetic resonance-based metabolomics study. Int J Mol Med 41(5):3006–3014. https://doi.org/10.3892/ijmm.2018.3494

    Article  CAS  PubMed  Google Scholar 

  60. Rezig L, Servadio A, Torregrossa L, Miccoli P, Basolo F, Shintu L, Caldarelli S (2018) Diagnosis of post-surgical fine-needle aspiration biopsies of thyroid lesions with indeterminate cytology using HRMAS NMR-based metabolomics. Metabolomics 14(10):141. https://doi.org/10.1007/s11306-018-1437-6

    Article  CAS  PubMed  Google Scholar 

  61. Westerhuis JA, van Velzen EJ, Hoefsloot HC, Smilde AK (2010) Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics 6(1):119–128

    Article  CAS  PubMed  Google Scholar 

  62. Liquet B, Le Cao KA, Hocini H, Thiebaut R (2012) A novel approach for biomarker selection and the integration of repeated measures experiments from two assays. BMC Bioinformatics 13:325. https://doi.org/10.1186/1471-2105-13-325

    Article  PubMed  PubMed Central  Google Scholar 

  63. Liu H, Motoda H (1998) Feature extraction, construction and selection: a data mining perspective, vol 453. Springer Science & Business Media, Norwell

    Book  Google Scholar 

  64. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    Google Scholar 

  65. Weston J, Elisseeff A, Schölkopf B, Tipping M (2003) Use of the zero-norm with linear models and kernel methods. J Mach Learn Res 3(Mar):1439–1461

    Google Scholar 

  66. Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: ICML 1999, pp 258–267

    Google Scholar 

  67. Bozdogan H (1987) Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 52(3):345–370

    Article  Google Scholar 

  68. Guan W, Zhou M, Hampton CY, Benigno BB, Walker LD, Gray A, McDonald JF, Fernández FM (2009) Ovarian cancer detection from metabolomic liquid chromatography/mass spectrometry data by support vector machines. BMC Bioinformatics 10(1):259

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  69. Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines

    Google Scholar 

  70. Kuhn M, Johnson K (2013) Applied predictive modeling, vol 26. Springer, New York

    Book  Google Scholar 

  71. Behnamian A, Millard K, Banks SN, White L, Richardson M, Pasher J (2017) A systematic approach for variable selection with random forests: achieving stable variable importance values. IEEE Geosci Remote Sens Lett 14(11):1988–1992

    Article  Google Scholar 

  72. Van Calster B, Vickers AJ (2015) Calibration of risk prediction models: impact on decision-analytic performance. Med Decis Making 35(2):162–169

    Article  PubMed  Google Scholar 

  73. Agresti A (2002) Categorical data analysis. Wiley, New York

    Book  Google Scholar 

  74. Huang Y, Sullivan Pepe M, Feng Z (2007) Evaluating the predictiveness of a continuous marker. Biometrics 63(4):1181–1188

    Article  PubMed  PubMed Central  Google Scholar 

  75. Holder LB, Haque MM, Skinner MK (2017) Machine learning for epigenetics and future medical applications. Epigenetics 12(7):505–514. https://doi.org/10.1080/15592294.2017.1329068

    Article  PubMed  PubMed Central  Google Scholar 

  76. Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data, vol 110. University of California, Berkeley, pp 1–12

    Google Scholar 

  77. Breiman L, Friedman J, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman & Hall, New York

    Google Scholar 

  78. Japkowicz N (2000) Learning from imbalanced data sets: a comparison of various strategies. In: AAAI workshop on learning from imbalanced data sets. Menlo Park, CA, pp 10–15

    Google Scholar 

  79. Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML-2003 workshop on learning from imbalanced data sets II, pp 2–1

    Google Scholar 

  80. Ling CX, Li C (1998) Data mining for direct marketing: problems and solutions. In: KDD 1998, pp 73–79

    Google Scholar 

  81. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  82. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML 1997. Citeseer, pp 179–186

    Google Scholar 

  83. Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: KDD 1999, pp 155–164

    Google Scholar 

  84. Cateni S, Colla V, Vannucci M (2014) A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135:32–41

    Article  Google Scholar 

  85. Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II. Citeseer, pp 1–8

    Google Scholar 

  86. Collins GS, Reitsma JB, Altman DG, Moons KG (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMC Med 13(1):1

    Article  PubMed  PubMed Central  Google Scholar 

  87. Cruickshank-Quinn CI, Jacobson S, Hughes G, Powell RL, Petrache I, Kechris K, Bowler R, Reisdorph N (2018) Metabolomics and transcriptomics pathway approach reveals outcome-specific perturbations in COPD. Sci Rep 8(1):17132

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  88. Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, Curran-Everett D, Silverman EK, Crapo JD (2010) Genetic epidemiology of COPD (COPDGene) study design. COPD 7(1):32–43. https://doi.org/10.3109/15412550903499522

    Article  PubMed  Google Scholar 

  89. Andersen SL, Briggs FBS, Winnike JH, Natanzon Y, Maichle S, Knagge KJ, Newby LK, Gregory SG (2019) Metabolome-based signature of disease pathology in MS. Mult Scler Relat Disord 31:12–21. https://doi.org/10.1016/j.msard.2019.03.006

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Lee HS, Seo C, Hwang YH, Shin TH, Park HJ, Kim Y, Ji M, Min J, Choi S, Kim H, Park AK, Yee ST, Lee G, Paik MJ (2019) Metabolomic approaches to polyamines including acetylated derivatives in lung tissue of mice with asthma. Metabolomics 15(1):8. https://doi.org/10.1007/s11306-018-1470-5

    Article  CAS  PubMed  Google Scholar 

  91. Long NP, Yoon SJ, Anh NH, Nghi TD, Lim DK, Hong YJ, Hong SS, Kwon SW (2018) A systematic review on metabolomics-based diagnostic biomarker discovery and validation in pancreatic cancer. Metabolomics 14(8):109. https://doi.org/10.1007/s11306-018-1404-2

    Article  CAS  PubMed  Google Scholar 

  92. Regan EA, Hersh CP, Castaldi PJ, DeMeo DL, Silverman EK, Crapo JD, Bowler RP (2019) Omics and the search for blood biomarkers in COPD: insights from COPDGene. Am J Respir Cell Mol Biol 61:143. https://doi.org/10.1165/rcmb.2018-0245PS

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. Thévenot EA (2016) ropls: PCA, PLS (-DA) and OPLS (-DA) for multivariate analysis and feature selection of omics data

    Google Scholar 

  94. Rinaudo P, Boudah S, Junot C, Thévenot EA (2016) Biosigner: a new method for the discovery of significant molecular signatures from omics data. Front Mol Biosci 3:26

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  95. Zararsiz G, Goksuluk D, Korkmaz S, Eldem V, Duru IP, Unver T, Ozturk A, Zararsiz MG, klaR M, biocViews Sequencing, R (2014) Package ‘MLSeq’

    Google Scholar 

  96. Xia J, Psychogios N, Young N, Wishart DS (2009) MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res 37(suppl_2):W652–W660

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Luan H, Ji F, Chen Y, Cai Z (2018) statTarget: a streamlined tool for signal drift correction and interpretations of quantitative mass spectrometry-based omics data. Anal Chim Acta 1036:66–72

    Article  CAS  PubMed  Google Scholar 

  98. Determan Jr CE, Determan Jr MCE (2015) Package ‘OmicsMarkeR’

    Google Scholar 

  99. Rohart F, Gautier B, Singh A, Le Cao K-A (2017) mixOmics: an R package for ‘omics feature selection and multiple data integration. PLoS Comput Biol 13(11):e1005752

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  100. Al-Akwaa FM, Yunits B, Huang S, Alhajaji H, Garmire LX (2018) Lilikoi: an R package for personalized pathway-based classification modeling using metabolomics data. GigaScience 7(12):giy136

    Article  CAS  Google Scholar 

  101. Gift N, Gormley IC, Brennan L, Gormley MC (2010) Package ‘MetabolAnalyze’

    Google Scholar 

  102. Gaude E, Chignola F, Spiliotopoulos D, Spitaleri A, Ghitti M, Garcìa-Manteiga JM, Mari S, Musco G (2013) Muma, an R package for metabolomics univariate and multivariate statistical analysis. Curr Metabol 1(2):180–189

    Article  CAS  Google Scholar 

  103. Palla P (2015) Information management and multivariate analysis techniques for metabolomics data. Universita’degli Studi di Cagliari

    Google Scholar 

  104. Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28(5):1–26

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katerina Kechris .

Editor information

Editors and Affiliations

Appendices

TRIPOD Checklist for Predictive Modeling for Metabolomics Data

Section/topic

Item

Checklist item

Section

Title and abstract

Title

1

Identify the study as developing and/or validating a multivariable prediction model, the target population, and the outcome to be predicted

See title

Abstract

2

Provide a summary of objectives, study design, setting, participants, sample size, predictors, outcome, statistical analysis, results, and conclusions

See abstract

Introduction

Background and objectives

3a

Explain the medical context (including whether diagnostic or prognostic) and rationale for developing or validating the multivariable prediction model, including references to existing models

Subheading 4.1

3b

Specify the objectives, including whether the study describes the development or validation of the model or both

Internal validation, Subheading 4.4

Methods

Source of data

4a

Describe the study design or source of data (e.g., randomized trial, cohort, or registry data), separately for the development and validation data sets, if applicable

Subheading 4.1

4b

Specify the key study dates, including start of accrual; end of accrual; and, if applicable, end of follow-up

Subheading 4.1, see [87]

Participants

5a

Specify key elements of the study setting (e.g., primary care, secondary care, general population) including number and location of centers

N/A

5b

Describe eligibility criteria for participants

Subheading 4.1, see [87]

5c

Give details of treatments received, if relevant

N/A

Outcome

6a

Clearly define the outcome that is predicted by the prediction model, including how and when assessed

Subheading 4.1

6b

Report any actions to blind assessment of the outcome to be predicted

N/A

Predictors

7a

Clearly define all predictors used in developing or validating the multivariable prediction model, including how and when they were measured

2999 predictors, for more details see [87]

7b

Report any actions to blind assessment of predictors for the outcome and other predictors

N/A

Sample size

8

Explain how the study size was arrived at

Subheading 4.1, see [87]

Missing data

9

Describe how missing data were handled (e.g., complete-case analysis, single imputation, multiple imputation) with details of any imputation method

The data was already preprocessed and imputed, see Subheading 4.1

Statistical analysis methods

10c

For validation, describe how the predictions were calculated

Subheading 3.3

10d

Specify all measures used to assess model performance and, if relevant, to compare multiple models

Subheading 3.3

10e

Describe any model updating (e.g., recalibration) arising from the validation, if done

N/A

Risk groups

11

Provide details on how risk groups were created, if done

N/A

Development vs. validation

12

For validation, identify any differences from the development data in setting, eligibility criteria, outcome, and predictors

N/A

Results

Participants

13a

Describe the flow of participants through the study, including the number of participants with and without the outcome and, if applicable, a summary of the follow-up time. A diagram may be helpful

Subheading 4.1, see [87]

13b

Describe the characteristics of the participants (basic demographics, clinical features, available predictors), including the number of participants with missing data for predictors and outcome

Subheading 4.1, see [87]

13c

For validation, show a comparison with the development data of the distribution of important variables (demographics, predictors, and outcome)

Subheadings 4.3 and 4.4

Model performance

16

Report performance measures (with CIs) for the prediction model

N/A

Model-updating

17

If done, report the results from any model updating (i.e., model specification, model performance)

Subheading 4.4

Discussion

Limitations

18

Discuss any limitations of the study (such as nonrepresentative sample, few events per predictor, missing data)

Subheading 4.1, see [87]

Interpretation

19a

For validation, discuss the results with reference to performance in the development data, and any other validation data

N/A

19b

Give an overall interpretation of the results, considering objectives, limitations, results from similar studies, and other relevant evidence

Subheadings 4.4 and 5

Implications

20

Discuss the potential clinical use of the model and implications for future research

Subheadings 4.4 and 5. However, performance of the model is data-driven

Other information

Supplementary information

21

Provide information about the availability of supplementary resources, such as study protocol, web calculator, and data sets

Subheading 4.1, see [87]

Funding

22

Give the source of funding and the role of the funders for the present study

NIH

Selected Open Source (R/Bioconductor/Web-Based) Tools for Supervised Learning Algorithms

Method

Source

Reference

PLS-DA

Bioconductor (ropls)

[93]

PLS-DA, RF, and SVM

Bioconductor (biosigner)

[94]

SVM, RF

Bioconductor (MLSeq)

[95]

RF, SVM, PLS-DA

Metaboanalyst

http://www.metaboanalyst.ca/

[96]

PCA, PLS-DA, RF

Bioconductor (statTarget)

[97]

Feature selection, metric evaluation

Bioconductor (OmicsMarker)

[98]

Sparse PLS-DA

Bioconductor (mixOmics)

[99]

Feature selection, metric evaluation

CRAN (lilikoi)

[100]

Probabilistic principal component analysis

CRAN (MetabolAnalyze)

[101]

Kernel-based metabolite differential analysis

CRAN (KMDA)

[21]

PLS-DA, OPLS-DA

CRAN (muma)

[102]

RF

CRAN (RFmarkerDetector)

[103]

RF, SVM, PLS-DA

CRAN (caret)

[104]

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Ghosh, T., Zhang, W., Ghosh, D., Kechris, K. (2020). Predictive Modeling for Metabolomics Data. In: Li, S. (eds) Computational Methods and Data Analysis for Metabolomics. Methods in Molecular Biology, vol 2104. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-0239-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-0239-3_16

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-0238-6

  • Online ISBN: 978-1-0716-0239-3

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics