Predictive Modeling for Metabolomics Data

Ghosh, Tusharkanti; Zhang, Weiming; Ghosh, Debashis; Kechris, Katerina

doi:10.1007/978-1-0716-0239-3_16

Predictive Modeling for Metabolomics Data

Tusharkanti Ghosh³,
Weiming Zhang³,
Debashis Ghosh³ &
…
Katerina Kechris³

Protocol
First Online: 18 January 2020

6290 Accesses
25 Citations

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2104))

Abstract

In recent years, mass spectrometry (MS)-based metabolomics has been extensively applied to characterize biochemical mechanisms, and study physiological processes and phenotypic changes associated with disease. Metabolomics has also been important for identifying biomarkers of interest suitable for clinical diagnosis. For the purpose of predictive modeling, in this chapter, we will review various supervised learning algorithms such as random forest (RF), support vector machine (SVM), and partial least squares-discriminant analysis (PLS-DA). In addition, we will also review feature selection methods for identifying the best combination of metabolites for an accurate predictive model. We conclude with best practices for reproducibility by including internal and external replication, reporting metrics to assess performance, and providing guidelines to avoid overfitting and to deal with imbalanced classes. An analysis of an example data will illustrate the use of different machine learning methods and performance metrics.

This is a preview of subscription content, log in via an institution.

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

Maniscalco M, Fuschillo S, Paris D, Cutignano A, Sanduzzi A, Motta A (2019) Clinical metabolomics of exhaled breath condensate in chronic respiratory diseases. Adv Clin Chem 88:121–149. https://doi.org/10.1016/bs.acc.2018.10.002
Article PubMed Google Scholar
Pujos-Guillot E, Petera M, Jacquemin J, Centeno D, Lyan B, Montoliu I, Madej D, Pietruszka B, Fabbri C, Santoro A, Brzozowska A, Franceschi C, Comte B (2018) Identification of pre-frailty sub-phenotypes in elderly using metabolomics. Front Physiol 9:1903. https://doi.org/10.3389/fphys.2018.01903
Article PubMed Google Scholar
Sarode GV, Kim K, Kieffer DA, Shibata NM, Litwin T, Czlonkowska A, Medici V (2019) Metabolomics profiles of patients with Wilson disease reveal a distinct metabolic signature. Metabolomics 15(3):43. https://doi.org/10.1007/s11306-019-1505-6
Article CAS PubMed PubMed Central Google Scholar
Wang X, Zhang A, Sun H (2013) Power of metabolomics in diagnosis and biomarker discovery of hepatocellular carcinoma. Hepatology 57(5):2072–2077
Article CAS PubMed Google Scholar
Caesar LK, Kellogg JJ, Kvalheim OM, Cech NB (2019) Opportunities and limitations for untargeted mass spectrometry metabolomics to identify biologically active constituents in complex natural product mixtures. J Nat Prod 82:469. https://doi.org/10.1021/acs.jnatprod.9b00176
Article CAS PubMed PubMed Central Google Scholar
Liu LL, Lin Y, Chen W, Tong ML, Luo X, Lin LR, Zhang HL, Yan JH, Niu JJ, Yang TC (2019) Metabolite profiles of the cerebrospinal fluid in neurosyphilis patients determined by untargeted metabolomics analysis. Front Neurosci 13:150. https://doi.org/10.3389/fnins.2019.00150
Article PubMed PubMed Central Google Scholar
Sanchez-Arcos C, Kai M, Svatos A, Gershenzon J, Kunert G (2019) Untargeted metabolomics approach reveals differences in host plant chemistry before and after infestation with different pea aphid host races. Front Plant Sci 10:188. https://doi.org/10.3389/fpls.2019.00188
Article PubMed PubMed Central Google Scholar
Wang R, Yin Y, Zhu ZJ (2019) Advancing untargeted metabolomics using data-independent acquisition mass spectrometry technology. Anal Bioanal Chem 411:4349. https://doi.org/10.1007/s00216-019-01709-1
Article CAS PubMed Google Scholar
Allwood JW, Xu Y, Martinez-Martin P, Palau R, Cowan A, Goodacre R, Marshall A, Stewart D, Howarth C (2019) Rapid UHPLC-MS metabolite profiling and phenotypic assays reveal genotypic impacts of nitrogen supplementation in oats. Metabolomics 15(3):42. https://doi.org/10.1007/s11306-019-1501-x
Article CAS PubMed PubMed Central Google Scholar
Fang J, Zhao H, Zhang Y, Wong M, He Y, Sun Q, Xu S, Cai Z (2019) Evaluation of gas chromatography-atmospheric pressure chemical ionization tandem mass spectrometry as an alternative to gas chromatography tandem mass spectrometry for the determination of polychlorinated biphenyls and polybrominated diphenyl ethers. Chemosphere 225:288–294. https://doi.org/10.1016/j.chemosphere.2019.03.011
Article CAS PubMed Google Scholar
Lohr KE, Camp EF, Kuzhiumparambil U, Lutz A, Leggat W, Patterson JT, Suggett DJ (2019) Resolving coral photoacclimation dynamics through coupled photophysiological and metabolomic profiling. J Exp Biol 222:jeb195982. https://doi.org/10.1242/jeb.195982
Article PubMed Google Scholar
Baumeister TUH, Ueberschaar N, Schmidt-Heck W, Mohr JF, Deicke M, Wichard T, Guthke R, Pohnert G (2018) DeltaMS: a tool to track isotopologues in GC- and LC-MS data. Metabolomics 14(4):41. https://doi.org/10.1007/s11306-018-1336-x
Article CAS PubMed Google Scholar
Gilmore IS, Heiles S, Pieterse CL (2019) Metabolic imaging at the single-cell scale: recent advances in mass spectrometry imaging. Annu Rev Anal Chem (Palo Alto Calif) 12:201. https://doi.org/10.1146/annurev-anchem-061318-115516
Article CAS Google Scholar
Do KT, Wahl S, Raffler J, Molnos S, Laimighofer M, Adamski J, Suhre K, Strauch K, Peters A, Gieger C, Langenberg C, Stewart ID, Theis FJ, Grallert H, Kastenmuller G, Krumsiek J (2018) Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics 14(10):128. https://doi.org/10.1007/s11306-018-1420-2
Article CAS PubMed PubMed Central Google Scholar
Liggi S, Hinz C, Hall Z, Santoru ML, Poddighe S, Fjeldsted J, Atzori L, Griffin JL (2018) KniMet: a pipeline for the processing of chromatography-mass spectrometry metabolomics data. Metabolomics 14(4):52. https://doi.org/10.1007/s11306-018-1349-5
Article CAS PubMed PubMed Central Google Scholar
Fielding S, Fayers PM, McDonald A, McPherson G, Campbell MK (2008) Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health Qual Life Outcomes 6(1):57
Article PubMed PubMed Central Google Scholar
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147
Article PubMed Google Scholar
Steyerberg EW, van Veen M (2007) Imputation is beneficial for handling missing data in predictive models. J Clin Epidemiol 60(9):979
Article PubMed Google Scholar
Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G (2006) XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 78(3):779–787. https://doi.org/10.1021/ac051437y
Article CAS PubMed Google Scholar
Wei R, Wang J, Su M, Jia E, Chen S, Chen T, Ni Y (2018) Missing value imputation approach for mass spectrometry-based metabolomics data. Sci Rep 8(1):663. https://doi.org/10.1038/s41598-017-19120-0
Article CAS PubMed PubMed Central Google Scholar
Zhan X, Patterson AD, Ghosh D (2015) Kernel approaches for differential expression analysis of mass spectrometry-based metabolomics data. BMC Bioinformatics 16:77. https://doi.org/10.1186/s12859-015-0506-3
Article CAS PubMed PubMed Central Google Scholar
Gromski PS, Xu Y, Kotze HL, Correa E, Ellis DI, Armitage EG, Turner ML, Goodacre R (2014) Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites 4(2):433–452. https://doi.org/10.3390/metabo4020433
Article CAS PubMed PubMed Central Google Scholar
Kumar N, Hoque MA, Shahjaman M, Islam SM, Mollah MN (2017) Metabolomic biomarker identification in presence of outliers and missing values. Biomed Res Int 2017:2437608. https://doi.org/10.1155/2017/2437608
Article PubMed PubMed Central Google Scholar
Sun X, Langer B, Weckwerth W (2015) Challenges of inversely estimating Jacobian from metabolomics data. Front Bioeng Biotechnol 3:188. https://doi.org/10.3389/fbioe.2015.00188
Article PubMed PubMed Central Google Scholar
Lee JY, Styczynski MP (2018) NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics 14(12):153. https://doi.org/10.1007/s11306-018-1451-8
Article CAS PubMed PubMed Central Google Scholar
Di Guida R, Engel J, Allwood JW, Weber RJM, Jones MR, Sommer U, Viant MR, Dunn WB (2016) Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling. Metabolomics 12(5):93. https://doi.org/10.1007/s11306-016-1030-9
Article CAS PubMed PubMed Central Google Scholar
Chen MX, Wang SY, Kuo CH, Tsai IL (2019) Metabolome analysis for investigating host-gut microbiota interactions. J Formos Med Assoc 118(Suppl 1):S10–S22. https://doi.org/10.1016/j.jfma.2018.09.007
Article CAS PubMed Google Scholar
Shen X, Zhu ZJ (2019) MetFlow: an interactive and integrated workflow for metabolomics data cleaning and differential metabolite discovery. Bioinformatics 35:2870. https://doi.org/10.1093/bioinformatics/bty1066
Article PubMed Google Scholar
McLachlan, Geoffrey J (2004) Discriminant analysis and statistical pattern recognition. Wiley-Interscience, Hoboken, N.J. John Wiley & Sons. & Wiley InterScience (Online Service)
Google Scholar
McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: AAAI-98 workshop on learning for text categorization, vol 1. Citeseer, pp 41–48
Google Scholar
Wang Q, Garrity GM, Tiedje JM, Cole JR (2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73(16):5261–5267
Article CAS PubMed PubMed Central Google Scholar
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Google Scholar
Breiman L (2017) Classification and regression trees. Routledge, Boca Raton
Book Google Scholar
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22
Google Scholar
Gislason PO, Benediktsson JA, Sveinsson JR (2006) Random forests for land cover classification. Pattern Recogn Lett 27(4):294–300
Article Google Scholar
Chen T, Cao Y, Zhang Y, Liu J, Bao Y, Wang C, Jia W, Zhao A (2013) Random forest in clinical metabolomics for phenotypic discrimination and biomarker selection. Evid Based Complement Alternat Med 2013:298183
PubMed PubMed Central Google Scholar
Scott I, Lin W, Liakata M, Wood J, Vermeer CP, Allaway D, Ward J, Draper J, Beale M, Corol D (2013) Merits of random forests emerge in evaluation of chemometric classifiers by external validation. Anal Chim Acta 801:22–33
Article CAS PubMed Google Scholar
Ho TK (1998) Nearest neighbors in random subspaces. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, pp 640–648
Google Scholar
Biau G (2012) Analysis of a random forests model. J Mach Learn Res 13(Apr):1063–1095
Google Scholar
Hapfelmeier A, Hothorn T, Ulm K, Strobl C (2014) A new variable importance measure for random forests with missing data. Stat Comput 24(1):21–34
Article Google Scholar
Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, Hamprecht FA (2009) A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10(1):213
Article PubMed PubMed Central CAS Google Scholar
Maker AV, Hu V, Kadkol SS, Hong L, Brugge W, Winter J, Yeo CJ, Hackert T, Buchler M, Lawlor RT, Salvia R, Scarpa A, Bassi C, Green S (2019) Cyst fluid biosignature to predict Intraductal papillary mucinous neoplasms of the pancreas with high malignant potential. J Am Coll Surg 228:721. https://doi.org/10.1016/j.jamcollsurg.2019.02.040
Article PubMed PubMed Central Google Scholar
Tkachev V, Sorokin M, Mescheryakov A, Simonov A, Garazha A, Buzdin A, Muchnik I, Borisov N (2018) FLOating-window projective separator (FloWPS): a data trimming tool for support vector machines (SVM) to improve robustness of the classifier. Front Genet 9:717. https://doi.org/10.3389/fgene.2018.00717
Article CAS PubMed Google Scholar
Yerukala Sathipati S, Ho SY (2018) Identifying a miRNA signature for predicting the stage of breast cancer. Sci Rep 8(1):16138. https://doi.org/10.1038/s41598-018-34604-3
Article CAS PubMed PubMed Central Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Google Scholar
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. ACM, pp 144–152
Google Scholar
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Google Scholar
Ripley BD (1994) Flexible non-linear approaches to classification. In: From statistics to neural networks. Springer, Berlin, pp 105–126
Chapter Google Scholar
Contreras-Jodar A, Nayan NH, Hamzaoui S, Caja G, Salama AAK (2019) Heat stress modifies the lactational performances and the urinary metabolomic profile related to gastrointestinal microbiota of dairy goats. PLoS One 14(2):e0202457. https://doi.org/10.1371/journal.pone.0202457
Article CAS PubMed PubMed Central Google Scholar
Park HG, Jang KS, Park HM, Song WS, Jeong YY, Ahn DH, Kim SM, Yang YH, Kim YG (2019) MALDI-TOF MS-based total serum protein fingerprinting for liver cancer diagnosis. Analyst 144:2231. https://doi.org/10.1039/c8an02241k
Article CAS PubMed Google Scholar
Quiros-Guerrero L, Albertazzi F, Araya-Valverde E, Romero RM, Villalobos H, Poveda L, Chavarria M, Tamayo-Castillo G (2019) Phenolic variation among Chamaecrista nictitans subspecies and varieties revealed through UPLC-ESI(−)-MS/MS chemical fingerprinting. Metabolomics 15(2):14. https://doi.org/10.1007/s11306-019-1475-8
Article CAS PubMed Google Scholar
Wang J, Yan D, Zhao A, Hou X, Zheng X, Chen P, Bao Y, Jia W, Hu C, Zhang ZL, Jia W (2019) Discovery of potential biomarkers for osteoporosis using LC-MS/MS metabolomic methods. Osteoporos Int 30:1491. https://doi.org/10.1007/s00198-019-04892-0
Article CAS PubMed Google Scholar
Grissa D, Petera M, Brandolini M, Napoli A, Comte B, Pujos-Guillot E (2016) Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data. Front Mol Biosci 3:30. https://doi.org/10.3389/fmolb.2016.00030
Article CAS PubMed PubMed Central Google Scholar
Bayci AWL, Baker DA, Somerset AE, Turkoglu O, Hothem Z, Callahan RE, Mandal R, Han B, Bjorndahl T, Wishart D, Bahado-Singh R, Graham SF, Keidan R (2018) Metabolomic identification of diagnostic serum-based biomarkers for advanced stage melanoma. Metabolomics 14(8):105. https://doi.org/10.1007/s11306-018-1398-9
Article CAS PubMed Google Scholar
Catav SS, Elgin ES, Dag C, Stark JL, Kucukakyuz K (2018) NMR-based metabolomics reveals that plant-derived smoke stimulates root growth via affecting carbohydrate and energy metabolism in maize. Metabolomics 14(11):143. https://doi.org/10.1007/s11306-018-1440-y
Article CAS PubMed Google Scholar
Guo JG, Guo XM, Wang XR, Tian JZ, Bi HS (2019) Metabolic profile analysis of free amino acids in experimental autoimmune uveoretinitis rat plasma. Int J Ophthalmol 12(1):16–24. https://doi.org/10.18240/ijo.2019.01.03
Article PubMed PubMed Central Google Scholar
Rodrigues-Neto JC, Correia MV, Souto AL, Ribeiro JAA, Vieira LR, Souza MT Jr, Rodrigues CM, Abdelnur PV (2018) Metabolic fingerprinting analysis of oil palm reveals a set of differentially expressed metabolites in fatal yellowing symptomatic and non-symptomatic plants. Metabolomics 14(10):142. https://doi.org/10.1007/s11306-018-1436-7
Article CAS PubMed Google Scholar
Wong M, Lodge JK (2012) A metabolomic investigation of the effects of vitamin E supplementation in humans. Nutr Metab (Lond) 9(1):110. https://doi.org/10.1186/1743-7075-9-110
Article CAS Google Scholar
Li Y, Chen M, Liu C, Xia Y, Xu B, Hu Y, Chen T, Shen M, Tang W (2018) Metabolic changes associated with papillary thyroid carcinoma: a nuclear magnetic resonance-based metabolomics study. Int J Mol Med 41(5):3006–3014. https://doi.org/10.3892/ijmm.2018.3494
Article CAS PubMed Google Scholar
Rezig L, Servadio A, Torregrossa L, Miccoli P, Basolo F, Shintu L, Caldarelli S (2018) Diagnosis of post-surgical fine-needle aspiration biopsies of thyroid lesions with indeterminate cytology using HRMAS NMR-based metabolomics. Metabolomics 14(10):141. https://doi.org/10.1007/s11306-018-1437-6
Article CAS PubMed Google Scholar
Westerhuis JA, van Velzen EJ, Hoefsloot HC, Smilde AK (2010) Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics 6(1):119–128
Article CAS PubMed Google Scholar
Liquet B, Le Cao KA, Hocini H, Thiebaut R (2012) A novel approach for biomarker selection and the integration of repeated measures experiments from two assays. BMC Bioinformatics 13:325. https://doi.org/10.1186/1471-2105-13-325
Article PubMed PubMed Central Google Scholar
Liu H, Motoda H (1998) Feature extraction, construction and selection: a data mining perspective, vol 453. Springer Science & Business Media, Norwell
Book Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Google Scholar
Weston J, Elisseeff A, Schölkopf B, Tipping M (2003) Use of the zero-norm with linear models and kernel methods. J Mach Learn Res 3(Mar):1439–1461
Google Scholar
Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: ICML 1999, pp 258–267
Google Scholar
Bozdogan H (1987) Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 52(3):345–370
Article Google Scholar
Guan W, Zhou M, Hampton CY, Benigno BB, Walker LD, Gray A, McDonald JF, Fernández FM (2009) Ovarian cancer detection from metabolomic liquid chromatography/mass spectrometry data by support vector machines. BMC Bioinformatics 10(1):259
Article PubMed PubMed Central CAS Google Scholar
Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines
Google Scholar
Kuhn M, Johnson K (2013) Applied predictive modeling, vol 26. Springer, New York
Book Google Scholar
Behnamian A, Millard K, Banks SN, White L, Richardson M, Pasher J (2017) A systematic approach for variable selection with random forests: achieving stable variable importance values. IEEE Geosci Remote Sens Lett 14(11):1988–1992
Article Google Scholar
Van Calster B, Vickers AJ (2015) Calibration of risk prediction models: impact on decision-analytic performance. Med Decis Making 35(2):162–169
Article PubMed Google Scholar
Agresti A (2002) Categorical data analysis. Wiley, New York
Book Google Scholar
Huang Y, Sullivan Pepe M, Feng Z (2007) Evaluating the predictiveness of a continuous marker. Biometrics 63(4):1181–1188
Article PubMed PubMed Central Google Scholar
Holder LB, Haque MM, Skinner MK (2017) Machine learning for epigenetics and future medical applications. Epigenetics 12(7):505–514. https://doi.org/10.1080/15592294.2017.1329068
Article PubMed PubMed Central Google Scholar
Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data, vol 110. University of California, Berkeley, pp 1–12
Google Scholar
Breiman L, Friedman J, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman & Hall, New York
Google Scholar
Japkowicz N (2000) Learning from imbalanced data sets: a comparison of various strategies. In: AAAI workshop on learning from imbalanced data sets. Menlo Park, CA, pp 10–15
Google Scholar
Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML-2003 workshop on learning from imbalanced data sets II, pp 2–1
Google Scholar
Ling CX, Li C (1998) Data mining for direct marketing: problems and solutions. In: KDD 1998, pp 73–79
Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML 1997. Citeseer, pp 179–186
Google Scholar
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: KDD 1999, pp 155–164
Google Scholar
Cateni S, Colla V, Vannucci M (2014) A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135:32–41
Article Google Scholar
Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II. Citeseer, pp 1–8
Google Scholar
Collins GS, Reitsma JB, Altman DG, Moons KG (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMC Med 13(1):1
Article PubMed PubMed Central Google Scholar
Cruickshank-Quinn CI, Jacobson S, Hughes G, Powell RL, Petrache I, Kechris K, Bowler R, Reisdorph N (2018) Metabolomics and transcriptomics pathway approach reveals outcome-specific perturbations in COPD. Sci Rep 8(1):17132
Article PubMed PubMed Central CAS Google Scholar
Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, Curran-Everett D, Silverman EK, Crapo JD (2010) Genetic epidemiology of COPD (COPDGene) study design. COPD 7(1):32–43. https://doi.org/10.3109/15412550903499522
Article PubMed Google Scholar
Andersen SL, Briggs FBS, Winnike JH, Natanzon Y, Maichle S, Knagge KJ, Newby LK, Gregory SG (2019) Metabolome-based signature of disease pathology in MS. Mult Scler Relat Disord 31:12–21. https://doi.org/10.1016/j.msard.2019.03.006
Article CAS PubMed PubMed Central Google Scholar
Lee HS, Seo C, Hwang YH, Shin TH, Park HJ, Kim Y, Ji M, Min J, Choi S, Kim H, Park AK, Yee ST, Lee G, Paik MJ (2019) Metabolomic approaches to polyamines including acetylated derivatives in lung tissue of mice with asthma. Metabolomics 15(1):8. https://doi.org/10.1007/s11306-018-1470-5
Article CAS PubMed Google Scholar
Long NP, Yoon SJ, Anh NH, Nghi TD, Lim DK, Hong YJ, Hong SS, Kwon SW (2018) A systematic review on metabolomics-based diagnostic biomarker discovery and validation in pancreatic cancer. Metabolomics 14(8):109. https://doi.org/10.1007/s11306-018-1404-2
Article CAS PubMed Google Scholar
Regan EA, Hersh CP, Castaldi PJ, DeMeo DL, Silverman EK, Crapo JD, Bowler RP (2019) Omics and the search for blood biomarkers in COPD: insights from COPDGene. Am J Respir Cell Mol Biol 61:143. https://doi.org/10.1165/rcmb.2018-0245PS
Article CAS PubMed PubMed Central Google Scholar
Thévenot EA (2016) ropls: PCA, PLS (-DA) and OPLS (-DA) for multivariate analysis and feature selection of omics data
Google Scholar
Rinaudo P, Boudah S, Junot C, Thévenot EA (2016) Biosigner: a new method for the discovery of significant molecular signatures from omics data. Front Mol Biosci 3:26
Article PubMed PubMed Central CAS Google Scholar
Zararsiz G, Goksuluk D, Korkmaz S, Eldem V, Duru IP, Unver T, Ozturk A, Zararsiz MG, klaR M, biocViews Sequencing, R (2014) Package ‘MLSeq’
Google Scholar
Xia J, Psychogios N, Young N, Wishart DS (2009) MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res 37(suppl_2):W652–W660
Article CAS PubMed PubMed Central Google Scholar
Luan H, Ji F, Chen Y, Cai Z (2018) statTarget: a streamlined tool for signal drift correction and interpretations of quantitative mass spectrometry-based omics data. Anal Chim Acta 1036:66–72
Article CAS PubMed Google Scholar
Determan Jr CE, Determan Jr MCE (2015) Package ‘OmicsMarkeR’
Google Scholar
Rohart F, Gautier B, Singh A, Le Cao K-A (2017) mixOmics: an R package for ‘omics feature selection and multiple data integration. PLoS Comput Biol 13(11):e1005752
Article PubMed PubMed Central CAS Google Scholar
Al-Akwaa FM, Yunits B, Huang S, Alhajaji H, Garmire LX (2018) Lilikoi: an R package for personalized pathway-based classification modeling using metabolomics data. GigaScience 7(12):giy136
Article CAS Google Scholar
Gift N, Gormley IC, Brennan L, Gormley MC (2010) Package ‘MetabolAnalyze’
Google Scholar
Gaude E, Chignola F, Spiliotopoulos D, Spitaleri A, Ghitti M, Garcìa-Manteiga JM, Mari S, Musco G (2013) Muma, an R package for metabolomics univariate and multivariate statistical analysis. Curr Metabol 1(2):180–189
Article CAS Google Scholar
Palla P (2015) Information management and multivariate analysis techniques for metabolomics data. Universita’degli Studi di Cagliari
Google Scholar
Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28(5):1–26
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
Tusharkanti Ghosh, Weiming Zhang, Debashis Ghosh & Katerina Kechris

Authors

Tusharkanti Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Weiming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Debashis Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Katerina Kechris
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Katerina Kechris .

Editor information

Editors and Affiliations

Department of Medicine, Emory University School of Medicine, Atlanta, GA, USA
Shuzhao Li

Appendices

TRIPOD Checklist for Predictive Modeling for Metabolomics Data

Section/topic	Item	Checklist item	Section
Title and abstract
Title	1	Identify the study as developing and/or validating a multivariable prediction model, the target population, and the outcome to be predicted	See title
Abstract	2	Provide a summary of objectives, study design, setting, participants, sample size, predictors, outcome, statistical analysis, results, and conclusions	See abstract
Introduction
Background and objectives	3a	Explain the medical context (including whether diagnostic or prognostic) and rationale for developing or validating the multivariable prediction model, including references to existing models	Subheading 4.1
Background and objectives	3b	Specify the objectives, including whether the study describes the development or validation of the model or both	Internal validation, Subheading 4.4
Methods
Source of data	4a	Describe the study design or source of data (e.g., randomized trial, cohort, or registry data), separately for the development and validation data sets, if applicable	Subheading 4.1
Source of data	4b	Specify the key study dates, including start of accrual; end of accrual; and, if applicable, end of follow-up	Subheading 4.1, see [87]
Participants	5a	Specify key elements of the study setting (e.g., primary care, secondary care, general population) including number and location of centers	N/A
	5b	Describe eligibility criteria for participants	Subheading 4.1, see [87]
	5c	Give details of treatments received, if relevant	N/A
Outcome	6a	Clearly define the outcome that is predicted by the prediction model, including how and when assessed	Subheading 4.1
Outcome	6b	Report any actions to blind assessment of the outcome to be predicted	N/A
Predictors	7a	Clearly define all predictors used in developing or validating the multivariable prediction model, including how and when they were measured	2999 predictors, for more details see [87]
Predictors	7b	Report any actions to blind assessment of predictors for the outcome and other predictors	N/A
Sample size	8	Explain how the study size was arrived at	Subheading 4.1, see [87]
Missing data	9	Describe how missing data were handled (e.g., complete-case analysis, single imputation, multiple imputation) with details of any imputation method	The data was already preprocessed and imputed, see Subheading 4.1
Statistical analysis methods	10c	For validation, describe how the predictions were calculated	Subheading 3.3
	10d	Specify all measures used to assess model performance and, if relevant, to compare multiple models	Subheading 3.3
	10e	Describe any model updating (e.g., recalibration) arising from the validation, if done	N/A
Risk groups	11	Provide details on how risk groups were created, if done	N/A
Development vs. validation	12	For validation, identify any differences from the development data in setting, eligibility criteria, outcome, and predictors	N/A
Results
Participants	13a	Describe the flow of participants through the study, including the number of participants with and without the outcome and, if applicable, a summary of the follow-up time. A diagram may be helpful	Subheading 4.1, see [87]
	13b	Describe the characteristics of the participants (basic demographics, clinical features, available predictors), including the number of participants with missing data for predictors and outcome	Subheading 4.1, see [87]
	13c	For validation, show a comparison with the development data of the distribution of important variables (demographics, predictors, and outcome)	Subheadings 4.3 and 4.4
Model performance	16	Report performance measures (with CIs) for the prediction model	N/A
Model-updating	17	If done, report the results from any model updating (i.e., model specification, model performance)	Subheading 4.4
Discussion
Limitations	18	Discuss any limitations of the study (such as nonrepresentative sample, few events per predictor, missing data)	Subheading 4.1, see [87]
Interpretation	19a	For validation, discuss the results with reference to performance in the development data, and any other validation data	N/A
Interpretation	19b	Give an overall interpretation of the results, considering objectives, limitations, results from similar studies, and other relevant evidence	Subheadings 4.4 and 5
Implications	20	Discuss the potential clinical use of the model and implications for future research	Subheadings 4.4 and 5. However, performance of the model is data-driven
Other information
Supplementary information	21	Provide information about the availability of supplementary resources, such as study protocol, web calculator, and data sets	Subheading 4.1, see [87]
Funding	22	Give the source of funding and the role of the funders for the present study	NIH

Selected Open Source (R/Bioconductor/Web-Based) Tools for Supervised Learning Algorithms

Method	Source	Reference
PLS-DA	Bioconductor (ropls)	[93]
PLS-DA, RF, and SVM	Bioconductor (biosigner)	[94]
SVM, RF	Bioconductor (MLSeq)	[95]
RF, SVM, PLS-DA	Metaboanalyst http://www.metaboanalyst.ca/	[96]
PCA, PLS-DA, RF	Bioconductor (statTarget)	[97]
Feature selection, metric evaluation	Bioconductor (OmicsMarker)	[98]
Sparse PLS-DA	Bioconductor (mixOmics)	[99]
Feature selection, metric evaluation	CRAN (lilikoi)	[100]
Probabilistic principal component analysis	CRAN (MetabolAnalyze)	[101]
Kernel-based metabolite differential analysis	CRAN (KMDA)	[21]
PLS-DA, OPLS-DA	CRAN (muma)	[102]
RF	CRAN (RFmarkerDetector)	[103]
RF, SVM, PLS-DA	CRAN (caret)	[104]

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Ghosh, T., Zhang, W., Ghosh, D., Kechris, K. (2020). Predictive Modeling for Metabolomics Data. In: Li, S. (eds) Computational Methods and Data Analysis for Metabolomics. Methods in Molecular Biology, vol 2104. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-0239-3_16

Download citation

DOI: https://doi.org/10.1007/978-1-0716-0239-3_16
Published: 18 January 2020
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-0238-6
Online ISBN: 978-1-0716-0239-3
eBook Packages: Springer Protocols

Publish with us

Policies and ethics