Preprocessing, classification modeling and feature selection using flow injection electrospray mass spectrometry metabolite fingerprint data

Enot, David P; Lin, Wanchang; Beckmann, Manfred; Parker, David; Overy, David P; Draper, John

doi:10.1038/nprot.2007.511

Protocol
Published: 28 February 2008

Preprocessing, classification modeling and feature selection using flow injection electrospray mass spectrometry metabolite fingerprint data

David P Enot¹,
Wanchang Lin¹,
Manfred Beckmann¹,
David Parker¹,
David P Overy¹^nAff2 &
…
John Draper¹

Nature Protocols volume 3, pages 446–470 (2008)Cite this article

1141 Accesses
88 Citations
6 Altmetric
Metrics details

Abstract

Metabolome analysis by flow injection electrospray mass spectrometry (FIE-MS) fingerprinting generates measurements relating to large numbers of m/z signals. Such data sets often exhibit high variance with a paucity of replicates, thus providing a challenge for data mining. We describe data preprocessing and modeling methods that have proved reliable in projects involving samples from a range of organisms. The protocols interact with software resources specifically for metabolomics provided in a Web-accessible data analysis package FIEmspro (http://users.aber.ac.uk/jhd) written in the R environment and requiring a moderate knowledge of R command-line usage. Specific emphasis is placed on describing the outcome of modeling experiments using FIE-MS data that require further preprocessing to improve quality. The salient features of both poor and robust (i.e., highly generalizable) multivariate models are outlined together with advice on validating classifiers and avoiding false discovery when seeking explanatory variables.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Workflow in FIE-MS data analysis.**

**Figure 2: TIC checking for potential outlying samples and batch effects in FIE-MS data using FIEmspro function *ticstats*.**

**Figure 3: Use of baseline correction to improve FIE-MS fingerprint data representing pathogen-challenged *B. distachyon* plants.**

**Figure 4: Evaluation of an optimum number of k with the FIEmspro function *koptimp*.**

Figure 5: Effect of log transformation on the signal variance to signal intensity dependency within FIE fingerprints representing the metabolome of *B. distachyon* challenged with the rice blast fungus.

**Figure 6: Effect of TIC normalization on fingerprint data representing *B. distachyon* leaves after either 3 or 4 d infection with rice blast fungus.**

**Figure 7: Outlier detection in FIE-MS fingerprint data representing *B. distachyon* leaves after 3 d infection with rice blast fungus using PCA (*pccomp*) and FIEmspro function *outl.det*.**

**Figure 8: PCA of FIE-MS fingerprints representing a time course of *B. distachyon* infected with the rice blast fungus.**

**Figure 9: LDA and HCA of FIE-MS representing disease progression in *B. distachyon* plants infected with the rice blast fungus.**

**Figure 10: Examples of RF statistics derived from the classification of FIE-MS fingerprint data representing *B. distachyon* plants during a time course of infection with rice blast fungus.**

**Figure 11: Validation of discrimination models using FIE-MS fingerprint data that describe disease progression in *B. distachyon* infected with the rice blast fungus.**

Figure 12: Relationship between RF importance score and variable ranking for explanatory value in binary comparisons of *B. distachyon* leaves taken at different times after infection with the rice blast fungus (ST, a significance threshold for explanatory variables).

Mass spectrometry-based metabolomics: a guide for annotation, quantification and best reporting practices

Article 08 July 2021

Saleh Alseekh, Asaph Aharoni, … Alisdair R. Fernie

Using MetaboAnalyst 5.0 for LC–HRMS spectra processing, multi-omics integration and covariate adjustment of global metabolomics data

Article 17 June 2022

Zhiqiang Pang, Guangyan Zhou, … Jianguo Xia

PeakDecoder enables machine learning-based metabolite annotation and accurate profiling in multidimensional mass spectrometry measurements

Article Open access 28 April 2023

Aivett Bilbao, Nathalie Munoz, … Kristin E. Burnum-Johnson

References

Somorjai, R.L., Dolenko, B. & Baumgartner, R. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics 19, 1484–1491 (2003).
Article CAS Google Scholar
Berrar, D., Bradbury, I. & Dubitzky, W. Avoiding model selection bias in small-sample genomic datasets. Bioinformatics 22, 1245–50 (2006).
Article CAS Google Scholar
BragaNeto, U.M. & Dougherty, E.R. Is cross-validation valid for small-sample microarray classification? Bioinformatics 20, 374–380 (2004).
Article CAS Google Scholar
Lyons-Weiler, J. et al. Assessing the statistical significance of the achieved classification error of classifiers constructed using serum peptide profiles, and a prescription for random sampling repeated studies for massive high-throughput genomic and proteomic studies. Cancer Inform. 1, 53–77 (2005).
CAS PubMed Google Scholar
Broadhurst, D.I. & Kell, D.B. Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics 2, 171–196 (2006).
Article CAS Google Scholar
Saghatelian, A. & Cravatt, B.F. Global strategies to integrate the proteome and metabolome. Curr. Opin. Chem. Biol. 9, 62–68 (2005).
Article CAS Google Scholar
EinDor, L., Zuk, O. & Domany, E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl. Acad. Sci. USA 103, 5923–5928 (2006).
Article CAS Google Scholar
Dyaz-Uriarte, R. Supervised methods with genomic data: a review and cautionary view. Data Analysis and Visualization in Genomics and Proteomics. pp 193–214 Wiley, New York, (2005).
Chapter Google Scholar
Fawcett, T. ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Technical report HPL-2003-4. HP Laboratories, Palo Alto, CA, Available at http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf (2003).
Google Scholar
Mukherjee, S., Roberts, S.J. & van der Laan, M.J. Data-adaptive test statistics for microarray data. Bioinformatics 21, 108–114 (2005).
Article Google Scholar
Sima, C. & Dougherty, E.R. What should be expected from feature selection in small-sample settings. Bioinformatics 22, 2430–2436 (2006).
Article CAS Google Scholar
Enot, D.P., Beckmann, M., Overy, D. & Draper, J. Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals. Proc. Natl. Acad. Sci. USA 103, 14865–14870 (2006).
Article CAS Google Scholar
Kell, D.B., Darby, R.M. & Draper, J. Genomic computing. Explanatory analysis of plant expression profiling data using machine learning. Plant Physiol. 126, 943–951 (2001).
Article CAS Google Scholar
Catchpole, G.S. et al. Hierarchical metabolomics demonstrates substantial compositional similarity between genetically modified and conventional potato crops. Proc. Natl. Acad. Sci. USA 102, 14458–14462 (2005).
Article CAS Google Scholar
Goodacre, R., Vaidyanathan, S., Dunn, W.B., Harrigan, G.G. & Kell, D.B. Metabolomics by numbers: acquiring and understanding global metabolite data. Trends Biotechnol. 22, 245–252 (2004).
Article CAS Google Scholar
Bino, R.J. et al. Potential of metabolomics as a functional genomics tool. Trends Plant Sci. 9, 418–425 (2004).
Article CAS Google Scholar
Fiehn, O. et al. Metabolite profiling for plant functional genomics. Nat. Biotechnol. 18, 1157–1161 (2000).
Article CAS Google Scholar
Sumner, L.W., Mendes, P. & Dixon, R.A. Plant metabolomics: large-scale phytochemistry in the functional genomics era. Phytochemistry 62, 817–836 (2003).
Article CAS Google Scholar
Nicholson, J.K. & Wilson, I.D. Understanding 'global' systems biology: metabonomics and the continuum of metabolism. Nat. Rev. Drug Discov. 2, 668–676 (2003).
Article CAS Google Scholar
Roessner, U., Wagner, C., Kopka, J., Trethewey, R.N. & Willmitzer, L. Simultaneous analysis of metabolites in potato tuber by gas chromatography-mass spectrometry. Plant J 23, 131–142 (2000).
Article CAS Google Scholar
Tolstikov, V.V. & Fiehn, O. Analysis of highly polar compounds of plant origin: Combination of hydrophilic interaction chromatography and electrospray ion trap mass spectrometry. Anal. Biochem. 301, 298–307 (2002).
Article CAS Google Scholar
Beckmann, M., Enot, D.P., Overy, D.P. & Draper, J. Representation, comparison and interpretation of metabolome fingerprint data for total composition analysis and quality trait investigation in potato cultivars. J. Agricultural and Food Chemistry 55, 3444–3451 (2007).
Article CAS Google Scholar
Dear, G.J., James, A.D. & Sarda, S. Ultra-performance liquid chromatography coupled to linear ion trap mass spectrometry for the identification of drug metabolites in biological samples. Rapid Commun. Mass Spectrom. 20, 1351–1360 (2006).
Article CAS Google Scholar
Wagner, C., Sefkow, M. & Kopka, J. Construction and application of a mass spectral and retention time index database generated from plant GC/EI-TOF-MS metabolite profiles. Phytochemistry 62, 887–900 (2003).
Article CAS Google Scholar
Jonsson, P. et al. A strategy for identifying differences in large series of metabolomic samples analyzed by GC/MS. Anal. Chem. 76, 1738–1745 (2004).
Article CAS Google Scholar
Vorst, O. et al. A non-directed approach to the differential analysis of multiple LC–MS-derived metabolic profiles. Metabolomics 1, 169–180 (2005).
Article CAS Google Scholar
Ward, J.L., Harris, C., Lewis, J. & Beale, M.H. Assessment of H-1 NMR spectroscopy and multivariate analysis as a technique for metabolite fingerprinting of Arabidopsis thaliana. Phytochemistry 62, 949–957 (2003).
Article CAS Google Scholar
Allen, J. et al. High-throughput classification of yeast mutants for functional genomics using metabolic footprinting. Nat. Biotechnol. 21, 692–696 (2003).
Article CAS Google Scholar
Scholz, M., Gatzek, S., Sterling, A., Fiehn, O. & Selbig, J. Metabolite fingerprinting: detecting biological features by independent component analysis. Bioinformatics 20, 2447–2454 (2004).
Article CAS Google Scholar
Aharoni, A. et al. Nontargeted metabolome analysis by use of Fourier Transform Ion Cyclotron Mass Spectrometry. OMICS 6, 217–234 (2002).
Article CAS Google Scholar
Smedsgaard, J. & Frisvad, J.C. Using direct electrospray mass spectrometry in taxonomy and secondary metabolite profiling of crude fungal extracts. J Microbiol. Methods 25, 5–17 (1996).
Article CAS Google Scholar
Dunn, W.B., Bailey, N.J. & Johnson, H.E. Measuring the metabolome: current analytical technologies. Analyst 130, 606–625 (2005).
Article CAS Google Scholar
Beckmann, M., Parker, D., Enot, D.P., Duval, E. & Draper, J. High-throughput, nontargeted metabolite fingerprinting using nominal mass flow injection electrospray mass spectrometry. Nat. Protoc. 3, 486–504 (2008).
Article CAS Google Scholar
Overy, D.P. et al. Explanatory signal interpretation and metabolite identification strategies for nominal mass FIE-MS metabolite fingerprints. Nat. Protoc. 3, 471–485 (2008).
Article CAS Google Scholar
Parker, D. et al. Rice blast infection of Brachypodium distachyon as a model system to study dynamic host/pathogen interactions. Nat. Protoc. 3, 435–445 (2008).
Article CAS Google Scholar
Enot, D.P., Beckmann, M. & Draper, J. Detecting a difference—assessing generalisability when modelling metabolome fingerprint data in longer term studies of genetically modified plants. Metabolomics 3, 335–347 (2007).
Article Google Scholar
Enot, D.P. & Draper, J. Statistical measures for testing substantial equivalence of GM plant genotypes in a multivariate context. Metabolomics 3, 349–355 (2007).
Article CAS Google Scholar
Jain, A.K., Murty, M.N. & Flynn, P.J. Data clustering: a review. ACM Computing Surveys (CSUR) 31, 264–323 (1999).
Article Google Scholar
Manly, B.F.J. Multivariate Statistical Methods: A Primer. Chapman & Hall/CRC, London (2004).
Book Google Scholar
Zhang, C., Lu, X. & Zhang, X. Significance of gene ranking for classification of microarray samples. EEE/ACM Transactions on Computational Biology and Bioinformatics 3, 312–320 (2006).
Google Scholar
Ransohoff, D.F. Rules of evidence for cancer molecular-marker discovery and validation. Nat. Rev. Cancer 4, 309–313 (2004).
Article CAS Google Scholar
Davis, C.A. et al. Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics 22, 2356–2363 (2006).
Article CAS Google Scholar
Wu, B. et al. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 19, 1636–1643 (2003).
Article CAS Google Scholar
Cristianini, N. & Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000).
Book Google Scholar
Zhu, C., Kitagawa, H. & Faloutsos, C. Example-based outlier detection for high dimensional datasets. IPSJ Digital Courier 1, 234–243 (2005).
Article Google Scholar
Craig, A., Cloarec, O., Holmes, E., Nicholson, J.K. & Lindon, J.C. Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Anal. Chem. 78, 2262–2267 (2006).
Article CAS Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, New York (2001).
Book Google Scholar
Good, P. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer Series in Statistics, Heidelberg (2000).
Book Google Scholar
Efron, B. Estimating the error rate of a prediction rule: improvement on cross-validation. J. Am. Stat. Assoc. 78, 316–331 (1983).
Article Google Scholar
Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005).
Article CAS Google Scholar
Fu, W.J., Carroll, R.J. & Wang, S. Estimating misclassification error with small samples via bootstrap cross-validation. Bioinformatics 21, 1979–1986 (2005).
Article CAS Google Scholar
Thomaz, C.E. et al. Using a maximum uncertainty LDA-based approach to classify and analyse MR brain images. Lecture Notes in Computer Science: Medical Image Computing and Computer-Assisted Intervention—MICCAI 2004, pp 291–3 Springer, Berlin, 291–300 (2004).
Google Scholar
Yang, J. & Yang, J. Why can LDA be performed in PCA transformed space? Pattern Recognition 36, 563–566 (2003).
Article Google Scholar
Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).
Article Google Scholar
Zar, J.H. Biostatistics. 2nd edn. (Prentice-Hall, Englewood Cliffs, New Jersey, 1984).
Google Scholar
Dietterich, T.G. Ensemble methods in machine learning. Lecture Notes in Computer Science 1857, 1–15 (2000).
Google Scholar
Vaidyanathan, S., Kell, D.B. & Goodacre, R. Flow-injection electrospray ionization mass spectrometry of crude cell extracts for high-throughput bacterial identification. J. Am. Soc. Mass Spectrom. 13, 118–128 (2002).
Article CAS Google Scholar
Roessner, U. & Luedemann, A. et al. Metabolic profiling allows comprehensive phenotyping of genetically or environmentally modified plant systems. Plant Cell 13, 11–29 (2001).
Article CAS Google Scholar
Mazzella, N. et al. Use of electrospray ionization mass spectrometry for profiling of crude oil effects on the phospholipid molecular species of two marine bacteria. Rapid Commun. Mass Spectrom. 19, 3579–3588 (2005).
Article CAS Google Scholar
Favretto, D., Piovan, A., Filippini, R. & Caniato, R. Monitoring the production yields of vincristine and vinblastine in Catharanthus roseus from somatic embryogenesis. Semiquantitative determination by flow-injection electrospray ionization mass spectrometry. Rapid Commun. Mass Spectrom. 15, 364–369 (2001).
Article CAS Google Scholar
Rashed, M.S., Al-Ahaidib, L.Y., Aboul-Enein, H.Y., Al-Amoudi, M. & Jacob, M. Determination of L-pipecolic acid in plasma using chiral liquid chromatography-electrospray tandem mass spectrometry. Clin. Chem. 47, 2124–2130 (2001).
CAS PubMed Google Scholar
Overy, S.A. et al. Application of metabolite profiling to the identification of traits in a population of tomato introgression lines. J. Exp. Bot. 56, 287–296 (2005).
Article CAS Google Scholar
Goodacre, R., York, E.V., Heald, J.K. & Scott, I.M. Chemometric discrimination of unfractionated plant extracts analyzed by electrospray mass spectrometry. Phytochemistry 62, 859–863 (2003).
Article CAS Google Scholar
Koulman, A. et al. High-throughput direct-infusion ion trap mass spectrometry: a new method for metabolomics. Rapid Commun. Mass Spectrom. 21, 421–428 (2007).
Article CAS Google Scholar
Martinez, A.M. & Kak, A.C. PCA versus LDA. IEEE Transactions on: Pattern Analysis and Machine Intelligence 23, 228–233 (2001).
Google Scholar
Windeatt, T. Vote counting measures for ensemble classifiers. Pattern Recognition 36, 2743–2756 (2003).
Article Google Scholar
R_Development_Core_Team. R. A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, ISBN 3-900051-900007-900050, URL http://www.R-project.org (2006).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57, 289–300 (1995).
Google Scholar
Storey, J.D. A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. 64, 479–498 (2002).
Article Google Scholar

Download references

Acknowledgements

Financial support was provided for W.L., M.B. and D.P.O by the UK Food Standards Agency G03012 programme. D.P.E. and D.P. were funded by grants MET20483 and BBD0069531 respectively, from the UK Biotechnology and Biological Sciences Research Council.

Author information

David P Overy
Present address: Present address: CIBE, Merck Sharp & Dohme de España, Madrid 28027, Spain.,

Authors and Affiliations

Institute of Biological Sciences, Aberystwyth University, Aberystwyth, SY23 3DA, UK
David P Enot, Wanchang Lin, Manfred Beckmann, David Parker, David P Overy & John Draper

Authors

David P Enot
View author publications
You can also search for this author in PubMed Google Scholar
Wanchang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Manfred Beckmann
View author publications
You can also search for this author in PubMed Google Scholar
David Parker
View author publications
You can also search for this author in PubMed Google Scholar
David P Overy
View author publications
You can also search for this author in PubMed Google Scholar
John Draper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John Draper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Enot, D., Lin, W., Beckmann, M. et al. Preprocessing, classification modeling and feature selection using flow injection electrospray mass spectrometry metabolite fingerprint data. Nat Protoc 3, 446–470 (2008). https://doi.org/10.1038/nprot.2007.511

Download citation

Published: 28 February 2008
Issue Date: March 2008
DOI: https://doi.org/10.1038/nprot.2007.511

This article is cited by

Developing a machine learning model for accurate nucleoside hydrogels prediction based on descriptors
- Weiqi Li
- Yinghui Wen
- Hang Zhao
Nature Communications (2024)
Spectral binning as an approach to post-acquisition processing of high resolution FIE-MS metabolome fingerprinting data
- Jasen P. Finch
- Thomas Wilson
- John Draper
Metabolomics (2022)
Plasma metabolomic profiling in patients with rheumatoid arthritis identifies biochemical features predictive of quantitative disease activity
- Benjamin Hur
- Vinod K. Gupta
- Jaeyun Sung
Arthritis Research & Therapy (2021)
Specificity of metabolic colorectal cancer biomarkers in serum through effect size
- Nicolas Di Giovanni
- Marie-Alice Meuwis
- Jean-François Focant
Metabolomics (2020)
Addressing the pitfalls when designing intervention studies to discover and validate biomarkers of habitual dietary intake
- A. J. Lloyd
- N. D. Willis
- J. Draper
Metabolomics (2019)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Preprocessing, classification modeling and feature selection using flow injection electrospray mass spectrometry metabolite fingerprint data

Abstract

Access options

Similar content being viewed by others

Mass spectrometry-based metabolomics: a guide for annotation, quantification and best reporting practices

Using MetaboAnalyst 5.0 for LC–HRMS spectra processing, multi-omics integration and covariate adjustment of global metabolomics data

PeakDecoder enables machine learning-based metabolite annotation and accurate profiling in multidimensional mass spectrometry measurements

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

This article is cited by

Developing a machine learning model for accurate nucleoside hydrogels prediction based on descriptors

Spectral binning as an approach to post-acquisition processing of high resolution FIE-MS metabolome fingerprinting data

Plasma metabolomic profiling in patients with rheumatoid arthritis identifies biochemical features predictive of quantitative disease activity

Specificity of metabolic colorectal cancer biomarkers in serum through effect size

Addressing the pitfalls when designing intervention studies to discover and validate biomarkers of habitual dietary intake

Comments

Search

Quick links

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links