A comprehensive automatic data analysis strategy for gas chromatography-mass spectrometry based untargeted metabolomics
Introduction
GC-MS-based untargeted metabolomics has been widely used in many laboratories for high thoroughly characterizing massive semi-volatile and volatile compounds in complex samples [1]. A remarkable advantage of this technique is that compounds can be accurately identified by matching their acquired mass spectra against those in library, like National Institute of Standards and Technology (NIST). However, automatic data mining for GC-MS, involving compound feature extraction, coeluted component resolution, and peak alignment, is still a bottleneck for untargeted metabolomics [2].
One of the most challenging tasks is accurately retrieving components under TIC peaks. Compared with the well-developed TIC peak detection algorithms [3], [4], [5], [6], automatic component resolution for each detected TIC can still be treated as an unresolved task in GC-MS-based untargeted metabolomics. Thus far, the Automated Mass Spectral Deconvolution and Identification System (AMDIS) is one of the most widely utilized algorithms for the deconvolution of coeluted components. A remarkable advantage is that hundreds of TIC peaks can be analyzed simultaneously, and retrieved mass spectral profiles can be directly imported into the NIST library for compound identification. Broeckling et al. [7] took the advantage of the AMDIS and developed the MET-IDEA for providing a component list table to benefit the subsequent data analysis, such as screening metabolites. Du et al. and Domingo-Almenara et al. developed a number of high-performance algorithms [8], [9], [10], [11], [12], [13] to automatically resolve coeluted components. All these methods can automatically perform peak detection and component resolution, which is very helpful for practical applications. Performances of the above-mentioned methods for coeluted component resolution are depended on the selection of model ions, which should be the selective ions of the underlying components.
Chemometrics methods provide another choice. Some algorithms like trilinear decomposition [14,15] and multivariate curve resolution [16,17] have been widely used to perform component resolution in complex samples. These methods take the bilinear structure of the instrumental response of each sample to retrieve the underlying chromatographic and mass spectral profiles of components by iteratively optimizing random initialized values. A prerequisite of these chemometrics methods is that one has to manually set a number of initialized parameters for each TIC peak, involving elution range, number of components, which greatly limits their applications in the GC-MS-based untargeted metabolomics.
Time shift across samples is another problem that obstructs the direct comparison of multiple samples to find biomarkers. Mass spectrum can be used as a valid tool for aligning components. In GC-MS-based untargeted metabolomics, however, components with similar chemical structures are frequently encountered, which can generate similar mass spectra and thus may lead to inaccurate peak alignment results if they closely elute. In conclusion, new comprehensive methods for automatic implementing data mining in GC-MS-based untargeted metabolomics are urgently required [18,19].
Our research group has developed a number of methods [18,[20], [21], [22]] for GC-MS data analysis, such as employing chemometrics methods like HELP and MCR-ALS for screened TIC peak resolution. However, these methods can not automatically perform TIC peak resolution, which means one has to manually set calculation parameters like number of components under a TIC peak and the corresponding selective zones for either HELP or MCR-ALS. Aiming to develop a comprehensive automatic GC-MS data analysis workflow, we propose a novel TIC peak detection and peak resolution strategy in this work. Additionally, we provide a new time-shift correction and component registration algorithm. Based on these novel algorithms, we develop a new comprehensive automatic GC-MS data analysis method, called autoGCMSDataAnal, for automatically performing GC-MS-based untargeted metabolomics. The performance of the developed autoGCMSDataAnal is demonstrated by both standards and complex plant samples.
Section snippets
Standards
A mixture of 11 organic acid compounds (see Table 1) is prepared, which is then diluted to obtain a series of calibration samples with different concentration levels. A methyl esterification is performed for these calibration samples. Finally, 1 μL solution is injected into an 7890-5977 Agilent GC-MS. A DB-5MS column (50 m × 0.25 mm, 0.25 μm) is used. The column temperature is linearly increased from 50 °C to 280 °C at a rate of 3 °C min−1. Only the full scan mode is set for the mass
Theory
Fig. 1 provides a brief workflow of autoGCMSDataAnal, which consists of single sample analysis and batch analysis steps. The single sample analysis is further divided into component resolution and time-shift correction, whereas the batch analysis is divided into component registration and peak filling for statistical analysis to screen metabolites. The retrieved components can be imported to NIST for identification.
Standard sample analysis
Performance of the developed strategy on compound identification and quantification is investigated by standards. Resolution results for the 11 compounds are provided in Table 1. These compounds can be automatically and precisely identified by the autoGCMSDataAnal. The match factors (MF) provided by NIST are larger than 900 for all compounds. Moreover, the coefficients of determination of the regression lines are larger than 0.9900.
An example for automatic TIC detection and component resolution
Conclusion
Automatic data analysis of complex samples remains one of the most challenging tasks in GC-MS-related untargeted metabolomics. This work provides a novel comprehensive data analysis strategy, autoGCMSDataAnal, for users, which can automatically perform TIC peak detection, TIC peak resolution, time-shift correction and component registration, statistical analysis, and compound identification. Performance of autoGCMSDataAnal has been demonstrated by standards and complex plant samples. Results
CRediT authorship contribution statement
Yu-Ying Zhang: Methodology, Writing - original draft. Qian Zhang: Methodology, Writing - original draft. Yue-Ming Zhang: Methodology, Writing - original draft. Wei-Wei Wang: Data curation. Li Zhang: Data curation. Yong-Jie Yu: Conceptualization, Software. Chang-Cai Bai: Formal analysis. Ji-Zhao Guo: Formal analysis. Hai-Yan Fu: Writing - review & editing. Yuanbin She: Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no competing financial interests.
Acknowledgments
The authors gratefully acknowledge the financial support of the Foundation of the National Natural Science Foundation of China (Grant nos. 21868028, 21606137, 21776259, 21776259, and 21305160), NGY2016124, and Ningxia Medical University (XT2016003). The author Hai-Yan Fu wants to acknowledge the financial support of the Talented Youth Cultivation Program of South-Central University for Nationalities (No. CRZ18002).
References (27)
- et al.
Marker discovery in volatolomics based on systematic alignment of GC-MS signals: application to food authentication
Anal. Chim. Acta
(2017) - et al.
Chemometric methods in data processing of mass spectrometry-based metabolomics: a review
Anal. Chim. Acta
(2016) - et al.
Compound identification in gas chromatography/mass spectrometry-based metabolomics by blind source separation
J. Chromatogr. A
(2015) - et al.
Automated resolution of chromatographic signals by independent component analysis–orthogonal signal deconvolution in comprehensive gas chromatography/mass spectrometry-based metabolomics
Comput. Meth. Prog. Bio.
(2016) - et al.
Avoiding hard chromatographic segmentation: a moving window approach for the automated resolution of gas chromatography–mass spectrometry-based metabolomics signals by multivariate methods
J. Chromatogr. A
(2016) - et al.
MVC2: a MATLAB graphical interface toolbox for second-order multivariate calibration
Chemometr. Intell. Lab.
(2009) - et al.
A flexible and novel strategy of alternating trilinear decomposition method coupled with two-dimensional linear discriminant analysis for three-way chemical data analysis: characterization and classification
Anal. Chim. Acta
(2018) - et al.
Knowledge integration strategies for untargeted metabolomics based on MCR-ALS analysis of CE-MS and LC-MS data
Anal. Chim. Acta
(2017) - et al.
MCR-ALS GUI 2.0: new features and applications
Chemometr. Intell. Lab.
(2015) - et al.
Mass-spectra-based peak alignment for automatic nontargeted metabolic profiling analysis for biomarker screening in plant samples
J. Chromatogr. A
(2017)
Multiscale peak alignment for chromatographic datasets
J. Chromatogr. A
A chemometric-assisted method based on gas chromatography–mass spectrometry for metabolic profiling analysis
J. Chromatogr. A
Automatic untargeted metabolic profiling analysis coupled with chemometrics for improving metabolite identification quality to enhance geographical origin discrimination capability
J. Chromatogr. A
Cited by (32)
Geographical discrimination of Flos Trollii by GC-MS and UHPLC-HRMS-based untargeted metabolomics combined with chemometrics
2023, Journal of Pharmaceutical and Biomedical AnalysisQuality assessment for the flower of Lonicera japonica Thunb. during flowering period by integrating GC-MS, UHPLC-HRMS, and chemometrics
2023, Industrial Crops and Products
- 1
These authors contributed equally to this work.