A comprehensive automatic data analysis strategy for gas chromatography-mass spectrometry based untargeted metabolomics

https://doi.org/10.1016/j.chroma.2019.460787Get rights and content

Highlights

  • We propose a comprehensive data analysis workflow for GC-MS-based metabolomics.

  • An automatic TIC peak detection and resolution methodology is proposed.

  • A new time-shift correction and component registration algorithm is developed.

  • A MATLAB GUI is developed based on the developed strategy for users.

Abstract

Automatic data analysis for gas chromatography-mass spectrometry (GC-MS) is a challenging task in untargeted metabolomics. In this work, we provide a novel comprehensive data analysis strategy for GC-MS-based untargeted metabolomics (autoGCMSDataAnal) by developing a new automatic strategy for performing TIC peak detection and resolution and proposing a novel time-shift correction and component registration algorithm. autoGCMSDataAnal uses original acquired GC-MS datafiles as input to automatically perform TIC peak detection, component resolution, time-shift correction and component registration, statistical analysis, and compound identification. We utilize standards and complex plant samples to comprehensively investigate the performance of autoGCMSDataAnal. The results suggest that the developed strategy is comparable with several state-of-the-art methods that are widely used in GC-MS-based untargeted metabolomics. Based on the proposed strategy, we develop a user-friendly MATLAB GUI for users who are unfamiliar with programming languages to facilitate their routine analysis, which can be freely downloaded at: http://software.tobaccodb.org/software/autogcmsdataanal.

Introduction

GC-MS-based untargeted metabolomics has been widely used in many laboratories for high thoroughly characterizing massive semi-volatile and volatile compounds in complex samples [1]. A remarkable advantage of this technique is that compounds can be accurately identified by matching their acquired mass spectra against those in library, like National Institute of Standards and Technology (NIST). However, automatic data mining for GC-MS, involving compound feature extraction, coeluted component resolution, and peak alignment, is still a bottleneck for untargeted metabolomics [2].

One of the most challenging tasks is accurately retrieving components under TIC peaks. Compared with the well-developed TIC peak detection algorithms [3], [4], [5], [6], automatic component resolution for each detected TIC can still be treated as an unresolved task in GC-MS-based untargeted metabolomics. Thus far, the Automated Mass Spectral Deconvolution and Identification System (AMDIS) is one of the most widely utilized algorithms for the deconvolution of coeluted components. A remarkable advantage is that hundreds of TIC peaks can be analyzed simultaneously, and retrieved mass spectral profiles can be directly imported into the NIST library for compound identification. Broeckling et al. [7] took the advantage of the AMDIS and developed the MET-IDEA for providing a component list table to benefit the subsequent data analysis, such as screening metabolites. Du et al. and Domingo-Almenara et al. developed a number of high-performance algorithms [8], [9], [10], [11], [12], [13] to automatically resolve coeluted components. All these methods can automatically perform peak detection and component resolution, which is very helpful for practical applications. Performances of the above-mentioned methods for coeluted component resolution are depended on the selection of model ions, which should be the selective ions of the underlying components.

Chemometrics methods provide another choice. Some algorithms like trilinear decomposition [14,15] and multivariate curve resolution [16,17] have been widely used to perform component resolution in complex samples. These methods take the bilinear structure of the instrumental response of each sample to retrieve the underlying chromatographic and mass spectral profiles of components by iteratively optimizing random initialized values. A prerequisite of these chemometrics methods is that one has to manually set a number of initialized parameters for each TIC peak, involving elution range, number of components, which greatly limits their applications in the GC-MS-based untargeted metabolomics.

Time shift across samples is another problem that obstructs the direct comparison of multiple samples to find biomarkers. Mass spectrum can be used as a valid tool for aligning components. In GC-MS-based untargeted metabolomics, however, components with similar chemical structures are frequently encountered, which can generate similar mass spectra and thus may lead to inaccurate peak alignment results if they closely elute. In conclusion, new comprehensive methods for automatic implementing data mining in GC-MS-based untargeted metabolomics are urgently required [18,19].

Our research group has developed a number of methods [18,[20], [21], [22]] for GC-MS data analysis, such as employing chemometrics methods like HELP and MCR-ALS for screened TIC peak resolution. However, these methods can not automatically perform TIC peak resolution, which means one has to manually set calculation parameters like number of components under a TIC peak and the corresponding selective zones for either HELP or MCR-ALS. Aiming to develop a comprehensive automatic GC-MS data analysis workflow, we propose a novel TIC peak detection and peak resolution strategy in this work. Additionally, we provide a new time-shift correction and component registration algorithm. Based on these novel algorithms, we develop a new comprehensive automatic GC-MS data analysis method, called autoGCMSDataAnal, for automatically performing GC-MS-based untargeted metabolomics. The performance of the developed autoGCMSDataAnal is demonstrated by both standards and complex plant samples.

Section snippets

Standards

A mixture of 11 organic acid compounds (see Table 1) is prepared, which is then diluted to obtain a series of calibration samples with different concentration levels. A methyl esterification is performed for these calibration samples. Finally, 1 μL solution is injected into an 7890-5977 Agilent GC-MS. A DB-5MS column (50 m × 0.25 mm, 0.25 μm) is used. The column temperature is linearly increased from 50 °C to 280 °C at a rate of 3 °C min−1. Only the full scan mode is set for the mass

Theory

Fig. 1 provides a brief workflow of autoGCMSDataAnal, which consists of single sample analysis and batch analysis steps. The single sample analysis is further divided into component resolution and time-shift correction, whereas the batch analysis is divided into component registration and peak filling for statistical analysis to screen metabolites. The retrieved components can be imported to NIST for identification.

Standard sample analysis

Performance of the developed strategy on compound identification and quantification is investigated by standards. Resolution results for the 11 compounds are provided in Table 1. These compounds can be automatically and precisely identified by the autoGCMSDataAnal. The match factors (MF) provided by NIST are larger than 900 for all compounds. Moreover, the coefficients of determination of the regression lines are larger than 0.9900.

An example for automatic TIC detection and component resolution

Conclusion

Automatic data analysis of complex samples remains one of the most challenging tasks in GC-MS-related untargeted metabolomics. This work provides a novel comprehensive data analysis strategy, autoGCMSDataAnal, for users, which can automatically perform TIC peak detection, TIC peak resolution, time-shift correction and component registration, statistical analysis, and compound identification. Performance of autoGCMSDataAnal has been demonstrated by standards and complex plant samples. Results

CRediT authorship contribution statement

Yu-Ying Zhang: Methodology, Writing - original draft. Qian Zhang: Methodology, Writing - original draft. Yue-Ming Zhang: Methodology, Writing - original draft. Wei-Wei Wang: Data curation. Li Zhang: Data curation. Yong-Jie Yu: Conceptualization, Software. Chang-Cai Bai: Formal analysis. Ji-Zhao Guo: Formal analysis. Hai-Yan Fu: Writing - review & editing. Yuanbin She: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no competing financial interests.

Acknowledgments

The authors gratefully acknowledge the financial support of the Foundation of the National Natural Science Foundation of China (Grant nos. 21868028, 21606137, 21776259, 21776259, and 21305160), NGY2016124, and Ningxia Medical University (XT2016003). The author Hai-Yan Fu wants to acknowledge the financial support of the Talented Youth Cultivation Program of South-Central University for Nationalities (No. CRZ18002).

References (27)

Cited by (32)

View all citing articles on Scopus
1

These authors contributed equally to this work.

View full text