Investigation of different linear and nonlinear chemometric methods for modeling of retention index of essential oil components: Concerns to support vector machine
Introduction
Essential oils are sometimes used to flavor compounds in food and have antimicrobial activities. Also, they are toxic to humans including: carcinogenicity, reproductive and developmental toxicity, neurotoxicity as well as acute toxicity. This applies whether taken internally, applied to the skin or simply inhaled. As with most medicinal drugs, either of a “synthetic” or a “natural” origin, the compounds present in essential oils have the potential to create serious, and even fatal toxic effects if ingested in overly large quantities, or used incorrectly [1], [2].
In addition, essential oils play a determining role as natural flavoring compounds in both the perfumery and the pharmaceutical industries. In Japan, Citrus sudachi is a famous sour citrus fruit on account of its unique pleasant citrus odor and its savory taste [3], [4], [5]. It is cultivated in Tokushima prefecture on Shikoku Island. Commonly, Citrus sudachi is used as seasoning in foods or as flavoring in alcoholic beverages. The essential oil components of this natural product include: alcohols, organic acids, aldehydes, ketones, esters, aromatic compounds and terpenes—some of which have shown antibacterial activity [6]. All these compounds have been identified by gas chromatography-mass spectrometry (GC-MS). Nevertheless, the mass spectra do not always present enough evidence for the structure elucidation and a prediction model should be used to verify the molecular structure. These methodologies, called quantitative structure retention relationships (QSRR), permit the generation of useful equations for the prediction of retention indices for molecules that are similar but different from those used to develop the model [7]. Seeking the quantitative relationship between the molecular structure and the gas chromatographic retention indices has been a basic task in chemistry. Correlations between the GC retention indices and the molecular structures can provide more profound insights into the interactions between the eluents and the stationary phases from a theoretical viewpoint.
Recently several QSPR/QSRR studies on the retention relationship of essential oil components have been reported. Kovats gas chromatographic retention indices for both apolar (DB-1) and polar (DB-Wax) columns for 48 compounds from Ylang-Ylang essential oil by Olivero et al. [8], capillary column gas chromatographic retention time for natural sterols (trimethylsilyl ethers) from olive oil by Acuña-Cueva et al. [9], Kovats retention indices of terpenes by Hemmateenejad et al. [10], retention indices of pyrazines by Stanton et al. [7].
This work relates the quantitative structure investigation on the essential oils, extracted from the Citrus sudachi fruit and their RI relationship. In the QSAR/QSRR studies, there are some techniques which can be applied for the model construction, such as the multiple linear regression (MLR), the partial least squares (PLS). Also, the nonlinear regressions can be applied like the polynomial PLS (poly-PLS), used for the inspection of the linear and nonlinear relation between the interested property and the molecular descriptors, respectively. MLR yields models that are simpler and easier to interpret than PLS, because these methods perform regression on the latent variables that do not have any physical meaning. However, due to the collinearity between the structural descriptors, MLR is not able to extract useful information from the structural data. As a consequence, an overfitting problem is encountered. PLS is a factor analytical technique which uses factors, or latent variables to create a target matrix used for calibration. PLS is suitable if there are fewer factors in the target matrix than the number of the factors which are originally present in the data matrix. In PLS, the combination step and the regression stage are amalgamated with the decomposition step and the production of the latent variables, so that the eigenvectors of the data matrix are extracted in a sequence congruent with the eigenvectors of the target matrix [11]. Normal or linear PLS uses a linear function to regress the scores of the descriptors matrix on the scores of the retention indexes matrix to find the inner relation. The polynomial PLS employs a nonlinear function, in this case using a squared function to find this inner relation. A good explanation of PLS and nonlinear PLS is given in the paper by Wold et al. [12]. This paper describes the PLS process with the similarities and differences between the linear and nonlinear methods.
The support vector machine (SVM) is a new algorithm developed by the machine learning community [13], [14]. The SVM approach automatically controls the flexibility of the resulting classifier on the training data. Accordingly, by the design of the algorithm, the deteriorating effect of the input dimensionality on the generalization ability is largely suppressed. Due to its remarkable generalization performance, SVM has attracted attention and gained extensive application, such as; pattern recognition problems [15], [16], drug design [17], QSAR [18], [19], [20], [21] and quantitative QSPR analysis [22], [23]. In most of these cases, the performance of the SVM modeling either matches or is significantly better than that of the traditional machine learning approaches.
The main aim of the present work was to establish a new QSRR model for predicting the retention index property of the organic compounds, derived from the essential oil of Citrus sudachi using the SVM techniques. The performance of this model was compared with those obtained by the MLR, PLS and poly-PLS methods. This is the first research on QSRR of the essential oil compounds against the retention index, using SVM.
Section snippets
Data set
The data set used in this study was taken from the work of Mookdasanit et al. [24] and is presented in Table 1. This set contains the retention index property of Citrus sudachi essential oil compounds, which were measured at the same conditions with the HP5 column (30 m×0.32 mm i.d.; Hewlett Packard, CA). The retention index of the compounds fell in the range of 800 for Hexanal and 1752 for α-Sinensal, at the mean value of 1249.
Equipment
A Pentium IV personal computer (CPU at 3.06 GHz) with a Windows XP
Results and discussion
For the selection of the most important descriptors, GA was run many times with different initial sets of population. At the end, a population of good models was obtained. Among these models, one model presented the highest statistical quality and it was used repeatedly in comparison with the other models.
The descriptors, selected by this method, were used to construct some linear and nonlinear models with the employment of the MLR, PLS, poly-PLS and SVM techniques.
Conclusion
In the present study, two linear methods (MLR and PLS) and two nonlinear methods (poly-PLS and SVM) were used to construct a quantitative relation between the retention index of some essential oil components and their calculated descriptors.
The results obtained by SVM were compared with the results obtained by MLR, PLS and poly-PLS. The results demonstrated that SVM was more powerful in the retention index prediction of the essential oil compounds than MLR, PLS and poly-PLS. A suitable model
References (41)
- et al.
Molecular structure and gas chromatographic retention behavior of the components of Ylang-Ylang oil
J. Pharm. Sci.
(1997) - et al.
Quantitative structure-retention relationship for the Kovats retention indices of a large set of terpenes: a combined data splitting-feature selection strategy
Anal. Chim. Acta
(2007) - et al.
Nonlinear PLS modeling
Chemometr. Intell. Lab. Syst.
(1989) - et al.
Drug design by machine learning: support vector machines for pharmaceutical data analysis
Comput. Chem.
(2001) - et al.
Prediction of selectivity coefficients of a theophylline-selective electrode using MLR and ANN
Talanta
(2006) - et al.
Application of GA-MLR, GA-PLS and the DFT quantum chemical (QM) calculations for the prediction of the selectivity coefficients of a histamine-selective electrode
Sens. Actuators B
(2008) - et al.
Genetic algorithms applied to feature selection in PLS regression: how and when to use them
Chemometr. Intell. Lab. Syst.
(1998) - et al.
Determination of the spread parameter in the Gaussian kernel for classification and regression
Neurocomputing
(2003) - et al.
The antimicrobial properties of marjoram (Origanum majorana L.) volatile oil
Flavour Fragrance J.
(1990) - et al.
Effects of farnesol and the off-flavor derivative geosmin on streptomyces-tendae
Appl. Environ. Microbiol.
(1991)
Volatile constituents in the peel oil of sudachi (Citrus sudachi)
Agric. Biol. Chem.
Aroma profiles of peel oils of acid citrus
Seasonal change of volatile compounds of Citrus sudachi during maturation
Food Sci. Technol. Res.
Naturally-occurring antiacne agents
J. Nat. Prod.
Computer-assisted prediction of gas chromatographic retention indices of pyrazines
Anal. Chem.
Quantitative structure-capillary column gas chromatographic retention time relationships for natural sterols (trimethylsilyl ethers) from olive oil
JAOCS
Factor Analysis in Chemistry
Estimation of Dependencies Based on Empirical Data
A tutorial on support vector machines for pattern recognition
Data Min. Knowl. Disc.
Comparison of support vector machine and artificial neural network systems for drug/nondrug classification
J. Chem. Inf. Comput. Sci.
Cited by (107)
In silico simulations and molecular descriptors to predict in vitro transactivation potencies of Baikal seal estrogen receptors by environmental contaminants
2023, Ecotoxicology and Environmental SafetyA comprehensive evaluation of liposome/water partition coefficient prediction models based on the Technique for Order Preference by Similarity to an Ideal Solution (TOPSIS) method: Challenges from different descriptor dimension reduction methods and machine learning algorithms
2023, Journal of Hazardous MaterialsMultiple machine learning algorithms assisted QSPR models for aqueous solubility: Comprehensive assessment with CRITIC-TOPSIS
2023, Science of the Total EnvironmentCitation Excerpt :Consequently, further screening of the descriptors was necessary. To evaluate the influence of different descriptor screening methods on the model, three common methods (SRM, RF and PCA) were separately used to opt descriptors that contribute more from 1241 descriptors (Riahi et al., 2009; Zhang et al., 2022). The specific screening process was in Supporting Information.
Proposing a chemometric Normalized Difference Phyllodulcin Index (cNDPI) for phyllodulcin synthesis estimation
2022, Journal of Applied Research on Medicinal and Aromatic Plants