ReviewVariables selection methods in near-infrared spectroscopy
Introduction
In recent years, near-infrared (NIR) spectroscopy has gained wide acceptance in different fields by virtue of its advantages over other analytical techniques, the most salient of which is its ability to record spectra for solid and liquid samples without any pretreatment. This characteristic makes it especially attractive for straightforward, speedy characterization of natural and synthetic products. The cost savings of NIR measurements related to improved control and product quality are often achieved and can provide results significantly faster compared to traditional laboratory analysis. In batch processes, NIR allows several quality estimates to be performed within a manufacturing cycle in opposed to a single end of batch analysis. Therefore, it can reveal potential problems early in the process and promote corrective actions, this may have particular advantages in the case where safety is a factor. Also, e.g., safety aspects can be seen as one of the advantages due to intrinsically safe measurement probes and fiber optics. NIR spectroscopy has increasingly been adopted as an analytical tool in variety of different fields during the past 15 years, for example in the petrochemical [1], [2], pharmaceutical [3], [4], environmental [5], [6], clinical [7], [8], [9], agricultural [6], [10], [11], [12], food [13] and biomedical [14] sectors.
Typically, modern NIR analysis involves the rapid acquisition of large number of absorbance values for a selected spectral range. The information contained in the spectral curve is then used to predict the chemical composition of the sample by extracting the appropriate variables of interest. Generally, NIR spectroscopy is used in combination with multivariate techniques for qualitative or quantitative analysis. The large number of spectral variables in most data sets encountered in spectral chemometrics often renders the prediction of a dependent variable complicated, however by the use of suitable projection or selection techniques the problem may be ninimised. Selection and projection methods differ in several aspects [15].
Projection methods, for example, partial least squares (PLS) and principal component regression (PCR) are generally applicable but do not presuppose any bias or weights on the principal axes. However, projection calibration models are straightforward and the model calculations can be performed quickly by commercially available software packages. Earlier PCR and PLS full-spectrum methods did not feature preliminary selection, but introduce latent variables comprised of combinations of the original features. Even where prediction properties are good, they usually suffer from the fact that the latent variables are hardly interpretable in terms of original features (wavelengths in the case of infrared spectra). Furthermore, multivariate calibration models such as partial least squares (PLS) regression have been developed for quantitative analysis of spectral data because of their ability to reduce the impact of common problems such as collinearity, band overlaps, and interactions. However, even with such sophisticated chemometric tools as PLS, the influence of data that does not contain critical information can severely corrupt the resulting calibration models, because not all variables or their regions are equally important for the modeling; some of them, like noise areas, may even be harmful. Data projection on an abstract factor space reduces the error but does not eliminate it entirely; it is partially projected onto the new data space, often confounding the model. Therefore, removal of the variables, in which the noise dominates over the relevant information often leads to better accuracy and performance of the analytical methods.
In contrast, selection methods are based on the principle of choosing a small number of variables selected from the original provide easier interpretation. Variable selection in multivariate analysis is a very important step, because the removal of non-informative variables will produce better prediction and simpler models. It has been shown that the predictive ability can be increased and, the complexity of the model can be reduced by a judicious pre-selection of wavelengths. It is now widely accepted that a well-performed variable selection can result in models having a greater predictive ability [15].
Variable or feature selection, also called “frequency” or “wavelength” selection when applied to spectroscopic data, is a critical step in data analysis, as it allows interactive improvement of the quality of data during the calibration procedure. The goal of frequency selection is to identify a subset of spectral frequencies that produce the smallest possible errors when used to perform operations such as making quantitative determinations or discriminating between dissimilar samples. Recently, considerable effort has been directed toward developing and evaluating different procedures that objectively identify variables that contribute useful information and/or eliminate variables containing mostly noise. Classically, this selection is made from the basic knowledge about the spectroscopic properties of the sample – knowledge based selection [16], but it has been shown that there are mathematical strategies for variable selection that are more efficient.
From a conceptual point of view, a variable selection procedure includes first the choice of a relevance measure and, second, the choice of a search algorithm to perform optimization. The relevance measure aims at evaluating the influence of a particular subset of X-variables on the dependent variables, y. Concerning the search algorithm, stochastic algorithms are performed in applications such as spectroscopic multivariate calibration. This approach is usually called computer aided variable selection. Computer aided variable selection is an important pre-processing procedure in chemometrics, which is widely used to improve the performance of various multivariate methods and algorithms, such as regression methods, factor analysis, and curve resolution. Multivariate approaches can exploit all variables and effectively extract necessary information in the analysis. Computer aided variable selection is also important in industry for several reasons. Variable selection can improve model performance, provide robust models that may be readily transferred and allow non-expert users to build reliable models with only limited expert intervention. Furthermore, computer aided selection of variables may be the only approach for some models, for example predicting a physical property from spectral data. Exploiting state-of-the-art theories and techniques of the late 20th and the 21st centuries, has enabled tremendous progress in NIR spectroscopy.
There are a multitude of approaches available for variable selection. These maybe categorized as follows. First, “Univariate” approaches select those variables that have the greatest correlation with the response, mainly in the early NIR spectroscopy selection time. Secondly, “Sequential” approaches rank variables in order and pair the variables in a forward or backward progression. A more sophisticated approach iterates the progression to reassess previous selections. An inherent problem with these approaches is that only a very small part of the experimental domain is explored. These methods were used from middle of the 1970s to the middle of the 1990s. Thirdly, since 1990s, “multivariate” methods of variable selection have been introduced, namely, interactive variable selection, uninformative variable elimination (UVE), interval PLS (iPLS), significance tests of model parameters, and the use of genetic algorithms (GAs) for example.
This review emphasizes variable selection methods in NIR spectroscopy, and is organized as follows. The importance of variable selection in NIR spectroscopy is given in Section 2 based on different viewpoints. Section 3 gives a brief review of the global calibration methods PCR and PLS and the most common method MLR, because these methods are always used as relevance measure methods in variable selection. Variable selection methods are discussed in Section 4. Some classical approaches, such as the manual approach (knowledge based selection), “Univariate” and “Sequential” selection methods are introduced in the first and second part of Section 4. The third part of Section 4 discussed the relatively, sophisticated methods, successive projections algorithm (SPA) and uninformative variable elimination (UVE). The fourth part of Section 4 focuses on elaborate search-based strategies, such as simulated annealing (SA), artificial neural networks (ANN) and genetic algorithms (GAs). Interval partial least squares (iPLS), including moving windows PLS and iterative PLS are well discussed in the fifth part of Section 4. The last part of Section 4 introduces some other selection methods, such as B-spine and Kalman filter, etc. Finally, the websites of some variable selection software and toolboxes for non-commercial use are given in Section 5. The review ends with a brief summary.
Section snippets
The importance of variable selection in NIR spectroscopy
There is much literature about the importance of variable selection in NIR spectroscopy. Here, the different aspects of variable selection are summarized.
A brief review of regression methods
The commonly used chemometric methods for the analysis of NIR spectra could be divided into three main techniques groups. (i) Mathematical pretreatments to enhance the information search in the study, and decrease the influence of the side information contained in the spectra. Spectral pre-processing is considered as well known and not described in this text. The classical pretreatments are normalizations, derivatives and smoothing. For more details, readers are referred to textbooks [25], [26]
Manual approaches – knowledge based selection
For manual approaches, one possibility is to remove the variables that have poor informational quality. In many studies [23], [54], [55], due to the insensitivity of the NIR instrument detector, some data points at the lower and higher regions were omitted from the spectral data sets. Fig. 3(a) [56] shows apple spectra collected by a NIR instrument. The data points at the lower and higher regions were cut from the spectral data sets before regression due to a high signal to noise ratio (S/N).
Software of wavelength selection methods
In Table 1, we list the software packages (mainly in Matlab) used in the non-commercial studies. The software package from NIR spectroscopy company and combined with the NIR spectra equipment, such as OPUS software version 4.2 also from Bruker Gmbh (Bremen, Germany), OMNIC series from Thermo Nicolet company, and the software of TQ Analyst, will not be discussed here. The packages cited in Table 1 are free available software and the authors would like to express their gratitude to the developers.
Summary
Although NIR instrumentation produces large volumes of data, it often, as we have described, requires careful and sophisticated processing in order to extract information. Chemometrics methods have been found to be very useful for extracting information from NIR spectra and there is great interest for using the NIR technology for measurements of phenomena of different analytes [25], [27], [47], [59]. However, there is still a great need for new methods that can handle data from these modern
Acknowledgements
The authors gratefully acknowledge the financial support provided by the foundations of NSFC (Grant no. 6091079), Chinese 863 Program (Grant nos. 2008AA10Z208, 2008AA10Z204), the Postdoctoral Foundation of China (20070411024, 0601003C) and the talent foundation of Jiangsu University. Dr. Zou Xiabo thanks Dr. Jianshe Chen (University of Leeds) for advice and encouragement, and to the many researchers whom have offered the stimulating works in this field.
Glossary
- MVC
- multivariate calibration
- LMVC
- linear multivariate calibration
- PLS
- partial least squares
- MLR
- multiple linear regression
- PCR
- principal components regression
- SEC
- standard error of calibration
- RMSECV
- root mean square error of cross-validation
- r
- correlation coefficient
- SEP
- standard error of prediction
- LOOCV
- leave-one-out cross-validation
- SPA
- successive projections algorithm
- UVE
- uninformative variable elimination
- SA
- simulated annealing
- ANN
- artificial neural networks
- GA
- genetic algorithm
- iPLS
- interval partial least squares
- BP-ANN
References (107)
- et al.
Renewable and Sustainable Energy Reviews
(2009) - et al.
Renewable and Sustainable Energy Reviews
(2006) - et al.
European Journal of Pharmaceutics and Biopharmaceutics
(2008) - et al.
Journal of Pharmaceutical and Biomedical Analysis
(2007) - et al.
Fuel
(2004) - et al.
Medical Engineering & Physics
(2009) - et al.
Journal of the American College of Cardiology
(2006) - et al.
Biochemical and Biophysical Research Communications
(2006) - et al.
Journal of Food Engineering
(2009) - et al.
Food Chemistry
(2007)
Small Ruminant Research
Analytica Chimica Acta
Forensic Science International
Chemometrics and Intelligent Laboratory Systems
Journal of Cereal Science
Vibrational Spectroscopy
Chemometrics and Intelligent Laboratory Systems
Analytica Chimica Acta
NeuroImage
Analytica Chimica Acta
Chemometrics and Intelligent Laboratory Systems
Remote Sensing of Environment
Chemometrics and Intelligent Laboratory Systems
Analytica Chimica Acta
Analytica Chimica Acta
Chemometrics and Intelligent Laboratory Systems
Analytica Chimica Acta
Chemometrics and Intelligent Laboratory Systems
Chemometrics and Intelligent Laboratory Systems
Analytica Chimica Acta
Trends in Analytical Chemistry
Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy
Talanta
Talanta
Chemometrics and Intelligent Laboratory Systems
Chemometrics and Intelligent Laboratory Systems
Analytica Chimica Acta
Analytica Chimica Acta
Analytica Chimica Acta
Chemometrics and Intelligent Laboratory Systems
Chemometrics and Intelligent Laboratory Systems
Chemometrics and Intelligent Laboratory Systems
Chemometrics and Intelligent Laboratory Systems
Vibrational Spectroscopy
Chemometrics and Intelligent Laboratory Systems
Chemometrics and Intelligent Laboratory Systems
Analytica Chimica Acta
Analytica Chimica Acta
Chemometrics and Intelligent Laboratory Systems
Chemometrics and Intelligent Laboratory Systems
Cited by (889)
Non-destructive detection of single corn seed vigor based on visible/near-infrared spatially resolved spectroscopy combined with chemometrics
2024, Spectrochimica Acta - Part A: Molecular and Biomolecular SpectroscopyOn-line monitoring of egg freshness using a portable NIR spectrometer combined with deep learning algorithm
2024, Infrared Physics and Technology