Review
Variables selection methods in near-infrared spectroscopy

https://doi.org/10.1016/j.aca.2010.03.048Get rights and content

Abstract

Near-infrared (NIR) spectroscopy has increasingly been adopted as an analytical tool in various fields, such as the petrochemical, pharmaceutical, environmental, clinical, agricultural, food and biomedical sectors during the past 15 years. A NIR spectrum of a sample is typically measured by modern scanning instruments at hundreds of equally spaced wavelengths. The large number of spectral variables in most data sets encountered in NIR spectral chemometrics often renders the prediction of a dependent variable unreliable. Recently, considerable effort has been directed towards developing and evaluating different procedures that objectively identify variables which contribute useful information and/or eliminate variables containing mostly noise. This review focuses on the variable selection methods in NIR spectroscopy. Selection methods include some classical approaches, such as manual approach (knowledge based selection), “Univariate” and “Sequential” selection methods; sophisticated methods such as successive projections algorithm (SPA) and uninformative variable elimination (UVE), elaborate search-based strategies such as simulated annealing (SA), artificial neural networks (ANN) and genetic algorithms (GAs) and interval base algorithms such as interval partial least squares (iPLS), windows PLS and iterative PLS. Wavelength selection with B-spline, Kalman filtering, Fisher's weights and Bayesian are also mentioned. Finally, the websites of some variable selection software and toolboxes for non-commercial use are given.

Introduction

In recent years, near-infrared (NIR) spectroscopy has gained wide acceptance in different fields by virtue of its advantages over other analytical techniques, the most salient of which is its ability to record spectra for solid and liquid samples without any pretreatment. This characteristic makes it especially attractive for straightforward, speedy characterization of natural and synthetic products. The cost savings of NIR measurements related to improved control and product quality are often achieved and can provide results significantly faster compared to traditional laboratory analysis. In batch processes, NIR allows several quality estimates to be performed within a manufacturing cycle in opposed to a single end of batch analysis. Therefore, it can reveal potential problems early in the process and promote corrective actions, this may have particular advantages in the case where safety is a factor. Also, e.g., safety aspects can be seen as one of the advantages due to intrinsically safe measurement probes and fiber optics. NIR spectroscopy has increasingly been adopted as an analytical tool in variety of different fields during the past 15 years, for example in the petrochemical [1], [2], pharmaceutical [3], [4], environmental [5], [6], clinical [7], [8], [9], agricultural [6], [10], [11], [12], food [13] and biomedical [14] sectors.

Typically, modern NIR analysis involves the rapid acquisition of large number of absorbance values for a selected spectral range. The information contained in the spectral curve is then used to predict the chemical composition of the sample by extracting the appropriate variables of interest. Generally, NIR spectroscopy is used in combination with multivariate techniques for qualitative or quantitative analysis. The large number of spectral variables in most data sets encountered in spectral chemometrics often renders the prediction of a dependent variable complicated, however by the use of suitable projection or selection techniques the problem may be ninimised. Selection and projection methods differ in several aspects [15].

Projection methods, for example, partial least squares (PLS) and principal component regression (PCR) are generally applicable but do not presuppose any bias or weights on the principal axes. However, projection calibration models are straightforward and the model calculations can be performed quickly by commercially available software packages. Earlier PCR and PLS full-spectrum methods did not feature preliminary selection, but introduce latent variables comprised of combinations of the original features. Even where prediction properties are good, they usually suffer from the fact that the latent variables are hardly interpretable in terms of original features (wavelengths in the case of infrared spectra). Furthermore, multivariate calibration models such as partial least squares (PLS) regression have been developed for quantitative analysis of spectral data because of their ability to reduce the impact of common problems such as collinearity, band overlaps, and interactions. However, even with such sophisticated chemometric tools as PLS, the influence of data that does not contain critical information can severely corrupt the resulting calibration models, because not all variables or their regions are equally important for the modeling; some of them, like noise areas, may even be harmful. Data projection on an abstract factor space reduces the error but does not eliminate it entirely; it is partially projected onto the new data space, often confounding the model. Therefore, removal of the variables, in which the noise dominates over the relevant information often leads to better accuracy and performance of the analytical methods.

In contrast, selection methods are based on the principle of choosing a small number of variables selected from the original provide easier interpretation. Variable selection in multivariate analysis is a very important step, because the removal of non-informative variables will produce better prediction and simpler models. It has been shown that the predictive ability can be increased and, the complexity of the model can be reduced by a judicious pre-selection of wavelengths. It is now widely accepted that a well-performed variable selection can result in models having a greater predictive ability [15].

Variable or feature selection, also called “frequency” or “wavelength” selection when applied to spectroscopic data, is a critical step in data analysis, as it allows interactive improvement of the quality of data during the calibration procedure. The goal of frequency selection is to identify a subset of spectral frequencies that produce the smallest possible errors when used to perform operations such as making quantitative determinations or discriminating between dissimilar samples. Recently, considerable effort has been directed toward developing and evaluating different procedures that objectively identify variables that contribute useful information and/or eliminate variables containing mostly noise. Classically, this selection is made from the basic knowledge about the spectroscopic properties of the sample – knowledge based selection [16], but it has been shown that there are mathematical strategies for variable selection that are more efficient.

From a conceptual point of view, a variable selection procedure includes first the choice of a relevance measure and, second, the choice of a search algorithm to perform optimization. The relevance measure aims at evaluating the influence of a particular subset of X-variables on the dependent variables, y. Concerning the search algorithm, stochastic algorithms are performed in applications such as spectroscopic multivariate calibration. This approach is usually called computer aided variable selection. Computer aided variable selection is an important pre-processing procedure in chemometrics, which is widely used to improve the performance of various multivariate methods and algorithms, such as regression methods, factor analysis, and curve resolution. Multivariate approaches can exploit all variables and effectively extract necessary information in the analysis. Computer aided variable selection is also important in industry for several reasons. Variable selection can improve model performance, provide robust models that may be readily transferred and allow non-expert users to build reliable models with only limited expert intervention. Furthermore, computer aided selection of variables may be the only approach for some models, for example predicting a physical property from spectral data. Exploiting state-of-the-art theories and techniques of the late 20th and the 21st centuries, has enabled tremendous progress in NIR spectroscopy.

There are a multitude of approaches available for variable selection. These maybe categorized as follows. First, “Univariate” approaches select those variables that have the greatest correlation with the response, mainly in the early NIR spectroscopy selection time. Secondly, “Sequential” approaches rank variables in order and pair the variables in a forward or backward progression. A more sophisticated approach iterates the progression to reassess previous selections. An inherent problem with these approaches is that only a very small part of the experimental domain is explored. These methods were used from middle of the 1970s to the middle of the 1990s. Thirdly, since 1990s, “multivariate” methods of variable selection have been introduced, namely, interactive variable selection, uninformative variable elimination (UVE), interval PLS (iPLS), significance tests of model parameters, and the use of genetic algorithms (GAs) for example.

This review emphasizes variable selection methods in NIR spectroscopy, and is organized as follows. The importance of variable selection in NIR spectroscopy is given in Section 2 based on different viewpoints. Section 3 gives a brief review of the global calibration methods PCR and PLS and the most common method MLR, because these methods are always used as relevance measure methods in variable selection. Variable selection methods are discussed in Section 4. Some classical approaches, such as the manual approach (knowledge based selection), “Univariate” and “Sequential” selection methods are introduced in the first and second part of Section 4. The third part of Section 4 discussed the relatively, sophisticated methods, successive projections algorithm (SPA) and uninformative variable elimination (UVE). The fourth part of Section 4 focuses on elaborate search-based strategies, such as simulated annealing (SA), artificial neural networks (ANN) and genetic algorithms (GAs). Interval partial least squares (iPLS), including moving windows PLS and iterative PLS are well discussed in the fifth part of Section 4. The last part of Section 4 introduces some other selection methods, such as B-spine and Kalman filter, etc. Finally, the websites of some variable selection software and toolboxes for non-commercial use are given in Section 5. The review ends with a brief summary.

Section snippets

The importance of variable selection in NIR spectroscopy

There is much literature about the importance of variable selection in NIR spectroscopy. Here, the different aspects of variable selection are summarized.

A brief review of regression methods

The commonly used chemometric methods for the analysis of NIR spectra could be divided into three main techniques groups. (i) Mathematical pretreatments to enhance the information search in the study, and decrease the influence of the side information contained in the spectra. Spectral pre-processing is considered as well known and not described in this text. The classical pretreatments are normalizations, derivatives and smoothing. For more details, readers are referred to textbooks [25], [26]

Manual approaches – knowledge based selection

For manual approaches, one possibility is to remove the variables that have poor informational quality. In many studies [23], [54], [55], due to the insensitivity of the NIR instrument detector, some data points at the lower and higher regions were omitted from the spectral data sets. Fig. 3(a) [56] shows apple spectra collected by a NIR instrument. The data points at the lower and higher regions were cut from the spectral data sets before regression due to a high signal to noise ratio (S/N).

Software of wavelength selection methods

In Table 1, we list the software packages (mainly in Matlab) used in the non-commercial studies. The software package from NIR spectroscopy company and combined with the NIR spectra equipment, such as OPUS software version 4.2 also from Bruker Gmbh (Bremen, Germany), OMNIC series from Thermo Nicolet company, and the software of TQ Analyst, will not be discussed here. The packages cited in Table 1 are free available software and the authors would like to express their gratitude to the developers.

Summary

Although NIR instrumentation produces large volumes of data, it often, as we have described, requires careful and sophisticated processing in order to extract information. Chemometrics methods have been found to be very useful for extracting information from NIR spectra and there is great interest for using the NIR technology for measurements of phenomena of different analytes [25], [27], [47], [59]. However, there is still a great need for new methods that can handle data from these modern

Acknowledgements

The authors gratefully acknowledge the financial support provided by the foundations of NSFC (Grant no. 6091079), Chinese 863 Program (Grant nos. 2008AA10Z208, 2008AA10Z204), the Postdoctoral Foundation of China (20070411024, 0601003C) and the talent foundation of Jiangsu University. Dr. Zou Xiabo thanks Dr. Jianshe Chen (University of Leeds) for advice and encouragement, and to the many researchers whom have offered the stimulating works in this field.

Glossary

MVC
multivariate calibration
LMVC
linear multivariate calibration
PLS
partial least squares
MLR
multiple linear regression
PCR
principal components regression
SEC
standard error of calibration
RMSECV
root mean square error of cross-validation
r
correlation coefficient
SEP
standard error of prediction
LOOCV
leave-one-out cross-validation
SPA
successive projections algorithm
UVE
uninformative variable elimination
SA
simulated annealing
ANN
artificial neural networks
GA
genetic algorithm
iPLS
interval partial least squares
BP-ANN

References (107)

  • A. Murugesan et al.

    Renewable and Sustainable Energy Reviews

    (2009)
  • L.C. Meher et al.

    Renewable and Sustainable Energy Reviews

    (2006)
  • C. Gendrin et al.

    European Journal of Pharmaceutics and Biopharmaceutics

    (2008)
  • Y. Roggo et al.

    Journal of Pharmaceutical and Biomedical Analysis

    (2007)
  • J. Nyström et al.

    Fuel

    (2004)
  • S.J. Erickson et al.

    Medical Engineering & Physics

    (2009)
  • J.D. Caplan et al.

    Journal of the American College of Cardiology

    (2006)
  • A. Sakudo et al.

    Biochemical and Biophysical Research Communications

    (2006)
  • G.P. Moreda et al.

    Journal of Food Engineering

    (2009)
  • R. Karoui et al.

    Food Chemistry

    (2007)
  • S. Landau et al.

    Small Ruminant Research

    (2006)
  • H. Namkung et al.

    Analytica Chimica Acta

    (2008)
  • R.C. Schneider et al.

    Forensic Science International

    (2003)
  • C.B. Zachariassen et al.

    Chemometrics and Intelligent Laboratory Systems

    (2005)
  • T.M. Baye et al.

    Journal of Cereal Science

    (2006)
  • D.D. Archibald et al.

    Vibrational Spectroscopy

    (2000)
  • L.O. Rodrigues et al.

    Chemometrics and Intelligent Laboratory Systems

    (2005)
  • K. Krämer et al.

    Analytica Chimica Acta

    (2000)
  • H. Sato et al.

    NeuroImage

    (2004)
  • M. Casale et al.

    Analytica Chimica Acta

    (2006)
  • M. Forina et al.

    Chemometrics and Intelligent Laboratory Systems

    (2007)
  • B.L. Becker et al.

    Remote Sensing of Environment

    (2007)
  • U.G. Indahl et al.

    Chemometrics and Intelligent Laboratory Systems

    (1999)
  • P.J. de Groot et al.

    Analytica Chimica Acta

    (1999)
  • Q. Guo et al.

    Analytica Chimica Acta

    (1999)
  • W. Wu et al.

    Chemometrics and Intelligent Laboratory Systems

    (1999)
  • W. Wu et al.

    Analytica Chimica Acta

    (1996)
  • W. Wu et al.

    Chemometrics and Intelligent Laboratory Systems

    (1996)
  • W. Wu et al.

    Chemometrics and Intelligent Laboratory Systems

    (1996)
  • W. Wu et al.

    Analytica Chimica Acta

    (1996)
  • M. Zeaiter et al.

    Trends in Analytical Chemistry

    (2005)
  • B. Hemmateenejad et al.

    Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy

    (2007)
  • M. Khanmohammadi et al.

    Talanta

    (2007)
  • B. Hemmateenejad et al.

    Talanta

    (2006)
  • G.A. Bakken et al.

    Chemometrics and Intelligent Laboratory Systems

    (1999)
  • W. Wu et al.

    Chemometrics and Intelligent Laboratory Systems

    (2000)
  • L. Pasti et al.

    Analytica Chimica Acta

    (1998)
  • D. Jouan-Rimbaud et al.

    Analytica Chimica Acta

    (1995)
  • A. Donachie et al.

    Analytica Chimica Acta

    (1999)
  • E. Vigneau et al.

    Chemometrics and Intelligent Laboratory Systems

    (1996)
  • P.D. Wentzell et al.

    Chemometrics and Intelligent Laboratory Systems

    (2003)
  • S. Wold et al.

    Chemometrics and Intelligent Laboratory Systems

    (2001)
  • Z. Xiaobo et al.

    Chemometrics and Intelligent Laboratory Systems

    (2007)
  • X. Zou et al.

    Vibrational Spectroscopy

    (2007)
  • E. Bertran et al.

    Chemometrics and Intelligent Laboratory Systems

    (1999)
  • M.C.U. Araújo et al.

    Chemometrics and Intelligent Laboratory Systems

    (2001)
  • M.J.C. Pontes et al.

    Analytica Chimica Acta

    (2009)
  • F. Liu et al.

    Analytica Chimica Acta

    (2009)
  • W. Cai et al.

    Chemometrics and Intelligent Laboratory Systems

    (2008)
  • S. Ye et al.

    Chemometrics and Intelligent Laboratory Systems

    (2008)
  • Cited by (889)

    View all citing articles on Scopus
    View full text