Review of the most common pre-processing techniques for near-infrared spectra

https://doi.org/10.1016/j.trac.2009.07.007Get rights and content

Abstract

Pre-processing of near-infrared (NIR) spectral data has become an integral part of chemometrics modeling. The objective of the pre-processing is to remove physical phenomena in the spectra in order to improve the subsequent multivariate regression, classification model or exploratory analysis. The most widely used pre-processing techniques can be divided into two categories: scatter-correction methods and spectral derivatives. This review describes and compares the theoretical and algorithmic foundations of current pre-processing methods plus the qualitative and quantitative consequences of their application. The aim is to provide NIR users with better end-models through fundamental knowledge on spectral pre-processing.

Introduction

There is no substitute for optimal data collection, but, after proper data collection, pre-processing of spectral data is the most important step before chemometric bi-linear modeling [e.g., Principal Component Analysis (PCA) and Partial Least Squares (PLS)]. There is substantial literature on multivariate spectroscopic applications of food, feed and pharmaceutical analysis, in which comparative pre-processing studies are an integral part. Near-infrared reflectance/transmittance (NIR/NIT) spectroscopy is the spectroscopic technique that has led to by far the largest amount of and greatest diversity in pre-processing techniques, primarily because the spectra can be significantly influenced by non-linearities introduced by light scatter. Due to the comparable size of the wavelengths in NIR electromagnetic radiation and particle sizes in biological samples, NIR spectroscopy is a battleground for undesired scatter effects (both baseline shift and non-linearities) that will influence the recorded sample spectra. However, by applying suitable pre-processing, these effects can largely be eliminated.

In application studies, comparisons are almost exclusively of the relative performances in the calibration models developed (quantitative descriptor-response relations). Almost no evaluation of the differences and the similarities between the alternative techniques has been reported, and the implications of corrections (e.g., spectral descriptor data) are seldom discussed. This article aims to discuss the relations between the established pre-processing methods for NIR/NIT, more specifically those techniques that are independent of response variables, so we discuss only methods that do not require a response value. We focus on both the theoretical aspects of the pre-processing technique and the practical effect that the operation has on the NIR/NIT spectra.

For solid samples, undesired systematic variations are primarily caused by light scattering and differences in the effective path length. These undesired variations often constitute the major part of the total variation in the sample set, and can be observed as shifts in baseline (multiplicative effects) and other phenomena called non-linearities. In general, NIR-reflectance measurement of a sample will measure diffusively reflected and specular reflected radiation (mirror-like reflections). Specular reflections are normally minimized by instrument design and sampling geometry, as they do not contain any chemical information. The diffusively reflected light, which is reflected in a broad range of directions, is the primary source of information in the NIR spectra. However, the diffusively reflected light will contain information on not only the chemical composition of the sample (absorption) but also the micro-structure (scattering). The primary forms of light scattering (that do not include energy transfer with the sample) are Rayleigh and Lorentz-Mie. Both are processes in which the electromagnetic radiation is scattered (e.g., by small particles, bubbles, surface roughness, droplets, crystalline defects, microorganelles, cells, fibers, and density fluctuations).

Rayleigh scattering, which is strongly wavelength dependent (∼1/λ4), occurs when the particles are much smaller in diameter than the wavelength of the electromagnetic radiation (<λ/10).

When the particle sizes are larger than the wavelength, as is generally the case for NIR spectroscopy, Lorentz-Mie scattering is predominant. In contrast to Rayleigh scattering, Lorentz-Mie scattering is anisotropic, dependent on the shapes of the scattering particles and not strongly wavelength dependent.

For biological samples, the scattering properties are excessively complex, so soft, or model-free, spectral pre-processing techniques of NIR spectra, as we discuss in this article, are demanded to remove the scatter from the pure, desirable absorbance spectra.

Obviously, pre-processing cannot correct for specular reflectance (direct scattering), since the spectra do not contain any fine structure. Spectra dominated by specular reflectance should always be removed as outliers prior to multivariate data analysis, since they will remain outliers, even after pre-processing. Fig. 1 shows a set of 13 good sucrose samples with different particle sizes plus one bad sucrose example showing how (extreme) specular reflectance manifests itself compared to normal spectra.

Fig. 1 also illustrates the general layout of most figures in this article. In the upper part of the figure, a bar graph shows PCA-score values on the first principal component (PC) for the sample set after data mean centering [1]. The lower part shows the effect the pre-processing has on the dataset (or, in this case, no pre-processing). The squared correlation coefficient r between the bar values and a selected reference variable is included (in this case, known average particle sizes of 13 sucrose samples). For the sucrose dataset, this correlation should, e.g., be low when assuming particle-originating scatter is a hindrance; as little information as possible on the particle size should remain after the right pre-processing.

An example of the pre-processed sucrose data can be seen in Fig. 2, which also contains a standard-deviation plot, showing the effect that pre-processing has on the variation between samples for different wavelength regions. The selected pre-processing (detailed later on) removes some, but not all, of the undesired scatter or particle-size information in the spectra, as can be observed from, e.g., the first PC bars.

From now on in this article, we demonstrate the effect of different pre-processing techniques on a small pectin dataset containing only seven samples with very different degrees of esterification (%DE; in the range 0–93%) [2]. These samples were measured in NIR-reflectance mode in the spectral range 1100–2500 nm (collecting every 2-nm interval; Fig. 3). We present the corresponding first-factor PCA sample score after mean-centering as a bar graph, together with the centered absorbance value at wavelength 2244 nm. We selected this peak as it should, in theory, describe the %DE perfectly. For this article, we assume that the information in the spectra that is related to the pectin particle size and shape should be removed by the pre-processing technique, and that the bar graph should show a linear behavior correlated to %DE.

To illustrate the impact of pre-processing on quantification, we use data taken from Christensen et al. [3]. They studied a set of 32 marzipan mixtures, based on nine different recipes, with data available on the Internet (www.models.life.ku.dk). All marzipan samples were measured with six different NIR instruments and chemical-reference analyses for moisture and sugar content were made. Before building a quantitative regression model, it is important to clean the predictor data from unsystematic scatter variations, since they can have a significant impact on the predictive model performance and the model complexity or parsimony. In this article, we use PLS to predict this quantitative response information [4].

Section snippets

Pre-processing techniques

The most widely used pre-processing techniques in NIR spectroscopy (in both reflectance and transmittance mode) can be divided into two categories: scatter-correction methods and spectral derivatives.

The first group of scatter-corrective pre-processing methods includes Multiplicative Scatter Correction (MSC), Inverse MSC, Extended MSC (EMSC), Extended Inverse MSC, de-trending, Standard Normal Variate (SNV) and normalization.

The spectral derivation group is represented by two techniques in this

Scatter corrections

Under scatter-correction methods, we consider three pre-processing concepts: MSC, SNV and normalization. These techniques are designed to reduce the (physical) variability between samples due to scatter. All three also adjust for baseline shifts between samples.

Spectral derivatives

Derivatives have the capability to remove both additive and multiplicative effects in the spectra and have been used in analytical spectroscopy for decades. This concept is demonstrated in Fig. 14 for a simple Gaussian peak with added baseline and baseline plus multiplicative effect. The first derivative removes only the baseline; the second derivative removes both baseline and linear trend. In this article, we discuss two different methods: SG and NW. Both derivation techniques use smoothing

Interval and combined versions

Of the pre-processing techniques mentioned thus far, only the estimation of the derivatives operates by a moving-window operation, where only a local part (window) of the spectra is used at any time to estimate the correction. However, all the other methods can equally well be performed in a window-wise fashion.

Isaksson and Kowalski [26] suggested this for MSC, and named it piecewise MSC (PMSC). Andersson [27] compared alternative pre-processing methods with two versions of PMSC: moving-window

A quantitative example

We will now apply all the pre-processing methods discussed to a quantitative spectroscopic task involving 32 marzipan samples measured on six very distinct spectrometers as predictor variables for two different response variables: moisture and sugar content. The data are taken from a study by Christensen et al. [3]. Fig. 20 shows one of the spectral sets. For a summary of the data, see Table 1. Here, we show the PLS-regression models, which were built for all the six NIR instruments, and

Concluding remarks

Obviously, our quantitative example does not give the authoritative answer about which pre-processing to use in any given case. However, it does appear sensible to use normalization for short-wave NIR-transmission spectra and to use MSC (with first-order reference correction) or standard SNV for most other cases.

While it is difficult to find the best pre-processing, it is indeed possible to use wrong pre-processing. This will primarily happen due to incorrect parameter settings of window size

References (29)

  • S. Wold et al.

    Chemom. Intell. Lab. Syst.

    (1987)
  • H. Martens et al.

    J. Pharm. Biomed. Anal.

    (1991)
  • I.S. Helland et al.

    Chemom. Intell. Lab. Syst.

    (1995)
  • Q. Guo et al.

    Anal. Chim. Acta

    (1999)
  • C.A. Andersson

    Chemom. Intell. Lab. Syst.

    (1999)
  • R. Wehrens et al.

    Chemom. Intell. Lab. Syst.

    (2000)
  • C.B. Zachariassen et al.

    Chemom. Intell. Lab. Syst.

    (2005)
  • S.B. Engelsen et al.

    Progr. Colloid Polym. Sci.

    (1998)
  • J. Christensen et al.

    J. Near Infrared Spectrosc.

    (2004)
  • S. Wold et al.

    Lect. Notes Math.

    (1983)
  • H. Martens et al.

    Multivariate Calibration

    (1989)
  • H. Martens et al.

    Multivariate linearity transformations for near infrared reflectance spectroscopy

  • P. Geladi et al.

    Appl. Spectrosc.

    (1985)
  • D.K. Pedersen et al.

    Appl. Spectrosc.

    (2002)
  • Cited by (2078)

    • Bias-free estimation of signals on top of unknown backgrounds

      2024, Nuclear Instruments and Methods in Physics Research, Section A: Accelerators, Spectrometers, Detectors and Associated Equipment
    View all citing articles on Scopus
    View full text