Abstract

This paper reports a rapid identification method for a Chinese green tea with PGI, Anji-white tea, by class modeling techniques and NIR spectroscopy. 167 real and representative Anji-white tea samples were collected from 8 tea plantations in their original producing areas for model training. Another 81 non-Anji-white tea samples of similar appearance were collected from 7 important tea producing areas and used for validation of model specificity. Diffuse NIR spectra were measured with finely ground tea powders. OCPLS and SIMCA were used to describe the distribution of representative Anji-white tea objects and predict the authenticity of new objects. For data preprocessing, smoothing, derivatives, and SNV were applied to improve the raw spectra and classification performance. It is demonstrated that taking derivatives and SNV can improve classification accuracy and reduce the complexity of class models by removing spectral background and baseline. For the best models, the sensitivity and specificity were 0.886 and 0.951 for OCPLS, 0.886 and 0.938 for SIMCA with SNV spectra, respectively. Although it is difficult to perform an exhaustive analysis of all types of potential false objects, the proposed method can detect most of the important non-Anji-white teas in the Chinese market.

1. Introduction

As one of the most popular beverages in the world, tea is favored because of its pleasurable aroma, taste, and putative healthy effects [17]. Teas can be generally grouped into three principal types: unfermented, partially fermented, and fully fermented according to the degree of fermentation [8].

China has a long history of tea cultivation, processing, and consumption. Among various species, green tea accounts for the bulk of the total production and is favored by most Chinese consumers. In China, tea producing areas cover most of the central and southern provinces with vast differences in geographical and natural conditions, where the tea species, cultivation techniques, and processing procedures are different. Therefore, almost all of the famous teas in China are named after their origins. Among the most famous teas, Anji-white tea with protected geographical indication (PGI) has a somewhat confusing name because it is a typical green tea. It is called a “white” tea because its leaves are very light in color due to its low chlorophyll and polyphenol contents [9]. Its processing procedure makes it a green tea by withering, pan firing, and shaping, followed by firing gently over charcoal. This specific variety of tea bush reported in ancient literatures was rediscovered growing wild in the 1980s at an altitude over 800 m. Now it is cultivated in the mountains of Anji County, along the Huangpu River in a spectacular area, where there are heavy mist and vast forests of bright green bamboo. The flat and straight leaves produce a lasting fragrance unique in taste. It is recognized by traditional tea-tasting specialists that the high quality of Anji-white tea should be attributed to its species, growing environments, and processing procedure. Therefore, the PGI authentication of Anji-white tea is demanded to identify false products and ensure consumer interests. Numerous researches have contributed to the chemical compositions of teas influenced by various factors [1015], including species, season, age of the leaves (plucking position), climate, and horticultural conditions (soil, water, minerals, fertilizers, etc.). Such investigations are important to understand the biological and healthy effects of teas but usually lack a comprehensive view of chemical compositions. Actually, because the chemical compositions of teas are very complex, it is very difficult to perform a thorough component analysis of teas and represent the quality/class of teas by the contents of a few chemical components. In traditional sensory analysis, the quality of teas is evaluated by professional tea tasters. Because training a qualified tea taster may take years and is very expensive [16], it is suitable to evaluate tea quality by some instrumental techniques.

Recently, spectroscopy coupled with chemometric methods has been widely applied in food analysis [1722]. The principle of such techniques is that chemical compositions of complicated samples are represented by multivariate spectral signals; then relevant and useful information concerning food quality/parameters can be extracted by multivariate analysis methods. Near infrared (NIR) spectroscopy has been one of the most commonly used spectroscopic techniques in food quality evaluation and has some advantages over traditional chemical analysis methods, including lower sample preparation requirements, reduced analysis time and cost, the ability to simultaneous multicomponent analysis, and the potential use for online analysis [23].

The performance of NIR spectroscopy analysis depends heavily on the proper use of chemometric methods. It has been pointed out that PGI authentication is a typical one-class problem [24], where a decision needs to be made on whether a new object should be accepted or rejected by a target class. In such cases, the commonly used classification methods discriminating two or more predefined classes are unsuitable for several reasons. Firstly, PGI identification requires identifying various unknown false objects, which is difficult to be exhaustively collected and analyzed. Moreover, a discrimination/classification model would be highly complex and have poor generalization performance if it includes many different classes of training objects. Therefore, a class model is required to describe the representative samples belonging to the target class and predict the identities of the unknown objects. A class model aims at describing the distribution of a target class and has reduced model complexity. However, the sensitivity and specificity of a class model should be sufficiently validated to ensure its usefulness. The sampling procedure should be representative and comprehensive to include most if not all of the significant variations likely to be encountered in future test materials [18].

With the above considerations, the objective of this paper is to develop a rapid and well-validated PGI authentication method for Anji-white tea by using class modeling techniques and NIR spectroscopy, with emphasis on representative sample collection and validation of class models.

2. Materials and Methods

2.1. Tea Samples Analyzed

167 authentic Anji-white and 81 other main non-Anji-white tea samples were collected directly from the market branches of tea plantations in original producing areas with official certifications. All of the samples were made of green tea leaves picked before Qingming Festival 2011 (April 5, 2011). The detailed information concerning samples is shown in Table 1. All of the samples were stored in a cool, dark, and dry place with integral packaging before spectroscopic analysis.

2.2. FTIR Spectroscopy

Diffuse NIR spectra were collected using a Bruker-TENSOR37 FTIR spectrometer (Bruker Optics, Ettlingen, Germany) in the wavelength range from 4000−1 to 12000 cm−1. Tea samples were finely grounded into particles using an agate pestle and mortar and filtered through a 40-mesh sieve. The powders were then packed into a NIR sample cup. The sample cup was filled fully and compacted naturally without further pressing. For each sample, 128 scans were performed with a resolution of 8 cm−1 at room temperature using OPUS software. An increase in scanning time did not significantly improve the signal. The average of the 128 scans was used as a raw spectrum for further data analysis. The scanning interval was 3.857 cm−1; therefore, each raw spectrum had 2074 individual data points.

2.3. Outlier Diagnosis and Data Preprocessing

All of the data analysis was performed on MATLAB 7.0.1 (Mathworks, Sherborn, MA, USA). In practical data analysis, outliers in the data would cause model bias or even breakdown of the models. Therefore, robust principal component analysis (rPCA) [25] was used to detect the outliers. rPCA can overcome the masking effects caused by the presence of multiple outliers. Considering the high-dimensional nature of the NIR spectral data (for the raw spectra, ), an improved rPCA [26] was used, which was shown to be more numerically stable for high-dimensional data and have a moderate computational cost. According to the computed score distance (SD) and orthogonal distance (OD), an rPCA diagnosis plot classifies the samples into four groups: regular data (with small SD and small OD), good PCA-leverage points (with large SD and small OD), orthogonal outliers (with small SD and large OD), and bad PCA-leverage points (with large SD and large OD).

The data with outliers removed were then split into a representative training set and a test set by the Kennard and Stone (K-S) algorithm [27]. K-S algorithm selects a representative test set in such a way that the objects are scattered uniformly in the range of training objects. Because the distributions of tea samples from each producing area were not the same, the K-S method was performed separately for teas from different producing areas. For class models analysis, the training and test samples from each producing area selected by K-S algorithm were then put together to form a training and test set.

Smoothing, taking derivatives, and standard normal variate (SNV) [28] were used to improve the training and predicting performance of class models. Smoothing can suppress random noise in spectra and improve the signal-to-noise ratio (SNR). The S-G polynomial fitting algorithm [29] was used considering its popularity and simplicity. Taking derivatives can enhance spectral resolution and remove baseline and background, so first-order and second-order derivatives were used. To prevent the degradation of SNR by differencing, derivatives were also computed by S-G algorithms. SNV was proved to be effective in reducing scattering effects and correcting the interference caused by variations of optical path. In this paper, SNV was used to reduce the spectral variations caused by the possible differences of powder bulk density.

2.4. Class Modeling Techniques

Recently, a new class modeling technique was proposed by us using one-class partial least squares (OCPLS) [30] regression. It was used for authentication of pure sesame oils by mid-infrared spectra and was demonstrated to have a comparable performance to soft independent modeling of class analogy (SIMCA) [24]. OCPLS develops a partial least squares regression model relating the features to a class response vector 1 with all the elements being ones. The use of 1 as a response vector means all the objects in the same class should be distributed as compact as possible. Unlike SIMCA which projects the raw variables onto a few principal components (PCs) explaining most of the data variances, OCPLS considers both the explained variances and compactness of a class by projecting the raw features onto the class average. The modeling error of the response variable is assumed to have a normal distribution and used as the distance measurement from an object to the class center. The class center is estimated as the mean of modeling error. Since OCPLS can be performed in the framework of multivariate calibration, estimation of its model complexity is more straightforward than for SIMCA; for example, a well-established F-test combined with Monte Carlo cross validation (MCCV) [31] was demonstrated to be effective in reducing the risk of overfitting [32].

SIMCA describes the class structure of the training objects by the PCs space spanned by a few significant PCs. The magnitude of residual error can be used as a distance measurement from an object to the class center. To reject or accept a new object, its residuals can be tested with an F-test procedure. It was realized that the residual error could be underestimated when it is computed directly from PCA of the training samples. This would lead to a large number of objects that are wrongly rejected (a large α-error); therefore, residuals predicted by leave-one-out cross-validation (LOOCV) rather than the training residuals were used [33]. This procedure was shown to be effective in reducing the number of false outliers.

3. Results and Discussion

Some of the spectra of the authentic Anji-white tea and Non-Anji-white tea samples are demonstrated in Figure 1. The spectral range of 9000–12000 cm−1 carries poor chemical information and has a very low SNR, so this wavelength range was excluded from further data analysis. Seen from Figure 1, all the teas have very similar absorbance bands in the range of 4000–9000 cm−1. The wide bands in 8000–9000 cm−1 can be attributed to the second overtone of –C–H stretching. Peaks in 6000–7000 cm−1 involve the contributions of O–H stretching vibrations and stretching vibration of N–H (~6800 cm−1) in amino acids. Other obvious bands include 5600 cm−1 (fundamental stretching of –C–H), 5200 cm−1 (combination of O–H and C–O stretching), 4700 cm−1 (combination of O–H bending and C–O stretching), and 4300 cm−1 (combination of C–H stretching and –CH2 deformation). The raw spectra are highly overlapped and characterized by a poor peak resolution, so accurate assignments of specific peaks are very difficult.

The low level of details in the raw spectra can be attributed to the contributions of multicomponents and the shifts and distortions resulted from their interactions. Though different tea varieties had very similar absorbance patterns, the relative intensities of different bands were different. Therefore, class modeling techniques are useful to extract the subtle information from spectral data for characterizing real Anji-white teas.

To sharpen the classification performance of class models, smoothing, first- and second-order S-G derivatives, and SNV were used to preprocess the raw spectra. Some of the preprocessed spectra of Anji-white and non-Anji-white teas are plotted in Figure 2. Seen from Figure 2, although smoothed spectra can slightly improve the SNR of the raw spectra, they have the risk of losing some useful high-frequency information in the raw data. The second-order derivative spectra can remove most of the baselines and enhance some detailed information. SNV spectra can reduce some spectral variations while enhancing others. The actual effects of data preprocessing should be evaluated by classification performance.

Outlier detection was performed by rPCA of the raw spectra. The number of PCs was determined by robust pooled predicted residual sum of squares (PRESS) values. Following the rule of thumb, the first seven PCs were selected as to account for 95.32% (>95%) of the total data variance. The rPCA diagnosis plot of the 167 authentic Anji-white tea samples is shown in Figure 3. OD is a measure of the distance from the sample to the model space spanned by selected PCs, and SD describes the sample dispersion in the class projected onto the model space. Therefore, both orthogonal outliers and bad PCA-leverage points should be excluded from the training set. Because the real Anji-white tea samples came from different producing areas, there might be considerable difference in the contents of different chemical components. Therefore, good PCA-leverage points (objects 49, 51 and 135) were retained to represent the spectral variations among the real Anji-white tea samples from different producing areas. In Figure 3, three orthogonal outliers (objects 26, 115, and 156) were removed. The K-S algorithm was then used to split the remaining 164 Anji-white samples into a training set with 120 samples and a test set of 44 samples. Therefore, the test set had 44 positive (Anji) objects and 81 negative (non-Anji) objects.

OCPLS and SIMCA models were developed to describe the distribution of real Anji-white tea samples. For SIMCA, the improved decision region [33] was adopted to reduce the risk of having a large number of objects wrongly rejected. Cross-validation was performed to evaluate the number of significant PCs; the criterion of 95% total explained variances was also considered. For OCPLS, Monte Carlo cross-validation (MCCV) with 10% objects left out was used to determine the number of PLS components and the sampling time was 100. The PRESS values by MCCV were subject to an -test [32]. As suggested, a significant level of 0.25 was adopted to select the least number of latent variables with a PRESS value not significantly larger than the minimum value according to the -test. Sensitivity and specificity were used to evaluate the performance of different models and preprocessing options. The training and prediction results of test samples by SIMCA and OCPLS were shown in Table 2. Seen from Table 2, preprocessing generally improved the classification performance in terms of sensitivity and specificity. However, the models based on spectra smoothing seem to have inferior performance, which might be attributed to the possible loss of detailed frequency information. Second derivative and SNV significantly improved the class models by reducing the baseline and backgrounds. The model complexity of SIMCA and OCPLS based on such preprocessing was reduced. For the best models, the sensitivity and specificity were 0.886 and 0.951 for OCPLS and 0.886 and 0.938 for SIMCA with SNV spectra, respectively. For both OCPLS and SIMCA, the best class models were obtained by SNV preprocessing and the prediction results were demonstrated in Figures 4 and 5. The comparison of different preprocessing methods demonstrated that the spectral variations caused by scattering effects and baseline shifts played a more important role than an inferior SNR.

4. Conclusions

Rapid identification methods of a PGI green tea, Anji-white tea, were developed by using NIR spectroscopy and chemometric class modeling techniques. With SNV preprocessing, OCPLS (sensitivity 0.886 and specificity 0.951) and SIMCA (sensitivity 0.886 and specificity 0.938) achieved best classification performance in terms of prediction sensitivity and specificity. The analysis results indicate removal of spectral background, and baseline plays a more important role than a higher SNR. Taking derivatives and SNV transformation can not only improve classification accuracy but can also reduce the complexity of OCPLS and SIMCA models. Although it is hard to perform an exhaustive collection and analysis of all types of white teas, this study provides a reliable and effective tool to identify Anji-white tea against most of the important non-Anji-white teas in the Chinese market.

Authors’ Contribution

Lu Xu and Peng-Tao Shi equally contributed to this work.

Acknowledgments

The authors appreciate the National Public Welfare Industry Projects of China (nos. 201210010 and 201210092) and Hangzhou Programs for Agricultural Science and Technology Development (no. 20101032B28) for their financial aids. C.-B. Cai is also grateful to the financial aid of the Applied and Basic Research Project of Yunnan Provincial Science and Technology Department (no. 2010CD087).