Introduction

Cachaça, like other spirit drinks, often passes through the process of aging in wooden barrels. This step, in addition to improving its taste and acceptability by the consumer, increases its commercial value, e.g. a famous cachaça brand from Salinas (Brazil), that is aged in balsam wood during 8 years can cost up to twenty times more than the common ones.

According to the Brazilian legislation (MAPA 2005), the truly aged cachaça is the beverage stored in wooden barrels for 1 year or more, and it has different classifications, depending on the aging time and the ratio of the aged distillate in the final product. The information of the specific class of the aged cachaça in labeling is only allowed for the product whose aging process has been accompanied by the inspection and certification of MAPA (Ministry of Agriculture, Livestock and Food Supply). Due to the difficulties inherent to this certification process, many small producers age their cachaça and label it as “stored” rather than aged.

The most commonly wood used for producing barrels for aging beverages is oak (Quercus sp.). However, as the oak is not a typical tree of tropical climates, its wooden barrels come at a very high price for the Brazilian producers. Thus, an alternative is the use of native woods to produce barrels. Native woods, such as umburana (Amburana cearensis), jequitibá (Cariniana estrellensis), balsam (Myroxylon peruiferum), and amendoim (Pterogyne nitens) are the most used for the aging of cachaças produced in Minas Gerais and other states of Brazil.

With the increasing production of quality cachaças and the growing demand for the foreign and local market consumption, fast and reliable methods are needed to enable the cachaça discrimination by the type of wood used in the aging process. This avoids that a low-quality product is sold as of higher quality, contributing to the consolidation of a good image for this iconic Brazilian distillate. In order to classify cachaças according to the wood used for their aging, some authors have published studies using chemometric methods of pattern recognition (PCA, principal component analysis) and discrimination (LDA, linear discrimination analysis) jointly mainly with UV-Vis spectroscopy (Faria et al. 2003; Silva et al. 2012), but also ESI-MS (electrospray ionization mass spectrometry) (Souza et al. 2007) and HPLC (high-performance liquid chromatography) (Silva et al. 2009). These papers have reported the use of different analytical techniques to discriminate or classify the cachaças according to the wood used in their aging, but no study has presented a comparison between techniques or combined them to obtain more robust models. Thus, the aim of this work was to explore the capabilities and performances of UV-Vis spectrophotometry and spectrofluorimetry coupled to PLS-DA (partial least squares for discriminant analysis) and NPLS-DA (N-way PLS-DA) for discriminating cachaças according to the wood used in their aging. In addition to the models built individually with the data obtained from these two techniques, two data fusion strategies, at low and mid levels, were also considered, in order to improve the results. Other important aspect of this study is the employment of a significant number, more than a hundred, of commercial cachaças samples for constructing the classification models.

Materials and Methods

Samples Description

One hundred nine aged cachaças samples (originated from six different states of Brazil) were purchased in local markets and analyzed. These samples were labeled as aged in different woods: amendoim (19), balsam (23), oak (21), jequitibá (23), and umburana (23). In addition, 15 cachaça samples were aged in these five woods (three samples in each one) in our laboratory and analyzed. These samples were maintained in 1.5-L casks for 6 months.

Spectra Aquisition

The samples were diluted 30 times with ethanol/water (40:60 %, v/v) in the own cuvette, and the UV-Vis spectra were recorded from 190 to 500 nm (2 nm steps) in a diode array spectrophotometer Hewlett-Packard model 8451 A, using a 10-mm quartz cuvette. Fluorescence spectra were obtained in a Varian Cary Eclipse spectrofluorimeter, using a 10-mm quartz cuvette, without any sample pre-treatment. All the excitation-emission matrices (EEM) were obtained in the excitation range from 240 to 500 nm (10 nm steps) and in the emission range from 260 to 600 nm (2 nm steps). The excitation and emission monochromator slit widths were both 5.0 nm and the scanning rate was 9600 nm min−1.

Data Treatment and Analysis

Data were handled using the MATLAB software, version 7.9 (The Math Works, Natick, MA, USA). The PLS-DA and NPLS routines came from the PLS Toolbox, version 6.5 (Eigenvector Technologies, Manson, WA, USA). UV-Vis spectra were preprocessed by mean centering. The EEM were preprocessed to remove Rayleigh and Raman scattering (Bahram et al. 2006), and the models were obtained after mean-centering the first mode of the EEM. The fused matrices were preprocessed by autoscaling. All the pre-treatments used can be seen in Table 1.

Table 1 Pretreatment, LV, and variance captured by each model in different data treatments strategies

UV-Vis spectra were split in two data sets, 82 samples for the training set and 42 samples for the test set, using the Kennard-Stone algorithm. This is a uniform mapping algorithm able to select the most representative samples for the training set based on their Euclidian distance (Kennard and Stone 1969). To ensure the representativeness of each class, this algorithm was applied to each class separately and the order of the samples (training and test sets) was maintained for all the models, aiming to obtain full comparability among the results obtained.

PLS-DA is a variant of PLS used for supervised classification, in which a model is constructed correlating the spectral data to class membership vectors. One vector is built for each class containing the dummy variables, with values of 1 and 0 if a sample belongs or not to a specific class, respectively. The spectral data and the dummy variable vectors are simultaneously decomposed in LV (latent variables) searching for the maximum discrimination among classes. The number of LV is chosen based on the lowest value of the cross validation classification error (CVCE) estimated using venetian blinds cross validation (Wise et al. 2006). A complete description of the method is given by Barker and Rayens (2003).

NPLS is a method for building multi-way calibration models and is an extension of the classical two-way PLS for cases in which the matrix of independent variables is a tensor of third or higher order. In applications treating EEM, the independent variables are in a three-way data array formed by the fluorescence intensity measurements of various samples and the dependent variable is in a concentration vector. The NPLS decomposes a multidimensional data arrangement X into a set of triads. Each triad is equivalent to a latent variable (a component) in the two-way PLS and consists of a score vector, t, related to the first way, and two weight vectors, w J and w K, related to the other two ways (emission and excitation wavelengths). A complete description of this method is given by Bro (1996). This method used for discriminant analysis (NPLS-DA) allows the supervised classification of samples by separating the classes with a high degree of similarity (Hall et al. 2005).

With the wide availability of analytical techniques and the increasing capacity of data processing, it has become easy to obtain a large amount of data, allowing the emergence of data fusion models. These models process data provided by different analytical instruments or sensors, requiring for this processing different chemometric methods. For each sample, all spectral variables from different instrument types and sources are concatenated into a single vector, known as meta-spectrum (Godinho et al. 2014). Compilation of data from different nonspecific techniques provides complementary interpretations and facilitates full product description (Vera et al. 2011).

There are three strategies for data fusion: low-level, mid-level, and high-level. The low-level data fusion combines directly the original signals after the steps of preprocessing and, then, builds the model with the concatenated matrix. The mid-level involves extracting the characteristic features of the data sets obtained for each technique before data fusion and model construction. Finally, at a high-level strategy, a multivariate model is built separately for each technique and the individual responses are combined to produce the final result. Biancolillo et al. (2014) have presented a discussion on this subject, highlighting the benefits and drawbacks of each data fusion level.

Data fusion has been applied especially to the analysis of complex food matrices, such as olive oils (Casale et al. 2010; Casale et al. 2012; Pizarro et al. 2013), meat (Alamprese et al. 2013), spices (Di Anibal et al. 2011), and beers (Vera et al. 2011; Biancolillo et al. 2014).

Results and Discussion

UV-Vis PLS-DA Model

The typical UV-Vis spectra for the analyzed aged cachaças are show in Fig. 1. The majority of the samples studied present two characteristic bands with maximum wavelengths at 200 nm and between 250 and 300 nm. These wavelengths are associated with absorption of benzene-rings substituted with groups such as hydroxyl, commonly found in phenolic compounds present in aged cachaças. The PLS2-DA algorithm was employed because it provided a classification performance similar to PLS1-DA. Unlike PLS1-DA, which builds a specific model for each class separately, PLS2-DA allows that all classes are simultaneously predicted in a single model. The training set was built with 82 samples (15 samples of amendoim; 17 of balsam; 16 of oak; 17 of jequitibá; and 17 of umburana), while 42 samples (seven samples of amendoim; nine of balsam; eight of oak; nine of jequitibá; and nine of umburana) were used in the test set. The analyzed matrix has dimensions of 124 × 156 (samples × wavelengths). The best model was constructed with six LV, which accounted for 100 and 37 % of the X and Y variance, respectively (Table 1). The variable importance in projection (VIP) scores of the PLS-DA model for classification of each type of aged cachaças is shown in Fig. 2. By observing this figure, it was possible to identify the wavelengths that most contributed for this classification. The assignment of these specific wavelengths to the markers found in aged cachaças is difficult, considering that the UV-Vis spectrophotometry is not a technique commonly used for fingerprinting. As noted above, these wavelengths are associated with absorption of hydroxyl-substituted benzene-rings, commonly found in phenolic compounds present in aged cachaças, such as phenolic acids (vanillic, gallic, syringic, and ellagic), phenolic aldehydes (vanillin, syringaldehyde, coniferaldehyde, and sinapaldehyde), and coumarin (Aquino et al. 2006).

Fig. 1
figure 1

Typical UV-Vis spectra for aged cachaças. Black line amendoim, dotted line balsam, grey line oak, inverted triangle jequitibá, and circle umburana

Fig. 2
figure 2

VIP scores of the UV-Vis PLS-DA model for the classification of aged cachaças. Black line amendoim, dotted line balsam, grey line oak, inverted triangle jequitibá, and circle umburana

PLS-DA estimates a “prediction probability” and a classification threshold for each modeled class. These thresholds are calculated using a Bayesian method. The probability was calculated employing the plsdthres function present in PLS Toolbox software (Wise et al. 2006). Basically, this function takes the predicted y values from the PLS-DA model, fits a normal distribution, and then uses them to calculate the probability of observing a given y-value. The two distributions typically “cross” in only one place, which leads to a single point where both P(B|y) and P(A|y) are 0.5. This point is selected as the threshold for the PLS-DA. For a complete description of this method see Pérez et al. (2009).

Upon the model creation, three important performance parameters for qualitative methodologies were estimated: sensitivity, selectivity, and reliability rates. The sensitivity is the method’s ability to detect true positive samples as positive, while the sensitivity rate (SNR) is the probability of a method to classify known positive samples as positive. Similarly, the selectivity of the method corresponds to the ability to detect truly negative samples as negative and the selectivity ratio (SLR) the probability of the method classifying known negative samples as negative (Trullols et al. 2004; Gondim et al. 2014). The reliability rate (RLR) indicates how reliable is a method and is defined as the difference between the total of results (100 %) and the sum of the false positive and the false negative rates (Gondim et al. 2011; Botelho et al. 2015).

These three parameters were calculated for the training and test sets, and the results are listed in Tables 2 and 3. Analyzing the results in Table 3, it is seen that the RLR ranged from 46 to 87 % for the training sets and 56 to 89 % for the test sets. The UV-Vis PLS-DA model provided better results for the classes of aged cachaça in amendoim and jequitibá wooden barrels. Silva et al. (2012) have obtained accuracy between 80 and 100 % in classification of aged cachaças certified by the manufacturer using UV-Vis spectra and LDA.

Table 2 Results achieved for the sensitivity and selectivity rates for the classification of commercial aged cachaças within classes for training and test sets
Table 3 Reliability rates for the models obtained for the classification of commercial cachaça aged in different woods (the best ones are in italics)

By using EEM, a new classification model was built, searching for better results.

EEM NPLS-DA Model

Typical contour maps for the analyzed aged cachaças are show in Fig. 3. The fluorophores present in aged cachaças are derived primarily from the oxidation of lignin in the presence of alcohol, which leads to phenolic aldehydes and phenolic acids (Aquino et al. 2006). The EEM of 10.00 mg L−1 standard solutions of various phenolic compounds and coumarin were prepared in an ethanol/water solution (40:60 %) and analyzed under the same conditions previously described. It was found that phenolic acids (gallic, syringic, vanillic, ellagic) and coumarin were characterized by maximum excitation at 250 nm and emission at 300 nm, while phenolic aldehydes (sinapaldehyde, coniferaldehyde, syringaldehyde, vanillin) showed maximum excitation at 350 nm and emission between 400 and 450 nm.

Fig. 3
figure 3

Typical EEMs for aged cachaças. a Amendoim, b balsam, c oak, d jequitibá, and e umburana

The analyzed matrix has dimensions of 124 × 171 × 27 (samples × emission wavelengths × excitation wavelengths). A NPLS2-DA model was obtained, i.e., all classes were predicted through a single model. The best model was constructed with ten factors that accounted for 99 and 56 % of the X and Y variance, respectively (Table 1).

The PLS Toolbox software does not calculate VIP scores for NPLS models, but Favilla et al. (2013) have developed a routine for this purpose. Therefore, this routine was used and the VIP scores for the developed model can be seen in Fig. 4. By observing this figure, it was possible to identify the wavelengths that most contributed for the classification: 250, 280, 330, 360, and 450 nm (excitation wavelengths), and 352 and 446 nm (emission wavelengths). As described earlier in this section, most of these excitation and emission wavelengths are characteristic of phenolic aldehydes and acids that characterize the aged cachaça. Analyzing the results in Table 3, it is seen that the RLR ranged from 48 to 91 % for the training sets and from 37 to 91 % for the test sets.

Fig. 4
figure 4

VIP scores of the NPLS-DA model for classification of aged cachaças. Black line excitation, grey line emission

The threshold for each class in this model was also calculated through Bayesian statistics, employing the plsdthres function. The EEM NPLS-DA model provided better results for the classes of aged cachaça in oak and balsam wooden barrels.

Fused PLS-DA Models

Based on the results obtained for the two previously generated models, it was observed that the data generated by UV-Vis spectrophotometry provided better results for two classes of aged cachaça, amendoim and jequitibá, while spectrofluorimetric data provided better results for the other two classes of aged cachaças, balsam and oak. For the class of aged cachaça in umburana, both models provided good results. Thus, it was tried to merge the two types of data for building a model that be able to correctly classify the largest number of samples with good sensitivity and selectivity rates: a fused PLS-DA model.

Firstly, a mid-level data fusion strategy was tried, from the scores of the two models previously generated. The analyzed matrix has dimensions of 124 × 16 (samples × variables). The best model was constructed with six LV that accounted for 76 and 48 % of the X and Y variance, respectively (Table 1). Analyzing the results in Table 3, it is seen that the RLR ranged from 55 to 93 % for the training sets, and from 43 to 100 % for the test sets.

Considering these results, which did not provide a 100 % correct classification ability for all the test samples, a further attempt was carried out to also investigate the low-level data fusion strategy. The row-wise augmented data matrix was formed by the fusion of the UV-Vis and fluorescence spectra (250, 280, 330, 360, and 450 nm excitations), totaling 1011 variables. The choice of the excitation wavelengths was based on the analysis of the NPLS-DA VIP scores. In Fig. 5, the two spectral data sets were plotted before the fusion. For the best visualization of spectral profiles, the emission bands were normalized to have the same scale of intensity of the UV-Vis spectra.

Fig. 5
figure 5

Meta-spectra obtained by fusing UV-Vis and fluorescence (250, 280, 330, 360, and 450 nm excitations) data. Black line amendoim, dotted line balsam, grey line oak, Inverted triangle jequitibá, and circle umburana

The best model was constructed with eight LV that accounted for 82 and 61 % of the X and Y variance, respectively (Table 1). The results seen in Table 3 show that the best models were obtained with the low-level fused data: the RLR ranged from 56 to 100 % for the training sets and from 60 to 94 % for the test sets. For all the classes, this model showed the best RLR for training and testing sets, except for the class of samples aged in oak. Among the classes correctly classified, the class of cachaças aged in amendoim provided the worst RLR. This may be an indication that producers make more blends with this type of wood, without informing in the labels.

In order to monitor the developed models, the classification of the samples aged in our laboratory was evaluated. The model that had the highest success rate was the low-level data fused, with 100 % of correct classification.

Conclusion

In conclusion, UV-Vis spectra and molecular fluorescence in combination with supervised classification chemometric methods provided good models for the prediction of the wood used in the aging of commercial cachaças.

In particular, by using low-level data fusion, which proved to be the best fusion protocol, it was possible to obtain the highest RLR for four out of the five analyzed classes. This model can be used for routine analysis, facilitating the quality control of this drink and its supervision by the inspection organisms, helping to combating frauds.