Properties of the Box–Cox transformation for pattern classification
Introduction
Many important results and techniques in statistical data analysis follow from the assumption that data is normally distributed. In situations where this condition does not hold, one of the possible options [3], [4] is to transform the data in such a way that the distribution is nearer to the normality assumption. The Box–Cox transformation, abstracted from the original context of the linear regression model where it was first introduced in the 60's [1], [2], belongs to this class of approaches and can be regarded as a parametric way to non linearly transform a set of points with the aim of making their distribution approximately Gaussian (see for instance [5]). Since its introduction, such transformation has been widely studied and applied to many different data analysis situations,1 mostly in economics, econometrics, statistics and medicine, but also – to be closer to the pattern recognition area – in medical image segmentation [6], EEG signal analysis [7], geoscience [8], system dynamics modeling and prediction [9], time series forecasting [10], [11] and expression microarray [12]. It is important to notice that in many of the above-referenced applications the final goal was not the design of a classification system: in fact, the usefulness of the Box–Cox transformation as a pre-processing tool for pattern classification has received much less attention, with not so many papers published in the literature – see Section 2 for a detailed list. Clearly, in the specific Pattern Classification context, the increase of Gaussianity in a dataset is no longer the crucial aspect, since Gaussianity of the dataset does not imply Gaussianity of the classes – see for example the datasets in Fig. 1 –, the latter characteristic being the assumption of different standard classifiers, like the nearest mean classifier.
In this paper, we provide some evidence that the Box–Cox transformation may be useful also in the Pattern Classification context, typically operating in parameter ranges which can be very far from those optimal for Gaussianity. This success is likely due to the non linear nature of the Box–Cox transformation, which deforms the problem space in a nonuniform way, allowing to highlight useful structures or making the data more suitable for a given classifier (e.g. by increasing the class separability) – see the example in Fig. 3. To investigate these aspects, we start from a large scope analysis, involving a set of different datasets and classifiers, and we empirically link the behaviour of accuracies while varying the Box–Cox parameter λ with three criteria (Gaussianity, Gaussianity of every class, Class separability), showing that accuracies curves are linked more to class separability than to Gaussianity. In the second part of the paper, we also derive some practical and simple criteria to select good and effective values of the parameter λ, exploiting the characteristics of the specific scenario – i.e. the labels.
The remainder of the paper is organized as follows: in Section 2 we briefly review the basics of the Box–Cox Transformation; subsequently, in Section 3, we empirically investigate its properties in the Pattern Classification domain. The analysis on the automatic estimation of the best parameter is then reported in Section 4. Finally, in Section 5 conclusions are drawn.
Section snippets
The Box–Cox transformation
In our paper we focus on the basic formulation of the Box–Cox transform as given in the original Box–Cox paper [1], which transforms a given variable x into via the following equation:the transformation being defined for . Many other slightly different versions or formulations have appeared over the years [2], which are nevertheless all minor variations of the same idea. In some versions the parameter was restricted to the range , in some others a shift
Experimental evaluation part I: understanding the Box–Cox transform
In this section, starting from an extensive analysis of different datasets, different classifiers, and different parameter configurations, we analyse the behaviour of the Box–Cox transform in the Pattern Classification scenario, trying to link accuracy improvements to different criteria relative to different aspects.
Experimental evaluation part II: automatic estimation of λ
The main goal of this section is to provide some feasible and practical hints on how to set the λ parameter, a crucial problem in the actual application of the Box–Cox Transformation. Typically, in the literature, such a parameter is (i) set by hand, or (ii) found by an exhaustive search, or (iii) automatically estimated via numerical optimization of a criterion – many criteria have been studied, starting from the historical works [1], [20], [21] up to more recent ones [9]. In the peculiar
Conclusions
In this paper we provided an empirical analysis of the behaviour of the Box–Cox transformation for pattern classification, showing that it represents an useful preprocessing tool, also when used in operational ranges which are far from those leading to the maximum Gaussianity. These results open the door to the analysis of different non linear data pre-processing methods, which can be successfully exploited in the pattern recognition field.
Acknowledgements
MB would like to thank Bob Duin for pointing out similarities between the Box–Cox transformation and the power transformation, and P. Lovato for helpful discussions on the experimental evaluation. Authors would also like to thank the Observatorio Vulcanológico y Sismológico de Manizales, Colombia (in particular John Makario Londoño-Bonilla) for providing the volcano data set.
Manuele Bicego received his Laurea degree and PhD degree in Computer Science from University of Verona in 1999 and 2003, respectively. From 2004 to 2008 he was at the University of Sassari. Currently he is assistant professor (ricercatore) at the University of Verona. From June 2009 to February 2011 he was also team leader at the Istituto Italiano di Tecnologia (IIT - Genova Italy).
His research interests include statistical pattern recognition, mainly probabilistic models (GMM, HMM) and kernel
References (32)
- et al.
Rule induction for forecasting method selection: meta-learning the characteristics of univariate time series
Neurocomputing
(2009) - et al.
Handwritten digit recognition: investigation of normalization and feature extraction techniques
Pattern Recognit.
(2004) A note on the multivariate Box–Cox transformation to normality
Stat. Probab. Lett.
(1993)- et al.
The dtw-based representation space for seismic pattern classification
Comput. Geosci.
(2015) - et al.
Combining information theoretic kernels with generative embeddings for classification
Neurocomputing
(2013) - et al.
An analysis of transformations
J. R. Stat. Soc.: Ser. B (Methodol.)
(1964) The Box–Cox transformation technique: a review
Statistician
(1992)The Theory and Applications of the Linear Model
(1976)Statistical Pattern Recognition
(1990)- et al.
The Box–Cox metric for nearest neighbour classification improvement
Pattern Recognit.
(1997)
MR image segmentation using a power transformation approach
IEEE Trans. Med. Imaging
Analysis of amplitude-integrated eeg in the newborn based on approximate entropy
IEEE Trans. Biomed. Eng.
Visual-semantic modeling in content-based geospatial information retrieval using associative mining techniques
IEEE Geosci. Remote Sens. Lett.
A fast identification algorithm for Box–Cox transformation based radial basis function neural network
IEEE Trans. Neural Netw.
The bias in reversing the Box–Cox transformation in time series forecasting: an empirical study based on neural networks
Neurocomputing
Estimation of transformation parameters for microarray data
Bioinformatics
Cited by (29)
Spatio-temporal patterns of Aspergillus flavus infection and aflatoxin B<inf>1</inf> biosynthesis on maize kernels probed by SWIR hyperspectral imaging and synchrotron FTIR microspectroscopy
2022, Food ChemistryCitation Excerpt :Aflatoxin B1 was also quantified in each kernel with different treatments at each time point and significant differences were observed. However, the original aflatoxin content of all samples was abnormally distributed and required transformation prior to data analysis since linear model analysis methods typically assume that observations are independent, and conform to a normal distribution (Bicego & Baldo, 2016). To address this, the Box-Cox transformation was used to normalize the data.
Estimation of CO₂ emissions from petroleum refineries based on the total operable capacity for carbon capture applications
2021, Chemical Engineering Journal AdvancesStructurally optimized suture resistant polylactic acid (PLA)/poly (є-caprolactone) (PCL) blend based engineered nanofibrous mats
2021, Journal of the Mechanical Behavior of Biomedical MaterialsCitation Excerpt :Lambda (λ), is the transformation parameter in the Box-Cox plot, the minimum value of which defines the significance of the selected transformation function. The best λ value at the minimum point on the curve determines the fitness of the transformation function (Bicego and Baldo, 2016). The confidence interval (CI) of λ remained in the range of ~ −0.20 to ~ 0.49, as shown in Fig. 4 (represented with red lines).
Transdermal Drug Delivery: Concepts and Application
2020, Transdermal Drug Delivery: Concepts and ApplicationDesign and optimization of a luminescent Samarium complex of isoprenaline: A chemometric approach based on Factorial design and Box-Behnken response surface methodology
2019, Spectrochimica Acta - Part A: Molecular and Biomolecular SpectroscopyCitation Excerpt :For both techniques A and B. Response Surface Design (RSD) was generated to optimize the three significant factors obtained from the preliminary screening study by employing Box-Behnken (BBD) strategy with appropriately adjusted lower and upper limits for each factor to comprise narrower ranges in comparison to the screening phase. It is also worth mentioning, that according to the normal probability plots portrayed in Fig. S1 (see supplementary materials), no transformation of data was applied as the normality supposition was verified by obtaining a nearly normal distribution [49]. Originally, both responses (Y1 and Y2) were interpreted by the aid of quadratic polynomial regression equations upon using techniques A and B.
Screening and optimization of samarium-assisted complexation for the determination of norfloxacin, levofloxacin and lomefloxacin in their corresponding dosage forms employing spectrofluorimetry
2019, Spectrochimica Acta - Part A: Molecular and Biomolecular SpectroscopyCitation Excerpt :Responses were initially described by quadratic polynomial equations. Achieving greater normal distribution pattern was achieved by transformation of data to minimize the lack of normality of residuals by Box and Cox Power transformation [29,30]. Box-Cox transformation was applied where higher p-values as well as smaller Anderson Darling (AD) values in the normal distribution curves were obtained.
Manuele Bicego received his Laurea degree and PhD degree in Computer Science from University of Verona in 1999 and 2003, respectively. From 2004 to 2008 he was at the University of Sassari. Currently he is assistant professor (ricercatore) at the University of Verona. From June 2009 to February 2011 he was also team leader at the Istituto Italiano di Tecnologia (IIT - Genova Italy).
His research interests include statistical pattern recognition, mainly probabilistic models (GMM, HMM) and kernel machines (e.g. SVM), with application to video analysis, biometrics and, recently, bioinformatics. Manuele Bicego is author of several papers in the above subjects, published in international journals and conferences.
Sisto Baldo received his Laurea degree in Mathematics from the University of Pisa in 1989. From 1989 to 1991 he was PhD student at the Scuola Normale Superiore in Pisa. In 1991 he was appointed assistant professor (ricercatore) at the University of Trento. Since 1999 he is associate professor of mathematical analysis: 1999–2002 at the Universitá della Basilicata in Potenza, 2002–2008 at the University of Trento, since 2008 at the University of Verona. His main research interests are in calculus of variations, geometric measure theory, elliptic partial differential equations and applications to physics (phase transitions, superconductivity, Bose-Einstein condensation). He is author of several scientific papers in the above subjects, published in international journals.