Elsevier

Neurocomputing

Volume 218, 19 December 2016, Pages 390-400
Neurocomputing

Properties of the Box–Cox transformation for pattern classification

https://doi.org/10.1016/j.neucom.2016.08.081Get rights and content

Abstract

The Box–Cox transformation [1,2] (Box and Cox, 1964; Sakia, 1992) has been regarded as a parametric pre-processing technique aimed at making the distribution of a set of points approximately Gaussian. Since normality represents an assumption underlying many statistical data analysis tools, such technique has been widely applied in different fields of Computer Science. In this paper we will provide evidence that this technique can be useful also in the case of Pattern Classification, where Gaussianity of datasets is not so critical. By letting the Box–Cox transform work in operational ranges which do not necessarily correspond to an increase in Gaussianity, we will show that class separability can be improved: this is likely due to the non linear nature of the Box–Cox transformation, which deforms the space in a nonuniform way. We will also provide some suggestions on criteria that can be used to automatically estimate the best parameter of the Box–Cox transformation in the Pattern Classification context.

Introduction

Many important results and techniques in statistical data analysis follow from the assumption that data is normally distributed. In situations where this condition does not hold, one of the possible options [3], [4] is to transform the data in such a way that the distribution is nearer to the normality assumption. The Box–Cox transformation, abstracted from the original context of the linear regression model where it was first introduced in the 60's [1], [2], belongs to this class of approaches and can be regarded as a parametric way to non linearly transform a set of points with the aim of making their distribution approximately Gaussian (see for instance [5]). Since its introduction, such transformation has been widely studied and applied to many different data analysis situations,1 mostly in economics, econometrics, statistics and medicine, but also – to be closer to the pattern recognition area – in medical image segmentation [6], EEG signal analysis [7], geoscience [8], system dynamics modeling and prediction [9], time series forecasting [10], [11] and expression microarray [12]. It is important to notice that in many of the above-referenced applications the final goal was not the design of a classification system: in fact, the usefulness of the Box–Cox transformation as a pre-processing tool for pattern classification has received much less attention, with not so many papers published in the literature – see Section 2 for a detailed list. Clearly, in the specific Pattern Classification context, the increase of Gaussianity in a dataset is no longer the crucial aspect, since Gaussianity of the dataset does not imply Gaussianity of the classes – see for example the datasets in Fig. 1 –, the latter characteristic being the assumption of different standard classifiers, like the nearest mean classifier.

In this paper, we provide some evidence that the Box–Cox transformation may be useful also in the Pattern Classification context, typically operating in parameter ranges which can be very far from those optimal for Gaussianity. This success is likely due to the non linear nature of the Box–Cox transformation, which deforms the problem space in a nonuniform way, allowing to highlight useful structures or making the data more suitable for a given classifier (e.g. by increasing the class separability) – see the example in Fig. 3. To investigate these aspects, we start from a large scope analysis, involving a set of different datasets and classifiers, and we empirically link the behaviour of accuracies while varying the Box–Cox parameter λ with three criteria (Gaussianity, Gaussianity of every class, Class separability), showing that accuracies curves are linked more to class separability than to Gaussianity. In the second part of the paper, we also derive some practical and simple criteria to select good and effective values of the parameter λ, exploiting the characteristics of the specific scenario – i.e. the labels.

The remainder of the paper is organized as follows: in Section 2 we briefly review the basics of the Box–Cox Transformation; subsequently, in Section 3, we empirically investigate its properties in the Pattern Classification domain. The analysis on the automatic estimation of the best parameter is then reported in Section 4. Finally, in Section 5 conclusions are drawn.

Section snippets

The Box–Cox transformation

In our paper we focus on the basic formulation of the Box–Cox transform as given in the original Box–Cox paper [1], which transforms a given variable x into x(λ) via the following equation:x(λ)={xλ1λifλ0log(x)ifλ=0the transformation being defined for x>0. Many other slightly different versions or formulations have appeared over the years [2], which are nevertheless all minor variations of the same idea. In some versions the parameter was restricted to the range (0,1], in some others a shift

Experimental evaluation part I: understanding the Box–Cox transform

In this section, starting from an extensive analysis of different datasets, different classifiers, and different parameter configurations, we analyse the behaviour of the Box–Cox transform in the Pattern Classification scenario, trying to link accuracy improvements to different criteria relative to different aspects.

Experimental evaluation part II: automatic estimation of λ

The main goal of this section is to provide some feasible and practical hints on how to set the λ parameter, a crucial problem in the actual application of the Box–Cox Transformation. Typically, in the literature, such a parameter is (i) set by hand, or (ii) found by an exhaustive search, or (iii) automatically estimated via numerical optimization of a criterion – many criteria have been studied, starting from the historical works [1], [20], [21] up to more recent ones [9]. In the peculiar

Conclusions

In this paper we provided an empirical analysis of the behaviour of the Box–Cox transformation for pattern classification, showing that it represents an useful preprocessing tool, also when used in operational ranges which are far from those leading to the maximum Gaussianity. These results open the door to the analysis of different non linear data pre-processing methods, which can be successfully exploited in the pattern recognition field.

Acknowledgements

MB would like to thank Bob Duin for pointing out similarities between the Box–Cox transformation and the power transformation, and P. Lovato for helpful discussions on the experimental evaluation. Authors would also like to thank the Observatorio Vulcanológico y Sismológico de Manizales, Colombia (in particular John Makario Londoño-Bonilla) for providing the volcano data set.

Manuele Bicego received his Laurea degree and PhD degree in Computer Science from University of Verona in 1999 and 2003, respectively. From 2004 to 2008 he was at the University of Sassari. Currently he is assistant professor (ricercatore) at the University of Verona. From June 2009 to February 2011 he was also team leader at the Istituto Italiano di Tecnologia (IIT - Genova Italy).

His research interests include statistical pattern recognition, mainly probabilistic models (GMM, HMM) and kernel

References (32)

  • J.-D. Lee et al.

    MR image segmentation using a power transformation approach

    IEEE Trans. Med. Imaging

    (2009)
  • L. Li et al.

    Analysis of amplitude-integrated eeg in the newborn based on approximate entropy

    IEEE Trans. Biomed. Eng.

    (2010)
  • A. Barb et al.

    Visual-semantic modeling in content-based geospatial information retrieval using associative mining techniques

    IEEE Geosci. Remote Sens. Lett.

    (2010)
  • X. Hong

    A fast identification algorithm for Box–Cox transformation based radial basis function neural network

    IEEE Trans. Neural Netw.

    (2006)
  • A. da Costa et al.

    The bias in reversing the Box–Cox transformation in time series forecasting: an empirical study based on neural networks

    Neurocomputing

    (2014)
  • B. Durbin et al.

    Estimation of transformation parameters for microarray data

    Bioinformatics

    (2003)
  • Cited by (29)

    • Spatio-temporal patterns of Aspergillus flavus infection and aflatoxin B<inf>1</inf> biosynthesis on maize kernels probed by SWIR hyperspectral imaging and synchrotron FTIR microspectroscopy

      2022, Food Chemistry
      Citation Excerpt :

      Aflatoxin B1 was also quantified in each kernel with different treatments at each time point and significant differences were observed. However, the original aflatoxin content of all samples was abnormally distributed and required transformation prior to data analysis since linear model analysis methods typically assume that observations are independent, and conform to a normal distribution (Bicego & Baldo, 2016). To address this, the Box-Cox transformation was used to normalize the data.

    • Structurally optimized suture resistant polylactic acid (PLA)/poly (є-caprolactone) (PCL) blend based engineered nanofibrous mats

      2021, Journal of the Mechanical Behavior of Biomedical Materials
      Citation Excerpt :

      Lambda (λ), is the transformation parameter in the Box-Cox plot, the minimum value of which defines the significance of the selected transformation function. The best λ value at the minimum point on the curve determines the fitness of the transformation function (Bicego and Baldo, 2016). The confidence interval (CI) of λ remained in the range of ~ −0.20 to ~ 0.49, as shown in Fig. 4 (represented with red lines).

    • Transdermal Drug Delivery: Concepts and Application

      2020, Transdermal Drug Delivery: Concepts and Application
    • Design and optimization of a luminescent Samarium complex of isoprenaline: A chemometric approach based on Factorial design and Box-Behnken response surface methodology

      2019, Spectrochimica Acta - Part A: Molecular and Biomolecular Spectroscopy
      Citation Excerpt :

      For both techniques A and B. Response Surface Design (RSD) was generated to optimize the three significant factors obtained from the preliminary screening study by employing Box-Behnken (BBD) strategy with appropriately adjusted lower and upper limits for each factor to comprise narrower ranges in comparison to the screening phase. It is also worth mentioning, that according to the normal probability plots portrayed in Fig. S1 (see supplementary materials), no transformation of data was applied as the normality supposition was verified by obtaining a nearly normal distribution [49]. Originally, both responses (Y1 and Y2) were interpreted by the aid of quadratic polynomial regression equations upon using techniques A and B.

    • Screening and optimization of samarium-assisted complexation for the determination of norfloxacin, levofloxacin and lomefloxacin in their corresponding dosage forms employing spectrofluorimetry

      2019, Spectrochimica Acta - Part A: Molecular and Biomolecular Spectroscopy
      Citation Excerpt :

      Responses were initially described by quadratic polynomial equations. Achieving greater normal distribution pattern was achieved by transformation of data to minimize the lack of normality of residuals by Box and Cox Power transformation [29,30]. Box-Cox transformation was applied where higher p-values as well as smaller Anderson Darling (AD) values in the normal distribution curves were obtained.

    View all citing articles on Scopus

    Manuele Bicego received his Laurea degree and PhD degree in Computer Science from University of Verona in 1999 and 2003, respectively. From 2004 to 2008 he was at the University of Sassari. Currently he is assistant professor (ricercatore) at the University of Verona. From June 2009 to February 2011 he was also team leader at the Istituto Italiano di Tecnologia (IIT - Genova Italy).

    His research interests include statistical pattern recognition, mainly probabilistic models (GMM, HMM) and kernel machines (e.g. SVM), with application to video analysis, biometrics and, recently, bioinformatics. Manuele Bicego is author of several papers in the above subjects, published in international journals and conferences.

    Sisto Baldo received his Laurea degree in Mathematics from the University of Pisa in 1989. From 1989 to 1991 he was PhD student at the Scuola Normale Superiore in Pisa. In 1991 he was appointed assistant professor (ricercatore) at the University of Trento. Since 1999 he is associate professor of mathematical analysis: 1999–2002 at the Universitá della Basilicata in Potenza, 2002–2008 at the University of Trento, since 2008 at the University of Verona. His main research interests are in calculus of variations, geometric measure theory, elliptic partial differential equations and applications to physics (phase transitions, superconductivity, Bose-Einstein condensation). He is author of several scientific papers in the above subjects, published in international journals.

    View full text