Abstract

It can be argued that the identification of sound mathematical models is the ultimate goal of any scientific endeavour. On the other hand, particularly in the investigation of complex systems and nonlinear phenomena, discriminating between alternative models can be a very challenging task. Quite sophisticated model selection criteria are available but their deployment in practice can be problematic. In this work, the Akaike Information Criterion is reformulated with the help of purely information theoretic quantities, namely, the Gibbs-Shannon entropy and the Mutual Information. Systematic numerical tests have proven the improved performances of the proposed upgrades, including increased robustness against noise and the presence of outliers. The same modifications can be implemented to rewrite also Bayesian statistical criteria, such as the Schwartz indicator, in terms of information-theoretic quantities, proving the generality of the approach and the validity of the underlying assumptions.

1. Introduction to Nonfrequentist Model Selection Criteria

The promised land of modern scientific enterprises is often the formulation of robust and generally applicable mathematical models [1, 2]. The ultimate validation of any model resides in the comparison with the results of experiments or observations. In the last decades, enormous quantities of data have become available in many fields of science and engineering. The statistical inference has therefore progressively moved to centre stage. The older frequentist techniques, based on traditional significance level criteria, have been complemented by a series of Bayesian and information-theoretic criteria, in many respects more suited to managing large amounts of information.

One of the most popular model selection criteria (MSC) is the Akaike Information Criterion (AIC) [3]. The AIC can be derived from the Kullback–Leibler divergence and can be interpreted as the loss of information associated with the adoption of a model different from the exact one, generating the data. The basic idea underlying the AIC criterion resides indeed in the consideration that the less information a model loses, the higher its quality. The theoretical derivation of the AIC gives the unbiased form of the criterion [4].where L is the likelihood of the data given the model and k is the number of estimated parameters in the model. The AIC is a metric that is minimised by the best model as a compromise between the goodness of fit (the first term) and complexity (the second term).

The general formulation of the AIC is not always easy to apply in practice as can be appreciated by a simple inspection of (1). First, in many instances, it can be impossible to reliably calculate the likelihood. Moreover, it is well known that the number of parameters is a poor quantifier of a model complexity and it is not inherently an information-theoretic indicator. The more practical expression of the AIC, very often the one used in practice, is even more distant from its original information theoretic origin, as discussed in the next section.

The first quantity, proposed to improve the AIC, is the Gibbs–Shannon entropy H

The higher the value of H, the higher the uniformity of the corresponding probability distribution function (whose values are indicated with pi). The Gibbs- Shannon entropy can improve significantly the quantification of the model complexity, as discussed in detail in Section 2.2.

The second quantity, used in the rest of the work, is the mutual information, MI.where Px,y is the joint pdf of the random variables X and Y. Mutual Information can play a fundamental role in determining the goodness of fit of the models, as discussed in Section 2.1.

With regard to the organization of the paper, the next section introduces the rationale and details of the proposed information-theoretic upgrades of the Akaike Information Criterion. Section 3 is devoted to a simple but challenging didactic case, meant to illustrate the effects of the modifications with an easy-to-grasp example. The family of functions and the types of noise statistics, implemented to perform a series of systematic tests, are summarised in Section 4. The results of the aforementioned tests are exemplified in Section 5 with the help of some representative cases. The extension of the approach to the Bayesian Selection criterion is covered in Section 6 before the conclusions and lines of future developments are discussed in the final section of the paper.

2. Model Selection Formulated in terms of Information Theoretic Quantities

Among the many indicators, for identifying the “best model” among a set of candidates, the Akaike Information Criterion AIC can be conceived originally as a pure information theoretic criterion. Unfortunately, the original formulation of the AIC criterion is typically problematic to implement in practice, particularly in applications involving complex systems and nonlinear phenomena. Both terms in the AIC present significant issues [57]. To bypass the practical difficulties of calculating the likelihood, the strong assumption that the data are identically distributed and independently sampled from a normal distribution is the most commonly invoked. If this traditionally called iid hypothesis is valid, it can be demonstrated that the AIC can be written (up to an additive immaterial constant depending only on the number of entries in the database) as follows:

In (4), formally derived in [4], the Mean Squared Error (MSE) is calculated in terms of the residuals, the differences between the data, and the estimates of the models; in its turn n indicates the number of entries in the database.

(4) is certainly the most widely used form of AIC. On the other hand, as can be easily appreciated by inspection, the criterion is now expressed in terms of quantities, which are not information theoretic anymore. Moreover, all the statistical information content, originally in the likelihood, is reduced to the mere MSE of the residuals. The first obvious question, which comes to mind, is whether some additional statistical information about the distribution of the residuals could be taken into account, to improve the discriminatory capability of the criterion. The practical relevance of this issue is quite significant also because, in many applications, the assumptions behind (4) are clearly violated. In real life, indeed, the statistics of the noise can have a non-Gaussian distribution, memory effects can be important, and a significant number of outliers can be unavoidable. How to improve the model selection criteria in this respect is the subject of Section 2.1.

The second term in (4) is also problematic because it is well known that the number of parameters is a quite poor indicator of the complexity of a model. More sophisticated quantifiers exist, such as the VC dimension [8] and the Rademacher dimension [9], but they are often impossible to calculate for most practical functions. An alternative information theoretic and computationally simple way to calculate a model complexity is the subject of Section 2.2.

2.1. Expressing the Goodness of Fit in terms of Mutual Information

The main idea informing one of the AIC upgrades, proposed in this work, is based on the observation that the better a model, the more similar the residuals to the noise affecting the measurements. In the case of a perfect model, the residuals should present exactly the same distribution as the noise. Assuming that the noise is not correlated with the measurements, absolutely legitimate in most practical applications, this consideration can be quantified mathematically by calculating the mutual information between the model predictions and the residuals, .

The AIC can therefore be rewritten as follows:

Conceptually, (6) is to be preferred to (4) for various reasons. First, it formulates the criterion in terms of an information theoretic quantity, the mutual information. Moreover, it retains much more statistical information about the model and the residuals. At the same time, MIMRes takes into account also nonlinear correlations and does not make any “a priori” assumption about the statistics of the noise or the presence of outliers. Consequently, as shown by numerical tests, AICMI is a much more general and sensitive model selection criterion than the original AIC.

2.2. Expressing the Complexity in terms of the Shannon Entropy

The other weakness in the original definition of AIC is certainly the quantification of complexity. Indeed, the simple number of parameters in a model is a very poor indicator of its flexibility and in particular of its potential to overfit (see Section 3). A possible alternative relies on the traditional idea that complexity is the middle ground between randomness and determinism. According to this view, complete randomness and perfect determinism are considered less complex than a combination of the two. This approach to complexity has a long pedigree and can be traced back to the interpretation of information as uncertainty, the concept at the basis of information theory [10]. A possible way of expressing this idea in mathematical terms is the following complexity measure C[X]:where H is the usual Shannon entropy and D is the distance from a uniform distribution.where with the usual notation, n is the number of entries in the database. The distance D reduces the estimated complexity of models, whose predictions are uniform. The entropy reduces the estimated complexity of models, whose outputs are concentrated on a few well-defined values. Conceptually, the implementation of this quantification of complexity is quite simple. The pdf of the model predictions can be inserted in (7) to obtain a simple indicator, implementing the aforementioned information theoretic interpretation of complexity.

The most delicate aspect of (7) is the choice of the exponents α and β because they contribute significantly to determining the trade-off between entropy and distance. To this end, the increments of the model predictions have been calculated as follows:

The moving averages (Mov), of the mean and standard deviation of the squared increments, are good indicators of the flexibility of a model and therefore of its potential to overfit. The normalized versions of these quantities are defined in

The ratio of the two averages calculated in (10) is

The parameter MF increases for functions, which have stronger variations in the domain of interest and can therefore be considered more complex. Indeed, these more nervous functions would have a higher potential of overfitting the data, following the noise. This is the interpretation of the quantity MF, which is used to determine the exponents α and β.

Finally, the proposed final versions of the AIC expressed only in terms of the mentioned information theoretic quantities read

3. A Didactic Example to Illustrate the Main Characteristics of AICMICX

To illustrate the potential and the meaning of the proposed upgrades of the AIC, an academic but challenging example, already discussed in detail in the literature [11], is described in this section. To this end, it is assumed that the actual data is generated with a polynomial function depending on 5 parameters.

The equations, considered as possible candidate models for the data generated with (14), are reported in Table 1.

A comment about the sinusoidal functions is in place. These functions can be tuned to fit perfectly the data generated with (14) by increasing their frequency. This fact can be appreciated by inspection of the first two plots of Figure 1. If there is any noise added to the data, the sinusoidal functions, given their higher flexibility, can fit the data even better than the original equation generating it.

On the other hand, they depend only on two parameters, their amplitude and frequency. Therefore, the traditional version of the AIC would tend to prefer a well-adjusted sinusoidal model (because it would achieve lower values of both terms of the indicator). The proposed version AICMICx, on the contrary, manages to properly identify the right model, as shown in Figure 2. The plots report the differences between the AIC and AICMICx of the candidate models and the reference, the equation used to generate the data.

When these differences are positive, the reference model is the preferred one; the negative cases indicate that the criteria would have selected the wrong model. From the plots of Figure 2, it appears quite clearly that the traditional AIC would have preferred the sinusoids (particularly model 1) for various numbers of entries, whereas the AICMICx always identifies the reference model as the right one. This is achieved by taking into account the distributions of the residuals and by better estimating the complexity of the models. The details about the comparison, between the traditional AIC and the new version proposed in this paper, are fully documented in Appendix A for the specific example reported in this section.

4. The Main Functional Classes and Noise Statistics for Practical Applications

To assess the performance of the alternative AIC model selection criterion proposed in Section 2, a series of systematic numerical tests have been performed. The analysis is focussed mainly on four classes of models that cover the most widely used in practice. They are the classes of polynomials, power laws, power laws multiplied by a squashing term, and exponential functions. In the rest of the paper, only the results for bidimensional functions (of the form z = f(x, y)) are discussed, because they are susceptible of clear visualization, which helps illustrating the properties of the criterion. The extension to a larger number of variables is straightforward and does not pose any conceptual difficulty. Therefore, the considerations and conclusions reported have to be assumed valid also in higher dimensions. For the reader's convenience, the mathematical form of the aforementioned models is reported in the left column of Table 2.

Significant attention has been devoted to noise statistics. Three of the most relevant distribution functions have been tested: Gaussian, uniform, and multi-Gaussian [12]. Again for the reader's convenience, the mathematical formulation of these types of noise is summarised in the right column of Table 2, together with the parameter values valid for the runs reported in the rest of the paper. Since in practice very often the presence of outliers in the data cannot be excluded, the robustness of the proposed upgrade of the AIC in this respect has also been verified. This has been achieved by randomly adding to the synthetic data values sampled from a Gaussian distribution of small variance but nonzero mean (see the entry called Asymmetric noise in Table 2 for a precise mathematical definition).

5. Representative Results of Numerical Tests

As mentioned, a systematic series of tests with synthetic data has been performed to assess the competitive advantage of the proposed version of the AIC. All the combinations of cases summarised in Section 4 have been investigated. The new version AICMICx has always proved to have better discriminatory capabilities than the traditional AIC. In practice, this means that AICMICx at least provides better separation between the right model (the one used to generate the data) and its wrong competitors. This has proved to occur for any type of function, noise statistics, and levels of outliers. In general, the more severe the conditions, the higher the level of noise or outliers, and the better the AICMICx performance compared to the traditional AIC. In some cases, as the one already discussed in Section 3, only the AICMICx can converge on the right model.

In the rest of this section, some relevant examples of the performed tests are reported. They have to be considered absolutely representative of the vast majority of systematic investigations performed.

In the first case discussed in the following, the model generating the data consists of a power law multiplied by a squashing term. The importance and popularity of power laws are difficult to overstate. Self-similarity can result in many quantities presenting a power law trend. Power laws are also particularly important for the investigation of scalings. On the other hand, power law monomials can be too rigid and the multiplication by a squashing factor can provide some additional flexibility. The function implemented to generate the synthetic data is reported in the last row of Table 3. The other rows of the same table report the alternative models. The synthetic data generated with the reference model of Table 3 is shown in Figure 3, together with the functions constituting the alternative models. Two different levels of Gaussian additive noise are shown; corresponding to a standard deviation of 15% and 30% of the synthetic data averaged amplitude. As can be derived by simple inspection of the plots, AICMICx not only increases the separation between the models, compared to the traditional AIC, but it also allows identifying the equation generating the data. Indeed whereas, for some numbers of entries and 30% of added noise, the AIC of the candidate models can be lower than the reference one, the AICMICx always identifies the model generating the data as the best; this can be seen by noticing that the values of the AICMICx differences, with respect to the best model, are always positive.

The discriminatory power of AICMICx is even higher in the case of high noise. This fact is exemplified by the following example, in which the generating model belongs to the class of exponential functions. The alternative models are reported in Table 4, whose last row reports the equation used to generate the data. In addition to Gaussian noise, with a standard deviation of 30% and 60% of the synthetic data averaged amplitude, some concentrated high noise has also been added, according to the relations specified in the last row of Table 2. The better performance of AICMICx compared to the traditional AIC can be easily recognised by inspection of the plots in Figure 4. Indeed, the separation between the alternative models and the right one is much larger for the AICMICx than for the traditional AIC (the reader should please consider also the different scales of the plots in Figure 4).

6. Extension to Bayesian Model Selection

It is worth noting that the same modifications proposed for the AIC can be applied also to the Bayesian information criterion (BIC) [13]. BIC is based on Bayesian theory and has been designed to maximize the posterior probability of a model given the data. BIC is again a cost function and therefore it is also an indicator to be minimised. The BIC's most general form iswhere again L is the likelihood of the data given the model, k is the number of estimated parameters in the model, and n is the number of entries in the database. BIC has the same structural form as the AIC and is affected by the same difficulties in practical applications, in particular the challenges posed by the calculation of the likelihood and the quantification of the model complexity.

Assumptions, similar to the ones leading to (4), allow expressing the BIC criterion as follows:

Even if the conceptual origins of BIC are different, the proposed changes have the same effects, namely, they improve BIC’s discriminatory power by including more statistical information about the residuals and by better quantifying the models' complexity. In full analogy to (13), the final upgraded version of the BIC criterion is

The tests of the AIC have been performed also for the BIC and they produce basically the same results. The discriminatory capability of BICMICx is clearly superior to the original version of the indicator, as can be seen in the plots of Appendix B. Of course, given the fact that BIC is based on Bayesian statistics, the argument that the implemented upgrades improve the coherence, with information-theoretic definitions and assumptions, cannot be made. On the other hand, the fact that the proposed modifications improve the quality also of a Bayesian type of selection criterion increases the confidence in the validity of the ideas, which have led to them.

7. Conclusions

The Akaike Information Criterion was conceived to minimise the out-of-sample error and it is based on information theory. Statistical models are indeed developed to represent the process that generated the data, and the AIC estimates the relative amount of information lost by a given model. On this basis, it is assumed that the better a model, the less information it loses. Unfortunately, the deployment of AIC is problematic because its practical versions are affected by significant limitations. Indeed the most widely used version of AIC is valid under the assumptions that the data are affected by Gaussian, zero-sum additive noise. These hypotheses have to be accepted because, in most practical applications, it is often very difficult, if not impossible, to compute the likelihood of the data given the model. If the processes generating the data do not verify these assumptions, the traditional versions of the AIC can become poorly effective or even misleading.

On the other hand, other information theoretic quantities can be implemented to improve the discrimination potential of the criterion. In particular, the mutual information between the model estimates and the residuals can help reward the goodness of fit. The entropy in its turn can be used to quantify the model complexity. With these upgrades, the proposed version of the AIC has always proved to have much better convergence properties than the traditional version in all respects, including robustness against noise and zero-sum outliers. This has occurred in all the numerical tests performed, some of which consist of very challenging selection tasks, given the fact that some candidate models assume values very similar to the right one in the range covered by the data. The proposed improvements have an equally positive impact on the other criteria of the AIC family, such as TIC and AICc [4]. The extension of the same concepts to the Bayesian information criterion proves the soundness of the basic rationale behind the proposed modifications. The good performance in presence of non-normal noise distributions is particularly encouraging because model assessment in such situations has not yet received a lot of attention in the literature. Indeed, only a few publications have addressed the fact that many existing model selection criteria such as the BIC and Cp may not be suitable for generalized linear model regression, in which the conditional mean and variance of the response are dependent [14]. Synergies with other formulations of the complexity term would also be very interesting from the methodological point of view [15].

Given the quite positive results obtained with synthetic data, proving their better discriminatory capability, the proposed new versions of the selection criteria are expected to become useful in various fields. They are already being deployed for the investigation of complex systems, ranging from high-temperature plasmas [1623] to remote sensing of the atmosphere and radar [2426]. Another promising application seems to be in support of the regularization of recent tomographic inversion methods [2729]. In these fields, Dimensional Analysis (DA) is a methodology widely used to identify key variables based on physical dimensions. Even if it has been granted some attention recently, in most literature DA is treated as merely a preprocessing tool, creating various statistical problems [30]. The upgrades of the criteria proposed in this work could hopefully help in devising an appropriate statistical methodology that integrates DA and model selection.

Appendix

A. Calculation of the AICMICx and BICMICx Quantities of Section 3

The Figures 5–Figure 9 in this Appendix document all the quantities required to calculate AICMICx and BICMICx for the didactic case of Section 3, involving polynomial and sinusoidal models.

B. Performance Details of the BIC and BICMICx Quantities of Section 5

This Appendix documents the performance of BICMICx for the numerical cases described in Section 5: power laws multiplied by a squashing term and exponentials. Figures 10 and 11 show the comparison of BIC and BICMICx.

Data Availability

The Matlab scripts and data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

AM conceived this research; ML participated in the design of the code and interpretation of the results; ML and RR performed the validation of the analysis; AM and MG wrote the paper and participated in the revisions of it. MG provided the funding and supervised the project. All authors read and approved the final manuscript.