Entropy coefficient of determination for generalized linear models

https://doi.org/10.1016/j.csda.2009.12.003Get rights and content

Abstract

The objective of the present paper is to propose a predictive power measure for generalized linear models (GLMs). First, basic predictive power measures for GLMs are compared with respect to some desirable properties. We propose a generalized coefficient of determination for GLMs, which is referred to as the entropy coefficient of determination (ECD). The advantage of the measure is discussed in the GLM framework. Second, the asymptotic properties of the maximum likelihood estimator of ECD are discussed. Third, ECD is applied to GLMs with polytomous response variables. Finally, discussions and a conclusion to this study are provided.

Introduction

Measurement of the predictive or explanatory power of GLMs (Nelder and Wedderburn, 1972, McCullagh and Nelder, 1989) is important for regression analysis as well as a model selection, i.e. determination of significant factors in regression models. GLMs include various useful regression models such as normal linear regression, logistic regression, and loglinear models, and are widely applied in practical data analyses. In normal linear regression, the coefficient of determination plays an important role in the measurement of predictive power; however regression models for non-normal responses especially polytomous ones need other measures for assessing the predictive powers. By using variation functions of response variables, generalized coefficient of determinations were proposed as proportions of the variation explained by explanatory variables (Efron, 1978, Agresti, 1986, Korn and Simon, 1991). Predictive power measures based on the likelihood function (Theil, 1970, Goodman, 1971) and entropy (Haberman, 1982) can be viewed as the above type ones. For logistic regression model, the squared sample correlation between responses and their conditional expectations and the sums-of-squares measure are considered as preferable predictive power measures (Mittlebock and Schemper, 1996, Ash and Shwarts, 1999). Zheng and Agresti (2000) recommended the correlation coefficient between responses and their conditional expectations as a predictive power measure for GLMs. The random component of the GLM is an exponential family distribution, and the covariance between the response variable and a canonical parameter is described with the Kullback–Leibler information that measures the difference between the model concerned and the independent (null) model (Eshima and Tabata, 2007). The desirable properties of the predictive power measures for GLMs may be (i) interpretability; (ii) being the multiple correlation coefficient or the coefficient of determination in normal linear regression models; (iii) entropy-based property; (iv) applicability to all GLMs; in addition to these, it may be appropriate for a measure to have the following property: (v) monotonicity in the complexity of the linear predictor.

The aim of the present paper is to propose a generalized coefficient of determination for GLMs. Section 2 proposes the entropy coefficient of determination as a predictive power measure for GLMs. Basic predictive power measures are compared with respect to desirable properties for assessing the effects of factors in GLMs. In Section 3, the asymptotic properties of the maximum likelihood estimator of ECD are discussed. Section 4 considers advantageous properties of ECD in GLMs with polytomous response variables. In Section 5, the ECD approach is applied to a logit model. Finally, in Section 6 discussions and a conclusion to this study are given.

Section snippets

Basic predictive power measures for GLMs

In the GLM framework, predictive power measures are compared, and a generalized coefficient of determination is proposed. Let X=(X1,X2,,Xp)T be a p×1 factor or explanatory variable vector; Y be a response variable; and let f(y|x) be the conditional probability or density function of Y given X=x. In GLMs, the function f(y|x) is assumed to be the following exponential form: f(y|x)=exp(yθb(θ)a(φ)+c(y,φ)), where θ and φ are parameters, and a(φ), b(θ) (>0) and c(y,φ) are specific functions.

Asymptotic property of the ML estimator of ECD

A basic association measure, mPP(Y|X), was considered in the case where X is uniformly distributed or not random (Eshima and Tabata, 2007). In this section, more general cases are considered. Let f(y) and g(x) be the marginal density or probability function of Y and X, respectively. Then, the association measure is expressed as mPP(Y|X)=f(y|x)g(x)log(f(y|x)f(y))dxdy+f(y)g(x)log(f(y)f(y|x))dxdy. If Y is discrete, the integral is replaced with the summation. If X is not random and take values xk

GLMs with polytomous response variables

Let Y be a polytomous response variable with levels {1,2,,K} and let Yk={1(Y=k)0(Yk). Then, dummy variable vector Y=(Y1,Y2,,YK)T is identified with response Y. Then, the random component of a GLM is described as follows: f(y|x)=exp(yTθb(θ)a(φ)+c(y,φ)), where θ=(θ1,θ2,,θK)T. For model identification we set θK=0. For an appropriate link function, θ is a function of explanatory variable vector X. In this case, ECD is given by ECD(X,Y)=k=1K1Cov(θk,Yk)k=1K1Cov(θk,Yk)+a(φ).

Theorem 3

For the GLM with a

Application to a generalized logit model

A baseline-category logit model is considered. Let X1 and X2 be categorical factors that take levels {1,2,,I} and {1,2,,J}, respectively, and let Y be a categorical response variable with levels {1,2,,K}. Let Xai={1(Xa=i)0(Xai)(a=1,2)andYk={1(Y=k)0(Yk). Then, dummy variable vectors X1=(X11,X12,,X1I)T,X2=(X21,X22,,X2J)T and Y=(Y1,Y2,,YK)T are identified with factors X1,X2 and response Y, respectively. From this, the systematic component of the baseline-category logit model is assumed as

Discussion

As an extension of the coefficient of determination, ECD has been proposed for GLMs. In the GLM framework, regression models are described with random, systematic and link components, and the explanatory variables are related to entropy of response variables. From this it may be suitable to propose an entropy-based predictive power measure for GLMs. In these days, GLMs are widely applied to practical data analyses, and especially canonical link GLMs play an important role in regression

Acknowledgements

The authors would like to thank the referees and the editor for their useful comments and suggestions. The first version of this paper was drastically improved. This research was supported by Grant-in-aid for Scientific Research 18500216, Ministry of Education, Culture, Sports, Science and Technology of Japan.

References (14)

  • N. Eshima et al.

    Entropy correlation coefficient for measuring predictive power of generalized linear models

    Statistics & Probability Letters

    (2007)
  • A. Agresti

    Applying R2-type measures to ordered categorical data

    Technometrics

    (1986)
  • A. Agresti

    Categorical Data Analysis

    (2002)
  • A. Ash et al.

    R2: A useful measure of model performance with predicting a dichotomous outcome

    Statistics in Medicine

    (1999)
  • B. Efron

    Regression and ANOVA with zero-one data: Measures of residual variation

    Journal of the American Statistical Association

    (1978)
  • L.A. Goodman

    The analysis of multinomial contingency tables: Stepwise procedures and direct estimation methods for building models for multiple classifications

    Technometrics

    (1971)
  • S.J. Haberman

    Analysis of dispersion of multinomial responses

    Journal of the American Statistical Association

    (1982)
There are more references available in the full text version of this article.

Cited by (24)

  • Regression correlation coefficient for a Poisson regression model

    2016, Computational Statistics and Data Analysis
  • Goodness of fit in restricted measurement error models

    2016, Journal of Multivariate Analysis
  • Three predictive power measures for generalized linear models: The entropy coefficient of determination, the entropy correlation coefficient and the regression correlation coefficient

    2011, Computational Statistics and Data Analysis
    Citation Excerpt :

    In this paper, the coefficient is referred to as the regression correlation coefficient (RCC). The entropy correlation coefficient (ECC; Eshima and Tabata, 2007) and the entropy coefficient of determination (ECD; Eshima and Tabata, 2010) were proposed with consideration of their desirable properties: (i) interpretability; (ii) being the multiple-correlation coefficient or the coefficient of determination in normal linear regression models; (iii) the entropy-based property; and (iv) applicability to all GLMs. The aim of the present paper is to compare three predictive power measures, ECC, ECD and RCC, in theoretical aspects, and the utility of ECC and ECD is demonstrated, with ECC and ECD being applied to GLMs with polytomous responses.

View all citing articles on Scopus
View full text