Variable selection in qualitative models via an entropic explanatory power

https://doi.org/10.1016/S0378-3758(02)00286-0Get rights and content

Abstract

The variable selection method proposed in the paper is based on the evaluation of the Kullback–Leibler distance between the full (or encompassing) model and its submodels. The Bayesian implementation of the method does not require a separate prior modeling on the submodels since the corresponding parameters for the submodels are defined as the Kullback–Leibler projections of the full model parameters. The result of the selection procedure is the submodel with the smallest number of covariates which is at an acceptable distance of the full model. We introduce the notion of explanatory power of a model and scale the maximal acceptable distance in terms of the explanatory power of the full model. Moreover, an additivity property between embedded submodels shows that our selection procedure is equivalent to select the submodel with the smallest number of covariates which has a sufficient explanatory power. We illustrate the performances of this method on a breast cancer dataset

Introduction

Bayesian model choice is usually based on the premise that posterior probabilities, Bayes factors or related quantities should be compared, according to various scales. As noted by Gelfand and Dey (1994), there is no agreement on the Bayesian course of action in this setup as the problem can be stated in many different formats. See Clyde (1999), Godsill (2001), Hoeting et al. (1999), Racugno (1999) or Robert (2001, Chapter 7) for recent perspectives on the whole issue of Bayesian model choice. Most of these different approaches require prior specifications for each possible submodel, with at least a proper prior on each (sub)set of parameters and often a prior weight attached to each submodel. The complexity of this modeling is at odds with the initial parsimony requirement inherent to variable selection and it creates difficulties and ad hoc queries even in moderately informative settings. For instance, usual prior modeling rules imply that the weights depend on the number of submodels which are considered, notwithstanding the prior information and the tree structure of the submodels. Automated prior selection methods as in Bernardo 1979, Bernardo 1999 and McCullogh and Rossi (1993) also encounter difficulties, on either implementation or theoretical grounds.

General arguments that motivate our variable selection strategy have been already defended in Mengersen and Robert (1996), in Dupuis (1997), and in Goutis and Robert (1998). However, these papers rely on a more conventional approach, namely Bayes factors, while the current paper develops an inferential interpretation of the Kullback–Leibler distance that sets the whole model choice problematic on a new decision-theoretic ground. In addition, and contrary to existing methods, the present approach only requires a (possibly improper) prior distribution on the full model. The submodels under consideration are projections of the full model, namely the closest submodels in the sense of the Kullback–Leibler distance to the full model.

The general principle of the method is presented in Section 2, while the derivation of the Kullback–Leibler projections is detailed in Section 3 (in the discrete case) and 4 (in the logit and polylogit cases). Section 5 examines the important issue of scaling the Kullback–Leibler distance and of deriving the corresponding bound on this distance. Section 6 proposes an algorithmic implementation of the method, which is illustrated in Section 7 on a dataset of Raftery and Richardson (1996).

Section snippets

Distance between models

We consider p+1 random variables y,x1,…,xk,…,xp, where y represents the phenomenon under scrutiny and the xk's are either discrete or continuous covariates. (Although we will primarily provide examples in the case y has a finite support, the results of this paper equally apply to y's with continuous support.) As understood in this paper, the goal of variable selection is to reduce as much as possible the dimension of the covariates vector from the original p while preserving enough of the

Projections in the discrete case

Starting from A⊂{1,…,p}, we denote by β(xA) the vector of the βj(xA)=Pr(y=j|xA). For a subset B of A, the conditional density of y in the class MB is γy(xB). An advantage of the discrete case, when compared with the general setup addressed in Section 4, is that the projection β(xB) of model MA on the class MB can be derived in closed form.

Proposition 3.1

The minimization programargminγ(xB)ExA[d(g(y|xA),h(y|xB))]has a unique solution β(xB)=(Ej(xA)|xB])j=1,…,J.

Proof

SinceEx[d(f(y|xA),g(y|xB)]=xf(x)jβj(xA)logβj(xA

Logit model

If yi∈{0,1} and xiRk are related by a logit modelP(yi=1|xi,α)=1−P(yi=0|xi,α)=exptxi)1+exptxi),the projection α on the subspace corresponding to the covariates zi is associated with β, solution of the minimization programminβi=1ntxi−βtzi)expαtxi1+expαtxilog1+expαtxi1+expβtzi,i.e. of the implicit equations (in β)i=1nexpβtzi1+expβtzizi=i=1nexpαtxi1+expαtxizi.As pointed out in Goutis and Robert (1998), these equations are formally equivalent to the MLE equations for the logit model,i=1n

Explanatory power and scaling

The variable selection principle stated in Section 2.4 is based on the Kullback–Leibler distance and its scaling via ε=ϱd(Mg,M0), where the class M0 contains the covariate free submodels of Mg, that is those such that f(y|x,α)=g(y|α). For instance, in the setup of Section 4, M0 corresponds to the submodels with only an intercept.

We now proceed to justify that d(Mg,M0) defines a proper referential (or explanatory power) for the derivation of the threshold ε, that is of the ability of x to

MCMC implementation

As mentioned in the introduction, the method is forcibly distinct from a testing approach. From a Bayesian perspective, this signifies that the focus is on estimating the posterior distance between the (embedding) full model and some submodels. In the setup of Section 4.1, given a sample of α's produced from the posterior distribution for the full model by an MCMC algorithm (see Albert and Chib, 1993; Gilks et al. 1996; or Robert and Casella, 1999), it is then possible to compute the projected

An illustration

The projection method developed in this paper is used to select covariates in a logistic regression model for an epidemiological study already considered by Richardson et al. (1989) (Fig. 1) from a classical point of view and by Raftery and Richardson (1996), who were using Bayes factors and transformed variables using the alternating conditional expectations (ACE) of Madigan and Raftery (1994).

The study evaluates the role of dietary factors on breast cancer and consists of 740 women from

Acknowledgements

The authors are grateful to Sylvia Richardson for sharing and discussing her dataset, as well as to Peter Müller and Marco Pacifico for helpful discussions.

References (21)

  • J.A. Dupuis

    Bayesian test of homogeneity for Markov chains

    Statist. Probab. Lett.

    (1997)
  • J.H. Albert et al.

    Bayesian analysis of binary and polychotomous response data

    J. Amer. Statist. Assoc.

    (1993)
  • J.M. Bernardo

    Reference posterior distributions for Bayesian inference (with discussion)

    J. Roy. Statist. Soc. B

    (1979)
  • Bernardo, J.M., 1999. Nested hypothesis testing: the Bayesian reference criterion (with discussion). In: Berger, J.O.,...
  • R. Christensen

    Log-linear Models and Logistic Regression

    (1997)
  • Clyde, M., 1999. Bayesian model averaging and model search strategies. In: Bernardo, J.M., Dawid, A.P., Berger, J.O.,...
  • Dupuis, J.A., 1994. Bayesian test of homogeneity for Markov chains with missing data by Kullback proximity. Technical...
  • A. Gelfand et al.

    Bayesian model choice: asymptotics and exact calculations

    J. Roy. Statist. Soc. B

    (1994)
  • Gelman, A., Gilks, W.R., Roberts, G.O., 1996. Efficient metropolis jumping rules. In: Berger, J.O., Bernardo, J.M.,...
  • W.R. Gilks et al.

    Markov Chain Monte Carlo in Practice

    (1996)
There are more references available in the full text version of this article.

Cited by (60)

View all citing articles on Scopus
View full text