Variable selection in qualitative models via an entropic explanatory power
Introduction
Bayesian model choice is usually based on the premise that posterior probabilities, Bayes factors or related quantities should be compared, according to various scales. As noted by Gelfand and Dey (1994), there is no agreement on the Bayesian course of action in this setup as the problem can be stated in many different formats. See Clyde (1999), Godsill (2001), Hoeting et al. (1999), Racugno (1999) or Robert (2001, Chapter 7) for recent perspectives on the whole issue of Bayesian model choice. Most of these different approaches require prior specifications for each possible submodel, with at least a proper prior on each (sub)set of parameters and often a prior weight attached to each submodel. The complexity of this modeling is at odds with the initial parsimony requirement inherent to variable selection and it creates difficulties and ad hoc queries even in moderately informative settings. For instance, usual prior modeling rules imply that the weights depend on the number of submodels which are considered, notwithstanding the prior information and the tree structure of the submodels. Automated prior selection methods as in Bernardo 1979, Bernardo 1999 and McCullogh and Rossi (1993) also encounter difficulties, on either implementation or theoretical grounds.
General arguments that motivate our variable selection strategy have been already defended in Mengersen and Robert (1996), in Dupuis (1997), and in Goutis and Robert (1998). However, these papers rely on a more conventional approach, namely Bayes factors, while the current paper develops an inferential interpretation of the Kullback–Leibler distance that sets the whole model choice problematic on a new decision-theoretic ground. In addition, and contrary to existing methods, the present approach only requires a (possibly improper) prior distribution on the full model. The submodels under consideration are projections of the full model, namely the closest submodels in the sense of the Kullback–Leibler distance to the full model.
The general principle of the method is presented in Section 2, while the derivation of the Kullback–Leibler projections is detailed in Section 3 (in the discrete case) and 4 (in the logit and polylogit cases). Section 5 examines the important issue of scaling the Kullback–Leibler distance and of deriving the corresponding bound on this distance. Section 6 proposes an algorithmic implementation of the method, which is illustrated in Section 7 on a dataset of Raftery and Richardson (1996).
Section snippets
Distance between models
We consider p+1 random variables , where y represents the phenomenon under scrutiny and the xk's are either discrete or continuous covariates. (Although we will primarily provide examples in the case y has a finite support, the results of this paper equally apply to y's with continuous support.) As understood in this paper, the goal of variable selection is to reduce as much as possible the dimension of the covariates vector from the original p while preserving enough of the
Projections in the discrete case
Starting from , we denote by the vector of the . For a subset of , the conditional density of y in the class is . An advantage of the discrete case, when compared with the general setup addressed in Section 4, is that the projection of model on the class can be derived in closed form. Proposition 3.1 The minimization programhas a unique solution . Proof Since
Logit model
If yi∈{0,1} and are related by a logit modelthe projection α⊥ on the subspace corresponding to the covariates zi is associated with β, solution of the minimization programi.e. of the implicit equations (in β)As pointed out in Goutis and Robert (1998), these equations are formally equivalent to the MLE equations for the logit model,
Explanatory power and scaling
The variable selection principle stated in Section 2.4 is based on the Kullback–Leibler distance and its scaling via , where the class contains the covariate free submodels of Mg, that is those such that f(y|x,α)=g(y|α). For instance, in the setup of Section 4, corresponds to the submodels with only an intercept.
We now proceed to justify that defines a proper referential (or explanatory power) for the derivation of the threshold ε, that is of the ability of x to
MCMC implementation
As mentioned in the introduction, the method is forcibly distinct from a testing approach. From a Bayesian perspective, this signifies that the focus is on estimating the posterior distance between the (embedding) full model and some submodels. In the setup of Section 4.1, given a sample of α's produced from the posterior distribution for the full model by an MCMC algorithm (see Albert and Chib, 1993; Gilks et al. 1996; or Robert and Casella, 1999), it is then possible to compute the projected
An illustration
The projection method developed in this paper is used to select covariates in a logistic regression model for an epidemiological study already considered by Richardson et al. (1989) (Fig. 1) from a classical point of view and by Raftery and Richardson (1996), who were using Bayes factors and transformed variables using the alternating conditional expectations (ACE) of Madigan and Raftery (1994).
The study evaluates the role of dietary factors on breast cancer and consists of 740 women from
Acknowledgements
The authors are grateful to Sylvia Richardson for sharing and discussing her dataset, as well as to Peter Müller and Marco Pacifico for helpful discussions.
References (21)
Bayesian test of homogeneity for Markov chains
Statist. Probab. Lett.
(1997)- et al.
Bayesian analysis of binary and polychotomous response data
J. Amer. Statist. Assoc.
(1993) Reference posterior distributions for Bayesian inference (with discussion)
J. Roy. Statist. Soc. B
(1979)- Bernardo, J.M., 1999. Nested hypothesis testing: the Bayesian reference criterion (with discussion). In: Berger, J.O.,...
Log-linear Models and Logistic Regression
(1997)- Clyde, M., 1999. Bayesian model averaging and model search strategies. In: Bernardo, J.M., Dawid, A.P., Berger, J.O.,...
- Dupuis, J.A., 1994. Bayesian test of homogeneity for Markov chains with missing data by Kullback proximity. Technical...
- et al.
Bayesian model choice: asymptotics and exact calculations
J. Roy. Statist. Soc. B
(1994) - Gelman, A., Gilks, W.R., Roberts, G.O., 1996. Efficient metropolis jumping rules. In: Berger, J.O., Bernardo, J.M.,...
- et al.
Markov Chain Monte Carlo in Practice
(1996)
Cited by (60)
A Cheat Sheet for Bayesian Prediction
2023, arXiv