On the estimation of a binary response model in a selected population

https://doi.org/10.1016/j.jspi.2011.04.014Get rights and content

Abstract

A generalization of the Probit model is presented, with the extended skew-normal cumulative distribution as a link function, which can be used for modelling a binary response variable in the presence of selectivity bias. The estimate of the parameters via ML is addressed, and inference on the parameters expressing the degree of selection is discussed. The assumption underlying the model is that the selection mechanism influences the unmeasured factors and does not affect the explanatory variables. When this assumption is violated, but other conditional independencies hold, then the model proposed here is derived. In particular, the instrumental variable formula still applies and the model results at the second stage of the estimating procedure.

Introduction

We propose a generalization of the Probit model for a binary response which makes use of the extended skew-normal cumulative distribution as a link function. Since the extended skew-normal distribution may be generated by truncation on a latent variable (see e.g. Arnold and Beaver, 2000), the choice allows to build binary response models generated by a hidden selection mechanism. Probit models for binary response in the presence of selectivity bias have been extensively used in the biomedical literature, and a review is in Bhattacharya et al. (2006). We here assume that selection acts as truncation, that is only the population which satisfies the selection criterion is observable. This could be e.g. the self-selected population who volunteer for an experiments, or the patients who qualify for an insurance program. A further example is in the applied section of the paper. The model contains the one proposed by Chen et al. (1999) as a particular case, and therefore may arise also from a different data generating mechanism.

We show how to perform ML estimation of the parameters from a sample drawn at random from the selected population. As expected, the inference on one of the parameters which expresses the selection mechanism, that we denote by α, is problematic. By using the results in Rotnitzky et al. (2000), we show how it is possible to test for the null hypothesis that α=0, corresponding to the absence of selection. Furthermore, following Copas and Li (1997), we construct a profile log likelihood for this parameter and look at the effects of small deviation of α around zero. Similar approach has been taken by Boehmke (2003) where, however, ML estimate of the parameters of the model has not been pursued.

The model proposed here may also be applied to the complete cases in situations where the explanatory variables are observable for all units in the population and the binary response variable is missing in an informative way. In this case, the parameter related to the proportion of complete cases in the population, which we denote with τ, may be assumed either as known or as belonging to an interval of real values. With this additional information, inference on the model is much simplified.

Section snippets

The extended skew-Probit model

Let (W, Y) be two unobserved random variables such that:W=τ+UWY=β˜x+UYwhere UW and UY are two jointly Gaussian random variables, with Var(UW)=1 and Var(UY)=σ2 and ρ the correlation coefficient. Let x be a k-dimensional vector of independent variables with x1=1 and β˜ is a k-dimensional vector of unknown coefficients. We now assume that the population is selected, that is, only the observations with W>0 are observed. In this case,(Y|W>0)=β˜x+ɛwhere ɛ=UY|UW>τ has an extended

The likelihood function

We assume to have a random sample of n observations (yi,xi), drawn from model (3). Let θ=(β,α,τ). The likelihood function of the sample is thenL(θ)=i=1n{F(βxi)}yi{1F(βxi)}1yiwhich by making use of (5) becomes(θ)=nlnΦ(τ)+i=1n(1yi)lnΦ(τ)βxα0+αtφ(t)φ(u)dudt+yilnβxα0+αtφ(t)φ(u)dudtMaximum likelihood estimate of θ is obtained by partial differentiation of the log likelihood function with respect to β, α and τ. As in the GLM, the MLE are not in closed form. The score

EM algorithm

The EM algorithm may be easily implemented by making reference to the truncated binary regression model (1). To emphasize the relationship with the bivariate Gaussian distribution, in this section we parametrize the model with ρ instead of α. For xi given, the complete data are then (uYi,uWi|uWi>τ) which are observations from a truncated bivariate Gaussian distribution with null expected value and correlation coefficient ρ. The incomplete data are yi=I[uYiβxi|uWi>τ]. The log likelihood,

Endogenous regressors

We illustrate situations where a binary response model with extended skew-normal link function may arise. Let V=(Y1,,Yk) be a multivariate random vector with jointly Gaussian distribution, with μ=E[V]. Let cov(V)=Σ and corr(V)=Σ˜ be, in order, the covariance and the correlation matrix. We partition V=(Yo,Ys). The vector μ is partitioned accordingly, as μ=(μo,μs). The matrices Σ and Σ˜ are then Σ=Σooδ.σss,Σ˜=Σ˜ooδ˜.1We assume that the observed population is selected according to Ys>τ.

Health scores data

Some outcome of interest in medical research, such as mental or physical health, cannot be measured directly. A way to asses them is to ask the patients to fill a questionnaire. Each item aims at measuring one particular aspect of health or ability, by providing a numerical evaluation, or score. Typically, scores have a finite range. The overall quality of life score is formed as a weighted sum of different scores, and is therefore bounded over an interval. For repeated measurements, later

Concluding remarks

We have proposed a binary response model with asymmetric link function, which comprises the Probit model as a particular case. Other models with asymmetric links exist in the literature, and a review is in Bazán et al. (2006). The consequences of a misspecified link on ML estimate of the regression parameters are investigated in Czado and Santner (1992). The model used in this paper can be seen as arising from a particular form of selection which can be modelled by truncation on one hidden

Acknowledgments

We thank Prof. Sallie Lamb, Prof. Matthew Cooke and the CAST team for permission to use these data, and Prof. Jane Hutton for useful insights on the data analysis. The CAST trial was funded by a grant from the UK Department of Health through its Health Technology Assessment Programme, project no. 01/14/10. Part of this paper was completed when the corresponding author was visiting the Department of Statistics, University of Warwick, the hospitality of which is gratefully acknowledged. This

References (27)

  • A. Capitanio et al.

    Graphical models for skew-normal variates

    Scandinavian Journal of Statistics

    (2003)
  • Capobianco, R., 2006. Discrete choice models with skewed link. Atti XLIII Riunione Scientifica SIS. Sessioni Spontanee,...
  • M.H. Chen et al.

    A new skewed link model for dichotomous quantal response data

    Journal of the American Statistical Association

    (1999)
  • Cited by (7)

    • Inference in second-order identified models

      2020, Journal of Econometrics
    • The asymptotic properties of GMM and indirect inference under second-order identification

      2018, Journal of Econometrics
      Citation Excerpt :

      All three can be considered as special cases of GMM, with the first involving parameters that are over-identified and the other two involving parameters that are just-identified. Local identification at second order has been shown to arise in a number of situations in statistics and econometrics such as: ML for skew-normal distributions, e.g. Azzalini (2005); ML for binary response models based on skew-normal distributions, Stingo et al. (2011); ML for missing not at random (MNAR) models, e.g. Jansen et al. (2006); GMM estimation of conditionally heteroscedastic factor models, Dovonon and Renault (2009, 2013); GMM estimation of panel data models using second moments, Madsen (2009); modified-ML estimation of panel data models, Kruiniger (2014). In this paper, we consider the case where local identification fails at first order but holds at second order.

    • Identifiability and bias reduction in the skew-probit model for a binary response

      2019, Journal of Statistical Computation and Simulation
    View all citing articles on Scopus
    View full text