On the estimation of a binary response model in a selected population
Introduction
We propose a generalization of the Probit model for a binary response which makes use of the extended skew-normal cumulative distribution as a link function. Since the extended skew-normal distribution may be generated by truncation on a latent variable (see e.g. Arnold and Beaver, 2000), the choice allows to build binary response models generated by a hidden selection mechanism. Probit models for binary response in the presence of selectivity bias have been extensively used in the biomedical literature, and a review is in Bhattacharya et al. (2006). We here assume that selection acts as truncation, that is only the population which satisfies the selection criterion is observable. This could be e.g. the self-selected population who volunteer for an experiments, or the patients who qualify for an insurance program. A further example is in the applied section of the paper. The model contains the one proposed by Chen et al. (1999) as a particular case, and therefore may arise also from a different data generating mechanism.
We show how to perform ML estimation of the parameters from a sample drawn at random from the selected population. As expected, the inference on one of the parameters which expresses the selection mechanism, that we denote by , is problematic. By using the results in Rotnitzky et al. (2000), we show how it is possible to test for the null hypothesis that , corresponding to the absence of selection. Furthermore, following Copas and Li (1997), we construct a profile log likelihood for this parameter and look at the effects of small deviation of around zero. Similar approach has been taken by Boehmke (2003) where, however, ML estimate of the parameters of the model has not been pursued.
The model proposed here may also be applied to the complete cases in situations where the explanatory variables are observable for all units in the population and the binary response variable is missing in an informative way. In this case, the parameter related to the proportion of complete cases in the population, which we denote with , may be assumed either as known or as belonging to an interval of real values. With this additional information, inference on the model is much simplified.
Section snippets
The extended skew-Probit model
Let (W, Y⁎) be two unobserved random variables such that:where UW and UY⁎ are two jointly Gaussian random variables, with and and the correlation coefficient. Let x be a k-dimensional vector of independent variables with x1=1 and is a k-dimensional vector of unknown coefficients. We now assume that the population is selected, that is, only the observations with are observed. In this case,where has an extended
The likelihood function
We assume to have a random sample of n observations , drawn from model (3). Let . The likelihood function of the sample is thenwhich by making use of (5) becomesMaximum likelihood estimate of is obtained by partial differentiation of the log likelihood function with respect to , and . As in the GLM, the MLE are not in closed form. The score
EM algorithm
The EM algorithm may be easily implemented by making reference to the truncated binary regression model (1). To emphasize the relationship with the bivariate Gaussian distribution, in this section we parametrize the model with instead of . For xi given, the complete data are then which are observations from a truncated bivariate Gaussian distribution with null expected value and correlation coefficient . The incomplete data are . The log likelihood,
Endogenous regressors
We illustrate situations where a binary response model with extended skew-normal link function may arise. Let be a multivariate random vector with jointly Gaussian distribution, with . Let and be, in order, the covariance and the correlation matrix. We partition . The vector is partitioned accordingly, as . The matrices and are then We assume that the observed population is selected according to .
Health scores data
Some outcome of interest in medical research, such as mental or physical health, cannot be measured directly. A way to asses them is to ask the patients to fill a questionnaire. Each item aims at measuring one particular aspect of health or ability, by providing a numerical evaluation, or score. Typically, scores have a finite range. The overall quality of life score is formed as a weighted sum of different scores, and is therefore bounded over an interval. For repeated measurements, later
Concluding remarks
We have proposed a binary response model with asymmetric link function, which comprises the Probit model as a particular case. Other models with asymmetric links exist in the literature, and a review is in Bazán et al. (2006). The consequences of a misspecified link on ML estimate of the regression parameters are investigated in Czado and Santner (1992). The model used in this paper can be seen as arising from a particular form of selection which can be modelled by truncation on one hidden
Acknowledgments
We thank Prof. Sallie Lamb, Prof. Matthew Cooke and the CAST team for permission to use these data, and Prof. Jane Hutton for useful insights on the data analysis. The CAST trial was funded by a grant from the UK Department of Health through its Health Technology Assessment Programme, project no. 01/14/10. Part of this paper was completed when the corresponding author was visiting the Department of Statistics, University of Warwick, the hospitality of which is gratefully acknowledged. This
References (27)
- et al.
The effect of link misspecification on binary regression inference
Journal of Statistical Planning and Inference
(1992) - et al.
Mechanical supports for acute, severe ankle sprains: a pragmatic, multi-centre, randomised controlled trial
The Lancet
(2009) Bias prevention of maximum likelihood estimates for scalar skew normal and skew t distributions
Journal of Statistical Planning and Inference
(2006)- et al.
Hidden truncation models
Sankhya
(2000) - et al.
Some properties of the multivariate skew-normal distribution
Journal of the Royal Statistical Society B
(1999) The skew-normal distribution and related multivariate families
Scandinavian Journal of Statistics
(2005)- Bazán, J.L., Bolfarine, H., Branco, M.D., 2006. A generalized skew-Probit class link for binary regression. RT-MAE...
- et al.
Estimating probit models with self-selected treatments
Statistics in Medicine
(2006) Using auxiliary data to estimate selection bias models, with an application to interest group use of the direct initiative process
Political Analysis
(2003)- et al.
Instrumental Variables
(1990)
Graphical models for skew-normal variates
Scandinavian Journal of Statistics
A new skewed link model for dichotomous quantal response data
Journal of the American Statistical Association
Cited by (7)
Inference in second-order identified models
2020, Journal of EconometricsThe asymptotic properties of GMM and indirect inference under second-order identification
2018, Journal of EconometricsCitation Excerpt :All three can be considered as special cases of GMM, with the first involving parameters that are over-identified and the other two involving parameters that are just-identified. Local identification at second order has been shown to arise in a number of situations in statistics and econometrics such as: ML for skew-normal distributions, e.g. Azzalini (2005); ML for binary response models based on skew-normal distributions, Stingo et al. (2011); ML for missing not at random (MNAR) models, e.g. Jansen et al. (2006); GMM estimation of conditionally heteroscedastic factor models, Dovonon and Renault (2009, 2013); GMM estimation of panel data models using second moments, Madsen (2009); modified-ML estimation of panel data models, Kruiniger (2014). In this paper, we consider the case where local identification fails at first order but holds at second order.
Ordinal categorical response with baseline response and skew-normal random effect: an application to schizophrenia data
2023, Communications in Statistics: Simulation and ComputationInvestigation of covariance structures in modelling longitudinal ordinal responses with skew normal random effect
2021, Communications in Statistics: Simulation and ComputationIdentifiability and bias reduction in the skew-probit model for a binary response
2019, Journal of Statistical Computation and SimulationModelling of correlated ordinal responses, by using multivariate skew probit with different types of variance covariance structures
2019, Journal of Sciences, Islamic Republic of Iran