Regularized simultaneous model selection in multiple quantiles regression

https://doi.org/10.1016/j.csda.2008.05.013Get rights and content

Abstract

Simultaneously estimating multiple conditional quantiles is often regarded as a more appropriate regression tool than the usual conditional mean regression for exploring the stochastic relationship between the response and covariates. When multiple quantile regressions are considered, it is of great importance to share strength among them. In this paper, we propose a novel regularization method that explores the similarity among multiple quantile regressions by selecting a common subset of covariates to model multiple conditional quantiles simultaneously. The penalty we employ is a matrix norm that encourages sparsity in a column-wise fashion. We demonstrate the effectiveness of the proposed method using both simulations and an application of gene expression data analysis.

Introduction

Consider a general regression setting where y represents the response and x=(x1,,xp)T represents a set of predictors. Classical regression methods focus on recovering the conditional expectation E(YX). Quantile regression (Koenker and Bassett, 1978) estimates the conditional quantile functions. Suppose we want to infer the 100τ% quantile (say τ=0.5) of the conditional distribution of the response (y) given covariates (x) based on n independent observations (xi,yi)i=1n. Koenker and Bassett (1978) showed that one can estimate the conditional τ-quantile by minimizing i=1nρτ(yiβ0xiTβ), where ρτ(t)=τt++(1τ)t is the so-called check function where subscripts ‘ +’ and ‘ −’ stand for the positive and negative parts, respectively. Quantile regression has been widely used in many different areas such as economics (Koenker and Hallock, 2001) and survival analysis (Koenker and Geling, 2001) among others. Nonlinear estimates can be obtained by the same method, except that we replace the covariates x with some basis functions such as splines (He et al., 1998, Koenker et al., 1994, Yuan, 2006). In this paper we consider the variable selection problem in the linear quantile regression model.

Variable selection in conditional mean regression has received a lot of attention in recent years. Several regularization techniques have been invented for doing automatic variable selection, including the lasso (Tibshirani, 1996) and the SCAD (Fan and Li, 2001). Similar to the conditional mean regression, variable selection is also crucial in quantile regression when the number of predictors is large. A sparse model is much more interpretable in practice and often enjoys improved estimation accuracy by eliminating irrelevant variables. Koenker (2004) considered using the L1 quantile regression to automatically select significant predictors in quantile regression. The L1 quantile regression model is estimated by β̂(L1norm)=argminβ0,βi=1nρτ(yiβ0xiTβ)+λβ1, where β1=j=1p|βj| is the L1-norm penalty (or lasso penalty) on β. When the tuning parameter λ is appropriately chosen, some components of β̂ will be shrunk to exact zero, and the corresponding variables are excluded from the final model.

In this paper, we consider the so-called simultaneous multiple quantiles regression (SMQR, for short), where we are interested in estimating multiple conditional quantile functions simultaneously. As demonstrated in Koenker (2005), a main advantage of quantile regression over classical mean regression is its ability to examine multiple conditional quantile functions, and provide a more comprehensive description of the relationship between the response and the covariates. Another example of SMQR arises naturally when jointly modelling multiple responses.

When considering multiple regression models such as SMQR, it is of great importance to share strength among different models as illustrated in Breiman and Friedman (1997). It is of particular interest, when there is a large number of covariates, to find a common set of variables that can be used for all models under investigation (Turlach et al., 2005). In the context of mean regression, Turlach et al. (2005) considered the problem of selecting a subset of 770 wavelengths that are suitable as predictors for 14 different but correlated infra-red spectrometry measurements, and they proposed a novel regularization method to perform simultaneous variable selection. Complementary to this earlier work, we study simultaneous variable selection in multiple quantiles regression. Simultaneous model selection is actually more relevant in quantile regression than in classical mean regression, considering that estimating multiple quantiles of a single response is routinely done in practice. It is often desirable to select a common set of significant variables for modeling a sequence of quantiles of the response. More generally, it is natural, in many applications, that the set of variables used to model different conditional quantiles overlaps with each other. The goal of this paper is to develop a model selection method that is capable of exploring such similarity and performing simultaneous model selection in multiple quantiles regression whenever simultaneous model selection is desirable.

A naive approach to model selection in multiple quantiles regression will separately fit individual L1 quantile regression models and take a union of the selected variables from each regression model. This naive approach cannot guarantee that the same set of variables are selected within each L1 quantile regression model. Furthermore, the naive approach may also be suboptimal in terms of predictive accuracy in some problems. For example, consider the classical model underlying linear quantile regression y=x1β1++xpβp+ϵ. We omit the intercept for brevity. Clearly, all conditional quantiles can be described by the same set of variables. By recognizing this fact, estimation can be greatly improved. The naive approach does not share information across the quantile regression models, hence the results might be suboptimal compared with the methods which combine strengths from multiple models. It is well known that when estimating multiple statistical models, it is beneficial to share information across them (Breiman and Friedman, 1997). The same wisdom applies to multiple quantiles regression, as demonstrated in Section 4.

To overcome the drawbacks of the naive approach, we introduce a new regularization method for performing simultaneous model selection in multiple quantiles regression. We propose to penalize the sum of the check functions of multiple quantile regression models by a norm of the coefficient matrix that encourages column-wise sparsity. As the regularization parameter varies, the penalty does simultaneous variable selection via continuous shrinkage. The rest of the paper is organized as follows. In Section 2 we present methodological details of the penalized multiple quantiles regression. Section 3 discusses the implementation details of the proposed method. Simulation results are presented in Section 4 and we also demonstrate the utility of the proposed method on the cardiomyopathy data in Section 5.

Section snippets

Penalized multiple quantiles regression

To fix the idea, consider first estimating multiple quantiles of a single response. Suppose we want to estimate the τ1,,τG quantiles of the conditional distribution y|x. Denote by β(k)=(β1(k),,βp(k))T the coefficients of the covariates in the τk conditional quantile function of y given x, where k=1,2,,G. We also write β(j)=(βj(1),,βj(G))T for each j=1,2,,p. We say β(j) the coefficient vector of variable xj. Let β be the coefficient matrix whose (k,j) element is βj(k). With such a notation

Implementation

In this section we show that (5) can be solved efficiently by linear programming techniques. We also consider data-driven methods for selecting the regularization parameter λ.

Simulation

In this section we conduct a Monte Carlo simulation to check the performance of the proposed method. Two criteria are considered: model error and model selection performance. For any fit {fˆ(k)}k=1G its model error is defined as ME(fˆ)=Ey,x[1Gk=1G(1ni=1nEz(k)[ρτk(zi(k)fˆ(k)(xi))])]. In all simulated examples the underlying model has a sparse representation. Model selection performance is measured by the sparsity of the fitted model.

As a comparison we also include the L1 quantile regression

Real data

In this section we apply the developed penalized multiple quantiles regression method to analyze the cardiomyopathy data. The response variable in this study is a G protein-coupled receptor, designated Ro1. When the receptor is over-expressed in the heart of adult mice, the mice develop a lethal dilated cardiomyopathy that has many hallmarks of human disease. The mice recover when the expression of the receptor is turned off (Segal et al., 2003). The goal of the study is to investigate the

Conclusion

In this paper we have proposed the a new regularization technique to perform regularized simultaneous model selection in multiple quantiles regression. We have demonstrated the promising performance of the proposed method using simulated and real data. It is also worth noting that SMQR provides a unified solution to handle three different multiple quantiles regression problems: (1) multiple quantiles of a single response; (2) the same quantile of multiple responses; and (3) multiple quantiles

Acknowledgements

We thank Professor Mark Segal for kindly providing us the cardiomypathy data. The authors sincerely thank an AE and two referees for their helpful comments that substantially improved an earlier version of this paper. This work is supported by NSF grant DMS-0706733.

References (22)

  • R. Koenker

    Quantile regression for longitudinal data

    Journal of Multivariate Analysis

    (2004)
  • M. Yuan

    GACV for quantile smooting splines

    Computational Statistics and Data Analysis

    (2006)
  • L. Breiman et al.

    Predicting multiple responses in multiple linear regression (with discussion)

    Journal of the Royal Statistical Society: Series B

    (1997)
  • J. Fan et al.

    Variable selection via nonconcave penalized likelihood and its oracle properties

    Journal of the American Statistical Association

    (2001)
  • Fan, J., Lv, J., 2008. Sure independence screening for ultra-high dimensional feature space (with discussion). Journal...
  • T. Hastie et al.

    The Elements of Statistical Learning; Data mining, Inference and Prediction

    (2001)
  • X. He et al.

    Bivariate quantile smoothing splines

    Journal of the Royal Statistical Society, Series B

    (1998)
  • R. Koenker

    Quantile Regression

  • R. Koenker et al.

    Regression quantiles

    Econometrica

    (1978)
  • R. Koenker et al.

    Reappraising medfly longevity: A quantile regression survival analysis

    Journal of the American Statistical Association

    (2001)
  • R. Koenker et al.

    Quantile regression

    Journal of Economic Perspectives

    (2001)
  • Cited by (0)

    View full text