Bayesian variable selection in linear quantile mixed models for longitudinal data with application to macular degeneration

Yonggang Ji; Haifang Shi

doi:10.1371/journal.pone.0241197

Abstract

This paper presents a Bayesian analysis of linear mixed models for quantile regression based on a Cholesky decomposition for the covariance matrix of random effects. We develop a Bayesian shrinkage approach to quantile mixed regression models using a Bayesian adaptive lasso and an extended Bayesian adaptive group lasso. We also consider variable selection procedures for both fixed and random effects in a linear quantile mixed model via the Bayesian adaptive lasso and extended Bayesian adaptive group lasso with spike and slab priors. To improve mixing of the Markov chains, a simple and efficient partially collapsed Gibbs sampling algorithm is developed for posterior inference. Simulation experiments and an application to the Age-Related Macular Degeneration Trial data to demonstrate the proposed methods.

Citation: Ji Y, Shi H (2020) Bayesian variable selection in linear quantile mixed models for longitudinal data with application to macular degeneration. PLoS ONE 15(10): e0241197. https://doi.org/10.1371/journal.pone.0241197

Editor: Lei Shi, Yunnan University of Finance and Economics, CHINA

Received: April 24, 2020; Accepted: October 11, 2020; Published: October 26, 2020

Copyright: © 2020 Ji, Shi. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The ARMD dataset is available in the supplementary materials.

Funding: The second author was supported by the Fundamental Research Funds for the Central Universities (Grant No.3122014K013).

Competing interests: The authors have declared that no competing interests exist.

Introduction

Quantile regression(QR) in longitudinal or panel data models have received increasing attention in recent years. For example, [1] proposed a general method to estimate the QR coefficients by using a L₁ penalized likelihood approach. Later [2] generalized the work of [1] to be more flexible model with endogenous variables. [3] offered a estimation of individual effects for panel data QR models, which is widely applied by practitioners because of computationally simple. [4] studied quantile panel models with correlated random effects. [5] proposed a Stochastic Approximation of the EM (SAEM) algorithm to analyze linear quantile mixed regressions(LQMMs) via the asymmetric Laplace distribution. Other related papers include [6–9], among many others.

Like all regression issues, when there are many covariates in longitudinal or panel QR models, variable selection becomes necessary to avoid overfitting and multicollinearity. [10] presented a penalized quantile regression model for random intercept using the Bayesian lasso priors and Bayesian adaptive lasso priors, respectivly. [11] also considered a Bayesian Lasso approach to jointly estimate a vector of covariate effects and a vector of random effects by introducing an l₁ penalty. However, for the random effects, their method mainly focus on the penalty of diagonal elements in the covariance matrix of random effects after a modified Cholesky decomposition. In addition, they neglect the nonnegative constraint on the diagonal terms in the covariance matrix of random effects after a modified Cholesky decomposition, which can cause non-uniqueness in the decomposition [12]. Then, we first propose a new Bayesian shrinkage approach to solve these problems in the LQMMs. In particular, we develop an extended Bayesian adaptive group lasso which can accommodate the nonnegative constraint of the diagonal terms to attempt to identify the non-zero random effects. Unfortunately, the shrinkage approach need some ad hoc methods to do variable selection similar to [10] and [11] for the well known reason. An alternative to the Bayesian variable selection is to use spike and slab priors. We then develop a general variable selection technic via the Bayesian adaptive lasso and the extended Bayesian adaptive group lasso with spike and slab priors to identify the significant fixed and random effects simultaneously.

The paper is organized as follows. Section 2 presents the reparameterized LQMMs based on a Cholesky decomposition of covariance matrix of random effects. Section 3 describes the structure of our hierarchical Bayesian LQMMs and discuss our prior specifications. We also develop a Partially collapsed Gibbs (PCG) sampling algorithm for posterior inference. In section 4, we will develop a variable selection procedure for both fixed and random effects in a LQMM with spike and slab priors and an efficient PCG. We illustrate the performance of our method by simulation studies and a real data example in section 5 and 6. Results show that the proposed approach performs very well. In Section 7, we conclude the paper.

The re-parameterized LQMMs

Consider n subjects containing n_i observations, , for j = 1, …, n_i and , where y_ij is the jth response for subject i, is the jth row of a known n_i × p matrix x_i and is the jth row of a known n_i × q matrix z_i. The canonical linear mixed model is as follows: (1) where β is a p × 1 vector of unknown population fixed parameters, α_i is a q × 1 vector of unobservable random effects with covariance matrix D. In order to guarantee the positive semidefinite of D, we adopt the Cholesky decomposition to it, resulting in where Γ = (γ_st) is a q × q lower triangular matrix with nonnegative diagonal entries. Similar to [13], the model (1) can be re-expressed as follows: (2) Here is a 1 × q(q + 1)/2 vector, where γ_k consists of the non-zero lower triangular elements of the kth row of Γ, i.e., , k = 1, …, q, J_q is a transformation matrix with size q² × q(q + 1)/2 such that vec(Γ) = J_qγ, I_q denotes the q × q identity matrix. For example, if q = 2 the transformation matrix J₂ would have the following form With model (2), the linear conditional τth (0 < τ < 1) mixed quantile estimator can be calculated by (3) In a Bayesian setup, this is equivalent to assuming that the error terms ε_ij in (2) follow the asymmetric Laplace distribution (ALD), of which the probability density function is given by (4) where σ⁻¹ and τ are the scale and skewness parameters, respectively. [14] first used the ALD to establish Bayesian QR in a linear model for independent data and proposed a random walk Metropolis-Hastings algorithm to draw samples. Since then, there has been a great deal of literature on the extension and application of the QR based on the ALD such as [15–18]. However, direct sampling using the ALD is often inconvenient and not easy to generalize to more complex scenarios. We employ a mixture representation of ALD proposed by [19] to facilitate Bayesian inference in LQMMs, which was also used by many authors (e.g., see [11, 20–22]), (5) where v_ij ∼ exp(1/σ) has an exponential prior with mean 1/σ, ζ_ij ∼ N(0, 1) has a standard normal prior and is independent of v_ij, and . Let v = (v_ij: i = 1, …, n; j = 1, …, n_i), ζ = (ζ_ij: i = 1, …, n; j = 1, …, n_i). Hence, based on the Cholesky decomposition and the mixture representation of ALD, we obtain the following hierarchal model,

Shrinkage estimation of linear quantile mixed regression

We first propose a new Bayesian shrinkage approach in LQMMs by using the Bayesian adaptive lasso and an extended Bayesian adaptive group lasso. A regularized τth quantile estimates of the fixed and random effect coefficients can be defined by (6) Motivated by [23], we put a Laplace priors on β_s, where , a Laplace priors , where , and suppose that the error terms ε_ij come from the ALD (4). Then, the posterior distribution of β and γ is proportional to (7) So it can be easily observed that maximizing the posterior distribution (7) is equivalent to minimizing the target function (6) if we ignore the nonnegative constraints γ_kk ≥ 0, k = 1, …q. However, it is not clear how to incorporate this constraints within the Bayesian Penalized LQMMs framework. Using truncated normal priors, [24] and [25] solved a similar problem on the selection of fixed effects, which is non-group structure in linear and generalized mixed model, respectively. We extend the idea to the Bayesian Group lasso on the shrinkage of random effects with group structure. We specify the conditional prior distributions of γ_k as . To facilitate posterior inference, we utilize a mixture representation to yield where . Then, γ_k can be expressed hierarchically as (8) for k = 1, …, q. Moving to the fixed effects parameters, β can also be written hierarchically as follows (9) for s = 1, …, p. There are two main ways to estimate the tuning parameter in Bayesian regularization: first, a fully Bayesian method, which specifies a hyperprior on it, such as a conjugate Gamma distribution [23], and second, an empirical Bayesian method, which estimates it using maximum marginal likelihood method [26]. Here we use the fully Bayesian approach and set Gamma priors on the parameters , and σ to complete the Bayesian structure. We develop a Partially collapsed Gibbs (PCG) sampling sampler by integrating the full conditional distribution of β with respect to {b_j} to sample from posterior distributions. In order to improve the convergence behavior of the Gibbs sampler, [27] proposed PCG which is a generalization of blocking and collapsing method [28]. The key idea of the PCG is to replace some of the posterior distributions with marginal posterior distribution while preserving the target distribution. Next, we present the conditional distribution of unknown parameters.

(1) Conditional distribution of β:

Let T = diag(t_s, s = 1, …, p) and . The conditional distribution of β is then a , where

(2) Conditional distribution of b_i:

The conditional distribution of b_i is a multivariate normal distribution with mean and variance where

(3) Conditional distribution of :

The full conditional distribution of follows a Inverse Gaussian(μ′, λ′) with p.d.f. (10) Here and , where .

(4) Conditional distributions of and :

The full conditional distributions of and are independent Gamma distributions,

(5) Conditional distributions of t_s and η_k:

The full conditional distributions of t_s and η_k are independent Inverse Gaussian,

(6) Conditional distribution of γ:

Denote , i = 1, …, n; j = 1, …, n_i, and let F_ijk be the covariate vector corresponding to γ_k, k = 1, …, q. The full conditional distribution of γ_k is then a truncated normal distribution where where and F_ij(k represent the covariate matrix corresponding to γ_(k).

(7) Conditional distribution of σ:

The full conditional distribution of σ is

Bayesian model selection in LQMMs

The shrinkage approach proposed in the above section has potential for variable selection in LQMMs. However, the estimators are never exact 0 due to the continuous priors on β and γ. This section develops a general model selection method with spike and slab priors to identify the relative important effects. We first discuss the prior specification and propose the hierarchal model. Subsequently, we consider a Partially collapsed Gibbs sampling algorithm for posterior computation, and derive the relevant conditional distributions.

We assume the following hierarchical Bayesian lasso with independent spike and slab type priors for β: where δ₀(⋅) denotes a point mass at 0, is the prior probability of excluding the sth fixed effect in the model, which is assigned a beta prior with parameters and , resulting in a noninformative uniform prior on (0, 1). Furthermore, we assume the following extended hierarchical adaptive Bayesian group lasso with independent spike and slab type priors for γ: where π₀ is the prior probability of excluding the kth random effect in the model, which is also assumed to be a noninformative uniform prior on (0, 1). We also present the conditional distribution of unknown parameters.

(1) Conditional distribution of β:

Let , and . The conditional distribution of β_s is then a spike and slab distribution, where

(2) Conditional distribution of b_i:

The conditional distribution of b_i is a multivariate normal distribution with mean and variance where

(3) Conditional distribution of :

The full conditional distribution of follows a Inverse Gaussian(μ′, λ′) with and , where .

(4) Conditional distributions of λ_1s and λ_2k:

The full conditional distributions of λ_1s and λ_2k are independent Gamma distributions,

(5) Conditional distributions of t_s and η_k: where InverseGamma(a, b) denotes a Inverse Gamma distribution with shape parameter a and scale parameter b.

(6) Conditional distribution of and π₀:

The full conditional distributions of and π₀ are independent Beta distribution.

(7) Conditional distribution of γ:

Denote , , , and let F_ijk be the covariate vector corresponding to be the covariate vector corresponding to γ_k, k = 1, …, q. The full conditional distribution of γ_k is then a a spike and slab distribution, where Here, Φ(⋅) indicates the cumulative distribution function of the standard normal distribution, is the kth element of and is the kth diagonal element of .

(8) Conditional distribution of σ:

The full conditional distribution of σ is

Simulation studies

In this section we check the performance of the proposed extended Bayesian adaptive group lasso approach (adGL) and the Bayesian adaptive group lasso with independent spike and slab type priors approach (adSpikeGL) through simulations. We also compare them with other models: the PMQ method reported by [10], the BL and BAL methods proposed by [11] and the ordinary quantile regression estimator (QR) focusing solely on independent regression. The test data are generated from the following model: for i = 1, …, 50, j = 1, …, 5, where ε_ij has the τth quantile equal to 0, random effects α_i ∼ N(0, D), where (11) Similar formulations of D in mixed models have been used by [29] and [30]. We set and , where x_ijk, k = 1, …, 8 are drawn independently from the uniform [−2, 2]. The fixed effects parameters β = (β₁, …, β₈) are set as follows:

Simulation 1: β = (3, 1.5, 0, 0, 2, 0, 0, 0)′ to illustrate a sparse case;
Simulation 2: β = (0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85)′ to illustrate a dense case;
Simulation 3: β = (5, 0, 0, 0, 0, 0, 0, 0)′ to illustrate a very sparse case.

With each simulation of β, we generate data under three different quantiles i.e. τ = 0.1, 0.3 and 0.5, and four different error distributions: normal, normal mixture, laplace, and laplace mixture as follows

(1) ε_ij ∼ N(0, 9), i = 1, …, n, j = 1, …, n_i;
(2) ε_ij ∼ 0.1N(0, 1) + 0.9N(0, 5), i = 1, …, n, j = 1, …, n_i;
(3) ε_ij ∼ Laplace(0, 3), i = 1, …, n, j = 1, …, n_i;
(4) ;

We consider three sample sizes n = 30, n_i = 5; n = 50, n_i = 5; n = 100, n_i = 5. Therefore, we have a total of 36 simulation setups for each of the sample sizes. We simulate 100 data sets for simulation setups. Priors for the Bayesian methods are taken to be weak prior. The hyperparameters in the Gamma priors for tuning and scale parameters are all set to be 0.1. We run our PCG sampler for 20000 following initial burn-in 10000 cycles. The convergence of the proposed PCG is monitored by using the multivariate potential scale reduction factor (MPSRF) introduced by [31]. Figs 1–6 show that the MPSRF generally become stable and get close to 1 after about 10000 iterations under normal mixture error distribution, suggesting that the above burn-in size is large enough to ensure the convergence of PCG. The results for other error distributions are similar and omitted.

Download:

Fig 1. MPSRF for adGL in Simulation 1, when normal mixture error distribution.

https://doi.org/10.1371/journal.pone.0241197.g001

Download:

Fig 2. MPSRF for adGL in Simulation 2, when normal mixture error distribution.

https://doi.org/10.1371/journal.pone.0241197.g002

Download:

Fig 3. MPSRF for adGL in Simulation 3, when normal mixture error distribution.

https://doi.org/10.1371/journal.pone.0241197.g003

Download:

Fig 4. MPSRF for adSpikeGL in Simulation 1, when normal mixture error distribution.

https://doi.org/10.1371/journal.pone.0241197.g004

Download:

Fig 5. MPSRF for adSpikeGL in Simulation 2, when normal mixture error distribution.

https://doi.org/10.1371/journal.pone.0241197.g005

Download:

Fig 6. MPSRF for adSpikeGL in Simulation 3, when normal mixture error distribution.

https://doi.org/10.1371/journal.pone.0241197.g006

We use the posterior median estimates of β and D as our point estimator, denoted as and , respectively, and we consider two error measures including the mean absolute deviations (MAD), root mean squared errors (RMSE). To check the performance of the parameter estimation, we also calculate the median of mean square error for β () and the median of quadratic loss error for D (). More specifically, where is the predicted value of y_ij, and the median is taken over the 100 simulations. The results of MAD and RMSE are summarized in Figs 7–12, meanwhile and are listed in Tables 1–9. Because BL, BAL, and QR neglect the covariance structure of random effects, we just report the results of the other three techniques for comparison.