Combining Sparse Group Lasso and Linear Mixed Model Improves Power to Detect Genetic Variants Underlying Quantitative Traits

Guo, Yingjie; Wu, Chenxi; Guo, Maozu; Zou, Quan; Liu, Xiaoyan; Keinan, Alon

doi:10.3389/fgene.2019.00271

ORIGINAL RESEARCH article

Front. Genet., 10 April 2019

Sec. Computational Genomics

Volume 10 - 2019 | https://doi.org/10.3389/fgene.2019.00271

This article is part of the Research Topic Machine Learning Advanced Dynamic Omics Data Analysis for Precision Medicine View all 40 articles

Combining Sparse Group Lasso and Linear Mixed Model Improves Power to Detect Genetic Variants Underlying Quantitative Traits

$\r\nYingjie Guo,$ Yingjie Guo^1,2

Chenxi Wu³

Maozu Guo^1,4^*

Quan Zou⁵

Xiaoyan Liu¹

Alon Keinan^2,6^*

¹School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
²Department of Computational Biology, Cornell University, Ithaca, NY, United States
³Department of Mathematics, Rutgers University, Piscataway, NJ, United States
⁴School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
⁵Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
⁶Cornell Center for Comparative and Population Genomics, Center for Vertebrate Genomics, and Center for Enervating Neuroimmune Disease, Cornell University, Ithaca, NY, United States

Genome-Wide association studies (GWAS), based on testing one single nucleotide polymorphism (SNP) at a time, have revolutionized our understanding of the genetics of complex traits. In GWAS, there is a need to consider confounding effects such as due to population structure, and take groups of SNPs into account simultaneously due to the “polygenic” attribute of complex quantitative traits. In this paper, we propose a new approach SGL-LMM that puts together sparse group lasso (SGL) and linear mixed model (LMM) for multivariate associations of quantitative traits. LMM, as has been often used in GWAS, controls for confounders, while SGL maintains sparsity of the underlying multivariate regression model. SGL-LMM first sets a fixed zero effect to learn the parameters of random effects using LMM, and then estimates fixed effects using SGL regularization. We present efficient algorithms for hyperparameter tuning and feature selection using stability selection. While controlling for confounders and constraining for sparse solutions, SGL-LMM also provides a natural framework for incorporating prior biological information into the group structure underlying the model. Results based on both simulated and real data show SGL-LMM outperforms previous approaches in terms of power to detect associations and accuracy of quantitative trait prediction.

1. Introduction

Quantitative traits are important in medicine, agriculture, and evolution but, until recently, few polymorphisms have been shown to be related in these traits. Genome-wide association studies (GWAS) is a statistical technique that has been used successfully in the identification of over 65,000 single-nucleotide polymorphisms (SNPs) that are connected to various traits or diseases (MacArthur et al., 2017). Typically, GWAS are carried out using single-locus models (i.e., testing for association between each SNP and a given phenotype independently using linear or logistic regression). However, according to the popular “polygenic theory” (Li et al., 2015b; Dudbridge, 2016), complex traits are often controlled by multiple SNPs collectively. Due to the need to eliminate multi-testing corrections that decrease statistical power, a better understanding of the underlying heritable genetic architecture of complex traits requires one to move beyond single-locus models to multivariate linear regression models that incorporate the joint effects of multiple SNPs explicitly (Ma et al., 2013).

Usually, the multi-locus GWAS are large p small n problems (i.e., the number of features (SNPs) far exceeds the number of samples, and one would expect only a small number of features are associated with the phenotype predictor). Therefore, as is customary for similar regression problems, it is necessary to regularize by demanding sparsity in the coefficients of the final model to prevent over-fitting and to maintain interpretability. The most popular regularizing penalty that serves this purpose is the lasso (i.e., least absolute shrinkage and selection operator) (Tibshirani, 1996), which is the L1 norm of the coefficients of features. Yang et al. (2012) fit sparse predictors for all genome-wide SNPs using stepwise, forward selection. Li et al. (2011) imposed a Laplace prior, which led to the Bayesian lasso. Arbet et al. (2017) developed a permutation-based, selection procedure to test the significance of lasso coefficients.

In GWAS, one expects the effective SNPs to be clustered in a few genes or pathways, hence, adding group structure by mandating sparsity on the group level is a good way to apply this prior knowledge that can potentially outperform the simple lasso. Yuan and Lin (2006) proposed using the group lasso for the linear regressions, which imposed a regularization penalty of the sum of the L2 norm on groups that guaranteed that few groups were selected. But if a group is selected, so are all the predictors in it.

The group lasso has already enjoyed much success in GWAS (Li et al., 2015a; Lim and Hastie, 2015). A caveat, however, is its assumption that either all SNPs in a group being associated or none of the SNPs in a group being associated. It is desirable to not only constrain sparsity between groups (only a few groups are associated), but also within groups; only a few SNPs in each active group are associated. Hence, we propose to employ a sparse group lasso (SGL), which is a regularization method aimed at achieving both between- and within-group sparsity simultaneously (Rao et al., 2013, 2016; Simon et al., 2013). The SGL has a L2 penalty that promotes the selection of only a subset of the groups and L1 penalty that promotes the selection of only a subset of the predictors within a group.

Another important factor in genetic association studies is the existence of confounding, which are indirect associations between markers and traits due to factors like population structure, family structure, and cryptic relatedness. Methods for correcting these confounding factors include EIGENSTRAT (Price et al., 2006), family-based association, genomic control, and linear mixed models (LMMs) (Fisher, 1919; Hoffman, 2013; Hoffman et al., 2014). Compared with other methods, LMMs provide more fine-grained control by modeling the contribution of these confounders as a random effect term. They are capable of capturing the cumulative effect of all types of confounding simultaneously without the need of prior knowledge on which confounding is present and without the need to estimate them individually. However, the time and space costs of LMM are high compared with simpler confounding models. Previous attempts to improve the performance of LMM includes Zhou and Stephens (2012) (EMMA), Kang et al. (2010) (EMMAX), Zhang et al. (2010) (P3D), Lippert et al. (2011) (FaST-LMM), and Li et al. (2017) (StepLMM). All of these methods are univariate models that are powerful in detecting few associations with large effect sizes.

Although joint modeling of multiple weak effects and correction for population structure have been tackled individually, few existing methods are capable of addressing them simultaneously. Segura et al. (2012) proposed a multi-locus, mixed model approach using stepwise forward selection. Rakitsch et al. (2012) and Papachristou et al. (2016) developed new association methods that combined LMM and lasso to enjoy the benefits of both methods.

There are a variety of patterns that typically arise in regularization (Figure 1). Prior knowledge can be utilized by using the SGL, which maintains both between- and within-group sparsity. The relative strength between L1 and L2 norms can be used to represent prior knowledge on the comparative degrees of sparsity at the SNP and gene level. In particular, by varying the ratio between L1 and L2 norms, the approach includes both group lasso and lasso as special cases.

FIGURE 1

Figure 1. A comparison of different sparsity patterns that occur in the analysis of genome-wide association studies. SNPs belong to M genes. Association SNPs that influence the phenotype are represented by boxes that are shaded gray. (a) Shows a lasso sparse pattern. An example of the group sparse pattern is shown in (b). In (c), we show the pattern in which we are interested in this paper.

In this paper, we propose a novel analysis that not only combines multivariate analysis with population correction using Fast-LMM, but we also incorporate the group structure of the SNPs as biological priors. We use the gene as the group unit, and it is reasonable to assume that the model should be sparse not only on the SNP-level (only relatively few SNPs are involved), but on the gene level as well (those functional SNPs belong to relatively few genes). Experiments on semi-empirical data showed that the combination of sparse group lasso and a linear mixed model yielded better power to identify marker associations in a large range of settings, and application to real datasets have verified that SGL-LMM generated a sparse solution with accurate prediction of phenotypes and interpretable detection of marker associations.

2. Materials and Methods

2.1. Method

We used a linear mixed model to model the genetic effects on the phenotypes. More precisely, we modeled the phenotype as a sum of three terms: a fixed effect determined by the association SNPs, a random confounding effect due to population structure, and an i.i.d. noise as follows:

\begin{array}{l} y = X β + y_{p o p} + ϕ & (1) \end{array}

where y is a vector of observed phenotypes of size m × 1 for m samples, X is a m × q matrix that consists of SNPs and other (e.g., environmental, familial etc.) variables of the m samples, y_pop is a m × 1 random matrix with distribution $N (0, σ_{g}^{2} K)$ where K is an m by m matrix called realized relationship matrix(RRM) that captures the overall genetic similarity between all pairs of samples, and $ϕ ~ N (0, σ_{e}^{2} I)$ .

To make a prediction on y, one only needs β and $δ = \frac{σ_{e}^{2}}{σ_{g}^{2}}$ . Following FAST-LMM, our overall strategy for estimating the parameters β and δ goes as follows:

1. Set β = 0, find the optimal δ.

2. Use the δ from the first step to estimate β, regularizing by using SGL.

Now we describe each of the two steps in more detail.

2.1.1. Estimate of δ

To calculate δ we use an approach similar to Fast-LMM. Because β was set to 0, we have:

\begin{array}{l} y ~ N (0, σ_{g}^{2} (K + δ I)) & (2) \end{array}

Hence the log likelihood for a given y is

\begin{array}{l} L L (δ, σ_{g}^{2}) = \log N (0, σ_{g}^{2} (K + δ I)) \\ = - \frac{1}{2} (m \log (2 π σ_{g}^{2}) + \log (\det (K + δ I) \\ + \frac{1}{σ_{g}^{2}} y^{T} {(K + δ I)}^{- 1} y) & (3) \end{array}

Diagonalize K into K = USU^T where U is orthogonal and S is diagonal, and we have:

\begin{array}{l} L L (δ, σ_{g}^{2}) = - \frac{1}{2} (m \log (2 π σ_{g}^{2}) + \log (\det (S + δ I) \\ + \frac{1}{σ_{g}^{2}} {(U^{T} y)}^{T} {(S + δ I)}^{- 1} (U^{T} y)) & (4) \end{array}

Substitute $σ_{g}^{2}$ with the optimal value:

\begin{array}{l} \hat{σ_{g}^{2}} = \frac{{(U^{T} y)}^{T} {(S + δ I)}^{- 1} (U^{T} y)}{m} & (5) \end{array}

we have:

\begin{array}{l} L L (δ) = - \frac{1}{2} (log (det (S + δ I)) + m log \frac{{(U^{T} y)}^{T} {(S + δ I)}^{- 1} (U^{T} y)}{m}) \\ + C & (6) \end{array}

Where C does not depend on δ. The optimal δ can then be calculated from above as a one dimensional optimization problem:

\begin{array}{l} \hat{δ} = arg min (log (det (S + δ I)) + m log \frac{{(U^{T} y)}^{T} {(S + δ I)}^{- 1} (U^{T} y)}{m}) & (7) \end{array}

2.1.2. Estimate of β

In this subsection, we describe the estimation for β based on model described by Equation (1), then, in the next subsection, we introduce the SGL regularization.

Equation (1) implies that:

\begin{array}{l} y ~ N (X β, σ_{g}^{2} (K + δ I)) & (8) \end{array}

Hence, using the diagonalization we see that, after δ and $σ_{g}^{2}$ have been estimated in the previous subsection, the log-likelihood becomes:

\begin{array}{l} L L (β) = \log N (X β, \hat{σ_{g}^{2}} (K + \hat{δ} I)) \\ = - \frac{m}{2} \log (2 π \hat{σ_{g}^{2}}) - \frac{1}{2} \log (\det (S + \hat{δ} I) \\ - \frac{1}{2 \hat{σ_{g}^{2}}} (U^{T} (y - X β))^{T} {(S + \hat{δ} I)}^{- 1} (U^{T} (y - X β)) \\ = - \frac{1}{2 \hat{σ_{g}^{2}}} (U^{T} (y - X β))^{T} {(S + \hat{δ} I)}^{- 1} (U^{T} (y - X β)) + C & (9) \end{array}

Let $S_{\hat{δ}}$ be the non-negative diagonal matrix defined by $S_{\hat{δ}}^{- 2} = S + \hat{δ} I$ , or, more concretely, ${(S_{\hat{δ}})}_{i i} = {(S_{i i} + \hat{δ})}^{- 1 / 2}$ , then the MLE of β is

\begin{array}{l} \hat{β} = arg min {(U^{T} (y - X β))}^{T} {(S + \hat{δ} I)}^{- 1} (U^{T} (y - X β)) \\ = arg min {(S_{\hat{δ}} U^{T} y - S_{\hat{δ}} U^{T} X β)}^{T} (S_{\hat{δ}} U^{T} y - S_{\hat{δ}} U^{T} X β) \\ = arg min ‖ S_{\hat{δ}} U^{T} y - S_{\hat{δ}} U^{T} X β ‖_{2}^{2} & (10) \end{array}

Here || · ||₂ is the L² norm. $S_{\hat{δ}} U^{T} y$ and $S_{\hat{δ}} U^{T} X$ are obtained from y and X by a rotation and a scaling, and to simplify notations we denote them as $\tilde{y}$ and $\tilde{X}$ , respectively.

2.1.3. Sparse Group Lasso

To maintain sparsity in the estimated β, we need to add a regularizer to Equation (10). We used the SGL regularizer: let $G$ be a family of possibly overlapping groups of components in β, for each group $G \in G$ , let β_G be the vector that consists of these components, let λ > 1 and 0 ≤ α ≤ 1, then the regularized optimization problem becomes:

\begin{array}{l} {\hat{β}}_{r e g} = arg min ‖ S_{δ} U^{T} y - S_{δ} U^{T} X β ‖_{2}^{2} + λ (1 - α) \sum_{G \in G} ‖ β_{G} ‖_{2} + λ α ‖ β_{G} ‖_{1} & (11) \end{array}

Here λ is the strength of regularization, and α is the comparative strength of the L¹ and L² regularization, with indicating how much sparsity at the SNP level is desired compared to the sparsity at the group level. From a Bayesian perspective, one can think of it as adding a regularizing prior to β of the form:

\begin{array}{l} log p (β) \propto (1 - α) \sum_{G \in G} ‖ β_{G} ‖_{2} + α ‖ β_{G} ‖_{1} & (12) \end{array}

2.1.4. Phenotype Prediction

With estimated β and δ, phenotype prediction follows from a straight-forward MLE using Equation (1). Suppose there are other samples with genotype X′ and unknown phenotype y′, then

\begin{array}{l} L L (y^{'}) \propto {([\begin{matrix} y^{'} \\ y \end{matrix}] - [\begin{matrix} X^{'} \\ X \end{matrix}] \hat{β})}^{T} {(K + \hat{δ} I)}^{- 1} ([\begin{matrix} y^{'} \\ y \end{matrix}] - [\begin{matrix} X^{'} \\ X \end{matrix}] \hat{β}) & (13) \end{array}

Here $K = [\begin{matrix} K_{X^{'} X^{'}} & K_{X^{'} X} \\ K_{X^{'} X}^{T} & K_{X X} \end{matrix}]$ So, by linear algebra, the MLE estimate for y′ is

\begin{array}{l} \hat{y^{'}} = X^{'} \hat{β} + K_{X^{'} X} {(K_{X X} + \hat{δ} I)}^{- 1} (y - X \hat{β}) & (14) \end{array}

We can summarize the SGL-LMM significant SNPs selection in the following algorithm:

ALGORITHM 1

Algorithm 1: Parameter estimate for LMM with SGL regularization

2.1.5. Complexity Analysis

Let n be the number of samples and s be the number of SNPs. When training the null model, the complexity is O(n³) which is from the computation of eigenvalues and eigenvectors. This is reasonable when n is about 10k but for higher n one can improve on the time complexity by only taking into account the dominant eigenevalues. The proximal gradient step has a complexity of about O(ns), and since n is usually much less than s, one can see it as more or less O(s). The prediction step has a complexity of O(nn′s), where n′ is the size of the testing set. From the complexity analysis, we can see that SGL-LMM is scalable for the genome-wide association analysis. But when analysing with a huge genome such as the human genome, we recommend to analysis each chromosome individually or doing a 2nd step based on suggested loci from GWAS.

2.2. Model Selection

When solving the Equation (11), we employ SGL R package. Instead of doing a two dimensional grid search of λ and α to determine the optimal parameters, the package fix the mixing parameter α and compute solutions for a path with many λ values. The path begins with lambda sufficiently large to set $\hat{β} = 0$ and let lambda decrease until the result is close to unregularized. Taking advantage of this mechanism, we carry out feature selection using LMM-SGL through the following steps:

(1) Finding the λ that optimizes phenotype prediction accuracy

In order to find the best λ for phenotype prediction, we first fitted the sparse group lasso model with the whole dataset to find a λ path. We then used 5-fold cross validation to find the appropriate λ, which maximize the average explained variance on the test dataset.

(2) Stability selection

To evaluate the significance of individual SNPs, we carry out stability selection (Meinshausen and Bühlmann, 2010). To obtain a more accurate ranking of SNPs, after the optimal λ was selected in the step above, we chose another 9 λs from the larger λs in the λ path evenly spaced. This group of λs were used in each stability selection process to rank the features by the order of inclusion into the model. We drew randomly no more than 50% of the samples as proposed in the original artical 100 times. We selected all SNPs that were found in ≥ 50% of all results. Significance estimate can be deduced from the selection frequency of individual SNPs.

We summarize the process as the algorithm below and the overall pipeline of SGL-LMM method as Figure 2:

ALGORITHM 2

Algorithm 2: Feature selection using SGL-LMM

FIGURE 2

Figure 2. Flow chart of the SGL-LMM method. The dotted line shows the data flow, and the solid line shows the flow of control.

2.3. Simulation Study

To evaluate the accuracy of SGL-LMM and pervious methods for association mapping, we considered a semi-empirical example based on the genotypic and phenotypic data for up to 1307 world-wide accessions of Arabidopsis thaliana from Atwell et al. (2010). The data can be downloaded from https://github.com/Gregor-Mendel-Institute/atpolydb. Based on the quality control provided by GWAS, we excluded a SNP if its Minor Allele Frequency (MAF) was < 0.05, if its missing rate was >0.05 of the population, or its allele frequencies were not in Hardy-Weinberg equilibrium (P < 0.0001). After filtering, there were 200155 SNPs left.

To simulate the effect of population structure, we used the real phenotypic leaf number at flowering time (LN,16°C,16 h daylight) which is available for 177 plants of the 1307 plants of A.thaliana. Univariate analyses showed that the phenotype had an excess of associations when population structure was not taken into account (Atwell et al., 2010). After correction for population effect, the p-values are approximately uniformly distributed, Which means this phenotype is totally subjected to population structure. Hence, we use this phenotype to simulate the confounding effect. First, to determine the fraction δ of genetic and residual variance, we fit a random effects model to LN, which we subsequently used to predict the population structure for the remaining 1,130 plants. We run the random effect model multiple times, and choose the final dataset which the difference of genetic variance parameter between real and synthetic data are less than 0.0001. In addition to this empirical background, we added simulated association with different effect sizes and a range of complexities of genetic models.

We then simulated the phenotype as follows:

\begin{array}{l} y = σ_{s i g} y_{s i g} + (1 - σ_{s i g}) [σ_{p o p} y_{p o p} + (1 - σ_{p o p}) φ] & (15) \end{array}

where $y_{s i g} = X^{k} β$ , X^k is the genotype data for the k causal SNPs. By introducing the group structure, we consider a case with N_g = 200 genes(groups) on the chromsome1 which covered 2000 SNPs, we set m groups to be active. We vary the sparsity level of the active groups to get the total active SNPs to be k. $β ~ N (0, I)$ and $φ ~ N (0, I)$ . During the simulation, we maintained the original LD structure in each gene.

The initial setting used for simulation were 3 active groups each containing 5 effective SNP (k = 15 and m = 3). To investigate the influence of the confounding effect strength and the overall noise, we considered varied σ_pop ∈ {0.5, 0.7, 0.9} and σ_sig ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. For each combination of σ_pop and σ_sig, we generate 10 datasets, resulting in 120 datasets in total for the 12 combinations.

2.4. Application With Arabidopsis thaliana Data

To assess the capacity of SGL-LMM to deal with real association mapping of quantitative phenotypes, we investigated the susceptibility of a set of SNPs that belong to genes of several flowering phenotypes in A. thaliana. We used the same dataset as in the simulation study. From the 107 phenotypes, we chose 10 flowering time phenotypes (Table S1).

To verify our method, we constructed our dataset in the following ways:

1. We obtained gene information from the A. thaliana annotation file. For each gene, 10kb of buffer region was added both upstream and downstream of the defined gene location. All SNPs between the regions were considered.

2. From chromosome 1 to chromosome 5, we chose the top 1,000 largest genes to form a genotype data file. There were a total 49,962 SNPs in the 1,000 genes.

3. According to the most promising association listed in Atwell's paper, we chose 19 genes that were related strongly to flowering time and added them to the genotype. The 19 genes consisted of 367 SNPs, so that the final genotype file had 50,329 SNPs (Table S2).

4. For each phenotype, a corresponding kinship matrix was generated in the same way as described in the simulation study.

3. Results

3.1. Existing Methods

To compare our SGL-LMM method with existing techniques, we considered standard regularization methods that included Lasso and SGL, which model all SNPs simultaneously without correcting for population structure. Also, we combined LMM with different regularization strategies (e.g., Lasso-LMM was listed as a comparison). All the methods that were related to regularization were fit in identical ways (see section 2.2).

3.2. Performance Measurements

In this paper, all the models output a ranking list of SNPs with their frequencies of being chosen; true significant markers were rare and accounted for only 15 out of 1,993 in our simulation datasets. Hence, we treated this as a binary classification problem with an imbalanced dataset where we assigned association markers as label 1 and background markers as label 0. The frequency of each marker was treated as the predicted probability for label 1.

The ROC (Receive operating characteristic) curve and the PR (Precision-Recall) curve are commonly used to evaluate performance of classification models. The ROC curve is created by plotting the Sensitivity against the Specificity while varying the threshold settings:

\begin{array}{l} s e n s i t i v i t y (T r u e P o s i t i v e R a t e, T P R) = \frac{T P}{T P + F N} \end{array}

\begin{array}{l} s p e c i f i c i t y (F a l s e P o s i t i v e R a t e, F P R) = \frac{T N}{T N + F P} \end{array}

The PR curve is created by plotting the Precision against the Recall at various threshold settings:

\begin{array}{l} p r e c i s i o n = \frac{T P}{T P + F P} \end{array}

\begin{array}{l} r e c a l l = \frac{T P}{T P + F N} \end{array}

where TP=TruePositve, TN=TrueNegative, FP=FalsePositive, and FN=FalseNegative.

In our imbalanced setting, the ROC curve was not a good visual illustration, because the False Positive Rate did not drop drastically when the True Negative was huge. Whereas, the PR curve was highly sensitive to False Positive and was not impacted by a large True Negative denominator. Hence, we chose the PR curve to evaluate the performance for all the methods, and we used the average AUC (Area Under Curve) of the PR curve to explore the impact of various simulation settings.

3.3. Results of the Simulation Study

3.3.1. SGL-LMM Ranks Causal SNPs Higher Than Alternative Methods

We assessed the performance in recovering causal SNPs with a true simulated association. PR curves were constructed while varying σ_pop in {0.5, 0.7, 0.9} with σ_sig set at 0.2 (Figure 3). Notice that a larger AUC score indicated better performance. For this experiment, we chose effective SNPs from three of the 200 groups, while taking sparsity into account, and we set the ratio α of L1 and L2 penalty in SGL-LMM to be 0.95. The two methods that incorporated LMM for population correction performed better than those without, and SGL-LMM was the best model (Figure 3). For most sets of parameters, SGL-LMM outperformed Lasso-LMM in AUC by about 10%.

FIGURE 3

Figure 3. The Precision Recall curve by varying the σ_pop ∈ {0.5, 0.7, 0.9} and set σ_sig = 0.2. Each parameter combination had five datasets. The legend in each subplot shows the area under the curve (AUC) for each method. L represents Lasso only, LL means Lasso-LMM, S is SGL only, and SL means SGL-LMM. (A) Shows the PR curve under σ_pop = 0.5, σ_sig = 0.2, (B) for σ_pop = 0.7,σ_sig = 0.2, and (C) for σ_pop = 0.9,σ_sig = 0.2.

Next, we explored the impact of various simulated setting. As mentioned in section 3.2, the area under the Precision-Recall curve is a summary performance measurement to assess different methods. The AUC under the PR curve is shown as a function of an increasing ratio between true genetic marker signals compared with confounding and noise (Figure 4). The performance of all methods improved when σ_sig became larger, and the AUC = 1 at σ_sig = 0.5 for all methods. Among them, SGL-LMM was the best. We also notice that when σ_sig = 0.1, only SGL was more accurate than Lasso-LMM in the majority of datasets. SGL and Lasso-LMM performed similarly (Figure 3). One possible explanation is that when the variation explained by causal SNPs was relatively small, noise dominated the results. Under this scenario, eliminating false positives caused by population structure did not improve the performance of the models significantly. However, imposing group structure seems to be useful in generating accurate results.

FIGURE 4

Figure 4. The boxplot of a sample of five points for each method with a specific when varying the σ_pop. Each method has a different color frame, and each that is filled with a different color is shown in the legend. (A) for σ_pop = 0.5, (B) for σ_pop = 0.7, and (C) for σ_pop = 0.9.

The AUC under the PR curve is shown as a function of an increasing ratio of population structure and independent random noise with a specific σ_sig and, as expected, strong confounding was harmful to performance, because the AUC of all methods decreased when the confounding ratio increased. Again, SGL-LMM was superior to its counterparts. However, when σ_sig = 0.3, the performance of methods with the population correction exhibited an upper trend when σ_pop varied from 0.5 to 0.7 (Figure 5C). The performance of δ_sig to be 0.1,0.2 and 0.4 can be found in Figures 5A,B,D. This effect indicated that with a medium signal to noise ratio, it was advantageous to include a genetic covariance matrix K that accounted for confounding that was caused by population structure. SGL-LMM performed better than alternative methods for the entire range of considered settings. The benefits of population correction and inclusion of group structure in SGL-LMM were most pronounced in the scenario with strong confounding.

FIGURE 5

Figure 5. The boxplot of a sample of five points for each method with a specific σ_sig when varying the σ_pop. Each method with a different color frame and each σ_sig filled with a different color are shown in the legend. (A) for σ_sig = 0.1, (B) for σ_sig = 0.2, (C) for σ_sig = 0.3, and (D) σ_sig = 0.4.

3.4. Application With Arabidopsis thaliana Data

Having shown the accuracy of SGL-LMM in recovering the association SNPs in the simulation study, we can demonstrate that the SGL-LMM models association mapping in the A. thaliana dataset better than other models. For this experiment based on real data, we compared the performance of SGL-LMM and Lasso-LMM in predicting phenotype and in selecting predictive SNPs. For the ratio α between L1 and L2 penalty, we considered eight possible values {0.95, 0.85, 075, 0.65, 0.55, 0.45, 0.35, 0.25}; we picked the one that resulted in the largest correlation coefficient between the predicted and the real phenotype for subsequent stability selection. Because it is a verification experiment, we did not cover all genes in the experimental design. It may be the case that few, or even none, of the related genes in the selected phenotypes were covered in our genotype file. As a consequence, when setting the threshold for stability selection to be 50%, few SNPs are chosen by Lasso-LMM, and usually no more than 20 SNPs are chosen by SGL-LMM. Hence, we chose to rank the SNPs by their frequency of being chosen in both approaches and to investigate the first 100 SNPs. We summarized the genes to which these 100 SNPs belonged and the number of these genes in the candidate gene list (Table 1).

TABLE 1

Table 1. Summary of associations found in SGL-LMM and Lasso-LMM in real data application.

SGL-LMM had the following two advantages (Table 1):

3.4.1. SGL-LMM Had Higher Prediction Accuracy

For most of the 10 phenotypes, correlation coefficients between the predicted and the true phenotypes were higher using SGL-LMM than those obtained with Lasso-LMM by >10%; for FT10, the predictions by SGL-LMM had a correlation coefficient 100% higher than that obtained by Lasso-LMM. Therefore, incorporating prior knowledge of genetic structure significantly improved the accuracy of models of quantitative phenotypes.

3.4.2. SGL-LMM Selected Fewer Genes, and It Tended to Find More Genes That Were Known to be Functional

Compared with Lasso-LMM, associations that were located by SGL-LMM were more enriched to known candidate genes (Table 1). It linked more candidate genes in five phenotypes, and it linked the same number of candidate genes in the phenotypes SD and SDV. However, SGL-LMM linked many fewer genes compared with Lasso-LMM, which was consistent with our assumption that phenotypes should be related to a few SNPs in a few genes. Hence, adding group information into SGL-LMM made the results more interpretable and more meaningful biologically. The remaining three phenotypes that were related to leaf numbers seemed to be largely unrelated to the 19 candidate genes and to the randomly selected background genes and, therefore, both methods performed badly.

4. Discussion

Quantitative traits are important in medicine, agriculture, and evolution, but the association mapping studies of these traits are insufficient. In this paper, we have proposed a sparse group lasso, multi-marker mixed model (SGL-LMM) to identify genetic associations in quantitative traits with the presence of confounding influences, such as population structure. The approach benefits from the attractive properties of linear mixed models that allow for elegant correction of confounding effects and those of group-based, multi-marker models that not only consider the joint effects of sets of genetic markers rather than one single locus at a time, but that also incorporate biological group information as prior knowledge. As a consequence, SGL-LMM was able to better predict the phenotype and to identify true genetic associations, even in challenging scenarios with complex underlying genetic models, weak effects of individual markers, or presence of strong confounding effects.

SGL-LMM is useful for genome-wide association studies of complex quantitative phenotypes. In this paper, we have illustrated such practical use through a semi-empirical simulation study and retrospective analysis of A. thaliana. First, we found that imposing gene structure as group structure into the model improved both the prediction of phenotype from genotype and the selection of association SNPs, which suggested that incorporating prior biological knowledge into models led to a better fit to real genetic architectures. Second, the combination of a random effect model and a multivariate linear model is a way to reveal the true association of complex phenotypes, especially with a medium signal to noise ratio. It is widely accepted that parts of the unexplained portion of genetic variance can be due to a large number of loci that have a joint effect on the phenotype, but which lead to only a weak signal if considered independently. In addition, SGL-LMM yields much more biologically meaningful and interpretable associations, which suits the biological assumption that complex traits are only related to a few SNPs in a few genes. Our experiments on the flowering phenotype of A. thaliana showed that SGL-LMM linked many more candidate genes, but this was true only in a smaller gene set compared with the Lasso-LMM method.

The SGL-LMM included both GL-LMM (group lasso with linear mixed model) and Lasso-LMM as special cases by varying the ratio between the L1 and L2 norms. The sparsity within groups and group-wise sparsity influenced the performance of SGL-LMM. Small groups did not benefit from the within-group sparsity that led the method act as group lasso with LMM. In practical use, we recommend doing imputation first, which can ensure a moderate size for each group. The SGL-LMM can be made even more powerful by adding a strategy to deal with overlapping groups, which has been shown to be feasible by Jacob et al. (2009). Assessing the statistical significance of association results of SGL-LMM remains a challenge for future research. In summary, SGL-LMM is a useful addition to the current toolbox of computational models for unraveling associations of quantitative traits.

Author Contributions

YG, AK, MG, and XL conceived and designed the project. YG and CW derived the formula of the method. YG implemented the software, performed the experiment, analyzed data, and wrote the paper with CW and QZ. All authors read, edited, and approved the final version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 61571163, 61532014, and 61671189), the National Key Research and Development Plan of China (Grant No. 2016YFC0901902), and the National Institute of Health (grants R01HG006849 and R01GM108805 to AK).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer YWC declared a past co-authorship with one of the authors QZ to the handling editor.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00271/full#supplementary-material

References

Arbet, J., McGue, M., Chatterjee, S., and Basu, S. (2017). Resampling-based tests for lasso in genome-wide association studies. BMC Genet. 18:70. doi: 10.1186/s12863-017-0533-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Atwell, S., Huang, Y. S., Vilhjálmsson, B. J., Willems, G., Horton, M., Li, Y., et al. (2010). Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465:627. doi: 10.1038/nature08800

PubMed Abstract | CrossRef Full Text | Google Scholar

Dudbridge, F. (2016). Polygenic epidemiology. Genet. Epidemiol. 40, 268–272. doi: 10.1002/gepi.21966

PubMed Abstract | CrossRef Full Text | Google Scholar

Fisher, R. A. (1919). The correlation between relatives on the supposition of mendelian inheritance. Earth Environ. Sci. Trans. R. Soc. Edinburgh 52, 399–433. doi: 10.1017/S0080456800012163

CrossRef Full Text | Google Scholar

Hoffman, G. E. (2013). Correcting for population structure and kinship using the linear mixed model: theory and extensions. PLoS ONE 8:e75707. doi: 10.1371/journal.pone.0075707

CrossRef Full Text | Google Scholar

Hoffman, G. E., Mezey, J. G., and Schadt, E. E. (2014). lrgpr: interactive linear mixed model analysis of genome-wide association studies with composite hypothesis testing and regression diagnostics in r. Bioinformatics 30, 3134–3135. doi: 10.1093/bioinformatics/btu435

PubMed Abstract | CrossRef Full Text | Google Scholar

Jacob, L., Obozinski, G., and Vert, J.-P. (2009). “Group lasso with overlap and graph lasso,” in Proceedings of the 26th Annual International Conference on Machine Learning (Montreal: ACM), 433–440.

Google Scholar

Kang, H. M., Sul, J. H., Zaitlen, N. A., Kong, S.-Y., Freimer, N. B., Sabatti, C., et al. (2010). Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42:348. doi: 10.1038/ng.548

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, H., Su, G., Jiang, L., and Bao, Z. (2017). An efficient unified model for genome-wide association studies and genomic selection. Genet. Select. Evol. 49:64. doi: 10.1186/s12711-017-0338-x

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, J., Das, K., Fu, G., Li, R., and Wu, R. (2011). The bayesian lasso for genome-wide association studies. Bioinformatics 27, 516–523. doi: 10.1093/bioinformatics/btq688

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, J., Wang, Z., Li, R., and Wu, R. (2015a). Bayesian group lasso for nonparametric varying-coefficient models with application to functional genome-wide association studies. Ann. Appl. Stat. 9:640. doi: 10.1214/15-AOAS808

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, P., Guo, M., Wang, C., Liu, X., and Zou, Q. (2015b). An overview of snp interactions in genome-wide association studies. Brief. Funct. Genom. 14, 143–155. doi: 10.1093/bfgp/elu036

PubMed Abstract | CrossRef Full Text | Google Scholar

Lim, M., and Hastie, T. (2015). Learning interactions via hierarchical group-lasso regularization. J. Comput. Graph. Stat. 24, 627–654. doi: 10.1080/10618600.2014.938812

PubMed Abstract | CrossRef Full Text | Google Scholar

Lippert, C., Listgarten, J., Liu, Y., Kadie, C. M., Davidson, R. I., and Heckerman, D. (2011). Fast linear mixed models for genome-wide association studies. Nat. Methods 8:833. doi: 10.1038/nmeth.1681

PubMed Abstract | CrossRef Full Text | Google Scholar

Ma, L., Clark, A. G., and Keinan, A. (2013). Gene-based testing of interactions in association studies of quantitative traits. PLoS Genet. 9:e1003321. doi: 10.1371/journal.pgen.1003321

PubMed Abstract | CrossRef Full Text | Google Scholar

MacArthur, J., Bowler, E., Cerezo, M., Gil, L., Hall, P., Hastings, E., et al. (2017). The new nhgri-ebi catalog of published genome-wide association studies (gwas catalog). Nucl. Acids Res. 45, D896–D901. doi: 10.1093/nar/gkw1133

PubMed Abstract | CrossRef Full Text | Google Scholar

Meinshausen, N., and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 72, 417–473. doi: 10.1111/j.1467-9868.2010.00740.x

CrossRef Full Text | Google Scholar

Papachristou, C., Ober, C., and Abney, M. (2016). “A lasso penalized regression approach for genome-wide association analyses using related individuals: application to the genetic analysis workshop 19 simulated data,” in BMC Proceedings, Vol. 10 (Vienna: BioMed Central), 53.

Google Scholar

Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38:904. doi: 10.1038/ng1847

PubMed Abstract | CrossRef Full Text | Google Scholar

Rakitsch, B., Lippert, C., Stegle, O., and Borgwardt, K. (2012). A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics 29, 206–214. doi: 10.1093/bioinformatics/bts669

PubMed Abstract | CrossRef Full Text | Google Scholar

Rao, N., Cox, C., Nowak, R., and Rogers, T. T. (2013). “Sparse overlapping sets lasso for multitask learning and its application to fMRI analysis,” in Advances in Neural Information Processing Systems (Harrah's Lake Tahoe), 2202–2210.

Google Scholar

Rao, N. S., Nowak, R. D., Cox, C. R., and Rogers, T. T. (2016). Classification with the sparse group lasso. IEEE Trans. Signal Process. 64, 448–463. doi: 10.1109/TSP.2015.2488586

CrossRef Full Text | Google Scholar

Segura, V., Vilhjálmsson, B. J., Platt, A., Korte, A., Seren, Ü., Long, Q., et al. (2012). An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nat. Genet. 44:825. doi: 10.1038/ng.2314

PubMed Abstract | CrossRef Full Text | Google Scholar

Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2013). A sparse-group lasso. J. Comput. Graph. Stat. 22, 231–245. doi: 10.1080/10618600.2012.681250

CrossRef Full Text | Google Scholar

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58, 267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x

CrossRef Full Text | Google Scholar

Yang, J., Ferreira, T., Morris, A. P., Medland, S. E., Madden, P. A., Heath, A. C., et al. (2012). Conditional and joint multiple-snp analysis of gwas summary statistics identifies additional variants influencing complex traits. Nat. Genet. 44:369. doi: 10.1038/ng.2213

PubMed Abstract | CrossRef Full Text | Google Scholar

Yuan, M., and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68, 49–67. doi: 10.1111/j.1467-9868.2005.00532.x

CrossRef Full Text | Google Scholar

Zhang, Z., Ersoz, E., Lai, C.-Q., Todhunter, R. J., Tiwari, H. K., Gore, M. A., et al. (2010). Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42:355. doi: 10.1038/ng.546

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhou, X., and Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44:821. doi: 10.1038/ng.2310

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: genome-wide association studies, single nucleotide polymorphisms, quantitative traits, linear mixed model, sparse group lasso

Citation: Guo Y, Wu C, Guo M, Zou Q, Liu X and Keinan A (2019) Combining Sparse Group Lasso and Linear Mixed Model Improves Power to Detect Genetic Variants Underlying Quantitative Traits. Front. Genet. 10:271. doi: 10.3389/fgene.2019.00271

Received: 17 December 2018; Accepted: 12 March 2019;
Published: 10 April 2019.

Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

Reviewed by:

Yan Huang, Harvard Medical School, United States
Yen-Wei Chu, National Chung Hsing University, Taiwan

Copyright © 2019 Guo, Wu, Guo, Zou, Liu and Keinan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Maozu Guo, guomaozu@bucea.edu.cn
Alon Keinan, alon.keinan@cornell.edu

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.