A fast and efficient smoothing approach to Lasso regression and an application in statistical genetics: polygenic risk scores for chronic obstructive pulmonary disease (COPD)

Hahn, Georg; Lutz, Sharon M.; Laha, Nilanjana; Cho, Michael H.; Silverman, Edwin K.; Lange, Christoph

doi:10.1007/s11222-021-10010-0

A fast and efficient smoothing approach to Lasso regression and an application in statistical genetics: polygenic risk scores for chronic obstructive pulmonary disease (COPD)

Published: 17 April 2021

Volume 31, article number 35, (2021)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Georg Hahn ORCID: orcid.org/0000-0001-6008-2720¹,
Sharon M. Lutz¹,
Nilanjana Laha¹,
Michael H. Cho¹,
Edwin K. Silverman¹ &
…
Christoph Lange¹

443 Accesses
3 Citations
2 Altmetric
Explore all metrics

Abstract

High dimensional linear regression problems are often fitted using Lasso approaches. Although the Lasso objective function is convex, it is not differentiable everywhere, making the use of gradient descent methods for minimization not straightforward. To avoid this technical issue, we apply Nesterov smoothing to the original (unsmoothed) Lasso objective function. We introduce a closed-form smoothed Lasso which preserves the convexity of the Lasso function, is uniformly close to the unsmoothed Lasso, and allows us to obtain closed-form derivatives everywhere for efficient and fast minimization via gradient descent. Our simulation studies are focused on polygenic risk scores using genetic data from a genome-wide association study (GWAS) for chronic obstructive pulmonary disease (COPD). We compare accuracy and runtime of our approach to the current gold standard in the literature, the FISTA algorithm. Our results suggest that the proposed methodology provides estimates with equal or higher accuracy than the FISTA algorithm while having the same asymptotic runtime scaling. The proposed methodology is implemented in the R-package smoothedLasso, available on the Comprehensive R Archive Network (CRAN).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identification of genetic profile and biomarkers involved in acute respiratory distress syndrome

Article 03 November 2023

Statistical power for cluster analysis

Article Open access 31 May 2022

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

References

Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet Google Scholar
Chi, E., Goldstein, T., Studer, C., Baraniuk, R.: fasta: fast adaptive shrinkage/thresholding algorithm. R-package version 1 (2018)
Daubechies, I., Defrise, M., Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 57(11), 1413–1457 (2004)
Article MathSciNet Google Scholar
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
Article MathSciNet Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Article MathSciNet Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Article Google Scholar
Hahn, G., Banerjee, M., Sen, B.: Parameter estimation and inference in a continuous piecewise linear regression model (2017). http://www.cantab.net/users/ghahn/preprints/PhaseRegMultiDim.pdf. Accessed 21 Mar 2017
Hahn, G., Lutz, S.M., Laha, N., Lange, C.: smoothedLasso: smoothed LASSO regression via Nesterov smoothing. R-package version 1.3 (2020). https://cran.r-project.org/package=smoothedLasso. Accessed 21 Mar 2017
Hastie, T., Efron, B.: lars: least angle regression, lasso and forward stagewise. R-package version 1.2 (2013)
Khera, A.V., Chaffin, M., Aragam, K.G., Haas, M.E., Roselli, C., Choi, S.H., Natarajan, P., Lander, E.S., Lubitz, S.A., Ellinor, P.T., Kathiresan, S.: Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018)
Mak, T., Porsch, R., Choi, S., Zhou, X., Sham, P.: Polygenic scores via penalized regression on summary statistics. Genet. Epidemiol. 41(6), 469–480 (2016)
Article Google Scholar
Michelot, C.: A finite algorithm for finding the projection of a point onto the canonical simplex of $\mathbb{R}^n$. J. Optim. Theory App. 50(1), 195–200 (1986)
Article MathSciNet Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate $O(1/k^2)$. Dokl. Akad. Nauk SSSR 269(3), 543–547 (1983)
MathSciNet Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. Ser. A 103, 127–152 (2005)
Article MathSciNet Google Scholar
NHLBI TOPMed: Boston Early-Onset COPD Study in the National Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) Program (2018). https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000946.v3.p1. Accessed 18 Oct 2016
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Stat Comp, Vienna, Austria (2014). http://www.R-project.org/. Accessed 2 Sept 2019
Regan, E., Hokanson, J., Murphy, J., Make, B., Lynch, D., Beaty, T., Curran-Everett, D., Silverman, E., Crapo, J.: Genetic epidemiology of copd (copdgene) study design 2. COPD 7, 32–43 (2010)
Article Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B Methodol. 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
Tibshirani, R.: Model selection and validation 1: cross-validation (2013). https://www.stat.cmu.edu/~ryantibs/datamining/lectures/18-val1.pdf. Accessed 2 Sept 2019
Wu, T., Chen, Y., Hastie, T., Sobel, E., Lange, K.: Genome-wide association analysis by Lasso penalized logistic regression. Bioinformatics 25(6), 714–721 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA, 02115, USA
Georg Hahn, Sharon M. Lutz, Nilanjana Laha, Michael H. Cho, Edwin K. Silverman & Christoph Lange

Authors

Georg Hahn
View author publications
You can also search for this author in PubMed Google Scholar
Sharon M. Lutz
View author publications
You can also search for this author in PubMed Google Scholar
Nilanjana Laha
View author publications
You can also search for this author in PubMed Google Scholar
Michael H. Cho
View author publications
You can also search for this author in PubMed Google Scholar
Edwin K. Silverman
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Lange
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Georg Hahn.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Selection of the Lasso regularization parameter via cross validation

We aim to select the Lasso regularization parameter $\uplambda $ using cross validation. To this end, for the simulation scenario described in Sect. 3.1 (in particular, for the chosen noise level of $\sigma =0.5$ and the sparsity level of $20\%$, as well as $n=1000$ and $p=100$), we perform 10-fold cross validation as described in Tibshirani (2013).

To be precise, we first fix a grid of admissible values of $\uplambda $ from which we would like to choose the regularization parameter (here, $\uplambda \in \{0,0.05,0.1,0.15,\ldots ,,1\}$). We then randomly divide the n data points into $K=10$ disjoint sets (folds) $I_1,\ldots ,I_K$ such that $\bigcup _{j=1}^K I_j = \{1,\ldots ,n\}$. For each $j \in \{1,\ldots ,K\}$, we withhold the indices in $I_j$ and fit a linear model $y_{-I_j}=X_{-I_j,\cdot } \beta $ using the FISTA algorithm. After obtaining an estimate $\hat{\beta }$, we use the withheld rows of X indexed in $I_j$ to predict the withheld entries of y, that is we compute $X_{I_j,\cdot } \hat{\beta }$. We evaluate the accuracy of the prediction with the $L_2$ norm, that is we compute $\Vert X_{I_j,\cdot } \hat{\beta } - y_{I_j} \Vert _2$. Repeating this computation for all $j \in \{1,\ldots ,K\}$ allows us to compute an average $L_2$ error over the K folds (called the cross validation error). We plot this error as a function of $\uplambda $.

The result is shown in Fig. 5. We observe that for the simulation scenario we consider in Sect. 3.1, the choice $\uplambda =0.3$ is sensible.

Sensitivity analysis

In the linear regression model $y=X\beta +\epsilon $ under consideration in this work (see Sect. 1), it is easy to see that the larger the noise/error $\epsilon $, the harder it will be to obtain accurate estimates of $\beta $.

To quantify this statement, Fig. 6 presents a sensitivity analysis for the recovery accuracy of the parameter estimate $\beta $ (measured as the $L_2$ norm between the fitted parameter estimate returned by the unsmoothed Lasso, the FISTA algorithm, and the smoothed Lasso, to the truth) as a function of the standard deviation $\sigma $. The setup of the simulation is identical to the one of Sect. 3.1, though now $n=100$ and $p=200$ are fixed. The entries of the noise vector $\epsilon \in \mathbb {R}^n$ in the model $y = X\beta + \epsilon $ are generated independently from a Normal distribution with mean zero and a varying standard deviation $\sigma \in [0,10]$.

Figure 6 (left) shows that, as expected, the accuracy of the recovered estimate of $\beta $ decreases for all methods as $\sigma $ increases. However, this increase seems rather slow. The runtime as a function of $\sigma $, depicted in Fig. 6 (right), stays roughly constant for all methods, as expected.

Proof of Proposition 1

Proof

The bounds on $L_e^\mu $ and $L_s^\mu $ follow from Eqs. (8) and (10) after a direct calculation. In particular, for the entropy prox function,

$$\begin{aligned} \sup _{\beta \in \mathbb {R}^p} \left| L_e^\mu (\beta ) - L(\beta ) \right|&= \sup _{\beta \in \mathbb {R}^p} \left| \uplambda \sum _{i=1}^p f_s^\mu (\beta _i) - \uplambda \sum _{i=1}^p |\beta _i| \right| \\&\le \sup _{\beta \in \mathbb {R}^p} \uplambda \sum _{i=1}^p \Big | f_s^\mu (\beta _i) - f(\beta _i) \Big |\\&\le \sup _{\beta \in \mathbb {R}^p} \uplambda \sum _{i=1}^p \mu \log (2)\\&\le \uplambda p \mu \log (2), \end{aligned}$$

where f is as defined in Sect. 2.2 and where it was used that $\uplambda \ge 0$. The result for the squared error prox smoothed $L_s^\mu $ is proven analogously.

Since both $f_e^\mu $ and $f_s^\mu $ are convex according to Nesterov (2005, Theorem 1), and since the least squares term $\frac{1}{2} \Vert X\beta - y \Vert _2^2$ is convex, it follows that both $L_e^\mu $ and $L_s^\mu $ remain convex as the sum of two convex functions.

To be precise, strict convexity holds true. Observe that the second derivative of the entropy smoothed absolute value of Sect. 2.2.1 is given by

$$\begin{aligned} \frac{\partial ^2}{\partial z^2} f_e^\mu (z) = \frac{4 e^{2x/\mu }}{\mu \left( e^{2x/\mu } +1 \right) ^2}, \end{aligned}$$

which is always positive, thus making $f_e^\mu $ strictly convex. Therefore, $L_e^\mu $ is strictly convex as the sum of a convex function and a strictly convex function. Similar arguments show that $L_s^\mu $ is strictly convex. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hahn, G., Lutz, S.M., Laha, N. et al. A fast and efficient smoothing approach to Lasso regression and an application in statistical genetics: polygenic risk scores for chronic obstructive pulmonary disease (COPD). Stat Comput 31, 35 (2021). https://doi.org/10.1007/s11222-021-10010-0

Download citation

Received: 29 July 2020
Accepted: 24 March 2021
Published: 17 April 2021
DOI: https://doi.org/10.1007/s11222-021-10010-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A fast and efficient smoothing approach to Lasso regression and an application in statistical genetics: polygenic risk scores for chronic obstructive pulmonary disease (COPD)

Abstract

Access this article

Similar content being viewed by others