Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization

doi:10.1016/j.csda.2017.06.007

Computational Statistics & Data Analysis

Volume 115, November 2017, Pages 186-198

https://doi.org/10.1016/j.csda.2017.06.007 Get rights and content

Abstract

Using a multiplicative reparametrization, it is shown that a subclass of $L_{q}$ penalties with $q$ less than or equal to one can be expressed as sums of $L_{2}$ penalties. It follows that the lasso and other norm-penalized regression estimates may be obtained using a very simple and intuitive alternating ridge regression algorithm. As compared to a similarly intuitive EM algorithm for $L_{q}$ optimization, the proposed algorithm avoids some numerical instability issues and is also competitive in terms of speed. Furthermore, the proposed algorithm can be extended to accommodate sparse high-dimensional scenarios, generalized linear models, and can be used to create structured sparsity via penalties derived from covariance models for the parameters. Such model-based penalties may be useful for sparse estimation of spatially or temporally structured parameters.

Introduction

Consider estimation for the normal linear regression model $y \sim N_{n} (X β, σ^{2} I)$ , where $X \in R^{n \times p}$ is a matrix of predictor variables and $β \in R^{p}$ is a vector of regression coefficients to be estimated. A least squares estimate is a minimizer of the residual sum of squares $‖ y - X β ‖^{2} .$ A popular alternative estimate is the lasso estimate (Tibshirani, 1996), which minimizes $‖ y - X β ‖^{2} + λ ‖ β ‖_{1}$ , a penalized residual sum of squares that balances fit to the data against the possibility that some or many of the elements of $β$ are small or zero. Indeed, minimizers of this penalized sum of squares may have elements that are exactly zero.

There exist a large variety of optimization algorithms for finding lasso estimates (see Schmidt et al. (2007) for a review). However, the details of many of these algorithms are somewhat opaque to data analysts who are not well-versed in the theory of optimization. One exception is the local quadratic approximation (LQA) algorithm of Fan and Li (2001), which proceeds by iteratively computing a series of ridge regressions. Fan and Li (2001) also suggested using LQA for non-convex $L_{q}$ penalization when $q < 1$ , and this technique was used by Kabán and Durrant (2008) and Kabán (2013) in their studies of non-convex $L_{q}$ -penalized logistic regression. However, LQA can be numerically unstable for some combinations of models and penalties. To remedy this, Hunter and Li (2005) suggested optimizing a surrogate “perturbed” objective function. This perturbation must be user-specified, and its value can affect the parameter estimate. As an alternative to using local quadratic approximations, Zou and Li (2008) suggest $L_{q}$ -penalized optimization using local linear approximations (LLA). While this approach avoids the instability of LQA, the algorithm is implemented by iteratively solving a series of $L_{1}$ penalization problems for which an optimization algorithm must be chosen as well.

This article develops a simple alternative technique for obtaining $L_{q}$ -penalized regression estimates for many values of $q \leq 1$ . The technique is based on a non-identifiable Hadamard product parametrization (HPP) of $β$ as $β = u \circ v$ , where “ $\circ$ ” denotes the Hadamard (element-wise) product of the vectors $u$ and $v$ . As shown in Section 2, if $\hat{u}$ and $\hat{v}$ are optimal $L_{2}$ -penalized values of $u$ and $v$ , then $\hat{β} = \hat{u} \circ \hat{v}$ is an optimal $L_{1}$ -penalized value of $β$ . An alternating ridge regression algorithm for obtaining $\hat{u} \circ \hat{v}$ is easy to understand and implement, and is competitive with LQA in terms of speed. Furthermore, a modified version of HPP can be adapted to provide fast convergence in sparse, high-dimensional scenarios. In Section 3 we consider extensions of this algorithm for non-convex $L_{q}$ -penalized regression with $q \leq 1$ . As in the $L_{1}$ case, $L_{q}$ -penalized linear regression estimates may be found using alternating ridge regression, whereas estimates in generalized linear models can be obtained with a modified version of an iteratively reweighted least squares algorithm. In Section 4 we show how the HPP can facilitate structured sparsity in parameter estimates: The $L_{2}$ penalty on the vectors $u$ and $v$ can be interpreted as independent Gaussian prior distributions on the elements of $u$ and $v$ . If instead we choose a penalty that mimics a dependent Gaussian prior, then we can achieve structured sparsity among the elements of $\hat{β} = \hat{u} \circ \hat{v}$ . This technique is illustrated with an analysis of brain imaging data, for which a spatially structured HPP penalty is able to identify spatially contiguous regions of differential brain activity. A discussion follows in Section 5.

Section snippets

The Hadamard product parametrization

The lasso or $L_{1}$ -penalized regression estimate $\hat{β}$ of $β$ for the model $y \sim N_{p} (X β, σ^{2} I)$ is the minimizer of $‖ y - X β ‖^{2} + λ ‖ β ‖_{1}$ , or equivalently of the objective function $f (β) = β^{⊤} Q β - 2 β^{⊤} l + λ ‖ β ‖_{1},$ where $Q = X^{⊤} X$ and $l = X^{⊤} y$ . Now reparametrize the model so that $β = u \circ v$ , where “ $\circ$ ” is the Hadamard (element-wise) product. We refer to this parametrization as the Hadamard product parametrization (HPP). Estimation of $u$ and $v$ using $L_{2}$ penalties corresponds to the following objective function: $g (u, v) = {(u \circ v)}^{⊤} Q (u \circ v) - 2 {(u \circ v)}^{⊤} l + λ (u^{⊤} u +$

HPP for $L_{q}$ -penalized linear regression

A natural generalization of the HPP is to write $β = u_{1} \circ \dots \circ u_{K}$ and optimize $g (u_{1}, \dots, u_{K}) = {(u_{1} \circ \dots \circ u_{K})}^{⊤} Q (u_{1} \circ \dots \circ u_{K}) - 2 {(u_{1} \circ \dots \circ u_{K})}^{⊤} l + \frac{λ}{K} (u_{1}^{⊤} u_{1} + \dots + u_{K}^{⊤} u_{K}) .$ For $K = 1$ the optimal $u$ -value is the $L_{2}$ -penalized ridge regression estimate, and for $K = 2$ the optimal value of $(u_{1}, u_{2})$ gives the $L_{1}$ -penalized lasso regression estimate, as discussed in the previous section. Values of $K$ greater than 2 correspond to non-convex $L_{q}$ penalties with $q = 2 ∕ K$ . For example, the $L_{1 ∕ 2}$ -penalized estimate is obtained by optimizing (6) with $K = 4$ .

Structured penalization with the HPP

It is well-known that the lasso objective function $f (β) = β^{T} Q β - 2 β^{T} l + λ ‖ β ‖_{1}$ is equal to the scaled log posterior density of $β$ under a Laplace prior distribution on the elements of $β$ Tibshirani (1996), Figueiredo (2003), Park and Casella (2008). Specifically, for the linear regression model $y \sim N (X β, σ^{2} I)$ and prior distribution $β_{1}, \dots, β_{p} \sim$ i.i.d. Laplace $(λ ∕ [2 σ^{2}])$ , the posterior density of $β$ is given by $p (β | y, X, σ^{2}) \propto exp (- ‖ y - X β ‖^{2} ∕ [2 σ^{2}]) exp (- λ ‖ β ‖_{1} ∕ [2 σ^{2}]) \propto exp (- \frac{1}{2 σ^{2}} [β^{⊤} Q β - 2 β^{⊤} l + λ ‖ β ‖_{1}]) = exp (- \frac{1}{2 σ^{2}} f (β)),$ where $Q = X^{⊤} X$ and $l = X^{⊤} y$ as

Discussion

The Hadamard product parametrization provides a simple and intuitive method for obtaining $L_{q}$ -penalized regression estimates for certain values of $q$ . In terms of accessibility to practitioners, the HPP algorithm is similar to the LQA algorithm. Both of these algorithms proceed by iterative ridge regression. Unlike the “ridge” of the LQA algorithm, that of the HPP algorithm is bounded near zero, suggesting that HPP is to be preferred over LQA for reasons of numerical stability. However, for the

Acknowledgments

I thank Panos Toulis for helpful comments. This research was supported by NSF grant DMS-1505136.

References (23)

DeutschG.K. et al.
Correlations between white matter microstructure and reading performance in children
Cortex
(2005)
StyanG.P.H.
Hadamard products and multivariate statistical analysis
Linear Algebr. Appl.
(1973)
EfronB.
Large-scale inference
FanJ. et al.
Variable selection via nonconcave penalized likelihood and its oracle properties
J. Amer. Statist. Assoc.
(2001)
FigueiredoM.A.T.
Adaptive sparseness for supervised learning
IEEE Trans. Pattern Anal. Mach. Intell.
(2003)
FriedmanJ. et al.
Regularization paths for generalized linear models via coordinate descent
J. Stat. Softw.
(2010)
FuW.J.
Penalized regressions: the bridge versus the lasso
J. Comput. Graph. Statist.
(1998)
Goeman, J.J., Meijer, R.J., Chaturvedi, N., 2016. Penalized: L1 (Lasso and Fused Lasso) and L2 (Ridge) Penalized...
GriffinJ.E. et al.
Inference with normal-gamma prior distributions in regression problems
Bayesian Anal.
(2010)
HunterD.R. et al.
Variable selection using MM algorithms
Ann. Statist.
(2005)

KabánA.

Fractional norm regularization: learning with very few relevant features

IEEE Trans. Neural Netw. Learn. Syst.

(2013)

Cited by (0)

View full text

Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization

Abstract

Introduction

Section snippets

The Hadamard product parametrization

HPP for Lq-penalized linear regression

Structured penalization with the HPP

Discussion

Acknowledgments

Cortex

Linear Algebr. Appl.

Large-scale inference

Variable selection via nonconcave penalized likelihood and its oracle properties

J. Amer. Statist. Assoc.

Adaptive sparseness for supervised learning

IEEE Trans. Pattern Anal. Mach. Intell.

Regularization paths for generalized linear models via coordinate descent

J. Stat. Softw.

Penalized regressions: the bridge versus the lasso

J. Comput. Graph. Statist.

Inference with normal-gamma prior distributions in regression problems

Bayesian Anal.

Variable selection using MM algorithms

Ann. Statist.

Fractional norm regularization: learning with very few relevant features

IEEE Trans. Neural Netw. Learn. Syst.

HPP for $L_{q}$ -penalized linear regression