Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization

https://doi.org/10.1016/j.csda.2017.06.007Get rights and content

Abstract

Using a multiplicative reparametrization, it is shown that a subclass of Lq penalties with q less than or equal to one can be expressed as sums of L2 penalties. It follows that the lasso and other norm-penalized regression estimates may be obtained using a very simple and intuitive alternating ridge regression algorithm. As compared to a similarly intuitive EM algorithm for Lq optimization, the proposed algorithm avoids some numerical instability issues and is also competitive in terms of speed. Furthermore, the proposed algorithm can be extended to accommodate sparse high-dimensional scenarios, generalized linear models, and can be used to create structured sparsity via penalties derived from covariance models for the parameters. Such model-based penalties may be useful for sparse estimation of spatially or temporally structured parameters.

Introduction

Consider estimation for the normal linear regression model yNn(Xβ,σ2I), where XRn×p is a matrix of predictor variables and βRp is a vector of regression coefficients to be estimated. A least squares estimate is a minimizer of the residual sum of squares yXβ2. A popular alternative estimate is the lasso estimate (Tibshirani, 1996), which minimizes yXβ2+λβ1, a penalized residual sum of squares that balances fit to the data against the possibility that some or many of the elements of β are small or zero. Indeed, minimizers of this penalized sum of squares may have elements that are exactly zero.

There exist a large variety of optimization algorithms for finding lasso estimates (see Schmidt et al. (2007) for a review). However, the details of many of these algorithms are somewhat opaque to data analysts who are not well-versed in the theory of optimization. One exception is the local quadratic approximation (LQA) algorithm of Fan and Li (2001), which proceeds by iteratively computing a series of ridge regressions. Fan and Li (2001) also suggested using LQA for non-convex Lq penalization when q<1, and this technique was used by Kabán and Durrant (2008) and Kabán (2013) in their studies of non-convex Lq-penalized logistic regression. However, LQA can be numerically unstable for some combinations of models and penalties. To remedy this, Hunter and Li (2005) suggested optimizing a surrogate “perturbed” objective function. This perturbation must be user-specified, and its value can affect the parameter estimate. As an alternative to using local quadratic approximations, Zou and Li (2008) suggest Lq-penalized optimization using local linear approximations (LLA). While this approach avoids the instability of LQA, the algorithm is implemented by iteratively solving a series of L1 penalization problems for which an optimization algorithm must be chosen as well.

This article develops a simple alternative technique for obtaining Lq-penalized regression estimates for many values of q1. The technique is based on a non-identifiable Hadamard product parametrization (HPP) of β as β=uv, where “” denotes the Hadamard (element-wise) product of the vectors u and v. As shown in Section 2, if uˆ and vˆ are optimal L2-penalized values of u and v, then βˆ=uˆvˆ is an optimal L1-penalized value of β. An alternating ridge regression algorithm for obtaining uˆvˆ is easy to understand and implement, and is competitive with LQA in terms of speed. Furthermore, a modified version of HPP can be adapted to provide fast convergence in sparse, high-dimensional scenarios. In Section 3 we consider extensions of this algorithm for non-convex Lq-penalized regression with q1. As in the L1 case, Lq-penalized linear regression estimates may be found using alternating ridge regression, whereas estimates in generalized linear models can be obtained with a modified version of an iteratively reweighted least squares algorithm. In Section 4 we show how the HPP can facilitate structured sparsity in parameter estimates: The L2 penalty on the vectors u and v can be interpreted as independent Gaussian prior distributions on the elements of u and v. If instead we choose a penalty that mimics a dependent Gaussian prior, then we can achieve structured sparsity among the elements of βˆ=uˆvˆ. This technique is illustrated with an analysis of brain imaging data, for which a spatially structured HPP penalty is able to identify spatially contiguous regions of differential brain activity. A discussion follows in Section 5.

Section snippets

The Hadamard product parametrization

The lasso or L1-penalized regression estimate βˆ of β for the model yNp(Xβ,σ2I) is the minimizer of yXβ2+λβ1, or equivalently of the objective function f(β)=βQβ2βl+λβ1,where Q=XX and l=Xy. Now reparametrize the model so that β=uv, where “” is the Hadamard (element-wise) product. We refer to this parametrization as the Hadamard product parametrization (HPP). Estimation of u and v using L2 penalties corresponds to the following objective function: g(u,v)=(uv)Q(uv)2(uv)l+λ(uu+

HPP for Lq-penalized linear regression

A natural generalization of the HPP is to write β=u1uK and optimize g(u1,,uK)=(u1uK)Q(u1uK)2(u1uK)l+λKu1u1++uKuK.For K=1 the optimal u-value is the L2-penalized ridge regression estimate, and for K=2 the optimal value of (u1,u2) gives the L1-penalized lasso regression estimate, as discussed in the previous section. Values of K greater than 2 correspond to non-convex Lq penalties with q=2K. For example, the L12-penalized estimate is obtained by optimizing (6) with K=4.

Structured penalization with the HPP

It is well-known that the lasso objective function f(β)=βTQβ2βTl+λβ1 is equal to the scaled log posterior density of β under a Laplace prior distribution on the elements of β Tibshirani (1996), Figueiredo (2003), Park and Casella (2008). Specifically, for the linear regression model yN(Xβ,σ2I) and prior distribution β1,,βp i.i.d. Laplace(λ[2σ2]), the posterior density of β is given by p(β|y,X,σ2)expyXβ22σ2expλβ12σ2exp12σ2βQβ2βl+λβ1=exp12σ2f(β),where Q=XX and l=Xy as

Discussion

The Hadamard product parametrization provides a simple and intuitive method for obtaining Lq-penalized regression estimates for certain values of q. In terms of accessibility to practitioners, the HPP algorithm is similar to the LQA algorithm. Both of these algorithms proceed by iterative ridge regression. Unlike the “ridge” of the LQA algorithm, that of the HPP algorithm is bounded near zero, suggesting that HPP is to be preferred over LQA for reasons of numerical stability. However, for the

Acknowledgments

I thank Panos Toulis for helpful comments. This research was supported by NSF grant DMS-1505136.

References (23)

  • DeutschG.K. et al.

    Correlations between white matter microstructure and reading performance in children

    Cortex

    (2005)
  • StyanG.P.H.

    Hadamard products and multivariate statistical analysis

    Linear Algebr. Appl.

    (1973)
  • EfronB.

    Large-scale inference

  • FanJ. et al.

    Variable selection via nonconcave penalized likelihood and its oracle properties

    J. Amer. Statist. Assoc.

    (2001)
  • FigueiredoM.A.T.

    Adaptive sparseness for supervised learning

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2003)
  • FriedmanJ. et al.

    Regularization paths for generalized linear models via coordinate descent

    J. Stat. Softw.

    (2010)
  • FuW.J.

    Penalized regressions: the bridge versus the lasso

    J. Comput. Graph. Statist.

    (1998)
  • Goeman, J.J., Meijer, R.J., Chaturvedi, N., 2016. Penalized: L1 (Lasso and Fused Lasso) and L2 (Ridge) Penalized...
  • GriffinJ.E. et al.

    Inference with normal-gamma prior distributions in regression problems

    Bayesian Anal.

    (2010)
  • HunterD.R. et al.

    Variable selection using MM algorithms

    Ann. Statist.

    (2005)
  • KabánA.

    Fractional norm regularization: learning with very few relevant features

    IEEE Trans. Neural Netw. Learn. Syst.

    (2013)
  • Cited by (0)

    View full text