1 Introduction

In this paper we propose an algorithm for optimization problems of the form

$$\begin{aligned} \displaystyle \mathop {\hbox {minimize}}_{x}\quad f(x)\quad {\hbox {subject to}}\quad x \in {\mathcal {C}}, \end{aligned}$$
(1)

where \(f:{\mathbb {R}}^n \rightarrow {\mathbb {R}}\) is a convex, twice continuously differentiable function, and \({\mathcal {C}}\) is a polyhedral set in \({\mathbb {R}}^n\). The main focus of the paper is the application and specialization of the framework to the Lasso problem [1]:

figure a

and its extension to the weighted one-norm. The work in this paper is motivated by the need for an efficient and accurate solver for the Lasso subproblems appearing in the spgl1 [2] solver for basis-pursuit denoise [3] problems of the form

figure b

Both the (\({{\hbox {BP}}_{\sigma }}\)) and (\({{\hbox {LS}}_{\tau }}\)) formulations are central to compressed sensing [4, 5] as a means of recovering sparse or approximately sparse vectors \(x_0\) from linearly compressed and often noisy observations \(b = Ax_0 + z\). The basis-pursuit denoise formulation (\({{\hbox {BP}}_{\sigma }}\)) is often a more natural choice in practice compared to the Lasso formulation, as it is parameterized in the noise level \(\sigma \), rather than in \(\tau \), the one-norm of the unknown signal \(x_0\), which may be more difficult to determine.

It was shown in [2] that basis-pursuit denoise and Lasso are connected through the Pareto curve

$$\begin{aligned} p(\tau ) = \min _{x}\quad \Vert Ax-b\Vert _2 \quad {\hbox {subject to}}\quad \Vert x\Vert _1 \le \tau , \end{aligned}$$

and that solving (\({{\hbox {BP}}_{\sigma }}\)) can be reduced to finding the smallest \(\tau \) for which the Lasso solution \(x_{\tau }^*\) satisfies \(\Vert Ax_{\tau }^*-b\Vert \le \sigma \). Denoting by \(\tau _{\sigma }\) this critical value of \(\tau \) and assuming that b lies in the range space of A it was shown in [2] that the Pareto curve is convex and differentiable at all \(\tau \in [0,\tau _0)\) with gradient \(\Vert A^Tr\Vert _{\infty } / \Vert r\Vert _2\) where r denotes the misfit \(Ax_{\tau }^* - b\). Evaluation of both \(p(\tau )\) and \(p'(\tau )\) relies on the misfit r, which can be obtained by solving (\({{\hbox {LS}}_{\tau }}\)). The spgl1 solver proposed in [2] applies root finding on the Pareto curve, as illustrated in Fig. 1, to solve \(p(\tau ) = \sigma \) and thereby reduce basis-pursuit denoise to a series of Lasso problems. In spgl1 these subproblems are solved using the spectral projected-gradient (SPG) algorithm [6], which we discuss in more detail in Sect. 2.

For certain problems it was observed that SPG generates long sequences of iterates that all lie on the same face of the polyhedral set of feasible points. This suggests the use of an active-set type method in which a quasi-Newton method, such as the limited-memory BFGS (L-BFGS) method [7], is used to minimize the problem restricted to the currently active face. In order to obtain an efficient solver it is important to avoid unnecessarily costly subproblem solves or complicated heuristics to determine when to switch between the solvers. In this paper we therefore propose a hybrid algorithm that switches between the two methods in a seamless and lightweight fashion.

Fig. 1
figure 1

Example of root finding on the Pareto curve \(p(\tau )\)

1.1 Paper outline

In Sect. 2 we provide a concise background on the SPG and L-BFGS methods along with some of their theoretical properties. We then describe the proposed algorithm for the general problem formulation (1) in Sect. 3. In Sect. 4 we study the geometry of the constraints in the Lasso problem, and develop the tools needed for an efficient implementation of the framework for Lasso. Numerical experiments are provided in Sect. 5, followed by the conclusions in Sect. 6.

1.2 Notation and definitions

We use caligraphic capital letters for sets. Given any two set \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\), we write \({\mathcal {S}}_1 + {\mathcal {S}}_2\) for \(\{x_1 + x_2 \mid x_1 \in {\mathcal {S}}_1,\ x_2\in {\mathcal {S}}_2\}\), and likewise for \({\mathcal {S}}_1 - {\mathcal {S}}_2\). For a seeming lack of established terminology, we define the difference hull of a set \({\mathcal {S}}\), \(\mathrm {diff\ hull}({{\mathcal {S}}})\), as the linear hull of differences, \(\mathrm {span}\{u_1 - u_2 \mid u_1,u_2\in {\mathcal {S}}\}\). The difference hull can be seen as the linear subspace corresponding to the affine hull of \({\mathcal {S}}\) translated to contain the origin. For any x in a polyhedral set \({\mathcal {C}}\), we define \({\mathcal {F}}(x)\) to be the unique face \({\mathcal {F}}\) of \({\mathcal {C}}\) for which \(x \in \mathrm {relint}({\mathcal {F}})\); this may be \({\mathcal {C}}\) itself. The normal cone of \({\mathcal {C}}\) at x is given by \( {\mathcal {N}}(x)\,{:}{=}\,\{d \in {\mathbb {R}}^n \mid {\mathcal {P}}(x + d) = x\}\). The normal cones \({\mathcal {N}}(x)\) are identical for all \(x \in \mathrm {relint}({\mathcal {F}})\), and we can therefore define the normal cone of a face \({\mathcal {N}}({\mathcal {F}})\) as \({\mathcal {N}}(x)\) evaluated at any arbitrary \(x \in \mathrm {relint}({\mathcal {F}})\). Orthogonal projection of any vector \(v \in {\mathbb {R}}^n\) onto \({\mathcal {C}}\) is defined as

$$\begin{aligned} {\mathcal {P}}(v)\,{:}{=}\,\mathop {\hbox {arg min}}_{x}\quad \Vert x-v\Vert _2\quad {\hbox {subject to}}\quad x\in {\mathcal {C}}. \end{aligned}$$

Given any point \(x \in \mathrm {relint}({\mathcal {F}})\) we are interested in the set of directions \(d \in {\mathbb {R}}^n\) for which there exists an \(\epsilon > 0\) such that the projection of \(x + \epsilon d\) lies in \({\mathcal {F}}\). It can be verified that

$$\begin{aligned} {\mathcal {S}}(x)&{:}{=}&\{ d \in {\mathbb {R}}^n \mid \exists \ \epsilon > 0: {\mathcal {F}}[{\mathcal {P}}(x + \epsilon d)]= {\mathcal {F}}(x)\}\\= & {} {\mathcal {N}}(x) + \mathrm {diff\ hull}({{\mathcal {F}}(x)}). \end{aligned}$$

Because the normal cone \({\mathcal {N}}(x)\) is the same for all \(x \in \mathrm {relint}({\mathcal {F}})\) we can define the self-projection cone of a face \({\mathcal {F}}\) as \({\mathcal {S}}({\mathcal {F}})\,{:}{=}\,\mathcal {N({\mathcal {F}})} + \mathrm {diff\ hull}({{\mathcal {F}}})\). Note that \({\mathcal {N}}({\mathcal {F}}) \perp \mathrm {diff\ hull}({{\mathcal {F}}})\), or stronger yet, that the difference hull of \({\mathcal {F}}\) is the orthogonal complement of the linear hull of \({\mathcal {N}}({\mathcal {F}})\). For any k-face \({\mathcal {F}}\) of \({\mathcal {C}}\), \(k\ge 1\), we denote by \({\varPhi }({\mathcal {F}}) \in {\mathbb {R}}^{n \times k}\) an arbitrary but fixed orthonormal basis for \(\mathrm {diff\ hull}({{\mathcal {F}}})\). We will never use \({\varPhi }({\mathcal {F}})\) when \({\mathcal {F}}\) is a vertex and therefore leave it undefined. We denote by \(e_i\) the ith column of an identity matrix whose size is clear from the context. The proximation of a function f is defined as \(\mathrm {prox}_f(u)\,{:}{=}\,\mathop {\hbox {arg min}}_x f(x) + {\textstyle {\frac{1}{2}}}\Vert x-y\Vert _2^2\).

2 Background

2.1 The nonmonotone spectral projected-gradient method

figure c

The nonmonotone spectral projected-gradient method (SPG), outlined in Algorithm 1, was introduced by Birgin et al. [6] for problems of the form (1), with \({\mathcal {C}}\) a closed convex set in \({\mathbb {R}}^n\), and \(f: {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) a function with continuous partial derivatives on an open set that contains \({\mathcal {C}}\). The SPG algorithm is based on the curvilinear projected-gradient method, which performs a line search along the curvilinear trajectory (see also [8]):

$$\begin{aligned} x(\alpha ) = {\mathcal {P}}(x_i - \alpha \nabla f(x_i)),\quad \alpha \ge 0. \end{aligned}$$
(2)

In order to speed up the method, two important modifications were introduced in [6]. The first modification allows a limited level of nonmonotonicity in the objective value. Given \(\mu ,\gamma \in (0,1)\), the Armijo-type line search starts with an initial step length \(\alpha _i\), and then finds the first nonnegative integer k such that

$$\begin{aligned} f(x(\mu ^k \alpha _i)) \le \max _{0 \le j \le \min \{i,M-1\} }\{f(x_{i-j})\} + \gamma (\nabla f(x_i))^T( x(\mu ^k\alpha _i) - x_i). \end{aligned}$$
(3)

The right-hand side of this condition ensures sufficient descent, but only with respect to the maximum of up to M of the most recent objective values. In case \(M=1\) this reduces to the standard Armijo line-search condition. The second modification is the use of the spectral step length, as proposed by Barzilai and Borwein [9]. Given \(s = x_i - x_{i-1}\) and \(y = \nabla f(x_i) - \nabla f(x_{i-1})\), the initial step length at iteration i is defined as

$$\begin{aligned} \alpha _{i} = {\left\{ \begin{array}{ll} \max \left( \min \left( \frac{s^Ts}{s^Ty},\ \alpha _{\max }\right) ,\ \alpha _{\min }\right) &{}\quad \mathrm {if}\ s^Ty > 0, \\ \alpha _{\max } &{}\quad \mathrm {otherwise}, \end{array}\right. } \end{aligned}$$

where \(0< \alpha _{\min } < \alpha _{\max }\) are fixed parameters. More information on the motivation behind this particular choice of step length can be found in [6, 10]. It was shown in [6] that any accumulation point \(x^*\) of the sequence \(\{x_i\}\) lies in \({\mathcal {C}}\) and satisfies

$$\begin{aligned} \Vert {\mathcal {P}}(x^* - \nabla f(x^*)) - x^*\Vert _2 = 0. \end{aligned}$$
(4)

Practical implementations of Algorithm 1 may use a relaxed version of (4) along with other conditions as a stopping criterion.

2.2 Limited-memory BFGS

The L-BFGS algorithm by Liu and Nocedal [7] is a popular quasi-Newton method for unconstrained minimization of smooth functions \({\hat{f}}: {\mathbb {R}}^n \rightarrow {\mathbb {R}}\):

$$\begin{aligned} \displaystyle \mathop {\hbox {minimize}}_{x}\quad {\hat{f}}(x). \end{aligned}$$

At each iteration, the algorithm constructs a positive-definite approximation \(H_i\) of the inverse Hessian of \({\hat{f}}\) at \(x_i\), based on an initial positive-definite matrix H along with \({\hat{n}} = \min \{i,N\}\) of the most recent vector pairs \(\{s_{i-j}\), \(y_{i-j}\}_{j=0}^{{\hat{n}}-1}\), where

$$\begin{aligned} s_j = x_j - x_{j-1},\quad \mathrm {and}\quad y_j = \nabla {\hat{f}}(x_j) - \nabla {\hat{f}}(x_{j-1}), \end{aligned}$$
(5)

and N denotes the memory size for L-BFGS. The iterates in L-BFGS are of the form \(x_{i+1} = x_{i} + \alpha _i d_i\), where the search direction \(d_i\) is given by

$$\begin{aligned} d_{i} = -H_i \cdot \nabla {\hat{f}}(x_i), \end{aligned}$$
(6)

and the step size \(\alpha _i\) is chosen to satisfy the Wolfe conditions:

$$\begin{aligned}&{\hat{f}}(x_i + \alpha _i d_i) \le {\hat{f}}(x_i) + \gamma _1 \alpha _i (\nabla {\hat{f}}(x_i))^T d_i, \end{aligned}$$
(7a)
$$\begin{aligned}&(\nabla {\hat{f}}(x_i + \alpha _i d_i))^Td_i \ge \gamma _2 (\nabla {\hat{f}}(x_i))^Td_i. \end{aligned}$$
(7b)

Parameters \(\gamma _1\) and \(\gamma _2\) are chosen such that \(0< \gamma _1 < {\textstyle {\frac{1}{2}}}\), and \(\gamma _1< \gamma _2 < 1\). For details on the structure of the inverse approximation \(H_i\) and efficient ways of evaluating the matrix-vector product in (6), see [7, 11].

2.2.1 Convergence results

For the analysis of the L-BFGS algorithm, Liu and Nocedal make the following assumptions [7]:

Assumption 1

Given starting point \(x_0\), we have:

  1. (1)

    The objective function \({\hat{f}}\) is twice continuously differentiable;

  2. (2)

    The level set \({\mathcal {D}}\,{:}{=}\,\{ x \in {\mathbb {R}}^n \mid {\hat{f}}(x) \le f(x_0)\}\) is convex; and

  3. (3)

    There exist positive constants \(\mu _1\) and \(\mu _2\) such that

    $$\begin{aligned} \mu _1 \Vert v\Vert ^2 \le v^T [\nabla ^2 {\hat{f}}(x)]\, v \le \mu _2\Vert v\Vert ^2,\quad \end{aligned}$$
    (8)

    for all \(x \in {\mathcal {D}}\) and \(v \in {\mathbb {R}}^n\).

Under these conditions, and with some simplifications, they prove that

Theorem 2

(Liu and Nocedal [7]) The L-BFGS algorithm generates a sequence \(\{x_i\}\) that converges to the unique minimizer \(x^*\) in \({\mathcal {D}}\). Moreover, there exists a constant \(c > 0\) such that

$$\begin{aligned} {\hat{f}}(x_{i+1}) - {\hat{f}}(x^*) \le (1 - c)({\hat{f}}(x_i) - {\hat{f}}(x^*)). \end{aligned}$$
(9)

3 Proposed algorithm

The proposed algorithm can be seen as a modification of the SPG method that allows the use of quasi-Newton steps over a currently active face. The basic idea is that whenever two successive iterates \(x_i\) and \(x_{i-1}\) lie on the same face, we can form or update a quadratic model of \({\hat{f}}\), the objective function restricted to the face. Whenever a model for the current face is available, the algorithm will attempt a quasi-Newton step that is restricted to the face and satisfies the Wolfe conditions (7). If the quasi-Newton step fails, or is otherwise abandoned, the algorithm simply falls back to the SPG method and takes a regular spectral projected-gradient step. After each step—regardless of the type—we check the conditions required to update the quadratic model and initiate the quasi-Newton step:

$$\begin{aligned} {\mathcal {F}}(x_{i}) = {\mathcal {F}}(x_{i-1})\quad \text{ and }\quad -\nabla f(x_i) \in \mathrm {self\ proj}({{\mathcal {F}}_i}). \end{aligned}$$
(10)

If these conditions are not met, we discard the Hessian approximation used in the quadratic model. The self-projection criterion in (10) is essential and omitting it could cause the algorithm to take repeated quasi-Newton iterations converging to a minimum on the relative interior of the face rather than the global minimum.

3.1 Quasi-Newton over a face

One way of performing quasi-Newton over a face is by maintaining an inverse Hessian approximation using the update vectors in (5), and computing the search direction \(d_i\) using (6). However, this approach has some major disadvantages. First, we may have that \(d_i \not \in \mathrm {diff\ hull}({{\mathcal {F}}_i})\), which means that \(x_i + \alpha d_i \not \in {\mathcal {F}}_i\) for all nonzero \(\alpha \). This could be partially solved by projection onto the face, but such a projected direction is no longer guaranteed to be a descent direction [12]. This too could be addressed by modifying the Hessian, but doing so would further complicate the algorithm. A second disadvantage is that we maintain the inverse Hessian approximation for the higher-dimensional ambient space, and the approximation may therefore not be very accurate along \(\mathrm {aff}({\mathcal {F}}_i)\).

figure d

The solution of the above problems is straightforward: we simply work with a representation for the function restricted to \(\mathrm {aff}({\mathcal {F}}_i)\). Let \({\mathcal {F}}_i\) be a k-dimensional face with \(k > 0\). Then we can find an orthonormal basis \({\varPhi }\,{:}{=}\,{\varPhi }({\mathcal {F}}_i) \in {\mathbb {R}}^{n\times k}\) whose span coincides with \(\mathrm {diff\ hull}({{\mathcal {F}}_i})\). Using \({\varPhi }\) we can write any point \(x\in {\mathcal {F}}_i\) as \(x = v + {\varPhi } c\), where \(v \in {\mathbb {R}}^n\) is an arbitrary but fixed point in \({\mathcal {F}}_i\), and \(c \in {\mathbb {R}}^k\) is a coefficient in the lower-dimensional space. The function \({\hat{f}}:{\mathbb {R}}^k\rightarrow {\mathbb {R}}\), which restricts f to the current face, is then given by \({\hat{f}}(c) = f(v + {\varPhi } c)\). The idea then is to form the inverse Hessian approximation over \({\hat{f}}\), and use it to obtain a search direction \({\hat{d}} \in {\mathbb {R}}^k\), which can then be mapped back to the ambient space for the actual line search. In particular, we can form the approximate inverse Hessian \(H_i \in {\mathbb {R}}^{k\times k}\) by updating an initial positive definite H using

$$\begin{aligned} s_i = {\varPhi }^T(x_i - x_{i-1})\quad \text{ and }\quad y_i = {\varPhi }^T(\nabla f(x_i) - \nabla f(x_{i-1})). \end{aligned}$$
(11)

In order to obtain the search direction we first compute \(\nabla {\hat{f}}({\varPhi }^T(x_i - v)) = {\varPhi }^T \nabla f(x_i)\) by projecting the gradient \(\nabla f(x_i)\) onto the lower-dimensional space. We then apply the inverse Hessian followed by back projection, giving:

$$\begin{aligned} d_i = {\varPhi } H_i{\varPhi }^T(-\nabla f(x_i)). \end{aligned}$$

The resulting vector clearly lies in \(\mathrm {diff\ hull}({{\mathcal {F}}_i})\) and we therefore have that \(x_i + \alpha d_i \in {\mathcal {F}}_i\) for all step sizes \(\alpha \in [0,\alpha _{\mathrm {bnd}}]\), for some \(\alpha _{\mathrm {bnd}} > 0\), possibly with \(\alpha _{\mathrm {bnd}} = +\,\infty \). Line search is equivalently done in the projected or ambient dimension, and we aim to find a step size \(\alpha \) within the above range, such that the Wolfe conditions (7) are met. For the line search we could start with a unit step length, whenever \(\alpha _{\mathrm {bnd}} \ge 1\), or we could try \(\alpha = \alpha _{\mathrm {bnd}}\) first to encourage exploring lower-dimensional faces, provided of course that \(\alpha _{\mathrm {bnd}} < \infty \). If no suitable step length can be found, or a certain maximum number of trial steps is taken, we abandon the quasi-Newton step and take a spectral projected-gradient step instead. As a final remark, note that condition (10) should never be met for vertices, since that would imply not only that \(x_i = x_{i-1}\), but also that \(-\nabla f(x_{i-1}) \in \mathrm {self\ proj}({{\mathcal {F}}_{i-1}}) = {\mathcal {N}}({\mathcal {F}}_{i-1})\), which means that the optimality condition given in (4) would already have been satisfied at \(x_{i-1}\).

3.2 Convergence

For the convergence analysis of Algorithm 2 we rely on the results in [6] and [7], and add a step in the algorithm that resets the objective-value history used by SPG after each series of successful quasi-Newton iterations to ensure that any subsequent iteration has a lower objective value. We use the following assumptions, which are somewhat more restrictive than those in the aforementioned two papers (see for example Assumption 1):

Assumption 3

We assume that

  1. (1)

    Function f is twice continuously differentiable, and bounded below;

  2. (2)

    There exist constants \(0< \mu _1 \le \mu _2 < \infty \) such that for all \(x,v \in {\mathbb {R}}^n\):

    $$\begin{aligned} \mu _1\Vert v\Vert _2^2 \le v^T \nabla ^2 f(x) v \le \mu _2\Vert v\Vert _2^2. \end{aligned}$$
    (12)

Under these assumptions, which imply convexity of f, we have the following result:

Theorem 4

Let f(x) satisfy Assumption 3 and let \(x_0 \in {\mathcal {C}}\). Then the sequence \(\{x_t\}\) generated by Algorithm 2 converges to the unique minimizer of (1).

Proof

Assumption 3 ensures the existence of a unique minimizer \(x^*\) to (1), which satisfies

$$\begin{aligned} -\nabla f(x^*) \in {\mathcal {N}}(x^*). \end{aligned}$$

If there exists a finite t for which \(x_t = x^*\), we are done. Suppose, therefore that \(x_t \ne x^*\) for all t. We consider two cases. First, if there are finitely many quasi-Newton steps, there must a \(\bar{t}\) such that all iterations \(t > \bar{t}\) are of the projected gradient type. In this case the result follows directly from the analysis in [6]. Second, consider the case where there are infinitely many quasi-Newton steps. For a fixed face \({\mathcal {F}}\), minimizing \({\hat{f}}\) is equivalent to minimizing the objective over the affine hull of the current face \({\mathcal {F}}\):

$$\begin{aligned} \displaystyle \mathop {\hbox {minimize}}_{x}\quad f(x)\quad {\hbox {subject to}}\quad x \in \mathrm {aff}({\mathcal {F}}). \end{aligned}$$
(13)

For any successful step it follows from Theorem 2 that there exists a constant \(c > 0\) such that the quasi-Newton step satisfies

$$\begin{aligned} f(x_{t+1}) - f(x_{{\mathcal {F}}}^*) \le (1 - c)(f(x_t) - f(x_{{\mathcal {F}}}^*)), \end{aligned}$$
(14)

where \(x_{{\mathcal {F}}}^*\) denotes the minimizer of (13). Because the history of the M most recent objective values is reset after each successful quasi-Newton step, any intermediate projected-gradient step will not increase the objective. Based on this, Lemma 1 below, shows that the number of quasi-Newton iterates on any \({\mathcal {F}}\) that does not contain \(x^*\) is finite. By polyhedrality of the domain, the number of faces is bounded, and we must therefore take infinitely many iterations on at least one face that contains \(x^*\). Repeated application of (14) then shows that the objective value converges to \(f(x_{{\mathcal {F}}}^*) = f(x^*)\). From Assumption 3 it then follows that \(\{x_t\}\) converges to \(x^*\). \(\square \)

Lemma 1

Let \({\mathcal {F}}\) be a face of \({\mathcal {C}}\) such that \(x^* \not \in {\mathcal {F}}\). Then the number of quasi-Newton steps on \({\mathcal {F}}\) taken by Algorithm 2 is finite.

Proof

The quasi-Newton method is applied to \({\hat{f}}\), which is equivalent to f restricted to \({\mathcal {F}}\). We therefore focus on (13). Let \(x_{{\mathcal {F}}}^*\) be the solution to (13), and denote by \(x_{[j]}\) and \(x_{[j]+1}\) the starting, respectively ending, point for the jth quasi-Newton step on \({\mathcal {F}}\). It follows from Theorem 4 that

$$\begin{aligned} f(x_{[j]}) - f(x_{{\mathcal {F}}}^*) \le f(x_{[j-1]+1}) - f(x_{{\mathcal {F}}}^*) \le (1-c)(f(x_{[j-1]}) - f(x_{{\mathcal {F}}}^*)), \end{aligned}$$
(15)

for \(j \ge 2\). This holds since any intermediate quasi-Newton iteration can only reduce the objective, and likewise for projected-gradient steps, as a consequence of resetting the function-value history.

We consider two cases. First assume that \(x_{{\mathcal {F}}}^* \not \in {\mathcal {F}}\). Let \(\bar{f}\) be the minimum of f(x) over \(x\in {\mathcal {F}}\). Repeated application of (15) gives

$$\begin{aligned} f(x_{[j]+1}) - f(x_{{\mathcal {F}}}^*) \le (1-c)^j(f(x_{[1]}) - f(x_{{\mathcal {F}}}^*)). \end{aligned}$$
(16)

For sufficiently large, but finite j, the right-hand side in (16) must fall below \(\bar{f} - f(x_{{\mathcal {F}}}^*)\), which is strictly positive. Since every successful quasi-Newton step results in a vector \(x_{[j]+1} \in {\mathcal {F}}\) by construction, it follows that the number of quasi-Newton iterates on \({\mathcal {F}}\) must be bounded.

For the second case, assume that \(x_{{\mathcal {F}}}^* \in {\mathcal {F}}\). Because optimization is done over \(\mathrm {aff}({\mathcal {F}})\), it holds that \(-\nabla f(x_{{\mathcal {F}}}^*) \perp \mathrm {diff\ hull}({{\mathcal {F}}})\). For \(-\nabla f(x_{{\mathcal {F}}}^*) \in \mathrm {self\ proj}({{\mathcal {F}}})\), we must therefore have \(-\nabla f(x_{{\mathcal {F}}}^*) \in {\mathcal {N}}(x_{{\mathcal {F}}}^*)\), but this cannot be the case since it would imply that \(x_{{\mathcal {F}}}^*\) is a global minimizer. (The same holds when \(x_{{\mathcal {F}}}^*\) lies on a lower-dimensional subface on the boundary of \({\mathcal {F}}\).) Since f is continuously differentiable by assumption, it follows that \(-\nabla f(x) \not \in \mathrm {self\ proj}({{\mathcal {F}}})\) for all points \(x \in {\mathcal {F}}\) sufficiently close to \(x_{{\mathcal {F}}}^*\). Assumption 3 then allows us to define a sufficiently close neighborhood as the level set \(f(x) \le \bar{f}\) over \(x\in {\mathcal {F}}\), where \(\bar{f} > f(x_{{\mathcal {F}}}^*)\). Applying the same argument we used above shows that the right-hand side of (16) again falls below \(\bar{f} - f(x_{{\mathcal {F}}}^*)\) for sufficiently large j. Once this happens all following iterates \(x_t \in {\mathcal {F}}\) must have \(f(x_t) \le \bar{f}\). Since the self-projection cone condition \(-\nabla f(x) \in \mathrm {self\ proj}({{\mathcal {F}}})\) does not hold at these points, no more quasi-Newton steps are taken on \({\mathcal {F}}\). \(\square \)

A similar analysis holds when the spectral projected-gradient method is replaced by another convergent algorithm, provided that the iterates do not exceed the initial objective value.

4 Application to Lasso

The proposed algorithm depends on a number of operations on the constraint set. In particular, it has to determine the face in which the current iterate lies, check membership of the self-projection cone, and determine an orthonormal basis for the current face. For the algorithm to be of practical use, the constraint set must therefore be simple enough to allow efficient evaluation of these operations. As this work was motivated by improving the Lasso problem, we focus on the weighted one-norm ball (which for unit weights is also known as the cross-polytope or n-octahedron [13]):

$$\begin{aligned} {\mathcal {C}}_{w,1} = \{x\in {\mathbb {R}}^n \mid \Vert x\Vert _{w,1} \le \tau \}, \end{aligned}$$

where \(\Vert x\Vert _{w,1}\,{:}{=}\,\sum _i w_i|x_i|\) positive \(w_i\). The proposed framework also applies naturally to bound or simplex constrained problems, but these are outside the scope of this paper. The objective function we consider throughout this section is

$$\begin{aligned} f(x) = {\textstyle {\frac{1}{2}}}\Vert Ax-b\Vert _2^2 + {\textstyle \frac{\mu }{2}}\Vert x\Vert _2^2 + c^Tx, \end{aligned}$$
(17)

which can also be written in the form \({\textstyle {\frac{1}{2}}}\Vert Ax-b\Vert _2^2 + c^Tx\), with A and b appropriately redefined. The benefit of having an objective function of the form (17) is that it permits closed-form expressions for step lengths satisfying certain conditions. In the remainder of this section we discuss practical considerations for the line-search conditions and look at the specific structure and properties of the set \({\mathcal {C}}_{w,1}\).

4.1 Line search

For most objective functions the line search is done by evaluating \(f({\mathcal {P}}(x + \alpha d))\) or \(f(x + \alpha d)\) for a series of \(\alpha \) values until all required conditions, such as Armijo and Wolfe, are satisfied. The objective function in (17) has closed-form solutions for some of the problems arising in the line search, thereby allowing us to simplify the algorithms and improve their performance.

4.1.1 Optimal unconstrained step size

As a start we look at the step length that minimizes the objective along \(f(x+\alpha d)\):

$$\begin{aligned} \alpha _{\mathrm {opt}} = \mathop {\hbox {arg min}}_{\alpha }\quad f(x + \alpha d) \end{aligned}$$

Differentiating f with respect to \(\alpha \) and equating to zero leads to the following expression:

$$\begin{aligned} \alpha _{\mathrm {opt}} = -\,\frac{(A^Tr + \mu x + c)^Td}{\Vert Ad\Vert _2^2 + \mu \Vert d\Vert _2^2}, \end{aligned}$$

with \(r = Ax-b\). When \(\mu = 0\) and \(c = 0\) this reduces to \(\alpha _{\mathrm {opt}} = -\,r^TAd / \Vert Ad\Vert _2^2\).

4.1.2 Wolfe line search conditions

The maximum step length for which the Armijo condition (7a) is satisfied can be found by expanding the terms and simplifying. Doing so gives the following bound:

$$\begin{aligned} \alpha \le \alpha _{\max } = 2(1-\gamma _1) \alpha _{\mathrm {opt}}. \end{aligned}$$

Likewise, the gradient condition (7b) reduces to

$$\begin{aligned} \alpha \ge \alpha _{\min } = (1-\gamma _2)\alpha _{\mathrm {opt}}. \end{aligned}$$

4.1.3 Projection arc

Line search in gradient projection methods is often done by backtracking from a single projection \( p(\alpha ) = x + \alpha ({\mathcal {P}}_{{\mathcal {C}}}(x + d)), \) with \(\alpha \in [0,1]\), or by search over the projection arc \( p(\alpha ) = {\mathcal {P}}_{{\mathcal {C}}}(x + \alpha d), \) with \(\alpha \ge 0\). The trajectory of the first method depends strongly on the scaling of d and is more likely to generate points on the interior of the feasible set. The second method better captures the structure of the domain, but can also be more expensive computationally. In an earlier version of the present work [14] we studied line search based on the entire projection arc (that is, over all values \(\alpha \ge 0\)) along with its properties. Given the high computational complexity of the line search in practice and the lengthy treatise we omit the details here and refer the interested reader to [14].

4.2 Properties of the weighted one-norm ball

4.2.1 Facial structure

The weighted one-norm ball of radius \(\tau \) is the convex hull of vertices \(\left\{ \pm \tau /w_i\cdot e_i\right\} _i\). Every proper k-face \({\mathcal {F}}\) of the weighted one-norm ball \({\mathcal {C}}_{w,1}\) can be written as the convex hull of \(\{\sigma _i/w_i\cdot e_i\}_{i\in {\mathcal {I}}}\), where \({\mathcal {I}}\) is a subset of \(\{1,\ldots ,n\}\) with cardinality \(k+1\), and \(\sigma _i \in \{-1,+1\}\). Throughout this section we assume that \(\tau > 0\).

Given an \(x\in {\mathcal {C}}\) we can determine \({\mathcal {F}}(x)\) as follows. First, we need to check whether \(\Vert x\Vert _{w,1} < \tau \), in which case \({\mathcal {F}}(x) = {\mathcal {C}}\). Otherwise, x lies on a proper face, which can be uniquely characterized by the sign vector \({\text {sgn}}(x)\) whose ith entry is given by \({\text {sgn}}(x_i)\). Determining \({\mathcal {F}}(x)\) and checking equality of faces can therefore be done in \({\mathcal {O}}(n)\) time.

4.2.2 Projection onto the feasible set

Projection onto the weighted one-norm ball is discussed in [15] and is based on the solution of the prox function

$$\begin{aligned} x_{\lambda }(u) {=} \mathrm {prox}_{\lambda \Vert \cdot \Vert _{w,1}}(u)\,{:}{=}\,\mathop {\hbox {arg min}}_x\ {\textstyle {\frac{1}{2}}}\Vert x-u\Vert _2^2 {+} \lambda \Vert x\Vert _{w,1} {=} {\text {sign}}(u)\cdot \left[ |u| {-} \lambda w\right] _+,\nonumber \\ \end{aligned}$$
(18)

where \([\cdot ]_+ = \max (0,\cdot )\),the absolute value, sign function, and multiplication in the right-hand side are evaluated elementwise. Projection onto \({\mathcal {C}}_{w,1}\) then amounts to finding the smallest \(\lambda \ge 0\) for which \(\Vert x_{\lambda }(u)\Vert _{w,1}\le \tau \). The entries in \(x_{\lambda }(u)\), and therefore \(\Vert x_{\lambda }(u)\Vert _{w,1}\), are continuous and piecewise linear in \(\lambda \) with break points occurring at \(\lambda = |u_i| / w_i\). We can obtain an \({\mathcal {O}}(n\log n)\) algorithm that finds the optimal \(\lambda \) and subsequent projection by sorting the break points [15]. This can be reduced to an expected \({\mathcal {O}}(n)\) algorithm [16] by avoiding the explicit sorting step.

4.2.3 Self-projection cone of a face

Given \(x \in {\mathcal {C}}_{w,1}\) and search direction \(d \in {\mathbb {R}}^n\), we want to know if \(d \in \mathrm {self\ proj}({{\mathcal {F}}(x)})\). When \(\Vert x\Vert _{w,1} < \tau \) it follows that x lies in the interior of \({\mathcal {C}}_{w,1}\) meaning that \({\mathcal {F}}(x) = {\mathcal {C}}_{w,1}\) and \(d \in \mathrm {self\ proj}({{\mathcal {C}}_{w,1}}) = {\mathbb {R}}^n\), trivially. For the case where \(\Vert x\Vert _{w,1} = \tau \), consider the support \({\mathcal {I}} = \{i \mid x_i \ne 0\}\).

We first show that \({\mathcal {I}}\) is included in the support of \(x(\alpha ) = {\mathcal {P}}(x+\alpha d)\) for all \(\alpha \in [0,{\bar{\alpha }})\), for some \({\bar{\alpha }} > 0\). Since

$$\begin{aligned} \Vert x+\alpha d\Vert _{w,1} \le \Vert x\Vert _{w,1} + \alpha \Vert d\Vert _{w,1} = \tau + \alpha \Vert d\Vert _{w,1}, \end{aligned}$$
(19)

it follows that the projection threshold \(\lambda (\alpha ) \le \alpha \Vert d\Vert _{w,1}/\min _i\{w_i\}\,{=}{:}\,\nu \alpha \). At the same time, a necessary condition for \(i\in {\mathcal {I}}\) to be removed from the support is that \(\lambda (\alpha ) \ge (|x_i| - \alpha |d_i|)/w_i\). Combined with (19) this gives the necessary condition \((\nu + |d_i|/w_i)\alpha \ge |x_i|/w_i\), which allows us to choose

$$\begin{aligned} {\bar{\alpha }} = \min _{i\in {\mathcal {I}}}\Big \{|x_i| / (w_i\nu +|d_i|)\Big \} > 0. \end{aligned}$$

For d to be in the self-projection cone we therefore need to show that (1) \(x+\alpha d\) does not move into the polytope, and (2) the support does not increase. It can be verified that the first condition is satisfied if and only if the directional derivative \((\Vert \cdot \Vert _{w,1})'(x; d)\) of \(\Vert x\Vert _{w,1}\) in direction d is nonnegative:

$$\begin{aligned} \sum _{i\in {\mathcal {I}}} {\text {sign}}(x_i)d_iw_i + \sum _{i\not \in {\mathcal {I}}}|d_i|w_i \ge 0. \end{aligned}$$
(20)

For the second condition to be satisfied we require for all \(i\not \in {\mathcal {I}}\) and sufficiently small \(\alpha \) that the absolute value of the entries remains less than or equal to the threshold value, namely \(\alpha |d_i| \le w_i\lambda (\alpha ) \). When the support remains the same we find \(\lambda (\alpha )\) by solving

$$\begin{aligned} \sum _{i\in {\mathcal {I}}} w_i(|x_i + \alpha d_i| - w_i\lambda (\alpha )) =\tau ,\quad \text{ which } \text{ gives }\quad \lambda (\alpha ) = \alpha \cdot \frac{ \sum _{i\in {\mathcal {I}}} w_i{\text {sign}}(x_i)d_i}{\sum _{i\in {\mathcal {I}}}w_i^2}, \end{aligned}$$

after writing \(|x_i + \alpha d_i| = |x_i| + \alpha {\text {sign}}(x_i)d_i\) and recalling that \(\Vert x\Vert _{w,1} = \tau \). A necessary (and sufficient) condition for the support to remain the same is therefore that

$$\begin{aligned} \max _{i\not \in {\mathcal {I}}} |d_i|/w_i \le \frac{ \sum _{i\in {\mathcal {I}}} w_i{\text {sign}}(x_i)d_i}{\sum _{i\in {\mathcal {I}}}w_i^2}. \end{aligned}$$
(21)

Summarizing the above we have:

Theorem 5

Given \(x\in {\mathcal {C}}_{w,1}\) with support \({\mathcal {I}} = \{i \mid x_i \ne 0\}\), then \(d \in \mathrm {self\ proj}({{\mathcal {F}}(x)})\) if and only if \(\Vert x\Vert _{w,1}<\tau \), or \(\Vert x\Vert _{w,1} = \tau \) and

$$\begin{aligned} \sum _{i\in {\mathcal {I}}} {\text {sign}}(x_i)d_iw_i + \sum _{i\not \in {\mathcal {I}}}|d_i|w_i \ge 0,\quad \mathrm {and}\quad \max _{i\not \in {\mathcal {I}}} |d_i|/w_i \le \frac{ \sum _{i\in {\mathcal {I}}} w_i{\text {sign}}(x_i)d_i}{\sum _{i\in {\mathcal {I}}}w_i^2}. \end{aligned}$$

4.2.4 Orthogonal basis for a face

For the construction of a quadratic approximation of objective function f over a face \({\mathcal {F}}\), we require an orthogonal basis \({\varPhi }\) for \(\mathrm {diff\ hull}({{\mathcal {F}}})\). For simplicity, consider the facet of the unit cross polytope lying in the positive orthant in \({\mathbb {R}}^3\). In other words, consider the unit simplex given by \(\mathrm {conv}\{e_1, e_2, e_3\}\). A first vector for the basis can then be obtained by normalizing \(e_2 - e_1\) to have unit norm. A second vector orthogonal to the first can be obtained by connecting the point halfway on the line segment \(e_1\)\(e_2\) to \(e_3\), that is, \(e_3 - (e_1+e_2)/2\), followed again by normalization. Extending this to general k-simplices we find \((k+1)\times k\) bases \(Q = [q_1,q_2,\ldots ,q_k]\) with

$$\begin{aligned} q_j = \left( e_{j+1} - \frac{1}{j}\sum _{i=1}^{j}e_i\right) \Big {/}\sqrt{1 + 1/j}. \end{aligned}$$

In other words

$$\begin{aligned} Q_{i,j} = {\left\{ \begin{array}{ll} -\,\sqrt{1 / (j^2 + j)} &{}\quad i \le j \\ \sqrt{j / (j+1)} &{}\quad i = j+1 \\ 0 &{}\quad \text{ otherwise }. \end{array}\right. } \end{aligned}$$

It can be seen that the above procedure amounts to taking a QR factorization of the \(k+1\times k\) matrix \([-e,\ I]^T\) of differences between the first vertex and all others of the k-simplex, and discarding the last column in Q whose entries are all equal to \(1/\sqrt{n}\). The special structure of Q allows us to evaluate matrix-vector products with Q and its transpose in \({\mathcal {O}}(k)\) time, without having to form the matrix explicitly. For the general case, let \({\mathcal {F}} = {\mathcal {F}}(x)\). For the case where \({\mathcal {F}} = {\mathcal {C}}\) no projection is needed and we can simply choose \({\varPhi } = I\). Otherwise, let \({\mathcal {I}} = \{i \mid x_i \ne 0\}\) denote the support of x. The desired basis can then be obtained by first restricting the vector to its support and then normalizing the sign pattern, thus giving:

$$\begin{aligned} {\varPhi } = I_{{\mathcal {I}}}\cdot \mathrm {diag}(\mathrm {sgn}(x_{{\mathcal {I}}}))\cdot Q. \end{aligned}$$

Matrix-vector products with \({\varPhi }\) can be evaluated in \({\mathcal {O}}(n)\) time, again without forming the matrix.

Generic weighted one-norm ball For the weighted one-norm ball we consider a simplicial face given by \(\mathrm {conv}(w_0e_1,w_1e_2,\ldots ,w_{n}e_{n})\), with nonzero weights \(w_0\) to \(w_{n}\). (Throughout this paragraph it is more convenient to work with a weight vector whose elements are the inverse of the weights appearing in the weighted one norm; the actual vertices of the weighted one norm ball are \(\pm \, w_i^{-1}e_i\), not \(\pm \,w_i e_i\).) The orthonormal basis corresponding to the face can again be found by applying the standard QR-algorithm to the matrix of differences between the vertices, and taking advantage of the special structure of the matrix. The initial setup is illustrated in Fig. 2a with \(v_1 = -w_0\). The two operations in this process are projecting out the contributions of all subsequent columns and normalizing the columns to unit norm. We do not normalize until the very end but do keep track of the squared two norm of the completed columns. Given vectors a and b we obtain the component in b orthogonal to a by evaluating \(b - \frac{\langle a,b\rangle }{\langle a,a\rangle } a\). In the first step of the factorization (we are interested only in Q) we orthogonalize with respect to the first column a. The inner product of each column with a is identical and equal to \(\alpha _1 = \langle v_1,v_1\rangle = \Vert v_1\Vert _2^2\). Using this we also compute the squared two norm of the first column as \(\gamma _1 = \alpha _1 + w_1^2\). After the sweep with the first column we are left with the matrix shown in Fig. 2b where

$$\begin{aligned} v_2 = \left[ \begin{array}{l} v_1 - \frac{\alpha _1}{\gamma _1}v_1\\ -\frac{\alpha _1}{\gamma _1}w_1 \end{array}\right] = \left[ \begin{array}{l}\frac{\gamma _1 - \alpha _1}{\gamma _1}v_1\\ -\frac{\alpha _1}{\gamma _1}w_1 \end{array}\right] = \left[ \begin{array}{l}\frac{w_1^2}{\gamma _1}v_1\\ -\frac{\alpha _1}{\gamma _1}w_1 \end{array}\right] . \end{aligned}$$
Fig. 2
figure 2

Stages of the orthogonalization process

The next step is to sweep with the updated second column. For this we compute the inner product with the remaining columns and itself, yielding \(\alpha _2 = \Vert v_2\Vert _2^2\) and \(\gamma _2 = \alpha _2 + w_2^2\), respectively. After this sweep we arrive at the matrix given in Fig. 2c. Proceeding like this we successively sweep with each of the column until we are done. Consider now the sweep with some column k, illustrated in Fig. 2d. Given \(\alpha _k = \Vert v_k\Vert _2^2\) and \(\gamma _k = \alpha _k + w_k^2\) we find

$$\begin{aligned} v_{k+1} = \left[ \begin{array}{l}v_k - \frac{\alpha _k}{\gamma _k}v_k\\ -\frac{\alpha _k}{\gamma _k}w_k \end{array}\right] = \left[ \begin{array}{l}\frac{w_k^2}{\gamma _k}v_k\\ -\frac{\alpha _k}{\gamma _k}w_k \end{array}\right] , \end{aligned}$$

from which we derive recurrence relations \(\gamma _{k+1} = \alpha _{k+1} + w_{k+1}^2\) and

$$\begin{aligned} \alpha _{k+1}= & {} \left( \frac{w_k^2}{\gamma _k}\right) ^{2}\cdot \Vert v_k\Vert _2^2 + \left( \frac{\alpha _k}{\gamma _k}\right) ^{2}\cdot w_k^2 = \frac{w_k^2w_k^2}{\gamma _k^2}\alpha _k + \frac{\alpha _k^2}{\gamma _k^2}w_k^2\nonumber \\= & {} \alpha _kw_k^2 \frac{\alpha _k + w_k^2}{\gamma _k^2} = \frac{\alpha _k w_k^2}{\gamma _k}. \end{aligned}$$
(22)

With \(\alpha _1\) and \(\gamma _1\) as given above, this allows us to compute all \(\alpha \) and \(\gamma \) values. Ultimately we are interested in the final orthonormal Q matrix. Defining scaling factors

$$\begin{aligned} \mu _{i,j} = \prod _{k=i}^{j} \frac{w_k^2}{\gamma _k}\ \ \text{ for }\ \ 1 \le i \le j \le n, \end{aligned}$$
(23)

as well as factors \(u_i = -\alpha _i / \gamma _i\) for \(1 \le i \le n\) and \(u_0\,{:}{=}\,-1\), it can be found based on the structure of vectors v that

$$\begin{aligned} Q = {\text {diag}}(w)\cdot \left[ \begin{array}{llllll} u_0 &{}\quad \mu _{1,1}u_0 &{}\quad \mu _{1,2}u_0 &{}\quad \mu _{1,3}u_0 &{} \quad \ldots &{} \quad \mu _{1,n-1}u_0\\ 1 &{} \quad u_1 &{} \quad \mu _{2,2}u_1 &{} \quad \mu _{2,3}u_1 &{}\quad \ldots &{} \quad \mu _{2,n-1}u_1\\ &{}\quad 1 &{}\quad u_2 &{}\quad \mu _{3,3}u_2 &{}\quad \ldots &{}\quad \mu _{3,n-1}u_2 \\ &{}\quad &{} \quad \ddots &{} \quad \ddots &{} \quad \ddots &{} \quad \vdots \\ &{} \quad &{} \quad &{} \quad 1 &{} \quad u_{n-2} &{} \quad \mu _{n-1,n-1}u_{n-2}\\ &{}\quad &{} \quad &{} \quad &{} \quad 1 &{} \quad u_{n-1}\\ &{} \quad &{}\quad &{} \quad &{}\quad &{}\quad 1 \end{array}\right] \cdot {\text {diag}}(1/\sqrt{\gamma }). \end{aligned}$$

Multiplication with this matrix and its transpose may still seem expensive but we now show how the structure enables \({\mathcal {O}}(n)\) algorithms for both operations. Multiplication with the diagonal matrices is trivial so we focus only on multiplication with the central part of the matrix. Looking at a small example we can decompose this matrix as

$$\begin{aligned} \left[ \begin{array}{lll} u_0 &{}\quad \mu _{1,1} u_0 &{}\quad \mu _{1,2}u_0 \\ 1 &{}\quad u_1 &{}\quad \mu _{2,2}u_1\\ &{}\quad 1 &{}\quad u_2\\ &{}\quad &{}\quad 1 \end{array}\right] = \left[ \begin{array}{l} 0 \\ I \end{array}\right] + {\text {diag}}(u)\left[ \begin{array}{lll} 1 &{}\quad \mu _{1,1} &{}\quad \mu _{1,2} \\ &{}\quad 1 &{}\quad \mu _{2,2} \\ &{}\quad &{}\quad 1 \\ \\ \end{array}\right] = \left[ \begin{array}{l} 0 \\ I \end{array}\right] + {\text {diag}}(u)\left[ \begin{array}{l}M\\ 0 \end{array}\right] . \end{aligned}$$

The key part is multiplication with the last matrix M. To evaluate \(y = Mv\) we initialize \(y_3 = v_3\) and then work upwards. Direct evaluation gives \(y_2 = v_2 + \mu _{2,2}v_3\), which can be rewritten as \(y_2 = v_2 + \mu _{2,2}y_3\). A pattern emerges when looking at the computation of \(y_1\):

$$\begin{aligned} y_1= & {} v_1 + \mu _{1,1} v_2 + \mu _{1,2}v_3 \\= & {} v_1 + \mu _{1,1}(v_2 + \mu _{2,2}v_3) \\= & {} v_1 + \mu _{1,1}y_2, \end{aligned}$$

where \(\mu _{1,2} = \mu _{1,1}\mu _{2,2}\), or more generally \(\mu _{i,k} = \mu _{i,j}\mu _{j+1,k}\) for \(i \le j\le k\), follows from the definition of \(\mu \) in (23). Given \(y_{n} = v_{n}\), we therefore obtain the recurrence \(y_k = v_k + \mu _{k,k}y_{k+1}\), which allows us to evaluate \(y = Mv\) in linear time. With v appropriately redefined we now look at \(y = M^Tv\):

$$\begin{aligned} \left[ \begin{array}{l}y_1\\ y_2 \\ y_3 \end{array}\right] = \left[ \begin{array}{lll} 1 &{}\quad &{}\quad \\ \mu _{1,1} &{}\quad 1 &{}\quad \\ \mu _{1,2} &{}\quad \mu _{2,2} &{}\quad 1 \end{array}\right] \left[ \begin{array}{l}v_1\\ v_2 \\ v_3 \end{array}\right] . \end{aligned}$$

Starting with \(y_1 = v_1\) we find \(y_2 = \mu _{1,1} y_1 + v_2\) and \(y_3 = \mu _{2,2}y_2 + v_3\), using \(\mu _{1,2} = \mu _{1,1}\mu _{2,2}\). This gives the recurrence \(y_{k+1} = \mu _{k,k}y_k + v_{k+1}\). We summarize the initialization and multiplication with Q and \(Q^T\) in Algorithm 3. Note that these algorithms use a different indexing scheme for a convenient implementation. For practical implementations we can precompute and store \(1 / \sqrt{\gamma _k}\) instead of \(\gamma _k\) and avoid storing \(\alpha \) since it is not used during the evaluation of matrix-vector products. Alternatively, we can reduce the memory footprint at the cost of increased computation by storing only \(\alpha \) and re-computing \(\mu _k\), \(u_k\), and \(\gamma _k\) whenever they are needed.

figure e

4.2.5 Maximum step length along a face

Given a feasible search direction d it is useful to know the maximum \(\alpha \) for which \(x+\alpha d \in {\mathcal {C}}_{w,1}\). When x lies in the interior of \({\mathcal {C}}_{w,1}\) or when (20) is violated and \(x+\alpha d\) moves into the interior, we need to compute the first intersection with the boundary. When x lies on a proper face of \({\mathcal {C}}_{w,1}\) and d moves along the face or onto a higher dimensional face, the maximum step length is determined by the first element to reach zero. More details can be found in [14].

4.3 Stopping criteria

We now look at stopping criteria for optimizing f(x) as defined in (17) over the weighted one-norm ball. A common stopping criterion for problem of this type is to look at the relative norm of the projected gradient:

$$\begin{aligned} \rho (x)\,{:}{=}\,\frac{\Vert {\mathcal {P}}_{{\mathcal {C}}}(x - \nabla f(x)) - x\Vert _2}{\max \{1,\Vert \nabla f(x)\Vert \}}, \end{aligned}$$

which is zero if and only if x is optimal. In addition to this we can look at the relative duality gap, which we define as the difference \(\delta \) between f(x) and any dual feasible objective, divided by \(\max \{1,f(x)\}\). Given that the dual feasible point will typically not be optimal, the relative duality gap merely provides an upper bound on the actual gap. Improving the estimate of the duality gap can cause the iterates to satisfy the stopping criteria earlier and thereby help reduce the runtime.

For the derivation of the dual problem we follow [2, 15] and rewrite the original problem as:

$$\begin{aligned} \displaystyle \mathop {\hbox {minimize}}_{x,r}\quad {\textstyle {\frac{1}{2}}}\Vert r\Vert _2^2 + c^Tx + {\textstyle \frac{\mu }{2}}\Vert x\Vert _2^2\quad {\hbox {subject to}}\quad Ax+r-b = 0,\ \Vert x\Vert _{w,1}\le \tau . \end{aligned}$$

The dual of this problem is given by

$$\begin{aligned} \displaystyle \mathop {\hbox {maximize}}_{y,\lambda }\quad {\mathcal {L}}(y,\lambda )\quad {\hbox {subject to}}\quad \lambda \ge 0, \end{aligned}$$

where the Lagrange dual function \({\mathcal {L}}\) is given by

$$\begin{aligned}&{\mathcal {L}}(y,\lambda )\,{:}{=}\,\inf _{x,r} \Big \{{\textstyle {\frac{1}{2}}}\Vert r\Vert _2^2 + c^Tx + {\textstyle \frac{\mu }{2}}\Vert x\Vert _2^2 - y^T(Ax+r-b) + \lambda (\Vert x\Vert _{w,1}-\tau ) \Big \} \nonumber \\&\quad = y^Tb - \tau \lambda + \inf _{r}\Big \{ {\textstyle {\frac{1}{2}}}\Vert r\Vert _2^2 - y^Tr\Big \} + \inf _{x}\Big \{ (c - A^Ty)^Tx + {\textstyle \frac{\mu }{2}}\Vert x\Vert _2^2 + \lambda \Vert x\Vert _{w,1}\Big \} \nonumber \\&\quad = y^Tb - \tau \lambda -{\textstyle {\frac{1}{2}}}\Vert y\Vert _2^2+ \inf _{x}\left\{ (c - A^Ty)^Tx + {\textstyle \frac{\mu }{2}}\Vert x\Vert _2^2 + \lambda \Vert x\Vert _{w,1}\right\} . \end{aligned}$$
(24)

Here, the infimum over r is solved by equating the gradient to zero, giving \(y=r\) and \(y^Tr = \Vert y\Vert _2^2\). For the infimum over x we consider two cases, based on the value of \(\mu \).

Dual when\(\mu = 0\). Following the derivation given in [15], with minor changes to account for the c term, it can be shown that

$$\begin{aligned} \inf _{x} \{ (c - A^Ty)^Tx + \lambda \Vert x\Vert _{w,1} \} = {\left\{ \begin{array}{ll} 0 &{}\quad \Vert A^Ty - c\Vert _{\frac{1}{w},\infty } \le \lambda \\ -\,\infty &{}\quad \mathrm {otherwise}. \end{array}\right. } \end{aligned}$$

From this we then obtain the dual problem:

$$\begin{aligned} \displaystyle \mathop {\hbox {maximize}}_{y,\ \lambda \ge 0}\quad y^Tb-\tau \lambda -{\textstyle {\frac{1}{2}}}\Vert y\Vert _2^2\quad {\hbox {subject to}}\quad \Vert A^Ty - c\Vert _{\frac{1}{w},\infty } \le \lambda . \end{aligned}$$
(25)

As a dual-feasible point we can choose \(y=r\). For any given y it can be verified that choosing \(\lambda = \Vert A^Ty - c\Vert _{\frac{1}{w},\infty }\) always gives the largest dual objective value. Given x and the corresponding residual \(r = b - Ax\) we therefore obtain the following duality gap:

$$\begin{aligned} \delta = \Vert r\Vert _2^2 + c^Tx - r^Tb + \tau \Vert A^Tr-c\Vert _{\frac{1}{w},\infty } \end{aligned}$$

Dual when\(\mu > 0\). The simplest way of dealing with \(\mu > 0\) is to rewrite the problem as:

$$\begin{aligned} \displaystyle \mathop {\hbox {minimize}}_{x}\quad {\textstyle {\frac{1}{2}}}\Vert {\tilde{A}} x - {\tilde{b}}\Vert _2^2\quad {\hbox {subject to}}\quad \Vert x\Vert _{w,1}\le \tau \end{aligned}$$

with \({\tilde{A}} = [A;\ \sqrt{\mu }I]\), and \({\tilde{b}} = [b;\ 0]\). This reduces the problem to the form where \(\mu =0\) and we can therefore directly use the dual formulation (25). Choosing \(y = {\tilde{r}}\), with \({\tilde{r}} = [r;\ -\sqrt{\mu }x]\), and applying the same derivation as given above, we obtain a dual objective value of

$$\begin{aligned} r^Tb -\tau \lambda -{\textstyle {\frac{1}{2}}}\Vert r\Vert _2^2 -{\textstyle \frac{\mu }{2}}\Vert x\Vert _2^2,\quad \text{ with }\quad \lambda =\Vert A^Tr-\mu x-c\Vert _{\frac{1}{w},\infty }, \end{aligned}$$
(26)

and a corresponding duality gap of

$$\begin{aligned} \delta = \Vert r\Vert _2^2 + c^Tx - r^Tb + \mu \Vert x\Vert _2^2 + \tau \Vert A^Tr - \mu x - c\Vert _{\frac{1}{w},\infty }. \end{aligned}$$
(27)

Another approach is to solve the original infimum over x in (24) for the case where \(\mu > 0\). For a fixed y and \(\lambda \) we have

$$\begin{aligned}&m(y,\lambda )\,{:}{=}\,\inf _{x}\left\{ (c - A^Ty)^Tx + {\textstyle \frac{\mu }{2}}\Vert x\Vert _2^2 + \lambda \Vert x\Vert _{w,1}\right\} \nonumber \\&\quad = \mu \inf _{x} \left\{ -{\textstyle \frac{1}{\mu }}(A^Ty - c)^Tx + {\textstyle {\frac{1}{2}}}\Vert x\Vert _2^2 + {\textstyle \frac{\lambda }{\mu }}\Vert x\Vert _{w,1}\right\} \end{aligned}$$
(28)

When \(\lambda = 0\) it is easily seen that \(x^* = {\textstyle \frac{1}{\mu }}(A^Ty-c)\), thus giving \(m(x) = -{\textstyle \frac{1}{2\mu }}\Vert A^Ty - c\Vert _2^2\). For the more general case where \(\lambda > 0\), we first reformulate (28) as

$$\begin{aligned} m(y,\lambda ) = \mu \inf _{x}\left\{ -v^Tx + {\textstyle {\frac{1}{2}}}{\Vert x\Vert }_2^2 + h(x)\right\} . \end{aligned}$$
(29)

with \(h(x) = \frac{\lambda }{\mu }\Vert x\Vert _{w,1} = \Vert x\Vert _{\frac{\lambda w}{\mu },1}\) and \(v = \frac{1}{\mu }(A^Ty-c)\). For problems of the form (29) we have:

Theorem 6

Let \(h(\cdot )\) be any norm then

$$\begin{aligned} \inf _{x}\ -v^Tx + {\textstyle {\frac{1}{2}}}\Vert x\Vert _2^2 + h(x) = -{\textstyle {\frac{1}{2}}}\Vert \mathrm {prox}_{h}(v)\Vert _2^2 \end{aligned}$$

Proof

Note that the objective is coercive and therefore attains the minimum. This allows us to rewrite and solve the objective as follows:

$$\begin{aligned} u = \mathop {\hbox {arg min}}_{x}\ {\textstyle {\frac{1}{2}}}\Vert x-v\Vert _2^2 + h(x) = \mathrm {prox}_h(v). \end{aligned}$$

We then need to show that

$$\begin{aligned} -v^Tu + {\textstyle {\frac{1}{2}}}\Vert u\Vert _2^2 + h(u) = -{\textstyle {\frac{1}{2}}}\Vert u\Vert _2^2 \end{aligned}$$

From the Moreau decomposition [17] we have \(v =\mathrm {prox}_h(v) + \mathrm {prox}_{h^*}(v)\), where \(h^*\) is the conjugate of h. Using \(\mathrm {prox}_{h^*}(v) = v-u\) we have

$$\begin{aligned}&-\,v^Tu + {\textstyle {\frac{1}{2}}}\Vert u\Vert _2^2 + h(u) = -(u + (v-u))^Tu + {\textstyle {\frac{1}{2}}}u^T u + h(u)\\&\quad = -{\textstyle {\frac{1}{2}}}\Vert u\Vert _2^2 - (v-u)^Tu + h(u) \\&\quad = -{\textstyle {\frac{1}{2}}}\Vert \mathrm {prox}_h(v)\Vert _2^2 - \mathrm {prox}_{h^*}(v)^T\mathrm {prox}_{h}(v) + h(\mathrm {prox}_h(v)) \\&\quad {\mathop {=}\limits ^{(a)}} -{\textstyle {\frac{1}{2}}}\Vert \mathrm {prox}_h(v)\Vert _2^2 - h^*(\mathrm {prox}_{h^*}(v)) \\&\quad {\mathop {=}\limits ^{(b)}} -{\textstyle {\frac{1}{2}}}\Vert \mathrm {prox}_h(v)\Vert _2^2, \end{aligned}$$

where (a) follows from [18, Lemma 2.10], and (b) follows from the fact that \(h^*\) is the indicator function for the dual norm of h, which implies that \(\mathrm {prox}_{h^*}(v)\) is an element of the dual norm ball, and \(h^*(\mathrm {prox}_{h^*}(v)) = 0\). \(\square \)

Application of Theorem 6 to (29) with proximal operator [see also (18)]

$$\begin{aligned} \mathrm {prox}_h(v) = \mathrm {sign}(v)\big [|v| - {\textstyle \frac{\lambda w}{\mu }}\big ]_+, \end{aligned}$$

we obtain

$$\begin{aligned} m(y,\lambda )= & {} -\frac{\mu }{2} \left\| {\text {sign}}({\textstyle \frac{1}{\mu }}(A^Ty - c))\left[ \big \vert {{\textstyle \frac{1}{\mu }}(A^Ty - c)}\big \vert - \frac{\lambda w}{\mu }\right] _+\right\| _2^2\\= & {} -\frac{1}{2\mu }\left\| \left[ \big \vert A^Ty - c\big \vert - \lambda w\right] _+\right\| _2^2. \end{aligned}$$

The same expression holds for \(\lambda = 0\) and substitution into (24) therefore gives the following dual problem:

$$\begin{aligned} \displaystyle \mathop {\hbox {maximize}}_{y,\ \lambda \ge 0}\quad y^Tb - \tau \lambda -{\textstyle {\frac{1}{2}}}\Vert y\Vert _2^2 -{\textstyle \frac{1}{2\mu }}\left\| \left[ \big \vert A^Ty - c\big \vert - \lambda w\right] _+\right\| _2^2. \end{aligned}$$
(30)

Even when restricting y to the current residual r in the primal formulation, it can be seen that the value of (30) is never smaller than that of (26) and, consequently, that the duality gap never exceeds the value in (27). In particular, by choosing \(\lambda = \Vert A^Ty + \mu x -c\Vert _{\frac{1}{w},\infty }\), it follows that for any index i we have

$$\begin{aligned} \lambda \ge {\textstyle \frac{1}{w_i}}\left| [A^Tx - c]_i + \mu x_i\right| \ge {\textstyle \frac{1}{w_i}}\left| [A^Tx-c]_i\right| - {\textstyle \frac{\mu }{w_i}}|x_i|. \end{aligned}$$

Multiplying either side by \(w_i\) and rearranging gives

$$\begin{aligned} \left| [A^Ty - c]_i\right| - \lambda w_i\le \mu |x_i|. \end{aligned}$$

Because the right-hand side is always nonnegative, this continues to hold when applying the \([\cdot ]_+\) operator on the left-hand side, and as a result we have

$$\begin{aligned} {\textstyle \frac{1}{2\mu }}\left\| \left[ \big \vert A^Ty - c\big \vert - \lambda w\right] _+\right\| _2^2 \le {\textstyle \frac{\mu }{2}}\Vert x\Vert _2^2, \end{aligned}$$

from which the desired result immediately follows.

Finding a dual-feasible solution It follows from Slater’s condition and strong duality that, at the solution (x, r) for (25), we have \(y = r\) and, without loss of generality, \(\lambda = \Vert A^Tr-c\Vert _{\frac{1}{w},\infty }\). When r is not optimal, we can still choose \(y=r\) and obtain a dual-feasible solution. For (30) we can also take \(y=r\), but finding \(\lambda \) requires some more work. In general, given any y we want to find a \(\lambda \) that maximizes the objective. Writing \(z = |A^Tr-c|\) and ignoring constant terms, this is equivalent to solving

$$\begin{aligned} \lambda ^*\,{:}{=}\,\mathop {\hbox {arg min}}_{\lambda \ge 0}\quad \tau \lambda + {\textstyle \frac{1}{2\mu }}\Vert [z-\lambda w]_+\Vert _2^2. \end{aligned}$$

With \({\mathcal {I}}(\lambda )\,{:}{=}\,\{i \mid z_i \ge \lambda w_i\}\) we can write the objective as

$$\begin{aligned} f(\lambda ) = \tau \lambda + {\textstyle \frac{1}{2\mu }}\sum _{i\in {\mathcal {I}}(\lambda )}(z_i - \lambda w_i)^2 \end{aligned}$$

Discarding all zero terms with \(z_i = 0\), this function is piecewise quadratic with breakpoints at \(\lambda _i = w_i / z_i\). We can write the sequence of breakpoints in non-decreasing order as \(\lambda _{[i]}\) for \(i=0,\ldots ,n\), with \(\lambda _{[0]}\,{:}{=}\,0\). The gradient between successive breakpoints is linear and continuously increases from \(f'(0) = \tau - 1/\mu \sum _i w_iz_i\) to \(f'(\lambda _{[n]}) = \tau \). In order to find the optimal point \(\lambda ^*\), we consider two cases. In the first case we have \(f'(0) \ge 0\), or equivalently \(\tau \ge 1/\mu \sum _i w_iz_i\), which means that the function is non-decreasing and we find \(\lambda ^* = 0\). In the second case we need to find \(\lambda \) for which the gradient vanishes. This can be done by traversing the breakpoints until the first breakpoint is found where the gradient is nonnegative. The desired solution \(\lambda ^*\) is then found by linear interpolation over the last segment. Including sorting this can be done in \({\mathcal {O}}(n\log n)\) time. This problem is very similar to projection onto the one-norm ball, and can also be evaluated in expected \({\mathcal {O}}(n)\) time using an algorithm similar to that proposed in [16].

Primal-dual pairs Using the methods described above, we can compute an upper bound on the duality gap given a feasible x and the corresponding residual r. We can do at least as good, and often better, by maintaining the maximum dual objective found so far, and using this to determine the relative duality gap. This way it is possible for the primal objective for the current iterate x to attain the desired optimality tolerance while the corresponding dual estimate is far from optimal. Within the root-finding framework this means that we cannot simply use the latest residual r to evaluate the gradient of the Pareto curve. Instead we should maintain the value of \(\lambda \) corresponding to the best dual solution at any point, and use this as the gradient approximation. In other words, we need to keep track of the best primal and dual variables separately.

4.4 Convergence

In many practical situations, the \(m\times n\) matrix A in (\({{\hbox {LS}}_{\tau }}\)) will have more columns than rows (\(m < n\)). Due to the presence of a null space, it follows that Assumption 3 is not satisfied, and the convergence result in Theorem 4 does not apply as is. For the result to apply it suffices to ensure that condition (12) applies to the restriction \({\hat{f}}\) of f on each of the faces of \({\mathcal {C}}\). A sufficient condition for this is that the columns in A are in general position, such that there does not exist any subset of m or fewer columns of A that is rank deficient. The hybrid algorithm can be updated to check for the self-projection cone and perform quasi-Newton steps only when the number of nonzeros in the current iterate are at most m.

5 Numerical experiments

In this section we evaluate the performance of the hybrid approach on the Lasso problem (\({{\hbox {LS}}_{\tau }}\)) both independently and within the spgl1 root-finding framework [2] described in the introduction. The spgl1 solver can be used both for stand-alone Lasso problems, as well as for basis-pursuit denoise (\({{\hbox {BP}}_{\sigma }}\)) problems. For the hybrid method we are mostly concerned with the performance of the former and we therefore changed spgl1 in two stages. First we modified the stopping criteria used in the Lasso mode, now declaring a solve successful only if the relative duality falls below a certain tolerance level. We then added all modifications needed for the implementation of the hybrid approach. To distinguish between the different algorithms, we use the convention that spgl1 is used only to refer to the existing implementation provided by [2]. We refer to the version of spgl1 with the more stringent stopping criteria as the spg method, which is then extended with the techniques described in this paper to obtain the hybrid method. For the hybrid method we use an L-BFGS memory size of eight for all experiments.

When used in the root-finding mode to solve (\({{\hbox {BP}}_{\sigma }}\)), spgl1 uses several different criteria to decide when to update \(\tau \). Each subproblems in spgl1 is considered solved when the relative change in objective is small, and at least one iteration was taken within the current subproblem. The overall problem is declared solved when \(\Vert A^Ty\Vert _{\infty }\), the relative difference between \(\Vert r\Vert _2\) and \(\sigma \), or the relative duality gap is sufficiently small. For the basis-pursuit denoise experiments based on the spg and hybrid algorithms, we use a separate implementation of the root-finding framework in which each Lasso subproblem is fully solved before updating \(\tau \). The differences in stopping criteria, and especially the lack of guarantees on the duality gap for the final subproblem in spgl1, make it difficult to compare the performances directly. We therefore focus predominantly on how the performance of the hybrid method differs from the reference spg method.

Fig. 3
figure 3

Comparison of qpOASES with spg and spg-hybrid on random gaussian problems of size \(m\times 2\,\mathrm{m}\). The different curves from bottom to top correspond to increasing \(\tau \) values. In qpOASES the curves overlap

5.1 Lasso problems

5.1.1 Active-set type method

Given their similarity, we first compare the performance of the hybrid method with an active-set type method. For this, we rewrite the Lasso subproblem (\({{\hbox {LS}}_{\tau }}\)) as a quadratic program (QP) by splitting x into its positive and negative components:

$$\begin{aligned} \min _{\bar{x}}\quad {\textstyle {\frac{1}{2}}}\Vert [A,-A]\bar{x} - b\Vert _2^2\quad {\hbox {subject to}}\quad \bar{x}\ge 0,\ e^T\bar{x} \le \tau . \end{aligned}$$
(31)

As a solver we choose qpOASES [19], a parametric active-set method that relies on solving linear systems of equations, and therefore generally works best for small to medium scale problems.

For the comparison we generate 20 random \(m\times 2\,\mathrm{m}\) matrices A and vectors b with i.i.d. random gaussian elements and columns normalized to unit length. For each problem instance we use spgl1 to solve the basis-pursuit denoise problem with \(\sigma \in \{0.2, 0.1, 0.05, 0.02, 0.01\}\) to get corresponding \(\tau \) values. We then solve the first Lasso problem with qpOASES, as well as spg and the hybrid method, both with relative duality tolerance \(10^{-6}\). In preliminary runs it was found that the runtime for qpOASES is insensitive to the problem instance and we therefore ran only the first problem instance for each \(\tau \) value. For the spg and hybrid methods we ran all 20 instances. The (average) runtime for the three methods is plotted against problem size m in Fig. 3. The first thing to notice is that the runtime of qpOASES is insensitive to the value of \(\tau \), whereas the runtime for spg and the hybrid method increases with \(\tau \) (corresponding to smaller \(\sigma \)). As expected, qpOASES performs well on small problems. However, it scales poorly with increasing problem size, and becomes uncompetitive beyond \(m\approx 200\). Using the procedure described in Sect. 4.3 we generated dual-feasible solutions for each of the primal solutions obtains with qpOASES. This showed that the relative duality gap for solutions by qpOASES was around \(10^{-12}\), which is much smaller than the threshold used for spg and the hybrid method. To ensure the comparison is fair we modified qpOASES to keep track of the relative duality gap at each iteration to see if early stopping was possible. This turned out not to be the case, since the one-norm of the iterates was found to substantially exceed \(\tau \) for all but the last several iterations, thereby prohibiting early stopping.

The total runtime over all problem instances using qpOASES amounted to 24,725 s. Even with additional problem instances for \(m=1536\) and \(m=2048\), the total of the (average) runtime for spg and the hybrid method were 1492 s and 1206 s. The total runtime for the hybrid method is 19% less than that of spg.

5.1.2 Comparison with pnopt

The pnopt algorithm, introduced by Lee et al. [20], is a proximal Newton-type method for solving problems of the form

$$\begin{aligned} \min _{x}\quad f(x) + h(x), \end{aligned}$$

where f(x) is convex and continuously differentiable, and h(x) is convex, but not necessarily differentiable. By choosing \(f(x) = {\textstyle {\frac{1}{2}}}\Vert Ax-b\Vert _2^2\) and setting h(x) to the indicator function of the one-norm ball of radius \(\tau \), pnopt can solve the Lasso formulation (\({{\hbox {LS}}_{\tau }}\)). We configure pnopt to use an L-BFGS quadratic model for f(x) with the default memory size of 50 (using a memory size of eight, as used in the hybrid method, gave similar results). Differences in the stopping criteria make it difficult to directly compare the performance of pnopt with spg and the hybrid method. We determine relative duality gap tolerance values and pnopt runtimes using the following steps:

  1. 1.

    We run pnopt on each of the problem instances from Sect. 5.1.1 using solver-specific tolerance values of \(10^{-4}\), \(10^{-5}\), and \(10^{-6}\).

  2. 2.

    Given the solutions for each of the three settings, we determine the relative duality gap, as described in Sect. 4.3, and choose this value as the reference tolerance. The median of the resulting tolerance values are plotted in Fig. 4a.

  3. 3.

    In order to ensure that pnopt did not reach the stopping criterion much earlier, we created a modified version of pnopt to evaluate the duality gap at each iteration to determine the first iteration at which the reference tolerance is attained. In most cases this number is equal to the number of iterations in step 1, in the remaining cases it is less due to the choice of the threshold value.

  4. 4.

    Finally, we rerun the original pnopt solver with the number of iterations limited to the value found in the previous step. This is the first iteration at which the relative duality gap tolerance is attained.

  5. 5.

    The time reported for each problem instance and setting for pnopt is then chosen as the minimum of the runs in steps 1, 3, and 4. Note that this approach is advantageous to pnopt because it effectively excludes the time needed to evaluate the relative duality gap.

For spg and the hybrid method we use the reference tolerance found in step 2 above as the stopping criterion. Figure 4b plots the median runtimes over 10 problem instances for the problem sizes in Sect. 5.1.1 and the relative duality tolerances corresponding to the three original tolerance values (we plot the median rather than the average runtime due to the presence of problem instances that take excessive time to solve using pnopt; for spg and the hybrid method the mean and median are quite similar). It can be seen that both spg and the hybrid method are much faster than pnopt for all problem sizes. Between spg and the hybrid method we see that the hybrid method is slightly slower than spg, especially on the small problem sizes. This can be attributed to the large tolerance values for the relative duality gap and the low number of iterations needed to reach it, which means that the quasi-Newton method either never activates or only activates several iterations before spg finishes. In this case, the overhead associated with maintaining the support cannot be offset by the benefits of any quasi-Newton steps.

Fig. 4
figure 4

Plots of a the median relative duality gap for ten problem instances solved with pnopt, and b the median runtime over the problem instances using pnopt, spg, and the hybrid method with a relative duality gap threshold corresponding to the settings in plot (a)

5.1.3 Lasso on sparse problems

We now take a closer look at the occurrence of line-search errors in spg and the speed up obtained using the hybrid method. For this we generate test problems following a conventional compressed-sensing scenario and choose A to be a random \(1024\times 2048\) matrix with i.i.d. normal entries with columns normalized to unit norm. We set \(b = Ax_0\), for k-sparse vectors \(x_0\) with random support and its non-zero entries generated i.i.d. from (1) the normal distribution; (2) the uniform distribution over \([-\,1,1]\); and (3) the discrete set \(\{-\,1,+1\}\) with equal probability. We set \(\tau = 0.99\cdot \Vert x_0\Vert _1\) and terminate the algorithm whenever the relative duality gap, defined here as \((f(x) - f_{\mathrm {dual}}) / \max \{f(x), 10^{-3}\}\), falls below a given tolerance level. We run 50 instances for each of the three distributions used above and report in Table 1 the results obtained with an optimality tolerance of \(10^{-6}\). In general we see that the runtime goes up considerable as we keep increasing k. Moreover, the results show clear differences in the runtime for the three distributions with a much higher runtime for problems based on sparse vectors with \(\pm \, 1\) entries. For sparsity levels up to around one hundred the number of iterations in the spg method is relatively small (between 25 and 50). For these problems the hybrid method may complete before or soon after the first quasi-Newton step is taken. The slight overhead of the method and occasionally a small number of additional iterations make the hybrid method somewhat slower on average for these problems than the spg method. For larger values of k, the number of iterations goes up, and the effect of the quasi-Newton steps in the hybrid method becomes apparent with average speed up values between 20 and 30%. Aside from reduced runtime we see from Table 1 that the hybrid method also manages to solve many more problems to the desired accuracy level compared to the spg method; the number of solved problems steadily falls to around 9% with increasing k for the spg method, but remains at 100% for all but the largest k for the hybrid method. The median relative duality gap provides further information about the level of accuracy reached before the algorithm completes or terminates with a line-search error. For the largest values of k, the spg method fails to complete with a relative duality gap of even \(10^{-5}\) for at least half of the problems.

Table 1 Comparison between the spg and the hybrid method on exact sparse problems with k non-zero entries

5.2 Root finding

Given that most of the runtimes appearing in Table 1 are of the order of seconds, it is valid to question whether these problems are too idealized and well behaved to give a good idea about the practical performance of the algorithms. In this section we therefore look at two different types of problems. First we introduce a class of random problems that better reflect conditions found in practical problems. Second we evaluate the performance on the Sparco [21] collection of test problems for sparse reconstruction.

5.2.1 Coherent test problem generation

In the compressed-sensing literature it is well known that a random Gaussian matrix satisfies with high probability that all sufficiently small subsets of columns form a near-orthogonal basis for the subspace spanned by these columns—a property known as the restricted isometry [22]. Another quantity used to characterize matrices is the mutual coherence, defined as the maximum absolute pairwise cosine distance between the columns. In practical applications the matrix A is often quite coherent [23], and although there are no theoretical results on how this affects the complexity of one-norm minimization, it has been observed empirically that more coherent problems are harder to solve. The construction we propose for generating such problems is by means of a random walk on the \((m-1)\)-sphere with a step size parameterized by \(\gamma \). Starting with a unit-norm column \(a_1\) we construct successive columns by sampling a vector \(v_k\) with i.i.d. Gaussian entries and setting \(a_{k+1} = \alpha _1 a_k + \alpha _2 v_k\), where \(\alpha _1\) and \(\alpha _2\) are chosen such that \(\Vert a_{k+1}\Vert _2 = 1\) and \(\langle a_k, a_{k+1}\rangle = 1-\gamma \). In other words, \(a_{k+1}\) lies on the boundary of a spherical cap with center center \(a_k\) and angle \(\theta \) such that \(\mathrm {cos}(\theta ) = 1-\gamma \). The mutual coherence of the resulting matrix is lower bounded by \(1-\gamma \), and an example of the distribution of the pairwise cosine distance between the columns is given in Fig. 5a. An example Gram matrix, plotted in Fig. 5b, shows that aside from the banded structure, there are regions of increased coherence whenever the random walk approaches earlier locations. From Fig. 5c we see that lowering \(\gamma \) while keeping \(a_1\) and \(v_k\) fixed leads to an increase of the top singular value \(\sigma _1\) as the columns become more and more similar. Figure 5d illustrates that the maximum pairwise coherence \(\mu \) does not necessarily have a relationship with the top singular value.

5.2.2 Highly coherent measurement matrices

We apply spg and the hybrid method to solve (\({{\hbox {BP}}_{\sigma }}\)) using the root-finding framework explained in Sect. 1. Each Lasso subproblem (\({{\hbox {LS}}_{\tau }}\)) is optimized to a certain optimality tolerance, and the overall problem is considered solved whenever the relative misfit \(|\sigma - \Vert r\Vert _2| / \max (\sigma , 10^{-3})\) falls below \(10^{-5}\). For completeness we also compare the performance with the spgl1 algorithm as provided by [2].

Fig. 5
figure 5

a Distribution of pairwise mutual coherence between vectors of two types of \(64\times 2048\) matrices with unit-normalized columns generated as: (i) random vectors with i.i.d. normal entries, and (ii) a random walk over the sphere with the mutual coherence between successive columns equal to \(1-\gamma \); b Gram matrix of a \(200\times 2000\) matrix generated with \(\gamma = 0.01\); c the top singular value of a \(200\times 500\) matrix generated with the same \(a_1\) and \(v_k\) for different values of \(\lambda \); and d the mutual coherence and top singular value for 1000 random Gaussian \(200\times 500\) matrices with columns scaled to unit norm

For the first set of experiments we use the highly coherent matrices described in Sect. 5.2.1. As before we create a k-sparse vector \(x_0\) with non-zero entries sampled from different distributions, and set \(b = Ax_0 + v\), where the entries in v are zero in the noiseless case, and sampled i.i.d. from the normal distribution and scaled to the desired noise level otherwise. For the noiseless results in Table 2a–c we run ten instances for each of the three distributions and report the average run time over all thirty runs. The percentage reduction in runtime is computed based on the total runtime and was found to closely match the percentage obtained for each of the three signal classes independently. For the root-finding columns we solve (\({{\hbox {BP}}_{\sigma }}\)) with \(\sigma = 0.01\Vert b\Vert _2\) and optimality tolerance levels of \(10^{-4}\) and \(10^{-6}\). For the Lasso columns we solve (\({{\hbox {LS}}_{\tau }}\)) on equivalent problems with \(\tau \) set to the value obtained using the root-finding procedure. The results in Table 2d–f apply to noisy problems where \(\Vert v\Vert _2\) is scaled to the given percentage of \(\Vert Ax_0\Vert _2\), and \(\sigma \) is set accordingly. For these experiments we only consider sparse \(x_0\) with random \(\pm \,1\) entries. The percentage decrease in runtime relative to spg, for spgl1 and the hybrid method for all problem instances in Table 2 is given in Fig. 6. A summary of the total runtime for the different solvers along with the percentage of solutions with relative duality gap within given ranges is provided by Table 3.

Table 2 Comparison between spgl1 and root finding with strict tolerance levels using the spg and hybrid method
Fig. 6
figure 6

Percentage runtime reduction relative to spg for solvers spgl1 and the hybrid method on a set of highly coherent problems. The hybrid method is run on both the root-finding and lasso problems with two optimality tolerance levels each

Table 3 Total runtime for the coherent problems with different methods and optimality tolerances, along with the percentage of instances that attain a relative duality gap in the given intervals at the final iteration

The first thing to note from the results in Table 2 is that the problems generated with lower values of \(\gamma \) are indeed more difficult to solve for both spg and the hybrid method. Compared to the spg method, the hybrid method reduces the average runtime for nearly all problems, and does so by a percentage that increases as the problems get harder, a trend clearly seen in Fig. 6. From Table 3 we see that the hybrid method with optimality tolerances of \(10^{-4}\) and \(10^{-6}\) reduces the total runtime respectively by 34% and 43% for the basis-pursuit problems, and 20% and 24% for the Lasso problems. The larger relative reduction in runtime for basis pursuit is due to the use of warm starting in the root-finding procedure, which removes a substantial number of iterations that would otherwise be identical for the hybrid and spg methods. Despite the improvements, the hybrid method still has a larger runtime than spgl1 on most problems. However, from Table 3 we see that spgl1 does not even reach a relative duality gap of \(10^{-3}\) for nearly 80% of the problems, as a result of the relaxed stopping criteria. Tightening these criteria, as done in what we label the spg method, increases the number of solutions that attain the desired optimality tolerance. Nevertheless, the spg method still fails to reach an optimality of \(10^{-6}\) for some 60% of the problems. Finally, we see that the hybrid method not only improves the runtime of the spg method, but also manages to reach the requested optimality on all problems from Table 2.

5.2.3 Sparco test problems

Sparco [21] provides a standard collection of test problems for compressed sensing and sparse recovery. The problems in Sparco are of the form \(b = Ax+v\), where A is represented as a linear operator rather than an explicit matrix. After excluding problems that are too easy to solve or require access to third-party software, we obtain the problem selection listed in Table 4. For some problems we scale the original b to avoid a very small objective value at the solution, which causes the duality gap relative to \(\max (f(x),1)\) to be satisfied more easily. The table also lists the one-norm of the solutions found when solving with \(\sigma = 0.01\Vert b\Vert _2\) and \(\sigma = 0.001\Vert b\Vert _2\), respectively, for the scaled b.

Table 4 Selected Sparco problems
Table 5 Sparco problems with \(\sigma =0.01\Vert b\Vert _2\), (a) runtime in seconds, (b) relative duality gap, (c) number of root-finding iterations

We run the spg and hybrid methods with optimality tolerances ranging from \(10^{-2}\) down to \(10^{-4}\). Beyond that, some of the problems simply took too long to finish. For spgl1 we use optimality tolerance values set to \(10^{-6}\) and \(10^{-9}\). By comparison these may seem excessively small, and we certainly do not expect the relative duality gap to reach these levels. Instead, we choose the small values to help control the other stopping criteria, such as the relative change in the objective value, which are parameterized using the same tolerance parameter. The results of the experiments with the two choices of \(\sigma \), are summarized in Tables 5 and 6. The hybrid method reduces the runtime of the spg method in 42 out of the 56 settings, often considerably so. For a tolerance level of \(10^{-4}\) the hybrid method consistently outperforms the spg method with an average time reduction of 38%. The required optimality level is reached on all problems except for problem 903 with the smaller \(\sigma \) and optimality tolerance \(10^{-4}\). For this problem the spg method stops with a relative duality gap of \(2\times 10^{-4}\) following a line-search error. The runtime for spgl1 with optimality tolerance \(10^{-6}\) is very low overall, but comes at the cost of a rather large relative duality gap at the solution. Lowering the tolerance to \(10^{-9}\) reduces the gap, but also leads to a considerable increase in runtime. In either case the number of root-finding iterations can be very large, especially if the target value of \(\tau \) is exceeded and gradual reduction follows. The lowest relative duality gap reached by spgl1 over all problems in Tables 5 and 6 is \(4\times 10^{-3}\). The varying optimality levels make it difficult to compare results, so of special interest are problem instances where spgl1 simultaneously has a lower runtime and relative duality gap with either the spg or hybrid method, or vice versa. From the tables we see that spgl1 outperforms the spg method on both instances of problem 702. For problem 401 in Table 5, spgl1 with an optimality tolerance of \(10^{-6}\) is better, but aside from this problem, spgl1 consistently has the lowest runtime, but also the largest duality gap. The spg method with more stringent root-finding iterations dominates spgl1 with a tolerance level of \(10^{-9}\) on all remaining problems aside from the instance of problem 701 in Table 5. As we saw earlier, the hybrid method performs especially well when the desired relative duality gap is small. Nevertheless, even for large duality gaps it still dominates spgl1 on nine out of the fourteen problem instances and is dominated on only one.

5.2.4 Primal-dual gap

We now consider the formulation

$$\begin{aligned} \displaystyle \mathop {\hbox {minimize}}_{x}\quad {\textstyle {\frac{1}{2}}}\Vert Ax-b\Vert _2^2 + {\textstyle \frac{\mu }{2}}\Vert x\Vert _2^2 \quad {\hbox {subject to}}\quad x\in {\mathcal {C}}, \end{aligned}$$
(32)

for \(\mu > 0\). In Sect. 4.3 we described two different ways of deriving a dual formulation. In the first approach we augment A and b to account for the \({\textstyle \frac{\mu }{2}}\Vert x\Vert _2^2\) term and reduce the problem to the standard Lasso formulation. The derivation of the dual for this formulation in [2, 15] provides a way of generating a dual-feasible point \((\bar{y},{\bar{\lambda }})\) from a primal-feasible x by choosing \(\bar{y} = \bar{A}x - \bar{b}\) and solving a trivial optimization problem for \({\bar{\lambda }}\). In the second approach we deal with formulation (32) directly and obtain a dual problem parameterized in \((y,\lambda )\). As before we can choose y to be equal to the residual, now in terms of the original A and b, and remain with a non-trivial optimization problem for \(\lambda \) that is nevertheless easily solved using the algorithm described in Sect. 4.3. We refer to the two derivations as the augmented derivation and the optimized derivation. The term ‘optimized’ refers to the need to solve for \(\lambda \), but more importantly, to the fact that the dual objective generated from any x using the optimized derivation is never smaller than that using the augmented derivation, as shown in Sect. 4.3.

Fig. 7
figure 7

Plots a, b the time required to reach the desired relative optimality gap of \(10^{-4}\) using the augmented formulation, and the corresponding speed up obtained using the optimized formulation; each point indicates a single problem instance. Plots c,d the relative distance of the primal (gray) and dual objective (red for the augmented formulation and blue for the optimized formulation) to the optimal objective as a function of iterations for the two circled problem instances (color figure online)

Table 6 Sparco problems with \(\sigma = 0.001\Vert b\Vert _2\), (a) runtime in seconds, (b) relative duality gap, (c) number of root-finding iterations

To evaluate the practical difference between the two approaches we generate a large number of randomized test problems of the form \(b = Ax_0 + v\), where \(x_0\) are random vectors with sparsity levels ranging from 50 to 350 in steps of 50 and on-support entries draw i.i.d. from the normal distribution. The \(m\times n\) measurement matrices A are drawn i.i.d. from \({\mathcal {N}}(0,1/m)\), with \(m=1000\) and \(n=2000\). Finally, the additive noise vectors v are drawn from the normal distribution and scaled to have Euclidean norm equal to 0, 0.01, 0.1, and 1% of that of the clean observation \(Ax_0\). For the problem formulation we set \(\tau \) to \(\Vert x_0\Vert _1\) scaled by 0.7–1.2 in 0.1 increments, and range \(\mu \) log-linearly from \(10^{-1}\) to \(10^{-4}\) in four steps. As a result of the additive \({\textstyle \frac{\mu }{2}}\Vert x\Vert _2\) term in the objective, the solutions are no longer sparse. As a result the hybrid method tends to coincide with the spg method, and we therefore only consider the latter for these experiments.

Fig. 8
figure 8

Relative distance of primal (solid) and dual iterates (dashed) to the optimum as a function of time using the spg (red) and hybrid method (blue) on four Sparco problems for fixed \(\tau \) corresponding to the root for \(\sigma \) equal to the given multiple of \(\Vert b\Vert _2\) (color figure online)

For each of the settings we evaluate the time required by the augmented and optimized formulations to reach a relative duality gap of \(10^{-4}\). Figure 7a, b plot the speed up obtained using the optimized formulation along with the runtime of the augmented formulation for different problem instances with two levels of \(\mu \). Despite the slightly more expensive evaluation of the dual, we see that the optimized formulation is around 1.5–4 times faster for \(\mu =0.01\), and up to 7 times faster for \(\mu =0.001\). For \(\mu =0.1\) (not shown in the plot) the speedup ranges from 1.2 to 3, and for \(\mu =0.001\) the speed up exceeds 10 on many problem instances and reaches a maximum of around 30.

We now take a closer look at the relative distance of the primal and dual objective to the optimum for the two circled problems in Fig. 7c,d. The progress of the primal objective over the iterations, indicated by the gray line, is the same for both formulations. For the dual objective there is a marked difference between the two. Notably, the augmented formulation converges much slower than the optimized formulation, thereby preventing the stopping criterion from being satisfied for many more iterations.

We now take another look at the Sparco problems from Tables 5 and 6. For each of the settings we record the optimal \(\tau \) and then run the hybrid solver with a target optimality tolerance of \(10^{-8}\) to obtain a best-effort optimum (for some problems the line search failed before reaching the desired tolerance). We then run the spg and hybrid solvers with a target accuracy of \(10^{-5}\) and record the relative distance of the primal and dual objective to the optimum at every iteration. The results for four representative problems are plotted in Fig. 8. From the plots we see that the iterates of the hybrid method initially coincide or otherwise closely follow those of the spg method. Once the hybrid method starts using quasi-Newton iterates increasingly often we see a sharp decrease in the relative distance to the optimum of the primal and dual iterates. The iterates of the spg method, by contrast, continue to decrease very slowly. Indeed, of the fourteen problem settings, the spg method managed to solve only two to the desired level of accuracy. Of the remaining problems, two reach the default iteration limit of ten times the number of rows in A, while all other problems fail with a line-search error. The hybrid method manages to solve all problems except for problem 401 with multiplier \(10^{-3}\). This problem reached the iteration limit, but could otherwise be solved successfully to a tolerance level of even \(10^{-8}\).

As before, we see that the dual objective converges to the optimum much slower than the primal, and unfortunately, there is no clear way to extend the optimized dual formulation from Sect. 4.3 to the standard Lasso formulation where \(\mu =0\). Given that the satisfaction of the optimality condition depends almost entirely by the dual objective value, it makes sense to look at the speed up that would be attained if the optimal objective value were to be known and satisfaction of the optimality condition depends only on the primal objective. In Table 7 we provide the attainable speed up for the different Sparco problems with varying optimality tolerance levels. Clearly, both the spg and hybrid methods would benefit greatly from an improved dual, although the effect is less for the hybrid method, due to the already fast convergence of the dual objective in the final iterations.

Table 7 Speed up for different tolerance levels when the optimal objective value is given and satisfaction of the optimality condition depends only on the primal objective value

6 Conclusions

In this paper we have presented a hybrid algorithm for minimization of quadratic functions over weighted one-norm balls. The method extends the spectral projected gradient method with L-BFGS iterations applied to reparameterizations of the objective function over active faces of the one-norm ball. For the decision of the iteration type we introduce the self-projection cone of a face and provide a complete characterization of this cone for weighted one-norm balls. The reparameterization uses an implicit orthonormal basis for the current face, and we provide an efficient algorithm for matrix-vector multiplication with this basis and its transpose.

As part of the numerical experiments we propose a challenging class of test problems in which the columns of the \(m\times n\) measurement matrix A are generated based on a random walk over the \((m-1)\)-sphere. Based on extensive numerical experiments on these and other test problems we showed that the hybrid method outperforms the original spectral projected gradient methods on a large fraction of the problems. Especially for medium to high accuracy solves and more challenging problems the spg method was found to either take much more time to reach the desired level of accuracy, or fail prematurely due to line-search problems. Both methods performed favorably against the parametric active-set solver qpOASES [19] and the proximal Newton-type solver pnopt [20].

The current stopping criterion used in both spg and the hybrid method relies on the generation of a dual feasible point from the primal iterate to determine the relative optimality of the iterate. From the experiments we found that the primal objective converges to the optimum much faster than the dual objective, and that satisfaction of the stopping criterion therefore depends entirely on the dual objective reaching the critical threshold. The performance of both methods could therefore be improved substantially given a better dual estimate.

In this paper we have studied the application of the hybrid method to the Lasso problem. Other important problems that may benefit from the approach but were not discussed in this paper include box-constrained optimization and minimization of quadratic functions over the simplex.