Elsevier

Neurocomputing

Volume 364, 28 October 2019, Pages 280-296
Neurocomputing

Faster constrained linear regression via two-step preconditioning

https://doi.org/10.1016/j.neucom.2019.07.070Get rights and content

Abstract

In this paper, we study the large scale constrained linear regression problem and propose a two-step preconditioning method, which is based on some recent developments on random projection, sketching techniques and convex optimization methods. Combining the method with (accelerated) mini-batch SGD, we can achieve an approximate solution with a time complexity lower than that of the state-of-the-art techniques for the low precision case. Our idea can also be extended to the high precision case, which gives an alternative implementation to the Iterative Hessian Sketch (IHS) method with significantly improved time complexity. Experiments on benchmark and synthetic datasets suggest that our methods indeed outperform existing ones considerably in both the low and high precision cases.

Introduction

Linear regression with convex constraints is a fundamental problem in Machine Learning, Statistics and Signal Processing, since many other problems, such as SVM, LASSO, signal recovery [1], can be all formulated as constrained linear regression problems. Thus, the problem has received a great deal of attentions from both the Machine Learning and Theoretical Computer Science communities. The problem can be formally defined as follows,minxWf(x)=Axb22,where A is a matrix in Rn×d with d < n < ed, W is a closed convex set and bRn is the response vector. The goal is to find an xW such that f(x)(1+ϵ)minxWf(x) or f(x)minxWf(x)ϵ, where ϵ is the approximation error.

Roughly speaking, there are two types of methods to solve the problem. The first type of techniques is based on Stochastic Gradient Descent (SGD). Recent developments on first-order stochastic methods, such as Stochastic Dual Coordinate Ascent (SDCA) [2], Stochastic Variance Reduced Gradient (SVRG) [3] and Katyusha [4], have made significant improvements on the convergence speed of large scale optimization problems in both theory and practice [5], which provide the potential for us to obtain faster solution to our problem.

The second type of techniques is based on randomized linear algebra. Among them, random projection and sketching are commonly used theoretical tools in many optimization problems as preconditioner, dimension reduction or sampling techniques to reduce the time complexity. This includes low rank approximation [6], SVM [7], column subset selection [8] and lp regression (for p ∈ [1, 2]) [9], [10]. Thus, it is very tempting to combine these two types of techniques to develop faster methods with theoretical or statistical guarantee for more constrained optimization problems. Recently, quite a number of works have successfully combined the two types of techniques. For example, [11] proposed faster methods for Ridge Regression and Empirical Risk Minimization, respectively, by using SVRG, Stochastic Gradient Descent (SGD) and low rank approximation. [12] achieved guarantee for Empirical Risk Minimization by using random projection in dual problem.

In this paper, we revisit the preconditioning methods for solving large-scale constrained linear regression problem, and propose a new method called two-step preconditioning. Combining this method with some recent developments on large scale convex optimization problems, we are able to achieve faster algorithms for both the low (ϵ101104) and high (ϵ1010) precision cases. Specifically, our main contributions can be summarized as follows.

  • 1.

    For the low precision case, we first propose a novel algorithm called HDpwBatchSGD (i.e., Algorithm 2) by combining the method of two step preconditioning with mini-batch SGD. Mini-batch SGD is a popular way for improving the efficiency of SGD. It uses several samples, instead of one, in each iteration and runs the gradient descent updating on all these samples (simultaneously). Ideally, we would hope for a factor of r speed-up on the convergence if using a batch of size r. However, this is not always possible for general case. Actually in some cases, there is even no speed-up at all when a large-size batch is used [19], [20], [21]. A unique feature of our method is its optimal speeding-up with respect to the batch size, i.e. the iteration complexity will decrease by a factor of b if we increase the batch size by a factor of b.

    We also use the two-step preconditioning method and Multi-epoch Stochastic Accelerated mini-batch SGD proposed in [22] to obtain another slightly different algorithm called HDpwBatchAccSGD (i.e., Algorithms 3 and 4), which has the time complexity lower than that of the state-of-the-art technique [14] and HDpwBatchSGD.

  • 2.

    The optimality on speeding-up in HDpwBatchSGD and HDpwBatchAccSGD further inspires us to think about how it will perform if using the whole gradient, i.e. projected Gradient Descent, which leads to another algorithm called pwGradient (i.e., Algorithm 6). We find that it actually allows us to have an alternative implementation of the Iterative Hessian Sketch (IHS) method [17], which is arguably the state-of-the-art technique for the high precision case. Particularly, we are able to show that one step of sketching is sufficient for IHS, instead of a sequence of sketchings used in the current form of IHS. This enables us to considerably improve the time complexity of IHS.

  • 3.

    Finally, we implement our algorithms and test them on both large synthetic and real benchmark datasets. Numerical results confirm our theoretical analysis of HDpwBatchSGD, HDpwBatchAccSGD and pwGradient. Also, our methods outperform existing ones in both low and high precision cases.

This paper is a substantially extended version of our previous work appeared in AAAI’18 [23]. The following are the main added contents. Firstly, we add detailed algorithms of HDpwBatchAccSGD and pwSVRG, which have not been discussed in [23] (see Algorithm 3, Algorithm 4 and 7). Secondly, we provide the proofs for all theorems and lemmas. Thirdly, we expand the previous work by validating our results with additional synthetic and real world datasets. More specifically, for the low precision case, we add experimental results for HDpwBatchAccSGD and show its superiority in the low precision case, compared with other existing methods and HDpwBatchSGD. For the high precision case, we provide comparisons with more large scale real datasets and show that our method is faster. We also conduct experimental studies on the relative error with different sketch size.

The rest of the paper is organized as follows. Section 2 introduces some related work. Section 3 gives some background on random projection and stochastic gradient descent. Section 4 describes our proposed algorithms for the low precision case. Section 5 presents our algorithm for the high precision case. Finally, we experimentally study our methods in Section 6, and conclude them in Section 7.

Section snippets

Related work

There is a vast number of papers studying the large scale constrained linear regression problem from different perspectives, such as [24], [25]. We mainly focus on those results that have theoretical time complexity guarantees (note that the time complexity cannot depend on the condition number of A, such as [26]), due to their similar natures to ours. We summarize these methods in Table 1.

For the low precision case, [13] directly uses sketching with a very large sketch size of poly(1ϵ2), which

Preliminaries

Let A be a matrix in Rn×d with ed > n > d and d=rank(A) (note that our proposed methods can be easily extended to the case of d > rank(A)), and Ai and Aj be its i-th row (i.e., AiR1×d) and j-th column, respectively. Let ‖A2 and ‖AF be the spectral norm and Frobenius norm of A, respectively, and σmin(A) be the minimal singular value of A.

We first give the formal definition of the problem to be studied throughout the paper.

Large Scaled Constrained Linear Regression. Let ARn×d be a dataset of

Main idea

The main idea of our algorithm is to use two steps of preconditioning to reform the problem in the following way,minxWAxb22=minyWUyb22=HDUyHDb22=1ni=1nn(HDU)iy(HDb)i22,where W in the first equality is the convex set corresponding to W and the second equality is due to the fact that matrix HD is orthogonal. Below we discuss the idea behind these reformulations.

The first step of the preconditioning (6) is to obtain U, an (O(d),O(1),2)-conditioned basis of A (i.e., U=AR1; see

High precision case: Improved iterative Hessian sketch

Now, we go back to our results on time complexity in (13) and (14). One benefit of these results is that the ϵ term in the time complexity is independent of n and depends only on poly(d) and log n. Thus, if directly using the Variance Reduced methods developed in recent years (such as [3]) to the constrained linear regression problem, we can obtain a time complexity of O((n+κ)poly(d)log1ϵ), where the ϵ term is log(1ϵ) instead of poly(1ϵ) and κ is the condition number of A. Comparing with the ϵ

Numerical experiments

In this section we present some experimental results on our proposed methods. We will mainly focus on the iteration complexity and running time. Experiments confirm that our proposed algorithms are indeed faster than those existing ones. The algorithms are implemented using CountSketch as the sketch matrix SRs×n in the step for computing R1 due to its fast constructing time. The Matlab code of CountSketch can be found in [35]. Below is a rephrase of our methods.

  • HDpwBatch, i.e. Algorithm 2. We

Conclusion and discussion

In this paper, we studied the large scale constrained linear regression problem, and presented new methods for both the low and high precision cases, using some recent developments in random projection, sketching and optimization. For the low precision case, our proposed methods have lower time complexity than the state-of-the-art technique. For the high precision case, our method considerably improves the time complexity of the Iterative Hessian Sketch method. Experiments on synthetic and

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was supported in part by National Science Foundation (NSF) through grants IIS-1422591, CCF-1422324, and CCF-1716400. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF.

Di Wang is a fourth year Ph.D. student in the Department of Computer Science and Engineering at The State University of New York (SUNY) at Buffalo under supervision of Dr. Jinhui Xu. Before that he got his Master degree in Mathematics at University of Western Ontario in 2015, and he got his Bachelor degree in Mathematics and Applied Mathematics at Shandong University in 2014. He is interested in Private Data Analysis, Machine Learning and Robust Estimation.

References (41)

  • M. Pilanci et al.

    Randomized sketches of convex programs with sharp guarantees

    IEEE Trans. Inf. Theory

    (2015)
  • S. Shalev-Shwartz et al.

    Stochastic dual coordinate ascent methods for regularized loss minimization

    J. Mach. Learn. Res.

    (2013)
  • R. Johnson et al.

    Accelerating stochastic gradient descent using predictive variance reduction

    Advances in Neural Information Processing Systems

    (2013)
  • Z. Allen-Zhu

    Katyusha: the first direct acceleration of stochastic gradient methods

    Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

    (2017)
  • D. Wang et al.

    Differentially private empirical risk minimization revisited: Faster and more general

    Advances in Neural Information Processing Systems

    (2017)
  • C. Musco et al.

    Randomized block Krylov methods for stronger and faster approximate singular value decomposition

    Advances in Neural Information Processing Systems

    (2015)
  • S. Paul et al.

    Random projections for support vector machines.

    AISTATS

    (2013)
  • C. Boutsidis et al.

    Near-optimal column-based matrix reconstruction

    SIAM J. Comput.

    (2014)
  • A. Dasgupta et al.

    Sampling algorithms and coresets for ℓp regression

    SIAM J. Comput.

    (2009)
  • D. Durfee et al.

    1 regression using lewis weights preconditioning and stochastic gradient descent

  • A. Gonen et al.

    Solving ridge regression using sketched preconditioned SVRG

    International Conference on Machine Learning

    (2016)
  • L. Zhang et al.

    Recovering the optimal solution by dual random projection.

    COLT

    (2013)
  • P. Drineas et al.

    Faster least squares approximation

    Numerische Mathematik

    (2011)
  • J. Yang et al.

    Weighted SGD for ℓp regression with randomized preconditioning

    Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms

    (2016)
  • V. Rokhlin et al.

    A fast randomized algorithm for overdetermined linear least-squares regression

    Proc. Natl. Acad. Sci.

    (2008)
  • H. Avron et al.

    Blendenpik: supercharging Lapack’s least-squares solver

    SIAM J. Scientif. Comput.

    (2010)
  • M. Pilanci et al.

    Iterative hessian sketch: fast and accurate solution approximation for constrained least-squares

    J. Mach. Learn. Res.

    (2016)
  • J.A. Tropp

    Improved analysis of the subsampled randomized Hadamard transform

    Adv. Adaptive Data Anal.

    (2011)
  • M. Takác et al.

    Mini-batch primal and dual methods for SVMs.

    ICML (3)

    (2013)
  • R.H. Byrd et al.

    Sample size selection in optimization methods for machine learning

    Math. Program.

    (2012)
  • Cited by (3)

    • Gradient preconditioned mini-batch SGD for ridge regression

      2020, Neurocomputing
      Citation Excerpt :

      Also, some researchers worked on preconditioning in gradient, namely, transforming the gradient by a preconditioner matrix. A two-step preconditioning SGD was successfully used to achieve an approximate solution for the large scale constrained linear regression [24]. The algorithm is accomplished by two-step preconditioning, the first step is data preconditioning and the second step is gradient preconditioning.

    • Estimating stochastic linear combination of non-linear regressions efficiently and scalably

      2020, Neurocomputing
      Citation Excerpt :

      Hence this is quite useful in non-convex learning models. However, the term np2 is prohibitive in the large scale setting where n, p are huge (see [30,31] for details). To further reduce the time complexity, we propose another estimator based on sub-sampling.

    Di Wang is a fourth year Ph.D. student in the Department of Computer Science and Engineering at The State University of New York (SUNY) at Buffalo under supervision of Dr. Jinhui Xu. Before that he got his Master degree in Mathematics at University of Western Ontario in 2015, and he got his Bachelor degree in Mathematics and Applied Mathematics at Shandong University in 2014. He is interested in Private Data Analysis, Machine Learning and Robust Estimation.

    Jinhui Xu is currently a professor of Computer Science and Engineering at the University at Buffalo (the State University of New York). He received his B.S. and M.S. degrees in Computer Science from the University of Science and Technology of China (USTC), and his Ph.D. degree in Computer Science and Engineering from the University of Notre Dame in 2000. His research interest lies in the fields of Algorithms, Computational Geometry, Combinatorial Optimization, Machine Learning, Differential Privacy and their applications in several applied areas.

    View full text