Faster constrained linear regression via two-step preconditioning
Introduction
Linear regression with convex constraints is a fundamental problem in Machine Learning, Statistics and Signal Processing, since many other problems, such as SVM, LASSO, signal recovery [1], can be all formulated as constrained linear regression problems. Thus, the problem has received a great deal of attentions from both the Machine Learning and Theoretical Computer Science communities. The problem can be formally defined as follows,where A is a matrix in with d < n < ed, is a closed convex set and is the response vector. The goal is to find an such that or where ϵ is the approximation error.
Roughly speaking, there are two types of methods to solve the problem. The first type of techniques is based on Stochastic Gradient Descent (SGD). Recent developments on first-order stochastic methods, such as Stochastic Dual Coordinate Ascent (SDCA) [2], Stochastic Variance Reduced Gradient (SVRG) [3] and Katyusha [4], have made significant improvements on the convergence speed of large scale optimization problems in both theory and practice [5], which provide the potential for us to obtain faster solution to our problem.
The second type of techniques is based on randomized linear algebra. Among them, random projection and sketching are commonly used theoretical tools in many optimization problems as preconditioner, dimension reduction or sampling techniques to reduce the time complexity. This includes low rank approximation [6], SVM [7], column subset selection [8] and lp regression (for p ∈ [1, 2]) [9], [10]. Thus, it is very tempting to combine these two types of techniques to develop faster methods with theoretical or statistical guarantee for more constrained optimization problems. Recently, quite a number of works have successfully combined the two types of techniques. For example, [11] proposed faster methods for Ridge Regression and Empirical Risk Minimization, respectively, by using SVRG, Stochastic Gradient Descent (SGD) and low rank approximation. [12] achieved guarantee for Empirical Risk Minimization by using random projection in dual problem.
In this paper, we revisit the preconditioning methods for solving large-scale constrained linear regression problem, and propose a new method called two-step preconditioning. Combining this method with some recent developments on large scale convex optimization problems, we are able to achieve faster algorithms for both the low () and high () precision cases. Specifically, our main contributions can be summarized as follows.
- 1.
For the low precision case, we first propose a novel algorithm called HDpwBatchSGD (i.e., Algorithm 2) by combining the method of two step preconditioning with mini-batch SGD. Mini-batch SGD is a popular way for improving the efficiency of SGD. It uses several samples, instead of one, in each iteration and runs the gradient descent updating on all these samples (simultaneously). Ideally, we would hope for a factor of r speed-up on the convergence if using a batch of size r. However, this is not always possible for general case. Actually in some cases, there is even no speed-up at all when a large-size batch is used [19], [20], [21]. A unique feature of our method is its optimal speeding-up with respect to the batch size, i.e. the iteration complexity will decrease by a factor of b if we increase the batch size by a factor of b.
We also use the two-step preconditioning method and Multi-epoch Stochastic Accelerated mini-batch SGD proposed in [22] to obtain another slightly different algorithm called HDpwBatchAccSGD (i.e., Algorithms 3 and 4), which has the time complexity lower than that of the state-of-the-art technique [14] and HDpwBatchSGD.
- 2.
The optimality on speeding-up in HDpwBatchSGD and HDpwBatchAccSGD further inspires us to think about how it will perform if using the whole gradient, i.e. projected Gradient Descent, which leads to another algorithm called pwGradient (i.e., Algorithm 6). We find that it actually allows us to have an alternative implementation of the Iterative Hessian Sketch (IHS) method [17], which is arguably the state-of-the-art technique for the high precision case. Particularly, we are able to show that one step of sketching is sufficient for IHS, instead of a sequence of sketchings used in the current form of IHS. This enables us to considerably improve the time complexity of IHS.
- 3.
Finally, we implement our algorithms and test them on both large synthetic and real benchmark datasets. Numerical results confirm our theoretical analysis of HDpwBatchSGD, HDpwBatchAccSGD and pwGradient. Also, our methods outperform existing ones in both low and high precision cases.
The rest of the paper is organized as follows. Section 2 introduces some related work. Section 3 gives some background on random projection and stochastic gradient descent. Section 4 describes our proposed algorithms for the low precision case. Section 5 presents our algorithm for the high precision case. Finally, we experimentally study our methods in Section 6, and conclude them in Section 7.
Section snippets
Related work
There is a vast number of papers studying the large scale constrained linear regression problem from different perspectives, such as [24], [25]. We mainly focus on those results that have theoretical time complexity guarantees (note that the time complexity cannot depend on the condition number of A, such as [26]), due to their similar natures to ours. We summarize these methods in Table 1.
For the low precision case, [13] directly uses sketching with a very large sketch size of which
Preliminaries
Let A be a matrix in with ed > n > d and (note that our proposed methods can be easily extended to the case of d > rank(A)), and Ai and Aj be its i-th row (i.e., ) and j-th column, respectively. Let ‖A‖2 and ‖A‖F be the spectral norm and Frobenius norm of A, respectively, and σmin(A) be the minimal singular value of A.
We first give the formal definition of the problem to be studied throughout the paper.
Large Scaled Constrained Linear Regression. Let be a dataset of
Main idea
The main idea of our algorithm is to use two steps of preconditioning to reform the problem in the following way,where in the first equality is the convex set corresponding to and the second equality is due to the fact that matrix HD is orthogonal. Below we discuss the idea behind these reformulations.
The first step of the preconditioning (6) is to obtain U, an (-conditioned basis of A (i.e., ; see
High precision case: Improved iterative Hessian sketch
Now, we go back to our results on time complexity in (13) and (14). One benefit of these results is that the ϵ term in the time complexity is independent of n and depends only on poly(d) and log n. Thus, if directly using the Variance Reduced methods developed in recent years (such as [3]) to the constrained linear regression problem, we can obtain a time complexity of where the ϵ term is instead of and κ is the condition number of A. Comparing with the ϵ
Numerical experiments
In this section we present some experimental results on our proposed methods. We will mainly focus on the iteration complexity and running time. Experiments confirm that our proposed algorithms are indeed faster than those existing ones. The algorithms are implemented using CountSketch as the sketch matrix in the step for computing due to its fast constructing time. The Matlab code of CountSketch can be found in [35]. Below is a rephrase of our methods.
- •
HDpwBatch, i.e. Algorithm 2. We
Conclusion and discussion
In this paper, we studied the large scale constrained linear regression problem, and presented new methods for both the low and high precision cases, using some recent developments in random projection, sketching and optimization. For the low precision case, our proposed methods have lower time complexity than the state-of-the-art technique. For the high precision case, our method considerably improves the time complexity of the Iterative Hessian Sketch method. Experiments on synthetic and
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research was supported in part by National Science Foundation (NSF) through grants IIS-1422591, CCF-1422324, and CCF-1716400. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF.
Di Wang is a fourth year Ph.D. student in the Department of Computer Science and Engineering at The State University of New York (SUNY) at Buffalo under supervision of Dr. Jinhui Xu. Before that he got his Master degree in Mathematics at University of Western Ontario in 2015, and he got his Bachelor degree in Mathematics and Applied Mathematics at Shandong University in 2014. He is interested in Private Data Analysis, Machine Learning and Robust Estimation.
References (41)
- et al.
Randomized sketches of convex programs with sharp guarantees
IEEE Trans. Inf. Theory
(2015) - et al.
Stochastic dual coordinate ascent methods for regularized loss minimization
J. Mach. Learn. Res.
(2013) - et al.
Accelerating stochastic gradient descent using predictive variance reduction
Advances in Neural Information Processing Systems
(2013) Katyusha: the first direct acceleration of stochastic gradient methods
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
(2017)- et al.
Differentially private empirical risk minimization revisited: Faster and more general
Advances in Neural Information Processing Systems
(2017) - et al.
Randomized block Krylov methods for stronger and faster approximate singular value decomposition
Advances in Neural Information Processing Systems
(2015) - et al.
Random projections for support vector machines.
AISTATS
(2013) - et al.
Near-optimal column-based matrix reconstruction
SIAM J. Comput.
(2014) - et al.
Sampling algorithms and coresets for ℓp regression
SIAM J. Comput.
(2009) - et al.
ℓ1 regression using lewis weights preconditioning and stochastic gradient descent
Solving ridge regression using sketched preconditioned SVRG
International Conference on Machine Learning
Recovering the optimal solution by dual random projection.
COLT
Faster least squares approximation
Numerische Mathematik
Weighted SGD for ℓp regression with randomized preconditioning
Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms
A fast randomized algorithm for overdetermined linear least-squares regression
Proc. Natl. Acad. Sci.
Blendenpik: supercharging Lapack’s least-squares solver
SIAM J. Scientif. Comput.
Iterative hessian sketch: fast and accurate solution approximation for constrained least-squares
J. Mach. Learn. Res.
Improved analysis of the subsampled randomized Hadamard transform
Adv. Adaptive Data Anal.
Mini-batch primal and dual methods for SVMs.
ICML (3)
Sample size selection in optimization methods for machine learning
Math. Program.
Cited by (3)
Gradient preconditioned mini-batch SGD for ridge regression
2020, NeurocomputingCitation Excerpt :Also, some researchers worked on preconditioning in gradient, namely, transforming the gradient by a preconditioner matrix. A two-step preconditioning SGD was successfully used to achieve an approximate solution for the large scale constrained linear regression [24]. The algorithm is accomplished by two-step preconditioning, the first step is data preconditioning and the second step is gradient preconditioning.
Estimating stochastic linear combination of non-linear regressions efficiently and scalably
2020, NeurocomputingCitation Excerpt :Hence this is quite useful in non-convex learning models. However, the term np2 is prohibitive in the large scale setting where n, p are huge (see [30,31] for details). To further reduce the time complexity, we propose another estimator based on sub-sampling.
Faster doubly stochastic functional gradient by gradient preconditioning for scalable kernel methods
2022, Applied Intelligence
Di Wang is a fourth year Ph.D. student in the Department of Computer Science and Engineering at The State University of New York (SUNY) at Buffalo under supervision of Dr. Jinhui Xu. Before that he got his Master degree in Mathematics at University of Western Ontario in 2015, and he got his Bachelor degree in Mathematics and Applied Mathematics at Shandong University in 2014. He is interested in Private Data Analysis, Machine Learning and Robust Estimation.
Jinhui Xu is currently a professor of Computer Science and Engineering at the University at Buffalo (the State University of New York). He received his B.S. and M.S. degrees in Computer Science from the University of Science and Technology of China (USTC), and his Ph.D. degree in Computer Science and Engineering from the University of Notre Dame in 2000. His research interest lies in the fields of Algorithms, Computational Geometry, Combinatorial Optimization, Machine Learning, Differential Privacy and their applications in several applied areas.