A randomized least squares solver for terabyte-sized dense overdetermined systems

doi:10.1016/j.jocs.2016.09.007

Journal of Computational Science

Volume 36, September 2019, 100547

https://doi.org/10.1016/j.jocs.2016.09.007 Get rights and content

Highlights

•
A scalable randomized least-squares solver based on the Blendenpik algorithm in a distributed-memory setting is implemented and evaluated.
•
The randomized solver employs four different random projection transforms that enable our algorithm to scale up to three times in terms of matrix sizes and provide a speedup of up to 7.5 times over a state-of-the-art scalable least-squares solver for tera-scale matrices.
•
The computational cost of various stages of the randomized solver is determined by how overdetermined the input matrix is.
•
The Blendenpik solver demonstrates significant strong and weak scaling and excellent numerical stability in terms of relative error for increasing matrix sizes. The backward error is comparable to the backward error achieved by the state-of-the-art solver.

Abstract

We present a fast randomized least-squares solver for distributed-memory platforms. Our solver is based on the Blendenpik algorithm, but employs multiple random projection schemes to construct a sketch of the input matrix. These random projection sketching schemes, and in particular the use of the randomized Discrete Cosine Transform, enable our algorithm to scale the distributed memory vanilla implementation of Blendenpik to terabyte-sized matrices and provide up to ×7.5 speedup over a state-of-the-art scalable least-squares solver based on the classic QR algorithm. Experimental evaluations on terabyte scale matrices demonstrate excellent speedups on up to 16,384 cores on a Blue Gene/Q supercomputer.

Introduction

The explosive growth of data over the past 20 years in various domains, ranging from physics and biological sciences to economics and social sciences, has led to a need to perform efficient and scalable analysis on such massive datasets. One of the most widely and routinely used primitives in statistical data analysis is least-squares regression: given a matrix $A \in ℝ^{m \times n}$ and a vector $b \in ℝ^{m}$ , we seek to compute $x^{*} = arg min_{x \in ℝ^{n}} {∥A x - b∥}_{2} .$ Several algorithms have been proposed to solve large-scale least-squares problems in various distributed and parallel environments [1], returning solutions whose accuracy is close to machine precision. Recent approaches for large-scale least-squares problems include communication-avoiding factorizations [2], which scale well for a variety of shared memory and distributed memory platforms [3] and are based on the classic QR decomposition algorithm, an O(mn²) algorithm (assuming m ≥ n).

Recent years have witnessed an explosion of research on so-called Randomized Numerical Linear Algebra [4] (or RandNLA for short) algorithms, which leverage the power of randomization in order to perform standard matrix computations. One of the core problems that have been extensively researched in this emerging field is the least-squares regression problem of Eq. (1). Sarlos [5] and Drineas et al. [6] introduced the first randomized algorithms for this problem. These algorithms are based on the application of the sub-sampled Randomized Hadamard Transform to the columns of the input matrix in order to create a least-squares problem of smaller size that can be then solved exactly and whose solution provably approximates the solution of the original problem with very high probability. This was followed by the work of Rokhlin and Tygert [7], who used a subsampled Randomized Fourier Transform to form a preconditioner and then used a standard iterative solver to solve the preconditioned problem. At the same time, Avron et al. [8] introduced Blendenpik, an algorithm and a software package which was the first practical implementation of a RandNLA dense least-squares solver that consistently and comprehensively outperformed state-of-the-art implementations of the traditional QR-based O(mn²) algorithms. Since then, there has been extensive research on RandNLA algorithms for regression problems; see Yang et al. [9] for a recent survey.

So far, most research on randomized least-squares regression algorithms has focused on the single processor setting, with two important exceptions. Meng et al. [10] introduced LSRN, a distributed memory algorithm for least-squares problems based on random Gaussian projections. While the algorithm is still an O(mn²) algorithm, the benefits of randomization are apparent with respect to both constants in the asymptotic analysis, as well as its much improved efficiency on parallel environments. Yang et al. [9] consider RandNLA in a MapReduce-like framework called Spark. This framework is less appropriate for supercomputers, as it is fails to take advantage of their hardware architecture.

In this work, we explore the behavior of Blendenpik-type algorithms in a distributed memory setting. We show that variants of Blendenpik that use various batchwise transformations to compute preconditioners lead to implementations that are not only faster than state-of-the-art implementations of baseline least-squares solvers, but are also able to scale to much larger matrix sizes. In particular, we show that a Blendenpik based algorithm can solve least-squares regression problems with terabyte-sized (and larger) input matrices. Our implementation and experiments were run on AMOS,¹ the high-performance Blue Gene/Q supercomputer system at Rensselaer. AMOS has five racks, 5120 nodes (81,920 cores), and 81,920 GB of main memory. AMOS has a peak performance of one PetaFLOP (10¹⁵ floating point operations per second), and a 5-D torus network with 2 GB/s of bandwidth per link and 512 GB/s to 1 TB/s of bisection network bandwidth per rack, depending on the torus network configuration. Due to runtime constraints imposed by the scheduling system for each partition of AMOS, we limited our experiments to partitions containing up to 1024 nodes (16,384 cores). The Blue Gene/Q architecture supports a hybrid communication framework that uses the MPI (Message Passing Interface) [11] standard for distributed communication and multithreading using OpenMP [12].

Our main contributions in this paper are: (i) implementation of and experimentation with the Blendenpik algorithm on distributed-memory platforms; (ii) implementation of four randomized sketching transforms in the context of the Blendenpik algorithm; (iii) a batchwise transformation scheme that scales a distributed vanilla implementation of the Randomized Discrete Cosine Transform in the context of the Blendenpik algorithm by up to three times in terms of matrix sizes, and provides a speedup of up to 7.5 times over a state-of-the-art scalable least-squares solver for tera-scale matrices; (iv) a detailed evaluation of four randomized sketching transforms in the context of the Blendenpik algorithm and their parameters on the BG/Q, using up to 16,384 cores.

The full source code of our batchwise Blendenpik implementation is available for download at https://github.com/cjiyer/libskylark/tree/batchwiseblendenpik. The rest of this paper is organized as follows. Section 2 describes the Blendenpik algorithm and the various stages of the algorithm in detail. Section 3 highlights the distributed Blendenpik implementation in the Blue Gene/Q, as well as scalability issues in our implementation, and describes an approach to overcome them. Section 4 first describes experiments to tune our Blue Gene/Q environment for our evaluations and then discusses the outcome of our experimental evaluations.

Notation. Let A, B, … denote matrices and let x, y, z, … denote column vectors. Given a vector $x \in ℝ^{m}$ , let ${∥x∥}_{2}^{2} = \sum_{i = 1}^{m} x_{i}^{2}$ be (the square of) its Euclidean norm; given a matrix $A \in ℝ^{m \times n}$ , let ${∥A∥}_{F}^{2} = \sum_{i, j = 1}^{m, n} A_{ij}^{2}$ be (the square of) its Frobenius norm. Let σ₁ ≥ σ₁ ≥ σ₂ ⋯ ≥ σ_r > 0 be the nonzero singular values of A, where $r = rank (A)$ is the rank of the matrix A. Then, the condition number of A is equal to κ(A) = σ₁/σ_r.

Section snippets

The Blendenpik Algorithm for dense overdetermined systems

Blendenpik (see Algorithm 1) is a least-squares solver for dense, overdetermined, full column rank least-squares problems that computes an approximate solution to the problem of Eq. (1), with a high degree of precision. Given a dense, tall-and-thin matrix $A \in ℝ^{m \times n}$ and a column vector $b \in ℝ^{m}$ , the algorithm returns an approximate solution by executing the following three steps:

1.
A preconditioner is constructed by applying a randomized unitary (or approximately unitary) transform F to the input matrix A

Implementing our algorithm on the Blue Gene/Q

The algorithm is implemented on top of the Elemental library [15]. Given a distributed environment over p processes, any dense matrix $A \in ℝ^{m \times n}$ is partitioned in Elemental into rectangular grids of sizes r × c in a 2D cyclic distribution, such that p = r × c and both r and c are $O (\sqrt{p})$ . Elemental allows a matrix to be distributed in more than one way. An overview of various data distributions available in Elemental is given in Table 1 (not exhaustive). We use the standard distribution [M_C, M_R]

Evaluation

To generate terascale-sized dense matrices, we randomly perturb sparse matrices with values generated from a standard normal distribution. We relied on the UFL Sparse matrix collection [18] for obtaining matrices of different condition numbers and coherence values. We chose to work with the Yoshiyasu Mesh matrix,² the ESOC Springer matrix,³ and the Rucci

Conclusions and future work

We implemented and thoroughly evaluated a highly scalable, distributed memory, least-squares solver based on the Blendenpik algorithm. Our solver, which is based on an implementation of the Blendenpik algorithm in a distributed setting coupled with various batchwise transformations in order to construct an appropriate preconditioner, beats state-of-the-art least-squares solvers with respect to running time and scales to much larger matrices compared to prior work. In future work, we plan to

Chander Iyer is currently a 4th year Ph.D. student in the Department of Computer Science at Rensselaer Poly-technic Institute. He received his B.E. from Mumbai University, Mumbai, India, in 2003 and received his M.Tech. in 2010 from Indian Institute of Technology, Bombay. He is currently being advised by Prof. Petros Drineas, with Prof. Chris Carothers as his co-advisor. His research interests lie at the intersection of Randomized Algorithms for large scale datasets, High Performance Computing

References (20)

D. Achlioptas
Database-friendly random projections: Johnson-Lindenstrauss with binary coins
J. Comput. Syst. Sci.
(2003)
K.A. Gallivan et al.
Parallel algorithms for dense linear algebra computations
SIAM Rev.
(1990)
J. Demmel et al.
Communication-optimal parallel and sequential QR and LU factorizations. UC Berkeley Technical Report EECS- 2008-89
(Aug 1, 2008)
J. Demmel et al.
Communication Avoiding (CA) and Other Innovative Algorithms
(2014)
P. Drineas et al.
RandNLA: Randomized numerical linear algebra
Commun. ACM
(2016)
T. Sarlos
Improved approximation algorithms for large matrices via random projections
P. Drineas et al.
Faster least squares approximation
Numer. Math.
(2011)
V. Rokhlin et al.
A fast randomized algorithm for overdetermined linear least-squares regression
Proc. Natl. Acad. Sci. U.S.A.
(2008)
H. Avron et al.
Blendenpik: Supercharging LAPACK's least-squares solver
SIAM J. Sci. Comput.
(2010)
J. Yang et al.
Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments, CoRR abs/1502.03032
(2015)

There are more references available in the full text version of this article.

Cited by (3)

Continuous and discrete gradient-Zhang neuronet (GZN) with analyses for time-variant overdetermined linear equation system solving as well as mobile localization applications
2023, Neurocomputing
In this paper, a novel continuous gradient-Zhang neuronet (GZN) model and three discrete GZN algorithms are proposed to solve time-variant overdetermined linear equation system (TVOLES) problems, and are developed on the basis of a gradient neuronet and Zhang neuronet. Specifically, the continuous GZN model is proposed and derived in the form of an explicit-dynamic equation, and its stability is analysed and proven using Lyapunov theory. To ensure that the problems can be solved and implemented on digital platforms, three discrete GZN algorithms are established by using three different discretization formulas. The numerical simulative results verify that, compared with the Zhang neuronet model, gradient neuronet model and variant-parameter neuronet model, the proposed continuous GZN model converges faster in some cases. Other computer experiments further substantiate that the proposed continuous model and its corresponding discrete GZN algorithms are effective and efficient. Finally, the proposed discrete GZN algorithms are applied to mobile object localizations, and the simulative results verify their applicability and effectiveness.
On Random Embeddings and Their Application to Optimisation
2022, arXiv
Hashing embeddings of optimal dimension, with applications to linear least squares
2021, arXiv

Haim Avron did his Ph.D. at the School of Computer Science at Tel Aviv University under the supervision of Prof. Sivan Toledo. Afterwards he spent two years as a Postdoctoral Researcher in the Business Analytics & Mathematical Sciences department at the IBM T.J. Watson Research Center. From 2012 to 2015 he was a Research Sta Member in the Mathematical Sciences & Analytics department at the IBM T.J. Watson Research Center. He joined the Department of Applied Mathematics, School of Mathematical Sciences at Tel Aviv University as a Senior Lecturer (equivalent to assistant professor) in 2015. His research focuses on numerical computing and high performance computing and their applications in scientific computing and machine learning. His interests and work range from mathematical and computational foundations to end-to-end implementation aspects.

Georgios Kollias received the B.Sc. in Physics in 2000 and the M.Sc. in Computational Science in 2002 from the University of Athens, Greece, and the PhD in Computer Science from the University of Patras, Greece, in 2009. He moved to Purdue University, USA in October 2009 and worked as a Postdoctoral Researcher in the Computer Science Department and the Center for Science of Information till May 2013. Then he joined IBM T.J. Watson Research Center, USA and in August 2014 he moved to IBM Zurich Research Lab. He returned back in IBM T.J. Watson Research Center in April 2015 as a Research Sta Member in the area of Big Data Management and Analytics. His research interests include Parallel, Distributed and High Performance Computing, Numerical Linear Algebra and Matrix Computations, Graph Mining, Data Analytics and Problem Solving Environments.

Yves Ineichen received his M.Sc. in Computer Science in 2008, and the Phd in Computer Science in 2013 from the Federal Institute of Technology Zurich (ETHZ), Switzerland. In the beginning of 2013 he joined the IBM Research Center in Rueschlikon, Switzerland. His research interests include: High Performance Computing, Optimization, Numerical Linear Algebra, Compiler Design, Programming Languages. Yves is a recipient of the PRACE (2012) and ACM Gordon Bell (2015) award.

Professor Chris Carothers is a faculty member in the Computer Science Department at Rensselaer Polytechnic Institute. He received the Ph.D., M.S., and B.S. from Georgia Institute of Technology in 1997, 1996, and 1991, respectively. Prior to joining RPI in 1998, he was a research scientist at the Georgia Institute of Technology. His research interests are focused on massively parallel computing which involve the creation of high fidelity models of extreme-scale networks and computer systems. These models have executed using nearly 2,000,000 processing cores on the largest leadership class supercomputers in the world. Professor Carothers is an NSF CAREER Award winner as well as Best Paper award winner at the ACM-SIGSIM PADS Conference for 1999, 2003 and 2009. Since joining Rensselaer, he has developed a world-class research portfolio which includes funding from the NSF, the U.S. Department of Energy, Army Research Laboratory, Air Force Research Laboratory, as well as several companies, including IBM, General Electric, and AT&T.

Additionally, Professor Carothers serves as the Director for the Rensselaer Center for Computational Innovations (CCI). CCI is a partnership between Rensselaer and IBM. The center provides computation and storage resources to diverse network of researchers, faculty, and students from Renssleaer, government laboratories, and companies across a number of science and engineering disciplines. The agship supercomputer is a 1 petaFLOP IBM Blue Gene/Q system with 80 terabytes of memory, 81,920 processing cores and over 2 petabytes of disk storage.

Professor Petros Drineas is an Associate Professor at the Computer Science Department of Rensselaer Polytechnic Institute. Prof. Drineas earned a PhD in Computer Science from Yale University in May of 2003, and a BS in Computer Engineering and Informatics from the University of Patras, Greece, in July of 1997. Prof. Drineas’ research interests lie in the design and analysis of randomized algorithms for linear algebraic problems, as well as their applications to the analysis of modern, massive datasets. Prof. Drineas received an NSF CAREER in 2006; was a Visiting Professor at the US Sandia National Laboratories during the fall of 2005; was a Visiting Fellow at the Institute for Pure and Applied Mathematics at the University of California, Los Angeles in the fall of 2007; and was a Visiting Professor at the University of California Berkeley in the fall of 2013. Prof. Drineas has also served the US National Science Foundation (NSF) as a Program Director in the Information and Intelligent Systems (IIS) Division and the Computing and Communication Foundations (CCF) Division (2010–2011). Prof. Drineas has published over 90 articles in conferences and journals in Theoretical Computer Science, Numerical Linear Algebra, and statistical data analysis.

^☆: This work was partially supported by the XDATA program of the Defense Advanced Research Projects Agency (DARPA), administered through Air Force Research Laboratory contract FA8750- 12-C-0323, as well as by NSF IIS-1302231.

View full text

A randomized least squares solver for terabyte-sized dense overdetermined systems☆

Highlights

Abstract

Introduction

Section snippets

The Blendenpik Algorithm for dense overdetermined systems

Implementing our algorithm on the Blue Gene/Q

Evaluation

Conclusions and future work

J. Comput. Syst. Sci.

Parallel algorithms for dense linear algebra computations

SIAM Rev.

Communication-optimal parallel and sequential QR and LU factorizations. UC Berkeley Technical Report EECS- 2008-89

Communication Avoiding (CA) and Other Innovative Algorithms

RandNLA: Randomized numerical linear algebra

Commun. ACM

Improved approximation algorithms for large matrices via random projections

Faster least squares approximation

Numer. Math.

A fast randomized algorithm for overdetermined linear least-squares regression

Proc. Natl. Acad. Sci. U.S.A.

Blendenpik: Supercharging LAPACK's least-squares solver

SIAM J. Sci. Comput.

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments, CoRR abs/1502.03032