A randomized least squares solver for terabyte-sized dense overdetermined systems

https://doi.org/10.1016/j.jocs.2016.09.007Get rights and content

Highlights

  • A scalable randomized least-squares solver based on the Blendenpik algorithm in a distributed-memory setting is implemented and evaluated.

  • The randomized solver employs four different random projection transforms that enable our algorithm to scale up to three times in terms of matrix sizes and provide a speedup of up to 7.5 times over a state-of-the-art scalable least-squares solver for tera-scale matrices.

  • The computational cost of various stages of the randomized solver is determined by how overdetermined the input matrix is.

  • The Blendenpik solver demonstrates significant strong and weak scaling and excellent numerical stability in terms of relative error for increasing matrix sizes. The backward error is comparable to the backward error achieved by the state-of-the-art solver.

Abstract

We present a fast randomized least-squares solver for distributed-memory platforms. Our solver is based on the Blendenpik algorithm, but employs multiple random projection schemes to construct a sketch of the input matrix. These random projection sketching schemes, and in particular the use of the randomized Discrete Cosine Transform, enable our algorithm to scale the distributed memory vanilla implementation of Blendenpik to terabyte-sized matrices and provide up to ×7.5 speedup over a state-of-the-art scalable least-squares solver based on the classic QR algorithm. Experimental evaluations on terabyte scale matrices demonstrate excellent speedups on up to 16,384 cores on a Blue Gene/Q supercomputer.

Introduction

The explosive growth of data over the past 20 years in various domains, ranging from physics and biological sciences to economics and social sciences, has led to a need to perform efficient and scalable analysis on such massive datasets. One of the most widely and routinely used primitives in statistical data analysis is least-squares regression: given a matrix Am×n and a vector bm, we seek to computex*=argminxnAxb2.Several algorithms have been proposed to solve large-scale least-squares problems in various distributed and parallel environments [1], returning solutions whose accuracy is close to machine precision. Recent approaches for large-scale least-squares problems include communication-avoiding factorizations [2], which scale well for a variety of shared memory and distributed memory platforms [3] and are based on the classic QR decomposition algorithm, an O(mn2) algorithm (assuming m ≥ n).

Recent years have witnessed an explosion of research on so-called Randomized Numerical Linear Algebra [4] (or RandNLA for short) algorithms, which leverage the power of randomization in order to perform standard matrix computations. One of the core problems that have been extensively researched in this emerging field is the least-squares regression problem of Eq. (1). Sarlos [5] and Drineas et al. [6] introduced the first randomized algorithms for this problem. These algorithms are based on the application of the sub-sampled Randomized Hadamard Transform to the columns of the input matrix in order to create a least-squares problem of smaller size that can be then solved exactly and whose solution provably approximates the solution of the original problem with very high probability. This was followed by the work of Rokhlin and Tygert [7], who used a subsampled Randomized Fourier Transform to form a preconditioner and then used a standard iterative solver to solve the preconditioned problem. At the same time, Avron et al. [8] introduced Blendenpik, an algorithm and a software package which was the first practical implementation of a RandNLA dense least-squares solver that consistently and comprehensively outperformed state-of-the-art implementations of the traditional QR-based O(mn2) algorithms. Since then, there has been extensive research on RandNLA algorithms for regression problems; see Yang et al. [9] for a recent survey.

So far, most research on randomized least-squares regression algorithms has focused on the single processor setting, with two important exceptions. Meng et al. [10] introduced LSRN, a distributed memory algorithm for least-squares problems based on random Gaussian projections. While the algorithm is still an O(mn2) algorithm, the benefits of randomization are apparent with respect to both constants in the asymptotic analysis, as well as its much improved efficiency on parallel environments. Yang et al. [9] consider RandNLA in a MapReduce-like framework called Spark. This framework is less appropriate for supercomputers, as it is fails to take advantage of their hardware architecture.

In this work, we explore the behavior of Blendenpik-type algorithms in a distributed memory setting. We show that variants of Blendenpik that use various batchwise transformations to compute preconditioners lead to implementations that are not only faster than state-of-the-art implementations of baseline least-squares solvers, but are also able to scale to much larger matrix sizes. In particular, we show that a Blendenpik based algorithm can solve least-squares regression problems with terabyte-sized (and larger) input matrices. Our implementation and experiments were run on AMOS,1 the high-performance Blue Gene/Q supercomputer system at Rensselaer. AMOS has five racks, 5120 nodes (81,920 cores), and 81,920 GB of main memory. AMOS has a peak performance of one PetaFLOP (1015 floating point operations per second), and a 5-D torus network with 2 GB/s of bandwidth per link and 512 GB/s to 1 TB/s of bisection network bandwidth per rack, depending on the torus network configuration. Due to runtime constraints imposed by the scheduling system for each partition of AMOS, we limited our experiments to partitions containing up to 1024 nodes (16,384 cores). The Blue Gene/Q architecture supports a hybrid communication framework that uses the MPI (Message Passing Interface) [11] standard for distributed communication and multithreading using OpenMP [12].

Our main contributions in this paper are: (i) implementation of and experimentation with the Blendenpik algorithm on distributed-memory platforms; (ii) implementation of four randomized sketching transforms in the context of the Blendenpik algorithm; (iii) a batchwise transformation scheme that scales a distributed vanilla implementation of the Randomized Discrete Cosine Transform in the context of the Blendenpik algorithm by up to three times in terms of matrix sizes, and provides a speedup of up to 7.5 times over a state-of-the-art scalable least-squares solver for tera-scale matrices; (iv) a detailed evaluation of four randomized sketching transforms in the context of the Blendenpik algorithm and their parameters on the BG/Q, using up to 16,384 cores.

The full source code of our batchwise Blendenpik implementation is available for download at https://github.com/cjiyer/libskylark/tree/batchwiseblendenpik. The rest of this paper is organized as follows. Section 2 describes the Blendenpik algorithm and the various stages of the algorithm in detail. Section 3 highlights the distributed Blendenpik implementation in the Blue Gene/Q, as well as scalability issues in our implementation, and describes an approach to overcome them. Section 4 first describes experiments to tune our Blue Gene/Q environment for our evaluations and then discusses the outcome of our experimental evaluations.

Notation. Let A, B, … denote matrices and let x, y, z, … denote column vectors. Given a vector xm, let x22=i=1mxi2 be (the square of) its Euclidean norm; given a matrix Am×n, let AF2=i,j=1m,nAij2 be (the square of) its Frobenius norm. Let σ1 ≥ σ1 ≥ σ2 ⋯ ≥ σr > 0 be the nonzero singular values of A, where r=rankA is the rank of the matrix A. Then, the condition number of A is equal to κ(A) = σ1/σr.

Section snippets

The Blendenpik Algorithm for dense overdetermined systems

Blendenpik (see Algorithm 1) is a least-squares solver for dense, overdetermined, full column rank least-squares problems that computes an approximate solution to the problem of Eq. (1), with a high degree of precision. Given a dense, tall-and-thin matrix Am×n and a column vector bm, the algorithm returns an approximate solution by executing the following three steps:

  • 1.

    A preconditioner is constructed by applying a randomized unitary (or approximately unitary) transform F to the input matrix A

Implementing our algorithm on the Blue Gene/Q

The algorithm is implemented on top of the Elemental library [15]. Given a distributed environment over p processes, any dense matrix Am×n is partitioned in Elemental into rectangular grids of sizes r × c in a 2D cyclic distribution, such that p = r × c and both r and c are O(p). Elemental allows a matrix to be distributed in more than one way. An overview of various data distributions available in Elemental is given in Table 1 (not exhaustive). We use the standard distribution [MC, MR]

Evaluation

To generate terascale-sized dense matrices, we randomly perturb sparse matrices with values generated from a standard normal distribution. We relied on the UFL Sparse matrix collection [18] for obtaining matrices of different condition numbers and coherence values. We chose to work with the Yoshiyasu Mesh matrix,2 the ESOC Springer matrix,3 and the Rucci

Conclusions and future work

We implemented and thoroughly evaluated a highly scalable, distributed memory, least-squares solver based on the Blendenpik algorithm. Our solver, which is based on an implementation of the Blendenpik algorithm in a distributed setting coupled with various batchwise transformations in order to construct an appropriate preconditioner, beats state-of-the-art least-squares solvers with respect to running time and scales to much larger matrices compared to prior work. In future work, we plan to

Chander Iyer is currently a 4th year Ph.D. student in the Department of Computer Science at Rensselaer Poly-technic Institute. He received his B.E. from Mumbai University, Mumbai, India, in 2003 and received his M.Tech. in 2010 from Indian Institute of Technology, Bombay. He is currently being advised by Prof. Petros Drineas, with Prof. Chris Carothers as his co-advisor. His research interests lie at the intersection of Randomized Algorithms for large scale datasets, High Performance Computing

References (20)

  • D. Achlioptas

    Database-friendly random projections: Johnson-Lindenstrauss with binary coins

    J. Comput. Syst. Sci.

    (2003)
  • K.A. Gallivan et al.

    Parallel algorithms for dense linear algebra computations

    SIAM Rev.

    (1990)
  • J. Demmel et al.

    Communication-optimal parallel and sequential QR and LU factorizations. UC Berkeley Technical Report EECS- 2008-89

    (Aug 1, 2008)
  • J. Demmel et al.

    Communication Avoiding (CA) and Other Innovative Algorithms

    (2014)
  • P. Drineas et al.

    RandNLA: Randomized numerical linear algebra

    Commun. ACM

    (2016)
  • T. Sarlos

    Improved approximation algorithms for large matrices via random projections

  • P. Drineas et al.

    Faster least squares approximation

    Numer. Math.

    (2011)
  • V. Rokhlin et al.

    A fast randomized algorithm for overdetermined linear least-squares regression

    Proc. Natl. Acad. Sci. U.S.A.

    (2008)
  • H. Avron et al.

    Blendenpik: Supercharging LAPACK's least-squares solver

    SIAM J. Sci. Comput.

    (2010)
  • J. Yang et al.

    Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments, CoRR abs/1502.03032

    (2015)
There are more references available in the full text version of this article.

Cited by (3)

Chander Iyer is currently a 4th year Ph.D. student in the Department of Computer Science at Rensselaer Poly-technic Institute. He received his B.E. from Mumbai University, Mumbai, India, in 2003 and received his M.Tech. in 2010 from Indian Institute of Technology, Bombay. He is currently being advised by Prof. Petros Drineas, with Prof. Chris Carothers as his co-advisor. His research interests lie at the intersection of Randomized Algorithms for large scale datasets, High Performance Computing and Machine Learning.

Haim Avron did his Ph.D. at the School of Computer Science at Tel Aviv University under the supervision of Prof. Sivan Toledo. Afterwards he spent two years as a Postdoctoral Researcher in the Business Analytics & Mathematical Sciences department at the IBM T.J. Watson Research Center. From 2012 to 2015 he was a Research Sta Member in the Mathematical Sciences & Analytics department at the IBM T.J. Watson Research Center. He joined the Department of Applied Mathematics, School of Mathematical Sciences at Tel Aviv University as a Senior Lecturer (equivalent to assistant professor) in 2015. His research focuses on numerical computing and high performance computing and their applications in scientific computing and machine learning. His interests and work range from mathematical and computational foundations to end-to-end implementation aspects.

Georgios Kollias received the B.Sc. in Physics in 2000 and the M.Sc. in Computational Science in 2002 from the University of Athens, Greece, and the PhD in Computer Science from the University of Patras, Greece, in 2009. He moved to Purdue University, USA in October 2009 and worked as a Postdoctoral Researcher in the Computer Science Department and the Center for Science of Information till May 2013. Then he joined IBM T.J. Watson Research Center, USA and in August 2014 he moved to IBM Zurich Research Lab. He returned back in IBM T.J. Watson Research Center in April 2015 as a Research Sta Member in the area of Big Data Management and Analytics. His research interests include Parallel, Distributed and High Performance Computing, Numerical Linear Algebra and Matrix Computations, Graph Mining, Data Analytics and Problem Solving Environments.

Yves Ineichen received his M.Sc. in Computer Science in 2008, and the Phd in Computer Science in 2013 from the Federal Institute of Technology Zurich (ETHZ), Switzerland. In the beginning of 2013 he joined the IBM Research Center in Rueschlikon, Switzerland. His research interests include: High Performance Computing, Optimization, Numerical Linear Algebra, Compiler Design, Programming Languages. Yves is a recipient of the PRACE (2012) and ACM Gordon Bell (2015) award.

Professor Chris Carothers is a faculty member in the Computer Science Department at Rensselaer Polytechnic Institute. He received the Ph.D., M.S., and B.S. from Georgia Institute of Technology in 1997, 1996, and 1991, respectively. Prior to joining RPI in 1998, he was a research scientist at the Georgia Institute of Technology. His research interests are focused on massively parallel computing which involve the creation of high fidelity models of extreme-scale networks and computer systems. These models have executed using nearly 2,000,000 processing cores on the largest leadership class supercomputers in the world. Professor Carothers is an NSF CAREER Award winner as well as Best Paper award winner at the ACM-SIGSIM PADS Conference for 1999, 2003 and 2009. Since joining Rensselaer, he has developed a world-class research portfolio which includes funding from the NSF, the U.S. Department of Energy, Army Research Laboratory, Air Force Research Laboratory, as well as several companies, including IBM, General Electric, and AT&T.

Additionally, Professor Carothers serves as the Director for the Rensselaer Center for Computational Innovations (CCI). CCI is a partnership between Rensselaer and IBM. The center provides computation and storage resources to diverse network of researchers, faculty, and students from Renssleaer, government laboratories, and companies across a number of science and engineering disciplines. The agship supercomputer is a 1 petaFLOP IBM Blue Gene/Q system with 80 terabytes of memory, 81,920 processing cores and over 2 petabytes of disk storage.

Professor Petros Drineas is an Associate Professor at the Computer Science Department of Rensselaer Polytechnic Institute. Prof. Drineas earned a PhD in Computer Science from Yale University in May of 2003, and a BS in Computer Engineering and Informatics from the University of Patras, Greece, in July of 1997. Prof. Drineas’ research interests lie in the design and analysis of randomized algorithms for linear algebraic problems, as well as their applications to the analysis of modern, massive datasets. Prof. Drineas received an NSF CAREER in 2006; was a Visiting Professor at the US Sandia National Laboratories during the fall of 2005; was a Visiting Fellow at the Institute for Pure and Applied Mathematics at the University of California, Los Angeles in the fall of 2007; and was a Visiting Professor at the University of California Berkeley in the fall of 2013. Prof. Drineas has also served the US National Science Foundation (NSF) as a Program Director in the Information and Intelligent Systems (IIS) Division and the Computing and Communication Foundations (CCF) Division (2010–2011). Prof. Drineas has published over 90 articles in conferences and journals in Theoretical Computer Science, Numerical Linear Algebra, and statistical data analysis.

This work was partially supported by the XDATA program of the Defense Advanced Research Projects Agency (DARPA), administered through Air Force Research Laboratory contract FA8750- 12-C-0323, as well as by NSF IIS-1302231.

View full text