Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the Intel Xeon Phi

doi:10.1016/j.compeleceng.2015.06.009

Computers & Electrical Engineering

Volume 46, August 2015, Pages 95-111

https://doi.org/10.1016/j.compeleceng.2015.06.009 Get rights and content

Abstract

The emergence of new manycore architectures, such as the Intel Xeon Phi, poses new challenges in how to adapt existing libraries and applications to this type of systems. In particular, the exploitation of manycore accelerators requires a holistic solution that simultaneously addresses time-to-response, energy efficiency and ease of programming. In this paper, we adapt the SuperMatrix runtime task scheduler for dense linear algebra algorithms to the many-threaded Intel Xeon Phi, with special emphasis on the performance and energy profile of the solution. From the performance perspective, we optimize the balance between task- and data-parallelism, reporting notable results compared with Intel MKL. From the energy-aware point of view, we propose a methodology that relies on core-level event counters and aggregated power consumption samples to obtain a task-level accounting for the energy. In addition, we introduce a blocking mechanism to reduce power and energy consumption during the idle periods inherent to task parallel executions.

Introduction

The performance of today’s CMOS-based computer architectures is constrained by the cooling capacity of this technology and the power budget [1], [2], [3]. Concretely, the end of Dennard scaling [4] in the middle of the past decade marked the end of the “GHz race”, and the shift towards multicore designs due to their more appealing performance-energy balance. Since then, the doubling of transistors on chip with each new semiconductor generation, dictated by Moore’s law [5], has only exacerbated the problem [6].

In response to the power wall, many high performance computing (HPC) facilities have deployed heterogeneous clusters, equipped with manycore accelerators, such as AMD or NVIDIA graphics processor units (GPUs) or the Intel Xeon Phi, due to their favorable energy-performance balance as well as excellent acceleration for many compute-intensive applications. Moreover, the introduction of parallel programming standards, such as CUDA, OpenACC and OpenCL, has further increased the appeal of accelerator technologies. Nevertheless, programming an heterogeneous platform consisting of one to several general-purpose multicore processors and one or more manycore accelerators is still a considerable challenge. The reason is that, in addition to facing the programming difficulties intrinsic to having to exploit an ample amount of hardware concurrency, in many cases the developer also has to cope with the existence of multiple memory address spaces.

SuperMatrix is the run-time embedded in the libflame library [7] for the execution of dense linear algebra (DLA) operations on multicore desktop servers [8], heterogeneous CPU–GPU systems [9], [10], and small-scale clusters [11]. The SuperMatrix run-time follows the methodology advocated in the FLAME project, which patronizes a separation of concerns between the derivation of new algorithms for DLA operations, their practical coding (implementation), and their high-performance execution on a given platform. SuperMatrix orchestrates a seamless, task-parallel execution of the full functionality of the libflame DLA library [7].

In this paper, we present an extension of SuperMatrix specifically tailored to tackle the considerable amount of hardware concurrency in the many-threaded Intel Xeon Phi processor. In doing so, our paper makes the following contributions:

•
We adapt SuperMatrix to the Intel Xeon Phi manycore/many-threaded accelerator, demonstrating the benefits of abstracting the design and implementation of DLA algorithms from their practical, architecture-aware high performance execution. For our particular scenario, we rely on the native programming model, considering the accelerator as a stand-alone platform, in charge of the execution of both the runtime and the computational workload. With a research trend pointing in the direction of integrating accelerators and conventional architectures into the same chip, we envision an scenario where the accelerator becomes the main processor and, therefore, the native programming model is natural.
•
We investigate the impact on performance of exploiting the concurrency implicit to DLA operations at two different levels: as task-parallelism only, exposed by the run-time, or from a combination of task- and data-parallelism, with the latter extracted via a multi-threaded implementation of the BLAS (Basic Linear Algebra Subprograms).
•
From the perspective of energy efficiency, we describe a methodology that relies on core-level event counters and aggregated power consumption samples from the complete accelerator to deliver a detailed accounting of the energy dissipated by a DLA operation at the granularity of individual tasks.
•
We illustrate the positive effect on energy consumption of modifying the SuperMatrix run-time to adopt an idle-wait (blocking) approach for idle threads instead of the conventional power-hungry busy-wait (polling).
•
We provide an experimental evaluation and validation of the above contributions using a key DLA kernel, the Cholesky factorization, representative of the parallelism exhibited by many kernels in BLAS and LAPACK.

The rest of the paper is structured as follows. In Section 2 we further motivate our work, we review several related efforts, and we clarify the differences between these and our approach. In Section 3, we briefly present the SuperMatrix run-time system using the Cholesky factorization as a workhorse case study. In Sections 4 SuperMatrix for the Intel Xeon Phi, 5 Architecture-specific scheduling on the Intel Xeon Phi, we describe and evaluate the different approaches that are employed to exploit the hardware concurrency of a 60-core Intel Xeon Phi 5110P accelerator. At this point, we note that our approach exploits the parallelism present at two different levels: at the task level via worker threads of the SuperMatrix run-time and at the data-parallel level via BLAS threads. (In addition, the codes implicitly exploit the SIMD parallelism of the floating-point units, or FPUs, of the Intel Xeon Phi by means of the vector operations embedded inside tuned implementations of the BLAS.) In Section 6, we introduce our two main energy-related contributions, namely the methodology to conduct a task-level energy accounting and the evaluation of an energy-aware scheduler. Finally, the paper is closed with a few concluding remarks in Section 7.

Section snippets

High performance and programmability

In recent years, a number of run-times have been proposed to alleviate the burden of programming multi- and many-threaded platforms. Concretely, OmpSs,¹ StarPU,² and Harmony,³ among others, offer implicit parallel programming models with dependence analysis. When applied to the DLA operations that lie at the bottom of the “food chain” of many scientific compute-intensive applications [13],

SuperMatrix task-parallel execution of DLA operations

In this section, we describe the internal operation of the SuperMatrix run-time task analyzer and scheduler using the Cholesky factorization. This operation decomposes a symmetric positive definite (s.p.d.) matrix $A \in R^{n \times n}$ into the product $A = {LL}^{T}$ , where $L \in R^{n \times n}$ is lower triangular. Similar ideas to those exposed next underlie the task-parallel execution of the LU and QR factorizations for the solution of dense linear systems in particular, and many other DLA operations in general.

The Intel Xeon Phi architecture

In the current generation, the Intel Xeon Phi is a manycore co-processor composed of up to 61x86 cores and extended 512-bit vector FPUs, with a fully coherent private L2 cache per core. The cores feature a short in-order pipeline and support up to 4 threads in hardware which share 32 Kbytes of independent instruction/data L1 cache and 512 Kbytes of L2 cache. It can support up to 8 memory controllers, each one with two GDDR5 channels. The co-processor is connected to the host system through a

Architecture-specific scheduling on the Intel Xeon Phi

While the benefits of the naive approach described in the previous section are obvious (easy port to the Intel manycore architecture), this strategy potentially exhibits a serious performance bottleneck, directly derived from the underlying architecture characteristics of the Intel Xeon Phi:

•
The overhead introduced by run-time task schedulers when orchestrating parallel executions has been traditionally neglected due to the reduced number of processing units (and, therefore, worker threads) in

Experimental setup for the energy evaluation

In recent years, as the power wall gained relevance, hardware vendors have started to make power consumption information available to the user. Some relevant advances include Intel’s RAPL for the Intel Xeon “Sandy/Ivy-Bridge” and “Haswell” [24] processors; NVIDIA’s NVML for some of the GPUs in the “Kepler” generation [25]; and IBM’s AMESTER for the Power7 processor [26], [27]. Unfortunately, the Intel Xeon Phi 5110P does not offer a direct high-frequency access to power consumption information

Concluding remarks

In this paper, we have explored different solutions to profile and improve the performance and energy efficiency of the SuperMatrix runtime task scheduler adapted to the many-core Intel Xeon Phi architecture.

From the performance perspective, we have adapted the SuperMatrix task scheduler to the Intel Xeon Phi, introducing a hybrid approach to exploit task and data parallelism simultaneously. In combination with an appropriate thread-to-core affinity mapping, our experimental results for a

Acknowledgments

This research was supported by project CICYT TIN2011-23283, CICYT-TIN 2012-32180, FEDER, and the EU Project FP7 318793 “EXA2GREEN”. We thank Rafael Rodríguez, Sandra Catalán, and the members of the FLAME team for their support. This work was partially conducted while Francisco D. Igual and Enrique S. Quintana-Ortí were visiting The University of Texas at Austin, funded by the JTO visitor applications programme from the Institute for Computational Engineering and Sciences (ICES) at UT.

Dr. Manuel F. Dolz received the BSc degree in Computer Science from the Universitat Jaume I of Castellm (Spain) in 2008, and the PhD degree from the same university in 2014. He currently works as a postdoctoral research assistant in the Scientific Computing Research Group at the Universitt Hamburg (Germany). His research lines are focused in energy-aware high performance computing.

References (31)

Duranton M, et al. The HiPEAC vision for advanced computing in horizon 2020, <http://www.hipeac.net/roadmap>,...
Lavignon JF, et al. ETP4HPC strategic research agenda achieving HPC leadership in...
Lucas R, et al. Top ten Exascale research challenges,...
R. Dennard et al.
Design of ion-implanted MOSFET’s with very small physical dimensions
IEEE J Solid-State Circuits
(1974)
G. Moore
Cramming more components onto integrated circuits
Electronics
(1965)
Esmaeilzadeh H, Blem E, St. Amant R, Sankaralingam K, Burger D. Dark silicon and the end of multicore scaling. In:...
Zee FGV. libflame. The complete reference, in preparation. <http://www.cs.utexas.edu/users/flame>,...
G. Quintana-Ortí et al.
Programming matrix algorithms-by-blocks for thread-level parallelism
ACM Trans Math Software
(2009)
Quintana-Ortí G, Igual FD, Quintana-Ortí ES, van de Geijn R. Solving dense linear algebra problems on platforms with...
P. Alonso et al.
Enhancing performance and energy consumption of runtime schedulers for dense linear algebra
Concurrency Computat: Practice Exp
(2014)

F.D. Igual et al.

Scheduling algorithms-by-blocks on small clusters

Concurrency Computat: Practice Exp

(2013)

FLAME project home page,...

Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW,...

R.M. Badia et al.

Parallelizing dense and banded linear algebra libraries using SMPSs

Concurrency Computat: Practice Exp

(2009)

PLASMA project home page,...

Cited by (14)

Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures
2023, Journal of Parallel and Distributed Computing
We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures.
Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming.
Understanding hardware and software metrics with respect to power consumption
2018, Sustainable Computing: Informatics and Systems
Citation Excerpt :
Taking into account that most of the current processors feature a large set of hardware counters, temperature sensors, and resource usage statistics provided by the operating system, one could cleverly use this information to predict power drawn by individual components and system power consumption. For instance, a per-component power model could be easily exploited to make energy-aware scheduling with the aim of reducing the power consumption while preserving performance [6]. As highlighted in [7], a good power model should always be accurate, simple, inexpensive and portable.
Analyzing and understanding energy consumption of applications is an important task which allows researchers to develop novel strategies for optimizing and conserving energy. A typical methodology is to reduce the complexity of real systems and applications by developing a simplified performance model from observed behavior. In the literature, many of these models are known; however, inherent to any simplification is that some measured data cannot be explained well. While analyzing a models accuracy, it is highly important to identify the properties of such prediction errors. Such knowledge can then be used to improve the model or to optimize the benchmarks used for training the model parameters. For such a benchmark suite, it is important that the benchmarks cover all the aspects of system behavior to avoid overfitting of the model for certain scenarios. It is not trivial to identify the overlap between the benchmarks and answer the question if a benchmark causes different hardware behavior. Inspection of all the available hardware and software counters by humans is a tedious task given the large amount of real-time data they produce.
In this paper, we utilize statistical techniques to foster understand and investigate hardware counters as potential indicators of energy behavior. We capture hardware and software counters including power with a fixed frequency and analyze the resulting timelines of these measurements. The concepts introduced can be applied to any set of measurements in order to compare them to another set of measurements. We demonstrate how these techniques can aid identifying interesting behavior and significantly reducing the number of features that must be inspected. Next, we propose counters that can potentially be used for building linear models for predicting with a relative accuracy of 3%. Finally, we validate the completeness of a benchmark suite, from the point of view of using the available architectural components, for generating accurate models.
Experiences with nested parallelism in task-parallel applications using malleable BLAS on multicore processors
2024, International Journal of High Performance Computing Applications
QR Factorization Using Malleable BLAS on Multicore Processors
2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NUMA-Aware Dense Matrix Factorizations and Inversion with Look-Ahead on Multicore Processors
2022, Proceedings - Symposium on Computer Architecture and High Performance Computing
Scalable Hybrid Loop- And Task-Parallel Matrix Inversion for Multicore Processors
2021, 2021 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2021 - In conjunction with IEEE IPDPS 2021

View all citing articles on Scopus

Francisco D. Igual received the Ph.D. in Computer Science in 2011 from the University Jaume I of Castellon, Spain. In 2012, he joined the Computer Architecture and Automation Department of the Universidad Complutense de Madrid (UCM) as a postdoctoral researcher. His research interests include parallel algorithms for numerical linear algebra, task scheduling and runtime implementations.

Prof. Dr. Thomas Ludwig became Professor at Ruprecht-Karls-Universitt Heidelberg and led the research group Parallel and Distributed Systems in 2001. Since 2009 he is full professor at Universitt Hamburg and head of the Scientific Computing group. Additionally, he is CEO at German Climate Computing Centre. His research interests are energy efficiency and high performance storage for high performance computing.

Luis Piñuel received his MS and PhD in Computer Science from the Universidad Complutense de Madrid (UCM) in 1996 and 2003, respectively. He joined the UCM Computer Architecture Department in 1997, where he has been an associate professor since 2006. His research interests are primarily centred on computer architecture, parallel computing and code optimisation for high performance and embedded systems.

Enrique S. Quintana-Ortí received his Ph.D. in Computer Science from the Universidad Politecnica de Valencia (Spain) in 1996. He is currently professor of Computer Architecture in the Universidad Jaume I of Castellon (Spain). He has published 200 + papers, and contributed to libraries such as libflame. His research interests include parallel programming, linear algebra, energy efficiency and advanced architectures.

^☆: Reviews processed and recommended for publication to the Editor-in-Chief by Associate Editor Dr. Jesus Carretero.

View full text

Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the Intel Xeon Phi☆

Abstract

Introduction

Section snippets

High performance and programmability

SuperMatrix task-parallel execution of DLA operations

The Intel Xeon Phi architecture

Architecture-specific scheduling on the Intel Xeon Phi

Experimental setup for the energy evaluation

Concluding remarks

Acknowledgments

Design of ion-implanted MOSFET’s with very small physical dimensions

IEEE J Solid-State Circuits

Cramming more components onto integrated circuits

Electronics

Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Trans Math Software

Enhancing performance and energy consumption of runtime schedulers for dense linear algebra

Concurrency Computat: Practice Exp

Scheduling algorithms-by-blocks on small clusters

Concurrency Computat: Practice Exp

Parallelizing dense and banded linear algebra libraries using SMPSs

Concurrency Computat: Practice Exp