research-article

LAWS: locality-aware work-stealing for multi-socket multi-core architectures

Authors:
Quan Chen

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Minyi Guo

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Haibing Guan

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

ICS '14: Proceedings of the 28th ACM international conference on SupercomputingJune 2014Pages 3–12https://doi.org/10.1145/2597652.2597665

Published:10 June 2014Publication History

ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

Pages 3–12

ABSTRACT

Modern mainstream powerful computers adopt Multi-Socket Multi-Core (MSMC) CPU architecture and NUMA-based memory architecture. While traditional work-stealing schedulers are designed for single-socket architectures, they incur severe shared cache misses and remote memory accesses in these computers, which can degrade the performance of memory-bound applications seriously. To solve the problem, we propose a Locality-Aware Work-Stealing (LAWS) scheduler, which better utilizes both the shared cache and the NUMA memory system. In LAWS, a load-balanced task allocator is used to evenly split and store the data set of a program to all the memory nodes and allocate a task to the socket where the local memory node stores its data. Then, an adaptive DAG packer adopts an auto-tuning approach to optimally pack an execution DAG into many cache-friendly subtrees. Meanwhile, a triple-level work-stealing scheduler is applied to schedule the subtrees and the tasks in each subtree. Experimental results show that LAWS can improve the performance of memory-bound programs up to 54.2% compared with traditional work-stealing schedulers.

References

U. Acar, G. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002.Google ScholarCross Ref
AMD. BIOS and Kernel Developer Guide (BKDG) For AMD Family 10h Processors. AMD, 2010.Google Scholar
E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE TPDS, 20(3):404--418, 2009. Google ScholarDigital Library
R. D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, MIT, Sept. 1995. Google ScholarDigital Library
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55--69, 1996. Google ScholarDigital Library
M. Castro, L. G. Fernandes, C. Pousa, J.-F. Méhaut, and M. S. de Aguiar. NUMA-ICTM: A parallel version of ICTM exploiting memory placement strategies for NUMA machines. In IPDPS, pages 1--8, 2009. Google ScholarDigital Library
Q. Chen, Y. Chen, Z. Huang, and M. Guo. WATS: Workload-aware task scheduling in asymmetric multi-core architectures. In IPDPS, pages 249--260, 2012. Google ScholarDigital Library
Q. Chen and M. Guo. Adaptive workload aware task scheduling for single-ISA multi-core architectures. ACM Transactions on Architecture and Code Optimization, 11(1), 2014. Google ScholarDigital Library
Q. Chen, M. Guo, and Z. Huang. CATS: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In ICS, pages 163--172, 2012. Google ScholarDigital Library
Q. Chen, M. Guo, and Z. Huang. Adaptive cache aware bi-tier work-stealing in multi-socket multi-core architectures. IEEE Transactions on Parallel and Distributed Systems, 24(12):2334--2343, 2013. Google ScholarDigital Library
Q. Chen, Z. Huang, M. Guo, and J. Zhou. CAB: Cache-aware bi-tier task-stealing in multi-socket multi-core architecture. In ICPP, pages 722--732, 2011. Google ScholarDigital Library
R. Cole and V. Ramachandran. Analysis of randomized work stealing with false sharing. In IPDPS, pages 985--989, 2013. Google ScholarDigital Library
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google ScholarDigital Library
T. Gautier, J. V. Lima, N. Maillard, and B. Raffin. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures. In IPDPS, pages 1299--1308, 2013. Google ScholarDigital Library
T. Gautier, J. V. F. Lima, N. Maillard, B. Raffin, et al. Locality-aware work stealing on Multi-CPU and Multi-GPU architectures. In MULTIPROG, 2013.Google Scholar
A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and Distributed Computing, 16(4):276--291, 1992.Google ScholarCross Ref
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work- first and help-first scheduling policies for async-finish task parallelism. In IPDPS, pages 1--12, 2009. Google ScholarDigital Library
Y. Guo, J. Zhao, V. Cave, and V. Sarkar. SLAW: a scalable locality-aware adaptive work--stealing scheduler. In IPDPS, pages 1--12, 2010.Google ScholarCross Ref
L. V. Kale and S. Krishnan. CHARMGoogle Scholar
: a portable concurrent object oriented system based on CGoogle Scholar
. ACM, 1993.Google Scholar
G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359--392, 1998. Google ScholarDigital Library
J. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In PPoPP, pages 25--36, 2010. Google ScholarDigital Library
C. Leiserson. The CilkGoogle Scholar
concurrency platform. In DAC, pages 522--527, 2009. Google ScholarDigital Library
L. L. Pilla, C. P. Ribeiro, D. Cordeiro, A. Bhatele, P. O. Navaux, J.-F. Méhaut, L. V. Kalé, et al. Improving parallel system performance with a NUMA-aware load balancer. TR-JLPC-11-02, 2011.Google Scholar
J.-N. Quintin and F. Wagner. Hierarchical work-stealing. In EuroPar, pages 217--229, 2010. Google ScholarDigital Library
J. Reinders. Intel threading building blocks. Intel, 2007. Google ScholarDigital Library
M. Shaheen and R. Strzodka. NUMA aware iterative stencil computations on many-core systems. In IPDPS, pages 461--473, 2012. Google ScholarDigital Library
S. Sridharan, G. Gupta, and G. S. Sohi. Holistic run-time parallelism management for time and energy efficiency. In ICS, pages 337--348, 2013. Google ScholarDigital Library
B. Vikranth, R. Wankar, and C. R. Rao. Topology aware task stealing for on-chip NUMA multi-core processors. Procedia Computer Science, 18:379--388, 2013.Google ScholarCross Ref
R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis. Locality-aware task management for unstructured parallelism: a quantitative limit study. In SPAA, pages 315--325, 2013. Google ScholarDigital Library

Index Terms

LAWS: locality-aware work-stealing for multi-socket multi-core architectures
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Read More
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has ...
Read More
Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems

The most computationally demanding scientific problems are solved with large parallel systems. In some cases these systems are Non-Uniform Memory Access (NUMA) multiprocessors made up of a large number of cores which share a hierarchically organized ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing
June 2014
378 pages
ISBN:9781450326421
DOI:10.1145/2597652
General Chairs:
Arndt Bode
Technische Universität München and Leibniz Rechenzentrum, Germany
,
Michael Gerndt
Technische Universität München, Germany
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Lawrence Rauchwerger
Texas A&M University, USA
,
Barton Miller
University of Wisconsin, USA
,
Martin Schulz
Lawrence Livermore National Laboratory, USA
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
auto-tuning
dag packing
numa
shared cache
Qualifiers
- research-article
Conference

Acceptance Rates
ICS '14 Paper Acceptance Rate34of160submissions,21%Overall Acceptance Rate584of2,055submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 470
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

LAWS: locality-aware work-stealing for multi-socket multi-core architectures

ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

LAWS: locality-aware work-stealing for multi-socket multi-core architectures

ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media