ABSTRACT
Modern mainstream powerful computers adopt Multi-Socket Multi-Core (MSMC) CPU architecture and NUMA-based memory architecture. While traditional work-stealing schedulers are designed for single-socket architectures, they incur severe shared cache misses and remote memory accesses in these computers, which can degrade the performance of memory-bound applications seriously. To solve the problem, we propose a Locality-Aware Work-Stealing (LAWS) scheduler, which better utilizes both the shared cache and the NUMA memory system. In LAWS, a load-balanced task allocator is used to evenly split and store the data set of a program to all the memory nodes and allocate a task to the socket where the local memory node stores its data. Then, an adaptive DAG packer adopts an auto-tuning approach to optimally pack an execution DAG into many cache-friendly subtrees. Meanwhile, a triple-level work-stealing scheduler is applied to schedule the subtrees and the tasks in each subtree. Experimental results show that LAWS can improve the performance of memory-bound programs up to 54.2% compared with traditional work-stealing schedulers.
- U. Acar, G. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002.Google ScholarCross Ref
- AMD. BIOS and Kernel Developer Guide (BKDG) For AMD Family 10h Processors. AMD, 2010.Google Scholar
- E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE TPDS, 20(3):404--418, 2009. Google ScholarDigital Library
- R. D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, MIT, Sept. 1995. Google ScholarDigital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55--69, 1996. Google ScholarDigital Library
- M. Castro, L. G. Fernandes, C. Pousa, J.-F. Méhaut, and M. S. de Aguiar. NUMA-ICTM: A parallel version of ICTM exploiting memory placement strategies for NUMA machines. In IPDPS, pages 1--8, 2009. Google ScholarDigital Library
- Q. Chen, Y. Chen, Z. Huang, and M. Guo. WATS: Workload-aware task scheduling in asymmetric multi-core architectures. In IPDPS, pages 249--260, 2012. Google ScholarDigital Library
- Q. Chen and M. Guo. Adaptive workload aware task scheduling for single-ISA multi-core architectures. ACM Transactions on Architecture and Code Optimization, 11(1), 2014. Google ScholarDigital Library
- Q. Chen, M. Guo, and Z. Huang. CATS: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In ICS, pages 163--172, 2012. Google ScholarDigital Library
- Q. Chen, M. Guo, and Z. Huang. Adaptive cache aware bi-tier work-stealing in multi-socket multi-core architectures. IEEE Transactions on Parallel and Distributed Systems, 24(12):2334--2343, 2013. Google ScholarDigital Library
- Q. Chen, Z. Huang, M. Guo, and J. Zhou. CAB: Cache-aware bi-tier task-stealing in multi-socket multi-core architecture. In ICPP, pages 722--732, 2011. Google ScholarDigital Library
- R. Cole and V. Ramachandran. Analysis of randomized work stealing with false sharing. In IPDPS, pages 985--989, 2013. Google ScholarDigital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google ScholarDigital Library
- T. Gautier, J. V. Lima, N. Maillard, and B. Raffin. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures. In IPDPS, pages 1299--1308, 2013. Google ScholarDigital Library
- T. Gautier, J. V. F. Lima, N. Maillard, B. Raffin, et al. Locality-aware work stealing on Multi-CPU and Multi-GPU architectures. In MULTIPROG, 2013.Google Scholar
- A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and Distributed Computing, 16(4):276--291, 1992.Google ScholarCross Ref
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work- first and help-first scheduling policies for async-finish task parallelism. In IPDPS, pages 1--12, 2009. Google ScholarDigital Library
- Y. Guo, J. Zhao, V. Cave, and V. Sarkar. SLAW: a scalable locality-aware adaptive work--stealing scheduler. In IPDPS, pages 1--12, 2010.Google ScholarCross Ref
- L. V. Kale and S. Krishnan. CHARMGoogle Scholar
- : a portable concurrent object oriented system based on CGoogle Scholar
- . ACM, 1993.Google Scholar
- G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359--392, 1998. Google ScholarDigital Library
- J. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In PPoPP, pages 25--36, 2010. Google ScholarDigital Library
- C. Leiserson. The CilkGoogle Scholar
- concurrency platform. In DAC, pages 522--527, 2009. Google ScholarDigital Library
- L. L. Pilla, C. P. Ribeiro, D. Cordeiro, A. Bhatele, P. O. Navaux, J.-F. Méhaut, L. V. Kalé, et al. Improving parallel system performance with a NUMA-aware load balancer. TR-JLPC-11-02, 2011.Google Scholar
- J.-N. Quintin and F. Wagner. Hierarchical work-stealing. In EuroPar, pages 217--229, 2010. Google ScholarDigital Library
- J. Reinders. Intel threading building blocks. Intel, 2007. Google ScholarDigital Library
- M. Shaheen and R. Strzodka. NUMA aware iterative stencil computations on many-core systems. In IPDPS, pages 461--473, 2012. Google ScholarDigital Library
- S. Sridharan, G. Gupta, and G. S. Sohi. Holistic run-time parallelism management for time and energy efficiency. In ICS, pages 337--348, 2013. Google ScholarDigital Library
- B. Vikranth, R. Wankar, and C. R. Rao. Topology aware task stealing for on-chip NUMA multi-core processors. Procedia Computer Science, 18:379--388, 2013.Google ScholarCross Ref
- R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis. Locality-aware task management for unstructured parallelism: a quantitative limit study. In SPAA, pages 315--325, 2013. Google ScholarDigital Library
Index Terms
- LAWS: locality-aware work-stealing for multi-socket multi-core architectures
Recommendations
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming
In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has ...
Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems
The most computationally demanding scientific problems are solved with large parallel systems. In some cases these systems are Non-Uniform Memory Access (NUMA) multiprocessors made up of a large number of cores which share a hierarchically organized ...
Comments