skip to main content
10.1145/2597652.2597665acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

LAWS: locality-aware work-stealing for multi-socket multi-core architectures

Authors Info & Claims
Published:10 June 2014Publication History

ABSTRACT

Modern mainstream powerful computers adopt Multi-Socket Multi-Core (MSMC) CPU architecture and NUMA-based memory architecture. While traditional work-stealing schedulers are designed for single-socket architectures, they incur severe shared cache misses and remote memory accesses in these computers, which can degrade the performance of memory-bound applications seriously. To solve the problem, we propose a Locality-Aware Work-Stealing (LAWS) scheduler, which better utilizes both the shared cache and the NUMA memory system. In LAWS, a load-balanced task allocator is used to evenly split and store the data set of a program to all the memory nodes and allocate a task to the socket where the local memory node stores its data. Then, an adaptive DAG packer adopts an auto-tuning approach to optimally pack an execution DAG into many cache-friendly subtrees. Meanwhile, a triple-level work-stealing scheduler is applied to schedule the subtrees and the tasks in each subtree. Experimental results show that LAWS can improve the performance of memory-bound programs up to 54.2% compared with traditional work-stealing schedulers.

References

  1. U. Acar, G. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  2. AMD. BIOS and Kernel Developer Guide (BKDG) For AMD Family 10h Processors. AMD, 2010.Google ScholarGoogle Scholar
  3. E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE TPDS, 20(3):404--418, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, MIT, Sept. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55--69, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Castro, L. G. Fernandes, C. Pousa, J.-F. Méhaut, and M. S. de Aguiar. NUMA-ICTM: A parallel version of ICTM exploiting memory placement strategies for NUMA machines. In IPDPS, pages 1--8, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Q. Chen, Y. Chen, Z. Huang, and M. Guo. WATS: Workload-aware task scheduling in asymmetric multi-core architectures. In IPDPS, pages 249--260, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Q. Chen and M. Guo. Adaptive workload aware task scheduling for single-ISA multi-core architectures. ACM Transactions on Architecture and Code Optimization, 11(1), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Q. Chen, M. Guo, and Z. Huang. CATS: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In ICS, pages 163--172, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Q. Chen, M. Guo, and Z. Huang. Adaptive cache aware bi-tier work-stealing in multi-socket multi-core architectures. IEEE Transactions on Parallel and Distributed Systems, 24(12):2334--2343, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Q. Chen, Z. Huang, M. Guo, and J. Zhou. CAB: Cache-aware bi-tier task-stealing in multi-socket multi-core architecture. In ICPP, pages 722--732, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Cole and V. Ramachandran. Analysis of randomized work stealing with false sharing. In IPDPS, pages 985--989, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Gautier, J. V. Lima, N. Maillard, and B. Raffin. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures. In IPDPS, pages 1299--1308, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Gautier, J. V. F. Lima, N. Maillard, B. Raffin, et al. Locality-aware work stealing on Multi-CPU and Multi-GPU architectures. In MULTIPROG, 2013.Google ScholarGoogle Scholar
  16. A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and Distributed Computing, 16(4):276--291, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  17. Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work- first and help-first scheduling policies for async-finish task parallelism. In IPDPS, pages 1--12, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Guo, J. Zhao, V. Cave, and V. Sarkar. SLAW: a scalable locality-aware adaptive work--stealing scheduler. In IPDPS, pages 1--12, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  19. L. V. Kale and S. Krishnan. CHARMGoogle ScholarGoogle Scholar
  20. : a portable concurrent object oriented system based on CGoogle ScholarGoogle Scholar
  21. . ACM, 1993.Google ScholarGoogle Scholar
  22. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359--392, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In PPoPP, pages 25--36, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Leiserson. The CilkGoogle ScholarGoogle Scholar
  25. concurrency platform. In DAC, pages 522--527, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. L. Pilla, C. P. Ribeiro, D. Cordeiro, A. Bhatele, P. O. Navaux, J.-F. Méhaut, L. V. Kalé, et al. Improving parallel system performance with a NUMA-aware load balancer. TR-JLPC-11-02, 2011.Google ScholarGoogle Scholar
  27. J.-N. Quintin and F. Wagner. Hierarchical work-stealing. In EuroPar, pages 217--229, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Reinders. Intel threading building blocks. Intel, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Shaheen and R. Strzodka. NUMA aware iterative stencil computations on many-core systems. In IPDPS, pages 461--473, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Sridharan, G. Gupta, and G. S. Sohi. Holistic run-time parallelism management for time and energy efficiency. In ICS, pages 337--348, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Vikranth, R. Wankar, and C. R. Rao. Topology aware task stealing for on-chip NUMA multi-core processors. Procedia Computer Science, 18:379--388, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  32. R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis. Locality-aware task management for unstructured parallelism: a quantitative limit study. In SPAA, pages 315--325, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. LAWS: locality-aware work-stealing for multi-socket multi-core architectures

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '14: Proceedings of the 28th ACM international conference on Supercomputing
      June 2014
      378 pages
      ISBN:9781450326421
      DOI:10.1145/2597652

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 June 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ICS '14 Paper Acceptance Rate34of160submissions,21%Overall Acceptance Rate584of2,055submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader