skip to main content
10.1145/3392717.3392763acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Chunking loops with non-uniform workloads

Published:29 June 2020Publication History

ABSTRACT

Task-parallel languages such as X10 implement dynamic lightweight task-parallel execution model, where programmers are encouraged to express the ideal parallelism in the program. Prior work has used loop chunking to extract useful parallelism from ideal. Traditional loop chunking techniques assume that iterations in the loop are of similar workload, or the behavior of the first few iterations can be used to predict the load in later iterations. However, in loops with non-uniform work distribution, such assumptions do not hold. This problem becomes more complicated in the presence of atomic blocks (critical sections).

In this paper, we propose a new optimization called deep-chunking that uses a mixed compile-time and runtime technique to chunk the iterations of the parallel-for-loops, based on the runtime workload of each iteration. We propose a parallel algorithm that is executed by individual threads to efficiently compute their respective chunks so that the overall execution time gets reduced. We prove that the algorithm is correct and is a 2-factor approximation. In addition to simple parallel-for-loops, the proposed deep-chunking can also handle loops with atomic blocks, which lead to exciting challenges. We have implemented deep-chunking in the X10 compiler and studied its performance on the benchmarks taken from IMSuite. We show that on an average, deep-chunking achieves 50.48%, 21.49%, 26.72%, 32.41%, and 28.84% better performance than un-chunked (same as work-stealing), cyclic-, block-, dynamic-, and guided-chunking versions of the code, respectively.

References

  1. A. Agarwal, D. A. Kranz, and V. Natarajan. Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors. IEEE Trans. Parallel Distrib. Syst., 6(9):943--962, September 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. E. Blelloch. Programming parallel algorithms. Commun. ACM, 39(3):85--97, March 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. SIGPLAN Not., 30(8):207--216, August 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Mark Bull. Feedback Guided Dynamic Loop Scheduling: Algorithms and Experiments. In EUROPAR, pages 377--382, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  5. Chapel. The Chapel language specification version 0.4. http://chapel.cray.com/, 2005.Google ScholarGoogle Scholar
  6. P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An Object-oriented Approach to Non-uniform Cluster Computing. In OOPSLA, pages 519--538, New York, NY, USA, 2005. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Chen, W. Chen, J. Huang, B. Robert, and H. Kuhn. MPIPP: An Automatic Profile-guided Parallel Process Placement Toolset for SMP Clusters and Multiclusters. In ICS, pages 353--360. ACM, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Q. Chen, M. Guo, and Z. Huang. CATS: Cache Aware Task-stealing Based on Online Profiling in Multi-socket Multi-core Architectures. In ICS, pages 163--172, New York, NY, USA, 2012. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Duran, J. Corbalán, and E. Ayguadé. An Adaptive Cut-off for Task Parallelism. In SC, pages 36:1--36:11. IEEE Press, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  10. M. Durand, F. Broquedis, T. Gautier, and B. Raffin. An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines. In IWOMP, pages 141--155. Springer Berlin, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  11. S. Eyerman and L. Eeckhout. Modeling Critical Sections in Amdahl's Law and Its Implications for Multicore Design. In ISCA, pages 362--370, 2010.Google ScholarGoogle Scholar
  12. S. Gupta and V. K. Nandivada. IMSuite: A benchmark suite for simulating distributed algorithms. Journal of Parallel and Distributed Computing, 75(0):1 -- 19, January 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Gupta, R. Shrivastava, and V. K. Nandivada. Optimizing Recursive Task Parallel Programs. In ICS, pages 11:1--11:11, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Habanero. Habanero Java. http://habanero.rice.edu/hj, Dec 2009.Google ScholarGoogle Scholar
  15. B. Hamidzadeh, L. Y. Kit, and D. J. Lilja. Dynamic task scheduling using online optimization. IEEE Trans. Parallel Distrib. Syst., 11(11):1151--1163, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Hamidzadeh and D. J. Lilja. Self-Adjusting Scheduling: An On-Line Optimization Technique for Locality Management and Load Balancing. In ICPP, pages 39--46. IEEE Computer Society, 1994.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: A Practical and Robust Method for Scheduling Parallel Loops. In Supercomputing, pages 610--632, New York, NY, USA, 1991. ACM.Google ScholarGoogle Scholar
  18. D. Jackson and E. J. Rollins. Chopping: A Generalization of Slicing. Technical Report CMU-CS-94-169, Carnegie Mellon University, Pittsburgh, PA, USA, 1994.Google ScholarGoogle Scholar
  19. JGF. The Java Grande Forum benchmark suite. http://www.epcc.ed.ac.uk/javagrande/javag.html.Google ScholarGoogle Scholar
  20. A. Kejariwal, A. Nicolau, and C. D. Polychronopoulos. History-aware Self-Scheduling. In ICPP, pages 185--192, Aug 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Kruskal and A. Weiss. Allocating Independent Subtasks on Parallel Processors. SE, SE-11(10), October 1985.Google ScholarGoogle Scholar
  23. S. Lucco. A Dynamic Scheduling Method for Irregular Parallel Programs. SIGPLAN Not., 27(7):200--211, July 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. P. Markatos and T. J. LeBlanc. Using Processor Affinity in Loop Scheduling on Shared-memory Multiprocessors. In SC, pages 104--113. IEEE Computer Society Press, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  25. S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. V. K. Nandivada, J. Shirako, J. Zhao, and V. Sarkar. A Transformation Framework for Optimizing Task-Parallel Programs. ACM Trans. Program. Lang. Syst., 35(1):3:1--3:48, April 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. OpenMP Application Program Interface Version 4.0. http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf.Google ScholarGoogle Scholar
  28. P. H. Penna, A. T. A. Gomes, M. Castro, P. D. M. Plentz, H. C. Freitas, F. Broquedis, and J-F MÃl'haut. A comprehensive performance evaluation of the binlpt workload-aware loop scheduler. Concurrency and Computation: Practice and Experience, 31(18):e5170, 2019. e5170 cpe.5170.Google ScholarGoogle ScholarCross RefCross Ref
  29. C. D. Polychronopoulos and D. J. Kuck. Guided Self-scheduling: A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Trans. Comput., 36(12):1425--1439, December 1987.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Indu K. Prabhu and V. Krishna Nandivada. An extended report on chunking loops with non-uniform workloads. https://www.cse.iitm.ac.in/~krishna/preprints/ics20/ics20-full.pdf.Google ScholarGoogle Scholar
  31. Indu K. Prabhu and V. Krishna Nandivada. Deep Chunking Implementation. https://github.com/indukprabhu/deepChunking.Google ScholarGoogle Scholar
  32. J. Reinders. Intel Threading Building Blocks. O'Reilly & Associates, Inc., Sebastopol, CA, USA, first edition, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. Saraswat, B. Bard, P. Igor, O. Tardieu, and D. Grove. X10 Language Specification Version 2.4. Technical report, IBM, 2014.Google ScholarGoogle Scholar
  34. J. Shirako, J. M. Zhao, V. K. Nandivada, and V. N. Sarkar. Chunking Parallel Loops in the Presence of Synchronization. In ICS, pages 181--192. ACM, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Shrivastava and V. K. Nandivada. Energy-efficient compilation of irregular task-parallel loops. TACO, 14(4):35:1--35:29, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. S. Squiillante and E. D. Lazowska. Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling. IEEE Trans. Parallel Distrib. Syst., 4(2):131--143, February 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Subramaniam and D. L. Eager. Affinity Scheduling of Unbalanced Workloads. In SC, pages 214--226. IEEE Press, 1994.Google ScholarGoogle Scholar
  38. P. Thoman, H. Jordan, S. Pellegrini, and T. Fahringer. Automatic OpenMP Loop Scheduling: A Combined Compiler and Runtime Approach. In IWOMP, pages 88--101. Springer-Verlag, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin. Lazy Binary-Splitting: A Run-time Adaptive Work-stealing Scheduler. In PPPoP, pages 179--190, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. T. H. Tzen and L. M. Ni. Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers. IEEE Trans. Parallel Distrib. Syst., 4(1):87--98, January 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Akshay Utture and V. Krishna Nandivada. Efficient lock-step synchronization in task-parallel languages. Softw., Pract. Exper., 49(9):1379--1401, 2019.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Chunking loops with non-uniform workloads

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing
          June 2020
          499 pages
          ISBN:9781450379830
          DOI:10.1145/3392717
          • General Chairs:
          • Eduard Ayguadé,
          • Wen-mei Hwu,
          • Program Chairs:
          • Rosa M. Badia,
          • H. Peter Hofstee

          Copyright © 2020 ACM

          © 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 29 June 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate629of2,180submissions,29%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader