ABSTRACT
Task-parallel languages such as X10 implement dynamic lightweight task-parallel execution model, where programmers are encouraged to express the ideal parallelism in the program. Prior work has used loop chunking to extract useful parallelism from ideal. Traditional loop chunking techniques assume that iterations in the loop are of similar workload, or the behavior of the first few iterations can be used to predict the load in later iterations. However, in loops with non-uniform work distribution, such assumptions do not hold. This problem becomes more complicated in the presence of atomic blocks (critical sections).
In this paper, we propose a new optimization called deep-chunking that uses a mixed compile-time and runtime technique to chunk the iterations of the parallel-for-loops, based on the runtime workload of each iteration. We propose a parallel algorithm that is executed by individual threads to efficiently compute their respective chunks so that the overall execution time gets reduced. We prove that the algorithm is correct and is a 2-factor approximation. In addition to simple parallel-for-loops, the proposed deep-chunking can also handle loops with atomic blocks, which lead to exciting challenges. We have implemented deep-chunking in the X10 compiler and studied its performance on the benchmarks taken from IMSuite. We show that on an average, deep-chunking achieves 50.48%, 21.49%, 26.72%, 32.41%, and 28.84% better performance than un-chunked (same as work-stealing), cyclic-, block-, dynamic-, and guided-chunking versions of the code, respectively.
- A. Agarwal, D. A. Kranz, and V. Natarajan. Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors. IEEE Trans. Parallel Distrib. Syst., 6(9):943--962, September 1995.Google ScholarDigital Library
- G. E. Blelloch. Programming parallel algorithms. Commun. ACM, 39(3):85--97, March 1996.Google ScholarDigital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. SIGPLAN Not., 30(8):207--216, August 1995.Google ScholarDigital Library
- J. Mark Bull. Feedback Guided Dynamic Loop Scheduling: Algorithms and Experiments. In EUROPAR, pages 377--382, 1998.Google ScholarCross Ref
- Chapel. The Chapel language specification version 0.4. http://chapel.cray.com/, 2005.Google Scholar
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An Object-oriented Approach to Non-uniform Cluster Computing. In OOPSLA, pages 519--538, New York, NY, USA, 2005. ACM.Google ScholarDigital Library
- H. Chen, W. Chen, J. Huang, B. Robert, and H. Kuhn. MPIPP: An Automatic Profile-guided Parallel Process Placement Toolset for SMP Clusters and Multiclusters. In ICS, pages 353--360. ACM, 2006.Google ScholarDigital Library
- Q. Chen, M. Guo, and Z. Huang. CATS: Cache Aware Task-stealing Based on Online Profiling in Multi-socket Multi-core Architectures. In ICS, pages 163--172, New York, NY, USA, 2012. ACM.Google ScholarDigital Library
- A. Duran, J. Corbalán, and E. Ayguadé. An Adaptive Cut-off for Task Parallelism. In SC, pages 36:1--36:11. IEEE Press, 2008.Google ScholarCross Ref
- M. Durand, F. Broquedis, T. Gautier, and B. Raffin. An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines. In IWOMP, pages 141--155. Springer Berlin, 2013.Google ScholarCross Ref
- S. Eyerman and L. Eeckhout. Modeling Critical Sections in Amdahl's Law and Its Implications for Multicore Design. In ISCA, pages 362--370, 2010.Google Scholar
- S. Gupta and V. K. Nandivada. IMSuite: A benchmark suite for simulating distributed algorithms. Journal of Parallel and Distributed Computing, 75(0):1 -- 19, January 2015.Google ScholarDigital Library
- S. Gupta, R. Shrivastava, and V. K. Nandivada. Optimizing Recursive Task Parallel Programs. In ICS, pages 11:1--11:11, 2017.Google ScholarDigital Library
- Habanero. Habanero Java. http://habanero.rice.edu/hj, Dec 2009.Google Scholar
- B. Hamidzadeh, L. Y. Kit, and D. J. Lilja. Dynamic task scheduling using online optimization. IEEE Trans. Parallel Distrib. Syst., 11(11):1151--1163, 2000.Google ScholarDigital Library
- B. Hamidzadeh and D. J. Lilja. Self-Adjusting Scheduling: An On-Line Optimization Technique for Locality Management and Load Balancing. In ICPP, pages 39--46. IEEE Computer Society, 1994.Google ScholarDigital Library
- S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: A Practical and Robust Method for Scheduling Parallel Loops. In Supercomputing, pages 610--632, New York, NY, USA, 1991. ACM.Google Scholar
- D. Jackson and E. J. Rollins. Chopping: A Generalization of Slicing. Technical Report CMU-CS-94-169, Carnegie Mellon University, Pittsburgh, PA, USA, 1994.Google Scholar
- JGF. The Java Grande Forum benchmark suite. http://www.epcc.ed.ac.uk/javagrande/javag.html.Google Scholar
- A. Kejariwal, A. Nicolau, and C. D. Polychronopoulos. History-aware Self-Scheduling. In ICPP, pages 185--192, Aug 2006.Google ScholarDigital Library
- K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., 2002.Google ScholarDigital Library
- C. Kruskal and A. Weiss. Allocating Independent Subtasks on Parallel Processors. SE, SE-11(10), October 1985.Google Scholar
- S. Lucco. A Dynamic Scheduling Method for Irregular Parallel Programs. SIGPLAN Not., 27(7):200--211, July 1992.Google ScholarDigital Library
- E. P. Markatos and T. J. LeBlanc. Using Processor Affinity in Loop Scheduling on Shared-memory Multiprocessors. In SC, pages 104--113. IEEE Computer Society Press, 1992.Google ScholarCross Ref
- S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.Google ScholarDigital Library
- V. K. Nandivada, J. Shirako, J. Zhao, and V. Sarkar. A Transformation Framework for Optimizing Task-Parallel Programs. ACM Trans. Program. Lang. Syst., 35(1):3:1--3:48, April 2013.Google ScholarDigital Library
- OpenMP Application Program Interface Version 4.0. http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf.Google Scholar
- P. H. Penna, A. T. A. Gomes, M. Castro, P. D. M. Plentz, H. C. Freitas, F. Broquedis, and J-F MÃl'haut. A comprehensive performance evaluation of the binlpt workload-aware loop scheduler. Concurrency and Computation: Practice and Experience, 31(18):e5170, 2019. e5170 cpe.5170.Google ScholarCross Ref
- C. D. Polychronopoulos and D. J. Kuck. Guided Self-scheduling: A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Trans. Comput., 36(12):1425--1439, December 1987.Google ScholarDigital Library
- Indu K. Prabhu and V. Krishna Nandivada. An extended report on chunking loops with non-uniform workloads. https://www.cse.iitm.ac.in/~krishna/preprints/ics20/ics20-full.pdf.Google Scholar
- Indu K. Prabhu and V. Krishna Nandivada. Deep Chunking Implementation. https://github.com/indukprabhu/deepChunking.Google Scholar
- J. Reinders. Intel Threading Building Blocks. O'Reilly & Associates, Inc., Sebastopol, CA, USA, first edition, 2007.Google ScholarDigital Library
- V. Saraswat, B. Bard, P. Igor, O. Tardieu, and D. Grove. X10 Language Specification Version 2.4. Technical report, IBM, 2014.Google Scholar
- J. Shirako, J. M. Zhao, V. K. Nandivada, and V. N. Sarkar. Chunking Parallel Loops in the Presence of Synchronization. In ICS, pages 181--192. ACM, 2009.Google ScholarDigital Library
- R. Shrivastava and V. K. Nandivada. Energy-efficient compilation of irregular task-parallel loops. TACO, 14(4):35:1--35:29, 2017.Google ScholarDigital Library
- M. S. Squiillante and E. D. Lazowska. Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling. IEEE Trans. Parallel Distrib. Syst., 4(2):131--143, February 1993.Google ScholarDigital Library
- S. Subramaniam and D. L. Eager. Affinity Scheduling of Unbalanced Workloads. In SC, pages 214--226. IEEE Press, 1994.Google Scholar
- P. Thoman, H. Jordan, S. Pellegrini, and T. Fahringer. Automatic OpenMP Loop Scheduling: A Combined Compiler and Runtime Approach. In IWOMP, pages 88--101. Springer-Verlag, 2012.Google ScholarDigital Library
- A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin. Lazy Binary-Splitting: A Run-time Adaptive Work-stealing Scheduler. In PPPoP, pages 179--190, 2010.Google ScholarDigital Library
- T. H. Tzen and L. M. Ni. Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers. IEEE Trans. Parallel Distrib. Syst., 4(1):87--98, January 1993.Google ScholarDigital Library
- Akshay Utture and V. Krishna Nandivada. Efficient lock-step synchronization in task-parallel languages. Softw., Pract. Exper., 49(9):1379--1401, 2019.Google ScholarCross Ref
Index Terms
- Chunking loops with non-uniform workloads
Recommendations
Chunking parallel loops in the presence of synchronization
ICS '09: Proceedings of the 23rd international conference on SupercomputingModern languages for shared-memory parallelism are moving from a bulk-synchronous Single Program Multiple Data (SPMD) execution model to lightweight Task Parallel execution models for improved productivity. This shift is intended to encourage ...
Transformations techniques for extracting parallelism in non-uniform nested loops
Executing a program in parallel machines needs not only to find sufficient parallelism in a program, but it is also important that we minimize the synchronization and communication overheads in the parallelized program. This yields to improve the ...
Affine and unimodular transformations for non-uniform nested loops
ICCOMP'08: Proceedings of the 12th WSEAS international conference on ComputersPerformance improvement in the modern parallel machines needs not only to find sufficient parallelism in a program, but it is also important that we minimize the synchronization and communication overheads in the parallelized program. Parallelizing and ...
Comments