research-article

Chunking loops with non-uniform workloads

Authors:
Indu K. Prabhu

IIT Madras, India

IIT Madras, India
View Profile

,
V. Krishna Nandivada

IIT Madras, India

IIT Madras, India
View Profile

ICS '20: Proceedings of the 34th ACM International Conference on SupercomputingJune 2020Article No.: 40Pages 1–12https://doi.org/10.1145/3392717.3392763

Published:29 June 2020Publication History

ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

Pages 1–12

ABSTRACT

Task-parallel languages such as X10 implement dynamic lightweight task-parallel execution model, where programmers are encouraged to express the ideal parallelism in the program. Prior work has used loop chunking to extract useful parallelism from ideal. Traditional loop chunking techniques assume that iterations in the loop are of similar workload, or the behavior of the first few iterations can be used to predict the load in later iterations. However, in loops with non-uniform work distribution, such assumptions do not hold. This problem becomes more complicated in the presence of atomic blocks (critical sections).

In this paper, we propose a new optimization called deep-chunking that uses a mixed compile-time and runtime technique to chunk the iterations of the parallel-for-loops, based on the runtime workload of each iteration. We propose a parallel algorithm that is executed by individual threads to efficiently compute their respective chunks so that the overall execution time gets reduced. We prove that the algorithm is correct and is a 2-factor approximation. In addition to simple parallel-for-loops, the proposed deep-chunking can also handle loops with atomic blocks, which lead to exciting challenges. We have implemented deep-chunking in the X10 compiler and studied its performance on the benchmarks taken from IMSuite. We show that on an average, deep-chunking achieves 50.48%, 21.49%, 26.72%, 32.41%, and 28.84% better performance than un-chunked (same as work-stealing), cyclic-, block-, dynamic-, and guided-chunking versions of the code, respectively.

References

A. Agarwal, D. A. Kranz, and V. Natarajan. Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors. IEEE Trans. Parallel Distrib. Syst., 6(9):943--962, September 1995.Google ScholarDigital Library
G. E. Blelloch. Programming parallel algorithms. Commun. ACM, 39(3):85--97, March 1996.Google ScholarDigital Library
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. SIGPLAN Not., 30(8):207--216, August 1995.Google ScholarDigital Library
J. Mark Bull. Feedback Guided Dynamic Loop Scheduling: Algorithms and Experiments. In EUROPAR, pages 377--382, 1998.Google ScholarCross Ref
Chapel. The Chapel language specification version 0.4. http://chapel.cray.com/, 2005.Google Scholar
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An Object-oriented Approach to Non-uniform Cluster Computing. In OOPSLA, pages 519--538, New York, NY, USA, 2005. ACM.Google ScholarDigital Library
H. Chen, W. Chen, J. Huang, B. Robert, and H. Kuhn. MPIPP: An Automatic Profile-guided Parallel Process Placement Toolset for SMP Clusters and Multiclusters. In ICS, pages 353--360. ACM, 2006.Google ScholarDigital Library
Q. Chen, M. Guo, and Z. Huang. CATS: Cache Aware Task-stealing Based on Online Profiling in Multi-socket Multi-core Architectures. In ICS, pages 163--172, New York, NY, USA, 2012. ACM.Google ScholarDigital Library
A. Duran, J. Corbalán, and E. Ayguadé. An Adaptive Cut-off for Task Parallelism. In SC, pages 36:1--36:11. IEEE Press, 2008.Google ScholarCross Ref
M. Durand, F. Broquedis, T. Gautier, and B. Raffin. An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines. In IWOMP, pages 141--155. Springer Berlin, 2013.Google ScholarCross Ref
S. Eyerman and L. Eeckhout. Modeling Critical Sections in Amdahl's Law and Its Implications for Multicore Design. In ISCA, pages 362--370, 2010.Google Scholar
S. Gupta and V. K. Nandivada. IMSuite: A benchmark suite for simulating distributed algorithms. Journal of Parallel and Distributed Computing, 75(0):1 -- 19, January 2015.Google ScholarDigital Library
S. Gupta, R. Shrivastava, and V. K. Nandivada. Optimizing Recursive Task Parallel Programs. In ICS, pages 11:1--11:11, 2017.Google ScholarDigital Library
Habanero. Habanero Java. http://habanero.rice.edu/hj, Dec 2009.Google Scholar
B. Hamidzadeh, L. Y. Kit, and D. J. Lilja. Dynamic task scheduling using online optimization. IEEE Trans. Parallel Distrib. Syst., 11(11):1151--1163, 2000.Google ScholarDigital Library
B. Hamidzadeh and D. J. Lilja. Self-Adjusting Scheduling: An On-Line Optimization Technique for Locality Management and Load Balancing. In ICPP, pages 39--46. IEEE Computer Society, 1994.Google ScholarDigital Library
S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: A Practical and Robust Method for Scheduling Parallel Loops. In Supercomputing, pages 610--632, New York, NY, USA, 1991. ACM.Google Scholar
D. Jackson and E. J. Rollins. Chopping: A Generalization of Slicing. Technical Report CMU-CS-94-169, Carnegie Mellon University, Pittsburgh, PA, USA, 1994.Google Scholar
JGF. The Java Grande Forum benchmark suite. http://www.epcc.ed.ac.uk/javagrande/javag.html.Google Scholar
A. Kejariwal, A. Nicolau, and C. D. Polychronopoulos. History-aware Self-Scheduling. In ICPP, pages 185--192, Aug 2006.Google ScholarDigital Library
K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., 2002.Google ScholarDigital Library
C. Kruskal and A. Weiss. Allocating Independent Subtasks on Parallel Processors. SE, SE-11(10), October 1985.Google Scholar
S. Lucco. A Dynamic Scheduling Method for Irregular Parallel Programs. SIGPLAN Not., 27(7):200--211, July 1992.Google ScholarDigital Library
E. P. Markatos and T. J. LeBlanc. Using Processor Affinity in Loop Scheduling on Shared-memory Multiprocessors. In SC, pages 104--113. IEEE Computer Society Press, 1992.Google ScholarCross Ref
S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.Google ScholarDigital Library
V. K. Nandivada, J. Shirako, J. Zhao, and V. Sarkar. A Transformation Framework for Optimizing Task-Parallel Programs. ACM Trans. Program. Lang. Syst., 35(1):3:1--3:48, April 2013.Google ScholarDigital Library
OpenMP Application Program Interface Version 4.0. http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf.Google Scholar
P. H. Penna, A. T. A. Gomes, M. Castro, P. D. M. Plentz, H. C. Freitas, F. Broquedis, and J-F MÃl'haut. A comprehensive performance evaluation of the binlpt workload-aware loop scheduler. Concurrency and Computation: Practice and Experience, 31(18):e5170, 2019. e5170 cpe.5170.Google ScholarCross Ref
C. D. Polychronopoulos and D. J. Kuck. Guided Self-scheduling: A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Trans. Comput., 36(12):1425--1439, December 1987.Google ScholarDigital Library
Indu K. Prabhu and V. Krishna Nandivada. An extended report on chunking loops with non-uniform workloads. https://www.cse.iitm.ac.in/~krishna/preprints/ics20/ics20-full.pdf.Google Scholar
Indu K. Prabhu and V. Krishna Nandivada. Deep Chunking Implementation. https://github.com/indukprabhu/deepChunking.Google Scholar
J. Reinders. Intel Threading Building Blocks. O'Reilly & Associates, Inc., Sebastopol, CA, USA, first edition, 2007.Google ScholarDigital Library
V. Saraswat, B. Bard, P. Igor, O. Tardieu, and D. Grove. X10 Language Specification Version 2.4. Technical report, IBM, 2014.Google Scholar
J. Shirako, J. M. Zhao, V. K. Nandivada, and V. N. Sarkar. Chunking Parallel Loops in the Presence of Synchronization. In ICS, pages 181--192. ACM, 2009.Google ScholarDigital Library
R. Shrivastava and V. K. Nandivada. Energy-efficient compilation of irregular task-parallel loops. TACO, 14(4):35:1--35:29, 2017.Google ScholarDigital Library
M. S. Squiillante and E. D. Lazowska. Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling. IEEE Trans. Parallel Distrib. Syst., 4(2):131--143, February 1993.Google ScholarDigital Library
S. Subramaniam and D. L. Eager. Affinity Scheduling of Unbalanced Workloads. In SC, pages 214--226. IEEE Press, 1994.Google Scholar
P. Thoman, H. Jordan, S. Pellegrini, and T. Fahringer. Automatic OpenMP Loop Scheduling: A Combined Compiler and Runtime Approach. In IWOMP, pages 88--101. Springer-Verlag, 2012.Google ScholarDigital Library
A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin. Lazy Binary-Splitting: A Run-time Adaptive Work-stealing Scheduler. In PPPoP, pages 179--190, 2010.Google ScholarDigital Library
T. H. Tzen and L. M. Ni. Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers. IEEE Trans. Parallel Distrib. Syst., 4(1):87--98, January 1993.Google ScholarDigital Library
Akshay Utture and V. Krishna Nandivada. Efficient lock-step synchronization in task-parallel languages. Softw., Pract. Exper., 49(9):1379--1401, 2019.Google ScholarCross Ref

Index Terms

Chunking loops with non-uniform workloads

Recommendations

Chunking parallel loops in the presence of synchronization
ICS '09: Proceedings of the 23rd international conference on Supercomputing

Modern languages for shared-memory parallelism are moving from a bulk-synchronous Single Program Multiple Data (SPMD) execution model to lightweight Task Parallel execution models for improved productivity. This shift is intended to encourage ...
Read More
Transformations techniques for extracting parallelism in non-uniform nested loops

Executing a program in parallel machines needs not only to find sufficient parallelism in a program, but it is also important that we minimize the synchronization and communication overheads in the parallelized program. This yields to improve the ...
Read More
Affine and unimodular transformations for non-uniform nested loops
ICCOMP'08: Proceedings of the 12th WSEAS international conference on Computers

Performance improvement in the modern parallel machines needs not only to find sufficient parallelism in a program, but it is also important that we minimize the synchronization and communication overheads in the parallelized program. Parallelizing and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing
June 2020
499 pages
ISBN:9781450379830
DOI:10.1145/3392717
General Chairs:
Eduard Ayguadé
Universitat Politècnica de Catalunya and Barcelona Supercomputing Center
,
Wen-mei Hwu
University of Illinois at Urbana-Champaign
,
Program Chairs:
Rosa M. Badia
Barcelona Supercomputing Center and Universitat Politècnica de Catalunya
,
H. Peter Hofstee
IBM Austin
Copyright © 2020 ACM
© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 June 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
concurrent programs
loop chunking
parallel loop optimization
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate629of2,180submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 129
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Chunking loops with non-uniform workloads

ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Chunking parallel loops in the presence of synchronization

Transformations techniques for extracting parallelism in non-uniform nested loops

Affine and unimodular transformations for non-uniform nested loops