Resolving the GPU responsiveness dilemma through program transformations

Zhu, Qi; Wu, Bo; Shen, Xipeng; Shen, Kai; Shen, Li; Wang, Zhiying

doi:10.1007/s11704-016-6206-y

Resolving the GPU responsiveness dilemma through program transformations

Research Article
Published: 07 February 2018

Volume 12, pages 545–559, (2018)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Qi Zhu^1,2,3,
Bo Wu⁴,
Xipeng Shen³,
Kai Shen⁵,
Li Shen¹ &
…
Zhiying Wang¹

62 Accesses
1 Altmetric
Explore all metrics

Abstract

The emerging integrated CPU–GPU architectures facilitate short computational kernels to utilize GPU acceleration. Evidence has shown that, on such systems, the GPU control responsiveness (how soon the host program finds out about the completion of a GPU kernel) is essential for the overall performance. This study identifies the GPU responsiveness dilemma: host busy polling responds quickly, but at the expense of high energy consumption and interference with co-running CPU programs; interrupt-based notification minimizes energy and CPU interference costs, but suffers from substantial response delay. We present a program level solution that wakes up the host program in anticipation of GPU kernel completion. We systematically explore the design space of an anticipatory wakeup scheme through a timer-delayed wakeup or kernel splitting-based pre-completion notification. Experiments show that our proposed technique can achieve the best of both worlds, high responsiveness with low power and CPU costs, for a wide range of GPU workloads.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Article Open access 06 April 2024

Peter Thoman & Philip Salzmann

GPU Architecture

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

References

Zhu Q, Zhu M, Wu B, Shen X, Shen K, Wang Z. Software engagement with sleeping CPUs. In: Proceedings of the 15th Workshop on Hot Topics in Operating Systems (HotOS). 2015
Google Scholar
Gupta K, Stuart J A, Owens J D. A study of persistent threads style GPU programming for GPGPU workloads. Innovative Parallel Computing (InPar). 2012
Google Scholar
Lee S, Johnson T, Eigenmann R. Cetus -an extensible compiler infrastructure for source-to-source transformation. In: Proceedings of the 16th AnnualWorkshop on Languages and Compilers for Parallel Computing. 2003, 539–553
Google Scholar
Gonzalez R, Horowitz M. Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits, 1996, 31(9): 1277–1284
Article Google Scholar
Mekkat V, Holey A, Yew P C, Zhai A. Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 2013, 225–234
Google Scholar
Zhu Q, Wu B, Shen X, Shen L, Wang Z. Understanding co-run degradations on integrated heterogeneous processors. In: Proceedings of International Workshop on Languages and Compilers for Parallel Computing. 2014, 82–97
Google Scholar
Markatos E P, LeBlanc T J. Using processor affinity in loop scheduling on shared-memory multiprocessors. IEEE Transactions on Parallel Distributed Systems, 1994, 5(4): 379–400
Article Google Scholar
Squillante M S, Lazowska E D. Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Transactions on Parallel and Distributed Systems, 1993, 4(2): 131–143
Article Google Scholar
Gelado I, Stone J E, Cabezas J, Patel S, Navarro N, W. Hwu m W. An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems. 2010, 347–358
Google Scholar
Jiang Y, Shen X, Chen J, Tripathi R. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. 2008, 220–229
Google Scholar
Tian K, Jiang Y, Shen X. A study on optimally co-scheduling jobs of different lengths on chip multiprocessors. In: Proceedings of the 6th ACM Conference on Computing Frontiers. 2009, 41–50
Chapter Google Scholar
Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. 2007, 25–38
Google Scholar
El-Moursy A, Garg R, Albonesi D H, Dwarkadas S. Compatible phase co-scheduling on a cmp of multi-threaded processors. In: Proceedings of the International Parallel and Distribute Processing Symposium. 2006
Google Scholar
Chang J, Sohi G. Cooperative cache partitioning for chip multiprocessors. In: Proceedings of the 21st Annual International Conference on Supercomputing. 2007, 242–252
Chapter Google Scholar
Rafique N, Lim W, Thottethodi M. Architectural support for operating system-driven CMP cache management. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. 2006, 2–12
Chapter Google Scholar
Suh G, Devadas S, Rudolph L. A new memory monitoring scheme for memory-aware scheduling and partitioning. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture. 2002, 117–128
Chapter Google Scholar
Qureshi M K, Patt Y N. Utility-based cache partitioning: a lowoverhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of the International Symposium on Microarchitecture. 2006, 423–432
Google Scholar
Zhang E Z, Jiang Y, Shen X. Does cache sharing on modern cmpmatter to the performance of contemporary multithreaded programs? In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2010, 203–212
Google Scholar
Mars J, Tang L, Hundt R. Whare-map: Heterogeneity in “homogeneous” warehouse-scale computers. In: Proceedings of the 40th International Symposium on Computer Architecture. 2013, 1–12
Google Scholar
Zahedi S M, Lee B C. Ref: resource elasticity fairness with sharing incentives for multiprocessors. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 2014
Google Scholar
Menychtas K, Shen K, Scott M L. Disengaged scheduling for fair, protected access to computational accelerators. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. 2014, 301–316
Google Scholar
Kato S, Lakshmanan K, Rajkumar R, Ishikawa Y. TimeGraph: GPU scheduling for real-time multi-tasking environments. In: Proceedings of the USENIX Annual Technical Conference. 2011
Google Scholar
Wong H, Bracy A, Schuchman E, Aamodt T M, Collins J D, Wang P H, Chinya G, Groen A K, Jiang H, Wang H. Pangaea: a tightlycoupled ia32 heterogeneous chip multiprocessor. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. 2008, 52–61
Chapter Google Scholar

Download references

Acknowledgments

We thank the constructive comments from the anonymous referees. This material is based upon work supported by DOE Early Career Award (DE-SC0013700), the National Science Foundation (NSF) (1455404, 1455733 (CAREER), 1525609, 1464216, and 1618912). This work is also supported partly by the National Natural Science Foundation of China (NSFC) (Grant Nos. 61272143, 61272144, 61472431), and National Science and Technology Major Project (NSTMP) (2017ZX01028-101). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DOE, NSF, NSFC or NSTMP.

Author information

Authors and Affiliations

National Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, 410073, China
Qi Zhu, Li Shen & Zhiying Wang
Jiangnan Institute of Computing Technology, Wuxi, 214083, China
Qi Zhu
Department of Computer Science, North Carolina State University, Raleigh, NC, 27695, USA
Qi Zhu & Xipeng Shen
EECS, Colorado School of Mines, Golden, CO, 80401, USA
Bo Wu
Department of Computer Science, University of Rochester, Rochester, NY, 14627, USA
Kai Shen

Authors

Qi Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xipeng Shen
View author publications
You can also search for this author in PubMed Google Scholar
Kai Shen
View author publications
You can also search for this author in PubMed Google Scholar
Li Shen
View author publications
You can also search for this author in PubMed Google Scholar
Zhiying Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Zhu.

Additional information

Qi Zhu, a doctoral candidate, received his MS degree in computer science from the National University of Defense Technology, China in 2012. His research interests include compilers and programming systems, heterogeneous computing, and emerging architectures.

Bo Wu is an assistant professor at the Colorado School of Mines, USA. He earned his PhD degree in computer science from The College of William and Mary, USA. His research lies in the broad field of compilers and programming systems, with an emphasis on program optimizations for heterogeneous computing and emerging architectures. His current focus is on high-performance graph analytics on GPUs and memory optimization for irregular applications.

Xipeng Shen is an associate professor at the Computer Science Department, North Carolina State University, USA. He has been an IBM Canada CAS Research Faculty Fellow since 2010, and a receipt of the 2011 DOE Early Career Award and 2010 NSF CAREER Award. His research interest lies in the broad field of programming systems, with an emphasis on enabling extreme-scale data-intensive computing and intelligent portable computing through innovations in both compilers and runtime systems. He has been particularly interested in capturing large-scale program behavior patterns, in both data accesses and code executions, and exploiting them for scalable and efficient computing in a heterogeneous, massively parallel environment. He leads the NC-CAPS research group.

Kai Shen is an associate professor at the Department of Computer Science, University of Rochester, USA. His research interests fall into the broad area of computer systems. Much of his work is driven by the complexity of modern computer systems and the need for principled approaches to understand, characterize, and manage such complexities. He is particularly interested in the cross-layer work of developing software system solution to support emerging hardware or address hardware issues, including the characterization and management of memory hardware errors, system support for Flash-based SSDs and GPUs, as well as cyber-physical systems.

Electronic supplementary material

Supplementary material, approximately 211 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, Q., Wu, B., Shen, X. et al. Resolving the GPU responsiveness dilemma through program transformations. Front. Comput. Sci. 12, 545–559 (2018). https://doi.org/10.1007/s11704-016-6206-y

Download citation

Received: 06 April 2016
Accepted: 14 October 2016
Published: 07 February 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s11704-016-6206-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Resolving the GPU responsiveness dilemma through program transformations

Abstract

Access this article

Similar content being viewed by others

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

GPU Architecture

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 211 KB.

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Resolving the GPU responsiveness dilemma through program transformations

Abstract

Access this article

Similar content being viewed by others

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

GPU Architecture

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 211 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation