HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations

Authors:
Zhen Zheng

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Chanyoung Oh

University of Seoul, Seoul, Rebublic of Korea

University of Seoul, Seoul, Rebublic of Korea
View Profile

,
Jidong Zhai

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Xipeng Shen

North Carolina State University, Raleigh, UNK, USA

North Carolina State University, Raleigh, UNK, USA
View Profile

,
Youngmin Yi

University of Seoul, Seoul, Rebublic of Korea

University of Seoul, Seoul, Rebublic of Korea
View Profile

,
Wenguang Chen

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating SystemsApril 2019Pages 153–166https://doi.org/10.1145/3297858.3304032

Published:04 April 2019Publication History

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 153–166

ABSTRACT

Pipeline is a parallel computing model underpinning a class of important applications running on CPU-GPU heterogeneous systems. A critical aspect for the efficiency of such applications is the support of communications among pipeline stages that may reside on CPU and different parts of a GPU. Existing libraries of concurrent data structures do not meet the needs, due to the massive parallelism on GPU and the complexities in CPU-GPU memory and connections. This work gives an in-depth study on the communication problem. It identifies three key issues, namely, slow and error-prone detection of the end of pipeline processing, intensive queue contentions on GPU, and cumbersome inter-device data movements. This work offers solutions to each of the issues, and integrates all together to form a unified library named HiWayLib. Experiments show that HiWayLib significantly boosts the efficiency of pipeline communications in CPU-GPU heterogeneous applications. For real-world applications, HiWayLib produces 1.22~2.13× speedups over the state-of-art implementations with little extra programming effort required.

References

Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning.. In OSDI, Vol. 16. 265--283. Google ScholarDigital Library
Edward H Adelson, Charles H Anderson, James R Bergen, Peter J Burt, and Joan M Ogden. 1984. Pyramid methods in image processing. RCA engineer, Vol. 29, 6 (1984), 33--41.Google Scholar
Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In Proceedings of the conference on high performance graphics 2009. ACM, 145--149. Google ScholarDigital Library
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, Vol. 23, 2 (2011), 187--198. Google ScholarDigital Library
Christian Bienia and Kai Li. 2010. Characteristics of workloads using the pipeline programming model. In International Symposium on Computer Architecture. Springer, 161--171. Google ScholarDigital Library
Bruno Bodin, Luigi Nardi, Paul HJ Kelly, and Michael FP O'Boyle. 2016. Diplomat: Mapping of Multi-kernel Applications Using a Static Dataflow Abstraction. In Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016 IEEE 24th International Symposium on. IEEE, 241--250.Google ScholarCross Ref
Robert L Cook, Loren Carpenter, and Edwin Catmull. 1987. The Reyes image rendering architecture. In ACM SIGGRAPH Computer Graphics, Vol. 21. ACM, 95--102. Google ScholarDigital Library
Sivarama P Dandamudi. 1997. Reducing run queue contention in shared memory multiprocessors. Computer, Vol. 30, 3 (1997), 82--89. Google ScholarDigital Library
Sivarama P. Dandamudi and Philip S. P. Cheng. 1995. A hierarchical task queue organization for shared-memory multiprocessor systems. IEEE Transactions on Parallel and Distributed Systems, Vol. 6, 1 (1995), 1--16. Google ScholarDigital Library
John Danskin and Denis Foley. 2016. Pascal GPU with NVLink. In Hot Chips 28 Symposium (HCS), 2016 IEEE. IEEE, 1--24.Google Scholar
Kapil Dev, Xin Zhan, and Sherief Reda. 2016. Power-aware characterization and mapping of workloads on cpu-gpu processors. In Workload Characterization (IISWC), 2016 IEEE International Symposium on. IEEE, 1--2.Google ScholarCross Ref
Panagiota Fatourou and Nikolaos D Kallimanis. 2012. Revisiting the combining synchronization technique. In ACM SIGPLAN Notices, Vol. 47. ACM, 257--266. Google ScholarDigital Library
Thierry Gautier, Xavier Besseron, and Laurent Pigeon. 2007. Kaapi: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In Proceedings of the 2007 international workshop on Parallel symbolic computation. ACM, 15--23. Google ScholarDigital Library
Thierry Gautier, Joao VF Lima, Nicolas Maillard, and Bruno Raffin. 2013. Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 1299--1308. Google ScholarDigital Library
John Giacomoni, Tipp Moseley, and Manish Vachharajani. 2008. FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. ACM, 43--52. Google ScholarDigital Library
Michael I Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. ACM SIGARCH Computer Architecture News, Vol. 34, 5 (2006), 151--162. Google ScholarDigital Library
Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: a GPU-accelerated software router. In ACM SIGCOMM Computer Communication Review, Vol. 40. ACM, 195--206. Google ScholarDigital Library
Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures. ACM, 355--364. Google ScholarDigital Library
Joel Hestness, Stephen W Keckler, and David A Wood. 2015. GPU computing pipeline inefficiencies and optimization opportunities in heterogeneous CPU-GPU processors. In 2015 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 87--97. Google ScholarDigital Library
Thomas B Jablin, James A Jablin, Prakash Prabhu, Feng Liu, and David I August. 2012. Dynamically managed data for CPU-GPU architectures. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, 165--174. Google ScholarDigital Library
Thomas B Jablin, Prakash Prabhu, James A Jablin, Nick P Johnson, Stephen R Beard, and David I August. 2011. Automatic CPU-GPU communication management and optimization. In ACM SIGPLAN Notices, Vol. 46. ACM, 142--151. Google ScholarDigital Library
Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 339--350. Google ScholarDigital Library
Gwangsun Kim, Minseok Lee, Jiyun Jeong, and John Kim. 2014. Multi-GPU system design with memory networks. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 484--495. Google ScholarDigital Library
Klaus Kofler, Ivan Grasso, Biagio Cosenza, and Thomas Fahringer. 2013. An automatic input-sensitive approach for heterogeneous task partitioning. In Proceedings of the 27th international ACM conference on International conference on supercomputing. ACM, 149--160. Google ScholarDigital Library
Tushar Kumar, Aravind Natarajan, Wenjia Ruan, Mario Badr, Dario S Gracia, and Calin Cascaval. 2017. Abstract Representation of Shared Data for Heterogeneous Computing. In The 30th International Workshop on Languages and Compilers for Parallel Computing. Springer 2017.Google Scholar
Raphael Landaverde, Tiansheng Zhang, Ayse K Coskun, and Martin Herbordt. 2014. An investigation of unified memory access performance in cuda. In High Performance Extreme Computing Conference (HPEC), 2014 IEEE. IEEE, 1--6.Google ScholarCross Ref
M. LeBeane, S. Song, R. Panda, J. H. Ryoo, and L. K. John. 2015. Data partitioning strategies for graph workloads on heterogeneous clusters. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12. Google ScholarDigital Library
Hung-Fu Li, Tyng-Yeu Liang, and Yu-Jie Lin. 2016. An OpenMP Programming Toolkit for Hybrid CPU/GPU Clusters Based on Software Unified Memory. Journal of Information Science and Engineering, Vol. 32, 3 (2016), 517--539.Google Scholar
Joao Vicente Ferreira Lima, Thierry Gautier, Nicolas Maillard, and Vincent Danjean. 2012. Exploiting concurrent GPU operations for efficient work stealing on multi-GPUs. In 24rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 75--82. Google ScholarDigital Library
NVIDIA Corporation. {n. d.} a. NVIDIA Collective Communications Library. https://developer.nvidia.com/nccl .Google Scholar
NVIDIA Corporation. {n. d.} b. NVIDIA Warp Shuffle. https://devblogs.nvidia.com/using-cuda-warp-level-primitives .Google Scholar
NVIDIA Corporation. {n. d.} c. Unified Memory Programming. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd .Google Scholar
Chanyoung Oh, Saehanseul Yi, and Youngmin Yi. 2015. Real-time face detection in Full HD images exploiting both embedded CPU and GPU. In Multimedia and Expo (ICME), 2015 IEEE International Conference on. IEEE, 1--6.Google ScholarCross Ref
Sreepathi Pai, R Govindarajan, and Matthew J Thazhuthaveetil. 2012. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques. ACM, 33--42. Google ScholarDigital Library
Sreepathi Pai and Keshav Pingali. 2016. A compiler for throughput optimization of graph algorithms on GPUs. In ACM SIGPLAN Notices, Vol. 51. ACM, 1--19. Google ScholarDigital Library
Sankaralingam Panneerselvam and Michael Swift. 2016. Rinnegan: Efficient resource use in heterogeneous architectures. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, 373--386. Google ScholarDigital Library
Anjul Patney, Stanley Tzeng, Kerry A Seitz Jr, and John D Owens. 2015. Piko: a framework for authoring programmable graphics pipelines. ACM Transactions on Graphics (TOG), Vol. 34, 4 (2015), 147. Google ScholarDigital Library
Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces. ACM SIGPLAN Notices, Vol. 49, 4 (2014), 743--758. Google ScholarDigital Library
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, Vol. 48, 6 (2013), 519--530. Google ScholarDigital Library
Thejas Ramashekar and Uday Bondhugula. 2013. Automatic data allocation and buffer management for multi-GPU machines. ACM Transactions on Architecture and Code Optimization (TACO), Vol. 10, 4 (2013), 60. Google ScholarDigital Library
Lars Schor, Andreas Tretter, Tobias Scherer, and Lothar Thiele. 2013. Exploiting the parallelism of heterogeneous systems using dataflow graphs on top of OpenCL. In Embedded Systems for Real-time Multimedia (ESTIMedia), 2013 IEEE 11th Symposium on. IEEE, 41--50.Google ScholarCross Ref
Gilad Shainer, Ali Ayoub, Pak Lui, Tong Liu, Michael Kagan, Christian R Trott, Greg Scantlen, and Paul S Crozier. 2011. The development of Mellanox/NVIDIA GPUDirect over InfiniBand-a new model for GPU to GPU communications. Computer Science-Research and Development, Vol. 26, 3--4 (2011), 267--273. Google ScholarDigital Library
Jie Shen, Ana Lucia Varbanescu, Peng Zou, Yutong Lu, and Henk Sips. 2014. Improving performance by matching imbalanced workloads with heterogeneous platforms. In Proceedings of the 28th ACM international conference on Supercomputing. ACM, 241--250. Google ScholarDigital Library
S. Song, M. Li, X. Zheng, M. LeBeane, J. H. Ryoo, R. Panda, A. Gerstlauer, and L. K. John. 2016. Proxy-Guided Load Balancing of Graph Processing Workloads on Heterogeneous Clusters. In 2016 45th International Conference on Parallel Processing (ICPP). 77--86.Google Scholar
Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: task-based scheduling of dynamic workloads on the GPU. ACM Transactions on Graphics (TOG), Vol. 33, 6 (2014), 228. Google ScholarDigital Library
Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. 2016. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing. In Workload Characterization (IISWC), 2016 IEEE International Symposium on. IEEE, 1--10.Google ScholarCross Ref
William Thies, Vikram Chandrasekhar, and Saman Amarasinghe. 2007. A practical approach to exploiting coarse-grained pipeline parallelism in C programs. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 356--369. Google ScholarDigital Library
Stanley Tzeng, Brandon Lloyd, and John D Owens. 2012. A GPU task-parallel model with dependency resolution. Computer 8 (2012), 34--41. Google ScholarDigital Library
Abhishek Udupa, R Govindarajan, and Matthew J Thazhuthaveetil. 2009. Software pipelined execution of stream programs on GPUs. In Code Generation and Optimization, 2009. CGO 2009. International Symposium on. IEEE, 200--209. Google ScholarDigital Library
Ben Van Werkhoven, Jason Maassen, Frank J Seinstra, and Henri E Bal. 2014. Performance models for CPU-GPU data transfers. In Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on. IEEE, 11--20.Google ScholarDigital Library
Kaibo Wang, Xiaoning Ding, Rubao Lee, Shinpei Kato, and Xiaodong Zhang. 2014. GDM: device memory management for gpgpu computing. ACM SIGMETRICS Performance Evaluation Review, Vol. 42, 1 (2014), 533--545. Google ScholarDigital Library
Tsung Tai Yeh, Amit Sabne, Putt Sakdhnagool, Rudolf Eigenmann, and Timothy G Rogers. 2017. Pagoda: Fine-grained GPU resource virtualization for narrow tasks. In ACM SIGPLAN Notices, Vol. 52. ACM, 221--234. Google ScholarDigital Library
Feng Zhang, Bo Wu, Jidong Zhai, Bingsheng He, and Wenguang Chen. 2017. FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures. In Proceedings of the 2017 International Symposium on Code Generation and Optimization. IEEE Press, 27--38. Google ScholarCross Ref
Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2017. Versapipe: a versatile programming framework for pipelined computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 587--599. Google ScholarDigital Library

Index Terms

HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations

Recommendations

Performance modeling for data distribution in heterogeneous computing systems: work in progress
CASES '18: Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems

Balanced data distribution among devices in Heterogeneous Computing System (HCS) is key to improved application performance. This work presents a model that estimates data distribution ratio for CPU-GPU co-execution of a data-parallel application in an ...
Read More
Accelerating Exact Inner Product Retrieval by CPU-GPU Systems
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Recommender systems are widely used in many applications, e.g., social network, e-commerce. Inner product retrieval IPR is the core subroutine in Matrix Factorization (MF) based recommender systems. It consists of two phases: i) inner product ...
Read More
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
April 2019
1126 pages
ISBN:9781450362405
DOI:10.1145/3297858
General Chairs:
Iris Bahar
Brown University
,
Maurice Herlihy
Brown University
,
Program Chairs:
Emmett Witchel
University of Texas, Austin
,
Alvin Lebeck
Duke University
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 April 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CPU-GPU system
contention relief
end detection
lazy copy
pipeline communication
Qualifiers
- research-article
Conference

Acceptance Rates
ASPLOS '19 Paper Acceptance Rate74of351submissions,21%Overall Acceptance Rate535of2,713submissions,20%
More
Upcoming Conference
ASPLOS '24

Sponsor:

sigarch

sigarch

sigarch

29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

April 27 - May 1, 2024

La Jolla , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 799
  Total Downloads
- Downloads (Last 12 months)87
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance modeling for data distribution in heterogeneous computing systems: work in progress

Accelerating Exact Inner Product Retrieval by CPU-GPU Systems

Evaluation of Rodinia Codes on Intel Xeon Phi