ABSTRACT
Pipeline is a parallel computing model underpinning a class of important applications running on CPU-GPU heterogeneous systems. A critical aspect for the efficiency of such applications is the support of communications among pipeline stages that may reside on CPU and different parts of a GPU. Existing libraries of concurrent data structures do not meet the needs, due to the massive parallelism on GPU and the complexities in CPU-GPU memory and connections. This work gives an in-depth study on the communication problem. It identifies three key issues, namely, slow and error-prone detection of the end of pipeline processing, intensive queue contentions on GPU, and cumbersome inter-device data movements. This work offers solutions to each of the issues, and integrates all together to form a unified library named HiWayLib. Experiments show that HiWayLib significantly boosts the efficiency of pipeline communications in CPU-GPU heterogeneous applications. For real-world applications, HiWayLib produces 1.22~2.13× speedups over the state-of-art implementations with little extra programming effort required.
- Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning.. In OSDI, Vol. 16. 265--283. Google ScholarDigital Library
- Edward H Adelson, Charles H Anderson, James R Bergen, Peter J Burt, and Joan M Ogden. 1984. Pyramid methods in image processing. RCA engineer, Vol. 29, 6 (1984), 33--41.Google Scholar
- Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In Proceedings of the conference on high performance graphics 2009. ACM, 145--149. Google ScholarDigital Library
- Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, Vol. 23, 2 (2011), 187--198. Google ScholarDigital Library
- Christian Bienia and Kai Li. 2010. Characteristics of workloads using the pipeline programming model. In International Symposium on Computer Architecture. Springer, 161--171. Google ScholarDigital Library
- Bruno Bodin, Luigi Nardi, Paul HJ Kelly, and Michael FP O'Boyle. 2016. Diplomat: Mapping of Multi-kernel Applications Using a Static Dataflow Abstraction. In Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016 IEEE 24th International Symposium on. IEEE, 241--250.Google ScholarCross Ref
- Robert L Cook, Loren Carpenter, and Edwin Catmull. 1987. The Reyes image rendering architecture. In ACM SIGGRAPH Computer Graphics, Vol. 21. ACM, 95--102. Google ScholarDigital Library
- Sivarama P Dandamudi. 1997. Reducing run queue contention in shared memory multiprocessors. Computer, Vol. 30, 3 (1997), 82--89. Google ScholarDigital Library
- Sivarama P. Dandamudi and Philip S. P. Cheng. 1995. A hierarchical task queue organization for shared-memory multiprocessor systems. IEEE Transactions on Parallel and Distributed Systems, Vol. 6, 1 (1995), 1--16. Google ScholarDigital Library
- John Danskin and Denis Foley. 2016. Pascal GPU with NVLink. In Hot Chips 28 Symposium (HCS), 2016 IEEE. IEEE, 1--24.Google Scholar
- Kapil Dev, Xin Zhan, and Sherief Reda. 2016. Power-aware characterization and mapping of workloads on cpu-gpu processors. In Workload Characterization (IISWC), 2016 IEEE International Symposium on. IEEE, 1--2.Google ScholarCross Ref
- Panagiota Fatourou and Nikolaos D Kallimanis. 2012. Revisiting the combining synchronization technique. In ACM SIGPLAN Notices, Vol. 47. ACM, 257--266. Google ScholarDigital Library
- Thierry Gautier, Xavier Besseron, and Laurent Pigeon. 2007. Kaapi: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In Proceedings of the 2007 international workshop on Parallel symbolic computation. ACM, 15--23. Google ScholarDigital Library
- Thierry Gautier, Joao VF Lima, Nicolas Maillard, and Bruno Raffin. 2013. Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 1299--1308. Google ScholarDigital Library
- John Giacomoni, Tipp Moseley, and Manish Vachharajani. 2008. FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. ACM, 43--52. Google ScholarDigital Library
- Michael I Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. ACM SIGARCH Computer Architecture News, Vol. 34, 5 (2006), 151--162. Google ScholarDigital Library
- Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: a GPU-accelerated software router. In ACM SIGCOMM Computer Communication Review, Vol. 40. ACM, 195--206. Google ScholarDigital Library
- Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures. ACM, 355--364. Google ScholarDigital Library
- Joel Hestness, Stephen W Keckler, and David A Wood. 2015. GPU computing pipeline inefficiencies and optimization opportunities in heterogeneous CPU-GPU processors. In 2015 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 87--97. Google ScholarDigital Library
- Thomas B Jablin, James A Jablin, Prakash Prabhu, Feng Liu, and David I August. 2012. Dynamically managed data for CPU-GPU architectures. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, 165--174. Google ScholarDigital Library
- Thomas B Jablin, Prakash Prabhu, James A Jablin, Nick P Johnson, Stephen R Beard, and David I August. 2011. Automatic CPU-GPU communication management and optimization. In ACM SIGPLAN Notices, Vol. 46. ACM, 142--151. Google ScholarDigital Library
- Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 339--350. Google ScholarDigital Library
- Gwangsun Kim, Minseok Lee, Jiyun Jeong, and John Kim. 2014. Multi-GPU system design with memory networks. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 484--495. Google ScholarDigital Library
- Klaus Kofler, Ivan Grasso, Biagio Cosenza, and Thomas Fahringer. 2013. An automatic input-sensitive approach for heterogeneous task partitioning. In Proceedings of the 27th international ACM conference on International conference on supercomputing. ACM, 149--160. Google ScholarDigital Library
- Tushar Kumar, Aravind Natarajan, Wenjia Ruan, Mario Badr, Dario S Gracia, and Calin Cascaval. 2017. Abstract Representation of Shared Data for Heterogeneous Computing. In The 30th International Workshop on Languages and Compilers for Parallel Computing. Springer 2017.Google Scholar
- Raphael Landaverde, Tiansheng Zhang, Ayse K Coskun, and Martin Herbordt. 2014. An investigation of unified memory access performance in cuda. In High Performance Extreme Computing Conference (HPEC), 2014 IEEE. IEEE, 1--6.Google ScholarCross Ref
- M. LeBeane, S. Song, R. Panda, J. H. Ryoo, and L. K. John. 2015. Data partitioning strategies for graph workloads on heterogeneous clusters. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12. Google ScholarDigital Library
- Hung-Fu Li, Tyng-Yeu Liang, and Yu-Jie Lin. 2016. An OpenMP Programming Toolkit for Hybrid CPU/GPU Clusters Based on Software Unified Memory. Journal of Information Science and Engineering, Vol. 32, 3 (2016), 517--539.Google Scholar
- Joao Vicente Ferreira Lima, Thierry Gautier, Nicolas Maillard, and Vincent Danjean. 2012. Exploiting concurrent GPU operations for efficient work stealing on multi-GPUs. In 24rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 75--82. Google ScholarDigital Library
- NVIDIA Corporation. {n. d.} a. NVIDIA Collective Communications Library. https://developer.nvidia.com/nccl .Google Scholar
- NVIDIA Corporation. {n. d.} b. NVIDIA Warp Shuffle. https://devblogs.nvidia.com/using-cuda-warp-level-primitives .Google Scholar
- NVIDIA Corporation. {n. d.} c. Unified Memory Programming. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd .Google Scholar
- Chanyoung Oh, Saehanseul Yi, and Youngmin Yi. 2015. Real-time face detection in Full HD images exploiting both embedded CPU and GPU. In Multimedia and Expo (ICME), 2015 IEEE International Conference on. IEEE, 1--6.Google ScholarCross Ref
- Sreepathi Pai, R Govindarajan, and Matthew J Thazhuthaveetil. 2012. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques. ACM, 33--42. Google ScholarDigital Library
- Sreepathi Pai and Keshav Pingali. 2016. A compiler for throughput optimization of graph algorithms on GPUs. In ACM SIGPLAN Notices, Vol. 51. ACM, 1--19. Google ScholarDigital Library
- Sankaralingam Panneerselvam and Michael Swift. 2016. Rinnegan: Efficient resource use in heterogeneous architectures. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, 373--386. Google ScholarDigital Library
- Anjul Patney, Stanley Tzeng, Kerry A Seitz Jr, and John D Owens. 2015. Piko: a framework for authoring programmable graphics pipelines. ACM Transactions on Graphics (TOG), Vol. 34, 4 (2015), 147. Google ScholarDigital Library
- Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces. ACM SIGPLAN Notices, Vol. 49, 4 (2014), 743--758. Google ScholarDigital Library
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, Vol. 48, 6 (2013), 519--530. Google ScholarDigital Library
- Thejas Ramashekar and Uday Bondhugula. 2013. Automatic data allocation and buffer management for multi-GPU machines. ACM Transactions on Architecture and Code Optimization (TACO), Vol. 10, 4 (2013), 60. Google ScholarDigital Library
- Lars Schor, Andreas Tretter, Tobias Scherer, and Lothar Thiele. 2013. Exploiting the parallelism of heterogeneous systems using dataflow graphs on top of OpenCL. In Embedded Systems for Real-time Multimedia (ESTIMedia), 2013 IEEE 11th Symposium on. IEEE, 41--50.Google ScholarCross Ref
- Gilad Shainer, Ali Ayoub, Pak Lui, Tong Liu, Michael Kagan, Christian R Trott, Greg Scantlen, and Paul S Crozier. 2011. The development of Mellanox/NVIDIA GPUDirect over InfiniBand-a new model for GPU to GPU communications. Computer Science-Research and Development, Vol. 26, 3--4 (2011), 267--273. Google ScholarDigital Library
- Jie Shen, Ana Lucia Varbanescu, Peng Zou, Yutong Lu, and Henk Sips. 2014. Improving performance by matching imbalanced workloads with heterogeneous platforms. In Proceedings of the 28th ACM international conference on Supercomputing. ACM, 241--250. Google ScholarDigital Library
- S. Song, M. Li, X. Zheng, M. LeBeane, J. H. Ryoo, R. Panda, A. Gerstlauer, and L. K. John. 2016. Proxy-Guided Load Balancing of Graph Processing Workloads on Heterogeneous Clusters. In 2016 45th International Conference on Parallel Processing (ICPP). 77--86.Google Scholar
- Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: task-based scheduling of dynamic workloads on the GPU. ACM Transactions on Graphics (TOG), Vol. 33, 6 (2014), 228. Google ScholarDigital Library
- Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. 2016. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing. In Workload Characterization (IISWC), 2016 IEEE International Symposium on. IEEE, 1--10.Google ScholarCross Ref
- William Thies, Vikram Chandrasekhar, and Saman Amarasinghe. 2007. A practical approach to exploiting coarse-grained pipeline parallelism in C programs. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 356--369. Google ScholarDigital Library
- Stanley Tzeng, Brandon Lloyd, and John D Owens. 2012. A GPU task-parallel model with dependency resolution. Computer 8 (2012), 34--41. Google ScholarDigital Library
- Abhishek Udupa, R Govindarajan, and Matthew J Thazhuthaveetil. 2009. Software pipelined execution of stream programs on GPUs. In Code Generation and Optimization, 2009. CGO 2009. International Symposium on. IEEE, 200--209. Google ScholarDigital Library
- Ben Van Werkhoven, Jason Maassen, Frank J Seinstra, and Henri E Bal. 2014. Performance models for CPU-GPU data transfers. In Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on. IEEE, 11--20.Google ScholarDigital Library
- Kaibo Wang, Xiaoning Ding, Rubao Lee, Shinpei Kato, and Xiaodong Zhang. 2014. GDM: device memory management for gpgpu computing. ACM SIGMETRICS Performance Evaluation Review, Vol. 42, 1 (2014), 533--545. Google ScholarDigital Library
- Tsung Tai Yeh, Amit Sabne, Putt Sakdhnagool, Rudolf Eigenmann, and Timothy G Rogers. 2017. Pagoda: Fine-grained GPU resource virtualization for narrow tasks. In ACM SIGPLAN Notices, Vol. 52. ACM, 221--234. Google ScholarDigital Library
- Feng Zhang, Bo Wu, Jidong Zhai, Bingsheng He, and Wenguang Chen. 2017. FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures. In Proceedings of the 2017 International Symposium on Code Generation and Optimization. IEEE Press, 27--38. Google ScholarCross Ref
- Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2017. Versapipe: a versatile programming framework for pipelined computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 587--599. Google ScholarDigital Library
Index Terms
- HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations
Recommendations
Performance modeling for data distribution in heterogeneous computing systems: work in progress
CASES '18: Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded SystemsBalanced data distribution among devices in Heterogeneous Computing System (HCS) is key to improved application performance. This work presents a model that estimates data distribution ratio for CPU-GPU co-execution of a data-parallel application in an ...
Accelerating Exact Inner Product Retrieval by CPU-GPU Systems
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalRecommender systems are widely used in many applications, e.g., social network, e-commerce. Inner product retrieval IPR is the core subroutine in Matrix Factorization (MF) based recommender systems. It consists of two phases: i) inner product ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Comments