ABSTRACT
Unleashing the full potential of heterogeneous systems, consisting of multi-core CPUs and GPUs, is a challenging task due to the difference in processing capabilities, memory availability, and communication latencies of different computational resources.
In this paper we propose a novel approach that automatically optimizes task partitioning for different (input) problem sizes and different heterogeneous multi-core architectures. We use the Insieme source-to-source compiler to translate a single-device OpenCL program into a multi-device OpenCL program. The Insieme Runtime System then performs dynamic task partitioning based on an offline-generated prediction model. In order to derive the prediction model, we use a machine learning approach based on Artificial Neural Networks (ANN) that incorporates static program features as well as dynamic, input sensitive features. Principal component analysis have been used to further improve the task partitioning. Our approach has been evaluated over a suite of 23 programs and respectively achieves a performance improvement of 22% and 25% compared to an execution of the benchmarks on a single CPU and a single GPU which is equal to 87.5% of the optimal performance.
- Insieme Compiler Runtime Framework. http://insieme-compiler.org/.Google Scholar
- HMPP, Hybrid Multicore Parallel Programming. http://www.openhmpp.org, 2012.Google Scholar
- OpenACC Application Program Interface. http://openacc.org/, 2012.Google Scholar
- R-manual:Student's t-Test. http://stat.ethz.ch/R-manual/R-patched/library/stats/html/t.test.html, 2012.Google Scholar
- Top 500 Supercomputer site. http://www.top500.org/, 2012.Google Scholar
- R. Aoki, S. Oikawa, T. Nakamura, and S. Miki. Hybrid opencl: Enhancing opencl for distributed processing. In ISPA, pages 149--154, 2011. Google ScholarDigital Library
- Apple Inc. Clang/LLVM. http://clang.llvm.org/, 2012.Google Scholar
- C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: A unified platform for task scheduling on heterogeneous multicore architectures. In Euro-Par,pages 863--874, 2009. Google ScholarDigital Library
- A. Barak and A. Shilo. The Virtual OpenCL (VCL) Cluster Platform. In Proc. Intel European Research & Innovation Conference, page 196, 2011.Google Scholar
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, pages 44--54, 2009. Google ScholarDigital Library
- D. Chen, W. Chen, and W. Zheng. Cuda-zero: a framework for porting shared memory gpu applications to multi-gpus. SCIENCE CHINA Information Sciences, 55(3):663--676, 2012.Google ScholarCross Ref
- Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. Google ScholarDigital Library
- A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In GPGPU, pages 63--74, 2010. Google ScholarDigital Library
- Ethem, Alpaydin. Introduction to Machine Learning. The MIT Press, Cambridge, MA, USA, 2004. Google ScholarDigital Library
- I. Grasso, S. Pellegrini, B. Cosenza, and T. Fahringer. libwater: Heterogeneous distributed computing made easy. In ICS, 2013. Google ScholarDigital Library
- S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes. In InPar, 2012.Google ScholarCross Ref
- C. Gregg and K. M. Hazelwood. Where is the data? why you cannot debate cpu vs. gpu performance without the answer. In ISPASS, pages 134--144, 2011. Google ScholarDigital Library
- D. Grewe and M. F. O'Boyle. A static task partitioning approach for heterogeneous systems using opencl. In CC, 2011. Google ScholarDigital Library
- C. Hong, D. Chen, W. Chen, W. Zheng, and H. Lin. Mapcg: writing parallel program portable between cpu and gpu. In PACT, pages 217--226, 2010. Google ScholarDigital Library
- Institut für Neuroinformatik, Ruhr-University Bochum. Shark Machine Learning Library. http://shark-project.sourceforge.net/, 2012.Google Scholar
- M. Kai, L. Xue, C. Wei, Z. Chi, and W. Xiaorui. Greengpu: A holistic approach to energy effciency in gpu-cpu heterogeneous architectures. In ICPP, 2012.Google Scholar
- Khronos OpenCL Working Group. The OpenCL 1.2 specification. http://www.khronos.org/opencl, 2012.Google Scholar
- J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. Opencl as a unified programming model for heterogeneous cpu/gpu clusters. In PPoPP, pages 299--300, 2012. Google ScholarDigital Library
- J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. Snucl: an opencl framework for heterogeneous cpu/gpu clusters. In ICS, pages 341--352, 2012. Google ScholarDigital Library
- M. D. Linderman, J. D. Collins, H. Wang, and T. H. Y. Meng. Merge: a programming model for heterogeneous multi-core systems. In ASPLOS, pages 287--296, 2008. Google ScholarDigital Library
- C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In MICRO, pages 45--55, 2009. Google ScholarDigital Library
- NVIDIA Corporation. CUDA Programming Model. https://developer.nvidia.com/cuda-toolkit, 2012.Google Scholar
- A. Panagiotidis, D. Kauker, F. Sadlo, and T. Ertl. Distributed computation and large-scale visualization in heterogeneous compute environments. In Proceedings of the 11th International Symposium on Parallel and Distributed Computing, 2012. Google ScholarDigital Library
- K. Pearson. On lines and planes of closest fit to a system of points in space. In Philosophical Magazine, Series 6, vol. 2, no. 11, pages 557--572, 1901.Google Scholar
- M. Stephenson and S. Amarasinghe. Predicting unroll factors using supervised classification. In Proceedings of the international symposium on Code generation and optimization, CGO '05, pages 123--134, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- M. Strengert, C. Muller, C. Dachsbacher, and T. Ertl. Cudasa: Compute unified device and systems architecture. In EGPGV, pages 49--56, 2008. Google ScholarDigital Library
- E. Sun, D. Schaa, R. Bagley, N. Rubin, and D. Kaeli. Enabling task-level scheduling on heterogeneous platforms. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, pages 84--93, 2012. Google ScholarDigital Library
- P. Thoman, K. Kofler, H. Studt, J. Thomson, and T. Fahringer. Automatic opencl device characterization: guiding optimized kernel design. In Euro-Par, pages 438--452, 2011. Google ScholarDigital Library
Index Terms
- An automatic input-sensitive approach for heterogeneous task partitioning
Recommendations
Automatic problem size sensitive task partitioning on heterogeneous parallel systems
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programmingIn this paper we propose a novel approach which automatizes task partitioning in heterogeneous systems. Our framework is based on the Insieme Compiler and Runtime infrastructure. The compiler translates a single-device OpenCL program into a multi-device ...
Automatic problem size sensitive task partitioning on heterogeneous parallel systems
PPoPP '13In this paper we propose a novel approach which automatizes task partitioning in heterogeneous systems. Our framework is based on the Insieme Compiler and Runtime infrastructure. The compiler translates a single-device OpenCL program into a multi-device ...
Resource aggregation for task-based Cholesky Factorization on top of modern architectures
AbstractHybrid computing platforms are now commonplace, featuring a large number of CPU cores and accelerators. This trend makes balancing computations between these heterogeneous resources performance critical. In this paper we propose ...
Comments