ABSTRACT
Many high-performance computing applications solving partial differential equations (PDEs) can be attributed to the class of kernels using stencils on structured grids. Due to the disparity between floating point operation throughput and main memory bandwidth these codes typically achieve only a low fraction of peak performance. Unfortunately, stencil computation optimization techniques are often hardware dependent and lead to a significant increase in code complexity. We present a domain-specific tool, STELLA, which eases the burden of the application developer by separating the architecture dependent implementation strategy from the user-code and is targeted at multi- and manycore processors. On the example of a numerical weather prediction and regional climate model (COSMO) we demonstrate the usefulness of STELLA for a real-world production code. The dynamical core based on STELLA achieves a speedup factor of 1.8x (CPU) and 5.8x (GPU) with respect to the legacy code while reducing the complexity of the user code.
- I. Abrahams and A. Gurtovoy. C++ Template Metaprogramming: Concepts, Tools, And Techniques From Boost And Beyond. The C++ in-Depth Series. Addison Wesley Professional, 2005. Google ScholarDigital Library
- A. Alexandrescu. Modern C++ Design: Generic Programming and Design Patterns Applied. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2001. Google ScholarDigital Library
- M. Baldauf. Linear stability analysis of runge--kutta-based partial time-splitting schemes for the euler equations. Monthly Weather Review, 138(4475-4496), 2010.Google Scholar
- M. Baldauf, A. Seifert, J. Förstner, D. Majewski, and M. Raschendorfer. Operational convective-scale numerical weather prediction with the cosmo model: Description and sensitivities. Monthly Weather Review, 139:3387--3905, 2011.Google ScholarCross Ref
- M. Bianco. An interface for halo exchange pattern, 2012.Google Scholar
- M. Christen, O. Schenk, and H. Burkhart. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS '11, pages 676--687, Washington, DC, USA, 2011. IEEE Computer Society. Google ScholarDigital Library
- Consortium for Small-Scale Modeling. http://www.cosmo-model.org/.Google Scholar
- Z. DeVito, N. Joubert, F. Palacios, S. Oakley, M. Medina, M. Barrientos, E. Elsen, F. Ham, A. Aiken, K. Duraisamy, E. Darve, J. Alonso, and P. Hanrahan. Liszt: A domain specific language for building portable mesh-based PDE solvers. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 9:1--9:12, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- G. Doms and U. Schättler. The nonhydrostatic limited-area model LM (Lokal-Modell) of the DWD. Part I: Scientific documentation. Technical report, German Weather Service (DWD), Offenbach, Germany, 1999.Google Scholar
- T. M. Forum. MPI: A message passing interface, 1993.Google Scholar
- O. Fuhrer, C. Osuna, X. Lapillonne, T. Gysi, B. Cumming, M. Bianco, A. Arteaga, and T. Schulthess. Towards a performance portable, architecture agnostic implementation strategy for weather and climate models. Supercomputing frontiers and innovations, 1(1), 2014.Google Scholar
- T. Gysi, T. Grosser, and T. Hoefler. MODESTO: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS '15, pages 177--186, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan. A stencil compiler for short-vector SIMD architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 13--24, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- Khronos Group. OpenCL (Open Computing Language). https://www.khronos.org/opencl/.Google Scholar
- S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '07, pages 235--244, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- X. Lapillonne and O. Fuhrer. Using compiler directives to port large scientific applications to GPUs: An example from atmospheric science. Parallel Processing Letters, 24(1):1450003, 2014.Google ScholarCross Ref
- N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11. ACM, 2011. Google ScholarDigital Library
- S. Mehta, P.-H. Lin, and P.-C. Yew. Revisiting loop fusion in the polyhedral framework. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 233--246, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- M. Mernik, J. Heering, and A. M. Sloane. When and how to develop domain-specific languages. ACM Computing Surveys, 37(4):316--344, 2005. Google ScholarDigital Library
- P. Micikevicius. GPU performance analysis and optimization, 2012.Google Scholar
- NVIDIA. CUDA Parallel Computing Platform. https://developer.nvidia.com/cuda.Google Scholar
- OpenACC Corporation. The OpenACC Application Programing Interface, 2011. http://www.openacc.org/.Google Scholar
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- J. Steppeler, G. Doms, U. Schättler, H. Bitzer, A. Gassmann, U. Damrath, and G. Gregoric. Meso gamma scale forecasts using the nonhydrostatic model LM. Meteor. Atmos. Phys., 82, 2002.Google Scholar
- Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '11, pages 117--128, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- The OpenMP ARB. The OpenMP API Specification for Parallel Programming, 2013. http://www.openmp.org.Google Scholar
- R. Torres, L. Linardakis, J. Kunkel, and T. Ludwig. ICON DSL: A domain-specific language for climate modeling.Google Scholar
- R. A. van Engelen. ATMOL: A domain-sepcific language for atmospheric modeling. Journal of Computing and Information Technology, 4(289-303), 2002.Google Scholar
- R. A. van Engelen, L. Wolters, and G. Cats. Ctadel: a generator of multi-platform high-performance codes for PDE-based scientific applications. In Proceedings of the 10th international conference on Supercomputing, pages 86--93, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
- M. Wahib and N. Maruyama. Scalable kernel fusion for memory-bound gpu applications. In High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for, pages 191--202, Nov 2014. Google ScholarDigital Library
- T. Weusthoff, F. Ament, M. Arpagaus, and M. W. Rotach. Assessing the benefits of convection-permitting models by neighborhood verification: Examples from map d-phase. Monthly Weather Review, 138:3418--3433, 2010.Google ScholarCross Ref
- L. J. Wicker and W. C. Skamarock. Time-splitting methods for elastic models using forward time schemes. Monthly Weather Review, 130:2088--2097, 2001.Google ScholarCross Ref
- M. Xue. High-order monotonic numerical diffusion and smoothing. Monthly Weather Review, 128(8):2853--2864, 1999.Google ScholarCross Ref
Index Terms
- STELLA: a domain-specific tool for structured grid methods in weather and climate models
Recommendations
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing SystemsWith fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
Programming the Adapteva Epiphany 64-core network-on-chip coprocessor
Energy efficiency is the primary impediment in the path to exascale computing. Consequently, the high-performance computing community is increasingly interested in low-power high-performance embedded systems as building blocks for large-scale high-...
Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms
IWOCL '19: Proceedings of the International Workshop on OpenCLA key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and ...
Comments