tutorial

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Authors:
I. Z. Reguly

Oxford e-Research Centre, University of Oxford and Faculty of Information Technology and Bionics, Pázmány Péter Catholic University

Oxford e-Research Centre, University of Oxford and Faculty of Information Technology and Bionics, Pázmány Péter Catholic University
View Profile

,
E. László

Oxford e-Research Centre, University of Oxford and Faculty of Information Technology and Bionics, Pázmány Péter Catholic University

Oxford e-Research Centre, University of Oxford and Faculty of Information Technology and Bionics, Pázmány Péter Catholic University
View Profile

,
G. R. Mudalige

Oxford e-Research Centre, University of Oxford

Oxford e-Research Centre, University of Oxford
View Profile

,
M. B. Giles

Oxford e-Research Centre, University of Oxford

Oxford e-Research Centre, University of Oxford
View Profile

PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresFebruary 2014Pages 39–50https://doi.org/10.1145/2560683.2560686

Published:16 October 2018Publication History

PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Pages 39–50

ABSTRACT

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector processing constructs, they are only effective on a few classes of applications with regular memory access and computational patterns. Irregular application classes require the explicit use of parallel programming models; CUDA and OpenCL are well established for programming GPUs, but it is not obvious what model to use to exploit vector units on architectures such as CPUs or the Xeon Phi. Therefore it is of growing interest what programming models are available, such as Single Instruction Multiple Threads (SIMT) or Single Instruction Multiple Data (SIMD), and how they map to vector units.

This paper presents results on achieving high performance through vectorization on CPUs and the Xeon Phi on a key class of applications: unstructured mesh computations. By exploring the SIMT and SIMD execution and parallel programming models, we show how abstract unstructured grid computations map to OpenCL or vector intrinsics through the use of code generation techniques, and how these in turn utilize the hardware.

We benchmark a number of systems, including Intel Xeon CPUs and the Intel Xeon Phi, using an industrially representative CFD application and compare the results against previous work on CPUs and NVIDIA GPUs to provide a contrasting comparison of what could be achieved on current many-core systems. By carrying out a performance analysis study, we identify key performance bottlenecks due to computational, control and bandwidth limitations.

We show that the OpenCL SIMT model does not map efficiently to CPU vector units due to auto-vectorization issues and threading overheads. We demonstrate that while the use of SIMD vector intrinsics imposes some restrictions, and requires more involved programming techniques, it does result in efficient code and near-optimal performance, that is up to 2 times faster than the non-vectorized code. We observe that the Xeon Phi does not provide good performance for this class of applications, but is still on par with a pair of high-end Xeon chips. CPUs and GPUs do saturate the available resources, giving performance very near to the optimum.

References

Texas Instruments Multi-core TMS320C66x processor. http://www.ti.com/c66multicore.Google Scholar
Intel SDK for OpenCL Applications, 2013. http://software.intel.com/en-us/vcsource/tools/opencl-sdk.Google Scholar
R. G. Brook, B. Hadri, V. C. Betro, R. C. Hulguin, and R. Braby. In Cray User Group (CUG), 2012, 2012.Google Scholar
M. Giles, G. Mudalige, Z. Sharif, G. Markall, and P. H. J. Kelly. Performance analysis and optimization of the OP2 framework on many-core architectures. The Computer Journal, 55(2):168--180, 2012. Google ScholarDigital Library
M. B. Giles, D. Ghate, and M. C. Duta. Using Automatic Differentiation for Adjoint CFD Code Development. Computational Fluid Dynamics Journal, 16(4):434--443, 2008.Google Scholar
M. B. Giles, G. R. Mudalige, B. Spencer, C. Bertolli, and I. Reguly. Designing OP2 for GPU Architectures. Journal of Parallel and Distributed Computing, 73:1451--1460, November 2013. Google ScholarDigital Library
A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov, G. Henry, A. Shet, G. Chrysos, and P. Dubey. Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel Xeon Phi Coprocessor. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 126--137, 2013. Google ScholarDigital Library
NVIDIA Tesla Kepler GPU Accelerators, 2012. http://www.nvidia.com/object/tesla-servers.html.Google Scholar
R. P. LaRowe and C. S. Ellis. Page placement policies for NUMA multiprocessors. Journal of Parallel and Distributed Computing, 11(2):112--129, 1991. Google ScholarDigital Library
P. Lindberg. Basic OpenMP threading overhead. Technical report, Intel, 2009. http://software.intel.com/en-us/articles/basic-openmp-threading-overhead.Google Scholar
O. Lindtjorn, R. Clapp, O. Pell, H. Fu, M. Flynn, and H. Fu. Beyond traditional microprocessors for geoscience high-performance computing applications. Micro, IEEE, 31(2):41--49, March - April 2011. Google ScholarDigital Library
Intel Math Kernel Library, 2013. http://software.intel.com/en-us/intel-mkl.Google Scholar
G. R. Mudalige, M. B. Giles, J. Thiyagalingam, I. Z. Reguly, C. Bertolli, P. H. J. Kelly, and A. E. Trefethen. Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems. Parallel Computing, 39(11):669--692, 2013. Google ScholarDigital Library
What is GPU Computing, 2013. http://www.nvidia.com/object/what-is-gpu-computing.html.Google Scholar
OP2 github repository, 2013. https://github.com/OP2/OP2-Common.Google Scholar
S. Pennycook, C. Hughes, M. Smelyanskiy, and S. Jarvis. Exploring SIMD for Molecular Dynamics, Using Intel Xeon Processors and Intel Xeon Phi Coprocessors. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 1085--1097, 2013. Google ScholarDigital Library
E. L. Poole and J. M. Ortega. Multicolor ICCG Methods for Vector Computers. SIAM J. Numer. Anal., 24(6):pp. 1394--1418, 1987. Google ScholarDigital Library
Scotch and PT-Scotch, 2013. http://www.labri.fr/perso/pelegrin/scotch/.Google Scholar
J. Rosinski. Porting, validating, and optimizing NOAA weather models NIM and FIM to Intel Xeon Phi. Technical report, NOAA, 2013.Google Scholar
K. Skaugen. Petascale to Exascale: Extending Intel's HPC Commitment, June 2011. ISC 2010 keynote. http://download.intel.com/pressroom/archive/reference/ISC_2010_Skaugen_keynote.pdf.Google Scholar
M. Smelyanskiy, J. Sewall, D. Kalamkar, N. Satish, P. Dubey, N. Astafiev, I. Burylov, A. Nikolaev, S. Maidanov, S. Li, S. Kulkarni, and C. Finan. Analysis and Optimization of Financial Analytics Benchmark on Modern Multi- and Many-core IA-Based Architectures. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pages 1154--1162, 2012. Google ScholarDigital Library
Top500 Systems, 2013. http://www.top500.org/list/.Google Scholar
A. Vladimirov and V. Karpusenko. Test-driving Intel Xeon Phi coprocessors with a basic N-body simulation. Technical report, Colfax International, 2013.Google Scholar

Index Terms

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Recommendations

Auto-vectorizing a large-scale production unstructured-mesh CFD application
WPMVP '16: Proceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing

For modern x86 based CPUs with increasingly longer vector lengths, achieving good vectorization has become very important for gaining higher performance. Using very explicit SIMD vector programming techniques has been shown to give near optimal ...
Read More
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Read More
Vectorizing unstructured mesh computations for many-core architectures

Achieving optimal performance on the latest multi-core and many-core architectures increasingly depends on making efficient use of the hardware's vector units. This paper presents results on achieving high performance through vectorization on CPUs and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores
February 2014
156 pages
ISBN:9781450326575
DOI:10.1145/2578948
Conference Chairs:
Pavan Balaji
Argonne National Laboratory, USA
,
Minyi Guo
Shanghai Jiao Tong, University, China
,
Zhiyi Huang
University of Otago, New Zealand
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
AVX
CUDA
Domain Specific Library
OP2
Unstructured Grid
Vectorization
Xeon Phi
Qualifiers
- tutorial
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate53of97submissions,55%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 12
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Vectorizing Unstructured Mesh Computations for Many-core Architectures

PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

ABSTRACT

References

Cited By

Index Terms

Recommendations

Auto-vectorizing a large-scale production unstructured-mesh CFD application

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Vectorizing unstructured mesh computations for many-core architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Vectorizing Unstructured Mesh Computations for Many-core Architectures

PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

ABSTRACT

References

Cited By

Index Terms

Recommendations

Auto-vectorizing a large-scale production unstructured-mesh CFD application

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Vectorizing unstructured mesh computations for many-core architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media