Abstract
Data locality and parallelism are critical optimization objectives for performance on modern multi-core machines. Both coarse-grain parallelism (e.g., multi-core) and fine-grain parallelism (e.g., vector SIMD) must be effectively exploited, but despite decades of progress at both ends, current compiler optimization schemes that attempt to address data locality and both kinds of parallelism often fail at one of the three objectives.
We address this problem by proposing a 3-step framework, which aims for integrated data locality, multi-core parallelism and SIMD execution of programs. We define the concept of vectorizable codelets, with properties tailored to achieve effective SIMD code generation for the codelets. We leverage the power of a modern high-level transformation framework to restructure a program to expose good ISA-independent vectorizable codelets, exploiting multi-dimensional data reuse. Then, we generate ISA-specific customized code for the codelets, using a collection of lower-level SIMD-focused optimizations.
We demonstrate our approach on a collection of numerical kernels that we automatically tile, parallelize and vectorize, exhibiting significant performance improvements over existing compilers.
- PoCC, the polyhedral compiler collection. http://pocc.sourceforge.net.Google Scholar
- PolyOpt/C. http://hpcrl.cse.ohio-state.edu/wiki/index.php/polyopt/c.Google Scholar
- www.spiral.net/software/stencilgen.html.Google Scholar
- V. Bandishti, I. Pananilath, , and U. Bondhugula. Tiling stencil computations to maximize parallelism. In ACM/IEEE Conf. on Supercomputing (SC'12), 2012. Google ScholarDigital Library
- M. Baskaran, A. Hartono, S. Tavarageri, T. Henretty, J. Ramanujam, and P. Sadayappan. Parameterized tiling revisited. In CGO, April 2010. Google ScholarDigital Library
- C. Bastoul. Code generation in the polyhedral model is easier than you think. In IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT'04), pages 7--16, Juan-les-Pins, France, Sept. 2004. Google ScholarDigital Library
- C. Bastoul and P. Feautrier. More legal transformations for locality. In Euro-Par'10 Intl. Euro-Par conference, LNCS 3149, pages 272--283, Pisa, august 2004.Google ScholarCross Ref
- D. Batory, C. Johnson, B. MacDonald, and D. von Heeder. Achieving extensibility through product-lines and domain-specific languages: A case study. ACM Transactions on Software Engineering and Methodology (TOSEM), 11(2):191--214, 2002. Google ScholarDigital Library
- D. Batory, R. Lopez-Herrejon, and J.-P. Martin. Generating productlines of product-families. In Proc. Automated Software Engineering Conference (ASE), 2002. Google ScholarDigital Library
- J. Bentley. Programming pearls: little languages. Communications of the ACM, 29(8):711--721, 1986. Google ScholarDigital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In PLDI, June 2008.Google ScholarDigital Library
- C. Chen, J. Chame, and M. Hall. Chill: A framework for composing high-level loop transformations. Technical Report 08-897, USC Computer Science Technical Report, 2008.Google Scholar
- K. Czarnecki and U. Eisenecker. Generative Programming: Methods, Tools, and Applications. Addison-Wesley, 2000. Google ScholarDigital Library
- J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, C. Whaley, and K. Yelick. Self adapting linear algebra algorithms and software. Proc. of the IEEE, 93(2):293--312, 2005.Google ScholarCross Ref
- A. Eichenberger, P. Wu, and K. O'Brien. Vectorization for simd architectures with alignment constraints. In PLDI, 2004. Google ScholarDigital Library
- P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. Intl. J. of Parallel Programming, 21(6):389--420, Dec. 1992. Google ScholarDigital Library
- M. Frigo. A fast Fourier transform compiler. In PLDI, pages 169--180, 1999. Google ScholarDigital Library
- M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proc. of the IEEE, 93(2):216--231, 2005.Google ScholarCross Ref
- S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations. International Journal of Parallel Programming, 34(3):261--317, June 2006. Google ScholarDigital Library
- K. J. Gough. Little language processing, an alternative to courses on compiler construction. SIGCSE Bulletin, 13(3):31--34, 1981. Google ScholarDigital Library
- GPCE. ACM conference on generative programming and component engineering.Google Scholar
- A. Hartono, M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. Parametric multi-level tiling of imperfectly nested loops. In ICS, 2009. Google ScholarDigital Library
- T. Henretty, K. Stock, L.-N. Pouchet, F. Franchetti, J. Ramanujam, and P. Sadayappan. Data layout transformation for stencil computations on short simd architectures. In ETAPS International Conference on Compiler Construction (CC'11), pages 225--245, Saarbrcken, Germany, Mar. 2011. Springer Verlag. Google ScholarDigital Library
- P. Hudak. Domain specific languages. Available from author on request, 1997.Google Scholar
- E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. Int'l J. High Performance Computing Applications, 18(1), 2004. Google ScholarDigital Library
- K. Kennedy and J. Allen. Optimizing compilers for modern architectures: A dependence-based approach. Morgan Kaufmann, 2002. Google ScholarDigital Library
- M. Kong, L.-N. Pouchet, and P. Sadayappan. Abstract vector SIMD code generation using the polyhedral model. Technical Report Technical Report 4/13-TR08, Ohio State University, Apr. 2013.Google Scholar
- S. Larsen and S. P. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI, 2000. Google ScholarDigital Library
- A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In POPL, pages 201--214, 1997. Google ScholarDigital Library
- D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for simd. In PLDI, 2006. Google ScholarDigital Library
- D. Nuzman and A. Zaks. Outer-loop vectorization: revisited for short simd architectures. In PACT, 2008. Google ScholarDigital Library
- L.-N. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos. Iterative optimization in the polyhedral model: Part II, multidimensional time. In PLDI, pages 90--100. ACM Press, 2008. Google ScholarDigital Library
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, and P. Sadayappan. Combined iterative and model-driven optimization in an automatic parallelization framework. In ACM Supercomputing Conf. (SC'10), New Orleans, Lousiana, Nov. 2010. Google ScholarDigital Library
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. Loop transformations: Convexity, pruning and optimization. In POPL, pages 549--562, Austin, TX, Jan. 2011. Google ScholarDigital Library
- M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proc. of the IEEE, 93(2):232--275, 2005.Google ScholarCross Ref
- D. R. Smith. Mechanizing the development of software. In M. Broy, editor, Calculational System Design, Proc. of the International Summer School Marktoberdorf. NATO ASI Series, IOS Press, 1999. Kestrel Institute Technical Report KES.U.99.1.Google Scholar
- W. Taha. Domain-specific languages. In Proc. Intl Conf. Computer Engineering and Systems (ICCES), 2008.Google Scholar
- K. Trifunovic, D. Nuzman, A. Cohen, A. Zaks, and I. Rosen. Polyhedral-model guided loop-nest auto-vectorization. In PACT, Sept. 2009. Google ScholarDigital Library
- N. Vasilache. Scalable Program Optimization Techniques in the Polyhedra Model. PhD thesis, University of Paris-Sud 11, 2007.Google Scholar
- N. Vasilache, B. Meister, M. Baskaran, and R. Lethin. Joint scheduling and layout optimization to enable multi-level vectorization. In Proc. of IMPACT'12, Jan. 2012.Google Scholar
- Y. Voronenko and M. Püschel. Algebraic signal processing theory: Cooley-tukey type algorithms for real dfts. IEEE Transactions on Signal Processing, 57(1), 2009. Google ScholarDigital Library
- R. C. Whaley and J. Dongarra. Automatically Tuned Linear Algebra Software (ATLAS). In Proc. Supercomputing, 1998. math-atlas. sourceforge.net. Google ScholarDigital Library
- M. J. Wolfe. High Performance Compilers For Parallel Computing. Addison-Wesley, 1996. Google ScholarDigital Library
Index Terms
When polyhedral transformations meet SIMD code generation
Recommendations
Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and CompilersThis article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
When polyhedral transformations meet SIMD code generation
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and ImplementationData locality and parallelism are critical optimization objectives for performance on modern multi-core machines. Both coarse-grain parallelism (e.g., multi-core) and fine-grain parallelism (e.g., vector SIMD) must be effectively exploited, but despite ...
Loop Coalescing and Scheduling for Barrier MIMD Architectures
Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls); however, they differ from ...
Comments