Abstract
Power consumption and energy efficiency have become a major bottleneck in the design of new systems for high performance computing. The path to exa-scale computing requires new strategies that decrease the energy consumption of modern many-core architectures without sacrificing scalability or performance. The development of these strategies demands the use of scalable models for energy consumption and the reorientation of optimization techniques to focus on energy efficiency, evaluating their trade-offs with respect to performance.
In this paper, we investigate several optimization techniques to reduce the energy consumption on many-core architectures with a software-managed memory hierarchy. We study the impact of these techniques on the Static Energy and the Dynamic Energy of the LU factorization benchmark using a scalable energy consumption model. The main contributions of this paper are: (1) The modeling and analysis of energy consumption and energy efficiency for LU factorization; (2) the study and design of instruction-level and task-level optimizations for the reduction of the Static and Dynamic Energy; (3) the design and implementation of an energy aware tiling that decreases the Dynamic Energy of power hungry instructions in the LU factorization benchmark; and (4) the experimental evaluation of the scalability and improvement in terms of energy consumption and power efficiency of the proposed optimizations using the IBM Cyclops-64 many-core architecture. We study the trade-offs between performance and power efficiency for the proposed optimizations. Our results for the LU factorization benchmark, using 156 hardware thread units, show an improvement in power efficiency between 1.68X and 4.87X for different matrix sizes. In addition, we point out examples of optimizations that scale in performance but not necessarily in power efficiency.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Garcia, E., Orozco, D., Khan, R., Venetis, I., Livingston, K., Gao, G.R.: Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures. In: Proceedings of 2012 ACM International Conference on Computer Frontiers (CF 2012), Cagliari, Italy, May 2012. ACM (2012)
Garcia, E., Orozco, D., Khan, R., Venetis, I., Livingston, K., Gao, G.: A dynamic schema to increase performance in many-core architectures through percolation operations. In: Proceedings of the 2013 IEEE International Conference on High Performance Computing (HiPC 2013), Bangalore, India, December 2013. IEEE Computer Society (2013)
Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon, P., Harrod, W., Hiller, J., Karp, S., Keckler, S., Klein, D., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely, A., Sterling, T., Williams, R.S., Yelick, K.: Exascale computing study: technology challenges in achieving exascale systems. DARPA Information Processing Techniques Office (IPTO) sponsored study (2008)
Torrellas, J.: Architectures for extreme-scale computing. Computer 42, 28–35 (2009)
Denneau, M.: Cyclops. In: Padua, D. (ed.) Encyclopedia of Parallel Computing: SpringerReference, p. 145. Springer, Heidelberg (2011). www.springerreference.com
Garcia, E., Venetis, I.E., Khan, R., Gao, G.R.: Optimized dense matrix multiplication on a many-core architecture. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010, Part II. LNCS, vol. 6272, pp. 316–327. Springer, Heidelberg (2010)
Chen, L., Gao, G.R.: Performance analysis of cooley-tukey fft algorithms for a many-core architecture, in Proceedings of the 2010 Spring Simulation Multiconference, SpringSim ’10, (San Diego, CA, USA), pp. 81:1–81:8, Society for Computer Simulation International, 2010
Orozco, D., Garcia, E., Gao, G.: Locality optimization of stencil applications using data dependency graphs. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. LNCS, vol. 6548, pp. 77–91. Springer, Heidelberg (2011)
Garcia, E., Orozco, D., Gao, G.: Energy efficient tiling on a many-core architecture. In: Proceedings of 4th Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG-2011); 6th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Heraklion, Greece, January 2011, pp. 53–66 (2011)
Chen, O.Y.: A comparison of pivoting strategies for the direct lu factorization. In: Electronic Proceedings of the Eighth Annual International Conference on Technology in Collegiate Mathematics Houston, Texas, 16–19 November 1995
Dongarra, J.J., Walker, D.W.: Software libraries for linear algebra computations on high performance computers. SIAM Rev. 37, 151–180 (1995)
Dongarra, J., Luszczek, P., Petitet, A.: The linpack benchmark: past, present and future. Concurrency Comput.: Pract. Exper. 15(9), 803–820 (2003)
Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The splash-2 programs: characterization and methodological considerations. SIGARCH Comput. Archit. News 23, 24–36 (1995)
Venetis, I.E., Gao, G.R.: Mapping the LU decomposition on a many-core architecture: challenges and solutions. In: Proceedings of the 6th ACM Conference on Computing Frontiers (CF ’09), Ischia, Italy, May 2009, pp. 71–80 (2009)
Garcia, E., Orozco, D., Pavel, R., Gao, G.R.: A discussion in favor of dynamic scheduling for regular applications in many-core architectures. In: Proceedings of 2012 Workshop on Multithreaded Architectures and Applications (MTAAP 2012); 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2012), Shanghai, China, May 2012. IEEE (2012)
del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: FAST: a functionally accurate simulation toolset for the cyclops-64 cellular architecture. In: Workshop on Modeling, Benchmarking, and Simulation (MoBS ’05), in Conjunction with the 32nd Annual International Symposium on Computer Architecture (ISCA 05), pp. 11–20 (2005)
Yao, F., Demers, A., Shenker, S.: A scheduling model for reduced CPU energy. In: Proceedings of the 36th Annual Symposium on Foundations of Computer Science, October 1995, pp. 374–382 (1995)
Weiser, M., Welch, B., Demers, A., Shenker, S.: Scheduling for reduced cpu energy. In: Imielinski, T., Korth, H.F. (eds.) Mobile Computing. The Kluwer International Series in Engineering and Computer Science, vol. 353, pp. 449–471. Springer, Boston (1996)
Steinke, S., Knauer, M., Wehmeyer, L., Marwedel, P.: An accurate and fine grain instruction-level energy model supporting software optimizations. In: Proceedings of PATMOS, Citeseer (2001)
Lee, S., Ermedahl, A., Min, S.L.: An accurate instruction-level energy consumption model for embedded risc processors. In: LCTES ’01: Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems, New York, NY, USA, pp. 1–10. ACM (2001)
Andrei, A., Eles, P., Peng, Z., Schmitz, M., Hashimi, B.: Energy optimization of multiprocessor systems on chip by voltage selection. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 15, 262–275 (2007)
Donfack, S., Grigori, L., Gropp, W., Kale, V.: Hybrid static/dynamic scheduling for already optimized dense matrix factorization. In: 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), pp. 496–507 (2012)
Acknowledgements
This material is based upon work supported by the Department of Energy [Office of Science] under Award Number DE-SC0008717. This work was partly supported by European FP7 project TERAFLUX, id. 249013. We also thank ET International, Inc. for its support during the course of experiments. Finally, we thank the reviewers for their valuable suggestions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Garcia, E., Arteaga, J., Pavel, R., Gao, G.R. (2014). Optimizing the LU Factorization for Energy Efficiency on a Many-Core Architecture. In: Cașcaval, C., Montesinos, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2013. Lecture Notes in Computer Science(), vol 8664. Springer, Cham. https://doi.org/10.1007/978-3-319-09967-5_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-09967-5_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09966-8
Online ISBN: 978-3-319-09967-5
eBook Packages: Computer ScienceComputer Science (R0)