Abstract
Continuous enhancements and diversity in modern multi-core hardware, such as wider and deeper core pipelines and memory subsystems, bring to practice a set of hard-to-solve challenges when modeling their upper-bound capabilities and identifying the main application bottlenecks. Insightful roofline models are widely used for this purpose, but the existing approaches overly abstract the micro-architecture complexity, thus providing unrealistic performance bounds that lead to a misleading characterization of real-world applications. To address this problem, the Mansard Roofline Model (MaRM), proposed in this work, uncovers a minimum set of architectural features that must be considered to provide insightful, but yet accurate and realistic, modeling of performance upper bounds for modern processors. By encapsulating the retirement constraints due to the amount of retirement slots, Reorder-Buffer and Physical Register File sizes, the proposed model accurately models the capabilities of a real platform (average rRMSE of 5.4%) and characterizes 12 application kernels from standard benchmark suites. By following a herein proposed MaRM interpretation methodology and guidelines, speed-ups of up to 5× are obtained when optimizing real-world bioinformatic application, as well as a super-linear speedup of 18.5× when parallelized.
- D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, J. Hwang, R. Leslie-Hurd, M. Bye, E. R. Creswick, M. Boyd, M. Venigalla, E. Laforge, J. Purdy, P. Kamath, D. Maheshwari, M. Beidler, G. Rosseel, O. Ahmad, G. Gagarin, R. Czekalski, A. Rane, S. Parmar, J. Werner, J. Sproch, A. Macias, and B. Kurtz2020. Think fast: A tensor streaming processor (TSP) for accelerating deep learning workloads. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). 145–158. https://doi.org/10.1109/ISCA45697.2020.00023 Google ScholarDigital Library
- C. S. Anderson, J. Zhang, and M. Cornea. 2018. Enhanced vector math support on the Intel® AVX-512 architecture. In Proceedings of the IEEE 25th Symposium on Computer Arithmetic (ARITH’18). 120–124. https://doi.org/10.1109/ARITH.2018.8464794Google Scholar
- Y. Bao, C. Bienia, and K. Li. 2016. The PARSEC Benchmark Suite Tutorial. Princeton University. http://parsec.cs. princeton.edu/download/tutorial/3.0/parsec-tutorial.pdf.Google Scholar
- Scott Beamer, Krste Asanović, and David Patterson. 2015. The GAP benchmark suite. CoRR abs/1508.03619. arXiv:1508.03619. http://arxiv.org/abs/1508.03619.Google Scholar
- Natalie N. Beams, Adrianna Gillman, and Russell J. Hewett. 2020. A parallel shared-memory implementation of a high-order accurate solution technique for variable coefficient Helmholtz problems. Comput. Math. Appl. 79, 4 (2020), 996–1011.Google ScholarCross Ref
- Victoria Caparrós Cabezas and Markus Püschel. 2014. Extending the roofline model: Bottleneck analysis with microarchitectural constraints. In Proceedings of the International Symposium on Workload Characterization. IEEE, 222–231.Google ScholarCross Ref
- Salvatore Cielo, Luigi Iapichino, Fabio Baruffa, Matteo Bugli, and Christoph Federrath. 2020. Honing and proofing astrophysical codes on the road to exascale. experiences from code modernization on many-core systems. Fut. Gener. Comput. Syst. 112 (2020), 93–107. https://doi.org/10.1016/j.future.2020.05.003Google ScholarCross Ref
- Nan Ding and Samuel Williams. 2019. An instruction roofline model for gpus. In Proceedings of the International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. IEEE, 7–18.Google ScholarCross Ref
- Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincenti. 2016. Applying the roofline performance model to the intel xeon phi knights landing processor. In Proceedings of the International Conference on High Performance Computing. Springer, 339–353.Google ScholarCross Ref
- Daniel Drzisga, Ulrich Rüde, and Barbara Wohlmuth. 2019. Stencil scaling for vector-valued PDEs with applications to generalized newtonian fluids. CoRR abs/1908.08666. arXiv:1908.08666 http://arxiv.org/abs/1908.08666.Google Scholar
- V. Etienne, T. Tonellot, K. Akbudak, H. Ltaief, S. Kortas, T. Malas, P. Thierry, and D. Keyes. 2018. Optimization of finite-difference kernels on multi-core architectures for seismic applications. In Intel eXtreme Performance Users Group.Google Scholar
- Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst. 27, 2 (2009), 1–37. Google ScholarDigital Library
- S. Eyerman, W. Heirman, K. Du Bois, and I. Hur. 2018. Extending the performance analysis tool box: Multi-stage CPI stacks and FLOPS stacks. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’18). 179–188. https://doi.org/10.1109/ISPASS.2018.00031 Google ScholarDigital Library
- S. Eyerman, K. Hoste, and L. Eeckhout. 2011. Mechanistic-empirical processor performance modeling for constructing CPI stacks on real hardware. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’11). 216–226. https://doi.org/10.1109/ISPASS.2011.5762738 Google ScholarDigital Library
- Andrei Frumusanu. 2020. Apple Announces The Apple Silicon M1: Ditching x86—What to Expect, Based on A14. Retrieved from https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive.Google Scholar
- Philipp Grete, Forrest W. Glines, and Brian W. O’Shea. 2020. K-athena: A performance portable structured grid finite volume magnetohydrodynamics code. IEEE Trans. Parallel Distrib. Syst. 32, 1 (2021), 85–97. https://doi.org/10.1109/TPDS.2020.3010016Google ScholarCross Ref
- Mark Hill and Vijay Janapa Reddi. 2019. Gables: A roofline model for mobile socs. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 317–330.Google ScholarCross Ref
- Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. 2013. Cache-aware roofline model: Upgrading the loft. IEEE Comput. Arch. Lett. 13, 1 (2013), 21–24. Google ScholarDigital Library
- Bruce Jacob. 2009. The memory system: You can’t avoid it, you can’t ignore it, you can’t fake it. Synth. Lect. Comput. Arch. 4, 1 (2009), 1–77. Google ScholarDigital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, Florida, USA) (MM’14). Association for Computing Machinery, New York, NY, USA, 675–678. https://doi.org/10.1145/2647868.2654889 Google ScholarDigital Library
- Rik Jongerius, Andreea Anghel, Gero Dittmann, Giovanni Mariani, Erik Vermij, and Henk Corporaal. 2017. Analytic multi-core processor model for fast design-space exploration. IEEE Trans. Comput. 67, 6 (2017), 755–770.Google ScholarCross Ref
- Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12. Google ScholarDigital Library
- L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H. S. Lee, M. Li, B. Maher, D. Mudigere, M. Naumov, M. Schatz, M. Smelyanskiy, X. Wang, B. Reagen, C. Wu, M. Hempstead, and X. Zhang. 2020. RecNMP: Accelerating personalized recommendation with near-memory processing. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). 790–803. https://doi.org/10.1109/ISCA45697.2020.00070 Google ScholarDigital Library
- Tuomas Koskela, Jack Deslippe, Brian Friesen, and Karthik Raman. 2017. Fusion PIC code performance analysis on the cori KNL system. In Proceedings of the Cray User Group Conference.Google Scholar
- Tuomas Koskela, Zakhar Matveev, Charlene Yang, Adetokunbo Adedoyin, Roman Belenov, Philippe Thierry, Zhengji Zhao, Rahulkumar Gayatri, Hongzhang Shan, Leonid Oliker, Jack Deslippe, Ron Green, and Samuel Williams. 2018. A novel multi-level integrated roofline model approach for performance characterization. In Proceedings of the International Conference on High Performance Computing. Springer, 226–245.Google ScholarCross Ref
- A. Li, S. L. Song, E. Brugel, A. Kumar, D. Chavarría-Miranda, and H. Corporaal. 2016. X: A comprehensive analytic model for parallel machines. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’16). 242–252. https://doi.org/10.1109/IPDPS.2016.89Google Scholar
- Jiajia Li, Mahesh Lakshminarasimhan, Xiaolong Wu, Ang Li, Catherine Olschanowsky, and Kevin Barker. 2020. A sparse tensor benchmark suite for CPUs and GPUs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’20). IEEE, 193–204.Google ScholarCross Ref
- John D. C. Little. 1961. A proof for the queuing formula: L= W. Operat. Res. 9, 3 (1961), 383–387. Google ScholarDigital Library
- André Lopes, Frederico Pratas, Leonel Sousa, and Aleksandar Ilic. 2017. Exploring GPU performance, power and energy-efficiency bounds with cache-aware roofline modeling. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’17). IEEE, 259–268.Google ScholarCross Ref
- Diogo Marques, Aleksandar Ilic, Zakhar A. Matveev, and Leonel Sousa. 2020. Application-driven cache-aware roofline model. Fut. Gener. Comput. Syst. 107 (2020), 257–273.Google ScholarDigital Library
- Amrita Mathuriya, Ye Luo, Raymond C. Clay III, Anouar Benali, Luke Shulenburger, and Jeongnim Kim. 2017. Embracing a new era of highly efficient and productive quantum monte carlo simulations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 1–12. Google ScholarDigital Library
- Ricardo Nobre, Aleksandar Ilic, Sergio Santander-Jiménez, and Leonel Sousa. 2020. Exploring the binary precision capabilities of tensor cores for epistasis detection. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’20). IEEE, 338–347.Google ScholarCross Ref
- Irma Esmer Papazian. 2020. New 3rd gen intel® xeon® scalable processor (codename: Ice lake-SP). In Proceedings of the IEEE Hot Chips 32 Symposium (HCS’20). IEEE Computer Society, 1–22.Google ScholarCross Ref
- A. Pellegrini, N. Stephens, M. Bruce, Y. Ishii, J. Pusdesris, A. Raja, C. Abernathy, J. Koppanalil, T. Ringe, A. Tummala, J. Jalal, M. Werkheiser, and A. Kona. 2020. The arm neoverse N1 platform: Building blocks for the next-gen cloud-to-edge infrastructure SoC. IEEE Micro 40, 2 (2020), 53–62.Google ScholarCross Ref
- Milan Radulovic, Rommel Sánchez Verdejo, Paul Carpenter, Petar Radojković, Bruce Jacob, and Eduard Ayguadé. 2019. PROFET: Modeling system performance and energy without simulating the CPU. SIGMETRICS Perform. Eval. Rev. 47, 1 (Dec. 2019), 71–72. https://doi.org/10.1145/3376930.3376976 Google ScholarDigital Library
- N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang. 2020. MatRaptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). 766–780. https://doi.org/10.1109/MICRO50266.2020.00068Google ScholarCross Ref
- N. Srivastava, H. Jin, S. Smith, H. Rong, D. Albonesi, and Z. Zhang. 2020. Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). 689–702. https://doi.org/10.1109/HPCA47549.2020.00062Google ScholarCross Ref
- JEDEC Standard. 2012. DDR4 SDRAM. JEDEC Solid State Technology Association, JESD79-4 151 (2012).Google Scholar
- JEDEC Standard. 2013. High bandwidth memory (HBM) DRAM. JESD235 (2013).Google Scholar
- JEDEC Standard. 2020. DDR5 SDRAM. JEDEC Solid State Technology Association, JESD79-5 (2020).Google Scholar
- David Suggs, Mahesh Subramony, and Dan Bouvier. 2020. The AMD “Zen 2” processor. IEEE Micro 40, 2 (2020), 45–52. https://doi.org/10.1109/MM.2020.2974217Google ScholarCross Ref
- Ady Tal. 2012. Intel® Software Development Emulator. Retrieved from https://software.intel.com/en-us/articles/intel-software-development-emulator.Google Scholar
- Sofya Titarenko and Mark Hildyard. 2017. Hybrid multicore/vectorisation technique applied to the elastic wave equation on a staggered grid. Comput. Phys. Commun. 216 (2017), 53–62.Google ScholarCross Ref
- Jan Treibig and Georg Hager. 2009. Introducing a performance model for bandwidth-limited loop kernels. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics. Springer, 615–624. Google ScholarDigital Library
- Sam Van den Steen, Sander De Pestel, Moncef Mechri, Stijn Eyerman, Trevor Carlson, David Black-Schaffer, Erik Hagersten, and Lieven Eeckhout. 2015. Micro-architecture independent analytical processor performance and power modeling. In Proceedings of theInternational Symposium on Performance Analysis of Systems and Software (ISPASS’15). IEEE, 32–41.Google ScholarCross Ref
- Xavier Vera. 2020. Inside tiger lake: Intel’s next generation mobile client CPU. In Proceedings of the IEEE Hot Chips 32 Symposium (HCS’20). IEEE Computer Society, 1–26.Google ScholarCross Ref
- Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In HighPerformance Computing on the Intel® Xeon Phi™. Springer, 167–188. https://doi.org/10.1007/978-3-319-06486-4_7Google Scholar
- Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65–76. https://doi.org/10.1145/1498765.1498785 Google ScholarDigital Library
- Henry Wong. 2013. Measuring Reorder Buffer Capacity. Retrieved from http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/.Google Scholar
- Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. IEEE, 35–44.Google ScholarCross Ref
Index Terms
- Mansard Roofline Model: Reinforcing the Accuracy of the Roofs
Recommendations
Metrics and Design of an Instruction Roofline Model for AMD GPUs
Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU ...
Application-driven Cache-Aware Roofline Model
AbstractIn the coming exascale era, the complexity of modern applications and hardware resources imposes significant challenges for boosting the efficiency via execution fine-tuning. To abstract this complexity in an intuitive way, recent ...
Highlights- Novel Cache-aware Roofline methodology based on application characteristics.
- ...
Model-driven autotuning of sparse matrix-vector multiply on GPUs
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingWe present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts.
First, we describe several ...
Comments