research-article

Open Access

Mansard Roofline Model: Reinforcing the Accuracy of the Roofs

Authors:
Diogo Marques

INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal

INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal
View Profile

,
Aleksandar Ilic

INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal

INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal
View Profile

,
Leonel Sousa

INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal

INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal
View Profile

ACM Transactions on Modeling and Performance Evaluation of Computing Systems Volume 6 Issue 2Article No.: 7pp 1–23https://doi.org/10.1145/3475866

Published:04 October 2021Publication History

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Abstract

Continuous enhancements and diversity in modern multi-core hardware, such as wider and deeper core pipelines and memory subsystems, bring to practice a set of hard-to-solve challenges when modeling their upper-bound capabilities and identifying the main application bottlenecks. Insightful roofline models are widely used for this purpose, but the existing approaches overly abstract the micro-architecture complexity, thus providing unrealistic performance bounds that lead to a misleading characterization of real-world applications. To address this problem, the Mansard Roofline Model (MaRM), proposed in this work, uncovers a minimum set of architectural features that must be considered to provide insightful, but yet accurate and realistic, modeling of performance upper bounds for modern processors. By encapsulating the retirement constraints due to the amount of retirement slots, Reorder-Buffer and Physical Register File sizes, the proposed model accurately models the capabilities of a real platform (average rRMSE of 5.4%) and characterizes 12 application kernels from standard benchmark suites. By following a herein proposed MaRM interpretation methodology and guidelines, speed-ups of up to 5× are obtained when optimizing real-world bioinformatic application, as well as a super-linear speedup of 18.5× when parallelized.

References

D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, J. Hwang, R. Leslie-Hurd, M. Bye, E. R. Creswick, M. Boyd, M. Venigalla, E. Laforge, J. Purdy, P. Kamath, D. Maheshwari, M. Beidler, G. Rosseel, O. Ahmad, G. Gagarin, R. Czekalski, A. Rane, S. Parmar, J. Werner, J. Sproch, A. Macias, and B. Kurtz2020. Think fast: A tensor streaming processor (TSP) for accelerating deep learning workloads. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). 145–158. https://doi.org/10.1109/ISCA45697.2020.00023 Google ScholarDigital Library
C. S. Anderson, J. Zhang, and M. Cornea. 2018. Enhanced vector math support on the Intel® AVX-512 architecture. In Proceedings of the IEEE 25th Symposium on Computer Arithmetic (ARITH’18). 120–124. https://doi.org/10.1109/ARITH.2018.8464794Google Scholar
Y. Bao, C. Bienia, and K. Li. 2016. The PARSEC Benchmark Suite Tutorial. Princeton University. http://parsec.cs. princeton.edu/download/tutorial/3.0/parsec-tutorial.pdf.Google Scholar
Scott Beamer, Krste Asanović, and David Patterson. 2015. The GAP benchmark suite. CoRR abs/1508.03619. arXiv:1508.03619. http://arxiv.org/abs/1508.03619.Google Scholar
Natalie N. Beams, Adrianna Gillman, and Russell J. Hewett. 2020. A parallel shared-memory implementation of a high-order accurate solution technique for variable coefficient Helmholtz problems. Comput. Math. Appl. 79, 4 (2020), 996–1011.Google ScholarCross Ref
Victoria Caparrós Cabezas and Markus Püschel. 2014. Extending the roofline model: Bottleneck analysis with microarchitectural constraints. In Proceedings of the International Symposium on Workload Characterization. IEEE, 222–231.Google ScholarCross Ref
Salvatore Cielo, Luigi Iapichino, Fabio Baruffa, Matteo Bugli, and Christoph Federrath. 2020. Honing and proofing astrophysical codes on the road to exascale. experiences from code modernization on many-core systems. Fut. Gener. Comput. Syst. 112 (2020), 93–107. https://doi.org/10.1016/j.future.2020.05.003Google ScholarCross Ref
Nan Ding and Samuel Williams. 2019. An instruction roofline model for gpus. In Proceedings of the International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. IEEE, 7–18.Google ScholarCross Ref
Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincenti. 2016. Applying the roofline performance model to the intel xeon phi knights landing processor. In Proceedings of the International Conference on High Performance Computing. Springer, 339–353.Google ScholarCross Ref
Daniel Drzisga, Ulrich Rüde, and Barbara Wohlmuth. 2019. Stencil scaling for vector-valued PDEs with applications to generalized newtonian fluids. CoRR abs/1908.08666. arXiv:1908.08666 http://arxiv.org/abs/1908.08666.Google Scholar
V. Etienne, T. Tonellot, K. Akbudak, H. Ltaief, S. Kortas, T. Malas, P. Thierry, and D. Keyes. 2018. Optimization of finite-difference kernels on multi-core architectures for seismic applications. In Intel eXtreme Performance Users Group.Google Scholar
Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst. 27, 2 (2009), 1–37. Google ScholarDigital Library
S. Eyerman, W. Heirman, K. Du Bois, and I. Hur. 2018. Extending the performance analysis tool box: Multi-stage CPI stacks and FLOPS stacks. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’18). 179–188. https://doi.org/10.1109/ISPASS.2018.00031 Google ScholarDigital Library
S. Eyerman, K. Hoste, and L. Eeckhout. 2011. Mechanistic-empirical processor performance modeling for constructing CPI stacks on real hardware. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’11). 216–226. https://doi.org/10.1109/ISPASS.2011.5762738 Google ScholarDigital Library
Andrei Frumusanu. 2020. Apple Announces The Apple Silicon M1: Ditching x86—What to Expect, Based on A14. Retrieved from https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive.Google Scholar
Philipp Grete, Forrest W. Glines, and Brian W. O’Shea. 2020. K-athena: A performance portable structured grid finite volume magnetohydrodynamics code. IEEE Trans. Parallel Distrib. Syst. 32, 1 (2021), 85–97. https://doi.org/10.1109/TPDS.2020.3010016Google ScholarCross Ref
Mark Hill and Vijay Janapa Reddi. 2019. Gables: A roofline model for mobile socs. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 317–330.Google ScholarCross Ref
Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. 2013. Cache-aware roofline model: Upgrading the loft. IEEE Comput. Arch. Lett. 13, 1 (2013), 21–24. Google ScholarDigital Library
Bruce Jacob. 2009. The memory system: You can’t avoid it, you can’t ignore it, you can’t fake it. Synth. Lect. Comput. Arch. 4, 1 (2009), 1–77. Google ScholarDigital Library
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, Florida, USA) (MM’14). Association for Computing Machinery, New York, NY, USA, 675–678. https://doi.org/10.1145/2647868.2654889 Google ScholarDigital Library
Rik Jongerius, Andreea Anghel, Gero Dittmann, Giovanni Mariani, Erik Vermij, and Henk Corporaal. 2017. Analytic multi-core processor model for fast design-space exploration. IEEE Trans. Comput. 67, 6 (2017), 755–770.Google ScholarCross Ref
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12. Google ScholarDigital Library
L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H. S. Lee, M. Li, B. Maher, D. Mudigere, M. Naumov, M. Schatz, M. Smelyanskiy, X. Wang, B. Reagen, C. Wu, M. Hempstead, and X. Zhang. 2020. RecNMP: Accelerating personalized recommendation with near-memory processing. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). 790–803. https://doi.org/10.1109/ISCA45697.2020.00070 Google ScholarDigital Library
Tuomas Koskela, Jack Deslippe, Brian Friesen, and Karthik Raman. 2017. Fusion PIC code performance analysis on the cori KNL system. In Proceedings of the Cray User Group Conference.Google Scholar
Tuomas Koskela, Zakhar Matveev, Charlene Yang, Adetokunbo Adedoyin, Roman Belenov, Philippe Thierry, Zhengji Zhao, Rahulkumar Gayatri, Hongzhang Shan, Leonid Oliker, Jack Deslippe, Ron Green, and Samuel Williams. 2018. A novel multi-level integrated roofline model approach for performance characterization. In Proceedings of the International Conference on High Performance Computing. Springer, 226–245.Google ScholarCross Ref
A. Li, S. L. Song, E. Brugel, A. Kumar, D. Chavarría-Miranda, and H. Corporaal. 2016. X: A comprehensive analytic model for parallel machines. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’16). 242–252. https://doi.org/10.1109/IPDPS.2016.89Google Scholar
Jiajia Li, Mahesh Lakshminarasimhan, Xiaolong Wu, Ang Li, Catherine Olschanowsky, and Kevin Barker. 2020. A sparse tensor benchmark suite for CPUs and GPUs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’20). IEEE, 193–204.Google ScholarCross Ref
John D. C. Little. 1961. A proof for the queuing formula: L= W. Operat. Res. 9, 3 (1961), 383–387. Google ScholarDigital Library
André Lopes, Frederico Pratas, Leonel Sousa, and Aleksandar Ilic. 2017. Exploring GPU performance, power and energy-efficiency bounds with cache-aware roofline modeling. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’17). IEEE, 259–268.Google ScholarCross Ref
Diogo Marques, Aleksandar Ilic, Zakhar A. Matveev, and Leonel Sousa. 2020. Application-driven cache-aware roofline model. Fut. Gener. Comput. Syst. 107 (2020), 257–273.Google ScholarDigital Library
Amrita Mathuriya, Ye Luo, Raymond C. Clay III, Anouar Benali, Luke Shulenburger, and Jeongnim Kim. 2017. Embracing a new era of highly efficient and productive quantum monte carlo simulations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 1–12. Google ScholarDigital Library
Ricardo Nobre, Aleksandar Ilic, Sergio Santander-Jiménez, and Leonel Sousa. 2020. Exploring the binary precision capabilities of tensor cores for epistasis detection. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’20). IEEE, 338–347.Google ScholarCross Ref
Irma Esmer Papazian. 2020. New 3rd gen intel® xeon® scalable processor (codename: Ice lake-SP). In Proceedings of the IEEE Hot Chips 32 Symposium (HCS’20). IEEE Computer Society, 1–22.Google ScholarCross Ref
A. Pellegrini, N. Stephens, M. Bruce, Y. Ishii, J. Pusdesris, A. Raja, C. Abernathy, J. Koppanalil, T. Ringe, A. Tummala, J. Jalal, M. Werkheiser, and A. Kona. 2020. The arm neoverse N1 platform: Building blocks for the next-gen cloud-to-edge infrastructure SoC. IEEE Micro 40, 2 (2020), 53–62.Google ScholarCross Ref
Milan Radulovic, Rommel Sánchez Verdejo, Paul Carpenter, Petar Radojković, Bruce Jacob, and Eduard Ayguadé. 2019. PROFET: Modeling system performance and energy without simulating the CPU. SIGMETRICS Perform. Eval. Rev. 47, 1 (Dec. 2019), 71–72. https://doi.org/10.1145/3376930.3376976 Google ScholarDigital Library
N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang. 2020. MatRaptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). 766–780. https://doi.org/10.1109/MICRO50266.2020.00068Google ScholarCross Ref
N. Srivastava, H. Jin, S. Smith, H. Rong, D. Albonesi, and Z. Zhang. 2020. Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). 689–702. https://doi.org/10.1109/HPCA47549.2020.00062Google ScholarCross Ref
JEDEC Standard. 2012. DDR4 SDRAM. JEDEC Solid State Technology Association, JESD79-4 151 (2012).Google Scholar
JEDEC Standard. 2013. High bandwidth memory (HBM) DRAM. JESD235 (2013).Google Scholar
JEDEC Standard. 2020. DDR5 SDRAM. JEDEC Solid State Technology Association, JESD79-5 (2020).Google Scholar
David Suggs, Mahesh Subramony, and Dan Bouvier. 2020. The AMD “Zen 2” processor. IEEE Micro 40, 2 (2020), 45–52. https://doi.org/10.1109/MM.2020.2974217Google ScholarCross Ref
Ady Tal. 2012. Intel® Software Development Emulator. Retrieved from https://software.intel.com/en-us/articles/intel-software-development-emulator.Google Scholar
Sofya Titarenko and Mark Hildyard. 2017. Hybrid multicore/vectorisation technique applied to the elastic wave equation on a staggered grid. Comput. Phys. Commun. 216 (2017), 53–62.Google ScholarCross Ref
Jan Treibig and Georg Hager. 2009. Introducing a performance model for bandwidth-limited loop kernels. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics. Springer, 615–624. Google ScholarDigital Library
Sam Van den Steen, Sander De Pestel, Moncef Mechri, Stijn Eyerman, Trevor Carlson, David Black-Schaffer, Erik Hagersten, and Lieven Eeckhout. 2015. Micro-architecture independent analytical processor performance and power modeling. In Proceedings of theInternational Symposium on Performance Analysis of Systems and Software (ISPASS’15). IEEE, 32–41.Google ScholarCross Ref
Xavier Vera. 2020. Inside tiger lake: Intel’s next generation mobile client CPU. In Proceedings of the IEEE Hot Chips 32 Symposium (HCS’20). IEEE Computer Society, 1–26.Google ScholarCross Ref
Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In HighPerformance Computing on the Intel® Xeon Phi™. Springer, 167–188. https://doi.org/10.1007/978-3-319-06486-4_7Google Scholar
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65–76. https://doi.org/10.1145/1498765.1498785 Google ScholarDigital Library
Henry Wong. 2013. Measuring Reorder Buffer Capacity. Retrieved from http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/.Google Scholar
Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. IEEE, 35–44.Google ScholarCross Ref

Index Terms

Mansard Roofline Model: Reinforcing the Accuracy of the Roofs
1. Computer systems organization
  1. Architectures
2. Computing methodologies
  1. Modeling and simulation

Recommendations

Metrics and Design of an Instruction Roofline Model for AMD GPUs
Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU ...
Read More
Application-driven Cache-Aware Roofline Model
Abstract
In the coming exascale era, the complexity of modern applications and hardware resources imposes significant challenges for boosting the efficiency via execution fine-tuning. To abstract this complexity in an intuitive way, recent ...
Highlights
- Novel Cache-aware Roofline methodology based on application characteristics.
- ...
Read More
Model-driven autotuning of sparse matrix-vector multiply on GPUs
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts.

First, we describe several ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Modeling and Performance Evaluation of Computing Systems Volume 6, Issue 2
June 2021
114 pages
ISSN:2376-3639
EISSN:2376-3647
DOI:10.1145/3481701
Editor:
Leana Golubchik
University of Southern California, United States
Issue’s Table of Contents
Copyright © 2021 Copyright held by the owner/author(s).
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 October 2021
- Revised: 1 July 2021
- Accepted: 1 July 2021
- Received: 1 April 2021
Published in tompecs Volume 6, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Performance modeling
roofline modeling
application characterization
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 607
  Total Downloads
- Downloads (Last 12 months)182
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Mansard Roofline Model: Reinforcing the Accuracy of the Roofs

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Metrics and Design of an Instruction Roofline Model for AMD GPUs

Application-driven Cache-Aware Roofline Model

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Mansard Roofline Model: Reinforcing the Accuracy of the Roofs

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Metrics and Design of an Instruction Roofline Model for AMD GPUs

Application-driven Cache-Aware Roofline Model

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media