skip to main content
research-article
Open Access

Mansard Roofline Model: Reinforcing the Accuracy of the Roofs

Authors Info & Claims
Published:04 October 2021Publication History
Skip Abstract Section

Abstract

Continuous enhancements and diversity in modern multi-core hardware, such as wider and deeper core pipelines and memory subsystems, bring to practice a set of hard-to-solve challenges when modeling their upper-bound capabilities and identifying the main application bottlenecks. Insightful roofline models are widely used for this purpose, but the existing approaches overly abstract the micro-architecture complexity, thus providing unrealistic performance bounds that lead to a misleading characterization of real-world applications. To address this problem, the Mansard Roofline Model (MaRM), proposed in this work, uncovers a minimum set of architectural features that must be considered to provide insightful, but yet accurate and realistic, modeling of performance upper bounds for modern processors. By encapsulating the retirement constraints due to the amount of retirement slots, Reorder-Buffer and Physical Register File sizes, the proposed model accurately models the capabilities of a real platform (average rRMSE of 5.4%) and characterizes 12 application kernels from standard benchmark suites. By following a herein proposed MaRM interpretation methodology and guidelines, speed-ups of up to 5× are obtained when optimizing real-world bioinformatic application, as well as a super-linear speedup of 18.5× when parallelized.

References

  1. D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, J. Hwang, R. Leslie-Hurd, M. Bye, E. R. Creswick, M. Boyd, M. Venigalla, E. Laforge, J. Purdy, P. Kamath, D. Maheshwari, M. Beidler, G. Rosseel, O. Ahmad, G. Gagarin, R. Czekalski, A. Rane, S. Parmar, J. Werner, J. Sproch, A. Macias, and B. Kurtz2020. Think fast: A tensor streaming processor (TSP) for accelerating deep learning workloads. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). 145–158. https://doi.org/10.1109/ISCA45697.2020.00023 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. S. Anderson, J. Zhang, and M. Cornea. 2018. Enhanced vector math support on the Intel® AVX-512 architecture. In Proceedings of the IEEE 25th Symposium on Computer Arithmetic (ARITH’18). 120–124. https://doi.org/10.1109/ARITH.2018.8464794Google ScholarGoogle Scholar
  3. Y. Bao, C. Bienia, and K. Li. 2016. The PARSEC Benchmark Suite Tutorial. Princeton University. http://parsec.cs. princeton.edu/download/tutorial/3.0/parsec-tutorial.pdf.Google ScholarGoogle Scholar
  4. Scott Beamer, Krste Asanović, and David Patterson. 2015. The GAP benchmark suite. CoRR abs/1508.03619. arXiv:1508.03619. http://arxiv.org/abs/1508.03619.Google ScholarGoogle Scholar
  5. Natalie N. Beams, Adrianna Gillman, and Russell J. Hewett. 2020. A parallel shared-memory implementation of a high-order accurate solution technique for variable coefficient Helmholtz problems. Comput. Math. Appl. 79, 4 (2020), 996–1011.Google ScholarGoogle ScholarCross RefCross Ref
  6. Victoria Caparrós Cabezas and Markus Püschel. 2014. Extending the roofline model: Bottleneck analysis with microarchitectural constraints. In Proceedings of the International Symposium on Workload Characterization. IEEE, 222–231.Google ScholarGoogle ScholarCross RefCross Ref
  7. Salvatore Cielo, Luigi Iapichino, Fabio Baruffa, Matteo Bugli, and Christoph Federrath. 2020. Honing and proofing astrophysical codes on the road to exascale. experiences from code modernization on many-core systems. Fut. Gener. Comput. Syst. 112 (2020), 93–107. https://doi.org/10.1016/j.future.2020.05.003Google ScholarGoogle ScholarCross RefCross Ref
  8. Nan Ding and Samuel Williams. 2019. An instruction roofline model for gpus. In Proceedings of the International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. IEEE, 7–18.Google ScholarGoogle ScholarCross RefCross Ref
  9. Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincenti. 2016. Applying the roofline performance model to the intel xeon phi knights landing processor. In Proceedings of the International Conference on High Performance Computing. Springer, 339–353.Google ScholarGoogle ScholarCross RefCross Ref
  10. Daniel Drzisga, Ulrich Rüde, and Barbara Wohlmuth. 2019. Stencil scaling for vector-valued PDEs with applications to generalized newtonian fluids. CoRR abs/1908.08666. arXiv:1908.08666 http://arxiv.org/abs/1908.08666.Google ScholarGoogle Scholar
  11. V. Etienne, T. Tonellot, K. Akbudak, H. Ltaief, S. Kortas, T. Malas, P. Thierry, and D. Keyes. 2018. Optimization of finite-difference kernels on multi-core architectures for seismic applications. In Intel eXtreme Performance Users Group.Google ScholarGoogle Scholar
  12. Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst. 27, 2 (2009), 1–37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Eyerman, W. Heirman, K. Du Bois, and I. Hur. 2018. Extending the performance analysis tool box: Multi-stage CPI stacks and FLOPS stacks. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’18). 179–188. https://doi.org/10.1109/ISPASS.2018.00031 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Eyerman, K. Hoste, and L. Eeckhout. 2011. Mechanistic-empirical processor performance modeling for constructing CPI stacks on real hardware. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’11). 216–226. https://doi.org/10.1109/ISPASS.2011.5762738 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Andrei Frumusanu. 2020. Apple Announces The Apple Silicon M1: Ditching x86—What to Expect, Based on A14. Retrieved from https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive.Google ScholarGoogle Scholar
  16. Philipp Grete, Forrest W. Glines, and Brian W. O’Shea. 2020. K-athena: A performance portable structured grid finite volume magnetohydrodynamics code. IEEE Trans. Parallel Distrib. Syst. 32, 1 (2021), 85–97. https://doi.org/10.1109/TPDS.2020.3010016Google ScholarGoogle ScholarCross RefCross Ref
  17. Mark Hill and Vijay Janapa Reddi. 2019. Gables: A roofline model for mobile socs. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 317–330.Google ScholarGoogle ScholarCross RefCross Ref
  18. Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. 2013. Cache-aware roofline model: Upgrading the loft. IEEE Comput. Arch. Lett. 13, 1 (2013), 21–24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Bruce Jacob. 2009. The memory system: You can’t avoid it, you can’t ignore it, you can’t fake it. Synth. Lect. Comput. Arch. 4, 1 (2009), 1–77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, Florida, USA) (MM’14). Association for Computing Machinery, New York, NY, USA, 675–678. https://doi.org/10.1145/2647868.2654889 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Rik Jongerius, Andreea Anghel, Gero Dittmann, Giovanni Mariani, Erik Vermij, and Henk Corporaal. 2017. Analytic multi-core processor model for fast design-space exploration. IEEE Trans. Comput. 67, 6 (2017), 755–770.Google ScholarGoogle ScholarCross RefCross Ref
  22. Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H. S. Lee, M. Li, B. Maher, D. Mudigere, M. Naumov, M. Schatz, M. Smelyanskiy, X. Wang, B. Reagen, C. Wu, M. Hempstead, and X. Zhang. 2020. RecNMP: Accelerating personalized recommendation with near-memory processing. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). 790–803. https://doi.org/10.1109/ISCA45697.2020.00070 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Tuomas Koskela, Jack Deslippe, Brian Friesen, and Karthik Raman. 2017. Fusion PIC code performance analysis on the cori KNL system. In Proceedings of the Cray User Group Conference.Google ScholarGoogle Scholar
  25. Tuomas Koskela, Zakhar Matveev, Charlene Yang, Adetokunbo Adedoyin, Roman Belenov, Philippe Thierry, Zhengji Zhao, Rahulkumar Gayatri, Hongzhang Shan, Leonid Oliker, Jack Deslippe, Ron Green, and Samuel Williams. 2018. A novel multi-level integrated roofline model approach for performance characterization. In Proceedings of the International Conference on High Performance Computing. Springer, 226–245.Google ScholarGoogle ScholarCross RefCross Ref
  26. A. Li, S. L. Song, E. Brugel, A. Kumar, D. Chavarría-Miranda, and H. Corporaal. 2016. X: A comprehensive analytic model for parallel machines. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’16). 242–252. https://doi.org/10.1109/IPDPS.2016.89Google ScholarGoogle Scholar
  27. Jiajia Li, Mahesh Lakshminarasimhan, Xiaolong Wu, Ang Li, Catherine Olschanowsky, and Kevin Barker. 2020. A sparse tensor benchmark suite for CPUs and GPUs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’20). IEEE, 193–204.Google ScholarGoogle ScholarCross RefCross Ref
  28. John D. C. Little. 1961. A proof for the queuing formula: L= W. Operat. Res. 9, 3 (1961), 383–387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. André Lopes, Frederico Pratas, Leonel Sousa, and Aleksandar Ilic. 2017. Exploring GPU performance, power and energy-efficiency bounds with cache-aware roofline modeling. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’17). IEEE, 259–268.Google ScholarGoogle ScholarCross RefCross Ref
  30. Diogo Marques, Aleksandar Ilic, Zakhar A. Matveev, and Leonel Sousa. 2020. Application-driven cache-aware roofline model. Fut. Gener. Comput. Syst. 107 (2020), 257–273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Amrita Mathuriya, Ye Luo, Raymond C. Clay III, Anouar Benali, Luke Shulenburger, and Jeongnim Kim. 2017. Embracing a new era of highly efficient and productive quantum monte carlo simulations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 1–12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ricardo Nobre, Aleksandar Ilic, Sergio Santander-Jiménez, and Leonel Sousa. 2020. Exploring the binary precision capabilities of tensor cores for epistasis detection. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’20). IEEE, 338–347.Google ScholarGoogle ScholarCross RefCross Ref
  33. Irma Esmer Papazian. 2020. New 3rd gen intel® xeon® scalable processor (codename: Ice lake-SP). In Proceedings of the IEEE Hot Chips 32 Symposium (HCS’20). IEEE Computer Society, 1–22.Google ScholarGoogle ScholarCross RefCross Ref
  34. A. Pellegrini, N. Stephens, M. Bruce, Y. Ishii, J. Pusdesris, A. Raja, C. Abernathy, J. Koppanalil, T. Ringe, A. Tummala, J. Jalal, M. Werkheiser, and A. Kona. 2020. The arm neoverse N1 platform: Building blocks for the next-gen cloud-to-edge infrastructure SoC. IEEE Micro 40, 2 (2020), 53–62.Google ScholarGoogle ScholarCross RefCross Ref
  35. Milan Radulovic, Rommel Sánchez Verdejo, Paul Carpenter, Petar Radojković, Bruce Jacob, and Eduard Ayguadé. 2019. PROFET: Modeling system performance and energy without simulating the CPU. SIGMETRICS Perform. Eval. Rev. 47, 1 (Dec. 2019), 71–72. https://doi.org/10.1145/3376930.3376976 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang. 2020. MatRaptor: A sparse-sparse matrix multiplication accelerator based on row-wise product. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). 766–780. https://doi.org/10.1109/MICRO50266.2020.00068Google ScholarGoogle ScholarCross RefCross Ref
  37. N. Srivastava, H. Jin, S. Smith, H. Rong, D. Albonesi, and Z. Zhang. 2020. Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). 689–702. https://doi.org/10.1109/HPCA47549.2020.00062Google ScholarGoogle ScholarCross RefCross Ref
  38. JEDEC Standard. 2012. DDR4 SDRAM. JEDEC Solid State Technology Association, JESD79-4 151 (2012).Google ScholarGoogle Scholar
  39. JEDEC Standard. 2013. High bandwidth memory (HBM) DRAM. JESD235 (2013).Google ScholarGoogle Scholar
  40. JEDEC Standard. 2020. DDR5 SDRAM. JEDEC Solid State Technology Association, JESD79-5 (2020).Google ScholarGoogle Scholar
  41. David Suggs, Mahesh Subramony, and Dan Bouvier. 2020. The AMD “Zen 2” processor. IEEE Micro 40, 2 (2020), 45–52. https://doi.org/10.1109/MM.2020.2974217Google ScholarGoogle ScholarCross RefCross Ref
  42. Ady Tal. 2012. Intel® Software Development Emulator. Retrieved from https://software.intel.com/en-us/articles/intel-software-development-emulator.Google ScholarGoogle Scholar
  43. Sofya Titarenko and Mark Hildyard. 2017. Hybrid multicore/vectorisation technique applied to the elastic wave equation on a staggered grid. Comput. Phys. Commun. 216 (2017), 53–62.Google ScholarGoogle ScholarCross RefCross Ref
  44. Jan Treibig and Georg Hager. 2009. Introducing a performance model for bandwidth-limited loop kernels. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics. Springer, 615–624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Sam Van den Steen, Sander De Pestel, Moncef Mechri, Stijn Eyerman, Trevor Carlson, David Black-Schaffer, Erik Hagersten, and Lieven Eeckhout. 2015. Micro-architecture independent analytical processor performance and power modeling. In Proceedings of theInternational Symposium on Performance Analysis of Systems and Software (ISPASS’15). IEEE, 32–41.Google ScholarGoogle ScholarCross RefCross Ref
  46. Xavier Vera. 2020. Inside tiger lake: Intel’s next generation mobile client CPU. In Proceedings of the IEEE Hot Chips 32 Symposium (HCS’20). IEEE Computer Society, 1–26.Google ScholarGoogle ScholarCross RefCross Ref
  47. Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In HighPerformance Computing on the Intel® Xeon Phi™. Springer, 167–188. https://doi.org/10.1007/978-3-319-06486-4_7Google ScholarGoogle Scholar
  48. Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65–76. https://doi.org/10.1145/1498765.1498785 Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Henry Wong. 2013. Measuring Reorder Buffer Capacity. Retrieved from http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/.Google ScholarGoogle Scholar
  50. Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. IEEE, 35–44.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Mansard Roofline Model: Reinforcing the Accuracy of the Roofs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Modeling and Performance Evaluation of Computing Systems
        ACM Transactions on Modeling and Performance Evaluation of Computing Systems  Volume 6, Issue 2
        June 2021
        114 pages
        ISSN:2376-3639
        EISSN:2376-3647
        DOI:10.1145/3481701
        Issue’s Table of Contents

        Copyright © 2021 Copyright held by the owner/author(s).

        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 October 2021
        • Revised: 1 July 2021
        • Accepted: 1 July 2021
        • Received: 1 April 2021
        Published in tompecs Volume 6, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format