Abstract
With exascale computing forthcoming, performance metrics such as memory traffic and arithmetic intensity are increasingly important for codes that heavily utilize numerical kernels. Performance metrics in different CPU architectures can be monitored by reading the occurrences of various hardware events. However, from architecture to architecture, it becomes more and more unclear which native performance events are indexed by which event names, making it difficult for users to understand what specific events actually measure. This ambiguity seems particularly true for events related to hardware that resides beyond the compute core, such as events related to memory traffic. Still, traffic to memory is a necessary characteristic for determining arithmetic intensity. To alleviate this difficulty, PAPI’s Counter Analysis Toolkit measures the occurrences of events through a series of benchmarks, allowing its users to discover the high-level meaning of native events. We (i) leverage the capabilities of the Counter Analysis Toolkit to identify the names of hardware events for reading and writing bandwidth utilization in addition to floating-point operations, (ii) measure the occurrences of the events they index during the execution of important numerical kernels, and (iii) verify their identities by comparing these occurrence patterns to the expected arithmetic intensity of the numerical kernels.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
We perform this normalization on the raw data produced by this benchmark only for presentation purposes because we have observed that the measurements are either around 1.5, or extremely large, and thus they cannot be visualized in a readable way, not even in a logarithmic graph.
References
Cooper, K., Xu, X.: Efficient characterization of hidden processor memory hierarchies. In: Computational Science—ICCS 2018, pp. 335–349. Springer International Publishing, Cham (2018)
Danalis, A., Jagode, H., Hanumantharayappa, Ragate, S., Dongarra, J.: Counter inspection toolkit: making sense out of hardware performance events. In: Tools for high performance computing 2017, pp. 17–37. Springer International Publishing, Cham (2019)
Danalis, A., Luszczek, P., Marin, G., Vetter, J.S., Dongarra, J.: BlackjackBench: portable hardware characterization with automated results’ analysis. Comput. J. 57, 1002–1016 (2013)
González-Domínguez, J., Taboada, G.L., Fragüela, B.B., Martín, M.J., Touriño, J.: Servet: a benchmark suite for autotuning on multicore clusters. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–9 (2010)
McCraw, H., Ralph, J., Danalis, A., Dongarra, J.: Power monitoring with PAPI for extreme scale architectures and dataflow-based programming models, pp. 385–391 (2014)
IBM Corporation.: POWER9 Performance Monitor Unit User’s Guide. https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf (2018)
IBM Corporation.: POWER9 Processor User’s Manual. https://www.ibm.com/developerworks/community/files/basic/anonymous/api/library/35a0c17a-cd5e-4750-8f73-d98b6880d77b/document/828804a0-e5d7-480c-bad1-cf21342c3889/media/POWER9%20Processor.pdf (2018)
Malony, A.D., Biersdorff, S., Shende, S., Jagode, H., Tomov, S., Juckeland, G., Dietrich, R., Poole, D., Lamb, C.: Parallel performance measurement of heterogeneous parallel systems with gpus. In: Proceedings of the 2011 International Conference on Parallel Processing, ICPP ’11, pp. 176–185. IEEE Computer Society, Washington, DC, USA (2011)
McCraw, H., Terpstra, D., Dongarra, J., Davis, K., Musselman, M.: Beyond the CPU: hardware performance counter monitoring on blue Gene/Q. In: Proceedings of the international supercomputing conference 2013, ISC’13, pp. 213–225. Springer, Heidelberg (2013)
McVoy, L., Staelin, C.: lmbench: portable tools for performance analysis. In: ATEC’96: proceedings of the annual technical conference on USENIX 1996 annual technical conference, January 24–26, pp. 23–23. USENIX Association, Berkeley, CA, USA (1996)
Mucci, P.J., London, K.: The CacheBench report. Technical report, Computer Science Department, University of Tennessee, Knoxville, TN (1998)
Sandoval, J.: Foundations for automatic, adaptable compilation. Doctoral dissertation, Rice University (2011)
Sussman, A., Lo, N., Anderson, T.: Automatic computer system characterization for a parallelizing compiler. In: 2011 IEEE international conference on cluster computing, pp. 216–224 (2011)
Terpstra, D., Jagode, H., You, H., Dongarra, J.: Collecting performance data with PAPI-C. Tools High Perform. Comput. 2009, 157–173 (2009)
Acknowledgements
This material is based upon work supported in part by the National Science Foundation under award No. 1450429 “PAPI-EX.”
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Barry, D., Danalis, A., Jagode, H. (2021). Effortless Monitoring of Arithmetic Intensity with PAPI’s Counter Analysis Toolkit. In: Mix, H., Niethammer, C., Zhou, H., Nagel, W.E., Resch, M.M. (eds) Tools for High Performance Computing 2018 / 2019. Springer, Cham. https://doi.org/10.1007/978-3-030-66057-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-66057-4_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66056-7
Online ISBN: 978-3-030-66057-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)