ABSTRACT
Performance analysis tools are frequently used to support the development of parallel MPI applications. They facilitate the detection of errors, bottlenecks, or inefficiencies but differ substantially in their instrumentation, measurement, and type of feedback. Especially, tools that provide visual feedback are helpful for educational purposes. They provide a visual abstraction of program behavior, supporting learners to identify and understand performance issues and write more efficient code. However, existing professional tools for performance analysis are very complex, and their use in beginner courses can be very demanding. Foremost, their instrumentation and measurement require deep knowledge and take a long time. Immediate, as well as straightforward feedback, is essential to motivate learners. This paper provides an extensive overview of performance analysis tools for parallel MPI applications, which experienced developers broadly use today. It also gives an overview of existing educational tools for parallel programming with MPI and shows their shortcomings compared to professional tools. Using tools for performance analysis of MPI programs in educational scenarios can promote the understanding of program behavior in large HPC systems and support learning parallel programming. At the same time, the complexity of the programs and the lack of infrastructure in educational institutions are barriers. These aspects will be considered and discussed in detail.
- 1998. GNU gprof - The GNU profiler. https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_mono/gprof.html. Accessed: 2022--11--16.Google Scholar
- 2000--2022. Valgrind. https://valgrind.org. Accessed: 2022--11--16.Google Scholar
- 2020. mpiP 3.5. https://github.com/LLNL/mpiP. Accessed: 2022--11-04.Google Scholar
- 2022. Open|Speedshop. https://openspeedshop.org. Accessed: 2022--10--26.Google Scholar
- 2022. SAUCE - System for AUtomated Code Evaluation. https://github.com/moschlar/SAUCE. Accessed: 2022--11-02.Google Scholar
- Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Gregory L. Lee, Barton P. Miller, and Martin Schulz. 2007. Stack Trace Analysis for Large Scale Debugging. In 21th International Parallel and Distributed Processing Symposium (IPDPS 2007). IEEE, 1--10. https://doi.org/10.1109/IPDPS.2007.370254Google ScholarCross Ref
- Juelich Supercomputing Centre at Forschungszentrum Juelich and Innovative Computing Laboratory at the University of Tennessee. 2022. KOJAK. https://icl.utk.edu/kojak/index.html. Accessed: 2022--10--24.Google Scholar
- Jean-Baptiste Besnard, Marc Pérache, and William Jalby. 2013. Event Streaming for Online Performance Measurements Reduction. In 42nd International Conference on Parallel Processing (ICPP 2013). IEEE Computer Society, 985--994. https://doi.org/10.1109/ICPP.2013.117Google ScholarDigital Library
- David Boehme. 2015--2021. Caliper: A Performance Analysis Toolbox in a Library. http://software.llnl.gov/Caliper/. Accessed: 2022--10--20.Google Scholar
- David Boehme. 2020. Tool Time: Caliper - A Performance Analysis Toolbox in a Library. https://pop-coe.eu/blog/tool-time-caliper-a-performance-analysis-toolbox-in-a-library.Google Scholar
- David Böhme, Pascal Aschwanden, Olga Pearce, Kenneth Weiss, and Matthew P. LeGendre. 2021. Ubiquitous Performance Analysis. In High Performance Computing - 36th International Conference (ISC High Performance 2021) (Lecture Notes in Computer Science, Vol. 12728). Springer, 431--449. https://doi.org/10.1007/978--3-030--78713--4_23Google ScholarCross Ref
- David Böhme, Todd Gamblin, David Beckingsale, Peer-Timo Bremer, Alfredo Giménez, Matthew P. LeGendre, Olga Pearce, and Martin Schulz. 2016. Caliper: performance introspection for HPC software stacks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2016). IEEE Computer Society, 550--560. https://doi.org/10.1109/SC.2016.46Google ScholarCross Ref
- BSC. 2022. Paraver. https://tools.bsc.es/paraver. Accessed: 2022--10--24.Google Scholar
- Henri Casanova, Arnaud Legrand, Martin Quinson, and Frédéric Suter. 2018. SMPI Courseware: Teaching Distributed-Memory Computing with MPI in Simulation. In 2018 IEEE/ACM Workshop on Education for High- Performance Computing (EduHPC@SC 2018). IEEE, 21--30. https://doi.org/10.1109/EduHPC.2018.00006Google ScholarCross Ref
- Intel Corporation. [n.d.]. Intel Trace Analyzer and Collector (ITAC). https://www.intel.com/content/www/us/en/developer/tools/oneapi/trace-analyzer.html#gs.ijzdvr. Accessed: 2022--11--18.Google Scholar
- Association Curricula. 2013. Computer Science Curricula 2013: Curriculum Guidelines for Undergraduate Degree Programs in Computer Science. (2013). https://doi.org/10.1145/2534860Google ScholarDigital Library
- Technische Universtiaet Darmstadt and ETH Zurich. 2020. Extra-P. https://github.com/extra-p/extrap. Accessed: 2022--10--24.Google Scholar
- Constantinos T. Delistavrou and Konstantinos G. Margaritis. 2010. Survey of Software Environments for Parallel Distributed Processing: Parallel Programming Education on Real Life Target Systems Using Production Oriented Software Tools. In 14th Panhellenic Conference on Informatics (PCI 2010). IEEE Computer Society, 231--236. https://doi.org/10.1109/PCI.2010.26Google ScholarDigital Library
- Constantinos T. Delistavrou and Konstantinos G. Margaritis. 2011. Towards an Integrated Teaching Environment for Parallel Programming. In 15th Panhellenic Conference on Informatics (PCI 2011). IEEE Computer Society, 3--7. https://doi.org/10.1109/PCI.2011.16Google ScholarDigital Library
- Eclipse Foundation. 2022. Eclipse Parallel Tools Platform (PTP). https://www.eclipse.org/ptp/. Accessed: 2022--11--16.Google Scholar
- Markus Geimer, Felix Wolf, Brian J. N. Wylie, Erika Ábrahám, Daniel Becker, and Bernd Mohr. 2010. The Scalasca performance toolset architecture. Concurr. Comput. Pract. Exp. 22, 6 (2010), 702--719. https://doi.org/10.1002/cpe.1556Google ScholarCross Ref
- Victor Gergel, Evgeny Kozinov, Alexey Linev, and Anton Shtanyuk. 2016. Educational and Research Systems for Evaluating the Efficiency of Parallel Computations. In Algorithms and Architectures for Parallel Processing (ICA3PP 2016) (Lecture Notes in Computer Science, Vol. 10049). Springer, 278--290. https://doi.org/10.1007/978--3--319--49956--7_22Google ScholarCross Ref
- Michael Gerndt, Ventsislav Petkov, and Yuri Oleynik. 2010. Performance analysis with Periscope. https://www.vi-hps.org/cms/upload/material/tw10/vi-hps-tw10-Periscope_Overview.pdf. Accessed: 2022--10--24.Google Scholar
- GWT-TUD GmbH. 2022. Vampir. https://vampir.eu. Accessed: 2022--10--24.Google Scholar
- Marjan Gusev, Sasko Ristov, Goran Velkoski, and Bisera Ivanovska. 2014. E-learning and Benchmarking Platform for Parallel and Distributed Computing. Int. J. Emerg. Technol. Learn. 9, 2 (2014), 17--21. https://doi.org/10.3991/ijet.v9i2.3215Google ScholarCross Ref
- Tobias Hilbrich. 2014. Runtime MPI Correctness Checking with a Scalable Tools Infrastructure. Ph. D. Dissertation. Dresden University of Technology. https://nbn-resolving.org/urn:nbn:de:bsz:14-qucosa-175472Google Scholar
- Tobias Hilbrich, Joachim Protze, Martin Schulz, Bronis R. de Supinski, and Matthias S. Müller. 2012. MPI runtime error detection with MUST: advances in deadlock detection. In SC Conference on High Performance Computing Networking, Storage and Analysis (SC 2012). IEEE/ACM, 30. https://doi.org/10.1109/SC.2012.79Google ScholarDigital Library
- Rice University Houston. 2000--2022. HPCToolkit. http://hpctoolkit.org/index.html. Accessed: 2022--10--24.Google Scholar
- Alan Humphrey, Christopher Derrick, Ganesh Gopalakrishnan, and Beth Tibbitts. 2010. GEM: Graphical Explorer of MPI Programs. In 39th International Conference on Parallel Processing (ICPP Workshops 2010). IEEE Computer Society, 161--168. https://doi.org/10.1109/ICPPW.2010.33Google ScholarDigital Library
- David A. Joiner, Paul Gray, Thomas Murphy, and Charles Peck. 2006. Teaching parallel computing to science faculty: best practices and common pitfalls. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 2006). ACM, 239--246. https://doi.org/10.1145/1122971.1123007Google ScholarDigital Library
- Forschungszentrum Juelich. [n.d.]. Score-P, scalable performance measurement infrastructure for parallel codes. https://scorepci.pages.jsc.fz-juelich.de/scorep-pipelines/docs/scorep-6.0/html/index.html. Accessed: 2022--10--16.Google Scholar
- Forschungszentrum Juelich. [n.d.]. Score-P, Scalable performance measurement infrastructure for parallel codes. https://scorepci.pages.jsc.fz-juelich.de/scorep-pipelines/docs/scorep-6.0/html/index.html. Accessed: 2022--10--20.Google Scholar
- Forschungszentrum Juelich and Technische Universitaet Darmstadt. 2022. Scalasca. https://www.scalasca.orgl. Accessed: 2022--10--24.Google Scholar
- Torsten Kempf, Kingshuk Karuri, and Lei Gao. 2008. Software Instrumentation. In Wiley Encyclopedia of Computer Science and Engineering. John Wiley & Sons, Inc. https://doi.org/10.1002/9780470050118.ecse386Google ScholarCross Ref
- Michael Knobloch and Bernd Mohr. 2020. Tools for GPU Computing - Debugging and Performance Analysis of Heterogenous HPC Applications. Supercomput. Front. Innov. 7, 1 (2020), 91--111. https://doi.org/10.14529/jsfi200105Google ScholarCross Ref
- Andreas Knüpfer, Christian Rössel, Dieter an Mey, Scott Biersdorff, Kai Diethelm, Dominic Eschweiler, Markus Geimer, Michael Gerndt, Daniel Lorenz, Allen D. Malony, Wolfgang E. Nagel, Yury Oleynik, Peter Philippen, Pavel Saviankou, Dirk Schmidl, Sameer Shende, Ronny Tschüter, Michael Wagner, Bert Wesarg, and Felix Wolf. 2011. Score-P: A Joint Performance Measurement Run- Time Infrastructure for Periscope, Scalasca, TAU, and Vampir. In Tools for High Performance Computing 2011 - Proceedings of the 5th International Workshop on Parallel Tools for High Performance Computing. Springer, 79--91. https://doi.org/10.1007/978--3--642--31476--6_7Google ScholarCross Ref
- Eileen T. Kraemer and John T. Stasko. 1993. The Visualization of Parallel Systems: An Overview. J. Parallel Distributed Comput. 18, 2 (1993), 105--117. https://doi.org/10.1006/jpdc.1993.1050Google ScholarDigital Library
- B. Krammer, K. Bidmon, M.S. Müller, and M.M. Resch. 2004. MARMOT: An MPI analysis and checking tool. In Parallel Computing. Advances in Parallel Computing, Vol. 13. North-Holland, 493--500. https://doi.org/10.1016/S0927--5452(04)80063--7Google ScholarCross Ref
- Lawrence Livermore National Laboratory. [n.d.]. STAT: Stack Trace Analysis Tool. https://hpc.llnl.gov/software/development-environment-software/stat-stack-trace-analysis-tool. Accessed: 2022--10--20.Google Scholar
- Chee Wai Lee, Allen D. Malony, and Alan Morris. 2010. TAUmon: Scalable Online Performance Data Analysis in TAU. In Euro-Par 2010 Parallel Processing Workshops - HeteroPar, HPCC, HiBB, CoreGrid, UCHPC, HPCF, PROPER, CCPI, VHPC (Lecture Notes in Computer Science, Vol. 6586). Springer, 493--499. https://doi.org/10.1007/978--3--642--21878--1_61Google ScholarCross Ref
- Arm Limited. 2022. ARM DDT, The Number One Debugger for C, C and Fortran, Threaded and Parallel Code. https://www.arm.com/products/development-tools/server-and-hpc/forge/ddt. Accessed: 2022--10--20.Google Scholar
- Arm Limited. 2022. ARM Performance Reports. https://developer.arm.com/tools-and-software/server-and-hpc/debug-and-profile/arm-forge/arm-performance-reports. Accessed: 2022--10--20.Google Scholar
- Preeti Malakar. 2019. Experiences of Teaching Parallel Computing to Undergraduates and Post-Graduates. In 26th International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW 2019). IEEE, 40--47. https://doi.org/10.1109/HiPCW.2019.00020Google ScholarCross Ref
- John Mellor-Crummey, Nathan R. Tallent, Mike Fagan, and Jan Odegard. 2007. Application performance profiling on the Cray XD1 using HPCToolkit. In Proc. of the Cray User's Group.Google Scholar
- Robert Mijakovic, Michael Firbach, and Michael Gerndt. 2016. An architecture for flexible auto-tuning: The Periscope Tuning Framework 2.0. In 2nd International Conference on Green High Performance Computing (ICGHPC 2016). IEEE, 1--9. https://doi.org/10.1109/ICGHPC.2016.7508066Google ScholarCross Ref
- Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille, Jeffrey K. Hollingsworth, R. Bruce Irvin, Karen L. Karavanic, Krishna Kunchithapadam, and Tia Newhall. 1995. The Paradyn Parallel Performance Measurement Tool. Computer 28, 11 (1995), 37--46. https://doi.org/10.1109/2.471178Google ScholarDigital Library
- Bernd Mohr. 2014. Scalable parallel performance measurement and analysis tools - state-of-the-art and future challenges. Supercomput. Front. Innov. 1, 2 (2014), 108--123. https://doi.org/10.14529/jsfi140207Google ScholarDigital Library
- Shirley Moore, David Cronk, Kevin S. London, and Jack J. Dongarra. 2001. Review of Performance Analysis Tools for MPI Parallel Programs. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 8th European PVM/MPI Users' Group Meeting (Lecture Notes in Computer Science, Vol. 2131). Springer, 241--248. https://doi.org/10.1007/3--540--45417--9_34Google ScholarCross Ref
- Aroon Nataraj, Matthew J. Sottile, Alan Morris, Allen D. Malony, and Sameer Shende. 2007. TAUoverSupermon : Low-Overhead Online Parallel Performance Monitoring. In Euro-Par 2007, Parallel Processing, 13th International Euro-Par Conference (Lecture Notes in Computer Science, Vol. 4641). Springer, 85--96. https://doi.org/10.1007/978--3--540--74466--5_11Google ScholarCross Ref
- Department of Computer and Information Science University of Oregon. 1997- 2020. TAU, Tuning and Analysis Utilities. http://www.tau.uoregon.edu. Accessed: 2022--10--24.Google Scholar
- University of Versailles St Quentin. 2004--2021. Maqao (Modular Assembly Quality Analyzer and Optimizer). http://http://www.maqao.org. Accessed: 2022--10--26.Google Scholar
- Inc. Perforce Software. 2022. TotalView HPC Debugging Software. https://totalview.io/products/totalview. Accessed: 2022--10--20.Google Scholar
- Sushil K. Prasad, Almadena Yu. Chtchelkanova, Sajal K. Das, Frank Dehne, Mohamed G. Gouda, Anshul Gupta, Joseph F. JáJá, Krishna Kant, Anita La Salle, Richard LeBlanc, Manish Lumsdaine, David A. Padua, Manish Parashar, Viktor K. Prasanna, Yves Robert, Arnold L. Rosenberg, Sartaj Sahni, Behrooz A. Shirazi, Alan Sussman, Charles C. Weems, and Jie Wu. 2011. NSF/IEEE-TCPP curriculum initiative on parallel and distributed computing: core topics for undergraduates. In Proceedings of the 42nd ACM technical symposium on Computer science education (SIGCSE 2011). ACM, 617--618. https://doi.org/10.1145/1953163.1953336Google ScholarDigital Library
- Joachim Protze, Tobias Hilbrich, Martin Schulz, Bronis R. de Supinski, Wolfgang E. Nagel, and Matthias S. Müller. 2014. MPI Runtime Error Detection with MUST: A Scalable and Crash-Safe Approach. In 43rd International Conference on Parallel Processing Workshops, (ICPPW 2014). IEEE Computer Society, 206--215. https://doi.org/10.1109/ICPPW.2014.37Google ScholarDigital Library
- Readex. 2020. Periscope Tuning Framework. https://www.readex.eu/index.php/periscope-tuning-framework/p. Accessed: 2022--10--24.Google Scholar
- Sasko Ristov, Marjan Gusev, Blagoj Atanasovski, and Nenad Anchev. 2013. Using EDUCache Simulator for the Computer Architecture and Organization Course. Int. J. Eng. Pedagog. 3, 3 (2013), 47--56. https://doi.org/10.3991/ijep.v3i3.2784Google ScholarCross Ref
- Sasko Ristov, Marjan Gusev, and Goran Velkoski. 2014. Cloud E-learning and Benchmarking Platform for the Parallel and Distributed Computing Course. In 2014 IEEE Global Engineering Education Conference (EDUCON 2014). IEEE, 645--651. https://doi.org/10.1109/EDUCON.2014.6826161Google ScholarCross Ref
- Utah School of Computing. [n.d.]. GEM - Graphical Explorer of MPI Programs. http://formalverification.cs.utah.edu/GEM/. Accessed: 2022--11-04.Google Scholar
- Utah School of Computing. [n.d.]. ISP (In-situ Partial Order): a dynamic verifier for MPI Programs. http://formalverification.cs.utah.edu/ISP-release/. Accessed: 2022--11-04.Google Scholar
- Martin Schulz, Jim Galarowicz, Don Maghrak, William Hachfeld, David Montoya, and Scott Cranford. 2008. Open | SpeedShop: An open source infrastructure for parallel performance analysis. Sci. Program. 16, 2--3 (2008), 105--121. https://doi.org/10.3233/SPR-2008-0256Google ScholarCross Ref
- Martin Schulz, Jim Galarowicz, Don Maghrak, William Hachfeld, David Montoya, and Scott Cranford. 2009. Analyzing the performance of Scientific Applications with Open|SpeedShop. In Parallel Computational Fluid Dynamics. 151--159.Google Scholar
- Sameer Shende. 1999. Profiling and tracing in linux. In In Proceedings of Extreme Linux Workshop.Google Scholar
- Sameer Shende and Allen D. Malony. 2006. The Tau Parallel Performance System. Int. J. High Perform. Comput. Appl. 20, 2 (2006), 287--311. https://doi.org/10.1177/1094342006064482Google ScholarDigital Library
- Elizabeth Shoop, Richard A. Brown, Eric Biggers, Malcolm Kane, Devry Lin, and Maura Warner. 2012. Virtual clusters for parallel and distributed education. In Proceedings of the 43rd ACM technical symposium on Computer science education (SIGCSE 2012). ACM, 517--522. https://doi.org/10.1145/2157136.2157287Google ScholarDigital Library
- BSC Tools. 2022. Extrae. https://tools.bsc.es/extrae. Accessed: 2022--10--20.Google Scholar
- Lobachevsky University. 2022. ParaLab. https://hpc-education.unn.ru/en/trainings/teachware/paralab. Accessed: 2022--11-02.Google Scholar
- Lobachevsky University. 2022. ParaLib -- Parallel Computational Methods Library. https://hpc-education.unn.ru/en/trainings/teachware/paralib. Accessed: 2022--11-02.Google Scholar
- RTWH Aachen University. 2022. MUST - MPI Runtime Correctness Analysis. https://itc.rwth-aachen.de/must/. Accessed: 2022--10--20.Google Scholar
- University of Wisconsin University of Maryland. 2019. Dyninst. https://www.dyninst.org. Accessed: 2022--10--20.Google Scholar
- Computer Sciences Department University of Wisconsin. 2020. Paradyn. http://www.paradyn.org/overview/screen-shots.html. Accessed: 2022--10--20.Google Scholar
- Sarvani S. Vakkalanka, Subodh Sharma, Ganesh Gopalakrishnan, and Robert M. Kirby. 2008. ISP: a tool for model checking MPI programs. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 2008). ACM, 285--286. https://doi.org/10.1145/1345206.1345258Google ScholarDigital Library
- Cédric Valensi, William Jalby, Mathieu Tribalat, Emmanuel Oseret, Salah Ibnamar, and Kevin Camus. 2019. Using MAQAO to Analyse and Optimise an Application. In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2019). 423--424. https://doi.org/10.1109/MASCOTS.2019.00052Google ScholarCross Ref
- Jeffrey Vetter and Chris Chambreau. 2014. mpiP: Lightweight, Scalable MPI Profiling. http://gec.di.uminho.pt/Discip/MInf/cpd1415/PCP/MPI/mpiP_%20Lightweight,%20Scalable%20MPI%20Profiling.pdf. Accessed: 2022--11-04.Google Scholar
- Jeffrey S. Vetter and Bronis R. de Supinski. 2000. Dynamic Software Testing of MPI Applications with Umpire. In Proceedings Supercomputing 2000. IEEE Computer Society, 51. https://doi.org/10.1109/SC.2000.10055Google ScholarCross Ref
- Jack Whitham. 2016. Profiling versus Tracing. https://www.jwhitham.org/2016/02/profiling-versus-tracing.html. Accessed: 2022--10--17.Google Scholar
- Ali Yazici, Alok Mishra, and Ziya Karakaya. 2016. Teaching Parallel Computing Concepts Using Real-Life Applications*. International Journal of Engineering Education 32 (03 2016), 772--781.Google Scholar
- Gonzalo Zarza, Diego Lugones, Daniel Franco, and Emilio Luque. 2012. An Innovative Teaching Strategy to Understand High-Performance Systems through Performance Evaluation. In Proceedings of the International Conference on Computational Science (ICCS 2012) (Procedia Computer Science, Vol. 9). Elsevier, 1733--1742. https://doi.org/10.1016/j.procs.2012.04.191Google ScholarCross Ref
- Yuxiao Zhang, Jiang Li, Di Wu, and Yunfei Du. 2018. Improving Student Skills on Parallel Programming via Code Evaluation and Feedback Debugging. In IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE 2018). IEEE, 1069--1073. https://doi.org/10.1109/TALE.2018.8615351Google ScholarCross Ref
- Ilya Zhukov, Christian Feld, Markus Geimer, Bernd Mohr, Michael Knobloch, and Pavel Saviankou. 2015. Scalasca v2: Back to the Future. In Tools for High Performance Computing 2014. Springer International Publishing, 1--24. https://doi.org/10.1007/978--3--319--16012--2_1Google ScholarCross Ref
Index Terms
- Performance Analysis Tools for MPI Applications and their Use in Programming Education
Recommendations
An Overhead Analysis of MPI Profiling and Tracing Tools
PERMAVOST '22: Proceedings of the 2nd Workshop on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn StrategyMPI performance analysis tools are important instruments for finding performance bottlenecks in large-scale MPI applications. These tools commonly support either the profiling or the tracing of parallel applications. Depending on the type of analysis, ...
Tools-supported HPF and MPI parallelization of the NAS parallel benchmarks
FRONTIERS '96: Proceedings of the 6th Symposium on the Frontiers of Massively Parallel ComputationHigh Performance Fortran (HPF) compilers and communication libraries with the standardized Message Passing Interface (MPI) are becoming widely available, easing the development of portable parallel applications. The Annai tool environment supports ...
Benefits of Cross Memory Attach for MPI libraries on HPC Clusters
XSEDE '14: Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery EnvironmentWith the number of cores per node increasing in modern clusters, an efficient implementation of intra-node communications is critical for application performance. MPI libraries generally use shared memory mechanisms for communication inside the node, ...
Comments