Abstract
Current hardware trends place increasing pressure on programmers and tools to optimize scientific code. Numerous tools and techniques exist, but no single tool is a panacea; instead, different tools have different strengths. Therefore, an assortment of performance tuning utilities and strategies are necessary to best utilize scarce resources (e.g., bandwidth, functional units, cache).
This paper describes a combined methodology for the optimization process. The strategy combines static assembly analysis using MAQAO with dynamic information from hardware performance monitoring (HPM) and memory traces. We introduce a new technique, decremental analysis (DECAN), to iteratively identify the individual instructions responsible for performance bottlenecks. We present case studies on applications from several independent software vendors (ISVs) on a SMP Xeon Core 2 platform. These strategies help discover problems related to memory access locality and loop unrolling that lead to a sequential performance improvement of a factor of 2.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alexandrov, A., Bratanov, S., Fedorova, J., Levinthal, D., Lopatin, I., Ryabtsev, D.: Parallelization made easier with intel performance-tuning utility (2007)
AMD. Software optimization guide for amd family 10h processors
Armstrong, B., Eigenmann, R.: A methodology for scientific benchmarking with large-scale applications, pp. 109–127 (2001)
Cooper, K.D., Xu, L.: An efficient static analysis algorithm to detect redundant memory operations. SIGPLAN Not. 38(suppl. 2), 97–107 (2003)
Dinh, Q.V., Nam, A., Petit, G.: Projet fame2: rapport final de synthse sur l’optimisation des logiciels de simulation numrique de l’aronautique (2007)
Djoudi, L., Barthou, D., Carribault, P., Lemuet, C., Acquaviva, J.-T., Jalby, W.: Exploring application performance: a new tool for a static/dynamic approach. In: Los Alamos Computer Science Institute Symp., Santa Fe, NM (October 2005)
Dolan, E.D., Mor, J.J.: Benchmarking optimization software with performance profiles (2001)
Eranian, S.: Perfmon2: a flexible performance monitoring for linux (2006)
Graham, S.L., Kessler, P.B., Mckusick, M.K.: Gprof: A call graph execution profiler. In: SIGPLAN 1982: Proceedings of the 1982 SIGPLAN symposium on Compiler construction, pp. 120–126. ACM, New York (1982)
Hochstein, L., Carver, J., Shull, F., Asgari, S., Basili, V.: Parallel programmer productivity: A case study of novice parallel programmers. In: SC 2005: Proceedings of the, ACM/IEEE conference on Supercomputing, Washington, DC, USA, pp. 35+. IEEE Computer Society, Los Alamitos (2005)
Huck, K.A., Malony, A.D.: Perfexplorer: A performance data mining framework for large-scale parallel computing. In: SC 2005: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, Washington, DC, USA, p. 41. IEEE Computer Society, Los Alamitos (2005)
Intel. Intel 64 and ia-32 architectures optimization reference manual
Moseley, T., Connors, D.A., Grunwald, D., Peri, R.: Identifying potential parallelism via loop-centric profiling. In: Proceedings of the 2007 International Conference on Computing Frontiers (May 2007)
Mucci, P.J., Browne, S., Deane, C., Ho, G.: Papi: A portable interface to hardware performance counters. In: Proceedings of the Department of Defense HPCMP Users Group Conference, pp. 7–10 (1999)
Risio, B., Passmann, N., Wessel, F., Reinartz, E.: 3d-flame modelling in power plant applications (2008)
Shende, S., Malony, A., Moore, S., Mucci, P., Dongarra, J.: Integrated tool capabilities for performance instrumentation and measurement (2007)
Shende, S.S., Malony, A.D.: The tau parallel performance system. The International Journal of High Performance Computing Applications 20, 287–331 (2006)
Skinner, D., Kramer, W.: Understanding the causes of performance variability in hpc workloads. In: International Symposium on Workload Characterization (2005)
Tallent, N.R., Mellor-Crummey, J.M.: Effective performance measurement and analysis of multithreaded applications. In: PPoPP 2009: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 229–240. ACM, New York (2009)
Verykios, V.S., Houstis, E.N., Rice, J.R.: A knowledge discovery methodology for the performance evaluation of scientific software. Neural, Parallel & Scientific Computations 8, 115–132 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Koliai, S. et al. (2010). A Balanced Approach to Application Performance Tuning. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds) Languages and Compilers for Parallel Computing. LCPC 2009. Lecture Notes in Computer Science, vol 5898. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13374-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-13374-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13373-2
Online ISBN: 978-3-642-13374-9
eBook Packages: Computer ScienceComputer Science (R0)