Abstract
Computing systems today rely on massively parallel and heterogeneous architectures to promise very high peak performance. Yet most applications only achieve small fractions of this performance. While both programmers and architects have clear opinions about the causes of this performance gap, finding and quantifying the real problems remains a topic for performance modeling tools. In this paper, we sketch the landscape of modern GPUs’ performance limiters and optimization opportunities, and dive into details on modeling attempts for GPU-based systems. We highlight the specific features of the relevant contributions in this field, along with the optimization and design spaces they explore. We further use a typical kernel example (tiled dense matrix multiplication) to assess the efficacy and usability of a set of promising approaches. We conclude that the available GPU performance modeling solutions are very sensitive to applications and platform changes, and require significant efforts for tuning and calibration when new analyses are required.
Chapter PDF
Similar content being viewed by others
References
Saule, E., Kaya, K., Çatalyürek, Ü.V.: Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. CoRR abs/1302.1078 (2013)
NVIDIA Corporation: Press release: Nvidia tesla gpu computing processor ushers in the era of personal supercomputing (June 2007)
Advanced Micro Devices (AMD) Inc. Press release: Amd delivers enthusiast performance leadership with the introduction of the ati radeon 3870 x2 (January 2008)
Asanovic, K., et al.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)
Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing cuda workloads using a detailed gpu simulator. In: ISPASS, pp. 163–174. IEEE (2009)
Mudalige, G.R., Vernon, M.K., Jarvis, S.A.: A plug-and-play model for evaluating wavefront computations on parallel architectures. In: IPDPS, pp. 1–14. IEEE (2008)
Diamos, G.F., Yalamanchili, S.: Harmony: An execution model and runtime for heterogeneous many core systems. In: Proceedings of HPDC 2008, pp. 197–200. ACM, New York (2008)
Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: A programming model for heterogeneous multi-core systems. SIGPLAN Not. 43(3) (March 2008)
Snavely, A., Carrington, L., Wolter, N., Labarta, J., Badia, R., Purkayastha, A.: A framework for performance modeling and prediction. In: Proceedings of SC 2002, pp. 1–17. IEEE Computer Society Press, Los Alamitos (2002)
Tikir, M.M., Laurenzano, M.A., Carrington, L., Snavely, A.: PSINS: An open source event tracer and execution simulator for MPI applications. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 135–148. Springer, Heidelberg (2009)
Laurenzano, M., Tikir, M., Carrington, L., Snavely, A.: Pebil: Efficient static binary instrumentation for linux. In: ISPASS 2010, pp. 175–183 (March 2010)
Carrington, L., Tikir, M.M., Olschanowsky, C., Laurenzano, M., Peraza, J., Snavely, A., Poole, S.: An idiom-finding tool for increasing productivity of accelerators. In: Proceedings of ICS 2011, pp. 202–212. ACM, New York (2011)
Kerr, A., Anger, E., Hendry, G., Yalamanchili, S.: Eiger: A framework for the automated synthesis of statistical performance models. In: Proceedings of WPEA 2012 (2012)
Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of ptx kernels. In: Proceedings of IISWC 2009, Washington, DC, USA, pp. 3–12 (2009)
Jia, W., Shaw, K., Martonosi, M.: Stargazer: Automated regression-based gpu design space exploration. In: ISPASS 2012, pp. 2–13 (April 2012)
Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.M.W.: An adaptive performance modeling tool for gpu architectures. SIGPLAN Not. 45(5), 105–114 (2010)
Hong, S., Kim, H.: An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News 37(3), 152–163 (2009)
Kothapalli, K., Mukherjee, R., Rehman, M., Patidar, S., Narayanan, P.J., Srinathan, K.: A performance prediction model for the cuda gpgpu platform. In: HiPC 2009, pp. 463–472 (December 2009)
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Fortune, S., Wyllie, J.: Parallelism in random access machines. In: Proceedings of STOC 1978, pp. 114–118. ACM, New York (1978)
Gibbons, P.B., Matias, Y., Ramachandran, V.: The queue-read queue-write asynchronous pram model. In: Euro-Par 1996. LNCS, vol. 1124, pp. 279–292. Springer, Heidelberg (1996)
Zhang, Y., Owens, J.: A quantitative performance analysis model for gpu architectures. In: HPCA 2011, pp. 382–393 (February 2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Madougou, S., Varbanescu, A.L., de Laat, C., van Nieuwpoort, R. (2014). An Empirical Evaluation of GPGPU Performance Models. In: Lopes, L., et al. Euro-Par 2014: Parallel Processing Workshops. Euro-Par 2014. Lecture Notes in Computer Science, vol 8805. Springer, Cham. https://doi.org/10.1007/978-3-319-14325-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-14325-5_15
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14324-8
Online ISBN: 978-3-319-14325-5
eBook Packages: Computer ScienceComputer Science (R0)