Abstract
We show empirically that some of the issues that affected the design of linear algebra libraries for distributed memory architectures will also likely affect such libraries for shared memory architectures with many simultaneous threads of execution, including SMP architectures and future multicore processors. The always-important matrix-matrix multiplication is used to demonstrate that a simple one-dimensional data partitioning is suboptimal in the context of dense linear algebra operations and hinders scalability. In addition we advocate the publishing of low-level interfaces to supporting operations, such as the copying of data to contiguous memory, so that library developers may further optimize parallel linear algebra implementations. Data collected on a 16 CPU Itanium2 server supports these observations.
Chapter PDF
Similar content being viewed by others
Keywords
- Matrix Multiplication Algorithm
- Shared Memory Architecture
- Distribute Memory Architecture
- Linear Algebra Operation
- Linear Algebra Package
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft. 16(1), 1–17 (1990)
Kågström, B., Ling, P., Loan, C.V.: GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark. ACM Trans. Math. Soft. 24(3), 268–302 (1998)
Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., Sorensen, D.: LAPACK Users’ Guide - Release 2.0. SIAM (1994)
Gunnels, J.A., Henry, G.M., van de Geijn, R.A.: A family of high-performance matrix multiplication algorithms. In: Alexandrov, V.N., Dongarra, J.J., Juliano, B.A., Renner, R.S., Tan, C.K. (eds.) ICCS 2001. LNCS, vol. 2073, pp. 51–60. Springer, Heidelberg (2001)
Goto, K., van de Geijn, R.: Anatomy of high-performance matrix multiplication ACM Trans. Math. Soft. (to appear)
Agarwal, R.C., Gustavson, F., Zubair, M.: A high-performance matrix multiplication algorithm on a distributed memory parallel computer using overlapped communication. IBM Journal of Research and Development 38(6) (1994)
van de Geijn, R., Watts, J.: SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9(4), 255–274 (1997)
Gunnels, J., Lin, C., Morrow, G., van de Geijn, R.: A flexible class of parallel matrix multiplication algorithms. In: Proceedings of First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing (1998 IPPS/SPDP 1998), pp. 110–116 (1998)
Low, T.M., Milfeld, K., van de Geijn, R., Zee, F.V.: Parallelizing FLAME code with OpenMP task queues. Technical Report TR-04-50, The University of Texas at Austin, Department of Computer Sciences (December 2004)
Van Zee, F.G., Bientinesi, P., Low, T.M., van de Geijn, R.A.: Scalable parallelization of FLAME code via the workqueuing model. ACM Trans. Math. Soft. (submitted, 2007)
Stewart, G.W.: Communication and matrix computations on large message passing systems. Parallel Computing 16, 27–40 (1990)
Lichtenstein, W., Johnsson, S.L.: Block-cyclic dense linear algebra. Technical Report TR-04-92, Harvard University, Center for Research in Computing Technology (January 1992)
Hendrickson, B.A., Womble, D.E.: The torus-wrap mapping for dense matrix calculations on massively parallel computers. SIAM J. Sci. Stat. Comput. 15(5), 1201–1226 (1994)
Dongarra, J., van de Geijn, R., Walker, D.: Scalability issues affecting the design of a dense linear algebra library. J. Parallel Distrib. Comput. 22(3) (September 1994)
Addison, C., Ren, Y.: OpenMP issues arrising in the development of parallel BLAS and LAPACK libraries. In: EWOMP (2001)
Chan, E., Ortí, E.S.Q., Ortí, G.Q., van de Geijn, R.: Supermatrix out-of-order scheduling of matrix operations for smp and multi-core architectures. In: SPAA (2007)(submitted)
Kurzak, J., Dongarra, J.: Implementing linear algebra routines on multi-core processors with pipelining and a look ahead. LAPACK Working Note 178 UT-CS-06-581, University of Tennessee (September 2006)
Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. SIAM (1997)
van de Geijn, R.A.: Using PLAPACK: Parallel Linear Algebra Package. MIT Press, Cambridge (1997)
Bientinesi, P., Quintana-Ortí, E.S., van de Geijn, R.A.: Representing linear algebra algorithms in code: The FLAME APIs. ACM Trans. Math. Soft. 31(1), 27–59 (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Marker, B., Van Zee, F.G., Goto, K., Quintana-Ortí, G., van de Geijn, R.A. (2007). Toward Scalable Matrix Multiply on Multithreaded Architectures. In: Kermarrec, AM., Bougé, L., Priol, T. (eds) Euro-Par 2007 Parallel Processing. Euro-Par 2007. Lecture Notes in Computer Science, vol 4641. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74466-5_79
Download citation
DOI: https://doi.org/10.1007/978-3-540-74466-5_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74465-8
Online ISBN: 978-3-540-74466-5
eBook Packages: Computer ScienceComputer Science (R0)