Skip to main content
Log in

Way of Measuring Data Transfer Delays among Graphics Processing Units at Different Nodes of a Computer Cluster

  • Published:
Moscow University Computational Mathematics and Cybernetics Aims and scope Submit manuscript

Abstract

The basics of load tests for a computer cluster with a large number of GPUs (graphics processing units) distributed over the cluster’s nodes are presented and implemented as a program code. Information about the time delays in the transfer of data of different sizes among all GPUs in the system is collected as a result. Two modes of tests, ‘‘all to all’’ and ‘‘one to one,’’ are developed. In the first mode, all GPUs transfer data to all GPUs simultaneously. In the second mode, only the transfer between two GPUs proceeds at a single moment in time. Using test results obtained on the K60 computer cluster at the Keldysh Institute of Applied Mathematics, Russian Academy of Sciences, it is shown that the interconnector medium of the supercomputer is inhomogeneous in data transfer among the GPUs not only for transfer through the network, but also for the GPUs in a common node of the computer cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

REFERENCES

  1. K. P. Belyaev, A. A. Kuleshov, and N. P. Tuchkova, ‘‘Modeling of ocean dynamics with the assimilation of observational data,’’ KIAM Preprint No. 133 (Keldysh Inst. Appl. Math., Russ. Acad. Sci., Moscow, 2018) [in Russian].

    Google Scholar 

  2. S.-X. Zou, C.-Y. Chen, J.-L. Wu, C.-N. Chou, C.-C. Tsao, K.-C. Tung, T.-W. Lin, C.-L. Sung, and E. Y. Chang, ‘‘Distributed training large-scale deep architecture,’’ in Advanced Data Mining and Applications, Proc. 13th Int. Conf., ADMA 2017, Ed. by G. Cong, et al., Lecture Notes in Computer Science (Springer, Cham, 2017), Vol. 10604, pp. 18-32.

  3. Y. Bazilevs, K. Takizawa, and T. E. Tezduyar, ‘‘New directions and challenging computations in fluid dynamics modeling with stabilized and multiscale methods,’’ Math. Models Methods Appl. Sci. 25 (12), 2217-2226 (2015).

  4. NVIDIA Tesla P100. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf (Accessed May 5, 2019).

  5. Summit Supercomputer. https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit (Accessed July 2, 2019).

  6. A. Perepelkina and V. Levchenko, ‘‘LRnLA algorithm ConeFold with non-local vectorization for LBM implementation,’’ in Supercomputing, RuSCDays 2018, Ed. by V. Voevodin and S. Sobolev, Communications in Computer and Information Science (Springer, Cham, 2019), Vol. 965, pp. 101-113.

    Google Scholar 

  7. J. Colmenares, A. Galizia, J. Ortiz, A. Clematis, and W. Rocchia, ‘‘A combined MPI-CUDA parallel solution of linear and nonlinear Poisson-Boltzmann equation,’’ Biomed Res. Int. 2014, Article ID 560987, 1-12 (2014).

    Google Scholar 

  8. MPI Standart. https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf (Accessed May 14, 2019).

  9. CUDA Developer Documentation. https://docs.nvidia.com/cuda/ (Accessed May 12, 2019).

  10. A. N. Salnikov, D. Yu. Andreev, and R. D. Lebedev, ‘‘Toolkit for analyzing the communication environment characteristics of a computational cluster based on MPI standard functions,’’ Moscow Univ. Comput. Math. Cybern. 36 (1), 41-49 (2012).

    Article  Google Scholar 

  11. T. Fujiwara, J. K. Li, M. Mubarak, et al., ‘‘A visual analytics system for optimizing the performance of large-scale networks in supercomputing systems,’’ Visual Inf. 2 (1), 98-110 (2018).

    Article  Google Scholar 

  12. A. Bhatele and V. Laxmikant, ‘‘An evaluative study on the effect of contention on message latencies in large supercomputers,’’ in Proc. 2009 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS 2009), Rome, Italy, 2009 (IEEE, 2009), pp. 1-8.

  13. S. J. Pennycook, S. D. Hammond, S. A. Jarvis, and G. R. Mudalige, ‘‘Performance analysis of a hybrid MPI/CUDA implementation of the NAS-LU benchmark,’’ ACM SIGMETRICS Perform. Eval. Rev. 38 (4), 23-29 (2011).

    Article  Google Scholar 

  14. Clustbench. https://github.com/clustbench (Accessed May 3, 2019).

  15. NetCDF4 Documentation. https://www.unidata.ucar.edu/software/netcdf/docs/index.html (Accessed May 9, 2019).

  16. H. Li, D. Yu, A. Kumar, and Y.-C. Tu, ‘‘Performance modeling in CUDA streams - A means for high-throughput data processing,’’ in Proc. 2014 IEEE Int. Conf. on Big Data, Washington, USA, 2014 (IEEE, 2014), pp. 301-310.

  17. K-60 Computer Complex [in Russian]. http://www.kiam.ru/MVS/resourses/k60.html (Accessed May 13, 2019).

  18. P. S. Pacheco, Parallel Programming with MPI, 1st ed. (Morgan Kaufmann, San Francisco, 1996).

    MATH  Google Scholar 

  19. A. N. Salnikov, A. A. Begaev, and A. I. Maysuradze, ‘‘Analysis of delays structure of interconnections in supercomputer by means of DBScan and divisive clustering algorithms,’’ Comput. Math. Inf. Technol. 2 (1), 33-43 (2018).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to A. A. Begaev or A. N. Salnikov.

Additional information

Translated by E. Oborin

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Begaev, A.A., Salnikov, A.N. Way of Measuring Data Transfer Delays among Graphics Processing Units at Different Nodes of a Computer Cluster. MoscowUniv.Comput.Math.Cybern. 44, 1–10 (2020). https://doi.org/10.3103/S0278641920010021

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0278641920010021

Keywords:

Navigation