Skip to main content
Log in

Design of an adaptive GPU sharing and scheduling scheme in container-based cluster

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Container based virtualization is an innovative technology that accelerates software development by providing portability and maintainability of applications. Recently, a growing number of workloads such as high performance computing (HPC) and Deep Learning(DL) are deployed in the container based environment. However, GPU resource management issues especially the GPU memory over subscription issue in container-based clusters, which brings substantial performance loss, is still challenging. This paper proposes an adaptive fair-share method to share effectively in container-based virtualization environment as well as an execution rescheduling method to manage the execution order of each container for acquiring maximum performance gain. We also proposed a checkpoint based mechanism especially for DL workload running with TensorFlow, which can efficiently solve the GPU memory over subscription problem. We demonstrate that our approach contributes to overall performance improvement as well as higher resource utilization compared to default and static fair-share methods with homogeneous and heterogeneous workloads. Compared to two other conditions, their results show that the proposed method reduces by 16.37%, 15.61% in average execution time and boosts approximately by 52.46%, 10.3% in average GPU memory utilization, respectively. We also evaluated our checkpoint based mechanism by running multiple CNN workloads with TensorFlow at the same time and the result shows our proposed mechanism can ensure each workload executing safely without out of memory (OOM) error occurs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Graphics Processing Unit. https://en.wikipedia.org/wiki/Graphics_processing_unit

  2. Docker container. https://www.docker.com/

  3. Calmels, J.: “nvidia docker”. https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker

  4. Tensorflow. https://www.tensorflow.org/

  5. Radchenko, G.I., Alaasam, A.B.A., Tchernykh, A.N.: Comparative analysis of virtualization methods in big data processing. Supercomput. Front. Innov. 6(1), 48–79 (2019)

    Google Scholar 

  6. Naik, N.: Migrating from virtualization to dockerization in the cloud: simulation and evaluation of distributed systems. In: 2016 IEEE 10th International Symposium on the Maintenance and Evolution of Service-Oriented and Cloud-Based Environments (MESOCA). pp. 1–8. IEEE (2016). https://doi.org/10.1109/MESOCA.2016.9

  7. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.: Xen and the art of virtualization. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles—SOSP ’03. pp. 164–177. ACM Press, New York (2003). https://doi.org/10.1145/945445.945462

  8. Gupta, Vishakha, et al.: GViM: GPU-accelerated virtual machines. In: Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing. ACM (2009)

  9. Giunta, G., et al. A GPGPU transparent virtualization component for high performance computing clouds. European Conference on Parallel Processing. Springer, Berlin, Heidelberg (2010)

  10. Shi, Lin, et al.: vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Trans. Comput. 61.6, 804–816 (2011)

    MathSciNet  MATH  Google Scholar 

  11. Duato, José, et al.: rCUDA: Reducing the number of GPU-based accelerators in high performance clusters. 2010 International Conference on High Performance Computing and Simulation. IEEE (2010)

  12. Kim, J., Jun, T.J., Kang, D., Kim, D., Kim, D.: GPU Enabled Serverless Computing Framework. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP). pp. 533–540. IEEE (2018). https://doi.org/10.1109/PDP2018.2018.00090

  13. Herrera, A.: NVIDIA GRID: Graphics accelerated VDI with the visual performance of a workstation. Tech. Rep. (2014)

  14. Herrera, Al.: NVIDIA GRID: Graphics accelerated VDI with the visual performance of a workstation. Nvidia Corp 1–18 (2014)

  15. Boettiger, C.: An introduction to Docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 49(1), 71–79 (2015). https://doi.org/10.1145/2723872.2723882

    Article  Google Scholar 

  16. Jisun O., et al.: Toward an Adaptive Fair GPU Sharing Scheme in Container-based Clusters, Foundations and Applications of Self* Systems (FAS*) (2018)

  17. Convolution Neural Network. https://cs231n.github.io/convolutional-networks/

  18. MNIST. http://yann.lecun.com/exdb/mnist/

  19. Salomon-Ferrer, R., Case, D.A., Walker, R.C.: An overview of the Amber biomolecular simulation package. WIREs Comput. Mol. Sci. 3, 198–210 (2013)

    Article  Google Scholar 

  20. AMBER. http://ambermd.org/AmberMD.php

  21. Breuer, Stefan, et al.: Extending the SkelCL skeleton library for stencil computations on multi-GPU systems. In: Proceedings of the 1st International Workshop on High-Performance Stencil Computations (2014)

  22. Kämäräinen, Teemu, et al.: Virtual machines vs. containers in cloud gaming systems. In: 2015 International Workshop on Network and Systems Support for Games (NetGames). IEEE (2015)

Download references

Acknowledgements

The authors would like to thank all students who contributed to this study. We are grateful to Sejin Kim, who assisted with evaluation. This work has supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT)(No. NRF-2017R1A2B4005681).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yoonhee Kim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Q., Oh, J., Kim, S. et al. Design of an adaptive GPU sharing and scheduling scheme in container-based cluster. Cluster Comput 23, 2179–2191 (2020). https://doi.org/10.1007/s10586-019-02969-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-019-02969-3

Keywords

Navigation