Design of an adaptive GPU sharing and scheduling scheme in container-based cluster

Chen, Qichen; Oh, Jisun; Kim, Seoyoung; Kim, Yoonhee

doi:10.1007/s10586-019-02969-3

Design of an adaptive GPU sharing and scheduling scheme in container-based cluster

Published: 31 July 2019

Volume 23, pages 2179–2191, (2020)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Qichen Chen¹,
Jisun Oh²,
Seoyoung Kim² &
…
Yoonhee Kim²

752 Accesses
4 Citations
Explore all metrics

Abstract

Container based virtualization is an innovative technology that accelerates software development by providing portability and maintainability of applications. Recently, a growing number of workloads such as high performance computing (HPC) and Deep Learning(DL) are deployed in the container based environment. However, GPU resource management issues especially the GPU memory over subscription issue in container-based clusters, which brings substantial performance loss, is still challenging. This paper proposes an adaptive fair-share method to share effectively in container-based virtualization environment as well as an execution rescheduling method to manage the execution order of each container for acquiring maximum performance gain. We also proposed a checkpoint based mechanism especially for DL workload running with TensorFlow, which can efficiently solve the GPU memory over subscription problem. We demonstrate that our approach contributes to overall performance improvement as well as higher resource utilization compared to default and static fair-share methods with homogeneous and heterogeneous workloads. Compared to two other conditions, their results show that the proposed method reduces by 16.37%, 15.61% in average execution time and boosts approximately by 52.46%, 10.3% in average GPU memory utilization, respectively. We also evaluated our checkpoint based mechanism by running multiple CNN workloads with TensorFlow at the same time and the result shows our proposed mechanism can ensure each workload executing safely without out of memory (OOM) error occurs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 10

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Article Open access 12 April 2024

Containerization technologies: taxonomies, applications and challenges

Article 08 June 2021

Task scheduling approach in fog and cloud computing using Jellyfish Search (JS) optimizer and Improved Harris Hawks optimization (IHHO) algorithm enhanced by deep learning

Article 13 April 2024

References

Graphics Processing Unit. https://en.wikipedia.org/wiki/Graphics_processing_unit
Docker container. https://www.docker.com/
Calmels, J.: “nvidia docker”. https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker
Tensorflow. https://www.tensorflow.org/
Radchenko, G.I., Alaasam, A.B.A., Tchernykh, A.N.: Comparative analysis of virtualization methods in big data processing. Supercomput. Front. Innov. 6(1), 48–79 (2019)
Google Scholar
Naik, N.: Migrating from virtualization to dockerization in the cloud: simulation and evaluation of distributed systems. In: 2016 IEEE 10th International Symposium on the Maintenance and Evolution of Service-Oriented and Cloud-Based Environments (MESOCA). pp. 1–8. IEEE (2016). https://doi.org/10.1109/MESOCA.2016.9
Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.: Xen and the art of virtualization. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles—SOSP ’03. pp. 164–177. ACM Press, New York (2003). https://doi.org/10.1145/945445.945462
Gupta, Vishakha, et al.: GViM: GPU-accelerated virtual machines. In: Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing. ACM (2009)
Giunta, G., et al. A GPGPU transparent virtualization component for high performance computing clouds. European Conference on Parallel Processing. Springer, Berlin, Heidelberg (2010)
Shi, Lin, et al.: vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Trans. Comput. 61.6, 804–816 (2011)
MathSciNet MATH Google Scholar
Duato, José, et al.: rCUDA: Reducing the number of GPU-based accelerators in high performance clusters. 2010 International Conference on High Performance Computing and Simulation. IEEE (2010)
Kim, J., Jun, T.J., Kang, D., Kim, D., Kim, D.: GPU Enabled Serverless Computing Framework. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP). pp. 533–540. IEEE (2018). https://doi.org/10.1109/PDP2018.2018.00090
Herrera, A.: NVIDIA GRID: Graphics accelerated VDI with the visual performance of a workstation. Tech. Rep. (2014)
Herrera, Al.: NVIDIA GRID: Graphics accelerated VDI with the visual performance of a workstation. Nvidia Corp 1–18 (2014)
Boettiger, C.: An introduction to Docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 49(1), 71–79 (2015). https://doi.org/10.1145/2723872.2723882
Article Google Scholar
Jisun O., et al.: Toward an Adaptive Fair GPU Sharing Scheme in Container-based Clusters, Foundations and Applications of Self* Systems (FAS*) (2018)
Convolution Neural Network. https://cs231n.github.io/convolutional-networks/
MNIST. http://yann.lecun.com/exdb/mnist/
Salomon-Ferrer, R., Case, D.A., Walker, R.C.: An overview of the Amber biomolecular simulation package. WIREs Comput. Mol. Sci. 3, 198–210 (2013)
Article Google Scholar
AMBER. http://ambermd.org/AmberMD.php
Breuer, Stefan, et al.: Extending the SkelCL skeleton library for stencil computations on multi-GPU systems. In: Proceedings of the 1st International Workshop on High-Performance Stencil Computations (2014)
Kämäräinen, Teemu, et al.: Virtual machines vs. containers in cloud gaming systems. In: 2015 International Workshop on Network and Systems Support for Games (NetGames). IEEE (2015)

Download references

Acknowledgements

The authors would like to thank all students who contributed to this study. We are grateful to Sejin Kim, who assisted with evaluation. This work has supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT)(No. NRF-2017R1A2B4005681).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea
Qichen Chen
Department of Computer Science, Sookmyung Women’s University, Seoul, South Korea
Jisun Oh, Seoyoung Kim & Yoonhee Kim

Authors

Qichen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jisun Oh
View author publications
You can also search for this author in PubMed Google Scholar
Seoyoung Kim
View author publications
You can also search for this author in PubMed Google Scholar
Yoonhee Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yoonhee Kim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Q., Oh, J., Kim, S. et al. Design of an adaptive GPU sharing and scheduling scheme in container-based cluster. Cluster Comput 23, 2179–2191 (2020). https://doi.org/10.1007/s10586-019-02969-3

Download citation

Received: 01 February 2019
Revised: 15 June 2019
Accepted: 23 July 2019
Published: 31 July 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10586-019-02969-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Design of an adaptive GPU sharing and scheduling scheme in container-based cluster

Abstract

Access this article

Similar content being viewed by others

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Containerization technologies: taxonomies, applications and challenges

Task scheduling approach in fog and cloud computing using Jellyfish Search (JS) optimizer and Improved Harris Hawks optimization (IHHO) algorithm enhanced by deep learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Design of an adaptive GPU sharing and scheduling scheme in container-based cluster

Abstract

Access this article

Similar content being viewed by others

Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Containerization technologies: taxonomies, applications and challenges

Task scheduling approach in fog and cloud computing using Jellyfish Search (JS) optimizer and Improved Harris Hawks optimization (IHHO) algorithm enhanced by deep learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation