ABSTRACT
Datacenters use accelerators to provide the significant compute throughput required by emerging user-facing services. The diurnal user access pattern of user-facing services provides a strong incentive to co-located applications for better accelerator utilization, and prior work has focused on enabling co-location on multicore processors and traditional non-preemptive accelerators. However, current accelerators are evolving towards spatial multitasking and introduce a new set of challenges to eliminate QoS violation. To address this open problem, we explore the underlying causes of QoS violation on spatial multitasking accelerators. In response to these causes, we propose Laius, a runtime system that carefully allocates the computation resource to co-located applications for maximizing the throughput of batch applications while guaranteeing the required QoS of user-facing services. Our evaluation on a Nvidia RTX 2080Ti GPU shows that Laius improves the utilization of spatial multitasking accelerators by 20.8%, while achieving the 99%-ile latency target for user-facing services.
- {n. d.}. Apple Siri. https://www.apple.com/siri/.Google Scholar
- {n. d.}. Google Translate. https://translate.google.com/.Google Scholar
- {n. d.}. Nvidia Night Compute. https://docs.nvidia.com/nsight-compute/NsightCompute/index.html.Google Scholar
- {n. d.}. Prisma. https://prisma-ai.com/.Google Scholar
- Jacob T Adriaens, Katherine Compton, Nam Sung Kim, and Michael J Schulte. 2012. The case for GPGPU spatial multitasking. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 1--12.Google ScholarDigital Library
- Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture 8, 3 (2013), 1--154.Google Scholar
- Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture 8, 3 (2013), 1--154.Google Scholar
- Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. ACM, 3--10. Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Ieee, 44--54. Google ScholarDigital Library
- Quan Chen, Hailong Yang, Minyi Guo, Ram Srivatsa Kannan, Jason Mars, and Lingjia Tang. 2017. Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. ACM SIGARCH Computer Architecture News 45, 1 (2017), 17--32. Google ScholarDigital Library
- Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. ACM SIGARCH Computer Architecture News 44, 2 (2016), 681--696.Google ScholarDigital Library
- Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. DianNao family: energy-efficient hardware accelerators for machine learning. Commun. ACM 59, 11 (2016), 105--112. Google ScholarDigital Library
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).Google Scholar
- Carlos A Coello Coello. 2000. Treating constraints as objectives for single-objective evolutionary optimization. Engineering Optimization+ A35 32, 3 (2000), 275--308.Google Scholar
- Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google ScholarDigital Library
- Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google ScholarDigital Library
- Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. In ACM SIGPLAN Notices, Vol. 48. ACM, 77--88. Google ScholarDigital Library
- Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127--144. Google ScholarDigital Library
- Glenn A Elliott, Bryan C Ward, and James H Anderson. 2013. GPUSync: A framework for real-time GPU management. In 2013 IEEE 34th Real-Time Systems Symposium. IEEE, 33--44. Google ScholarDigital Library
- Johann Hauswald, Yiping Kang, Michael A Laurenzano, Quan Chen, Cheng Li, Trevor Mudge, Ronald G Dreslinski, Jason Mars, and Lingjia Tang. 2015. DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers. In Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE, 27--40. Google ScholarDigital Library
- Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An introduction to statistical learning. Vol. 112. Springer. Google ScholarDigital Library
- Nicola Jones. 2014. Computer science: The learning machines. Nature News 505, 7482 (2014), 146.Google ScholarCross Ref
- Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md Wasi-ur Rahman, Nusrat S Islam, Xiangyong Ouyang, Hao Wang, Sayantan Sur, et al. 2011. Memcached design on high performance rdma capable interconnects. In 2011 International Conference on Parallel Processing. IEEE, 743--752.Google ScholarDigital Library
- Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proc. USENIXATC. 17--30. Google ScholarDigital Library
- Raphael Landaverde, Tiansheng Zhang, Ayse K Coskun, and Martin Herbordt. 2014. An investigation of unified memory access performance in cuda. In High Performance Extreme Computing Conference (HPEC), 2014 IEEE. IEEE, 1--6.Google ScholarCross Ref
- Haeseung Lee, Al Faruque, and Mohammad Abdullah. 2014. GPU-EvR: Run-time event based real-time scheduling framework on GPGPU platform. In Proceedings of the conference on Design, Automation & Test in Europe. European Design and Automation Association, 220.Google Scholar
- Teng Li, Vikram K Narayana, and Tarek El-Ghazawi. 2015. Reordering GPU kernel launches to enable efficient concurrent execution. arXiv preprint arXiv:1511.07983 (2015).Google Scholar
- David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In ACM SIGARCHComputer Architecture News, Vol. 43. ACM, 450--462. Google ScholarDigital Library
- Szymon Łukasik and Sławomir Zak. 2009. Firefly algorithm for continuous constrained optimization tasks. In International conference on computational collective intelligence. Springer, 97--106. Google ScholarDigital Library
- Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture. ACM, 248--259. Google ScholarDigital Library
- NVIDIA. 2012. Sharing a GPU between MPI processes: multi-process service(MPS).Google Scholar
- NVIDIA. 2015. Multi-Process Service. https://docs.nvidia.com/deploy/mps/index.htmltopic_6_1_2.Google Scholar
- CUDA Nvidia. 2008. Cublas library. NVIDIA Corporation, Santa Clara, California 15, 27 (2008), 31.Google Scholar
- C Nvidia. 2012. Nvidias next generation cuda compute architecture: Kepler gk110. Whitepaper (2012) (2012).Google Scholar
- Sreepathi Pai, Matthew J Thazhuthaveetil, and Ramaswamy Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. In ACM SIGPLAN Notices, Vol. 48. ACM, 407--418.Google ScholarDigital Library
- Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic resource management for efficient utilization of multitasking GPUs. ACM SIGOPS Operating Systems Review 51, 2 (2017), 527--540. Google ScholarDigital Library
- Vinicius Petrucci, Michael A Laurenzano, John Doherty, Yunqi Zhang, Daniel Mosse, Jason Mars, and Lingjia Tang. 2015. Octopus-man: Qos-driven task management for heterogeneous multicores in warehouse-scale computers. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 246--258.Google ScholarCross Ref
- S Rasoul Safavian and David Landgrebe. 1991. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics 21, 3 (1991), 660--674.Google Scholar
- Sartaj Sahni. 1975. Approximate algorithms for the 0/1 knapsack problem. Journal of the ACM (JACM) 22, 1 (1975), 115--124. Google ScholarDigital Library
- George AF Seber and Alan J Lee. 2012. Linear regression analysis. Vol. 329. John Wiley & Sons.Google Scholar
- Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. 2014. GPUvm: Why not virtualizing GPUs at the hypervisor?. In USENIX Annual Technical Conference. 109--120. Google ScholarDigital Library
- Lingjia Tang, Jason Mars, and Mary Lou Soffa. 2012. Compiling for niceness: Mitigating contention for qos in warehouse scale computers. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, 1--12.Google ScholarDigital Library
- Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2016. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 358--369.Google ScholarCross Ref
- Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang. 2017. FLEP: Enabling flexible and efficient preemption on GPUs. ACM SIGOPS Operating Systems Review 51, 2 (2017), 483--496. Google ScholarDigital Library
- Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 607--618. Google ScholarDigital Library
- Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, and Xiuwen Yi. 2016. DNN-based prediction model for spatio-temporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 92. Google ScholarDigital Library
- Yunqi Zhang, Michael A Laurenzano, Jason Mars, and Lingjia Tang. 2014. Smite: Precise qos prediction on real-system smt processors to improve utilization in warehouse scale computers. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 406--418. Google ScholarDigital Library
- Wenyi Zhao, Quan Chen, and Minyi Guo. 2018. KSM: Online Application-Level Performance Slowdown Prediction for Spatial Multitasking GPGPU. IEEE Computer Architecture Letters 17, 2 (2018), 187--191.Google ScholarDigital Library
Recommendations
Avalon: towards QoS awareness and improved utilization through multi-resource management in datacenters
ICS '19: Proceedings of the ACM International Conference on SupercomputingExisting techniques for improving datacenter utilization while guaranteeing the QoS are based on the assumption that queries have similar behaviors. However, user queries in emerging compute demanding services demonstrate significantly diverse behavior ...
Navigator: dynamic multi-kernel scheduling to improve GPU performance
DAC '20: Proceedings of the 57th ACM/EDAC/IEEE Design Automation ConferenceEfficient GPU resource-sharing between multiple kernels has recently been a critical factor on overall performance. While previous works mainly focused on how to allocate resources to two kernels, there has been limited amount of work on determining ...
An adaptive hash-based multilayer scheduler for L7-filter on a highly threaded hierarchical multi-core server
ANCS '09: Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications SystemsUbiquitous multi-core-based web servers and edge routers are increasingly popular in deploying computationally intensive Deep Packet Inspection (DPI) programs. Previous work has shown the benefits of connection locality-based scheduling on multi-core ...
Comments