skip to main content
10.1145/3330345.3330351acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Laius: <u>T</u>owards <u>l</u>atency <u>a</u>wareness and <u>i</u>mproved <u>u</u>tilization of <u>s</u>patial multitasking accelerators in datacenters

Published:26 June 2019Publication History

ABSTRACT

Datacenters use accelerators to provide the significant compute throughput required by emerging user-facing services. The diurnal user access pattern of user-facing services provides a strong incentive to co-located applications for better accelerator utilization, and prior work has focused on enabling co-location on multicore processors and traditional non-preemptive accelerators. However, current accelerators are evolving towards spatial multitasking and introduce a new set of challenges to eliminate QoS violation. To address this open problem, we explore the underlying causes of QoS violation on spatial multitasking accelerators. In response to these causes, we propose Laius, a runtime system that carefully allocates the computation resource to co-located applications for maximizing the throughput of batch applications while guaranteeing the required QoS of user-facing services. Our evaluation on a Nvidia RTX 2080Ti GPU shows that Laius improves the utilization of spatial multitasking accelerators by 20.8%, while achieving the 99%-ile latency target for user-facing services.

References

  1. {n. d.}. Apple Siri. https://www.apple.com/siri/.Google ScholarGoogle Scholar
  2. {n. d.}. Google Translate. https://translate.google.com/.Google ScholarGoogle Scholar
  3. {n. d.}. Nvidia Night Compute. https://docs.nvidia.com/nsight-compute/NsightCompute/index.html.Google ScholarGoogle Scholar
  4. {n. d.}. Prisma. https://prisma-ai.com/.Google ScholarGoogle Scholar
  5. Jacob T Adriaens, Katherine Compton, Nam Sung Kim, and Michael J Schulte. 2012. The case for GPGPU spatial multitasking. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture 8, 3 (2013), 1--154.Google ScholarGoogle Scholar
  7. Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture 8, 3 (2013), 1--154.Google ScholarGoogle Scholar
  8. Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. ACM, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Ieee, 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Quan Chen, Hailong Yang, Minyi Guo, Ram Srivatsa Kannan, Jason Mars, and Lingjia Tang. 2017. Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. ACM SIGARCH Computer Architecture News 45, 1 (2017), 17--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. ACM SIGARCH Computer Architecture News 44, 2 (2016), 681--696.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. DianNao family: energy-efficient hardware accelerators for machine learning. Commun. ACM 59, 11 (2016), 105--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).Google ScholarGoogle Scholar
  14. Carlos A Coello Coello. 2000. Treating constraints as objectives for single-objective evolutionary optimization. Engineering Optimization+ A35 32, 3 (2000), 275--308.Google ScholarGoogle Scholar
  15. Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. In ACM SIGPLAN Notices, Vol. 48. ACM, 77--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Glenn A Elliott, Bryan C Ward, and James H Anderson. 2013. GPUSync: A framework for real-time GPU management. In 2013 IEEE 34th Real-Time Systems Symposium. IEEE, 33--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Johann Hauswald, Yiping Kang, Michael A Laurenzano, Quan Chen, Cheng Li, Trevor Mudge, Ronald G Dreslinski, Jason Mars, and Lingjia Tang. 2015. DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers. In Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE, 27--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An introduction to statistical learning. Vol. 112. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Nicola Jones. 2014. Computer science: The learning machines. Nature News 505, 7482 (2014), 146.Google ScholarGoogle ScholarCross RefCross Ref
  23. Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md Wasi-ur Rahman, Nusrat S Islam, Xiangyong Ouyang, Hao Wang, Sayantan Sur, et al. 2011. Memcached design on high performance rdma capable interconnects. In 2011 International Conference on Parallel Processing. IEEE, 743--752.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proc. USENIXATC. 17--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Raphael Landaverde, Tiansheng Zhang, Ayse K Coskun, and Martin Herbordt. 2014. An investigation of unified memory access performance in cuda. In High Performance Extreme Computing Conference (HPEC), 2014 IEEE. IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  26. Haeseung Lee, Al Faruque, and Mohammad Abdullah. 2014. GPU-EvR: Run-time event based real-time scheduling framework on GPGPU platform. In Proceedings of the conference on Design, Automation & Test in Europe. European Design and Automation Association, 220.Google ScholarGoogle Scholar
  27. Teng Li, Vikram K Narayana, and Tarek El-Ghazawi. 2015. Reordering GPU kernel launches to enable efficient concurrent execution. arXiv preprint arXiv:1511.07983 (2015).Google ScholarGoogle Scholar
  28. David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In ACM SIGARCHComputer Architecture News, Vol. 43. ACM, 450--462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Szymon Łukasik and Sławomir Zak. 2009. Firefly algorithm for continuous constrained optimization tasks. In International conference on computational collective intelligence. Springer, 97--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture. ACM, 248--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. NVIDIA. 2012. Sharing a GPU between MPI processes: multi-process service(MPS).Google ScholarGoogle Scholar
  32. NVIDIA. 2015. Multi-Process Service. https://docs.nvidia.com/deploy/mps/index.htmltopic_6_1_2.Google ScholarGoogle Scholar
  33. CUDA Nvidia. 2008. Cublas library. NVIDIA Corporation, Santa Clara, California 15, 27 (2008), 31.Google ScholarGoogle Scholar
  34. C Nvidia. 2012. Nvidias next generation cuda compute architecture: Kepler gk110. Whitepaper (2012) (2012).Google ScholarGoogle Scholar
  35. Sreepathi Pai, Matthew J Thazhuthaveetil, and Ramaswamy Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. In ACM SIGPLAN Notices, Vol. 48. ACM, 407--418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic resource management for efficient utilization of multitasking GPUs. ACM SIGOPS Operating Systems Review 51, 2 (2017), 527--540. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Vinicius Petrucci, Michael A Laurenzano, John Doherty, Yunqi Zhang, Daniel Mosse, Jason Mars, and Lingjia Tang. 2015. Octopus-man: Qos-driven task management for heterogeneous multicores in warehouse-scale computers. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 246--258.Google ScholarGoogle ScholarCross RefCross Ref
  38. S Rasoul Safavian and David Landgrebe. 1991. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics 21, 3 (1991), 660--674.Google ScholarGoogle Scholar
  39. Sartaj Sahni. 1975. Approximate algorithms for the 0/1 knapsack problem. Journal of the ACM (JACM) 22, 1 (1975), 115--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. George AF Seber and Alan J Lee. 2012. Linear regression analysis. Vol. 329. John Wiley & Sons.Google ScholarGoogle Scholar
  41. Yusuke Suzuki, Shinpei Kato, Hiroshi Yamada, and Kenji Kono. 2014. GPUvm: Why not virtualizing GPUs at the hypervisor?. In USENIX Annual Technical Conference. 109--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Lingjia Tang, Jason Mars, and Mary Lou Soffa. 2012. Compiling for niceness: Mitigating contention for qos in warehouse scale computers. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2016. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 358--369.Google ScholarGoogle ScholarCross RefCross Ref
  44. Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang. 2017. FLEP: Enabling flexible and efficient preemption on GPUs. ACM SIGOPS Operating Systems Review 51, 2 (2017), 483--496. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 607--618. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, and Xiuwen Yi. 2016. DNN-based prediction model for spatio-temporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Yunqi Zhang, Michael A Laurenzano, Jason Mars, and Lingjia Tang. 2014. Smite: Precise qos prediction on real-system smt processors to improve utilization in warehouse scale computers. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 406--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Wenyi Zhao, Quan Chen, and Minyi Guo. 2018. KSM: Online Application-Level Performance Slowdown Prediction for Spatial Multitasking GPGPU. IEEE Computer Architecture Letters 17, 2 (2018), 187--191.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    ICS '19: Proceedings of the ACM International Conference on Supercomputing
    June 2019
    533 pages
    ISBN:9781450360791
    DOI:10.1145/3330345

    Copyright © 2019 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 26 June 2019

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate584of2,055submissions,28%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader