ABSTRACT
On multi-core processors, contention on shared resources such as the last level cache (LLC) and memory bandwidth may cause serious performance degradation, which makes efficient resource allocation a critical issue in data centers. Intel recently introduces Memory Bandwidth Allocation (MBA) technology on its Xeon scalable processors, which makes it possible to allocate memory bandwidth in a real system. However, how to make the most of MBA to improve system performance remains an open question. In this work, (1) we formulate a quantitative relationship between a program's performance and its LLC occupancy and memory request rate on commodity processors. (2) Guided by the performance formula, we propose a heuristic bound-aware throttling algorithm to improve system performance and (3) we further develop a hierarchical clustering method to improve the algorithm's efficiency. (4) We implement these algorithms in EMBA, a low-overhead dynamic memory bandwidth scheduling system to improve performance on Intel commodity processors. The results show that, when multiple programs run simultaneously on a multi-core processor whose memory bandwidth is saturated, the programs with high memory bandwidth demand usually use bandwidth inefficiently compared with programs with medium memory bandwidth demand from the perspective of CPU performance. By slightly throttling the former's bandwidth, we can significantly improve the performance of the latter. On average, we improve system performance by 36.9% at the expense of 8.6% bandwidth utilization rate.
- Intel 64 and ia-32 architectures software developer's manual. https://software.intel.com/.Google Scholar
- Intel-cmt-cat. https://github.com/intel/intel-cmt-cat/tree/master/pqos/.Google Scholar
- Spec cpu 2017. https://www.spec.org/cpu2017/.Google Scholar
- Berger, D. S., Berg, B., Zhu, T., Sen, S., and Harchol-Balter, M. Robinhood: Tail latency aware caching-dynamic reallocation from cache-rich to cache-poor. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18) (2018), pp. 195--212. Google ScholarDigital Library
- Brock, J., Ye, C., Ding, C., Li, Y., Wang, X., and Luo, Y. Optimal cache partition-sharing. In 2015 44th International Conference on Parallel Processing (2015), IEEE, pp. 749--758. Google ScholarDigital Library
- Cherkasova, L., Gupta, D., and Vahdat, A. Comparison of the three cpu schedulers in xen. SIGMETRICS Performance Evaluation Review 35, 2 (2007), 42--51. Google ScholarDigital Library
- Ebrahimi, E., Lee, C. J., Mutlu, O., and Patt, Y. N. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010(2010), pp. 335--346. Google ScholarDigital Library
- El-Sayed, N., Mukkara, A., Tsai, P., Kasture, H., Ma, X., and Sánchez, D. Kpart: A hybrid cache partitioning-sharing technique for commodity multicores. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Vienna, Austria, February 24-28, 2018 (2018), pp. 104--117.Google ScholarCross Ref
- Eyerman, S., and Eeckhout, L. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 3 (2008), 42--53. Google ScholarDigital Library
- Funaro, L., Ben-Yehuda, O. A., and Schuster, A. Ginseng: Market-driven {LLC} allocation. In 2016 {USENIX} Annual Technical Conference ({USENIX} {ATC} 16) (2016), pp. 295--308. Google ScholarDigital Library
- Hameed, A., Khoshkbarforoushha, A., Ranjan, R., Jayaraman, P. P., Kolodziej, J., Balaji, P., Zeadally, S., Malluhi, Q. M., Tziritas, N., Vishnu, A., et al. A survey and taxonomy on energy efficient resource allocation techniques for cloud computing systems. Computing 98, 7 (2016), 751--774. Google ScholarDigital Library
- Herdrich, A., Illikkal, R., Iyer, R. R., Newell, D., Chadha, V., and Moses, J. Rate-based qos techniques for cache/memory in CMP platforms. In Proceedings of the 23rd international conference on Supercomputing, 2009, Yorktown Heights, NY, USA, June 8-12, 2009 (2009), pp. 479--488. Google ScholarDigital Library
- Hower, D. R., Cain, H. W., and Waldspurger, C. A. PABST: proportionally allocated bandwidth at the source and target. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4-8, 2017 (2017), pp. 505--516.Google ScholarCross Ref
- Hu, X., Wang, X., Li, Y., Luo, Y., Ding, C., and Wang, Z. Optimal symbiosis and fair scheduling in shared cache. IEEE Trans. Parallel Distrib. Syst. 28, 4 (2017), 1134--1148. Google ScholarDigital Library
- Illikkal, R., Chadha, V., Herdrich, A., Iyer, R., and Newell, D. PIRATE: qos and performance management in CMP architectures. SIGMETRICS Performance Evaluation Review 37, 4 (2010), 3--10. Google ScholarDigital Library
- Ipek, E., Mutlu, O., Martínez, J. F., and Caruana, R. Self-optimizing memory controllers: A reinforcement learning approach. In 35th International Symposium on Computer Architecture (ISCA 2008), June 21-25, 2008, Beijing, China (2008), pp. 39--50. Google ScholarDigital Library
- Jahre, M., and Natvig, L. A light-weight fairness mechanism for chip multiprocessor memory systems. In Proceedings of the 6th Conference on Computing Frontiers, 2009, Ischia, Italy, May 18-20, 2009 (2009), pp. 1--10. Google ScholarDigital Library
- Kaufman, L., and Rousseeuw, P. J. Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons, 2009.Google Scholar
- Le, T. N., Sun, X., Chowdhury, M., and Liu, Z. Allox: Allocation across computing resources for hybrid cpu/gpu clusters. ACM SIGMETRICS Performance Evaluation Review 46, 2 (2019), 87--88. Google ScholarDigital Library
- Lee, K.-B., Lin, T.-C., and Jen, C.-W. An efficient quality-aware memory controller for multimedia platform soc. IEEE transactions on circuits and systems for video technology 15, 5 (2005), 620--633. Google ScholarDigital Library
- Liu, F., Jiang, X., and Solihin, Y. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. In 16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9-14 January 2010, Bangalore, India (2010), pp. 1--12.Google Scholar
- Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P., and Kozyrakis, C. Heracles: improving resource efficiency at scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015 (2015), pp. 450--462. Google ScholarDigital Library
- Lu, Q., Yao, J., Qi, Z., He, B., et al. Fairness-efficiency allocation of cpu-gpu heterogeneous resources. IEEE Transactions on Services Computing (2016).Google Scholar
- Mutlu, O., and Moscibroda, T. Stall-time fair memory access scheduling for chip multiprocessors. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 1-5 December 2007, Chicago, Illinois, USA (2007), pp. 146--160. Google ScholarDigital Library
- Nesbit, K. J., Aggarwal, N., Laudon, J., and Smith, J. E. Fair queuing memory systems. In 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), 9-13 December 2006, Orlando, Florida, USA (2006), pp. 208--222. Google ScholarDigital Library
- Park, J., Park, S., and Baek, W. Copart: Coordinated partitioning of last-level cache and memory bandwidth for fairness-aware workload consolidation on commodity servers. In Proceedings of the Fourteenth EuroSys Conference 2019 (2019), ACM, p. 10. Google ScholarDigital Library
- Park, J., Park, S., Han, M.,Hyun,J., and Baek, W. Hypart: a hybrid technique for practical memory bandwidth partitioning on commodity servers. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT 2018, Limassol, Cyprus, November 01-04, 2018 (2018), pp. 5:1--5:14. Google ScholarDigital Library
- Rafique, N., Lim, W., and Thottethodi, M. Effective management of DRAM bandwidth in multicore processors. In 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007), Brasov, Romania, September 15-19, 2007 (2007), pp. 245--258. Google ScholarDigital Library
- Selfa, V., Sahuquillo, J., Eeckhout, L., Petit, S., and Gómez, M. E. Application clustering policies to address system fairness with intel's cache allocation technology. In 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017, Portland, OR, USA, September 9-13, 2017 (2017), pp. 194--205.Google ScholarCross Ref
- Snavely, A., and Tullsen, D. M. Symbiotic jobscheduling for a simultaneous multithreading processor. In ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, USA, November 12-15, 2000. (2000), pp. 234--244. Google ScholarDigital Library
- Subramanian, L., Seshadri, V., Ghosh, A., Khan, S. M., and Mutlu, O. The application slowdown model: quantifying and controlling the impact of interapplication interference at shared caches and main memory. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 5-9, 2015 (2015), pp. 62--75. Google ScholarDigital Library
- Subramanian, L., Seshadri, V., Kim, Y., Jaiyen, B., and Mutlu, O. MISE: providing performance predictability and improving fairness in shared main memory systems. In 19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 23-27, 2013 (2013), pp. 639--650. Google ScholarDigital Library
- Tam, D. K., Azimi, R., Soares, L., and Stumm, M. Rapidmrc: approximating L2 miss rate curves on commodity systems for online optimizations. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2009, Washington, DC, USA, March 7-11, 2009 (2009), pp. 121--132. Google ScholarDigital Library
- Tembey, P., Gavrilovska, A., and Schwan, K. Merlin: Application-and platform-aware resource allocation in consolidated server systems. In Proceedings of the ACM Symposium on Cloud Computing (2014), ACM, pp. 1--14. Google ScholarDigital Library
- Wang, W., Dey, T., Davidson, J. W., and Soffa, M. L. Dramon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) (2014), IEEE, pp. 380--391.Google ScholarCross Ref
- Wang, X., Chen, S., Setter, J., and Martínez, J. F. SWAP: effective fine-grain management of shared last-level caches with minimum hardware support. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4-8, 2017 (2017), pp. 121--132.Google ScholarCross Ref
- Xiang, Y., Wang, X., Huang, Z., Wang, Z., Luo, Y., and Wang, Z. DCAPS: dynamic cache allocation with partial sharing. In Proceedings of the Thirteenth EuroSys Conference, EuroSys 2018, Porto, Portugal, April 23-26, 2018 (2018), pp. 13:1--13:15. Google ScholarDigital Library
- Yang, H., Breslow, A., Mars, J., and Tang, L. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (New York, NY, USA, 2013), ISCA '13, ACM, pp. 607--618. Google ScholarDigital Library
- Ye, C., Brock, J., Ding, C., and Jin, H. Rochester elastic cache utility (recu): Unequal cache sharing is good economics. Int. J. Parallel Program. 45, 1 (Feb. 2017), 30--44. Google ScholarDigital Library
- Yu, Y., Zhang, J., and Letaief, K. B. Joint subcarrier and cpu time allocation for mobile edge computing. In 2016 IEEE Global Communications Conference (GLOBECOM) (2016), IEEE, pp. 1--6.Google ScholarDigital Library
- Yun, H., Yao, G., Pellizzoni, R., Caccamo, M., and Sha, L. Memguard: Memory bandwidth reservation system for efficient performance isolation in multi-core platforms. In 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS) (2013), IEEE, pp. 55--64. Google ScholarDigital Library
- Zhao, J., Feng, X., Cui, H., Yan, Y., Xue, J., and Yang, W. An empirical model for predicting cross-core performance interference on multicore processors. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (Sep. 2013), pp. 201--212. Google ScholarDigital Library
- Zhou, Y., and Wentzlaff, D. MITTS: memory inter-arrival time traffic shaping. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016 (2016), pp. 532--544. Google ScholarDigital Library
Recommendations
DCAPS: dynamic cache allocation with partial sharing
EuroSys '18: Proceedings of the Thirteenth EuroSys ConferenceIn a multicore system, effective management of shared last level cache (LLC), such as hardware/software cache partitioning, has attracted significant research attention. Some eminent progress is that Intel introduced Cache Allocation Technology (CAT) to ...
Dynamic directory table with victim cache: on-demand allocation of directory entries for active shared cache blocks
In this paper, we present a novel directory architecture that can dynamically allocate a directory entry for a cache block on demand at runtime only when the block is shared by more than a single core. Thus, we do not maintain coherence for private ...
CPpf: a prefetch aware LLC partitioning approach
ICPP '19: Proceedings of the 48th International Conference on Parallel ProcessingHardware cache prefetching is deployed in modern multicore processors to reduce memory latencies, addressing the memory wall problem. However, it tends to increase the Last Level Cache (LLC) contention among applications in multiprogrammed workloads, ...
Comments