skip to main content
10.1145/3337821.3337863acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Public Access

EMBA: Efficient Memory Bandwidth Allocation to Improve Performance on Intel Commodity Processor

Authors Info & Claims
Published:05 August 2019Publication History

ABSTRACT

On multi-core processors, contention on shared resources such as the last level cache (LLC) and memory bandwidth may cause serious performance degradation, which makes efficient resource allocation a critical issue in data centers. Intel recently introduces Memory Bandwidth Allocation (MBA) technology on its Xeon scalable processors, which makes it possible to allocate memory bandwidth in a real system. However, how to make the most of MBA to improve system performance remains an open question. In this work, (1) we formulate a quantitative relationship between a program's performance and its LLC occupancy and memory request rate on commodity processors. (2) Guided by the performance formula, we propose a heuristic bound-aware throttling algorithm to improve system performance and (3) we further develop a hierarchical clustering method to improve the algorithm's efficiency. (4) We implement these algorithms in EMBA, a low-overhead dynamic memory bandwidth scheduling system to improve performance on Intel commodity processors. The results show that, when multiple programs run simultaneously on a multi-core processor whose memory bandwidth is saturated, the programs with high memory bandwidth demand usually use bandwidth inefficiently compared with programs with medium memory bandwidth demand from the perspective of CPU performance. By slightly throttling the former's bandwidth, we can significantly improve the performance of the latter. On average, we improve system performance by 36.9% at the expense of 8.6% bandwidth utilization rate.

References

  1. Intel 64 and ia-32 architectures software developer's manual. https://software.intel.com/.Google ScholarGoogle Scholar
  2. Intel-cmt-cat. https://github.com/intel/intel-cmt-cat/tree/master/pqos/.Google ScholarGoogle Scholar
  3. Spec cpu 2017. https://www.spec.org/cpu2017/.Google ScholarGoogle Scholar
  4. Berger, D. S., Berg, B., Zhu, T., Sen, S., and Harchol-Balter, M. Robinhood: Tail latency aware caching-dynamic reallocation from cache-rich to cache-poor. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18) (2018), pp. 195--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Brock, J., Ye, C., Ding, C., Li, Y., Wang, X., and Luo, Y. Optimal cache partition-sharing. In 2015 44th International Conference on Parallel Processing (2015), IEEE, pp. 749--758. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cherkasova, L., Gupta, D., and Vahdat, A. Comparison of the three cpu schedulers in xen. SIGMETRICS Performance Evaluation Review 35, 2 (2007), 42--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ebrahimi, E., Lee, C. J., Mutlu, O., and Patt, Y. N. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010(2010), pp. 335--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. El-Sayed, N., Mukkara, A., Tsai, P., Kasture, H., Ma, X., and Sánchez, D. Kpart: A hybrid cache partitioning-sharing technique for commodity multicores. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Vienna, Austria, February 24-28, 2018 (2018), pp. 104--117.Google ScholarGoogle ScholarCross RefCross Ref
  9. Eyerman, S., and Eeckhout, L. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 3 (2008), 42--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Funaro, L., Ben-Yehuda, O. A., and Schuster, A. Ginseng: Market-driven {LLC} allocation. In 2016 {USENIX} Annual Technical Conference ({USENIX} {ATC} 16) (2016), pp. 295--308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hameed, A., Khoshkbarforoushha, A., Ranjan, R., Jayaraman, P. P., Kolodziej, J., Balaji, P., Zeadally, S., Malluhi, Q. M., Tziritas, N., Vishnu, A., et al. A survey and taxonomy on energy efficient resource allocation techniques for cloud computing systems. Computing 98, 7 (2016), 751--774. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Herdrich, A., Illikkal, R., Iyer, R. R., Newell, D., Chadha, V., and Moses, J. Rate-based qos techniques for cache/memory in CMP platforms. In Proceedings of the 23rd international conference on Supercomputing, 2009, Yorktown Heights, NY, USA, June 8-12, 2009 (2009), pp. 479--488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hower, D. R., Cain, H. W., and Waldspurger, C. A. PABST: proportionally allocated bandwidth at the source and target. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4-8, 2017 (2017), pp. 505--516.Google ScholarGoogle ScholarCross RefCross Ref
  14. Hu, X., Wang, X., Li, Y., Luo, Y., Ding, C., and Wang, Z. Optimal symbiosis and fair scheduling in shared cache. IEEE Trans. Parallel Distrib. Syst. 28, 4 (2017), 1134--1148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Illikkal, R., Chadha, V., Herdrich, A., Iyer, R., and Newell, D. PIRATE: qos and performance management in CMP architectures. SIGMETRICS Performance Evaluation Review 37, 4 (2010), 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ipek, E., Mutlu, O., Martínez, J. F., and Caruana, R. Self-optimizing memory controllers: A reinforcement learning approach. In 35th International Symposium on Computer Architecture (ISCA 2008), June 21-25, 2008, Beijing, China (2008), pp. 39--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jahre, M., and Natvig, L. A light-weight fairness mechanism for chip multiprocessor memory systems. In Proceedings of the 6th Conference on Computing Frontiers, 2009, Ischia, Italy, May 18-20, 2009 (2009), pp. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kaufman, L., and Rousseeuw, P. J. Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons, 2009.Google ScholarGoogle Scholar
  19. Le, T. N., Sun, X., Chowdhury, M., and Liu, Z. Allox: Allocation across computing resources for hybrid cpu/gpu clusters. ACM SIGMETRICS Performance Evaluation Review 46, 2 (2019), 87--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lee, K.-B., Lin, T.-C., and Jen, C.-W. An efficient quality-aware memory controller for multimedia platform soc. IEEE transactions on circuits and systems for video technology 15, 5 (2005), 620--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Liu, F., Jiang, X., and Solihin, Y. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. In 16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9-14 January 2010, Bangalore, India (2010), pp. 1--12.Google ScholarGoogle Scholar
  22. Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P., and Kozyrakis, C. Heracles: improving resource efficiency at scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015 (2015), pp. 450--462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lu, Q., Yao, J., Qi, Z., He, B., et al. Fairness-efficiency allocation of cpu-gpu heterogeneous resources. IEEE Transactions on Services Computing (2016).Google ScholarGoogle Scholar
  24. Mutlu, O., and Moscibroda, T. Stall-time fair memory access scheduling for chip multiprocessors. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 1-5 December 2007, Chicago, Illinois, USA (2007), pp. 146--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nesbit, K. J., Aggarwal, N., Laudon, J., and Smith, J. E. Fair queuing memory systems. In 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), 9-13 December 2006, Orlando, Florida, USA (2006), pp. 208--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Park, J., Park, S., and Baek, W. Copart: Coordinated partitioning of last-level cache and memory bandwidth for fairness-aware workload consolidation on commodity servers. In Proceedings of the Fourteenth EuroSys Conference 2019 (2019), ACM, p. 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Park, J., Park, S., Han, M.,Hyun,J., and Baek, W. Hypart: a hybrid technique for practical memory bandwidth partitioning on commodity servers. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT 2018, Limassol, Cyprus, November 01-04, 2018 (2018), pp. 5:1--5:14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Rafique, N., Lim, W., and Thottethodi, M. Effective management of DRAM bandwidth in multicore processors. In 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007), Brasov, Romania, September 15-19, 2007 (2007), pp. 245--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Selfa, V., Sahuquillo, J., Eeckhout, L., Petit, S., and Gómez, M. E. Application clustering policies to address system fairness with intel's cache allocation technology. In 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017, Portland, OR, USA, September 9-13, 2017 (2017), pp. 194--205.Google ScholarGoogle ScholarCross RefCross Ref
  30. Snavely, A., and Tullsen, D. M. Symbiotic jobscheduling for a simultaneous multithreading processor. In ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, USA, November 12-15, 2000. (2000), pp. 234--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Subramanian, L., Seshadri, V., Ghosh, A., Khan, S. M., and Mutlu, O. The application slowdown model: quantifying and controlling the impact of interapplication interference at shared caches and main memory. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 5-9, 2015 (2015), pp. 62--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Subramanian, L., Seshadri, V., Kim, Y., Jaiyen, B., and Mutlu, O. MISE: providing performance predictability and improving fairness in shared main memory systems. In 19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 23-27, 2013 (2013), pp. 639--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Tam, D. K., Azimi, R., Soares, L., and Stumm, M. Rapidmrc: approximating L2 miss rate curves on commodity systems for online optimizations. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2009, Washington, DC, USA, March 7-11, 2009 (2009), pp. 121--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tembey, P., Gavrilovska, A., and Schwan, K. Merlin: Application-and platform-aware resource allocation in consolidated server systems. In Proceedings of the ACM Symposium on Cloud Computing (2014), ACM, pp. 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Wang, W., Dey, T., Davidson, J. W., and Soffa, M. L. Dramon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) (2014), IEEE, pp. 380--391.Google ScholarGoogle ScholarCross RefCross Ref
  36. Wang, X., Chen, S., Setter, J., and Martínez, J. F. SWAP: effective fine-grain management of shared last-level caches with minimum hardware support. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4-8, 2017 (2017), pp. 121--132.Google ScholarGoogle ScholarCross RefCross Ref
  37. Xiang, Y., Wang, X., Huang, Z., Wang, Z., Luo, Y., and Wang, Z. DCAPS: dynamic cache allocation with partial sharing. In Proceedings of the Thirteenth EuroSys Conference, EuroSys 2018, Porto, Portugal, April 23-26, 2018 (2018), pp. 13:1--13:15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yang, H., Breslow, A., Mars, J., and Tang, L. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (New York, NY, USA, 2013), ISCA '13, ACM, pp. 607--618. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Ye, C., Brock, J., Ding, C., and Jin, H. Rochester elastic cache utility (recu): Unequal cache sharing is good economics. Int. J. Parallel Program. 45, 1 (Feb. 2017), 30--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Yu, Y., Zhang, J., and Letaief, K. B. Joint subcarrier and cpu time allocation for mobile edge computing. In 2016 IEEE Global Communications Conference (GLOBECOM) (2016), IEEE, pp. 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yun, H., Yao, G., Pellizzoni, R., Caccamo, M., and Sha, L. Memguard: Memory bandwidth reservation system for efficient performance isolation in multi-core platforms. In 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS) (2013), IEEE, pp. 55--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zhao, J., Feng, X., Cui, H., Yan, Y., Xue, J., and Yang, W. An empirical model for predicting cross-core performance interference on multicore processors. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (Sep. 2013), pp. 201--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhou, Y., and Wentzlaff, D. MITTS: memory inter-arrival time traffic shaping. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016 (2016), pp. 532--544. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICPP '19: Proceedings of the 48th International Conference on Parallel Processing
    August 2019
    1107 pages
    ISBN:9781450362955
    DOI:10.1145/3337821

    Copyright © 2019 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 5 August 2019

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate91of313submissions,29%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader