research-article

Public Access

EMBA: Efficient Memory Bandwidth Allocation to Improve Performance on Intel Commodity Processor

Authors:
Yaocheng Xiang

Shenzhen Key Lab for Cloud Computing Technology & Applications, SECE, Peking University, Shenzhen, China and Pengcheng Laboratory, Shenzhen, China

Shenzhen Key Lab for Cloud Computing Technology & Applications, SECE, Peking University, Shenzhen, China and Pengcheng Laboratory, Shenzhen, China
View Profile

,
Chencheng Ye

National Engineering Research Center for Big Data Technology and System, Service Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

National Engineering Research Center for Big Data Technology and System, Service Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
View Profile

,
Xiaolin Wang

Pengcheng Laboratory, Shenzhen, China

Pengcheng Laboratory, Shenzhen, China
View Profile

,
Yingwei Luo

Pengcheng Laboratory, Shenzhen, China

Pengcheng Laboratory, Shenzhen, China
View Profile

,
Zhenlin Wang

Michigan Technological University

Michigan Technological University
View Profile

ICPP '19: Proceedings of the 48th International Conference on Parallel ProcessingAugust 2019Article No.: 16Pages 1–12https://doi.org/10.1145/3337821.3337863

Published:05 August 2019Publication History

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

Pages 1–12

ABSTRACT

On multi-core processors, contention on shared resources such as the last level cache (LLC) and memory bandwidth may cause serious performance degradation, which makes efficient resource allocation a critical issue in data centers. Intel recently introduces Memory Bandwidth Allocation (MBA) technology on its Xeon scalable processors, which makes it possible to allocate memory bandwidth in a real system. However, how to make the most of MBA to improve system performance remains an open question. In this work, (1) we formulate a quantitative relationship between a program's performance and its LLC occupancy and memory request rate on commodity processors. (2) Guided by the performance formula, we propose a heuristic bound-aware throttling algorithm to improve system performance and (3) we further develop a hierarchical clustering method to improve the algorithm's efficiency. (4) We implement these algorithms in EMBA, a low-overhead dynamic memory bandwidth scheduling system to improve performance on Intel commodity processors. The results show that, when multiple programs run simultaneously on a multi-core processor whose memory bandwidth is saturated, the programs with high memory bandwidth demand usually use bandwidth inefficiently compared with programs with medium memory bandwidth demand from the perspective of CPU performance. By slightly throttling the former's bandwidth, we can significantly improve the performance of the latter. On average, we improve system performance by 36.9% at the expense of 8.6% bandwidth utilization rate.

References

Intel 64 and ia-32 architectures software developer's manual. https://software.intel.com/.Google Scholar
Intel-cmt-cat. https://github.com/intel/intel-cmt-cat/tree/master/pqos/.Google Scholar
Spec cpu 2017. https://www.spec.org/cpu2017/.Google Scholar
Berger, D. S., Berg, B., Zhu, T., Sen, S., and Harchol-Balter, M. Robinhood: Tail latency aware caching-dynamic reallocation from cache-rich to cache-poor. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18) (2018), pp. 195--212. Google ScholarDigital Library
Brock, J., Ye, C., Ding, C., Li, Y., Wang, X., and Luo, Y. Optimal cache partition-sharing. In 2015 44th International Conference on Parallel Processing (2015), IEEE, pp. 749--758. Google ScholarDigital Library
Cherkasova, L., Gupta, D., and Vahdat, A. Comparison of the three cpu schedulers in xen. SIGMETRICS Performance Evaluation Review 35, 2 (2007), 42--51. Google ScholarDigital Library
Ebrahimi, E., Lee, C. J., Mutlu, O., and Patt, Y. N. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010(2010), pp. 335--346. Google ScholarDigital Library
El-Sayed, N., Mukkara, A., Tsai, P., Kasture, H., Ma, X., and Sánchez, D. Kpart: A hybrid cache partitioning-sharing technique for commodity multicores. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Vienna, Austria, February 24-28, 2018 (2018), pp. 104--117.Google ScholarCross Ref
Eyerman, S., and Eeckhout, L. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 3 (2008), 42--53. Google ScholarDigital Library
Funaro, L., Ben-Yehuda, O. A., and Schuster, A. Ginseng: Market-driven {LLC} allocation. In 2016 {USENIX} Annual Technical Conference ({USENIX} {ATC} 16) (2016), pp. 295--308. Google ScholarDigital Library
Hameed, A., Khoshkbarforoushha, A., Ranjan, R., Jayaraman, P. P., Kolodziej, J., Balaji, P., Zeadally, S., Malluhi, Q. M., Tziritas, N., Vishnu, A., et al. A survey and taxonomy on energy efficient resource allocation techniques for cloud computing systems. Computing 98, 7 (2016), 751--774. Google ScholarDigital Library
Herdrich, A., Illikkal, R., Iyer, R. R., Newell, D., Chadha, V., and Moses, J. Rate-based qos techniques for cache/memory in CMP platforms. In Proceedings of the 23rd international conference on Supercomputing, 2009, Yorktown Heights, NY, USA, June 8-12, 2009 (2009), pp. 479--488. Google ScholarDigital Library
Hower, D. R., Cain, H. W., and Waldspurger, C. A. PABST: proportionally allocated bandwidth at the source and target. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4-8, 2017 (2017), pp. 505--516.Google ScholarCross Ref
Hu, X., Wang, X., Li, Y., Luo, Y., Ding, C., and Wang, Z. Optimal symbiosis and fair scheduling in shared cache. IEEE Trans. Parallel Distrib. Syst. 28, 4 (2017), 1134--1148. Google ScholarDigital Library
Illikkal, R., Chadha, V., Herdrich, A., Iyer, R., and Newell, D. PIRATE: qos and performance management in CMP architectures. SIGMETRICS Performance Evaluation Review 37, 4 (2010), 3--10. Google ScholarDigital Library
Ipek, E., Mutlu, O., Martínez, J. F., and Caruana, R. Self-optimizing memory controllers: A reinforcement learning approach. In 35th International Symposium on Computer Architecture (ISCA 2008), June 21-25, 2008, Beijing, China (2008), pp. 39--50. Google ScholarDigital Library
Jahre, M., and Natvig, L. A light-weight fairness mechanism for chip multiprocessor memory systems. In Proceedings of the 6th Conference on Computing Frontiers, 2009, Ischia, Italy, May 18-20, 2009 (2009), pp. 1--10. Google ScholarDigital Library
Kaufman, L., and Rousseeuw, P. J. Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons, 2009.Google Scholar
Le, T. N., Sun, X., Chowdhury, M., and Liu, Z. Allox: Allocation across computing resources for hybrid cpu/gpu clusters. ACM SIGMETRICS Performance Evaluation Review 46, 2 (2019), 87--88. Google ScholarDigital Library
Lee, K.-B., Lin, T.-C., and Jen, C.-W. An efficient quality-aware memory controller for multimedia platform soc. IEEE transactions on circuits and systems for video technology 15, 5 (2005), 620--633. Google ScholarDigital Library
Liu, F., Jiang, X., and Solihin, Y. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. In 16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9-14 January 2010, Bangalore, India (2010), pp. 1--12.Google Scholar
Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P., and Kozyrakis, C. Heracles: improving resource efficiency at scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015 (2015), pp. 450--462. Google ScholarDigital Library
Lu, Q., Yao, J., Qi, Z., He, B., et al. Fairness-efficiency allocation of cpu-gpu heterogeneous resources. IEEE Transactions on Services Computing (2016).Google Scholar
Mutlu, O., and Moscibroda, T. Stall-time fair memory access scheduling for chip multiprocessors. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 1-5 December 2007, Chicago, Illinois, USA (2007), pp. 146--160. Google ScholarDigital Library
Nesbit, K. J., Aggarwal, N., Laudon, J., and Smith, J. E. Fair queuing memory systems. In 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), 9-13 December 2006, Orlando, Florida, USA (2006), pp. 208--222. Google ScholarDigital Library
Park, J., Park, S., and Baek, W. Copart: Coordinated partitioning of last-level cache and memory bandwidth for fairness-aware workload consolidation on commodity servers. In Proceedings of the Fourteenth EuroSys Conference 2019 (2019), ACM, p. 10. Google ScholarDigital Library
Park, J., Park, S., Han, M.,Hyun,J., and Baek, W. Hypart: a hybrid technique for practical memory bandwidth partitioning on commodity servers. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT 2018, Limassol, Cyprus, November 01-04, 2018 (2018), pp. 5:1--5:14. Google ScholarDigital Library
Rafique, N., Lim, W., and Thottethodi, M. Effective management of DRAM bandwidth in multicore processors. In 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007), Brasov, Romania, September 15-19, 2007 (2007), pp. 245--258. Google ScholarDigital Library
Selfa, V., Sahuquillo, J., Eeckhout, L., Petit, S., and Gómez, M. E. Application clustering policies to address system fairness with intel's cache allocation technology. In 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017, Portland, OR, USA, September 9-13, 2017 (2017), pp. 194--205.Google ScholarCross Ref
Snavely, A., and Tullsen, D. M. Symbiotic jobscheduling for a simultaneous multithreading processor. In ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, USA, November 12-15, 2000. (2000), pp. 234--244. Google ScholarDigital Library
Subramanian, L., Seshadri, V., Ghosh, A., Khan, S. M., and Mutlu, O. The application slowdown model: quantifying and controlling the impact of interapplication interference at shared caches and main memory. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 5-9, 2015 (2015), pp. 62--75. Google ScholarDigital Library
Subramanian, L., Seshadri, V., Kim, Y., Jaiyen, B., and Mutlu, O. MISE: providing performance predictability and improving fairness in shared main memory systems. In 19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 23-27, 2013 (2013), pp. 639--650. Google ScholarDigital Library
Tam, D. K., Azimi, R., Soares, L., and Stumm, M. Rapidmrc: approximating L2 miss rate curves on commodity systems for online optimizations. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2009, Washington, DC, USA, March 7-11, 2009 (2009), pp. 121--132. Google ScholarDigital Library
Tembey, P., Gavrilovska, A., and Schwan, K. Merlin: Application-and platform-aware resource allocation in consolidated server systems. In Proceedings of the ACM Symposium on Cloud Computing (2014), ACM, pp. 1--14. Google ScholarDigital Library
Wang, W., Dey, T., Davidson, J. W., and Soffa, M. L. Dramon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) (2014), IEEE, pp. 380--391.Google ScholarCross Ref
Wang, X., Chen, S., Setter, J., and Martínez, J. F. SWAP: effective fine-grain management of shared last-level caches with minimum hardware support. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4-8, 2017 (2017), pp. 121--132.Google ScholarCross Ref
Xiang, Y., Wang, X., Huang, Z., Wang, Z., Luo, Y., and Wang, Z. DCAPS: dynamic cache allocation with partial sharing. In Proceedings of the Thirteenth EuroSys Conference, EuroSys 2018, Porto, Portugal, April 23-26, 2018 (2018), pp. 13:1--13:15. Google ScholarDigital Library
Yang, H., Breslow, A., Mars, J., and Tang, L. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (New York, NY, USA, 2013), ISCA '13, ACM, pp. 607--618. Google ScholarDigital Library
Ye, C., Brock, J., Ding, C., and Jin, H. Rochester elastic cache utility (recu): Unequal cache sharing is good economics. Int. J. Parallel Program. 45, 1 (Feb. 2017), 30--44. Google ScholarDigital Library
Yu, Y., Zhang, J., and Letaief, K. B. Joint subcarrier and cpu time allocation for mobile edge computing. In 2016 IEEE Global Communications Conference (GLOBECOM) (2016), IEEE, pp. 1--6.Google ScholarDigital Library
Yun, H., Yao, G., Pellizzoni, R., Caccamo, M., and Sha, L. Memguard: Memory bandwidth reservation system for efficient performance isolation in multi-core platforms. In 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS) (2013), IEEE, pp. 55--64. Google ScholarDigital Library
Zhao, J., Feng, X., Cui, H., Yan, Y., Xue, J., and Yang, W. An empirical model for predicting cross-core performance interference on multicore processors. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (Sep. 2013), pp. 201--212. Google ScholarDigital Library
Zhou, Y., and Wentzlaff, D. MITTS: memory inter-arrival time traffic shaping. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016 (2016), pp. 532--544. Google ScholarDigital Library

Recommendations

DCAPS: dynamic cache allocation with partial sharing
EuroSys '18: Proceedings of the Thirteenth EuroSys Conference

In a multicore system, effective management of shared last level cache (LLC), such as hardware/software cache partitioning, has attracted significant research attention. Some eminent progress is that Intel introduced Cache Allocation Technology (CAT) to ...
Read More
Dynamic directory table with victim cache: on-demand allocation of directory entries for active shared cache blocks

In this paper, we present a novel directory architecture that can dynamically allocate a directory entry for a cache block on demand at runtime only when the block is shared by more than a single core. Thus, we do not maintain coherence for private ...
Read More
CPpf: a prefetch aware LLC partitioning approach
ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

Hardware cache prefetching is deployed in modern multicore processors to reduce memory latencies, addressing the memory wall problem. However, it tends to increase the Last Level Cache (LLC) contention among applications in multiprogrammed workloads, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing
August 2019
1107 pages
ISBN:9781450362955
DOI:10.1145/3337821

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 August 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Memory bandwidth allocation
Multi-core architectures
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate91of313submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 1,331
  Total Downloads
- Downloads (Last 12 months)337
- Downloads (Last 6 weeks)58
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

EMBA: Efficient Memory Bandwidth Allocation to Improve Performance on Intel Commodity Processor

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

ABSTRACT

References

Cited By

Recommendations

DCAPS: dynamic cache allocation with partial sharing

Dynamic directory table with victim cache: on-demand allocation of directory entries for active shared cache blocks

CPpf: a prefetch aware LLC partitioning approach

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

EMBA: Efficient Memory Bandwidth Allocation to Improve Performance on Intel Commodity Processor

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

ABSTRACT

References

Cited By

Recommendations

DCAPS: dynamic cache allocation with partial sharing

Dynamic directory table with victim cache: on-demand allocation of directory entries for active shared cache blocks

CPpf: a prefetch aware LLC partitioning approach

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media