ABSTRACT
As chip multiprocessors (CMPs) become increasingly mainstream, architects have likewise become more interested in how best to share a cache hierarchy among multiple simultaneous threads of execution. The complexity of this problem is exacerbated as the number of simultaneous threads grows from two or four to the tens or hundreds. However, there is no consensus in the architectural community on what "best" means in this context. Some papers in the literature seek to equalize each thread's performance loss due to sharing, while others emphasize maximizing overall system performance. Furthermore, the specific effect of these goals varies depending on the metric used to define "performance".In this paper we label equal performance targets as Communist cache policies and overall performance targets as Utilitarian cache policies. We compare both of these models to the most common current model of a free-for-all cache (a Capitalist policy). We consider various performance metrics, including miss rates, bandwidth usage, and IPC, including both absolute and relative values of each metric. Using analytical models and behavioral cache simulation, we find that the optimal partitioning of a shared cache can vary greatly as different but reasonable definitions of optimality are applied. We also find that, although Communist and Utilitarian targets are generally compatible, each policy has workloads for which it provides poor overall performance or poor fairness, respectively. Finally, we find that simple policies like LRU replacement and static uniform partitioning are not sufficient to provide near-optimal performance under any reasonable definition, indicating that some thread-aware cache resource allocation mechanism is required.
- D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proc. 11th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 340--351, Feb. 2005. Google ScholarDigital Library
- D. Chiou, P. Jain, S. Devadas, and L. Rudolph. Dynamic cache partitioning via columnization. In Proceedings of Design Automation Conference, Los Angeles, June 2000.Google Scholar
- A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In Proc. 2005 USENIX Technical Conference, pages 395--398, 2005. Google ScholarDigital Library
- R. Goodwins. Does hyperthreading hurt server performance? http://news.com.com/Does+hyperthreading+hurt+server+performance/2100-1006_3-5965435.html?tag=nefd.top, Nov. 2005.Google Scholar
- J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A nuca substrate for flexible cmp cache sharing. In Proc. 2005 Int'l Conf. on Supercomputing, pages 31--40, 2005. Google ScholarDigital Library
- Intel Corp. Next leap in microprocessor architecture: Intel core duo. White paper. http://ces2006.akamai.com.edgesuite.net/yonahassets/CoreDuo_WhitePaper.pdf.Google Scholar
- R. R. Iyer. On modeling and analyzing cache hierarchies using CASPER. In Proc. 11th Int'l Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, pages 182--187, Oct. 2003.Google ScholarCross Ref
- R. R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In Proc. 2004 Int'l Conf. on Supercomputing, pages 257--266, 2004. Google ScholarDigital Library
- R. Kalla, B. Sinharoy, and J. M. Tendler. Ibm power5 chip: A dual-core multithreaded processor. IEEE Micro, 24(2):40--47, Mar. 2004. Google ScholarDigital Library
- S. Kim, D. Chandra, and Y. Solihin. Fair caching in a chip multiprocessor architecture. In Proc. 13th Ann. Int'l Conf. on Parallel Architectures and Compilation Techniques, pages 111--122, Sept. 2004. Google ScholarDigital Library
- P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded sparc processor. IEEE Micro, 25(2):21--29, March/April 2005. Google ScholarDigital Library
- S. R. Kunkel, R. J. Eickemeyer, M. H. Lipasti, T. J. Mullins, B. O'Krafka, H. Rosenberg, S. P. VanderWiel, P. L. Vitale, and L. D. Whitley. A performance methodology for commercial servers. IBM Journal of Research and Development, 44(6):851--871, November 2000. Google ScholarDigital Library
- M5 Development Team. The M5 Simulator. http://m5.eecs.umich.edu.Google Scholar
- A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proc. Ninth Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), pages 234--244, Nov. 2000. Google ScholarDigital Library
- H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. IEEE Trans. Computers, 41(9):1054--1068, Sept. 1992. Google ScholarDigital Library
- G. E. Suh, S. Devadas, and L. Rudolph. Analytical cache models with applications to cache partitioning. In Proc. 2001 Int'l Conf. on Supercomputing, pages 1--12, 2001. Google ScholarDigital Library
- G. E. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proc. 8th Int'l Symp. on High-Performance Computer Architecture (HPCA), Feb. 2002. Google ScholarDigital Library
- G. E. Suh, L. Rudolph, and S. Devadas. Dynamic cache partitioning for simultaneous multithreading systems. In Proc. 13th IASTED Int'l Conference on Parallel and Distributed Computing Systems, 2001.Google Scholar
- D. Thiebaut, H. S. Stone, and J. L. Wolf. Improving disk cache hit-ratios through cache partitioning. 41(6):665--676, 1992. Google ScholarDigital Library
- C. A. Waldspurger. Memory resource management in vmware esx server. In Proc. 2002 USENIX Technical Conference, pages 181--194, Dec. 2002. Google ScholarDigital Library
- D. A. Wood, M. D. Hill, and R. E. Kessler. A model for estimating trace-sample miss ratios. In Proc. 1991 ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, pages 79--89, May 1991. Google ScholarDigital Library
Index Terms
- Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource
Recommendations
A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines
Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, ...
A reusability-aware cache memory sharing technique for high-performance low-power CMPs with private L2 caches
ISLPED '07: Proceedings of the 2007 international symposium on Low power electronics and designChip multiprocessors (CMPs) emerge as a dominant architectural alternative in high-end embedded systems. Since off-chip accesses require a long latency and consume a large amount of power, CMPs are typically based on multiple levels of on-chip cache ...
Early miss prediction based periodic cache bypassing for high performance GPUs
The aim of the hierarchical cache memories that are equipped for GPUs is the management of irregular memory access patterns for general purpose workloads. The level-1 data cache (L1D) of the GPU plays an important role for its ability in the provision ...
Comments