Abstract
Optimization of the replacement policy used for Shared Last-Level Cache (SLLC) management in a Chip-MultiProcessor (CMP) is critical for avoiding off-chip accesses. Temporal locality, while being exploited by first levels of private cache memories, is only slightly exhibited by the stream of references arriving at the SLLC. Thus, traditional replacement algorithms based on recency are bad choices for governing SLLC replacement. Recent proposals involve SLLC replacement policies that attempt to exploit reuse either by segmenting the replacement list or improving the rereference interval prediction.
On the other hand, inclusive SLLCs are commonplace in the CMP market, but the interaction between replacement policy and the enforcement of inclusion has barely been discussed. After analyzing that interaction, this article introduces two simple replacement policies exploiting reuse locality and targeting inclusive SLLCs: Least Recently Reused (LRR) and Not Recently Reused (NRR). NRR has the same implementation cost as NRU, and LRR only adds one bit per line to the LRU cost.
After considering reuse locality and its interaction with the invalidations induced by inclusion, the proposals are evaluated by simulating multiprogrammed workloads in an 8-core system with two private cache levels and an SLLC. LRR outperforms LRU by 4.5% (performing better in 97 out of 100 mixes) and NRR outperforms NRU by 4.2% (performing better in 99 out of 100 mixes). We also show that our mechanisms outperform rereference interval prediction, a recently proposed SLLC replacement policy and that similar conclusions can be drawn by varying the associativity or the SLLC size.
- Baer, J. and Wang, W.-H. (1988). On the inclusion properties for multi-level cache hierarchies. In Proceedings of the 15th Annual International Computer Architecture Symposium. 73--80. Google ScholarDigital Library
- Chen, X., Yanh, Y., Gopalarkishnan, G., and Chou, C. T. 2006. Reducing verification complexity of a multicore coherence protocol using assume/guarantee. In Proceedings of the International Conference on Formal Methods in Computer Aided Design (FMCAD'06). 81--88. Google ScholarDigital Library
- Gao, H. and Wilkerson, C. 2010. A dueling segmented lru replacement algorithm with adaptive bypassing. In Proceedings of the 1st JILP Workshop on Computer Architecture Competitions.Google Scholar
- Intel. 2011. Intel core i7 processor. http://www.intel.com/products/processor/corei7/specifications.htmGoogle Scholar
- Jaleel, A., Borch, E., Bhandaru, M., Steely Jr., S., and Emer, J. 2010a. Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (tla) cache management policies. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'10). IEEE Computer Society, 151--162. Google ScholarDigital Library
- Jaleel, A., Hasenplaugh, W., Qureshi, M., Sebot, J., Steely, S., and Emer, J. 2008. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th Conference on Parallel Architectures and Compilation Techniques (PACT'08). ACM Press, New York, 208--219. Google ScholarDigital Library
- Jaleel, A., Theobald, K., Steely, S., and Emer, J. 2010b. High performance cache replacement using re-reference interval prediction (rrip). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10). ACM Press, New York, 60--71. Google ScholarDigital Library
- Karedla, R., Love, J., and Wherry, B. 1994. Caching strategies to improve disk system performance. Comput. 27, 3, 38--46. Google ScholarDigital Library
- Kaxiras, S., Hu, Z., and Martonosi, M. 2001. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th Annual International Computer Architecture Symposium. 240--251. Google ScholarDigital Library
- Kahn, S., Wang, Z., and Jimenez, D. A. 2012. Decoupled dynamic cache segmentation. In Proceedings of the IEEE 18th International Symposium on High Performance Computer Architecture (HPCA'12). Google ScholarDigital Library
- Kahn, S. M., Tian, Y., and Jimenez, D. A. 2010. Sampling dead block prediction for last-level caches. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'10). 175--186. Google ScholarDigital Library
- Lai, A.-C., Fide, C., and Falsafi, B. 2001. Dead-Block prediction and dead-block correlating prefetchers. In Proceedings of the 28th Annual International Computer Architecture Symposium. 144--154. Google ScholarDigital Library
- Lee, D., Choi, J., Kim, J.-H., Noh, S., Min, S. L., Cho, Y., and Kim, C. S. 2001. LRFU: A spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Trans. Comput. 50, 12, 1352--1361. Google ScholarDigital Library
- Lin, W. and Reinhardt, S. K. 2002. Predicting last-touch references under optimal replacement. Tech. rep. CSE-TR-447-02, University of Michigan.Google Scholar
- Liu, H., Ferdman, M., Huh, J., and Burger, D. 2008. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Proceedings of the 41st IEEE/ACM International Symposium on Microarchitecture (MICRO'08). 222--233. Google ScholarDigital Library
- Magnusson, P. S., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., and Werner, B. 2002. Simics: A full system simulation platform. Comput. 35, 2, 50--58. Google ScholarDigital Library
- Martin, M., Sorin, D., Beckmann, B., Marty, M., Xu, M., Alameldeen, A., Moore, K., Hill, M., and Wood, D. 2005. Multifacets general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News 33. Google ScholarDigital Library
- Qureshi, M., Jaleel, A., Patt, Y., Steely, S., and Emer, J. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA'07). ACM Press, New York, 381--391. Google ScholarDigital Library
- Subramanian, R., Smaragdakis, Y., and Loh, G. H. 2006. Adaptive caches: Effective shaping of cache behavior to workloads. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). 385--396. Google ScholarDigital Library
- Sun Microsystems. 2007. UltraSPARC T2 supplement to the UltraSPARC architecture 2007. Draft D1.4.3, 19 Sep 2007.Google Scholar
- Valero, A., Sahuquillo, J., Petit, S., Lopez, P., and Duato, J. 2012. Combining recency of information with selective random and a victim cache in last-level caches. ACM Trans. Archit. Code Optim. 9, 3, 16:1--16:20. Google ScholarDigital Library
- Wu, C.-J., Jaleel, A., Hasenplaugh, W., Martonosi, M., Steely, S. C., and Emer, J. 2011. Ship: Signature-Based hit predictor for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'11). ACM Press, New York, 430--441. Google ScholarDigital Library
- Xie, Y. and Loh, G. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA'09). ACM Press, New York, 174--183. Google ScholarDigital Library
Index Terms
- Exploiting reuse locality on inclusive shared last-level caches
Recommendations
Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesThe replacement policies for the last-level caches (LLCs) are usually designed based on the access information available locally at the LLC. These policies are inherently sub-optimal due to lack of information about the activities in the inner-levels of ...
A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines
Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, ...
Adaptive Cache Bypassing for Inclusive Last Level Caches
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed ProcessingCache hierarchy designs, including bypassing, replacement, and the inclusion property, have significant performance impact. Recent works on high performance caches have shown that cache bypassing is an effective technique to enhance the last level cache ...
Comments