skip to main content
research-article
Free Access

Exploiting reuse locality on inclusive shared last-level caches

Published:20 January 2013Publication History
Skip Abstract Section

Abstract

Optimization of the replacement policy used for Shared Last-Level Cache (SLLC) management in a Chip-MultiProcessor (CMP) is critical for avoiding off-chip accesses. Temporal locality, while being exploited by first levels of private cache memories, is only slightly exhibited by the stream of references arriving at the SLLC. Thus, traditional replacement algorithms based on recency are bad choices for governing SLLC replacement. Recent proposals involve SLLC replacement policies that attempt to exploit reuse either by segmenting the replacement list or improving the rereference interval prediction.

On the other hand, inclusive SLLCs are commonplace in the CMP market, but the interaction between replacement policy and the enforcement of inclusion has barely been discussed. After analyzing that interaction, this article introduces two simple replacement policies exploiting reuse locality and targeting inclusive SLLCs: Least Recently Reused (LRR) and Not Recently Reused (NRR). NRR has the same implementation cost as NRU, and LRR only adds one bit per line to the LRU cost.

After considering reuse locality and its interaction with the invalidations induced by inclusion, the proposals are evaluated by simulating multiprogrammed workloads in an 8-core system with two private cache levels and an SLLC. LRR outperforms LRU by 4.5% (performing better in 97 out of 100 mixes) and NRR outperforms NRU by 4.2% (performing better in 99 out of 100 mixes). We also show that our mechanisms outperform rereference interval prediction, a recently proposed SLLC replacement policy and that similar conclusions can be drawn by varying the associativity or the SLLC size.

References

  1. Baer, J. and Wang, W.-H. (1988). On the inclusion properties for multi-level cache hierarchies. In Proceedings of the 15th Annual International Computer Architecture Symposium. 73--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Chen, X., Yanh, Y., Gopalarkishnan, G., and Chou, C. T. 2006. Reducing verification complexity of a multicore coherence protocol using assume/guarantee. In Proceedings of the International Conference on Formal Methods in Computer Aided Design (FMCAD'06). 81--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Gao, H. and Wilkerson, C. 2010. A dueling segmented lru replacement algorithm with adaptive bypassing. In Proceedings of the 1st JILP Workshop on Computer Architecture Competitions.Google ScholarGoogle Scholar
  4. Intel. 2011. Intel core i7 processor. http://www.intel.com/products/processor/corei7/specifications.htmGoogle ScholarGoogle Scholar
  5. Jaleel, A., Borch, E., Bhandaru, M., Steely Jr., S., and Emer, J. 2010a. Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (tla) cache management policies. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'10). IEEE Computer Society, 151--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jaleel, A., Hasenplaugh, W., Qureshi, M., Sebot, J., Steely, S., and Emer, J. 2008. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th Conference on Parallel Architectures and Compilation Techniques (PACT'08). ACM Press, New York, 208--219. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jaleel, A., Theobald, K., Steely, S., and Emer, J. 2010b. High performance cache replacement using re-reference interval prediction (rrip). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10). ACM Press, New York, 60--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Karedla, R., Love, J., and Wherry, B. 1994. Caching strategies to improve disk system performance. Comput. 27, 3, 38--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kaxiras, S., Hu, Z., and Martonosi, M. 2001. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th Annual International Computer Architecture Symposium. 240--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kahn, S., Wang, Z., and Jimenez, D. A. 2012. Decoupled dynamic cache segmentation. In Proceedings of the IEEE 18th International Symposium on High Performance Computer Architecture (HPCA'12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kahn, S. M., Tian, Y., and Jimenez, D. A. 2010. Sampling dead block prediction for last-level caches. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'10). 175--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lai, A.-C., Fide, C., and Falsafi, B. 2001. Dead-Block prediction and dead-block correlating prefetchers. In Proceedings of the 28th Annual International Computer Architecture Symposium. 144--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Lee, D., Choi, J., Kim, J.-H., Noh, S., Min, S. L., Cho, Y., and Kim, C. S. 2001. LRFU: A spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Trans. Comput. 50, 12, 1352--1361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Lin, W. and Reinhardt, S. K. 2002. Predicting last-touch references under optimal replacement. Tech. rep. CSE-TR-447-02, University of Michigan.Google ScholarGoogle Scholar
  15. Liu, H., Ferdman, M., Huh, J., and Burger, D. 2008. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Proceedings of the 41st IEEE/ACM International Symposium on Microarchitecture (MICRO'08). 222--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Magnusson, P. S., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., and Werner, B. 2002. Simics: A full system simulation platform. Comput. 35, 2, 50--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Martin, M., Sorin, D., Beckmann, B., Marty, M., Xu, M., Alameldeen, A., Moore, K., Hill, M., and Wood, D. 2005. Multifacets general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News 33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Qureshi, M., Jaleel, A., Patt, Y., Steely, S., and Emer, J. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA'07). ACM Press, New York, 381--391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Subramanian, R., Smaragdakis, Y., and Loh, G. H. 2006. Adaptive caches: Effective shaping of cache behavior to workloads. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). 385--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Sun Microsystems. 2007. UltraSPARC T2 supplement to the UltraSPARC architecture 2007. Draft D1.4.3, 19 Sep 2007.Google ScholarGoogle Scholar
  21. Valero, A., Sahuquillo, J., Petit, S., Lopez, P., and Duato, J. 2012. Combining recency of information with selective random and a victim cache in last-level caches. ACM Trans. Archit. Code Optim. 9, 3, 16:1--16:20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Wu, C.-J., Jaleel, A., Hasenplaugh, W., Martonosi, M., Steely, S. C., and Emer, J. 2011. Ship: Signature-Based hit predictor for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'11). ACM Press, New York, 430--441. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Xie, Y. and Loh, G. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA'09). ACM Press, New York, 174--183. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploiting reuse locality on inclusive shared last-level caches

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 9, Issue 4
      Special Issue on High-Performance Embedded Architectures and Compilers
      January 2013
      876 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/2400682
      Issue’s Table of Contents

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 January 2013
      • Accepted: 1 November 2012
      • Revised: 1 September 2012
      • Received: 1 June 2012
      Published in taco Volume 9, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader