ABSTRACT
Cache compression is a promising technique to increase on-chip cache capacity and to decrease on-chip and off-chip bandwidth usage. Unfortunately, directly applying well-known compression algorithms (usually implemented in software) leads to high hardware complexity and unacceptable decompression/compression latencies, which in turn can negatively affect performance. Hence, there is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.
In this paper, we introduce a new compression algorithm called Base-Delta-Immediate (BΔI) compression, a practical technique for compressing data in on-chip caches. The key idea is that, for many cache lines, the values within the cache line have a low dynamic range - i.e., the differences between values stored within the cache line are small. As a result, a cache line can be represented using a base value and an array of differences whose combined size is much smaller than the original cache line (we call this the base+delta encoding). Moreover, many cache lines intersperse such base+delta values with small values - our BΔI technique efficiently incorporates such immediate values into its encoding.
Compared to prior cache compression approaches, our studies show that BΔI strikes a sweet-spot in the tradeoff between compression ratio, decompression/compression latencies, and hardware complexity. Our results show that BΔI compression improves performance for both single-core (8.1% improvement) and multi-core workloads (9.5% / 11.2% improvement for two/four cores). For many applications, BΔI provides the performance benefit of doubling the cache size of the baseline system, effectively increasing average cache capacity by 1.53X.
- B. Abali, H. Franke, D. E. Poff, R. A. Saccone, C. O. Schulz, L. M. Herger, and T. B. Smith. Memory expansion technology (MXT): software support and performance. IBM JRD, 2001. Google ScholarDigital Library
- A. R. Alameldeen and D. A. Wood. Adaptive cache compression for high-performance processors. In ISCA-31, 2004. Google ScholarDigital Library
- A. R. Alameldeen and D. A. Wood. Frequent pattern compression: A significance-based compression scheme for L2 caches. Tech. Rep., University of Wisconsin-Madison, 2004.Google Scholar
- S. Balakrishnan and G. S. Sohi. Exploiting value locality in physical register files. In MICRO-36, 2003. Google ScholarDigital Library
- J. Chen and W. A. Watson-III. Multi-threading performance on commodity multi-core processors. In Proceedings of HPCAsia, 2007.Google Scholar
- X. Chen, L. Yang, R. Dick, L. Shang, and H. Lekatsas. C-pack: A high-performance microprocessor cache compression algorithm. In IEEE TVLSI, Aug. 2010. Google ScholarDigital Library
- R. Das, A. Mishra, C. Nicopoulos, D. Park, V. Narayanan, R. Iyer, M. Yousif, and C. Das. Performance and power optimization through data compression in network-on-chip architectures. In HPCA, 2008.Google ScholarCross Ref
- J. Dusser, T. Piquet, and A. Seznec. Zero-content augmented caches. In ICS, 2009. Google ScholarDigital Library
- M. Ekman and P. Stenström. A robust main-memory compression scheme. In ISCA-32, 2005. Google ScholarDigital Library
- M. Farrens and A. Park. Dynamic base register caching: a technique for reducing address bus width. In ISCA-18, 1991. Google ScholarDigital Library
- E. G. Hallnor and S. K. Reinhardt. A fully associative software-managed cache design. In ISCA-27, 2000. Google ScholarDigital Library
- E. G. Hallnor and S. K. Reinhardt. A unified compressed memory hierarchy. In HPCA-11, 2005. Google ScholarDigital Library
- D. W. Hammerstrom and E. S. Davidson. Information content of CPU memory referencing behavior. In ISCA-4, 1977. Google ScholarDigital Library
- D. Huffman. A method for the construction of minimum-redundancy codes. 1952.Google Scholar
- M. M. Islam and P. Stenström. Zero-value caches: Cancelling loads that return zero. In PACT, 2009. Google ScholarDigital Library
- M. M. Islam and P. Stenström. Characterization and exploitation of narrow-width loads: the narrow-width cache approach. In CASES, 2010. Google ScholarDigital Library
- A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In ISCA-37, 2010. Google ScholarDigital Library
- P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. 2002. Google ScholarDigital Library
- D. Molka, D. Hackenberg, R. Schone, and M. Muller. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In PACT, 2009. Google ScholarDigital Library
- M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA-34, 2007. Google ScholarDigital Library
- M. K. Qureshi, M. A. Suleman, and Y. N. Patt. Line distillation: Increasing cache capacity by filtering unused words in cache lines. In HPCA-13, 2007. Google ScholarDigital Library
- M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-Way cache: Demand based associativity via global replacement. ISCA-32, 2005. Google ScholarDigital Library
- Y. Sazeides and J. E. Smith. The predictability of data values. In MICRO-30, 1997. Google ScholarDigital Library
- A. B. Sharma, L. Golubchik, R. Govindan, and M. J. Neely. Dynamic data compression in multi-hop wireless networks. In SIGMETRICS, 2009. Google ScholarDigital Library
- A. Snavely and D. M. Tullsen. Symbiotic job scheduling for a simultaneous multithreaded processor. ASPLOS-9, 2000. Google ScholarDigital Library
- SPEC CPU2006 Benchmarks. http://www.spec.org/.Google Scholar
- S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA-13, 2007. Google ScholarDigital Library
- W. Sun, Y. Lu, F. Wu, and S. Li. DHTC: an effective DXTC-based HDR texture compression scheme. In Graphics Hardware, 2008. Google ScholarDigital Library
- S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. CACTI 5.1. Technical Report HPL-2008-20, HP Laboratories, 2008.Google Scholar
- Transaction Processing Performance Council. http://www.tpc.org/.Google Scholar
- L. Villa, M. Zhang, and K. Asanovic. Dynamic zero compression for cache energy reduction. In MICRO-33, 2000. Google ScholarDigital Library
- P. R. Wilson, S. F. Kaplan, and Y. Smaragdakis. The case for compressed caching in virtual memory systems. In USENIX ATC, 1999. Google ScholarDigital Library
- J. Yang, Y. Zhang, and R. Gupta. Frequent value compression in data caches. In MICRO-33, 2000. Google ScholarDigital Library
- Y. Zhang, J. Yang, and R. Gupta. Frequent value locality and value-centric data cache design. ASPLOS-9, 2000. Google ScholarDigital Library
- J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977. Google ScholarDigital Library
Index Terms
- Base-delta-immediate compression: practical data compression for on-chip caches
Recommendations
Base-victim compression: an opportunistic cache compression architecture
ISCA '16: Proceedings of the 43rd International Symposium on Computer ArchitectureThe memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity ...
Base-victim compression: an opportunistic cache compression architecture
ISCA'16The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity ...
Opportunistic compression for direct-mapped DRAM caches
MEMSYS '18: Proceedings of the International Symposium on Memory SystemsLarge off-chip DRAM caches offer performance and bandwidth improvements for many systems by bridging the gap between on-chip last level caches and off-chip memories. To avoid the high hit latency resulting from serial DRAM accesses for tags and data, ...
Comments