research-article

Base-delta-immediate compression: practical data compression for on-chip caches

Authors:
Gennady Pekhimenko

Carnegie Mellon University, Pittsburgh, Pennsylvania, USA

Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
View Profile

,
Vivek Seshadri

Carnegie Mellon University, Pittsburgh, Pennsylvania, USA

Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
View Profile

,
Onur Mutlu

Carnegie Mellon University, Pittsburgh, Pennsylvania, USA

Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
View Profile

,
Phillip B. Gibbons

Intel Labs Pittsburgh, Pittsburgh, Pennsylvania, USA

Intel Labs Pittsburgh, Pittsburgh, Pennsylvania, USA
View Profile

,
Michael A. Kozuch

Intel Labs Pittsburgh, Pittsburgh, Pennsylvania, USA

Intel Labs Pittsburgh, Pittsburgh, Pennsylvania, USA
View Profile

,
Todd C. Mowry

Carnegie Mellon University, Pittsburgh, Pennsylvania, USA

Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
View Profile

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesSeptember 2012Pages 377–388https://doi.org/10.1145/2370816.2370870

Published:19 September 2012Publication History

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Pages 377–388

ABSTRACT

Cache compression is a promising technique to increase on-chip cache capacity and to decrease on-chip and off-chip bandwidth usage. Unfortunately, directly applying well-known compression algorithms (usually implemented in software) leads to high hardware complexity and unacceptable decompression/compression latencies, which in turn can negatively affect performance. Hence, there is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.

In this paper, we introduce a new compression algorithm called Base-Delta-Immediate (BΔI) compression, a practical technique for compressing data in on-chip caches. The key idea is that, for many cache lines, the values within the cache line have a low dynamic range - i.e., the differences between values stored within the cache line are small. As a result, a cache line can be represented using a base value and an array of differences whose combined size is much smaller than the original cache line (we call this the base+delta encoding). Moreover, many cache lines intersperse such base+delta values with small values - our BΔI technique efficiently incorporates such immediate values into its encoding.

Compared to prior cache compression approaches, our studies show that BΔI strikes a sweet-spot in the tradeoff between compression ratio, decompression/compression latencies, and hardware complexity. Our results show that BΔI compression improves performance for both single-core (8.1% improvement) and multi-core workloads (9.5% / 11.2% improvement for two/four cores). For many applications, BΔI provides the performance benefit of doubling the cache size of the baseline system, effectively increasing average cache capacity by 1.53X.

References

B. Abali, H. Franke, D. E. Poff, R. A. Saccone, C. O. Schulz, L. M. Herger, and T. B. Smith. Memory expansion technology (MXT): software support and performance. IBM JRD, 2001. Google ScholarDigital Library
A. R. Alameldeen and D. A. Wood. Adaptive cache compression for high-performance processors. In ISCA-31, 2004. Google ScholarDigital Library
A. R. Alameldeen and D. A. Wood. Frequent pattern compression: A significance-based compression scheme for L2 caches. Tech. Rep., University of Wisconsin-Madison, 2004.Google Scholar
S. Balakrishnan and G. S. Sohi. Exploiting value locality in physical register files. In MICRO-36, 2003. Google ScholarDigital Library
J. Chen and W. A. Watson-III. Multi-threading performance on commodity multi-core processors. In Proceedings of HPCAsia, 2007.Google Scholar
X. Chen, L. Yang, R. Dick, L. Shang, and H. Lekatsas. C-pack: A high-performance microprocessor cache compression algorithm. In IEEE TVLSI, Aug. 2010. Google ScholarDigital Library
R. Das, A. Mishra, C. Nicopoulos, D. Park, V. Narayanan, R. Iyer, M. Yousif, and C. Das. Performance and power optimization through data compression in network-on-chip architectures. In HPCA, 2008.Google ScholarCross Ref
J. Dusser, T. Piquet, and A. Seznec. Zero-content augmented caches. In ICS, 2009. Google ScholarDigital Library
M. Ekman and P. Stenström. A robust main-memory compression scheme. In ISCA-32, 2005. Google ScholarDigital Library
M. Farrens and A. Park. Dynamic base register caching: a technique for reducing address bus width. In ISCA-18, 1991. Google ScholarDigital Library
E. G. Hallnor and S. K. Reinhardt. A fully associative software-managed cache design. In ISCA-27, 2000. Google ScholarDigital Library
E. G. Hallnor and S. K. Reinhardt. A unified compressed memory hierarchy. In HPCA-11, 2005. Google ScholarDigital Library
D. W. Hammerstrom and E. S. Davidson. Information content of CPU memory referencing behavior. In ISCA-4, 1977. Google ScholarDigital Library
D. Huffman. A method for the construction of minimum-redundancy codes. 1952.Google Scholar
M. M. Islam and P. Stenström. Zero-value caches: Cancelling loads that return zero. In PACT, 2009. Google ScholarDigital Library
M. M. Islam and P. Stenström. Characterization and exploitation of narrow-width loads: the narrow-width cache approach. In CASES, 2010. Google ScholarDigital Library
A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In ISCA-37, 2010. Google ScholarDigital Library
P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. 2002. Google ScholarDigital Library
D. Molka, D. Hackenberg, R. Schone, and M. Muller. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In PACT, 2009. Google ScholarDigital Library
M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA-34, 2007. Google ScholarDigital Library
M. K. Qureshi, M. A. Suleman, and Y. N. Patt. Line distillation: Increasing cache capacity by filtering unused words in cache lines. In HPCA-13, 2007. Google ScholarDigital Library
M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-Way cache: Demand based associativity via global replacement. ISCA-32, 2005. Google ScholarDigital Library
Y. Sazeides and J. E. Smith. The predictability of data values. In MICRO-30, 1997. Google ScholarDigital Library
A. B. Sharma, L. Golubchik, R. Govindan, and M. J. Neely. Dynamic data compression in multi-hop wireless networks. In SIGMETRICS, 2009. Google ScholarDigital Library
A. Snavely and D. M. Tullsen. Symbiotic job scheduling for a simultaneous multithreaded processor. ASPLOS-9, 2000. Google ScholarDigital Library
SPEC CPU2006 Benchmarks. http://www.spec.org/.Google Scholar
S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA-13, 2007. Google ScholarDigital Library
W. Sun, Y. Lu, F. Wu, and S. Li. DHTC: an effective DXTC-based HDR texture compression scheme. In Graphics Hardware, 2008. Google ScholarDigital Library
S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. CACTI 5.1. Technical Report HPL-2008-20, HP Laboratories, 2008.Google Scholar
Transaction Processing Performance Council. http://www.tpc.org/.Google Scholar
L. Villa, M. Zhang, and K. Asanovic. Dynamic zero compression for cache energy reduction. In MICRO-33, 2000. Google ScholarDigital Library
P. R. Wilson, S. F. Kaplan, and Y. Smaragdakis. The case for compressed caching in virtual memory systems. In USENIX ATC, 1999. Google ScholarDigital Library
J. Yang, Y. Zhang, and R. Gupta. Frequent value compression in data caches. In MICRO-33, 2000. Google ScholarDigital Library
Y. Zhang, J. Yang, and R. Gupta. Frequent value locality and value-centric data cache design. ASPLOS-9, 2000. Google ScholarDigital Library
J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977. Google ScholarDigital Library

Index Terms

Base-delta-immediate compression: practical data compression for on-chip caches
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory
2. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression

Recommendations

Base-victim compression: an opportunistic cache compression architecture
ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture

The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity ...
Read More
Base-victim compression: an opportunistic cache compression architecture
ISCA'16

The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity ...
Read More
Opportunistic compression for direct-mapped DRAM caches
MEMSYS '18: Proceedings of the International Symposium on Memory Systems

Large off-chip DRAM caches offer performance and bandwidth improvements for many systems by bridging the gap between on-chip last level caches and off-chip memories. To avoid the high hit latency resulting from serial DRAM accesses for tags and data, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques
September 2012
512 pages
ISBN:9781450311823
DOI:10.1145/2370816
General Chairs:
Pen-Chung Yew
University of Minnesota
,
Sangyeun Cho
University of Pittsburgh
,
Program Chairs:
Luiz DeRose
Cray, Inc.
,
David J. Lilja
University of Minnesota
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 September 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache compression
caching
memory
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate121of471submissions,26%
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 312
  Total Citations
  View Citations
- 1,230
  Total Downloads
- Downloads (Last 12 months)141
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.