Abstract
Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer to the DRAM layers with high bandwidth connections. Recent work has shown promising potential performance benefits from an architecture that connects multiple such 3D-stacked memories and offloads bandwidth-intensive computations to a GPU in each of the logic layers. An unsolved key challenge in such a system is how to enable computation offloading and data mapping to multiple 3D-stacked memories without burdening the programmer such that any application can transparently benefit from near-data processing capabilities in the logic layer.
Our paper develops two new mechanisms to address this key challenge. First, a compiler-based technique that automatically identifies code to offload to a logic-layer GPU based on a simple cost-benefit analysis. Second, a software/hardware cooperative mechanism that predicts which memory pages will be accessed by offloaded code, and places those pages in the memory stack closest to the offloaded code, to minimize off-chip bandwidth consumption. We call the combination of these two programmer-transparent mechanisms TOM: Transparent Offloading and Mapping.
Our extensive evaluations across a variety of modern memory-intensive GPU workloads show that, without requiring any program modification, TOM significantly improves performance (by 30% on average, and up to 76%) compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.
- N. Agarwal et al., "Selective GPU caches to eliminate CPU-GPU HW cache coherence," in HPCA, 2016.Google Scholar
- J. Ahn et al., "A scalable processing-in-memory accelerator for parallel graph processing," in ISCA, 2015. Google ScholarDigital Library
- J. Ahn et al., "PIM-Enabled Instructions: A low-overhead, locality-aware processing-in-memory architecture," in ISCA, 2015. Google ScholarDigital Library
- B. Akin et al., "Data Reorganization in memory using 3D-stacked DRAM," in ISCA, 2015. Google ScholarDigital Library
- O. O. Babarinsa and S. Idreos, "JAFAR: near-data processing for databases," in SIGMOD, 2015. Google ScholarDigital Library
- A. Bakhoda et al., "Analyzing CUDA workloads using a detailed GPU simulator," in ISPASS, 2009.Google Scholar
- R. Chandra et al., "Scheduling and page migration for multiprocessor compute servers," in ASPLOS, 1994. Google ScholarDigital Library
- K. Chandrasekar et al., "System and circuit level power modeling of energy-efficient 3D-stacked wide I/O DRAMs," in DATE, 2013. Google ScholarDigital Library
- N. Chatterjee et al., "Managing DRAM latency divergence in irregular GPGPU applications," in SC, 2014. Google ScholarDigital Library
- S. Che et al., "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009. Google ScholarDigital Library
- W. J. Dally and B. P. Towles, Principles and Practices of Interconnection Networks, 2004. Google ScholarDigital Library
- R. G. Dreslinski et al., "Centip3De: A 64-Core, 3D stacked near-threshold system," IEEE Micro, 2013. Google ScholarDigital Library
- D. G. Elliott et al., "Computational RAM: A Memory-SIMD hybrid and its application to DSP," in CICC, 1992.Google Scholar
- A. Farmahini-Farahani et al., "NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules," in HPCA, 2015.Google Scholar
- B. B. Fraguela et al., "Programming the FlexRAM parallel intelligent memory system," in PPoPP, 2003. Google ScholarDigital Library
- M. Gao et al., "Practical near-data processing for in-memory analytics frameworks," in PACT, 2015. Google ScholarDigital Library
- M. Gao and C. Kozyrakis, "HRL: efficient and flexible reconfigurable logic for near-data processing," in HPCA, 2016.Google Scholar
- M. Giles and S. Xiaoke, "Notes on using the NVIDIA 8800 GTX graphics card," https://people.maths.ox.ac.uk/gilesm/codes/libor_old/report.pdf.Google Scholar
- M. Gokhale et al., "Processing in memory: The Terasys massively parallel PIM array," IEEE Computer, 1995. Google ScholarDigital Library
- GPGPU-Sim, "GPGPU-Sim Manual."Google Scholar
- A. Jog et al., "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in ASPLOS, 2013. Google ScholarDigital Library
- D. Johnson et al., "Rigel: A 1,024-core single-chip accelerator architecture," IEEE Micro, 2011. Google ScholarDigital Library
- Y. Kang et al., "FlexRAM: Toward an advanced intelligent memory system," in ICCD, 1999.Google Scholar
- J. H. Kelm et al., "WAYPOINT: Scaling coherence to thousand-core architectures," in PACT, 2010. Google ScholarDigital Library
- C. Kim et al., "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in ASPLOS, 2002. Google ScholarDigital Library
- D. H. Kim et al., "3D-MAPS: 3D massively parallel processor with stacked memory," in ISSCC, 2012.Google Scholar
- G. Kim et al., "Memory-centric system interconnect design with hybrid memory cubes," in PACT, 2013. Google ScholarDigital Library
- P. M. Kogge, "EXECUBE-a new architecture for scaleable MPPs," in ICPP, 1994. Google ScholarDigital Library
- D. U. Lee et al., "A 1.2 V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV," in ISSCC, 2014.Google Scholar
- D. Lee et al., "Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost," ACM TACO, 2016. Google ScholarDigital Library
- J. Lee et al., "Automatically mapping code on an intelligent memory architecture," in HPCA, 2001. Google ScholarDigital Library
- J. H. Lee et al., "BSSync: Processing near memory for machine learning workloads with bounded staleness consistency models," in PACT, 2015. Google ScholarDigital Library
- J. Leng et al., "GPUWattch: enabling energy optimizations in GPGPUs," in ISCA, 2013. Google ScholarDigital Library
- "Hybrid Memory Cube Specification 2.0," 2014.Google Scholar
- Micron Technology, "4Gb: x4, x8, x16 DDR3 SDRAM," 2011.Google Scholar
- D. J. Miller et al., "Motivating future interconnects: a differential measurement analysis of PCI latency," in ANCS, 2009. Google ScholarDigital Library
- V. Narasiman et al., "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in MICRO, 2011. Google ScholarDigital Library
- NVIDIA, "NVIDIA CUDA C Programming Guide."Google Scholar
- NVIDIA, "NVIDIA CUDA SDK 4.2."Google Scholar
- NVIDIA, "NVIDIA CUDA Toolkit."Google Scholar
- NVIDIA, "NVIDIA PTX ISA Version 1.4."Google Scholar
- M. Oskin et al., "Active Pages: a computation model for intelligent memory," in ISCA, 1998. Google ScholarDigital Library
- D. Patterson et al., "A case for intelligent RAM," IEEE Micro, 1997. Google ScholarDigital Library
- G. Pekhimenko et al., "A case for toggle-aware compression for GPU systems," in HPCA, 2016.Google Scholar
- J. Power et al., "Heterogeneous system coherence for integrated CPU-GPU systems," in MICRO, 2013. Google ScholarDigital Library
- S. H. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads," in ISPASS, 2014.Google Scholar
- S. Rixner et al., "Memory access scheduling," in ISCA, 2000. Google ScholarDigital Library
- T. G. Rogers et al., "Cache-conscious wavefront scheduling," in MICRO, 2012. Google ScholarDigital Library
- V. Seshadri et al., "Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses," in MICRO, 2015. Google ScholarDigital Library
- V. Seshadri et al., "Fast bulk bitwise AND and OR in DRAM," CAL, 2015. Google ScholarDigital Library
- V. Seshadri et al., "RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization," in MICRO, 2013. Google ScholarDigital Library
- D. E. Shaw et al., "The NON-VON database machine: A brief overview," IEEE Database Eng. Bull., 1981.Google Scholar
- I. Singh et al., "Cache coherence for GPU architectures," in HPCA, 2013. Google ScholarDigital Library
- H. S. Stone, "A logic-in-memory computer," IEEE TC, 1970. Google ScholarDigital Library
- Z. Sura et al., "Data access optimization in a processing-in-memory system," in CF, 2015. Google ScholarDigital Library
- N. Vijaykumar et al., "A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps," in ISCA, 2015. Google ScholarDigital Library
- T. Vogelsang, "Understanding the energy consumption of dynamic random access memories," in MICRO, 2010. Google ScholarDigital Library
- S. J. Wilton and N. P. Jouppi, "CACTI: An enhanced cache access and cycle time model," IEEE Journal of Solid-State Circuits, 1996.Google Scholar
- D. H. Woo et al., "POD: A 3D-integrated broad-purpose acceleration layer," IEEE Micro, 2008. Google ScholarDigital Library
- D. P. Zhang et al., "TOP-PIM: Throughput-oriented Programmable Processing in Memory," in HPDC, 2014. Google ScholarDigital Library
- Z. Zhang et al., "A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality," in MICRO, 2000. Google ScholarDigital Library
- Q. Zhu et al., "Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware," in HPEC, 2013.Google Scholar
- W. K. Zuravleff and T. Robinson, "Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order," 1997, US Patent 5,630,096.Google Scholar
Index Terms
- Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems
Recommendations
Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems
ISCA '16: Proceedings of the 43rd International Symposium on Computer ArchitectureMain memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer ...
Enhancing thread-level parallelism in asymmetric multicores using transparent instruction offloading
DAC '20: Proceedings of the 57th ACM/EDAC/IEEE Design Automation ConferenceAsymmetric multicore architectures (AMC) with single-ISA can accelerate multi-threaded applications by running the serial region on the big core and the parallel region on multiple small cores. In such architectures, all cores implement resource-...
Transparent offloading of computational hotspots from binary code to Xeon Phi
DATE '15: Proceedings of the 2015 Design, Automation & Test in Europe Conference & ExhibitionIn this paper, we study how binary applications can be transparently accelerated with novel heterogeneous computing resources without requiring any manual porting or developer-provided hints. Our work is based on Binary Acceleration At Runtime (BAAR), ...
Comments