research-article

Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems

Authors:
Kevin Hsieh

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Eiman Ebrahimi

NVIDIA

NVIDIA
View Profile

,
Gwangsun Kim

KAIST

KAIST
View Profile

,
Niladrish Chatterjee

NVIDIA

NVIDIA
View Profile

,
Mike O'Connor

NVIDIA

NVIDIA
View Profile

,
Nandita Vijaykumar

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Onur Mutlu

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Stephen W. Keckler

NVIDIA

NVIDIA
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 44 Issue 3June 2016pp 204–216https://doi.org/10.1145/3007787.3001159

Published:18 June 2016Publication History

ACM SIGARCH Computer Architecture News

Abstract

Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer to the DRAM layers with high bandwidth connections. Recent work has shown promising potential performance benefits from an architecture that connects multiple such 3D-stacked memories and offloads bandwidth-intensive computations to a GPU in each of the logic layers. An unsolved key challenge in such a system is how to enable computation offloading and data mapping to multiple 3D-stacked memories without burdening the programmer such that any application can transparently benefit from near-data processing capabilities in the logic layer.

Our paper develops two new mechanisms to address this key challenge. First, a compiler-based technique that automatically identifies code to offload to a logic-layer GPU based on a simple cost-benefit analysis. Second, a software/hardware cooperative mechanism that predicts which memory pages will be accessed by offloaded code, and places those pages in the memory stack closest to the offloaded code, to minimize off-chip bandwidth consumption. We call the combination of these two programmer-transparent mechanisms TOM: Transparent Offloading and Mapping.

Our extensive evaluations across a variety of modern memory-intensive GPU workloads show that, without requiring any program modification, TOM significantly improves performance (by 30% on average, and up to 76%) compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.

References

N. Agarwal et al., "Selective GPU caches to eliminate CPU-GPU HW cache coherence," in HPCA, 2016.Google Scholar
J. Ahn et al., "A scalable processing-in-memory accelerator for parallel graph processing," in ISCA, 2015. Google ScholarDigital Library
J. Ahn et al., "PIM-Enabled Instructions: A low-overhead, locality-aware processing-in-memory architecture," in ISCA, 2015. Google ScholarDigital Library
B. Akin et al., "Data Reorganization in memory using 3D-stacked DRAM," in ISCA, 2015. Google ScholarDigital Library
O. O. Babarinsa and S. Idreos, "JAFAR: near-data processing for databases," in SIGMOD, 2015. Google ScholarDigital Library
A. Bakhoda et al., "Analyzing CUDA workloads using a detailed GPU simulator," in ISPASS, 2009.Google Scholar
R. Chandra et al., "Scheduling and page migration for multiprocessor compute servers," in ASPLOS, 1994. Google ScholarDigital Library
K. Chandrasekar et al., "System and circuit level power modeling of energy-efficient 3D-stacked wide I/O DRAMs," in DATE, 2013. Google ScholarDigital Library
N. Chatterjee et al., "Managing DRAM latency divergence in irregular GPGPU applications," in SC, 2014. Google ScholarDigital Library
S. Che et al., "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009. Google ScholarDigital Library
W. J. Dally and B. P. Towles, Principles and Practices of Interconnection Networks, 2004. Google ScholarDigital Library
R. G. Dreslinski et al., "Centip3De: A 64-Core, 3D stacked near-threshold system," IEEE Micro, 2013. Google ScholarDigital Library
D. G. Elliott et al., "Computational RAM: A Memory-SIMD hybrid and its application to DSP," in CICC, 1992.Google Scholar
A. Farmahini-Farahani et al., "NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules," in HPCA, 2015.Google Scholar
B. B. Fraguela et al., "Programming the FlexRAM parallel intelligent memory system," in PPoPP, 2003. Google ScholarDigital Library
M. Gao et al., "Practical near-data processing for in-memory analytics frameworks," in PACT, 2015. Google ScholarDigital Library
M. Gao and C. Kozyrakis, "HRL: efficient and flexible reconfigurable logic for near-data processing," in HPCA, 2016.Google Scholar
M. Giles and S. Xiaoke, "Notes on using the NVIDIA 8800 GTX graphics card," https://people.maths.ox.ac.uk/gilesm/codes/libor_old/report.pdf.Google Scholar
M. Gokhale et al., "Processing in memory: The Terasys massively parallel PIM array," IEEE Computer, 1995. Google ScholarDigital Library
GPGPU-Sim, "GPGPU-Sim Manual."Google Scholar
A. Jog et al., "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in ASPLOS, 2013. Google ScholarDigital Library
D. Johnson et al., "Rigel: A 1,024-core single-chip accelerator architecture," IEEE Micro, 2011. Google ScholarDigital Library
Y. Kang et al., "FlexRAM: Toward an advanced intelligent memory system," in ICCD, 1999.Google Scholar
J. H. Kelm et al., "WAYPOINT: Scaling coherence to thousand-core architectures," in PACT, 2010. Google ScholarDigital Library
C. Kim et al., "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in ASPLOS, 2002. Google ScholarDigital Library
D. H. Kim et al., "3D-MAPS: 3D massively parallel processor with stacked memory," in ISSCC, 2012.Google Scholar
G. Kim et al., "Memory-centric system interconnect design with hybrid memory cubes," in PACT, 2013. Google ScholarDigital Library
P. M. Kogge, "EXECUBE-a new architecture for scaleable MPPs," in ICPP, 1994. Google ScholarDigital Library
D. U. Lee et al., "A 1.2 V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV," in ISSCC, 2014.Google Scholar
D. Lee et al., "Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost," ACM TACO, 2016. Google ScholarDigital Library
J. Lee et al., "Automatically mapping code on an intelligent memory architecture," in HPCA, 2001. Google ScholarDigital Library
J. H. Lee et al., "BSSync: Processing near memory for machine learning workloads with bounded staleness consistency models," in PACT, 2015. Google ScholarDigital Library
J. Leng et al., "GPUWattch: enabling energy optimizations in GPGPUs," in ISCA, 2013. Google ScholarDigital Library
"Hybrid Memory Cube Specification 2.0," 2014.Google Scholar
Micron Technology, "4Gb: x4, x8, x16 DDR3 SDRAM," 2011.Google Scholar
D. J. Miller et al., "Motivating future interconnects: a differential measurement analysis of PCI latency," in ANCS, 2009. Google ScholarDigital Library
V. Narasiman et al., "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in MICRO, 2011. Google ScholarDigital Library
NVIDIA, "NVIDIA CUDA C Programming Guide."Google Scholar
NVIDIA, "NVIDIA CUDA SDK 4.2."Google Scholar
NVIDIA, "NVIDIA CUDA Toolkit."Google Scholar
NVIDIA, "NVIDIA PTX ISA Version 1.4."Google Scholar
M. Oskin et al., "Active Pages: a computation model for intelligent memory," in ISCA, 1998. Google ScholarDigital Library
D. Patterson et al., "A case for intelligent RAM," IEEE Micro, 1997. Google ScholarDigital Library
G. Pekhimenko et al., "A case for toggle-aware compression for GPU systems," in HPCA, 2016.Google Scholar
J. Power et al., "Heterogeneous system coherence for integrated CPU-GPU systems," in MICRO, 2013. Google ScholarDigital Library
S. H. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads," in ISPASS, 2014.Google Scholar
S. Rixner et al., "Memory access scheduling," in ISCA, 2000. Google ScholarDigital Library
T. G. Rogers et al., "Cache-conscious wavefront scheduling," in MICRO, 2012. Google ScholarDigital Library
V. Seshadri et al., "Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses," in MICRO, 2015. Google ScholarDigital Library
V. Seshadri et al., "Fast bulk bitwise AND and OR in DRAM," CAL, 2015. Google ScholarDigital Library
V. Seshadri et al., "RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization," in MICRO, 2013. Google ScholarDigital Library
D. E. Shaw et al., "The NON-VON database machine: A brief overview," IEEE Database Eng. Bull., 1981.Google Scholar
I. Singh et al., "Cache coherence for GPU architectures," in HPCA, 2013. Google ScholarDigital Library
H. S. Stone, "A logic-in-memory computer," IEEE TC, 1970. Google ScholarDigital Library
Z. Sura et al., "Data access optimization in a processing-in-memory system," in CF, 2015. Google ScholarDigital Library
N. Vijaykumar et al., "A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps," in ISCA, 2015. Google ScholarDigital Library
T. Vogelsang, "Understanding the energy consumption of dynamic random access memories," in MICRO, 2010. Google ScholarDigital Library
S. J. Wilton and N. P. Jouppi, "CACTI: An enhanced cache access and cycle time model," IEEE Journal of Solid-State Circuits, 1996.Google Scholar
D. H. Woo et al., "POD: A 3D-integrated broad-purpose acceleration layer," IEEE Micro, 2008. Google ScholarDigital Library
D. P. Zhang et al., "TOP-PIM: Throughput-oriented Programmable Processing in Memory," in HPDC, 2014. Google ScholarDigital Library
Z. Zhang et al., "A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality," in MICRO, 2000. Google ScholarDigital Library
Q. Zhu et al., "Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware," in HPEC, 2013.Google Scholar
W. K. Zuravleff and T. Robinson, "Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order," 1997, US Patent 5,630,096.Google Scholar

Index Terms

Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems
ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture

Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer ...
Read More
Enhancing thread-level parallelism in asymmetric multicores using transparent instruction offloading
DAC '20: Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference

Asymmetric multicore architectures (AMC) with single-ISA can accelerate multi-threaded applications by running the serial region on the big core and the parallel region on multiple small cores. In such architectures, all cores implement resource-...
Read More
Transparent offloading of computational hotspots from binary code to Xeon Phi
DATE '15: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition

In this paper, we study how binary applications can be transparently accelerated with novel heterogeneous computing resources without requiring any manual porting or developer-provided hints. Our work is based on Binary Acceleration At Runtime (BAAR), ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 44, Issue 3
ISCA'16
June 2016
730 pages
ISSN:0163-5964
DOI:10.1145/3007787
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture
June 2016
756 pages
ISBN:9781467389471
General Chairs:
Sang Lyul Min
Seoul National University
,
Gabriel Loh
AMD Research
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2016
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 149
  Total Citations
  View Citations
- 1,092
  Total Downloads
- Downloads (Last 12 months)111
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems

Enhancing thread-level parallelism in asymmetric multicores using transparent instruction offloading

Transparent offloading of computational hotspots from binary code to Xeon Phi