skip to main content
research-article

Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems

Published:18 June 2016Publication History
Skip Abstract Section

Abstract

Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer to the DRAM layers with high bandwidth connections. Recent work has shown promising potential performance benefits from an architecture that connects multiple such 3D-stacked memories and offloads bandwidth-intensive computations to a GPU in each of the logic layers. An unsolved key challenge in such a system is how to enable computation offloading and data mapping to multiple 3D-stacked memories without burdening the programmer such that any application can transparently benefit from near-data processing capabilities in the logic layer.

Our paper develops two new mechanisms to address this key challenge. First, a compiler-based technique that automatically identifies code to offload to a logic-layer GPU based on a simple cost-benefit analysis. Second, a software/hardware cooperative mechanism that predicts which memory pages will be accessed by offloaded code, and places those pages in the memory stack closest to the offloaded code, to minimize off-chip bandwidth consumption. We call the combination of these two programmer-transparent mechanisms TOM: Transparent Offloading and Mapping.

Our extensive evaluations across a variety of modern memory-intensive GPU workloads show that, without requiring any program modification, TOM significantly improves performance (by 30% on average, and up to 76%) compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.

References

  1. N. Agarwal et al., "Selective GPU caches to eliminate CPU-GPU HW cache coherence," in HPCA, 2016.Google ScholarGoogle Scholar
  2. J. Ahn et al., "A scalable processing-in-memory accelerator for parallel graph processing," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Ahn et al., "PIM-Enabled Instructions: A low-overhead, locality-aware processing-in-memory architecture," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Akin et al., "Data Reorganization in memory using 3D-stacked DRAM," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. O. O. Babarinsa and S. Idreos, "JAFAR: near-data processing for databases," in SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Bakhoda et al., "Analyzing CUDA workloads using a detailed GPU simulator," in ISPASS, 2009.Google ScholarGoogle Scholar
  7. R. Chandra et al., "Scheduling and page migration for multiprocessor compute servers," in ASPLOS, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Chandrasekar et al., "System and circuit level power modeling of energy-efficient 3D-stacked wide I/O DRAMs," in DATE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. Chatterjee et al., "Managing DRAM latency divergence in irregular GPGPU applications," in SC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Che et al., "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. J. Dally and B. P. Towles, Principles and Practices of Interconnection Networks, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. G. Dreslinski et al., "Centip3De: A 64-Core, 3D stacked near-threshold system," IEEE Micro, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. G. Elliott et al., "Computational RAM: A Memory-SIMD hybrid and its application to DSP," in CICC, 1992.Google ScholarGoogle Scholar
  14. A. Farmahini-Farahani et al., "NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules," in HPCA, 2015.Google ScholarGoogle Scholar
  15. B. B. Fraguela et al., "Programming the FlexRAM parallel intelligent memory system," in PPoPP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Gao et al., "Practical near-data processing for in-memory analytics frameworks," in PACT, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Gao and C. Kozyrakis, "HRL: efficient and flexible reconfigurable logic for near-data processing," in HPCA, 2016.Google ScholarGoogle Scholar
  18. M. Giles and S. Xiaoke, "Notes on using the NVIDIA 8800 GTX graphics card," https://people.maths.ox.ac.uk/gilesm/codes/libor_old/report.pdf.Google ScholarGoogle Scholar
  19. M. Gokhale et al., "Processing in memory: The Terasys massively parallel PIM array," IEEE Computer, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. GPGPU-Sim, "GPGPU-Sim Manual."Google ScholarGoogle Scholar
  21. A. Jog et al., "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in ASPLOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Johnson et al., "Rigel: A 1,024-core single-chip accelerator architecture," IEEE Micro, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Kang et al., "FlexRAM: Toward an advanced intelligent memory system," in ICCD, 1999.Google ScholarGoogle Scholar
  24. J. H. Kelm et al., "WAYPOINT: Scaling coherence to thousand-core architectures," in PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Kim et al., "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in ASPLOS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. H. Kim et al., "3D-MAPS: 3D massively parallel processor with stacked memory," in ISSCC, 2012.Google ScholarGoogle Scholar
  27. G. Kim et al., "Memory-centric system interconnect design with hybrid memory cubes," in PACT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P. M. Kogge, "EXECUBE-a new architecture for scaleable MPPs," in ICPP, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. U. Lee et al., "A 1.2 V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV," in ISSCC, 2014.Google ScholarGoogle Scholar
  30. D. Lee et al., "Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost," ACM TACO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Lee et al., "Automatically mapping code on an intelligent memory architecture," in HPCA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. H. Lee et al., "BSSync: Processing near memory for machine learning workloads with bounded staleness consistency models," in PACT, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Leng et al., "GPUWattch: enabling energy optimizations in GPGPUs," in ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. "Hybrid Memory Cube Specification 2.0," 2014.Google ScholarGoogle Scholar
  35. Micron Technology, "4Gb: x4, x8, x16 DDR3 SDRAM," 2011.Google ScholarGoogle Scholar
  36. D. J. Miller et al., "Motivating future interconnects: a differential measurement analysis of PCI latency," in ANCS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. V. Narasiman et al., "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. NVIDIA, "NVIDIA CUDA C Programming Guide."Google ScholarGoogle Scholar
  39. NVIDIA, "NVIDIA CUDA SDK 4.2."Google ScholarGoogle Scholar
  40. NVIDIA, "NVIDIA CUDA Toolkit."Google ScholarGoogle Scholar
  41. NVIDIA, "NVIDIA PTX ISA Version 1.4."Google ScholarGoogle Scholar
  42. M. Oskin et al., "Active Pages: a computation model for intelligent memory," in ISCA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. Patterson et al., "A case for intelligent RAM," IEEE Micro, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. G. Pekhimenko et al., "A case for toggle-aware compression for GPU systems," in HPCA, 2016.Google ScholarGoogle Scholar
  45. J. Power et al., "Heterogeneous system coherence for integrated CPU-GPU systems," in MICRO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. S. H. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads," in ISPASS, 2014.Google ScholarGoogle Scholar
  47. S. Rixner et al., "Memory access scheduling," in ISCA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. T. G. Rogers et al., "Cache-conscious wavefront scheduling," in MICRO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. V. Seshadri et al., "Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses," in MICRO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. V. Seshadri et al., "Fast bulk bitwise AND and OR in DRAM," CAL, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. V. Seshadri et al., "RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization," in MICRO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. D. E. Shaw et al., "The NON-VON database machine: A brief overview," IEEE Database Eng. Bull., 1981.Google ScholarGoogle Scholar
  53. I. Singh et al., "Cache coherence for GPU architectures," in HPCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. H. S. Stone, "A logic-in-memory computer," IEEE TC, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Z. Sura et al., "Data access optimization in a processing-in-memory system," in CF, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. N. Vijaykumar et al., "A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. T. Vogelsang, "Understanding the energy consumption of dynamic random access memories," in MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. S. J. Wilton and N. P. Jouppi, "CACTI: An enhanced cache access and cycle time model," IEEE Journal of Solid-State Circuits, 1996.Google ScholarGoogle Scholar
  59. D. H. Woo et al., "POD: A 3D-integrated broad-purpose acceleration layer," IEEE Micro, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. D. P. Zhang et al., "TOP-PIM: Throughput-oriented Programmable Processing in Memory," in HPDC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Z. Zhang et al., "A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality," in MICRO, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Q. Zhu et al., "Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware," in HPEC, 2013.Google ScholarGoogle Scholar
  63. W. K. Zuravleff and T. Robinson, "Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order," 1997, US Patent 5,630,096.Google ScholarGoogle Scholar

Index Terms

  1. Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGARCH Computer Architecture News
            ACM SIGARCH Computer Architecture News  Volume 44, Issue 3
            ISCA'16
            June 2016
            730 pages
            ISSN:0163-5964
            DOI:10.1145/3007787
            Issue’s Table of Contents
            • cover image ACM Conferences
              ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture
              June 2016
              756 pages
              ISBN:9781467389471

            Copyright © 2016 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 18 June 2016

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader