Abstract
Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Reducing the number of threads that the scheduler must consider each cycle improves the scheduler’s energy efficiency. Second, we propose replacing the monolithic register file found on modern designs with a hierarchical register file. We explore various trade-offs for the hierarchy including the number of levels in the hierarchy and the number of entries at each level. We consider both a hardware-managed caching scheme and a software-managed scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy. Combined with a hierarchical register file, our two-level thread scheduler provides a further reduction in energy by only allocating entries in the upper levels of the register file hierarchy for active threads. Averaging across a variety of real world graphics and compute workloads, the active thread count can be reduced by a factor of 4 with minimal impact on performance and our most efficient three-level software-managed register file hierarchy reduces register file energy by 54%.
- Agarwal, A., Lim, B.-H., Kranz, D., and Kubiatowicz, J. 1990. APRIL: A processor architecture for multiprocessing. In Proceedings of the International Symposium on Computer Architecture. 104--114. Google ScholarDigital Library
- AMD. 2010. ATI Stream Computing OpenCL Programming Guide. http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf.Google Scholar
- AMD. 2011. HD 6900 series instruction set architecture. http://developer.amd.com/gpu/amdappsdk/assets/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf.Google Scholar
- Ayala, J. L., Veidenbaum, A., and Lopez-Vallejo, M. 2003. Power-aware compilation for register file energy reduction. Int. J. Paral. Program. 31, 6, 451--467. Google ScholarDigital Library
- Bakhoda, A., Yuan, G. L., Fung, W. W. L., Wong, H., and Aamodt, T. M. 2009. Analyzing CUDA workloads using a detailed gpu simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 163--174.Google Scholar
- Balasubramonian, R., Dwarkadas, S., and Albonesi, D. H. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In Proceedings of the International Symposium on Microarchitecture. 237--248. Google ScholarDigital Library
- Balfour, J., Harting, R., and Dally, W. 2009. Operand registers and explicit operand forwarding. IEEE Comput. Architect. Lett. 8, 2, 60--63. Google ScholarDigital Library
- Borch, E., Tune, E., Manne, S., and Emer, J. 2002. Loose loops sink chips. In Proceedings of the International Symposium on High Performance Computer Architecture. 299--310. Google ScholarDigital Library
- Brekelbaum, E., Rupley, J., Wilkerson, C., and Black, B. 2002. Hierarchical scheduling windows. In Proceedings of the International Symposium on Microarchitecture. 27--36. Google ScholarDigital Library
- Brown, J. A. and Tullsen, D. M. 2008. The shared-thread multiprocessor. In Proceedings of the International Conference on Supercomputing. 73--82. Google ScholarDigital Library
- Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S. H., and Skadron, K. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the International Symposium on Workload Chaaracterization. 44--54. Google ScholarDigital Library
- Cooper, K. D. and Harvey, T. J. 1998. Compiler-controlled memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 2--11. Google ScholarDigital Library
- Crago, N. and Patel, S. 2011. OUTRIDER: Efficient memory latency tolerance with decoupled strands. In Proceedings of the International Symposium on Computer Architecture. 117--128. Google ScholarDigital Library
- Cruz, J., Gonzalez, A., Valero, M., and Topham, N. P. 2000. Multiple-banked register file architectures. In Proceedings of the International Symposium on Computer Architecture. 316--325. Google ScholarDigital Library
- Dally, W. J., Hanrahan, P., Erez, M., Knight, T. J., Labonte, F., Ahn, J.-H., Jayasena, N., Kapasi, U. J., Das, A., Gummaraju, J., and Buck, I. 2003. Merrimac: Supercomputing with Streams. In Proceedings of the International Conference for High Performance Computing. 35--42. Google ScholarDigital Library
- Diamos, G., Kerr, A., Yalamanchili, S., and Clark, N. 2010. Ocelot: A Dynamic Compiler for Bulk-Synchronous Applications in Heterogeneous Systems. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 353--364. Google ScholarDigital Library
- Ernst, D., Hamel, A., and Austin, T. 2003. Cyclone: A broadcast-free dynamic instruction scheduler with selective replay. In Proceedings of the International Symposium on Computer Architecture. 253--263. Google ScholarDigital Library
- Fatahalian, K. and Houston, M. 2008. A closer look at GPUs. Comm. ACM 51, 10, 50--57. Google ScholarDigital Library
- Franklin, M. and Sohi, G. S. 1992. Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors. In Proceedings of the International Symposium on Microarchitecture. 236--245. Google ScholarDigital Library
- Galal, S. and Horowitz, M. 2011. Energy-efficient floating point unit design. IEEE Trans. Comput. 60, 7, 913--922. Google ScholarDigital Library
- Gebotys, C. H. 1997. Low energy memory and register allocation using network flow. In Proceedings of the Design Automation Conference. 435--440. Google ScholarDigital Library
- Hong, S. and Kim, H. 2010. An integrated GPU power and performance model. In Proceedings of the International Symposium on Computer Architecture. 280--289. Google ScholarDigital Library
- Hu, Z. and Martonosi, M. 2000. Reducing register file power consumption by exploiting value lifetime characteristics. In Proceedings of the Workshop on Complexity-Effective Design.Google Scholar
- ITRS. 2009. International Technology Roadmap for Semiconductors. http://itrs.net/links/2009ITRS/Home2009.htm.Google Scholar
- Jones, T. M., O’Boyle, M. F. P., Abella, J., González, A., and Ergin, O. 2009. Energy-efficient register caching with compiler assistance. ACM Trans. Architect. Code Optim. 6, 4, 1--23. Google ScholarDigital Library
- Kogge, P., Ed. 2008. ExaScale computing study: Technology challenges in achieving exascale systems. Tech. rep. TR-2008-13, University of Notre Dame.Google Scholar
- Lebeck, A. R., Koppanalil, J., Li, T., Patwardhan, J., and Rotenberg, E. 2002. A large, fast instruction window for tolerating cache misses. In Proceedings of the International Symposium on Computer Architecture. 59--70. Google ScholarDigital Library
- Leon, A. S., Langley, B., and Shin, J. L. 2007. The UltraSPARC T1 processor: CMT reliability. In Proceedings of the IEEE Custom Integrated Circuits Conference. 555--562.Google Scholar
- MAGMA. MAGMA: Matrix Algebra for GPU and Multicore Architectures. http://icl.eecs.utk.edu/magma.Google Scholar
- Muralimanohar, N., Balasubramonian, R., and Jouppi, N. P. 2009. CACTI 6.0: A tool to model large caches. Tech. rep., HP Laboratories.Google Scholar
- Nuth, P. R. and Dally, W. J. 1991. A mechanism for efficient context switching. In Proceedings of the International Conference on Computer Design on VLSI in Computer & Processors. 301--304. Google ScholarDigital Library
- Nuth, P. R. and Dally, W. J. 1995. The named-state register file: Implementation and performance. In Proceedings of the International Symposium on High Performance Computer Architecture. 4--13. Google ScholarDigital Library
- NVIDIA. 2008. Compute Unified Device Architecture Programming Guide Version 2.0. http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf.Google Scholar
- NVIDIA. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. http://nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.Google Scholar
- NVIDIA. 2011. PTX: Parallel Thread Execution ISA Version 2.3. http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/ docs/ptx_isa_2.3.pdf.Google Scholar
- Parboil. Parboil Benchmark Suite. http://impact.crhc.illinois.edu/parboil.php.Google Scholar
- Park, I., Powell, M. D., and Vijaykumar, T. N. 2002. Reducing register ports for higher speed and lower energy. In Proceedings of the International Symposium on Microarchitecture. 171--182. Google ScholarDigital Library
- Park, J. and Dally, W. J. 2011. Guaranteeing Forward Progress of Unified Register Allocation and Instruction Scheduling. Tech. rep. Concurrent VLSI Architecture Group Memo 127, Stanford University.Google Scholar
- Park, S., Shrivastava, A., Dutt, N., Nicolau, A., Paek, Y., and Earlie, E. 2006. Bypass aware instruction scheduling for register file power reduction. In Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems. 173--181. Google ScholarDigital Library
- Raasch, S. E., Binkert, N. L., and Reinhardt, S. K. 2002. A scalable instruction queue design using dependence chains. In Proceedings of the International Symposium on Computer Architecture. 318--329. Google ScholarDigital Library
- Rixner, S., Dally, W., Khailany, B., Mattson, P., Kapasi, U., and Owens, J. 2000. Register organization for media processing. In Proceedings of the International Symposium on High Performance Computer Architecture. 375--386.Google Scholar
- Russell, R. M. 1978. The CRAY-1 computer system. Commun. ACM 21, 1, 63--72. Google ScholarDigital Library
- Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., and Hanrahan, P. 2008. Larrabee: A many-core x86 architecture for visual computing. In Proceedings of the International Conference and Exhibition on Computer Graphics and Interactive Techniques. 1--15. Google ScholarDigital Library
- Shioya, R., Horio, K., Goshima, M., and Sakai, S. 2010. Register cache system not for latency reduction purpose. In Proceedings of the International Symposium on Microarchitecture. 301--312. Google ScholarDigital Library
- Smith, R. 2011. AMD Radeon HD 7970 review: 28nm and graphics core next, together as one. www.anandtech.com/show/5261/amd-radeon-hd-7970-review.Google Scholar
- Swensen, J. A. and Patt, Y. N. 1988. Hierarchical registers for scientific computers. In Proceedings of the International Conference on Supercomputing. 346--354. Google ScholarDigital Library
- Tseng, J. H. and Asanovic, K. 2000. Energy-efficient register access. In Proceedings of the Symposium on Integrated Circuits and Systems Design. 377--382. Google ScholarDigital Library
- Tune, E., Kumar, R., Tullsen, D. M., and Calder, B. 2004. Balanced multithreading: Increasing throughput via a low cost multithreading hierarchy. In Proceedings of the International Symposium on Microarchitecture. 183--194. Google ScholarDigital Library
- Wong, H., Papadopoulou, M.-M., Sadooghi-Alvandi, M., and Moshovos, A. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 235--246.Google Scholar
- Yankelevsky, M. N. and Polychronopoulos, C. D. 2001. α-coral: A multigrain, multithreaded processor architecture. In Proceedings of the International Conference on Supercomputing. 358--367. Google ScholarDigital Library
- Yu, W., Huang, R., Xu, S., Wang, S.-E., Kan, E., and Suh, G. E. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
- Zalamea, J., Llosa, J., Ayguade, E., and Valero, M. 2000. Two-level hierarchical register file organization for VLIW processors. In Proceedings of the International Symposium on Microarchitecture. 137--146. Google ScholarDigital Library
- Zalamea, J., Llosa, J., Ayguade, E., and Valero, M. 2004. Software and hardware techniques to optimize register file utilization in VLIW architectures. Int. J. Paral. Program. 32, 6, 447--474. Google ScholarDigital Library
- Zeng, H. and Ghose, K. 2006. Register file caching for energy efficiency. In Proceedings of the International Symposium on Low Power Electronics and Design. 244--249. Google ScholarDigital Library
- Zhang, Y., He, H., and Sin, Y. 2005. A new register file access architecture for software pipelining in VLIW processors. In Proceedings of the Asia and South Pacific Design Automation Conference. 627--630. Google ScholarDigital Library
- Zhuang, X. and Pande, S. 2003. Resolving register bank conflicts for a network processor. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 269--278. Google ScholarDigital Library
Index Terms
- A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors
Recommendations
Energy-efficient mechanisms for managing thread context in throughput processors
ISCA '11: Proceedings of the 38th annual international symposium on Computer architectureModern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to ...
Energy-efficient mechanisms for managing thread context in throughput processors
ISCA '11Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to ...
Architecture and Compiler Support for GPUs Using Energy-Efficient Affine Register Files
A modern GPU can simultaneously process thousands of hardware threads. These threads are grouped into fixed-size SIMD batches executing the same instruction on vectors of data in a lockstep to achieve high throughput and performance. The register files ...
Comments