research-article

A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

Authors:
Mark Gebhart

The University of Texas at Austin

The University of Texas at Austin
View Profile

,
Daniel R. Johnson

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
David Tarjan

NVIDIA

NVIDIA
View Profile

,
Stephen W. Keckler

NVIDIA and The University of Texas at Austin

NVIDIA and The University of Texas at Austin
View Profile

,
William J. Dally

NVIDIA and Stanford University

NVIDIA and Stanford University
View Profile

,
Erik Lindholm

NVIDIA

NVIDIA
View Profile

,
Kevin Skadron

University of Virginia

University of Virginia
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 30 Issue 2Article No.: 8pp 1–38https://doi.org/10.1145/2166879.2166882

Published:01 April 2012Publication History

ACM Transactions on Computer Systems

Abstract

Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Reducing the number of threads that the scheduler must consider each cycle improves the scheduler’s energy efficiency. Second, we propose replacing the monolithic register file found on modern designs with a hierarchical register file. We explore various trade-offs for the hierarchy including the number of levels in the hierarchy and the number of entries at each level. We consider both a hardware-managed caching scheme and a software-managed scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy. Combined with a hierarchical register file, our two-level thread scheduler provides a further reduction in energy by only allocating entries in the upper levels of the register file hierarchy for active threads. Averaging across a variety of real world graphics and compute workloads, the active thread count can be reduced by a factor of 4 with minimal impact on performance and our most efficient three-level software-managed register file hierarchy reduces register file energy by 54%.

References

Agarwal, A., Lim, B.-H., Kranz, D., and Kubiatowicz, J. 1990. APRIL: A processor architecture for multiprocessing. In Proceedings of the International Symposium on Computer Architecture. 104--114. Google ScholarDigital Library
AMD. 2010. ATI Stream Computing OpenCL Programming Guide. http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf.Google Scholar
AMD. 2011. HD 6900 series instruction set architecture. http://developer.amd.com/gpu/amdappsdk/assets/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf.Google Scholar
Ayala, J. L., Veidenbaum, A., and Lopez-Vallejo, M. 2003. Power-aware compilation for register file energy reduction. Int. J. Paral. Program. 31, 6, 451--467. Google ScholarDigital Library
Bakhoda, A., Yuan, G. L., Fung, W. W. L., Wong, H., and Aamodt, T. M. 2009. Analyzing CUDA workloads using a detailed gpu simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 163--174.Google Scholar
Balasubramonian, R., Dwarkadas, S., and Albonesi, D. H. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In Proceedings of the International Symposium on Microarchitecture. 237--248. Google ScholarDigital Library
Balfour, J., Harting, R., and Dally, W. 2009. Operand registers and explicit operand forwarding. IEEE Comput. Architect. Lett. 8, 2, 60--63. Google ScholarDigital Library
Borch, E., Tune, E., Manne, S., and Emer, J. 2002. Loose loops sink chips. In Proceedings of the International Symposium on High Performance Computer Architecture. 299--310. Google ScholarDigital Library
Brekelbaum, E., Rupley, J., Wilkerson, C., and Black, B. 2002. Hierarchical scheduling windows. In Proceedings of the International Symposium on Microarchitecture. 27--36. Google ScholarDigital Library
Brown, J. A. and Tullsen, D. M. 2008. The shared-thread multiprocessor. In Proceedings of the International Conference on Supercomputing. 73--82. Google ScholarDigital Library
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S. H., and Skadron, K. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the International Symposium on Workload Chaaracterization. 44--54. Google ScholarDigital Library
Cooper, K. D. and Harvey, T. J. 1998. Compiler-controlled memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 2--11. Google ScholarDigital Library
Crago, N. and Patel, S. 2011. OUTRIDER: Efficient memory latency tolerance with decoupled strands. In Proceedings of the International Symposium on Computer Architecture. 117--128. Google ScholarDigital Library
Cruz, J., Gonzalez, A., Valero, M., and Topham, N. P. 2000. Multiple-banked register file architectures. In Proceedings of the International Symposium on Computer Architecture. 316--325. Google ScholarDigital Library
Dally, W. J., Hanrahan, P., Erez, M., Knight, T. J., Labonte, F., Ahn, J.-H., Jayasena, N., Kapasi, U. J., Das, A., Gummaraju, J., and Buck, I. 2003. Merrimac: Supercomputing with Streams. In Proceedings of the International Conference for High Performance Computing. 35--42. Google ScholarDigital Library
Diamos, G., Kerr, A., Yalamanchili, S., and Clark, N. 2010. Ocelot: A Dynamic Compiler for Bulk-Synchronous Applications in Heterogeneous Systems. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 353--364. Google ScholarDigital Library
Ernst, D., Hamel, A., and Austin, T. 2003. Cyclone: A broadcast-free dynamic instruction scheduler with selective replay. In Proceedings of the International Symposium on Computer Architecture. 253--263. Google ScholarDigital Library
Fatahalian, K. and Houston, M. 2008. A closer look at GPUs. Comm. ACM 51, 10, 50--57. Google ScholarDigital Library
Franklin, M. and Sohi, G. S. 1992. Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors. In Proceedings of the International Symposium on Microarchitecture. 236--245. Google ScholarDigital Library
Galal, S. and Horowitz, M. 2011. Energy-efficient floating point unit design. IEEE Trans. Comput. 60, 7, 913--922. Google ScholarDigital Library
Gebotys, C. H. 1997. Low energy memory and register allocation using network flow. In Proceedings of the Design Automation Conference. 435--440. Google ScholarDigital Library
Hong, S. and Kim, H. 2010. An integrated GPU power and performance model. In Proceedings of the International Symposium on Computer Architecture. 280--289. Google ScholarDigital Library
Hu, Z. and Martonosi, M. 2000. Reducing register file power consumption by exploiting value lifetime characteristics. In Proceedings of the Workshop on Complexity-Effective Design.Google Scholar
ITRS. 2009. International Technology Roadmap for Semiconductors. http://itrs.net/links/2009ITRS/Home2009.htm.Google Scholar
Jones, T. M., O’Boyle, M. F. P., Abella, J., González, A., and Ergin, O. 2009. Energy-efficient register caching with compiler assistance. ACM Trans. Architect. Code Optim. 6, 4, 1--23. Google ScholarDigital Library
Kogge, P., Ed. 2008. ExaScale computing study: Technology challenges in achieving exascale systems. Tech. rep. TR-2008-13, University of Notre Dame.Google Scholar
Lebeck, A. R., Koppanalil, J., Li, T., Patwardhan, J., and Rotenberg, E. 2002. A large, fast instruction window for tolerating cache misses. In Proceedings of the International Symposium on Computer Architecture. 59--70. Google ScholarDigital Library
Leon, A. S., Langley, B., and Shin, J. L. 2007. The UltraSPARC T1 processor: CMT reliability. In Proceedings of the IEEE Custom Integrated Circuits Conference. 555--562.Google Scholar
MAGMA. MAGMA: Matrix Algebra for GPU and Multicore Architectures. http://icl.eecs.utk.edu/magma.Google Scholar
Muralimanohar, N., Balasubramonian, R., and Jouppi, N. P. 2009. CACTI 6.0: A tool to model large caches. Tech. rep., HP Laboratories.Google Scholar
Nuth, P. R. and Dally, W. J. 1991. A mechanism for efficient context switching. In Proceedings of the International Conference on Computer Design on VLSI in Computer & Processors. 301--304. Google ScholarDigital Library
Nuth, P. R. and Dally, W. J. 1995. The named-state register file: Implementation and performance. In Proceedings of the International Symposium on High Performance Computer Architecture. 4--13. Google ScholarDigital Library
NVIDIA. 2008. Compute Unified Device Architecture Programming Guide Version 2.0. http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf.Google Scholar
NVIDIA. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. http://nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.Google Scholar
NVIDIA. 2011. PTX: Parallel Thread Execution ISA Version 2.3. http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/ docs/ptx_isa_2.3.pdf.Google Scholar
Parboil. Parboil Benchmark Suite. http://impact.crhc.illinois.edu/parboil.php.Google Scholar
Park, I., Powell, M. D., and Vijaykumar, T. N. 2002. Reducing register ports for higher speed and lower energy. In Proceedings of the International Symposium on Microarchitecture. 171--182. Google ScholarDigital Library
Park, J. and Dally, W. J. 2011. Guaranteeing Forward Progress of Unified Register Allocation and Instruction Scheduling. Tech. rep. Concurrent VLSI Architecture Group Memo 127, Stanford University.Google Scholar
Park, S., Shrivastava, A., Dutt, N., Nicolau, A., Paek, Y., and Earlie, E. 2006. Bypass aware instruction scheduling for register file power reduction. In Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems. 173--181. Google ScholarDigital Library
Raasch, S. E., Binkert, N. L., and Reinhardt, S. K. 2002. A scalable instruction queue design using dependence chains. In Proceedings of the International Symposium on Computer Architecture. 318--329. Google ScholarDigital Library
Rixner, S., Dally, W., Khailany, B., Mattson, P., Kapasi, U., and Owens, J. 2000. Register organization for media processing. In Proceedings of the International Symposium on High Performance Computer Architecture. 375--386.Google Scholar
Russell, R. M. 1978. The CRAY-1 computer system. Commun. ACM 21, 1, 63--72. Google ScholarDigital Library
Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., and Hanrahan, P. 2008. Larrabee: A many-core x86 architecture for visual computing. In Proceedings of the International Conference and Exhibition on Computer Graphics and Interactive Techniques. 1--15. Google ScholarDigital Library
Shioya, R., Horio, K., Goshima, M., and Sakai, S. 2010. Register cache system not for latency reduction purpose. In Proceedings of the International Symposium on Microarchitecture. 301--312. Google ScholarDigital Library
Smith, R. 2011. AMD Radeon HD 7970 review: 28nm and graphics core next, together as one. www.anandtech.com/show/5261/amd-radeon-hd-7970-review.Google Scholar
Swensen, J. A. and Patt, Y. N. 1988. Hierarchical registers for scientific computers. In Proceedings of the International Conference on Supercomputing. 346--354. Google ScholarDigital Library
Tseng, J. H. and Asanovic, K. 2000. Energy-efficient register access. In Proceedings of the Symposium on Integrated Circuits and Systems Design. 377--382. Google ScholarDigital Library
Tune, E., Kumar, R., Tullsen, D. M., and Calder, B. 2004. Balanced multithreading: Increasing throughput via a low cost multithreading hierarchy. In Proceedings of the International Symposium on Microarchitecture. 183--194. Google ScholarDigital Library
Wong, H., Papadopoulou, M.-M., Sadooghi-Alvandi, M., and Moshovos, A. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 235--246.Google Scholar
Yankelevsky, M. N. and Polychronopoulos, C. D. 2001. α-coral: A multigrain, multithreaded processor architecture. In Proceedings of the International Conference on Supercomputing. 358--367. Google ScholarDigital Library
Yu, W., Huang, R., Xu, S., Wang, S.-E., Kan, E., and Suh, G. E. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
Zalamea, J., Llosa, J., Ayguade, E., and Valero, M. 2000. Two-level hierarchical register file organization for VLIW processors. In Proceedings of the International Symposium on Microarchitecture. 137--146. Google ScholarDigital Library
Zalamea, J., Llosa, J., Ayguade, E., and Valero, M. 2004. Software and hardware techniques to optimize register file utilization in VLIW architectures. Int. J. Paral. Program. 32, 6, 447--474. Google ScholarDigital Library
Zeng, H. and Ghose, K. 2006. Register file caching for energy efficiency. In Proceedings of the International Symposium on Low Power Electronics and Design. 244--249. Google ScholarDigital Library
Zhang, Y., He, H., and Sin, Y. 2005. A new register file access architecture for software pipelining in VLIW processors. In Proceedings of the Asia and South Pacific Design Automation Conference. 627--630. Google ScholarDigital Library
Zhuang, X. and Pande, S. 2003. Resolving register bank conflicts for a network processor. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 269--278. Google ScholarDigital Library

Index Terms

A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Energy-efficient mechanisms for managing thread context in throughput processors
ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to ...
Read More
Energy-efficient mechanisms for managing thread context in throughput processors
ISCA '11

Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to ...
Read More
Architecture and Compiler Support for GPUs Using Energy-Efficient Affine Register Files

A modern GPU can simultaneously process thousands of hardware threads. These threads are grouped into fixed-size SIMD batches executing the same instruction on vectors of data in a lockstep to achieve high throughput and performance. The register files ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 30, Issue 2
April 2012
111 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/2166879
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 2012
- Accepted: 1 October 2011
- Received: 1 August 2011
Published in tocs Volume 30, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Energy efficiency
multithreading
register file organization
throughput computing
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 1,016
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Energy-efficient mechanisms for managing thread context in throughput processors

Energy-efficient mechanisms for managing thread context in throughput processors

Architecture and Compiler Support for GPUs Using Energy-Efficient Affine Register Files

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Energy-efficient mechanisms for managing thread context in throughput processors

Energy-efficient mechanisms for managing thread context in throughput processors

Architecture and Compiler Support for GPUs Using Energy-Efficient Affine Register Files

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media