skip to main content
research-article

A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

Published:01 April 2012Publication History
Skip Abstract Section

Abstract

Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Reducing the number of threads that the scheduler must consider each cycle improves the scheduler’s energy efficiency. Second, we propose replacing the monolithic register file found on modern designs with a hierarchical register file. We explore various trade-offs for the hierarchy including the number of levels in the hierarchy and the number of entries at each level. We consider both a hardware-managed caching scheme and a software-managed scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy. Combined with a hierarchical register file, our two-level thread scheduler provides a further reduction in energy by only allocating entries in the upper levels of the register file hierarchy for active threads. Averaging across a variety of real world graphics and compute workloads, the active thread count can be reduced by a factor of 4 with minimal impact on performance and our most efficient three-level software-managed register file hierarchy reduces register file energy by 54%.

References

  1. Agarwal, A., Lim, B.-H., Kranz, D., and Kubiatowicz, J. 1990. APRIL: A processor architecture for multiprocessing. In Proceedings of the International Symposium on Computer Architecture. 104--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AMD. 2010. ATI Stream Computing OpenCL Programming Guide. http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf.Google ScholarGoogle Scholar
  3. AMD. 2011. HD 6900 series instruction set architecture. http://developer.amd.com/gpu/amdappsdk/assets/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf.Google ScholarGoogle Scholar
  4. Ayala, J. L., Veidenbaum, A., and Lopez-Vallejo, M. 2003. Power-aware compilation for register file energy reduction. Int. J. Paral. Program. 31, 6, 451--467. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bakhoda, A., Yuan, G. L., Fung, W. W. L., Wong, H., and Aamodt, T. M. 2009. Analyzing CUDA workloads using a detailed gpu simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 163--174.Google ScholarGoogle Scholar
  6. Balasubramonian, R., Dwarkadas, S., and Albonesi, D. H. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In Proceedings of the International Symposium on Microarchitecture. 237--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Balfour, J., Harting, R., and Dally, W. 2009. Operand registers and explicit operand forwarding. IEEE Comput. Architect. Lett. 8, 2, 60--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Borch, E., Tune, E., Manne, S., and Emer, J. 2002. Loose loops sink chips. In Proceedings of the International Symposium on High Performance Computer Architecture. 299--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Brekelbaum, E., Rupley, J., Wilkerson, C., and Black, B. 2002. Hierarchical scheduling windows. In Proceedings of the International Symposium on Microarchitecture. 27--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Brown, J. A. and Tullsen, D. M. 2008. The shared-thread multiprocessor. In Proceedings of the International Conference on Supercomputing. 73--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S. H., and Skadron, K. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the International Symposium on Workload Chaaracterization. 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cooper, K. D. and Harvey, T. J. 1998. Compiler-controlled memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 2--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Crago, N. and Patel, S. 2011. OUTRIDER: Efficient memory latency tolerance with decoupled strands. In Proceedings of the International Symposium on Computer Architecture. 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cruz, J., Gonzalez, A., Valero, M., and Topham, N. P. 2000. Multiple-banked register file architectures. In Proceedings of the International Symposium on Computer Architecture. 316--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dally, W. J., Hanrahan, P., Erez, M., Knight, T. J., Labonte, F., Ahn, J.-H., Jayasena, N., Kapasi, U. J., Das, A., Gummaraju, J., and Buck, I. 2003. Merrimac: Supercomputing with Streams. In Proceedings of the International Conference for High Performance Computing. 35--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Diamos, G., Kerr, A., Yalamanchili, S., and Clark, N. 2010. Ocelot: A Dynamic Compiler for Bulk-Synchronous Applications in Heterogeneous Systems. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 353--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ernst, D., Hamel, A., and Austin, T. 2003. Cyclone: A broadcast-free dynamic instruction scheduler with selective replay. In Proceedings of the International Symposium on Computer Architecture. 253--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Fatahalian, K. and Houston, M. 2008. A closer look at GPUs. Comm. ACM 51, 10, 50--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Franklin, M. and Sohi, G. S. 1992. Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors. In Proceedings of the International Symposium on Microarchitecture. 236--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Galal, S. and Horowitz, M. 2011. Energy-efficient floating point unit design. IEEE Trans. Comput. 60, 7, 913--922. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gebotys, C. H. 1997. Low energy memory and register allocation using network flow. In Proceedings of the Design Automation Conference. 435--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Hong, S. and Kim, H. 2010. An integrated GPU power and performance model. In Proceedings of the International Symposium on Computer Architecture. 280--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hu, Z. and Martonosi, M. 2000. Reducing register file power consumption by exploiting value lifetime characteristics. In Proceedings of the Workshop on Complexity-Effective Design.Google ScholarGoogle Scholar
  24. ITRS. 2009. International Technology Roadmap for Semiconductors. http://itrs.net/links/2009ITRS/Home2009.htm.Google ScholarGoogle Scholar
  25. Jones, T. M., O’Boyle, M. F. P., Abella, J., González, A., and Ergin, O. 2009. Energy-efficient register caching with compiler assistance. ACM Trans. Architect. Code Optim. 6, 4, 1--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kogge, P., Ed. 2008. ExaScale computing study: Technology challenges in achieving exascale systems. Tech. rep. TR-2008-13, University of Notre Dame.Google ScholarGoogle Scholar
  27. Lebeck, A. R., Koppanalil, J., Li, T., Patwardhan, J., and Rotenberg, E. 2002. A large, fast instruction window for tolerating cache misses. In Proceedings of the International Symposium on Computer Architecture. 59--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Leon, A. S., Langley, B., and Shin, J. L. 2007. The UltraSPARC T1 processor: CMT reliability. In Proceedings of the IEEE Custom Integrated Circuits Conference. 555--562.Google ScholarGoogle Scholar
  29. MAGMA. MAGMA: Matrix Algebra for GPU and Multicore Architectures. http://icl.eecs.utk.edu/magma.Google ScholarGoogle Scholar
  30. Muralimanohar, N., Balasubramonian, R., and Jouppi, N. P. 2009. CACTI 6.0: A tool to model large caches. Tech. rep., HP Laboratories.Google ScholarGoogle Scholar
  31. Nuth, P. R. and Dally, W. J. 1991. A mechanism for efficient context switching. In Proceedings of the International Conference on Computer Design on VLSI in Computer & Processors. 301--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Nuth, P. R. and Dally, W. J. 1995. The named-state register file: Implementation and performance. In Proceedings of the International Symposium on High Performance Computer Architecture. 4--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. NVIDIA. 2008. Compute Unified Device Architecture Programming Guide Version 2.0. http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf.Google ScholarGoogle Scholar
  34. NVIDIA. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. http://nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.Google ScholarGoogle Scholar
  35. NVIDIA. 2011. PTX: Parallel Thread Execution ISA Version 2.3. http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/ docs/ptx_isa_2.3.pdf.Google ScholarGoogle Scholar
  36. Parboil. Parboil Benchmark Suite. http://impact.crhc.illinois.edu/parboil.php.Google ScholarGoogle Scholar
  37. Park, I., Powell, M. D., and Vijaykumar, T. N. 2002. Reducing register ports for higher speed and lower energy. In Proceedings of the International Symposium on Microarchitecture. 171--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Park, J. and Dally, W. J. 2011. Guaranteeing Forward Progress of Unified Register Allocation and Instruction Scheduling. Tech. rep. Concurrent VLSI Architecture Group Memo 127, Stanford University.Google ScholarGoogle Scholar
  39. Park, S., Shrivastava, A., Dutt, N., Nicolau, A., Paek, Y., and Earlie, E. 2006. Bypass aware instruction scheduling for register file power reduction. In Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems. 173--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Raasch, S. E., Binkert, N. L., and Reinhardt, S. K. 2002. A scalable instruction queue design using dependence chains. In Proceedings of the International Symposium on Computer Architecture. 318--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rixner, S., Dally, W., Khailany, B., Mattson, P., Kapasi, U., and Owens, J. 2000. Register organization for media processing. In Proceedings of the International Symposium on High Performance Computer Architecture. 375--386.Google ScholarGoogle Scholar
  42. Russell, R. M. 1978. The CRAY-1 computer system. Commun. ACM 21, 1, 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., and Hanrahan, P. 2008. Larrabee: A many-core x86 architecture for visual computing. In Proceedings of the International Conference and Exhibition on Computer Graphics and Interactive Techniques. 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Shioya, R., Horio, K., Goshima, M., and Sakai, S. 2010. Register cache system not for latency reduction purpose. In Proceedings of the International Symposium on Microarchitecture. 301--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Smith, R. 2011. AMD Radeon HD 7970 review: 28nm and graphics core next, together as one. www.anandtech.com/show/5261/amd-radeon-hd-7970-review.Google ScholarGoogle Scholar
  46. Swensen, J. A. and Patt, Y. N. 1988. Hierarchical registers for scientific computers. In Proceedings of the International Conference on Supercomputing. 346--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Tseng, J. H. and Asanovic, K. 2000. Energy-efficient register access. In Proceedings of the Symposium on Integrated Circuits and Systems Design. 377--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Tune, E., Kumar, R., Tullsen, D. M., and Calder, B. 2004. Balanced multithreading: Increasing throughput via a low cost multithreading hierarchy. In Proceedings of the International Symposium on Microarchitecture. 183--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Wong, H., Papadopoulou, M.-M., Sadooghi-Alvandi, M., and Moshovos, A. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 235--246.Google ScholarGoogle Scholar
  50. Yankelevsky, M. N. and Polychronopoulos, C. D. 2001. α-coral: A multigrain, multithreaded processor architecture. In Proceedings of the International Conference on Supercomputing. 358--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Yu, W., Huang, R., Xu, S., Wang, S.-E., Kan, E., and Suh, G. E. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of the International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Zalamea, J., Llosa, J., Ayguade, E., and Valero, M. 2000. Two-level hierarchical register file organization for VLIW processors. In Proceedings of the International Symposium on Microarchitecture. 137--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zalamea, J., Llosa, J., Ayguade, E., and Valero, M. 2004. Software and hardware techniques to optimize register file utilization in VLIW architectures. Int. J. Paral. Program. 32, 6, 447--474. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Zeng, H. and Ghose, K. 2006. Register file caching for energy efficiency. In Proceedings of the International Symposium on Low Power Electronics and Design. 244--249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Zhang, Y., He, H., and Sin, Y. 2005. A new register file access architecture for software pipelining in VLIW processors. In Proceedings of the Asia and South Pacific Design Automation Conference. 627--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Zhuang, X. and Pande, S. 2003. Resolving register bank conflicts for a network processor. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 269--278. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Computer Systems
      ACM Transactions on Computer Systems  Volume 30, Issue 2
      April 2012
      111 pages
      ISSN:0734-2071
      EISSN:1557-7333
      DOI:10.1145/2166879
      Issue’s Table of Contents

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 April 2012
      • Accepted: 1 October 2011
      • Received: 1 August 2011
      Published in tocs Volume 30, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader