research-article

Enabling coordinated register allocation and thread-level parallelism optimization for GPUs

Authors:
Xiaolong Xie

Peking University, China

Peking University, China
View Profile

,
Yun Liang

Peking University, China and Collaborative Innovation Center of High Performance Computing, NUDT, China

Peking University, China and Collaborative Innovation Center of High Performance Computing, NUDT, China
View Profile

,
Xiuhong Li

Peking University, China

Peking University, China
View Profile

,
Yudong Wu

Peking University, China

Peking University, China
View Profile

,
Guangyu Sun

Peking University, China and Collaborative Innovation Center of High Performance Computing, NUDT, China

Peking University, China and Collaborative Innovation Center of High Performance Computing, NUDT, China
View Profile

,
Tao Wang

Peking University, China and Collaborative Innovation Center of High Performance Computing, NUDT, China

Peking University, China and Collaborative Innovation Center of High Performance Computing, NUDT, China
View Profile

,
Dongrui Fan

Chinese Academy of Sciences

Chinese Academy of Sciences
View Profile

MICRO-48: Proceedings of the 48th International Symposium on MicroarchitectureDecember 2015Pages 395–406https://doi.org/10.1145/2830772.2830813

Published:05 December 2015Publication History

MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Pages 395–406

ABSTRACT

The key to high performance on GPUs lies in the massive threading to enable thread switching and hide the latency of function unit and memory access. However, running with the maximum thread-level parallelism (TLP) does not necessarily lead to the optimal performance due to the excessive thread contention for cache resource. As a result, thread throttling techniques are employed to limit the number of threads that concurrently execute to preserve the data locality. On the other hand, GPUs are equipped with a large register file to enable fast context switch between threads. However, thread throttling techniques that are designed to mitigate cache contention, lead to under utilization of registers. Register allocation is a significant factor for performance as it not just determines the single-thread performance, but indirectly affects the TLP.

The design space of register allocation and TLP presents new opportunities for performance optimization. However, the complicated correlation between the two factors inevitably lead to many performance dynamics and uncertainties. In this paper, we propose Coordinated Register Allocation and Thread-level parallelism (CRAT), a compiler-based performance optimization framework. In order to achieve this goal, CRAT first enables effective register allocation. Given a register per-thread limit, CRAT allocates the registers by analyzing the lifetime of variables. To reduce the spilling cost, CRAT spills the registers to shared memory when possible. Then, CRAT explores the design space by first pruning the design points that cause serious Ll cache thrashing and register under utilization. After that, CRAT employs a prediction model to find the best tradeoff between the single-thread performance and TLP. We evaluate CRAT using a set of representative workloads on GPUs. Experimental results indicate that compared to the optimal thread throttling technique, our framework achieves performance improvement up to 1.79X (geometric mean 1.25X).

References

"NVIDIA CUDA programming guide.." http://docs.nvidia.com/cuda/cuda-c-programming-guide.Google Scholar
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pp. 72--83, 2012. Google ScholarDigital Library
O. Kayιran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither more nor less: Optimizing thread-level parallelism for GPGPUs," in Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pp. 157--166, 2013. Google ScholarDigital Library
X. Chen, L.-W. Chang, C. Rodrigues, J. Lv, Z. Wang, and W. mei Hwu, "Adaptive cache management for energy-efficient GPU computing," in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pp. 343--355, 2014. Google ScholarDigital Library
M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu, "Improving GPGPU resource utilization through alternative thread block scheduling," in 2014 IEEE 20th International Symposium on High Performance Computer Architecture, HPCA'14, pp. 260--271, 2014.Google Scholar
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in 2009 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS'09, pp. 163--174, 2009.Google Scholar
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in 2009 IEEE International Symposium on Workload Characterization, IISWC'09, 2009. Google ScholarDigital Library
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W. mei Hwu, "The IMPACT Research Group, Parboil Benchmark Suite." http://impact.crhc.illinois.edu/parboil/parboil.aspx.Google Scholar
"NVIDIA PTX ISA document.." http://docs.nvidia.com/cuda/parallel-thread-execution.Google Scholar
P. Briggs, "Register allocation via graph coloring," Ph.D. thesis, Department of Computer Science, Rice University, 1992. Google ScholarDigital Library
S. Hong and H. Kim, "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness," in 2009 36th Annual International Symposium on Computer Architecture, ISCA'09, pp. 152--163, 2009. Google ScholarDigital Library
G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. W. Markstein, "Register allocation via coloring," Computing Languages, vol. 6, no. 1, pp. 47--57, 1981. Google ScholarDigital Library
M. Poletto and V. Sarkar, "Linear scan register allocation," ACM Transactions on Programming Languages and Systems, vol. 21, no. 5, pp. 895--913, 1999. Google ScholarDigital Library
C. Li, Y. Yang, H. Dai, S. Yan, F. Mueller, and H. Zhou, "Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs," in 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS'14, pp. 231--242, 2014.Google Scholar
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu, "An adaptive performance modeling tool for GPU architectures," in 2010 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'10, pp. 105--114, 2010. Google ScholarDigital Library
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling energy optimizations in GPGPUs," in 2013 40th Annual International Symposium on Computer Architecture, ISCA'13, pp. 487--498, 2013. Google ScholarDigital Library
"NVIDIA Fermi architecture.." www.nvidia.com/object/fermi-architecture.html.Google Scholar
Z. Cui, Y. Liang, K. Rupnow, and D. Chen, "An accurate GPU performance model for effective control flow divergence optimization," in 2012 IEEE 26th International Parallel Distributed Processing Symposium, IPDPS'12, pp. 83--94, 2012. Google ScholarDigital Library
X. Chen and T. Aamodt, "A first-order fine-grained multithreaded throughput model," in 2009 IEEE 15th International Symposium on High Performance Computer Architecture, HPCA'09, pp. 329--340, 2009.Google Scholar
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pp. 308--317, 2011. Google ScholarDigital Library
A. Jog, O. Kayιran, C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance," in 2013 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'13, pp. 395--406, 2013. Google ScholarDigital Library
Y. Liang, H. Huynh, K. Rupnow, R. Goh, and D. Chen, "Efficient GPU spatial-temporal multitasking," IEEE Transactions on Parallel and Distributed Systems, vol. 26, pp. 748--760, March 2015.Google ScholarDigital Library
O. Kayιran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das, "Managing GPU concurrency in heterogeneous architectures," in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pp. 114--126, 2014. Google ScholarDigital Library
W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in 2007 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-40, pp. 407--420, 2007. Google ScholarDigital Library
A. ElTantawy, J. Ma, M. O'Connor, and T. Aamodt, "A scalable multi-path microarchitecture for efficient GPU control flow," in 2014 IEEE 20th International Symposium on High Performance Computer Architecture, HPCA'14, pp. 248--259, 2014.Google Scholar
A. Jog, O. Kayιran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated scheduling and prefetching for GPGPUs," in 2013 40th Annual International Symposium on Computer Architecture, ISCA '13, pp. 332--343, 2013. Google ScholarDigital Library
Y. Yang and H. Zhou, "CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications," in 2014 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pp. 93--106, 2014. Google ScholarDigital Library
Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović, "Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators," in 2011 38th Annual International Symposium on Computer Architecture, ISCA '11, pp. 129--140, 2011. Google ScholarDigital Library
P. Xiang, Y. Yang, and H. Zhou, "Warp-level divergence in GPUs: Characterization, impact, and mitigation," in 2014 IEEE 20th International Symposium on High Performance Computer Architecture, HPCA'14, pp. 284--295, 2014.Google Scholar
H. Lee, K. Brown, A. Sujeeth, T. Rompf, and K. Olukotun, "Locality-aware mapping of nested parallel patterns on GPUs," in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pp. 63--74, 2014. Google ScholarDigital Library
Y. Yang, P. Xiang, J. Kong, and H. Zhou, "A GPGPU compiler for memory optimization and parallelism management," in 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '10, pp. 86--97, 2010. Google ScholarDigital Library
B. Jang, D. Schaa, P. Mistry, and D. Kaeli, "Exploiting memory access patterns to improve memory performance in data-parallel architectures," IEEE Transactions on Parallel and Distributed Systems, vol. 22, pp. 105--118, Jan 2011. Google ScholarDigital Library
G. Yuan, A. Bakhoda, and T. Aamodt, "Complexity effective memory access scheduling for many-core accelerator architectures," in 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-42, pp. 34--44, Dec 2009. Google ScholarDigital Library
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware warp scheduling," in 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pp. 99--110, 2013. Google ScholarDigital Library
X. Xie, Y. Liang, G. Sun, and D. Chen, "An efficient compiler framework for cache bypassing on GPUs," in 2013 International Conference on Computer-Aided Design, ICCAD '13, pp. 516--523, 2013. Google ScholarDigital Library
Y. Liang, X. Xie, G. Sun, and D. Chen, "An efficient compiler framework for cache bypassing on GPUs," IEEE Transactions on Parallel and Distributed Systems, vol. 34, pp. 1677--1690, Oct 2015.Google Scholar
X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang, "Coordinated static and dynamic cache bypassing for GPUs," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture, HPCA'15, pp. 76--88, 2015.Google Scholar
W. Jia, K. Shaw, and M. Martonosi, "MRPB: Memory request prioritization for massively parallel processors," in 2014 IEEE 20th International Symposium on High Performance Computer Architecture, HPCA'14, pp. 272--283, 2014.Google Scholar
D. Li, M. Rhu, D. Johnson, M. O'Connor, M. Erez, D. Burger, D. Fussell, and S. Redder, "Priority-based cache allocation in throughput processors," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture, HPCA'15, pp. 89--100, Feb 2015.Google Scholar
M. Gebhart, S. W. Keckler, and W. J. Dally, "A compile-time managed multi-level register file hierarchy," in 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pp. 465--476, 2011. Google ScholarDigital Library
M. Gebhart, S. Keckler, B. Khailany, R. Krashinsky, and W. Dally, "Unifying primary cache, scratch, and register file memories in a throughput processor," in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pp. 96--106, 2012. Google ScholarDigital Library
M. Abdel-Majeed and M. Annavaram, "Warped register file: A power efficient register file for GPGPUs," in 2013 IEEE 19th International Symposium on High Performance Computer Architecture, HPCA '13, pp. 412--423, 2013. Google ScholarDigital Library
N. Lakshminarayana and H. Kim, "Spare register aware prefetching for graph algorithms on GPUs," in 2014 IEEE 20th International Symposium on High Performance Computer Architecture, HPCA'14, pp. 614--625, Feb 2014.Google Scholar
S. Z. Gilani, N. S. Kim, and M. J. Schulte, "Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency," in 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pp. 74--85, 2013. Google ScholarDigital Library
A. B. Hayes and E. Z. Zhang, "Unified on-chip memory allocation for simt architecture," in 2014 ACM 28th International Conference on Supercomputing, ICS '14, pp. 293--302, 2014. Google ScholarDigital Library
G. Chen, B. Wu, D. Li, and X. Shen, "PORPLE: An extensible optimizer for portable data placement on GPU," in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pp. 88--100, 2014. Google ScholarDigital Library
C. Li, Y. Yang, Z. Lin, and H. Zhou, "Automatic data placement into GPU on-chip memory resources," in 2015 IEEE/ACM International Symposium on Code Generation and Optimization, CGO '15, 2015. Google ScholarDigital Library

Recommendations

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Read More
Exploiting Java instruction/thread level parallelism with horizontal multithreading

Java bytecodes can be executed with the following three methods: a Java interpretor running on a particular machine interprets bytecodes; a Just-In-Time (JIT) compiler translates bytecodes to the native primitives of the particular machine and the ...
Read More
Architecture optimization for multimedia application exploiting data and thread-level parallelism

The characteristics of multimedia applications when executed oil general-purpose processors are not well understood. Such knowledge is extremely important in guiding the development of multimedia applications and the design of future processors.In this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture
December 2015
787 pages
ISBN:9781450340342
DOI:10.1145/2830772
General Chair:
Milos Prvulovic
Georgia Tech
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 December 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
MICRO-48 Paper Acceptance Rate61of283submissions,22%Overall Acceptance Rate484of2,242submissions,22%
More
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 53
  Total Citations
  View Citations
- 647
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enabling coordinated register allocation and thread-level parallelism optimization for GPUs

MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Recommendations

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

Exploiting Java instruction/thread level parallelism with horizontal multithreading

Architecture optimization for multimedia application exploiting data and thread-level parallelism

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Enabling coordinated register allocation and thread-level parallelism optimization for GPUs

MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Recommendations

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

Exploiting Java instruction/thread level parallelism with horizontal multithreading

Architecture optimization for multimedia application exploiting data and thread-level parallelism

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media