research-article

Criticality stacks: identifying critical threads in parallel programs using synchronization behavior

Authors:
Kristof Du Bois

Ghent University, Belgium

Ghent University, Belgium
View Profile

,
Stijn Eyerman

Ghent University, Belgium

Ghent University, Belgium
View Profile

,
Jennifer B. Sartor

Ghent University, Belgium

Ghent University, Belgium
View Profile

,
Lieven Eeckhout

Ghent University, Belgium

Ghent University, Belgium
View Profile

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureJune 2013Pages 511–522https://doi.org/10.1145/2485922.2485966

Published:23 June 2013Publication History

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Pages 511–522

ABSTRACT

Analyzing multi-threaded programs is quite challenging, but is necessary to obtain good multicore performance while saving energy. Due to synchronization, certain threads make others wait, because they hold a lock or have yet to reach a barrier. We call these critical threads, i.e., threads whose performance is determinative of program performance as a whole. Identifying these threads can reveal numerous optimization opportunities, for the software developer and for hardware.

In this paper, we propose a new metric for assessing thread criticality, which combines both how much time a thread is performing useful work and how many co-running threads are waiting. We show how thread criticality can be calculated online with modest hardware additions and with low overhead. We use our metric to create criticality stacks that break total execution time into each thread's criticality component, allowing for easy visual analysis of parallel imbalance.

To validate our criticality metric, and demonstrate it is better than previous metrics, we scale the frequency of the most critical thread and show it achieves the largest performance improvement. We then demonstrate the broad applicability of criticality stacks by using them to perform three types of optimizations: (1) program analysis to remove parallel bottlenecks, (2) dynamically identifying the most critical thread and accelerating it using frequency scaling to improve performance, and (3) showing that accelerating only the most critical thread allows for targeted energy reduction.

References

G. M. Amdahl. Validity of the single-processor approach to achieving large-scale computing capabilities. In Proceedings of the American Federation of Information Processing Societies Conference (AFIPS), pages 483--485, 1967. Google ScholarDigital Library
M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl's law through EPI throttling. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 298--309, June 2005. Google ScholarDigital Library
A. Bhattacharjee and M. Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 290--301, June 2009. Google ScholarDigital Library
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 72--81, Oct. 2008. Google ScholarDigital Library
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. Computer Architecture News, 39:1--7, May 2011. Google ScholarDigital Library
D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 83--94, June 2000. Google ScholarDigital Library
Q. Cai, J. González, R. Rakvic, G. Magklis, P. Chaparro, and A. González. Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 240--249, Oct. 2008. Google ScholarDigital Library
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pages 44--54, Oct. 2009. Google ScholarDigital Library
G. Chen and P. Stenström. Critical lock analysis: Diagnosing critical section bottlenecks in multithreaded applications. In Proceedings of Supercomputing: the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 71:1--71:11, Nov. 2012. Google ScholarDigital Library
S. Eyerman, K. Du Bois, and L. Eeckhout. Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications. In Proceedings of the International Symposium on Performance Analysis of Software and Systems (ISPASS), pages 145--155, Apr. 2012. Google ScholarDigital Library
S. Eyerman and L. Eeckhout. Modeling critical sections in Amdahl's law and its implications for multicore design. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 362--370, June 2010. Google ScholarDigital Library
B. Fields, S. Rubin, and R. Bodík. Focusing processor policies via critical-path prediction. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 74--85, June 2001. Google ScholarDigital Library
M. Herlihy and J. Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 289--300, June 1993. Google ScholarDigital Library
J. Hollingsworth. An online computation of critical path profiling. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools, pages 11--20, May 1996. Google ScholarDigital Library
J. Joao, M. Suleman, O. Mutlu, and Y. Patt. Bottleneck identification and scheduling in multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 223--234, Mar. 2012. Google ScholarDigital Library
N. B. Lakshminarayana, J. Lee, and H. Kim. Age based scheduling for asymmetric multiprocessors. In Proceedings of Supercomputing: the International Conference on High Performance Computing Networking, Storage and Analysis (SC), pages 199--210, Nov. 2009. Google ScholarDigital Library
J. Li, J. Martinez, and M. Huang. The thrifty barrier: Energy-aware synchronization in shared-memory multiprocessors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pages 14--23, Feb. 2004. Google ScholarDigital Library
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 469--480, Dec. 2009. Google ScholarDigital Library
T. Li, A. Lebeck, and D. Sorin. Quantifying instruction criticality for shared memory multiprocessors. In Proceedings of the Symposium on Parallel Algorithms and Architectures (SPAA), pages 128--137, June 2003. Google ScholarDigital Library
T. Li, A. R. Lebeck, and D. J. Sorin. Spin detection hardware for improved management of multithreaded systems. IEEE Transactions on Parallel and Distributed Systems (TPDS), 17:508--521, June 2006. Google ScholarDigital Library
C. Liu, A. Sivasubramaniam, M. Kandemir, and M. Irwin. Exploiting barriers to optimize power consumption of CMPs. In Proceedings of the International Symposium on Parallel and Distributed Processing, page 5a, Apr. 2005. Google ScholarDigital Library
J. F. Martinez and J. Torrellas. Speculative synchronization: Applying thread-level speculation to explicitly parallel applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 18--29, Oct. 2002. Google ScholarDigital Library
T. Miller, X. Pan, R. Thomas, N. Sedaghati, and R. Teodorescu. Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips. In 18th International Symposium on High Performance Computer Architecture (HPCA), pages 1--12, Feb. 2012. Google ScholarDigital Library
T. Y. Morad, U. C. Weiser, A. Kolodny, M. Valero, and A. Ayguade. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. IEEE Computer Architecture Letters, 5(1):14--17, Jan. 2006. Google ScholarDigital Library
R. Rajwar and J. R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 294--305, Dec. 2001. Google ScholarDigital Library
R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 5--17, Oct. 2002. Google ScholarDigital Library
A. G. Saidi, N. L. Binkert, S. K. Reinhardt, and T. Mudge. End-to-end performance forecasting: finding bottlenecks before they happen. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 361--370, June 2009. Google ScholarDigital Library
M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 253--264, Mar. 2009. Google ScholarDigital Library
N. R. Tallent, J. M. Mellor-Crummey, and A. Porterfield. Analyzing lock contention in multithreaded applications. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 269--280, Jan. 2010. Google ScholarDigital Library
C. Tian, V. Nagarajan, R. Gupta, and S. Tallam. Dynamic recognition of synchronization operations for improved data race detection. In Proceedings of the International Symposium on Software Testing and Analysis, pages 143--154, July 2008. Google ScholarDigital Library
E. Tune, D. Liang, D. Tullsen, and B. Calder. Dynamic prediction of critical path instructions. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 185--195, Jan. 2001. Google ScholarDigital Library
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 24--36, June 1995. Google ScholarDigital Library

Recommendations

Criticality stacks: identifying critical threads in parallel programs using synchronization behavior
ICSA '13

Analyzing multi-threaded programs is quite challenging, but is necessary to obtain good multicore performance while saving energy. Due to synchronization, certain threads make others wait, because they hold a lock or have yet to reach a barrier. We call ...
Read More
Common2 extended to stacks and unbounded concurrency

This paper extends Common2, the family of objects that implement and are wait-free implementable from 2 consensus objects, in two ways: First, the stack object is shown to be in the family, refuting a conjecture to the contrary [6]. Second, Common2 is ...
Read More
Common2 extended to stacks and unbounded concurrency
PODC '06: Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing

Common2, the family of objects that implement and are wait-free implementable from 2 consensus objects, is extended inhere in two ways: First, the stack object is added to the family --- an object that was conjectured not to be in the family. Second, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
June 2013
686 pages
ISBN:9781450320795
DOI:10.1145/2485922
General Chair:
Avi Mendelson
Technion
ACM SIGARCH Computer Architecture News Volume 41, Issue 3
ICSA '13
June 2013
666 pages
ISSN:0163-5964
DOI:10.1145/2508148
Issue’s Table of Contents
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
ISCA '13 Paper Acceptance Rate56of288submissions,19%Overall Acceptance Rate543of3,203submissions,17%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 52
  Total Citations
  View Citations
- 718
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Criticality stacks: identifying critical threads in parallel programs using synchronization behavior

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Recommendations

Criticality stacks: identifying critical threads in parallel programs using synchronization behavior

Common2 extended to stacks and unbounded concurrency

Common2 extended to stacks and unbounded concurrency