ABSTRACT
Analyzing multi-threaded programs is quite challenging, but is necessary to obtain good multicore performance while saving energy. Due to synchronization, certain threads make others wait, because they hold a lock or have yet to reach a barrier. We call these critical threads, i.e., threads whose performance is determinative of program performance as a whole. Identifying these threads can reveal numerous optimization opportunities, for the software developer and for hardware.
In this paper, we propose a new metric for assessing thread criticality, which combines both how much time a thread is performing useful work and how many co-running threads are waiting. We show how thread criticality can be calculated online with modest hardware additions and with low overhead. We use our metric to create criticality stacks that break total execution time into each thread's criticality component, allowing for easy visual analysis of parallel imbalance.
To validate our criticality metric, and demonstrate it is better than previous metrics, we scale the frequency of the most critical thread and show it achieves the largest performance improvement. We then demonstrate the broad applicability of criticality stacks by using them to perform three types of optimizations: (1) program analysis to remove parallel bottlenecks, (2) dynamically identifying the most critical thread and accelerating it using frequency scaling to improve performance, and (3) showing that accelerating only the most critical thread allows for targeted energy reduction.
- G. M. Amdahl. Validity of the single-processor approach to achieving large-scale computing capabilities. In Proceedings of the American Federation of Information Processing Societies Conference (AFIPS), pages 483--485, 1967. Google ScholarDigital Library
- M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl's law through EPI throttling. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 298--309, June 2005. Google ScholarDigital Library
- A. Bhattacharjee and M. Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 290--301, June 2009. Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 72--81, Oct. 2008. Google ScholarDigital Library
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. Computer Architecture News, 39:1--7, May 2011. Google ScholarDigital Library
- D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 83--94, June 2000. Google ScholarDigital Library
- Q. Cai, J. González, R. Rakvic, G. Magklis, P. Chaparro, and A. González. Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 240--249, Oct. 2008. Google ScholarDigital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pages 44--54, Oct. 2009. Google ScholarDigital Library
- G. Chen and P. Stenström. Critical lock analysis: Diagnosing critical section bottlenecks in multithreaded applications. In Proceedings of Supercomputing: the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 71:1--71:11, Nov. 2012. Google ScholarDigital Library
- S. Eyerman, K. Du Bois, and L. Eeckhout. Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications. In Proceedings of the International Symposium on Performance Analysis of Software and Systems (ISPASS), pages 145--155, Apr. 2012. Google ScholarDigital Library
- S. Eyerman and L. Eeckhout. Modeling critical sections in Amdahl's law and its implications for multicore design. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 362--370, June 2010. Google ScholarDigital Library
- B. Fields, S. Rubin, and R. Bodík. Focusing processor policies via critical-path prediction. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 74--85, June 2001. Google ScholarDigital Library
- M. Herlihy and J. Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 289--300, June 1993. Google ScholarDigital Library
- J. Hollingsworth. An online computation of critical path profiling. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools, pages 11--20, May 1996. Google ScholarDigital Library
- J. Joao, M. Suleman, O. Mutlu, and Y. Patt. Bottleneck identification and scheduling in multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 223--234, Mar. 2012. Google ScholarDigital Library
- N. B. Lakshminarayana, J. Lee, and H. Kim. Age based scheduling for asymmetric multiprocessors. In Proceedings of Supercomputing: the International Conference on High Performance Computing Networking, Storage and Analysis (SC), pages 199--210, Nov. 2009. Google ScholarDigital Library
- J. Li, J. Martinez, and M. Huang. The thrifty barrier: Energy-aware synchronization in shared-memory multiprocessors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pages 14--23, Feb. 2004. Google ScholarDigital Library
- S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 469--480, Dec. 2009. Google ScholarDigital Library
- T. Li, A. Lebeck, and D. Sorin. Quantifying instruction criticality for shared memory multiprocessors. In Proceedings of the Symposium on Parallel Algorithms and Architectures (SPAA), pages 128--137, June 2003. Google ScholarDigital Library
- T. Li, A. R. Lebeck, and D. J. Sorin. Spin detection hardware for improved management of multithreaded systems. IEEE Transactions on Parallel and Distributed Systems (TPDS), 17:508--521, June 2006. Google ScholarDigital Library
- C. Liu, A. Sivasubramaniam, M. Kandemir, and M. Irwin. Exploiting barriers to optimize power consumption of CMPs. In Proceedings of the International Symposium on Parallel and Distributed Processing, page 5a, Apr. 2005. Google ScholarDigital Library
- J. F. Martinez and J. Torrellas. Speculative synchronization: Applying thread-level speculation to explicitly parallel applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 18--29, Oct. 2002. Google ScholarDigital Library
- T. Miller, X. Pan, R. Thomas, N. Sedaghati, and R. Teodorescu. Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips. In 18th International Symposium on High Performance Computer Architecture (HPCA), pages 1--12, Feb. 2012. Google ScholarDigital Library
- T. Y. Morad, U. C. Weiser, A. Kolodny, M. Valero, and A. Ayguade. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. IEEE Computer Architecture Letters, 5(1):14--17, Jan. 2006. Google ScholarDigital Library
- R. Rajwar and J. R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 294--305, Dec. 2001. Google ScholarDigital Library
- R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 5--17, Oct. 2002. Google ScholarDigital Library
- A. G. Saidi, N. L. Binkert, S. K. Reinhardt, and T. Mudge. End-to-end performance forecasting: finding bottlenecks before they happen. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 361--370, June 2009. Google ScholarDigital Library
- M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 253--264, Mar. 2009. Google ScholarDigital Library
- N. R. Tallent, J. M. Mellor-Crummey, and A. Porterfield. Analyzing lock contention in multithreaded applications. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 269--280, Jan. 2010. Google ScholarDigital Library
- C. Tian, V. Nagarajan, R. Gupta, and S. Tallam. Dynamic recognition of synchronization operations for improved data race detection. In Proceedings of the International Symposium on Software Testing and Analysis, pages 143--154, July 2008. Google ScholarDigital Library
- E. Tune, D. Liang, D. Tullsen, and B. Calder. Dynamic prediction of critical path instructions. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 185--195, Jan. 2001. Google ScholarDigital Library
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 24--36, June 1995. Google ScholarDigital Library
Recommendations
Criticality stacks: identifying critical threads in parallel programs using synchronization behavior
ICSA '13Analyzing multi-threaded programs is quite challenging, but is necessary to obtain good multicore performance while saving energy. Due to synchronization, certain threads make others wait, because they hold a lock or have yet to reach a barrier. We call ...
Common2 extended to stacks and unbounded concurrency
This paper extends Common2, the family of objects that implement and are wait-free implementable from 2 consensus objects, in two ways: First, the stack object is shown to be in the family, refuting a conjecture to the contrary [6]. Second, Common2 is ...
Common2 extended to stacks and unbounded concurrency
PODC '06: Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computingCommon2, the family of objects that implement and are wait-free implementable from 2 consensus objects, is extended inhere in two ways: First, the stack object is added to the family --- an object that was conjectured not to be in the family. Second, ...
Comments