skip to main content
10.1145/2485922.2485966acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Criticality stacks: identifying critical threads in parallel programs using synchronization behavior

Published:23 June 2013Publication History

ABSTRACT

Analyzing multi-threaded programs is quite challenging, but is necessary to obtain good multicore performance while saving energy. Due to synchronization, certain threads make others wait, because they hold a lock or have yet to reach a barrier. We call these critical threads, i.e., threads whose performance is determinative of program performance as a whole. Identifying these threads can reveal numerous optimization opportunities, for the software developer and for hardware.

In this paper, we propose a new metric for assessing thread criticality, which combines both how much time a thread is performing useful work and how many co-running threads are waiting. We show how thread criticality can be calculated online with modest hardware additions and with low overhead. We use our metric to create criticality stacks that break total execution time into each thread's criticality component, allowing for easy visual analysis of parallel imbalance.

To validate our criticality metric, and demonstrate it is better than previous metrics, we scale the frequency of the most critical thread and show it achieves the largest performance improvement. We then demonstrate the broad applicability of criticality stacks by using them to perform three types of optimizations: (1) program analysis to remove parallel bottlenecks, (2) dynamically identifying the most critical thread and accelerating it using frequency scaling to improve performance, and (3) showing that accelerating only the most critical thread allows for targeted energy reduction.

References

  1. G. M. Amdahl. Validity of the single-processor approach to achieving large-scale computing capabilities. In Proceedings of the American Federation of Information Processing Societies Conference (AFIPS), pages 483--485, 1967. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl's law through EPI throttling. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 298--309, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Bhattacharjee and M. Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 290--301, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 72--81, Oct. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. Computer Architecture News, 39:1--7, May 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 83--94, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Q. Cai, J. González, R. Rakvic, G. Magklis, P. Chaparro, and A. González. Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 240--249, Oct. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pages 44--54, Oct. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Chen and P. Stenström. Critical lock analysis: Diagnosing critical section bottlenecks in multithreaded applications. In Proceedings of Supercomputing: the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 71:1--71:11, Nov. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Eyerman, K. Du Bois, and L. Eeckhout. Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications. In Proceedings of the International Symposium on Performance Analysis of Software and Systems (ISPASS), pages 145--155, Apr. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Eyerman and L. Eeckhout. Modeling critical sections in Amdahl's law and its implications for multicore design. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 362--370, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Fields, S. Rubin, and R. Bodík. Focusing processor policies via critical-path prediction. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 74--85, June 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Herlihy and J. Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 289--300, June 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Hollingsworth. An online computation of critical path profiling. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools, pages 11--20, May 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Joao, M. Suleman, O. Mutlu, and Y. Patt. Bottleneck identification and scheduling in multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 223--234, Mar. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. B. Lakshminarayana, J. Lee, and H. Kim. Age based scheduling for asymmetric multiprocessors. In Proceedings of Supercomputing: the International Conference on High Performance Computing Networking, Storage and Analysis (SC), pages 199--210, Nov. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Li, J. Martinez, and M. Huang. The thrifty barrier: Energy-aware synchronization in shared-memory multiprocessors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pages 14--23, Feb. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 469--480, Dec. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Li, A. Lebeck, and D. Sorin. Quantifying instruction criticality for shared memory multiprocessors. In Proceedings of the Symposium on Parallel Algorithms and Architectures (SPAA), pages 128--137, June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Li, A. R. Lebeck, and D. J. Sorin. Spin detection hardware for improved management of multithreaded systems. IEEE Transactions on Parallel and Distributed Systems (TPDS), 17:508--521, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Liu, A. Sivasubramaniam, M. Kandemir, and M. Irwin. Exploiting barriers to optimize power consumption of CMPs. In Proceedings of the International Symposium on Parallel and Distributed Processing, page 5a, Apr. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. F. Martinez and J. Torrellas. Speculative synchronization: Applying thread-level speculation to explicitly parallel applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 18--29, Oct. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Miller, X. Pan, R. Thomas, N. Sedaghati, and R. Teodorescu. Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips. In 18th International Symposium on High Performance Computer Architecture (HPCA), pages 1--12, Feb. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Y. Morad, U. C. Weiser, A. Kolodny, M. Valero, and A. Ayguade. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. IEEE Computer Architecture Letters, 5(1):14--17, Jan. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Rajwar and J. R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 294--305, Dec. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 5--17, Oct. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. G. Saidi, N. L. Binkert, S. K. Reinhardt, and T. Mudge. End-to-end performance forecasting: finding bottlenecks before they happen. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 361--370, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 253--264, Mar. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. R. Tallent, J. M. Mellor-Crummey, and A. Porterfield. Analyzing lock contention in multithreaded applications. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 269--280, Jan. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Tian, V. Nagarajan, R. Gupta, and S. Tallam. Dynamic recognition of synchronization operations for improved data race detection. In Proceedings of the International Symposium on Software Testing and Analysis, pages 143--154, July 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. Tune, D. Liang, D. Tullsen, and B. Calder. Dynamic prediction of critical path instructions. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 185--195, Jan. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 24--36, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
    June 2013
    686 pages
    ISBN:9781450320795
    DOI:10.1145/2485922
    • cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 41, Issue 3
      ICSA '13
      June 2013
      666 pages
      ISSN:0163-5964
      DOI:10.1145/2508148
      Issue’s Table of Contents

    Copyright © 2013 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 23 June 2013

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    ISCA '13 Paper Acceptance Rate56of288submissions,19%Overall Acceptance Rate543of3,203submissions,17%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader