skip to main content
10.1145/1250662.1250668acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article

Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Published:09 June 2007Publication History

ABSTRACT

Efficient fine-grain synchronization is extremely important to effectively harness the computational power of many-core architectures. However, designing and implementing finegrain synchronization in such architectures presents several challenges, including issues of synchronization induced overhead, storage cost, scalability, and the level of granularity to which synchronization is applicable. This paper proposes the Synchronization State Buffer (SSB), a scalable architectural design for fine-grain synchronization that efficiently performs synchronizations between concurrent threads. The design of SSB is motivated by the following observation: at any instance during the parallel execution only a small fraction of memory locations are actively participating in synchronization. Based on this observation we present a fine-grain synchronization design that records and manages the states of frequently synchronized data using modest hardware support. We have implemented the SSB design in the context of the 160-core IBM Cyclops-64 architecture. Using detailed simulation, we present our experience for a set of benchmarks with different workload characteristics.

References

  1. HPC challenge benchmark. http://icl.cs.utk.edu/hpcc/.Google ScholarGoogle Scholar
  2. Meet Larrabee, Intel's answer to a GPU. http://www.theinquirer.net/default.aspx?article=37548.Google ScholarGoogle Scholar
  3. A. Agarwal, J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeoung, G. D'Souza, and M. Parkin. Sparcle: An evolutionary processor design for large-scale multiprocessors. IEEE Micro, 13(3):48--61, June 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Akgul and V. Mooney. The system-on-a-chip lock cache. Intl. Journal of Design Automation for Embedded Systems, 7(1--2):139--174, Sept. 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera computer system. SIGARCH Comput. Archit. News, 18(3b):1--6, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Arvind, R. S. Nikhil, and K. K. Pingali. I-structures: data structures for parallel computing. ACM Trans. Program. Lang. Syst., 11(4):598--632, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Y. Borkar, H. Mulder, P. Dubey, S. S. Pawlowski, K. C. Kahn, J. R. Rattner, and D. J. Kuck. Platform 2015: Intel processor and platform evolution for the next decade, 2005.Google ScholarGoogle Scholar
  8. C. Cascaval, J. Castanos, L. Ceze, M. Denneau, and et. al. Evaluation of a multithreaded architecture for cellular computing. In Procs. of 8th Intl. Symp. on High Performance Computer Architecture, Boston, MA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D.-K. Chen. Compiler Optimizations for Parallel Loops with Fine-Grained Synchronization. PhD thesis, UIUC, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Chen, Z. Hu, J. Lin, and G. R. Gao. Optimizing fast fourier transform on a multi-core architecture. In Procs. of Workshop on Performance Optimization for High-Level Languages and Libraries, Mar. 2007.Google ScholarGoogle ScholarCross RefCross Ref
  11. ClearSpeed Technology. CSX processor architecture whitepaper, 2006.Google ScholarGoogle Scholar
  12. W. J. Dally. Computer architecture in the many-core era. In Keynote at the 24th Intl. Conf. on Comput. Design, 2006.Google ScholarGoogle Scholar
  13. W. J. Dally and et. al. The message-driven processor. IEEE Micro., 12(2):23--39, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao. FAST: A functionally accurate simulation toolset for the Cyclops64 cellular architecture. In 1st Workshop on Modeling, Benchmarking, and Simulation, Madison, WI, Jun. 2005.Google ScholarGoogle Scholar
  15. J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao. Toward a software infrastructure for the Cyclops-64 cellular architecture. In Procs. of 20th Intl. Symp. on High Performance Computing Systems and Applications, St. John's, NL, Canada, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Denneau and H. S. Warren, Jr. 64-bit Cyclops: Principles of operation, Apr. 2005.Google ScholarGoogle Scholar
  17. J. Feo. An analysis of the computational and parallel complexity of the Livermore loops. Parallel Computing, 7(2):163--185, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  18. J. Feo and et. al. Eldorado. In Procs of the 2nd Conf. on Computing frontiers, pages 28--34, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and et. al. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In Procs. of the 20th Intl. Symp. on Computer architecture, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Z. Hu, J. del Cuvillo, W. Zhu, and G. R. Gao. Optimization of dense matrix multiplication on IBM Cyclops-64: Challenges and experiences. In Procs. of the 12nd Intl. European Conf. on Parallel Processing, Aug. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Kägi and D. B. J. R. Goodman. Efficient synchronization: Let them eat QOLB. In Procs. of the 24th Intl. Symp. on Computer Architecture, pages 170--180, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. M. Kavi, A. R. Hurson, P. Patadia, E. Abraham, and P. Shanmugam. Design of cache memories for multi-threaded dataflow architecture. In Procs. of the 22nd Intl. Symp. on Computer architecture, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. W. Keckler, W. J. Dally, D. Maskit, N. P. Carter, A. Chang, and W. S. Lee. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor. In Procs. of the 25th Intl. Symp. on Computer architecture, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Kejariwal, H. Saito, X. Tian, M. Gikar, W. Li, U. Banerjee, A. Nicolau, and C. D. Polychronopoulos. Lightweight lock-free synchronization methods for multithreading. In the 20th Intl. Conf. on Supercomputing, Cairns, Australia, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Kranz and et. al. Low-cost support for fine-grain synchronization in multiprocessors. Technical Report MIT/LCS/TM-470, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Laudon and D. Lenoski. The SGI Origin: a ccNUMA highly scalable server. In Procs. of the 24th Intl. Symp. on Computer Architecture, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. McDonald, J. Chung, B. D. Carlstrom, C. C. Minh, H. Chafi, C. Kozyrakis, and K. Olukotun. Architectural semantics for practical transactional memory. In Procs. of the 33rd Intl. Symp. on Computer Architecture, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. M. Mellor-Crummey and M. L. Scott, "Algorithms for scalable synchronization on shared--memory multiprocessors," ACM Trans. on Computer Systems, vol. 9, no. 1, pp. 21--65, Feb. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. M. Michael. High performance dynamic lock-free hash tables and list-based sets. In the 14th Annual ACM Symp. on Parallel Algorithms and Architectures, Aug. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Trans. Parallel Distrib. Syst, 15(6):491--504, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. P. Midkiff and D. Padua. Compiler algorithms for synchronization. IEEE Trans. on Comput., 36(12):1485--1495, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. F. P. O'Boyle, L. Kervella, and F. Bodin. Synchronization minimization in a SPMD execution model. J. Parallel Distrib. Comput., 29(2):196--210, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Rajamony and A. L. Cox. Optimally synchronizing DOACROSS loops on shared memory multiprocessors. In Procs. of 1997 Intl. Conf. on Parallel Architectures and Compilation Techniques, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In Procs. of the 19th Symp. on Architectural Support for Programming Languages and Operating Systems. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. Rajwar, M. Herlihy, and K. Lai. Virtualizing transactional memory. In Procs. of the 32nd Intl. Symp. on Computer Architecture, Jun. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Sampson, R. Gonzalez, J.-F. Collard, N. Jouppi, M. Schlansker, and B. Calder. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In Procs. of the Intl. Symp. on Microarchitecture, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. B. Smith. The architecture of HEP. In Parallel MIMD Computation: HEP Supercomputer and Its Applications, Scientific Computation Series, pages 41--55. MIT Press, Cambridge, MA, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. SPEC. SPEC OpenMP benchmark suite.Google ScholarGoogle Scholar
  40. G. Tan, N. Sun, and G. R. Gao. A parallel dynamic programming algorithm on a multi-core architecture. In Procs. of 19th ACM Symp. on Parallelism in Algorithms and Architectures, Jun. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting fine-grained synchronization on a simultaneous multithreading processor. In Procs. of the 5th Intl. Symp. on High-Performance Computer Architecture, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Vangal, J. Howard, G. Ruhl, and et. al. An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In Procs. of 2007 Intl. Solid-State Circuits Conf., Feb. 2007.Google ScholarGoogle Scholar
  43. I. E. Venetis and G. R. Gao. Optimizing the LU Benchmark for the Cyclops-64 Architecture. CAPSL Technical Memo 75, University of Delaware, Feb. 2007.Google ScholarGoogle Scholar
  44. D. Yeung and A. Agarwal. Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient. In Procs of the 4th ACM Symp. on Principles and practice of parallel programming, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. L. Zhang, Z. Fang, and J. B. Carter. Highly efficient synchronization based on active memory operations. In Procs. of 18th Intl. Parallel and Distrib. Processing Symp., 2004.Google ScholarGoogle Scholar

Index Terms

  1. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture
      June 2007
      542 pages
      ISBN:9781595937063
      DOI:10.1145/1250662
      • General Chair:
      • Dean Tullsen,
      • Program Chair:
      • Brad Calder
      • cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 35, Issue 2
        May 2007
        527 pages
        ISSN:0163-5964
        DOI:10.1145/1273440
        Issue’s Table of Contents

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 June 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate543of3,203submissions,17%

      Upcoming Conference

      ISCA '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader