Article

Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Authors:
Weirong Zhu

University of Delaware, Newark, DE

University of Delaware, Newark, DE
View Profile

,
Vugranam C Sreedhar

IBM TJ Watson Research Center, Howthorne, NY

IBM TJ Watson Research Center, Howthorne, NY
View Profile

,
Ziang Hu

University of Delaware, Newark, DE

University of Delaware, Newark, DE
View Profile

,
Guang R. Gao

University of Delaware, Newark, DE

University of Delaware, Newark, DE
View Profile

ISCA '07: Proceedings of the 34th annual international symposium on Computer architectureJune 2007Pages 35–45https://doi.org/10.1145/1250662.1250668

Published:09 June 2007Publication History

ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture

Pages 35–45

ABSTRACT

Efficient fine-grain synchronization is extremely important to effectively harness the computational power of many-core architectures. However, designing and implementing finegrain synchronization in such architectures presents several challenges, including issues of synchronization induced overhead, storage cost, scalability, and the level of granularity to which synchronization is applicable. This paper proposes the Synchronization State Buffer (SSB), a scalable architectural design for fine-grain synchronization that efficiently performs synchronizations between concurrent threads. The design of SSB is motivated by the following observation: at any instance during the parallel execution only a small fraction of memory locations are actively participating in synchronization. Based on this observation we present a fine-grain synchronization design that records and manages the states of frequently synchronized data using modest hardware support. We have implemented the SSB design in the context of the 160-core IBM Cyclops-64 architecture. Using detailed simulation, we present our experience for a set of benchmarks with different workload characteristics.

References

HPC challenge benchmark. http://icl.cs.utk.edu/hpcc/.Google Scholar
Meet Larrabee, Intel's answer to a GPU. http://www.theinquirer.net/default.aspx?article=37548.Google Scholar
A. Agarwal, J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeoung, G. D'Souza, and M. Parkin. Sparcle: An evolutionary processor design for large-scale multiprocessors. IEEE Micro, 13(3):48--61, June 1993. Google ScholarDigital Library
B. Akgul and V. Mooney. The system-on-a-chip lock cache. Intl. Journal of Design Automation for Embedded Systems, 7(1--2):139--174, Sept. 2002.Google ScholarDigital Library
R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera computer system. SIGARCH Comput. Archit. News, 18(3b):1--6, 1990. Google ScholarDigital Library
Arvind, R. S. Nikhil, and K. K. Pingali. I-structures: data structures for parallel computing. ACM Trans. Program. Lang. Syst., 11(4):598--632, 1989. Google ScholarDigital Library
S. Y. Borkar, H. Mulder, P. Dubey, S. S. Pawlowski, K. C. Kahn, J. R. Rattner, and D. J. Kuck. Platform 2015: Intel processor and platform evolution for the next decade, 2005.Google Scholar
C. Cascaval, J. Castanos, L. Ceze, M. Denneau, and et. al. Evaluation of a multithreaded architecture for cellular computing. In Procs. of 8th Intl. Symp. on High Performance Computer Architecture, Boston, MA, 2002. Google ScholarDigital Library
D.-K. Chen. Compiler Optimizations for Parallel Loops with Fine-Grained Synchronization. PhD thesis, UIUC, 1994. Google ScholarDigital Library
L. Chen, Z. Hu, J. Lin, and G. R. Gao. Optimizing fast fourier transform on a multi-core architecture. In Procs. of Workshop on Performance Optimization for High-Level Languages and Libraries, Mar. 2007.Google ScholarCross Ref
ClearSpeed Technology. CSX processor architecture whitepaper, 2006.Google Scholar
W. J. Dally. Computer architecture in the many-core era. In Keynote at the 24th Intl. Conf. on Comput. Design, 2006.Google Scholar
W. J. Dally and et. al. The message-driven processor. IEEE Micro., 12(2):23--39, 1992. Google ScholarDigital Library
J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao. FAST: A functionally accurate simulation toolset for the Cyclops64 cellular architecture. In 1st Workshop on Modeling, Benchmarking, and Simulation, Madison, WI, Jun. 2005.Google Scholar
J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao. Toward a software infrastructure for the Cyclops-64 cellular architecture. In Procs. of 20th Intl. Symp. on High Performance Computing Systems and Applications, St. John's, NL, Canada, 2006. Google ScholarDigital Library
M. Denneau and H. S. Warren, Jr. 64-bit Cyclops: Principles of operation, Apr. 2005.Google Scholar
J. Feo. An analysis of the computational and parallel complexity of the Livermore loops. Parallel Computing, 7(2):163--185, 1988.Google ScholarCross Ref
J. Feo and et. al. Eldorado. In Procs of the 2nd Conf. on Computing frontiers, pages 28--34, 2005. Google ScholarDigital Library
M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and et. al. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, 2006. Google ScholarDigital Library
M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In Procs. of the 20th Intl. Symp. on Computer architecture, 1993. Google ScholarDigital Library
Z. Hu, J. del Cuvillo, W. Zhu, and G. R. Gao. Optimization of dense matrix multiplication on IBM Cyclops-64: Challenges and experiences. In Procs. of the 12nd Intl. European Conf. on Parallel Processing, Aug. 2006. Google ScholarDigital Library
A. Kägi and D. B. J. R. Goodman. Efficient synchronization: Let them eat QOLB. In Procs. of the 24th Intl. Symp. on Computer Architecture, pages 170--180, 1997. Google ScholarDigital Library
K. M. Kavi, A. R. Hurson, P. Patadia, E. Abraham, and P. Shanmugam. Design of cache memories for multi-threaded dataflow architecture. In Procs. of the 22nd Intl. Symp. on Computer architecture, 1995. Google ScholarDigital Library
S. W. Keckler, W. J. Dally, D. Maskit, N. P. Carter, A. Chang, and W. S. Lee. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor. In Procs. of the 25th Intl. Symp. on Computer architecture, 1998. Google ScholarDigital Library
A. Kejariwal, H. Saito, X. Tian, M. Gikar, W. Li, U. Banerjee, A. Nicolau, and C. D. Polychronopoulos. Lightweight lock-free synchronization methods for multithreading. In the 20th Intl. Conf. on Supercomputing, Cairns, Australia, 2006. Google ScholarDigital Library
D. Kranz and et. al. Low-cost support for fine-grain synchronization in multiprocessors. Technical Report MIT/LCS/TM-470, 1992. Google ScholarDigital Library
J. Laudon and D. Lenoski. The SGI Origin: a ccNUMA highly scalable server. In Procs. of the 24th Intl. Symp. on Computer Architecture, 1997. Google ScholarDigital Library
A. McDonald, J. Chung, B. D. Carlstrom, C. C. Minh, H. Chafi, C. Kozyrakis, and K. Olukotun. Architectural semantics for practical transactional memory. In Procs. of the 33rd Intl. Symp. on Computer Architecture, 2006. Google ScholarDigital Library
J. M. Mellor-Crummey and M. L. Scott, "Algorithms for scalable synchronization on shared--memory multiprocessors," ACM Trans. on Computer Systems, vol. 9, no. 1, pp. 21--65, Feb. 1991. Google ScholarDigital Library
M. M. Michael. High performance dynamic lock-free hash tables and list-based sets. In the 14th Annual ACM Symp. on Parallel Algorithms and Architectures, Aug. 2002. Google ScholarDigital Library
M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Trans. Parallel Distrib. Syst, 15(6):491--504, 2004. Google ScholarDigital Library
S. P. Midkiff and D. Padua. Compiler algorithms for synchronization. IEEE Trans. on Comput., 36(12):1485--1495, 1987. Google ScholarDigital Library
M. F. P. O'Boyle, L. Kervella, and F. Bodin. Synchronization minimization in a SPMD execution model. J. Parallel Distrib. Comput., 29(2):196--210, 1995. Google ScholarDigital Library
R. Rajamony and A. L. Cox. Optimally synchronizing DOACROSS loops on shared memory multiprocessors. In Procs. of 1997 Intl. Conf. on Parallel Architectures and Compilation Techniques, 1997. Google ScholarDigital Library
R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In Procs. of the 19th Symp. on Architectural Support for Programming Languages and Operating Systems. 2002. Google ScholarDigital Library
R. Rajwar, M. Herlihy, and K. Lai. Virtualizing transactional memory. In Procs. of the 32nd Intl. Symp. on Computer Architecture, Jun. 2005. Google ScholarDigital Library
J. Sampson, R. Gonzalez, J.-F. Collard, N. Jouppi, M. Schlansker, and B. Calder. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In Procs. of the Intl. Symp. on Microarchitecture, 2006. Google ScholarDigital Library
B. Smith. The architecture of HEP. In Parallel MIMD Computation: HEP Supercomputer and Its Applications, Scientific Computation Series, pages 41--55. MIT Press, Cambridge, MA, 1985. Google ScholarDigital Library
SPEC. SPEC OpenMP benchmark suite.Google Scholar
G. Tan, N. Sun, and G. R. Gao. A parallel dynamic programming algorithm on a multi-core architecture. In Procs. of 19th ACM Symp. on Parallelism in Algorithms and Architectures, Jun. 2007. Google ScholarDigital Library
D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting fine-grained synchronization on a simultaneous multithreading processor. In Procs. of the 5th Intl. Symp. on High-Performance Computer Architecture, 1999. Google ScholarDigital Library
S. Vangal, J. Howard, G. Ruhl, and et. al. An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In Procs. of 2007 Intl. Solid-State Circuits Conf., Feb. 2007.Google Scholar
I. E. Venetis and G. R. Gao. Optimizing the LU Benchmark for the Cyclops-64 Architecture. CAPSL Technical Memo 75, University of Delaware, Feb. 2007.Google Scholar
D. Yeung and A. Agarwal. Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient. In Procs of the 4th ACM Symp. on Principles and practice of parallel programming, 1993. Google ScholarDigital Library
L. Zhang, Z. Fang, and J. B. Carter. Highly efficient synchronization based on active memory operations. In Procs. of 18th Intl. Parallel and Distrib. Processing Symp., 2004.Google Scholar

Index Terms

Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures
1. Computer systems organization

Recommendations

Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Efficient fine-grain synchronization is extremely important to effectively harness the computational power of many-core architectures. However, designing and implementing finegrain synchronization in such architectures presents several challenges, ...
Read More
Landing stencil code on Godson-T

The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture ...
Read More
The elephant and the mice: the role of non-strict fine-grain synchronization for modern many-core architectures
ICS '11: Proceedings of the international conference on Supercomputing

The Cray XMT architecture has incited curiosity among computer architect and system software designers for its architecture support of fine-grain in-memory synchronization. Although such discussion go back thirty years, there is a lack of practical ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture
June 2007
542 pages
ISBN:9781595937063
DOI:10.1145/1250662
General Chair:
Dean Tullsen
University of California, San Diego
,
Program Chair:
Brad Calder
Microsoft & University of California, San Diego
ACM SIGARCH Computer Architecture News Volume 35, Issue 2
May 2007
527 pages
ISSN:0163-5964
DOI:10.1145/1273440
Issue’s Table of Contents
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
SSB
fine-grain synchronization
many-core
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 90
  Total Citations
  View Citations
- 1,406
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Landing stencil code on Godson-T

The elephant and the mice: the role of non-strict fine-grain synchronization for modern many-core architectures