ABSTRACT
Efficient fine-grain synchronization is extremely important to effectively harness the computational power of many-core architectures. However, designing and implementing finegrain synchronization in such architectures presents several challenges, including issues of synchronization induced overhead, storage cost, scalability, and the level of granularity to which synchronization is applicable. This paper proposes the Synchronization State Buffer (SSB), a scalable architectural design for fine-grain synchronization that efficiently performs synchronizations between concurrent threads. The design of SSB is motivated by the following observation: at any instance during the parallel execution only a small fraction of memory locations are actively participating in synchronization. Based on this observation we present a fine-grain synchronization design that records and manages the states of frequently synchronized data using modest hardware support. We have implemented the SSB design in the context of the 160-core IBM Cyclops-64 architecture. Using detailed simulation, we present our experience for a set of benchmarks with different workload characteristics.
- HPC challenge benchmark. http://icl.cs.utk.edu/hpcc/.Google Scholar
- Meet Larrabee, Intel's answer to a GPU. http://www.theinquirer.net/default.aspx?article=37548.Google Scholar
- A. Agarwal, J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeoung, G. D'Souza, and M. Parkin. Sparcle: An evolutionary processor design for large-scale multiprocessors. IEEE Micro, 13(3):48--61, June 1993. Google ScholarDigital Library
- B. Akgul and V. Mooney. The system-on-a-chip lock cache. Intl. Journal of Design Automation for Embedded Systems, 7(1--2):139--174, Sept. 2002.Google ScholarDigital Library
- R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera computer system. SIGARCH Comput. Archit. News, 18(3b):1--6, 1990. Google ScholarDigital Library
- Arvind, R. S. Nikhil, and K. K. Pingali. I-structures: data structures for parallel computing. ACM Trans. Program. Lang. Syst., 11(4):598--632, 1989. Google ScholarDigital Library
- S. Y. Borkar, H. Mulder, P. Dubey, S. S. Pawlowski, K. C. Kahn, J. R. Rattner, and D. J. Kuck. Platform 2015: Intel processor and platform evolution for the next decade, 2005.Google Scholar
- C. Cascaval, J. Castanos, L. Ceze, M. Denneau, and et. al. Evaluation of a multithreaded architecture for cellular computing. In Procs. of 8th Intl. Symp. on High Performance Computer Architecture, Boston, MA, 2002. Google ScholarDigital Library
- D.-K. Chen. Compiler Optimizations for Parallel Loops with Fine-Grained Synchronization. PhD thesis, UIUC, 1994. Google ScholarDigital Library
- L. Chen, Z. Hu, J. Lin, and G. R. Gao. Optimizing fast fourier transform on a multi-core architecture. In Procs. of Workshop on Performance Optimization for High-Level Languages and Libraries, Mar. 2007.Google ScholarCross Ref
- ClearSpeed Technology. CSX processor architecture whitepaper, 2006.Google Scholar
- W. J. Dally. Computer architecture in the many-core era. In Keynote at the 24th Intl. Conf. on Comput. Design, 2006.Google Scholar
- W. J. Dally and et. al. The message-driven processor. IEEE Micro., 12(2):23--39, 1992. Google ScholarDigital Library
- J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao. FAST: A functionally accurate simulation toolset for the Cyclops64 cellular architecture. In 1st Workshop on Modeling, Benchmarking, and Simulation, Madison, WI, Jun. 2005.Google Scholar
- J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao. Toward a software infrastructure for the Cyclops-64 cellular architecture. In Procs. of 20th Intl. Symp. on High Performance Computing Systems and Applications, St. John's, NL, Canada, 2006. Google ScholarDigital Library
- M. Denneau and H. S. Warren, Jr. 64-bit Cyclops: Principles of operation, Apr. 2005.Google Scholar
- J. Feo. An analysis of the computational and parallel complexity of the Livermore loops. Parallel Computing, 7(2):163--185, 1988.Google ScholarCross Ref
- J. Feo and et. al. Eldorado. In Procs of the 2nd Conf. on Computing frontiers, pages 28--34, 2005. Google ScholarDigital Library
- M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and et. al. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, 2006. Google ScholarDigital Library
- M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In Procs. of the 20th Intl. Symp. on Computer architecture, 1993. Google ScholarDigital Library
- Z. Hu, J. del Cuvillo, W. Zhu, and G. R. Gao. Optimization of dense matrix multiplication on IBM Cyclops-64: Challenges and experiences. In Procs. of the 12nd Intl. European Conf. on Parallel Processing, Aug. 2006. Google ScholarDigital Library
- A. Kägi and D. B. J. R. Goodman. Efficient synchronization: Let them eat QOLB. In Procs. of the 24th Intl. Symp. on Computer Architecture, pages 170--180, 1997. Google ScholarDigital Library
- K. M. Kavi, A. R. Hurson, P. Patadia, E. Abraham, and P. Shanmugam. Design of cache memories for multi-threaded dataflow architecture. In Procs. of the 22nd Intl. Symp. on Computer architecture, 1995. Google ScholarDigital Library
- S. W. Keckler, W. J. Dally, D. Maskit, N. P. Carter, A. Chang, and W. S. Lee. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor. In Procs. of the 25th Intl. Symp. on Computer architecture, 1998. Google ScholarDigital Library
- A. Kejariwal, H. Saito, X. Tian, M. Gikar, W. Li, U. Banerjee, A. Nicolau, and C. D. Polychronopoulos. Lightweight lock-free synchronization methods for multithreading. In the 20th Intl. Conf. on Supercomputing, Cairns, Australia, 2006. Google ScholarDigital Library
- D. Kranz and et. al. Low-cost support for fine-grain synchronization in multiprocessors. Technical Report MIT/LCS/TM-470, 1992. Google ScholarDigital Library
- J. Laudon and D. Lenoski. The SGI Origin: a ccNUMA highly scalable server. In Procs. of the 24th Intl. Symp. on Computer Architecture, 1997. Google ScholarDigital Library
- A. McDonald, J. Chung, B. D. Carlstrom, C. C. Minh, H. Chafi, C. Kozyrakis, and K. Olukotun. Architectural semantics for practical transactional memory. In Procs. of the 33rd Intl. Symp. on Computer Architecture, 2006. Google ScholarDigital Library
- J. M. Mellor-Crummey and M. L. Scott, "Algorithms for scalable synchronization on shared--memory multiprocessors," ACM Trans. on Computer Systems, vol. 9, no. 1, pp. 21--65, Feb. 1991. Google ScholarDigital Library
- M. M. Michael. High performance dynamic lock-free hash tables and list-based sets. In the 14th Annual ACM Symp. on Parallel Algorithms and Architectures, Aug. 2002. Google ScholarDigital Library
- M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Trans. Parallel Distrib. Syst, 15(6):491--504, 2004. Google ScholarDigital Library
- S. P. Midkiff and D. Padua. Compiler algorithms for synchronization. IEEE Trans. on Comput., 36(12):1485--1495, 1987. Google ScholarDigital Library
- M. F. P. O'Boyle, L. Kervella, and F. Bodin. Synchronization minimization in a SPMD execution model. J. Parallel Distrib. Comput., 29(2):196--210, 1995. Google ScholarDigital Library
- R. Rajamony and A. L. Cox. Optimally synchronizing DOACROSS loops on shared memory multiprocessors. In Procs. of 1997 Intl. Conf. on Parallel Architectures and Compilation Techniques, 1997. Google ScholarDigital Library
- R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In Procs. of the 19th Symp. on Architectural Support for Programming Languages and Operating Systems. 2002. Google ScholarDigital Library
- R. Rajwar, M. Herlihy, and K. Lai. Virtualizing transactional memory. In Procs. of the 32nd Intl. Symp. on Computer Architecture, Jun. 2005. Google ScholarDigital Library
- J. Sampson, R. Gonzalez, J.-F. Collard, N. Jouppi, M. Schlansker, and B. Calder. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In Procs. of the Intl. Symp. on Microarchitecture, 2006. Google ScholarDigital Library
- B. Smith. The architecture of HEP. In Parallel MIMD Computation: HEP Supercomputer and Its Applications, Scientific Computation Series, pages 41--55. MIT Press, Cambridge, MA, 1985. Google ScholarDigital Library
- SPEC. SPEC OpenMP benchmark suite.Google Scholar
- G. Tan, N. Sun, and G. R. Gao. A parallel dynamic programming algorithm on a multi-core architecture. In Procs. of 19th ACM Symp. on Parallelism in Algorithms and Architectures, Jun. 2007. Google ScholarDigital Library
- D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting fine-grained synchronization on a simultaneous multithreading processor. In Procs. of the 5th Intl. Symp. on High-Performance Computer Architecture, 1999. Google ScholarDigital Library
- S. Vangal, J. Howard, G. Ruhl, and et. al. An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In Procs. of 2007 Intl. Solid-State Circuits Conf., Feb. 2007.Google Scholar
- I. E. Venetis and G. R. Gao. Optimizing the LU Benchmark for the Cyclops-64 Architecture. CAPSL Technical Memo 75, University of Delaware, Feb. 2007.Google Scholar
- D. Yeung and A. Agarwal. Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient. In Procs of the 4th ACM Symp. on Principles and practice of parallel programming, 1993. Google ScholarDigital Library
- L. Zhang, Z. Fang, and J. B. Carter. Highly efficient synchronization based on active memory operations. In Procs. of 18th Intl. Parallel and Distrib. Processing Symp., 2004.Google Scholar
Index Terms
- Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures
Recommendations
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures
Efficient fine-grain synchronization is extremely important to effectively harness the computational power of many-core architectures. However, designing and implementing finegrain synchronization in such architectures presents several challenges, ...
Landing stencil code on Godson-T
The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture ...
The elephant and the mice: the role of non-strict fine-grain synchronization for modern many-core architectures
ICS '11: Proceedings of the international conference on SupercomputingThe Cray XMT architecture has incited curiosity among computer architect and system software designers for its architecture support of fine-grain in-memory synchronization. Although such discussion go back thirty years, there is a lack of practical ...
Comments