research-article

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations

Authors:
Hyojin Sung

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Sarita V. Adve

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsMarch 2015Pages 545–559https://doi.org/10.1145/2694344.2694356

Published:14 March 2015Publication History

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 545–559

ABSTRACT

Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-the-art MESI protocol. DeNovo, however, severely restricted the synchronization constructs an application can support. This paper proposes DeNovoSync, a technique to support arbitrary synchronization in DeNovo. The key challenge is that DeNovo exploits race-freedom to use reader-initiated local self-invalidations (instead of conventional writer-initiated remote cache invalidations) to ensure coherence. Synchronization accesses are inherently racy and not directly amenable to self-invalidations. DeNovoSync addresses this challenge using a novel combination of registration of all synchronization reads with a judicious hardware backoff to limit unnecessary registrations. For a wide variety of synchronization constructs and applications, compared to MESI, DeNovoSync shows comparable or up to 22% lower execution time and up to 58% lower network traffic, enabling DeNovo's advantages for a much broader class of software than previously possible.

References

S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29 (12): 66--76, 1996. Google ScholarDigital Library
A. Agarwal and M. Cherian. Adaptive backoff synchronization techniques. In Proceedings of the 16th Annual International Symposium on Computer Architecture, ISCA '89, 1989. Google ScholarDigital Library
N. Agarwal, T. Krishna, L.-S. Peh, and N. Jha. Garnet: A detailed interconnection network model inside a full-system simulation framework. Technical Report CE-P08-001, Princeton University, 2008. URL http://www.princeton.edu/~niketa/garnet.Google Scholar
T. E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 1 (1), Jan. 1990. Google ScholarDigital Library
B. Bershad, M. Zekauskas, and W. Sawdon. The midway distributed shared memory system. In Compcon Spring '93, Digest of Papers., Feb 1993.Google ScholarCross Ref
C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, Jan. 2011. Google ScholarDigital Library
R. L. Bocchino, Jr., V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A type and effect system for deterministic parallel java. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '09, 2009. Google ScholarDigital Library
R. L. Bocchino, Jr., S. Heumann, N. Honarmand, S. V. Adve, V. S. Adve, A. Welc, and T. Shpeisman. Safe nondeterminism in a deterministic-by-default parallel language. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '11, 2011. Google ScholarDigital Library
H.-J. Boehm and S. V. Adve. Foundations of the cGoogle Scholar
concurrency memory model. In Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '08, 2008.Google Scholar
B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou. Denovo: Rethinking the memory hierarchy for disciplined parallelism. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques, PACT '11, 2011. Google ScholarDigital Library
M. Elver and V. Nagarajan. Tso-cc: Consistency directed cache coherence for tso. In IEEE 20th International Symposium on High Performance Computer Architecture, HPCA-20, Feb 2014.Google ScholarCross Ref
J. R. Goodman and P. J. Woest. The wisconsin multicube: A new large-scale cache-coherent multiprocessor. In Proceedings of the 15th Annual International Symposium on Computer Architecture, ISCA '88, 1988. Google ScholarDigital Library
J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS III, 1989. Google ScholarDigital Library
M. Herlihy. A methodology for implementing highly concurrent data structures. In Proceedings of the Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, PPOPP '90, 1990. Google ScholarDigital Library
M. D. Hill, J. R. Larus, S. K. Reinhardt, and D. A. Wood. Cooperative shared memory: Software and hardware for scalable multiprocessor. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS V, 1992. Google ScholarDigital Library
L. Iftode, J. P. Singh, and K. Li. Scope consistency: A bridge between release consistency and entry consistency. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '96, 1996. Google ScholarDigital Library
S. Kaxiras and G. Keramidas. SARC Coherence: Scaling Directory Cache Coherence in Performance and Power. IEEE Micro, 30 (5), Sept.-Oct. 2010. Google ScholarDigital Library
P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy Release Consistency for Software Distributed Shared Memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture, ISCA '92, 1992. Google ScholarDigital Library
J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel. Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, 2009. Google ScholarDigital Library
J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel. Cohesion: A Hybrid Memory Model for Accelerators. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, 2010. Google ScholarDigital Library
R. Komuravelli, S. V. Adve, and C.-T. Chou. Revisiting the complexity of hardware cache coherence and some implications. ACM Trans. Archit. Code Optim., Dec. 2014. Google ScholarDigital Library
D. Koufaty, X. Chen, D. Poulsen, and J. Torrellas. Data forwarding in scalable shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7 (12), dec 1996. Google ScholarDigital Library
J. R. Larus, S. Chandra, and D. A. Wood. Cico: A practical shared-memory programming performance model. In Workshop on Portability and Performance for Parallel Processing, 1993.Google Scholar
A. R. Lebeck and D. A. Wood. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA '95, 1995. Google ScholarDigital Library
P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35: 50--58, 2002. Google ScholarDigital Library
J. Manson, W. Pugh, and S. V. Adve. The java memory model. In Proceedings of the 32Nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '05, 2005. Google ScholarDigital Library
M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. SIGARCH Computer Architecture News, 33 (4): 92--99, 2005. Google ScholarDigital Library
M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, PODC '96, 1996. Google ScholarDigital Library
M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. J. Parallel Distrib. Comput., 51 (1), May 1998. Google ScholarDigital Library
S. L. Min and J.-L. Baer. Design and analysis of a scalable cache coherence scheme based on clocks and timestamps. IEEE Trans. on Parallel and Distributed Systems, 3 (2): 25--44, January 1992. Google ScholarDigital Library
R. Rajwar, A. Kagi, and J. Goodman. Improving the throughput of synchronization by insertion of delays. In Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, HPCA-6, 2000.Google Scholar
A. Ros and S. Kaxiras. Complexity-effective multicore coherence. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, PACT '12, 2012. Google ScholarDigital Library
M. Scott. Shared Memory Synchronization. Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2013. ISBN 9781608459568. URL http://books.google.com/books?id=N4YcnQEACAAJ. Google ScholarDigital Library
S. Subramaniam, S. C. Steely, W. Hasenplaugh, A. Jaleel, C. Beckmann, T. Fossum, and J. Emer. Using in-flight chains to build a scalable cache coherence protocol. ACM Trans. Archit. Code Optim., 10 (4), Dec. 2013. Google ScholarDigital Library
H. Sung, R. Komuravelli, and S. V. Adve. DeNovoND: efficient hardware support for disciplined non-determinism. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '13, 2013. Google ScholarDigital Library
H. Sung, R. Komuravelli, and S. V. Adve. DeNovoND: efficient hardware for disciplined nondeterminism. IEEE Micro, 34 (3), 2014.Google Scholar
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA '95, 1995. Google ScholarDigital Library
J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos. A tagless coherence directory. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, 2009. Google ScholarDigital Library

Index Terms

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations
ASPLOS'15

Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-...
Read More
DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations
ASPLOS '15

Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-...
Read More
SWEL: hardware cache coherence protocols to map shared data onto shared caches
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Snooping and directory-based coherence protocols have become the de facto standard in chip multi-processors, but neither design is without drawbacks. Snooping protocols are not scalable, while directory protocols incur directory storage overhead, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
March 2015
720 pages
ISBN:9781450328357
DOI:10.1145/2694344
General Chairs:
Ozcan Ozturk
Bilkent University, Turkey
,
Kemal Ebcioglu
Global Supercomputing, USA
,
Program Chair:
Sandhya Dwarkadas
University of Rochester, USA
ACM SIGPLAN Notices Volume 50, Issue 4
ASPLOS '15
April 2015
676 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2775054
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 43, Issue 1
ASPLOS'15
March 2015
676 pages
ISSN:0163-5964
DOI:10.1145/2786763
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 March 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache coherence
consistency
shared memory
synchronization
Qualifiers
- research-article
Conference

Acceptance Rates
ASPLOS '15 Paper Acceptance Rate48of287submissions,17%Overall Acceptance Rate535of2,713submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 465
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations

ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations

SWEL: hardware cache coherence protocols to map shared data onto shared caches