ABSTRACT
Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-the-art MESI protocol. DeNovo, however, severely restricted the synchronization constructs an application can support. This paper proposes DeNovoSync, a technique to support arbitrary synchronization in DeNovo. The key challenge is that DeNovo exploits race-freedom to use reader-initiated local self-invalidations (instead of conventional writer-initiated remote cache invalidations) to ensure coherence. Synchronization accesses are inherently racy and not directly amenable to self-invalidations. DeNovoSync addresses this challenge using a novel combination of registration of all synchronization reads with a judicious hardware backoff to limit unnecessary registrations. For a wide variety of synchronization constructs and applications, compared to MESI, DeNovoSync shows comparable or up to 22% lower execution time and up to 58% lower network traffic, enabling DeNovo's advantages for a much broader class of software than previously possible.
- S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29 (12): 66--76, 1996. Google ScholarDigital Library
- A. Agarwal and M. Cherian. Adaptive backoff synchronization techniques. In Proceedings of the 16th Annual International Symposium on Computer Architecture, ISCA '89, 1989. Google ScholarDigital Library
- N. Agarwal, T. Krishna, L.-S. Peh, and N. Jha. Garnet: A detailed interconnection network model inside a full-system simulation framework. Technical Report CE-P08-001, Princeton University, 2008. URL http://www.princeton.edu/~niketa/garnet.Google Scholar
- T. E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 1 (1), Jan. 1990. Google ScholarDigital Library
- B. Bershad, M. Zekauskas, and W. Sawdon. The midway distributed shared memory system. In Compcon Spring '93, Digest of Papers., Feb 1993.Google ScholarCross Ref
- C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, Jan. 2011. Google ScholarDigital Library
- R. L. Bocchino, Jr., V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A type and effect system for deterministic parallel java. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '09, 2009. Google ScholarDigital Library
- R. L. Bocchino, Jr., S. Heumann, N. Honarmand, S. V. Adve, V. S. Adve, A. Welc, and T. Shpeisman. Safe nondeterminism in a deterministic-by-default parallel language. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '11, 2011. Google ScholarDigital Library
- H.-J. Boehm and S. V. Adve. Foundations of the cGoogle Scholar
- concurrency memory model. In Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '08, 2008.Google Scholar
- B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou. Denovo: Rethinking the memory hierarchy for disciplined parallelism. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques, PACT '11, 2011. Google ScholarDigital Library
- M. Elver and V. Nagarajan. Tso-cc: Consistency directed cache coherence for tso. In IEEE 20th International Symposium on High Performance Computer Architecture, HPCA-20, Feb 2014.Google ScholarCross Ref
- J. R. Goodman and P. J. Woest. The wisconsin multicube: A new large-scale cache-coherent multiprocessor. In Proceedings of the 15th Annual International Symposium on Computer Architecture, ISCA '88, 1988. Google ScholarDigital Library
- J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS III, 1989. Google ScholarDigital Library
- M. Herlihy. A methodology for implementing highly concurrent data structures. In Proceedings of the Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, PPOPP '90, 1990. Google ScholarDigital Library
- M. D. Hill, J. R. Larus, S. K. Reinhardt, and D. A. Wood. Cooperative shared memory: Software and hardware for scalable multiprocessor. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS V, 1992. Google ScholarDigital Library
- L. Iftode, J. P. Singh, and K. Li. Scope consistency: A bridge between release consistency and entry consistency. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '96, 1996. Google ScholarDigital Library
- S. Kaxiras and G. Keramidas. SARC Coherence: Scaling Directory Cache Coherence in Performance and Power. IEEE Micro, 30 (5), Sept.-Oct. 2010. Google ScholarDigital Library
- P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy Release Consistency for Software Distributed Shared Memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture, ISCA '92, 1992. Google ScholarDigital Library
- J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel. Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, 2009. Google ScholarDigital Library
- J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel. Cohesion: A Hybrid Memory Model for Accelerators. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, 2010. Google ScholarDigital Library
- R. Komuravelli, S. V. Adve, and C.-T. Chou. Revisiting the complexity of hardware cache coherence and some implications. ACM Trans. Archit. Code Optim., Dec. 2014. Google ScholarDigital Library
- D. Koufaty, X. Chen, D. Poulsen, and J. Torrellas. Data forwarding in scalable shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7 (12), dec 1996. Google ScholarDigital Library
- J. R. Larus, S. Chandra, and D. A. Wood. Cico: A practical shared-memory programming performance model. In Workshop on Portability and Performance for Parallel Processing, 1993.Google Scholar
- A. R. Lebeck and D. A. Wood. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA '95, 1995. Google ScholarDigital Library
- P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35: 50--58, 2002. Google ScholarDigital Library
- J. Manson, W. Pugh, and S. V. Adve. The java memory model. In Proceedings of the 32Nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '05, 2005. Google ScholarDigital Library
- M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. SIGARCH Computer Architecture News, 33 (4): 92--99, 2005. Google ScholarDigital Library
- M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, PODC '96, 1996. Google ScholarDigital Library
- M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. J. Parallel Distrib. Comput., 51 (1), May 1998. Google ScholarDigital Library
- S. L. Min and J.-L. Baer. Design and analysis of a scalable cache coherence scheme based on clocks and timestamps. IEEE Trans. on Parallel and Distributed Systems, 3 (2): 25--44, January 1992. Google ScholarDigital Library
- R. Rajwar, A. Kagi, and J. Goodman. Improving the throughput of synchronization by insertion of delays. In Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, HPCA-6, 2000.Google Scholar
- A. Ros and S. Kaxiras. Complexity-effective multicore coherence. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, PACT '12, 2012. Google ScholarDigital Library
- M. Scott. Shared Memory Synchronization. Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2013. ISBN 9781608459568. URL http://books.google.com/books?id=N4YcnQEACAAJ. Google ScholarDigital Library
- S. Subramaniam, S. C. Steely, W. Hasenplaugh, A. Jaleel, C. Beckmann, T. Fossum, and J. Emer. Using in-flight chains to build a scalable cache coherence protocol. ACM Trans. Archit. Code Optim., 10 (4), Dec. 2013. Google ScholarDigital Library
- H. Sung, R. Komuravelli, and S. V. Adve. DeNovoND: efficient hardware support for disciplined non-determinism. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '13, 2013. Google ScholarDigital Library
- H. Sung, R. Komuravelli, and S. V. Adve. DeNovoND: efficient hardware for disciplined nondeterminism. IEEE Micro, 34 (3), 2014.Google Scholar
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA '95, 1995. Google ScholarDigital Library
- J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos. A tagless coherence directory. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, 2009. Google ScholarDigital Library
Index Terms
- DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations
Recommendations
DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations
ASPLOS'15Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-...
DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations
ASPLOS '15Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-...
SWEL: hardware cache coherence protocols to map shared data onto shared caches
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesSnooping and directory-based coherence protocols have become the de facto standard in chip multi-processors, but neither design is without drawbacks. Snooping protocols are not scalable, while directory protocols incur directory storage overhead, ...
Comments