ABSTRACT
With transistor miniaturization leading to an abundance of on-chip resources and uniprocessor designs providing diminishing returns, the industry has moved beyond single-core microprocessors and embraced the many-core wave. Scalable cache coherence protocol implementations are necessary to allow fast sharing of data among various cores and drive the many-core revolution forward. Snoopy coherence protocols, if realizable, have the desirable property of having low storage overhead and not adding indirection delay to cache-to-cache accesses. There are various proposals, like Token Coherence (TokenB), Uncorq, Intel QPI, INSO and Timestamp Snooping, that tackle the ordering of requests in snoopy protocols and make them realizable on unordered networks. However, snoopy protocols still have the broadcast overhead because each coherence request goes to all cores in the system. This has substantial network bandwidth and power implications. In this work, we propose embedding small in-network coherence filters inside on-chip routers that dynamically track sharing patterns among various cores. This sharing information is used to filter away redundant snoop requests that are traveling towards unshared cores. Filtering these useless messages saves network bandwidth and power and makes snoopy protocols on many-core systems truly scalable. Our in-network coherence filters are able to reduce the total number of snoops in the system on an average by 41.9%, thereby reducing total network traffic by 25.4% on 16-processor chip multiprocessor (CMP) systems running parallel applications. For 64-processor CMP systems, our filtering technique on an average achieves 46.5% reduction in total number of snoops that ends up reducing the total network traffic by 27.3%, on an average.
- IBM Power6. http://www-128.ibm.com/developerworks/power/library/pa-expert1.html.Google Scholar
- Sun Niagara. http://www.sun.com/processors/throughput/.Google Scholar
- N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha. GARNET: A detailed on-chip network model inside a full-system simulator. In Proceedings of International Symposium on Performance Analysis of Systems and Software, Apr. 2009.Google ScholarCross Ref
- N. Agarwal, L.-S. Peh, and N. K. Jha. In-network snoop ordering (INSO): Snoopy coherence on unordered interconnects. In Proceedings of International Symposium on High Performance Computer Architecture, Feb. 2009.Google ScholarCross Ref
- A. R. Alameldeen and D. A. Wood. IPC considered harmful for multiprocessor workloads. IEEE Micro, 26(4):8--17, 2006. Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, Oct. 2008. Google ScholarDigital Library
- E. E. Bilir, R. M. Dickson, Y. Hu, M. Plakal, D. J. Sorin, M. D. Hill, and D. A. Wood. Multicast snooping: A new coherence method using a multicast address network. In Proceedings of International Symposium on Computer Architecture, May 1999. Google ScholarDigital Library
- J. F. Cantin, M. H. Lipasti, and J. E. Smith. Improving multiprocessor performance with coarse-grain coherence tracking. In Proceedings of International Symposium on Computer Architecture, Jun. 2005. Google ScholarDigital Library
- W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Pub., 2003. Google ScholarDigital Library
- N. Eisley, L.-S. Peh, and L. Shang. In-network cache coherence. In Proceedings of International Symposium on Microarchitecture, Dec. 2006. Google ScholarDigital Library
- N. Enright Jerger, L.-S. Peh, and M. Lipasti. Virtual circuit tree multicasting: A case for on-chip hardware multicast support. In Proceedings of International Symposium on Computer Architecture, Jun. 2008. Google ScholarDigital Library
- N. D. Enright Jerger, L.-S. Peh, and M. H. Lipasti. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In Proceedings of International Symposium on Microarchitecture, Nov. 2008. Google ScholarDigital Library
- Intel. From a few cores to many: A tera-scale computing research overview. http://download.intel.com/research/platform/terascale/terascale_overview_paper.pdf.Google Scholar
- A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. Orion 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. In Proceedings of Design Automation and Test in Europe Conf., Feb. 2009. Google ScholarDigital Library
- D. Kanter. The common system interface: Intel's future interconnect. http://www.realworldtech.com/page.cfm?ArticleID=RWT082807020032, 2007.Google Scholar
- M. M. K. Martin, P. J. Harper, D. J. Sorin, M. D. Hill, and D. A. Wood. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In Proceedings of International Symposium on Computer Architecture, Jun. 2003. Google ScholarDigital Library
- M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In Proceedings of International Symposium on Computer Architecture, Jun. 2003. Google ScholarDigital Library
- M. M. K. Martin, D. J. Sorin, A. Ailamaki, A. R. Alameldeen, R. M. Dickson, C. J. Mauer, K. E. Moore, M. Plakal, M. D. Hill, and D. A. Wood. Timestamp snooping: An approach for extending SMPs. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, Nov. 2000. Google ScholarDigital Library
- M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Computer Architecture News, 33(4):92--99, 2005. Google ScholarDigital Library
- A. Moshovos. Regionscout: Exploiting coarse grain sharing in snoop-based coherence. In Proceedings of International Symposium on Computer Architecture, Jun. 2005. Google ScholarDigital Library
- A. Moshovos, G. Memik, B. Falsafi, and A. Choudhary. Jetty: Filtering snoops for reduced energy consumption in SMP servers. In Proceedings of International Symposium on High Performance Computer Architecture, Jan. 2001. Google ScholarDigital Library
- A. K. Nanda and L. N. Bhuyan. Design and analysis of cache coherent multistage interconnection networks. IEEE Trans. Comput., 42(4):458--470, 1993. Google ScholarDigital Library
- V. Salapura, M. Blumrich, and A. Gara. Design and implementation of the Blue Gene/P snoop filter. In Proceedings of International Symposium on High Performance Computer Architecture, Feb. 2007.Google Scholar
- K. Strauss, X. Shen, and J. Torellas. Uncorq: Unconstrained snoop request delivery in embedded-ring multiprocessors. In Proceedings of International Symposium on Microarchitecture, Dec. 2007. Google ScholarDigital Library
- D. Tarjan, S. Thoziyoor, and N. P. Jouppi. Cacti 4.0. Technical report, Hewlett Packard, 2006.Google Scholar
- Virtutech AB. Simics full system simulator. http://www.virtutech.com/.Google Scholar
- D. Wentzlaff, P. Griffin, H. Hoffman, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. Brown III, and A. Agarwal. On-chip interconnection architecture of the tile processor. IEEE Micro, pages 15--31, 2007. Google ScholarDigital Library
- S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of International Symposium on Computer Architecture, Jun. 1995. Google ScholarDigital Library
- J. Zebchuk, E. Safi, and A. Moshovos. A framework for coarse-grain optimizations in the on-chip memory hierarchy. In Proceedings of International Symposium on Microarchitecture, Dec. 2007. Google ScholarDigital Library
Index Terms
- In-network coherence filtering: snoopy coherence without broadcasts
Recommendations
A tagless coherence directory
MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on MicroarchitectureA key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, ...
SARC Coherence: Scaling Directory Cache Coherence in Performance and Power
The SARC project seeks to improve power scalability of shared-memory chip multiprocessors (CMPs) by making directory coherence more efficient in both power and performance. The authors describe how they eliminate two major sources of inefficiency for ...
Maintaining Cache Coherence through Compiler-Directed Data Prefetching
In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses ...
Comments