research-article

In-network coherence filtering: snoopy coherence without broadcasts

Authors:
Niket Agarwal

Princeton University

Princeton University
View Profile

,
Li-Shiuan Peh

Massachusetts Institute of Technology

Massachusetts Institute of Technology
View Profile

,
Niraj K. Jha

Princeton University

Princeton University
View Profile

MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on MicroarchitectureDecember 2009Pages 232–243https://doi.org/10.1145/1669112.1669143

Published:12 December 2009Publication History

MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 232–243

ABSTRACT

With transistor miniaturization leading to an abundance of on-chip resources and uniprocessor designs providing diminishing returns, the industry has moved beyond single-core microprocessors and embraced the many-core wave. Scalable cache coherence protocol implementations are necessary to allow fast sharing of data among various cores and drive the many-core revolution forward. Snoopy coherence protocols, if realizable, have the desirable property of having low storage overhead and not adding indirection delay to cache-to-cache accesses. There are various proposals, like Token Coherence (TokenB), Uncorq, Intel QPI, INSO and Timestamp Snooping, that tackle the ordering of requests in snoopy protocols and make them realizable on unordered networks. However, snoopy protocols still have the broadcast overhead because each coherence request goes to all cores in the system. This has substantial network bandwidth and power implications. In this work, we propose embedding small in-network coherence filters inside on-chip routers that dynamically track sharing patterns among various cores. This sharing information is used to filter away redundant snoop requests that are traveling towards unshared cores. Filtering these useless messages saves network bandwidth and power and makes snoopy protocols on many-core systems truly scalable. Our in-network coherence filters are able to reduce the total number of snoops in the system on an average by 41.9%, thereby reducing total network traffic by 25.4% on 16-processor chip multiprocessor (CMP) systems running parallel applications. For 64-processor CMP systems, our filtering technique on an average achieves 46.5% reduction in total number of snoops that ends up reducing the total network traffic by 27.3%, on an average.

References

IBM Power6. http://www-128.ibm.com/developerworks/power/library/pa-expert1.html.Google Scholar
Sun Niagara. http://www.sun.com/processors/throughput/.Google Scholar
N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha. GARNET: A detailed on-chip network model inside a full-system simulator. In Proceedings of International Symposium on Performance Analysis of Systems and Software, Apr. 2009.Google ScholarCross Ref
N. Agarwal, L.-S. Peh, and N. K. Jha. In-network snoop ordering (INSO): Snoopy coherence on unordered interconnects. In Proceedings of International Symposium on High Performance Computer Architecture, Feb. 2009.Google ScholarCross Ref
A. R. Alameldeen and D. A. Wood. IPC considered harmful for multiprocessor workloads. IEEE Micro, 26(4):8--17, 2006. Google ScholarDigital Library
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, Oct. 2008. Google ScholarDigital Library
E. E. Bilir, R. M. Dickson, Y. Hu, M. Plakal, D. J. Sorin, M. D. Hill, and D. A. Wood. Multicast snooping: A new coherence method using a multicast address network. In Proceedings of International Symposium on Computer Architecture, May 1999. Google ScholarDigital Library
J. F. Cantin, M. H. Lipasti, and J. E. Smith. Improving multiprocessor performance with coarse-grain coherence tracking. In Proceedings of International Symposium on Computer Architecture, Jun. 2005. Google ScholarDigital Library
W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Pub., 2003. Google ScholarDigital Library
N. Eisley, L.-S. Peh, and L. Shang. In-network cache coherence. In Proceedings of International Symposium on Microarchitecture, Dec. 2006. Google ScholarDigital Library
N. Enright Jerger, L.-S. Peh, and M. Lipasti. Virtual circuit tree multicasting: A case for on-chip hardware multicast support. In Proceedings of International Symposium on Computer Architecture, Jun. 2008. Google ScholarDigital Library
N. D. Enright Jerger, L.-S. Peh, and M. H. Lipasti. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In Proceedings of International Symposium on Microarchitecture, Nov. 2008. Google ScholarDigital Library
Intel. From a few cores to many: A tera-scale computing research overview. http://download.intel.com/research/platform/terascale/terascale_overview_paper.pdf.Google Scholar
A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. Orion 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. In Proceedings of Design Automation and Test in Europe Conf., Feb. 2009. Google ScholarDigital Library
D. Kanter. The common system interface: Intel's future interconnect. http://www.realworldtech.com/page.cfm?ArticleID=RWT082807020032, 2007.Google Scholar
M. M. K. Martin, P. J. Harper, D. J. Sorin, M. D. Hill, and D. A. Wood. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In Proceedings of International Symposium on Computer Architecture, Jun. 2003. Google ScholarDigital Library
M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In Proceedings of International Symposium on Computer Architecture, Jun. 2003. Google ScholarDigital Library
M. M. K. Martin, D. J. Sorin, A. Ailamaki, A. R. Alameldeen, R. M. Dickson, C. J. Mauer, K. E. Moore, M. Plakal, M. D. Hill, and D. A. Wood. Timestamp snooping: An approach for extending SMPs. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, Nov. 2000. Google ScholarDigital Library
M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Computer Architecture News, 33(4):92--99, 2005. Google ScholarDigital Library
A. Moshovos. Regionscout: Exploiting coarse grain sharing in snoop-based coherence. In Proceedings of International Symposium on Computer Architecture, Jun. 2005. Google ScholarDigital Library
A. Moshovos, G. Memik, B. Falsafi, and A. Choudhary. Jetty: Filtering snoops for reduced energy consumption in SMP servers. In Proceedings of International Symposium on High Performance Computer Architecture, Jan. 2001. Google ScholarDigital Library
A. K. Nanda and L. N. Bhuyan. Design and analysis of cache coherent multistage interconnection networks. IEEE Trans. Comput., 42(4):458--470, 1993. Google ScholarDigital Library
V. Salapura, M. Blumrich, and A. Gara. Design and implementation of the Blue Gene/P snoop filter. In Proceedings of International Symposium on High Performance Computer Architecture, Feb. 2007.Google Scholar
K. Strauss, X. Shen, and J. Torellas. Uncorq: Unconstrained snoop request delivery in embedded-ring multiprocessors. In Proceedings of International Symposium on Microarchitecture, Dec. 2007. Google ScholarDigital Library
D. Tarjan, S. Thoziyoor, and N. P. Jouppi. Cacti 4.0. Technical report, Hewlett Packard, 2006.Google Scholar
Virtutech AB. Simics full system simulator. http://www.virtutech.com/.Google Scholar
D. Wentzlaff, P. Griffin, H. Hoffman, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. Brown III, and A. Agarwal. On-chip interconnection architecture of the tile processor. IEEE Micro, pages 15--31, 2007. Google ScholarDigital Library
S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of International Symposium on Computer Architecture, Jun. 1995. Google ScholarDigital Library
J. Zebchuk, E. Safi, and A. Moshovos. A framework for coarse-grain optimizations in the on-chip memory hierarchy. In Proceedings of International Symposium on Microarchitecture, Dec. 2007. Google ScholarDigital Library

Index Terms

In-network coherence filtering: snoopy coherence without broadcasts
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Interconnection architectures

Recommendations

A tagless coherence directory
MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

A key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, ...
Read More
SARC Coherence: Scaling Directory Cache Coherence in Performance and Power

The SARC project seeks to improve power scalability of shared-memory chip multiprocessors (CMPs) by making directory coherence more efficient in both power and performance. The authors describe how they eliminate two major sources of inefficiency for ...
Read More
Maintaining Cache Coherence through Compiler-Directed Data Prefetching

In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
December 2009
601 pages
ISBN:9781605587981
DOI:10.1145/1669112
General Chairs:
David Albonesi
Cornell
,
Margaret Martonosi
Princeton
,
Program Chairs:
David August
Princeton/Parakinetics
,
José Martínez
Cornell
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 December 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate484of2,242submissions,22%
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 684
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

In-network coherence filtering: snoopy coherence without broadcasts

MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

A tagless coherence directory

SARC Coherence: Scaling Directory Cache Coherence in Performance and Power

Maintaining Cache Coherence through Compiler-Directed Data Prefetching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

In-network coherence filtering: snoopy coherence without broadcasts

MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

A tagless coherence directory

SARC Coherence: Scaling Directory Cache Coherence in Performance and Power

Maintaining Cache Coherence through Compiler-Directed Data Prefetching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media