skip to main content
10.1145/1669112.1669143acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

In-network coherence filtering: snoopy coherence without broadcasts

Published:12 December 2009Publication History

ABSTRACT

With transistor miniaturization leading to an abundance of on-chip resources and uniprocessor designs providing diminishing returns, the industry has moved beyond single-core microprocessors and embraced the many-core wave. Scalable cache coherence protocol implementations are necessary to allow fast sharing of data among various cores and drive the many-core revolution forward. Snoopy coherence protocols, if realizable, have the desirable property of having low storage overhead and not adding indirection delay to cache-to-cache accesses. There are various proposals, like Token Coherence (TokenB), Uncorq, Intel QPI, INSO and Timestamp Snooping, that tackle the ordering of requests in snoopy protocols and make them realizable on unordered networks. However, snoopy protocols still have the broadcast overhead because each coherence request goes to all cores in the system. This has substantial network bandwidth and power implications. In this work, we propose embedding small in-network coherence filters inside on-chip routers that dynamically track sharing patterns among various cores. This sharing information is used to filter away redundant snoop requests that are traveling towards unshared cores. Filtering these useless messages saves network bandwidth and power and makes snoopy protocols on many-core systems truly scalable. Our in-network coherence filters are able to reduce the total number of snoops in the system on an average by 41.9%, thereby reducing total network traffic by 25.4% on 16-processor chip multiprocessor (CMP) systems running parallel applications. For 64-processor CMP systems, our filtering technique on an average achieves 46.5% reduction in total number of snoops that ends up reducing the total network traffic by 27.3%, on an average.

References

  1. IBM Power6. http://www-128.ibm.com/developerworks/power/library/pa-expert1.html.Google ScholarGoogle Scholar
  2. Sun Niagara. http://www.sun.com/processors/throughput/.Google ScholarGoogle Scholar
  3. N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha. GARNET: A detailed on-chip network model inside a full-system simulator. In Proceedings of International Symposium on Performance Analysis of Systems and Software, Apr. 2009.Google ScholarGoogle ScholarCross RefCross Ref
  4. N. Agarwal, L.-S. Peh, and N. K. Jha. In-network snoop ordering (INSO): Snoopy coherence on unordered interconnects. In Proceedings of International Symposium on High Performance Computer Architecture, Feb. 2009.Google ScholarGoogle ScholarCross RefCross Ref
  5. A. R. Alameldeen and D. A. Wood. IPC considered harmful for multiprocessor workloads. IEEE Micro, 26(4):8--17, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, Oct. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. E. Bilir, R. M. Dickson, Y. Hu, M. Plakal, D. J. Sorin, M. D. Hill, and D. A. Wood. Multicast snooping: A new coherence method using a multicast address network. In Proceedings of International Symposium on Computer Architecture, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. F. Cantin, M. H. Lipasti, and J. E. Smith. Improving multiprocessor performance with coarse-grain coherence tracking. In Proceedings of International Symposium on Computer Architecture, Jun. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Pub., 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Eisley, L.-S. Peh, and L. Shang. In-network cache coherence. In Proceedings of International Symposium on Microarchitecture, Dec. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Enright Jerger, L.-S. Peh, and M. Lipasti. Virtual circuit tree multicasting: A case for on-chip hardware multicast support. In Proceedings of International Symposium on Computer Architecture, Jun. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. D. Enright Jerger, L.-S. Peh, and M. H. Lipasti. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In Proceedings of International Symposium on Microarchitecture, Nov. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Intel. From a few cores to many: A tera-scale computing research overview. http://download.intel.com/research/platform/terascale/terascale_overview_paper.pdf.Google ScholarGoogle Scholar
  14. A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. Orion 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. In Proceedings of Design Automation and Test in Europe Conf., Feb. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Kanter. The common system interface: Intel's future interconnect. http://www.realworldtech.com/page.cfm?ArticleID=RWT082807020032, 2007.Google ScholarGoogle Scholar
  16. M. M. K. Martin, P. J. Harper, D. J. Sorin, M. D. Hill, and D. A. Wood. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In Proceedings of International Symposium on Computer Architecture, Jun. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In Proceedings of International Symposium on Computer Architecture, Jun. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. M. K. Martin, D. J. Sorin, A. Ailamaki, A. R. Alameldeen, R. M. Dickson, C. J. Mauer, K. E. Moore, M. Plakal, M. D. Hill, and D. A. Wood. Timestamp snooping: An approach for extending SMPs. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, Nov. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Computer Architecture News, 33(4):92--99, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Moshovos. Regionscout: Exploiting coarse grain sharing in snoop-based coherence. In Proceedings of International Symposium on Computer Architecture, Jun. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Moshovos, G. Memik, B. Falsafi, and A. Choudhary. Jetty: Filtering snoops for reduced energy consumption in SMP servers. In Proceedings of International Symposium on High Performance Computer Architecture, Jan. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. K. Nanda and L. N. Bhuyan. Design and analysis of cache coherent multistage interconnection networks. IEEE Trans. Comput., 42(4):458--470, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. V. Salapura, M. Blumrich, and A. Gara. Design and implementation of the Blue Gene/P snoop filter. In Proceedings of International Symposium on High Performance Computer Architecture, Feb. 2007.Google ScholarGoogle Scholar
  24. K. Strauss, X. Shen, and J. Torellas. Uncorq: Unconstrained snoop request delivery in embedded-ring multiprocessors. In Proceedings of International Symposium on Microarchitecture, Dec. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Tarjan, S. Thoziyoor, and N. P. Jouppi. Cacti 4.0. Technical report, Hewlett Packard, 2006.Google ScholarGoogle Scholar
  26. Virtutech AB. Simics full system simulator. http://www.virtutech.com/.Google ScholarGoogle Scholar
  27. D. Wentzlaff, P. Griffin, H. Hoffman, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. Brown III, and A. Agarwal. On-chip interconnection architecture of the tile processor. IEEE Micro, pages 15--31, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of International Symposium on Computer Architecture, Jun. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Zebchuk, E. Safi, and A. Moshovos. A framework for coarse-grain optimizations in the on-chip memory hierarchy. In Proceedings of International Symposium on Microarchitecture, Dec. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. In-network coherence filtering: snoopy coherence without broadcasts

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
      December 2009
      601 pages
      ISBN:9781605587981
      DOI:10.1145/1669112

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 December 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate484of2,242submissions,22%

      Upcoming Conference

      MICRO '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader