skip to main content
10.1145/2464996.2465443acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Prefetching and cache management using task lifetimes

Published:10 June 2013Publication History

ABSTRACT

Task-based dataflow programming models and runtimes emerge as promising candidates for programming multicore and manycore architectures. These programming models analyze dynamically task dependencies at runtime and schedule independent tasks concurrently to the processing elements. In such models, cache locality, which is critical for performance, becomes more challenging in the presence of fine-grain tasks, and in architectures with many simple cores.

This paper presents a combined hardware-software approach to improve cache locality and offer better performance is terms of execution time and energy in the memory system. We propose the explicit bulk prefetcher (EBP) and epoch-based cache management (ECM) to help runtimes prefetch task data and guide the replacement decisions in caches. The runtimem software can use this hardware support to expose its internal knowledge about the tasks to the architecture and achieve more efficient task-based execution. Our combined scheme outperforms HW-only prefetchers and state-of-the-art replacement policies, improves performance by an average of 17%, generates on average 26% fewer L2 misses, and consumes on average 28% less energy in the components of the memory system.

References

  1. N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha. Garnet: A detailed on-chip network model inside a full-system simulator. In Proc. of IEEE Int. Symp. on Performance Analysis of Systems and Software, ISPASS '09, pages 33--42, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  2. M. D. Allen, S. Sridharan, and G. S. Sohi. Serialization sets: a dynamic dependence-based parallel execution model. In Proc. of the ACM Symp. on Principles and Practice of Parallel Programming, PPoPP '09, pages 85--96, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. \relax ARM Ltd. Cortex A9 Preload Engine. http://infocenter.arm.com/.Google ScholarGoogle Scholar
  4. E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE Trans. Parallel Distributed Systems, 20(3):404--418, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proc. of the ACM/IEEE Conf. on Supercomputing, Supercomputing '91, pages 176--186, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: expressing locality and independence with logical regions. In Proc. of the Int. Conf. on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 66--77, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. J. Best, S. Mottishaw, C. Mustard, M. Roth, A. Fedorova, and A. Brownsword. Synchronization via scheduling: techniques for efficiently managing shared state. In Proc. of the ACM Conf. on Programming Language Design and Implementation, PLDI '11, pages 640--652, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. S. Blackford et al. An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Transactions on Mathematical Software, 28:135--151, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. In Proc. of the ACM Symp. on Principles and Practice of Parallel Programming, PPOPP '95, pages 207--216, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Bocchino et al. A type and effect system for deterministic parallel java. In Proc. of the ACM Conf. on Object Oriented Programming Systems Languages and Applications, OOPSLA '09, pages 97--116, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. F. Cantin, M. H. Lipasti, and J. E. Smith. Stealth prefetching. In Proc. of the Int. Conf. on Architectural support for programming languages and operating systems, ASPLOS XII, pages 274--282, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Diaz and M. Cintra. Stream chaining: exploiting multiple levels of correlation in data prefetching. In Proc. of the Int. Symp. on Computer architecture, ISCA '09, pages 81--92, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proc. of the IEEE, 93(2):216--231, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  14. J. Gummaraju, J. Coburn, Y. Turner, and M. Rosenblum. Streamware: programming general-purpose multicore processors using streams. In Proc. of the Int. Conf. on Architectural support for programming languages and operating systems, ASPLOS XIII, pages 297--307, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Gummaraju, M. Erez, J. Coburn, M. Rosenblum, and W. J. Dally. Architectural support for the stream execution model on general-purpose processors. In Proc. of the Int. Conf. on Parallel Architecture and Compilation Techniques, PACT '07, pages 3--12, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Guo, J. Zhao, V. Cave, and V. Sarkar. Slaw: a scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In Proc. of the ACM Symp. on Principles and Practice of Parallel Programming, PPoPP '10, pages 341--342, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. HP Laboratories. Cacti 6.5: An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model. http://www.hpl.hp.com/research/cacti/.Google ScholarGoogle Scholar
  18. Intel Corporation. Intel Threading Building Blocks (TBB). http://www.threadingbuildingblocks.org.Google ScholarGoogle Scholar
  19. Intel Corporation. The Intel Xeon Phi Coprocessor. http://www.intel.com/content/www/us/en/ processors/xeon/xeon-phi-detail.html.Google ScholarGoogle Scholar
  20. P. Jain, S. Devadas, D. Engels, and L. Rudolph. Software-assisted cache replacement mechanisms for embedded systems. In Proc. of the IEEE/ACM Int. Conf. on Computer-Aided Design, ICCAD '01, pages 119--126, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In Proc. of the Int. Symp. on Computer Architecture, ISCA '10, pages 60--71, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. A. Kahle et al. Introduction to the Cell Multiprocessor. IBM Journal of Research and Develeopment, 49(4/5):589--604, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In Design, Automation & Test in Europe Conference, DATE '09, pages 423--428, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proc. of the ACM Conf. on Programming Language Design and Implementation, PLDI '05, pages 190--200, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Lyberis, G. Kalokerinos, M. Lygerakis, V. Papaefstathiou, D. Tsaliagkos, M. Katevenis, D. Pnevmatikatos, and D. Nikolopoulos. Formic: Cost-efficient and scalable prototyping of manycore architectures. In Proc. of the IEEE Symp. on Field-Programmable Custom Computing Machines, FCCM '12, pages 61--64, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Computer Architecture News, 33(4):92--99, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. C. Necula, S. McPeak, S. P. Rahul, and W. Weimer. Cil: Intermediate language and tools for analysis and transformation of c programs. In Proc. of the Int. Conf. on Compiler Construction, CC '02, pages 213--228, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Perez, R. Badia, and J. Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Proc. IEEE Conf. on Cluster Computing, CLUSTER '08, pages 142 --151, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  29. J. M. Perez, R. M. Badia, and J. Labarta. Handling task dependencies under strided and aliased references. In Proc. of the ACM Int. Conf. on Supercomputing, ICS '10, pages 263--274, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory access scheduling. In Proc. of the Int. Symp. on Computer architecture, ISCA '00, pages 128--138, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Rosenfeld, E. Cooper-Balis, and B. Jacob. Dramsim2: A cycle accurate memory system simulator. Comp. Arch. Letters, 10(1):16--19, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. G. Tzenakis, A. Papatriantafyllou, J. Kesapides, P. Pratikakis, H. Vandierendonck, and D. S. Nikolopoulos. BDDT: block-level dynamic dependence analysis for deterministic task-based parallelism. In Proc. of the ACM Symp. on Principles and Practice of Parallel Programming, PPoPP '12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. H. Vandierendonck, G. Tzenakis, and D. S. Nikolopoulos. A unified scheduler for recursive and task dataflow parallelism. In Proc. of the Int. Conf. on Parallel Architectures and Compilation Techniques, PACT '11, pages 1--11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Z. Wang, D. Burger, K. S. McKinley, S. K. Reinhardt, and C. C. Weems. Guided region prefetching: a cooperative hardware/software approach. In Proc. of the Int. Symp. on Computer architecture, ISCA '03, pages 388--398, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Z. Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems. Using the compiler to improve cache replacement decisions. In Proc. of the Int. Conf. on Parallel Architectures and Compilation Techniques, PACT '02, pages 199--208, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C.-J. Wu, A. Jaleel, M. Martonosi, S. C. Steely, Jr., and J. Emer. PACMan: prefetch-aware cache management for high performance caching. In Proc. of the IEEE/ACM Int. Symp. on Microarchitecture, MICRO-44, pages 442--453, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Prefetching and cache management using task lifetimes

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing
            June 2013
            512 pages
            ISBN:9781450321303
            DOI:10.1145/2464996

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 10 June 2013

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            ICS '13 Paper Acceptance Rate43of202submissions,21%Overall Acceptance Rate584of2,055submissions,28%

            Upcoming Conference

            ICS '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader