ABSTRACT
Task-based dataflow programming models and runtimes emerge as promising candidates for programming multicore and manycore architectures. These programming models analyze dynamically task dependencies at runtime and schedule independent tasks concurrently to the processing elements. In such models, cache locality, which is critical for performance, becomes more challenging in the presence of fine-grain tasks, and in architectures with many simple cores.
This paper presents a combined hardware-software approach to improve cache locality and offer better performance is terms of execution time and energy in the memory system. We propose the explicit bulk prefetcher (EBP) and epoch-based cache management (ECM) to help runtimes prefetch task data and guide the replacement decisions in caches. The runtimem software can use this hardware support to expose its internal knowledge about the tasks to the architecture and achieve more efficient task-based execution. Our combined scheme outperforms HW-only prefetchers and state-of-the-art replacement policies, improves performance by an average of 17%, generates on average 26% fewer L2 misses, and consumes on average 28% less energy in the components of the memory system.
- N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha. Garnet: A detailed on-chip network model inside a full-system simulator. In Proc. of IEEE Int. Symp. on Performance Analysis of Systems and Software, ISPASS '09, pages 33--42, 2009.Google ScholarCross Ref
- M. D. Allen, S. Sridharan, and G. S. Sohi. Serialization sets: a dynamic dependence-based parallel execution model. In Proc. of the ACM Symp. on Principles and Practice of Parallel Programming, PPoPP '09, pages 85--96, 2009. Google ScholarDigital Library
- \relax ARM Ltd. Cortex A9 Preload Engine. http://infocenter.arm.com/.Google Scholar
- E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE Trans. Parallel Distributed Systems, 20(3):404--418, 2009. Google ScholarDigital Library
- J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proc. of the ACM/IEEE Conf. on Supercomputing, Supercomputing '91, pages 176--186, 1991. Google ScholarDigital Library
- M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: expressing locality and independence with logical regions. In Proc. of the Int. Conf. on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 66--77, 2012. Google ScholarDigital Library
- M. J. Best, S. Mottishaw, C. Mustard, M. Roth, A. Fedorova, and A. Brownsword. Synchronization via scheduling: techniques for efficiently managing shared state. In Proc. of the ACM Conf. on Programming Language Design and Implementation, PLDI '11, pages 640--652, 2011. Google ScholarDigital Library
- L. S. Blackford et al. An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Transactions on Mathematical Software, 28:135--151, 2001. Google ScholarDigital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. In Proc. of the ACM Symp. on Principles and Practice of Parallel Programming, PPOPP '95, pages 207--216, 1995. Google ScholarDigital Library
- R. Bocchino et al. A type and effect system for deterministic parallel java. In Proc. of the ACM Conf. on Object Oriented Programming Systems Languages and Applications, OOPSLA '09, pages 97--116, 2009. Google ScholarDigital Library
- J. F. Cantin, M. H. Lipasti, and J. E. Smith. Stealth prefetching. In Proc. of the Int. Conf. on Architectural support for programming languages and operating systems, ASPLOS XII, pages 274--282, 2006. Google ScholarDigital Library
- P. Diaz and M. Cintra. Stream chaining: exploiting multiple levels of correlation in data prefetching. In Proc. of the Int. Symp. on Computer architecture, ISCA '09, pages 81--92, 2009. Google ScholarDigital Library
- M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proc. of the IEEE, 93(2):216--231, 2005.Google ScholarCross Ref
- J. Gummaraju, J. Coburn, Y. Turner, and M. Rosenblum. Streamware: programming general-purpose multicore processors using streams. In Proc. of the Int. Conf. on Architectural support for programming languages and operating systems, ASPLOS XIII, pages 297--307, 2008. Google ScholarDigital Library
- J. Gummaraju, M. Erez, J. Coburn, M. Rosenblum, and W. J. Dally. Architectural support for the stream execution model on general-purpose processors. In Proc. of the Int. Conf. on Parallel Architecture and Compilation Techniques, PACT '07, pages 3--12, 2007. Google ScholarDigital Library
- Y. Guo, J. Zhao, V. Cave, and V. Sarkar. Slaw: a scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In Proc. of the ACM Symp. on Principles and Practice of Parallel Programming, PPoPP '10, pages 341--342, 2010. Google ScholarDigital Library
- HP Laboratories. Cacti 6.5: An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model. http://www.hpl.hp.com/research/cacti/.Google Scholar
- Intel Corporation. Intel Threading Building Blocks (TBB). http://www.threadingbuildingblocks.org.Google Scholar
- Intel Corporation. The Intel Xeon Phi Coprocessor. http://www.intel.com/content/www/us/en/ processors/xeon/xeon-phi-detail.html.Google Scholar
- P. Jain, S. Devadas, D. Engels, and L. Rudolph. Software-assisted cache replacement mechanisms for embedded systems. In Proc. of the IEEE/ACM Int. Conf. on Computer-Aided Design, ICCAD '01, pages 119--126, 2001. Google ScholarDigital Library
- A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In Proc. of the Int. Symp. on Computer Architecture, ISCA '10, pages 60--71, 2010. Google ScholarDigital Library
- J. A. Kahle et al. Introduction to the Cell Multiprocessor. IBM Journal of Research and Develeopment, 49(4/5):589--604, 2005. Google ScholarDigital Library
- A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In Design, Automation & Test in Europe Conference, DATE '09, pages 423--428, 2009. Google ScholarDigital Library
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proc. of the ACM Conf. on Programming Language Design and Implementation, PLDI '05, pages 190--200, 2005. Google ScholarDigital Library
- S. Lyberis, G. Kalokerinos, M. Lygerakis, V. Papaefstathiou, D. Tsaliagkos, M. Katevenis, D. Pnevmatikatos, and D. Nikolopoulos. Formic: Cost-efficient and scalable prototyping of manycore architectures. In Proc. of the IEEE Symp. on Field-Programmable Custom Computing Machines, FCCM '12, pages 61--64, 2012. Google ScholarDigital Library
- M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Computer Architecture News, 33(4):92--99, 2005. Google ScholarDigital Library
- G. C. Necula, S. McPeak, S. P. Rahul, and W. Weimer. Cil: Intermediate language and tools for analysis and transformation of c programs. In Proc. of the Int. Conf. on Compiler Construction, CC '02, pages 213--228, 2002. Google ScholarDigital Library
- J. Perez, R. Badia, and J. Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Proc. IEEE Conf. on Cluster Computing, CLUSTER '08, pages 142 --151, 2008.Google ScholarCross Ref
- J. M. Perez, R. M. Badia, and J. Labarta. Handling task dependencies under strided and aliased references. In Proc. of the ACM Int. Conf. on Supercomputing, ICS '10, pages 263--274, 2010. Google ScholarDigital Library
- S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory access scheduling. In Proc. of the Int. Symp. on Computer architecture, ISCA '00, pages 128--138, 2000. Google ScholarDigital Library
- P. Rosenfeld, E. Cooper-Balis, and B. Jacob. Dramsim2: A cycle accurate memory system simulator. Comp. Arch. Letters, 10(1):16--19, 2011. Google ScholarDigital Library
- G. Tzenakis, A. Papatriantafyllou, J. Kesapides, P. Pratikakis, H. Vandierendonck, and D. S. Nikolopoulos. BDDT: block-level dynamic dependence analysis for deterministic task-based parallelism. In Proc. of the ACM Symp. on Principles and Practice of Parallel Programming, PPoPP '12, 2012. Google ScholarDigital Library
- H. Vandierendonck, G. Tzenakis, and D. S. Nikolopoulos. A unified scheduler for recursive and task dataflow parallelism. In Proc. of the Int. Conf. on Parallel Architectures and Compilation Techniques, PACT '11, pages 1--11, 2011. Google ScholarDigital Library
- Z. Wang, D. Burger, K. S. McKinley, S. K. Reinhardt, and C. C. Weems. Guided region prefetching: a cooperative hardware/software approach. In Proc. of the Int. Symp. on Computer architecture, ISCA '03, pages 388--398, 2003. Google ScholarDigital Library
- Z. Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems. Using the compiler to improve cache replacement decisions. In Proc. of the Int. Conf. on Parallel Architectures and Compilation Techniques, PACT '02, pages 199--208, 2002. Google ScholarDigital Library
- C.-J. Wu, A. Jaleel, M. Martonosi, S. C. Steely, Jr., and J. Emer. PACMan: prefetch-aware cache management for high performance caching. In Proc. of the IEEE/ACM Int. Symp. on Microarchitecture, MICRO-44, pages 442--453, 2011. Google ScholarDigital Library
Index Terms
- Prefetching and cache management using task lifetimes
Recommendations
Coordinating prefetching and STT-RAM based last-level cache management for multicore systems
GLSVLSI '13: Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSIData prefetching is a common mechanism to mitigate the bottleneck of off-chip memory bandwidth in modern computing systems. Unfortunately, the side effects of prefetching are an additional burden on off-chip communication and increased cache write ...
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Using dead blocks as a virtual victim cache
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesCaches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59% of the time, i.e., it will not be referenced again before it is ...
Comments