research-article

Prefetching and cache management using task lifetimes

Authors:
Vassilis Papaefstathiou

FORTH-ICS, Heraklion, Crete, Greece

FORTH-ICS, Heraklion, Crete, Greece
View Profile

,
Manolis G.H. Katevenis

FORTH-ICS, Heraklion, Crete, Greece

FORTH-ICS, Heraklion, Crete, Greece
View Profile

,
Dimitrios S. Nikolopoulos

Queen's University of Belfast, Belfast, United Kingdom

Queen's University of Belfast, Belfast, United Kingdom
View Profile

,
Dionisios Pnevmatikatos

FORTH-ICS, Heraklion, Crete, Greece

FORTH-ICS, Heraklion, Crete, Greece
View Profile

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputingJune 2013Pages 325–334https://doi.org/10.1145/2464996.2465443

Published:10 June 2013Publication History

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Pages 325–334

ABSTRACT

Task-based dataflow programming models and runtimes emerge as promising candidates for programming multicore and manycore architectures. These programming models analyze dynamically task dependencies at runtime and schedule independent tasks concurrently to the processing elements. In such models, cache locality, which is critical for performance, becomes more challenging in the presence of fine-grain tasks, and in architectures with many simple cores.

This paper presents a combined hardware-software approach to improve cache locality and offer better performance is terms of execution time and energy in the memory system. We propose the explicit bulk prefetcher (EBP) and epoch-based cache management (ECM) to help runtimes prefetch task data and guide the replacement decisions in caches. The runtimem software can use this hardware support to expose its internal knowledge about the tasks to the architecture and achieve more efficient task-based execution. Our combined scheme outperforms HW-only prefetchers and state-of-the-art replacement policies, improves performance by an average of 17%, generates on average 26% fewer L2 misses, and consumes on average 28% less energy in the components of the memory system.

References

N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha. Garnet: A detailed on-chip network model inside a full-system simulator. In Proc. of IEEE Int. Symp. on Performance Analysis of Systems and Software, ISPASS '09, pages 33--42, 2009.Google ScholarCross Ref
M. D. Allen, S. Sridharan, and G. S. Sohi. Serialization sets: a dynamic dependence-based parallel execution model. In Proc. of the ACM Symp. on Principles and Practice of Parallel Programming, PPoPP '09, pages 85--96, 2009. Google ScholarDigital Library
\relax ARM Ltd. Cortex A9 Preload Engine. http://infocenter.arm.com/.Google Scholar
E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE Trans. Parallel Distributed Systems, 20(3):404--418, 2009. Google ScholarDigital Library
J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proc. of the ACM/IEEE Conf. on Supercomputing, Supercomputing '91, pages 176--186, 1991. Google ScholarDigital Library
M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: expressing locality and independence with logical regions. In Proc. of the Int. Conf. on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 66--77, 2012. Google ScholarDigital Library
M. J. Best, S. Mottishaw, C. Mustard, M. Roth, A. Fedorova, and A. Brownsword. Synchronization via scheduling: techniques for efficiently managing shared state. In Proc. of the ACM Conf. on Programming Language Design and Implementation, PLDI '11, pages 640--652, 2011. Google ScholarDigital Library
L. S. Blackford et al. An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Transactions on Mathematical Software, 28:135--151, 2001. Google ScholarDigital Library
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. In Proc. of the ACM Symp. on Principles and Practice of Parallel Programming, PPOPP '95, pages 207--216, 1995. Google ScholarDigital Library
R. Bocchino et al. A type and effect system for deterministic parallel java. In Proc. of the ACM Conf. on Object Oriented Programming Systems Languages and Applications, OOPSLA '09, pages 97--116, 2009. Google ScholarDigital Library
J. F. Cantin, M. H. Lipasti, and J. E. Smith. Stealth prefetching. In Proc. of the Int. Conf. on Architectural support for programming languages and operating systems, ASPLOS XII, pages 274--282, 2006. Google ScholarDigital Library
P. Diaz and M. Cintra. Stream chaining: exploiting multiple levels of correlation in data prefetching. In Proc. of the Int. Symp. on Computer architecture, ISCA '09, pages 81--92, 2009. Google ScholarDigital Library
M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proc. of the IEEE, 93(2):216--231, 2005.Google ScholarCross Ref
J. Gummaraju, J. Coburn, Y. Turner, and M. Rosenblum. Streamware: programming general-purpose multicore processors using streams. In Proc. of the Int. Conf. on Architectural support for programming languages and operating systems, ASPLOS XIII, pages 297--307, 2008. Google ScholarDigital Library
J. Gummaraju, M. Erez, J. Coburn, M. Rosenblum, and W. J. Dally. Architectural support for the stream execution model on general-purpose processors. In Proc. of the Int. Conf. on Parallel Architecture and Compilation Techniques, PACT '07, pages 3--12, 2007. Google ScholarDigital Library
Y. Guo, J. Zhao, V. Cave, and V. Sarkar. Slaw: a scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In Proc. of the ACM Symp. on Principles and Practice of Parallel Programming, PPoPP '10, pages 341--342, 2010. Google ScholarDigital Library
HP Laboratories. Cacti 6.5: An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model. http://www.hpl.hp.com/research/cacti/.Google Scholar
Intel Corporation. Intel Threading Building Blocks (TBB). http://www.threadingbuildingblocks.org.Google Scholar
Intel Corporation. The Intel Xeon Phi Coprocessor. http://www.intel.com/content/www/us/en/ processors/xeon/xeon-phi-detail.html.Google Scholar
P. Jain, S. Devadas, D. Engels, and L. Rudolph. Software-assisted cache replacement mechanisms for embedded systems. In Proc. of the IEEE/ACM Int. Conf. on Computer-Aided Design, ICCAD '01, pages 119--126, 2001. Google ScholarDigital Library
A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In Proc. of the Int. Symp. on Computer Architecture, ISCA '10, pages 60--71, 2010. Google ScholarDigital Library
J. A. Kahle et al. Introduction to the Cell Multiprocessor. IBM Journal of Research and Develeopment, 49(4/5):589--604, 2005. Google ScholarDigital Library
A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In Design, Automation & Test in Europe Conference, DATE '09, pages 423--428, 2009. Google ScholarDigital Library
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proc. of the ACM Conf. on Programming Language Design and Implementation, PLDI '05, pages 190--200, 2005. Google ScholarDigital Library
S. Lyberis, G. Kalokerinos, M. Lygerakis, V. Papaefstathiou, D. Tsaliagkos, M. Katevenis, D. Pnevmatikatos, and D. Nikolopoulos. Formic: Cost-efficient and scalable prototyping of manycore architectures. In Proc. of the IEEE Symp. on Field-Programmable Custom Computing Machines, FCCM '12, pages 61--64, 2012. Google ScholarDigital Library
M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Computer Architecture News, 33(4):92--99, 2005. Google ScholarDigital Library
G. C. Necula, S. McPeak, S. P. Rahul, and W. Weimer. Cil: Intermediate language and tools for analysis and transformation of c programs. In Proc. of the Int. Conf. on Compiler Construction, CC '02, pages 213--228, 2002. Google ScholarDigital Library
J. Perez, R. Badia, and J. Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Proc. IEEE Conf. on Cluster Computing, CLUSTER '08, pages 142 --151, 2008.Google ScholarCross Ref
J. M. Perez, R. M. Badia, and J. Labarta. Handling task dependencies under strided and aliased references. In Proc. of the ACM Int. Conf. on Supercomputing, ICS '10, pages 263--274, 2010. Google ScholarDigital Library
S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory access scheduling. In Proc. of the Int. Symp. on Computer architecture, ISCA '00, pages 128--138, 2000. Google ScholarDigital Library
P. Rosenfeld, E. Cooper-Balis, and B. Jacob. Dramsim2: A cycle accurate memory system simulator. Comp. Arch. Letters, 10(1):16--19, 2011. Google ScholarDigital Library
G. Tzenakis, A. Papatriantafyllou, J. Kesapides, P. Pratikakis, H. Vandierendonck, and D. S. Nikolopoulos. BDDT: block-level dynamic dependence analysis for deterministic task-based parallelism. In Proc. of the ACM Symp. on Principles and Practice of Parallel Programming, PPoPP '12, 2012. Google ScholarDigital Library
H. Vandierendonck, G. Tzenakis, and D. S. Nikolopoulos. A unified scheduler for recursive and task dataflow parallelism. In Proc. of the Int. Conf. on Parallel Architectures and Compilation Techniques, PACT '11, pages 1--11, 2011. Google ScholarDigital Library
Z. Wang, D. Burger, K. S. McKinley, S. K. Reinhardt, and C. C. Weems. Guided region prefetching: a cooperative hardware/software approach. In Proc. of the Int. Symp. on Computer architecture, ISCA '03, pages 388--398, 2003. Google ScholarDigital Library
Z. Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems. Using the compiler to improve cache replacement decisions. In Proc. of the Int. Conf. on Parallel Architectures and Compilation Techniques, PACT '02, pages 199--208, 2002. Google ScholarDigital Library
C.-J. Wu, A. Jaleel, M. Martonosi, S. C. Steely, Jr., and J. Emer. PACMan: prefetch-aware cache management for high performance caching. In Proc. of the IEEE/ACM Int. Symp. on Microarchitecture, MICRO-44, pages 442--453, 2011. Google ScholarDigital Library

Index Terms

Prefetching and cache management using task lifetimes

Recommendations

Coordinating prefetching and STT-RAM based last-level cache management for multicore systems
GLSVLSI '13: Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI

Data prefetching is a common mechanism to mitigate the bottleneck of off-chip memory bandwidth in modern computing systems. Unfortunately, the side effects of prefetching are an additional burden on off-chip communication and increased cache write ...
Read More
Increasing hardware data prefetching performance using the second-level cache

Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Read More
Using dead blocks as a virtual victim cache
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59% of the time, i.e., it will not be referenced again before it is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing
June 2013
512 pages
ISBN:9781450321303
DOI:10.1145/2464996
General Chair:
Allen D. Malony
University of Oregon, USA
,
Program Chairs:
Mario Nemirovsky
Barcelona Supercomputing Center, Spain
,
Sam Midkiff
Purdue University, USA
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache management
prefetching
task-based programming
Qualifiers
- research-article
Conference

Acceptance Rates
ICS '13 Paper Acceptance Rate43of202submissions,21%Overall Acceptance Rate584of2,055submissions,28%
More
Upcoming Conference
ICS '24

Sponsor:

sigarch

2024 International Conference on Supercomputing

June 4 - 7, 2024

Kyoto , Japan
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 25
  Total Citations
  View Citations
- 452
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Prefetching and cache management using task lifetimes

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Coordinating prefetching and STT-RAM based last-level cache management for multicore systems

Increasing hardware data prefetching performance using the second-level cache

Using dead blocks as a virtual victim cache

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Prefetching and cache management using task lifetimes

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Coordinating prefetching and STT-RAM based last-level cache management for multicore systems

Increasing hardware data prefetching performance using the second-level cache

Using dead blocks as a virtual victim cache

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media