research-article

libPRISM: an intelligent adaptation of prefetch and SMT levels

Authors:
Cristobal Ortega

Universitat Politecnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC-CNS)

Universitat Politecnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC-CNS)
View Profile

,
Miquel Moreto

Universitat Politecnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC-CNS)

Universitat Politecnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC-CNS)
View Profile

,
Marc Casas

Barcelona Supercomputing Center (BSC-CNS)

Barcelona Supercomputing Center (BSC-CNS)
View Profile

,
Ramon Bertran

IBM T.J. Watson Research Center

IBM T.J. Watson Research Center
View Profile

,
Alper Buyuktosunoglu

IBM T.J. Watson Research Center

IBM T.J. Watson Research Center
View Profile

,
Alexandre E. Eichenberger

IBM T.J. Watson Research Center

IBM T.J. Watson Research Center
View Profile

,
Pradip Bose

IBM T.J. Watson Research Center

IBM T.J. Watson Research Center
View Profile

ICS '17: Proceedings of the International Conference on SupercomputingJune 2017Article No.: 28Pages 1–10https://doi.org/10.1145/3079079.3079101

Published:14 June 2017Publication History

ICS '17: Proceedings of the International Conference on Supercomputing

Pages 1–10

ABSTRACT

Current microprocessors include several knobs to modify the hardware behavior in order to improve performance under different workload demands. An impractical and time consuming offline profiling is needed to evaluate the design space to find the optimal knob configuration. Different knobs are typically configured in a decoupled manner to avoid the time-consuming offline profiling process. This can often lead to underperforming configurations and sometimes to conflicting decisions that jeopardize system power- performance efficiency. Thus, a dynamic management of the different hardware knobs is necessary to find the knob configuration that maximizes system power-performance efficiency without the burden of offline profiling.

In this paper, we propose libPRISM, an infrastructure that enables the transparent management of multiple hardware knobs in order to adapt the system to the evolving demands of hardware resources in different workloads. We use libPRISM to implement a policy that maximizes system performance without degrading energy efficiency by dynamically managing the SMT level and prefetcher hardware knobs of an IBM POWER8 system. We evaluate our solution using 24 applications from 3 different parallel benchmarks suites without the need of offline profiling or workload modification. Overall, the solution increases performance up to 220% (15.4% on average) and reduces dynamic power consumption up to 13% (2.0% on average) when compared to the static default knob configuration.

References

Boneti, C., et al. Balancing HPC applications through smart allocation of resources in MT processors. IPDPS'08.Google Scholar
Boneti, C., et al. A Dynamic Scheduler for Balancing HPC Applications. SC'08. Google ScholarDigital Library
Boneti, C., et al. Software-Controlled Priority Characterization of POWER5 Processor. ISCA'08. Google ScholarDigital Library
Casas, M., et al. Runtime-Aware Architectures. Euro-Par'15.Google Scholar
Cazorla, F., et al. Dynamically controlled resource allocation in SMT processors. MICRO'04. Google ScholarDigital Library
Cazorla, F., et al. Improving Memory Latency Aware Fetch Policies for SMT Processors. ISHPC'03.Google Scholar
Cazorla, F., et al. Predictable Performance in SMT Processors: Synergy Between the OS and SMTs. TC'06. Google ScholarDigital Library
Chilimbi, T., et al. Dynamic Hot Data Stream Prefetching for General-purpose Programs. PLDI'02. Google ScholarDigital Library
CORAL Benchmarks. Https://asc.llnl.gov/coral-benchmarks/.Google Scholar
Creech, T., et al. Efficient Multiprogramming for Multicores with SCAF. MI- CRO'13. Google ScholarDigital Library
de Melo, A. C. The new linux perf tools. 2010.Google Scholar
Ebrahimi, E., et al. Coordinated Control of Multiple Prefetchers in Multi-core Systems. MICRO'42. Google ScholarDigital Library
Ebrahimi, E., et al. Prefetch-aware Shared Resource Management for Multi-core Systems. ISCA'11. Google ScholarDigital Library
Ebrahimi, E., et al. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. HPCA'09.Google Scholar
Everman, S., et al. A Memory-Level Parallelism Aware Fetch Policy for SMT Processors. HPCA'07. Google ScholarDigital Library
Fatahalian, K., et al. Sequoia: Programming the Memory Hierarchy. SC'06. Google ScholarDigital Library
Feliu, J., et al. Addressing Fairness in SMT Multicores with a Progress-Aware Scheduler. IPDPS'16. Google ScholarDigital Library
Feliu, J., et al. Symbiotic job scheduling on the IBM POWER8. HPCA'15.Google Scholar
Floyd, M., et al. Adaptive energy-management features of the IBM POWER7 chip. 2015.Google Scholar
Hall, B., et al. Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8. 2015.Google Scholar
Heirman, W., et al. Automatic SMT Threading for OpenMP Applications on the Intel Xeon Phi Co-processor. ROSS'14. Google ScholarDigital Library
Hur, I., et al. Memory Prefetching Using Adaptive Stream Detection. MICRO'06. Google ScholarDigital Library
Jia, Z., et al. Auto-tuning Spark Big Data Workloads on POWER8: Prediction-Based Dynamic SMT Threading. PACT'16.Google Scholar
Jimenez, V., et al. Increasing multicore system efficiency through intelligent bandwidth shifting. HPCA'15.Google Scholar
Jiménez, V., et al. Making Data Prefetch Smarter: Adaptive Prefetching on POWER7. PACT'12.Google Scholar
Jin, H., et al. The OpenMP implementation of NAS parallel benchmarks and its performance. 1999.Google Scholar
Khan, M., et al. A case for resource efficient prefetching in multicores. ISPASS'14.Google Scholar
Li, M., et al. PATer: A Hardware Prefetching Automatic Tuner on IBM POWER8 Processor. CAL'16. Google ScholarDigital Library
Luk, C., et al. Ispike: a post-link optimizer for the Intel reg; Itanium reg; architecture. CGO'04. Google ScholarDigital Library
Manivannan, M., et al. Runtime-Guided Cache Coherence Optimizations in Multi-core Architectures. IPDPS'14. Google ScholarDigital Library
Mericas, A., et al. IBM POWER8 performance features and evaluation. IBM Journal of Research and Development (2015).Google Scholar
Moseley, T., et al. Methods for modeling resource contention on simultaneous multithreading processors. ICCD'05. Google ScholarDigital Library
Müller, M., et al. OpenMP in a Heterogeneous World: 8th International Workshop on OpenMP. IWOMP'12.Google Scholar
OpenMP Architecture Review Board. OpenMP Application Program Interface Version 4.5.Google Scholar
Prat, D., et al. Adaptive and application dependent runtime guided hardware prefetcher reconfiguration on the IBM POWER7. CoRR'15.Google Scholar
Snavely, A., et al. Symbiotic Jobs cheduling for a Simultaneous Multithreaded Processor. ASPLOS IX. Google ScholarDigital Library
Tembey, P., et al. Smt Switch: Software Mechanisms for Power Shifting. CAL'13. Google ScholarDigital Library
Valero, M., et al. Runtime-Aware Architectures: A First Approach. International Journal on Supercomputing Frontiers and Innovations 1, 1 (June 2014), 29--44. Google ScholarDigital Library
Vega, A., et al. Crank It Up or Dial It Down: Coordinated Multiprocessor Frequency and Folding Control. MICRO'13. Google ScholarDigital Library
Wang, Z., et al. Guided Region Prefetching: A Cooperative Hardware/Software Approach. ISCA'03. Google ScholarDigital Library
Wu, C., et al. PACMan: Prefetch-aware Cache Management for High Performance Caching. MICRO'11. Google ScholarDigital Library
Zhang, Y., et al. An Adaptive OpenMP Loop Scheduler for Hyperthreaded SMPs. PDCS'04.Google Scholar
Zhang, Y., et al. Runtime Empirical Selection of Loop Schedulers on Hyperthreaded SMPs. IPDPS'05. Google ScholarDigital Library
Zhuang, X., et al. Reducing Cache Pollution via Dynamic Data Prefetch Filtering. TC'07. Google ScholarDigital Library

Index Terms

libPRISM: an intelligent adaptation of prefetch and SMT levels

Recommendations

Combined circuit and architectural level variable supply-voltage scaling for low power

Energy-efficient processor design is becoming more and more important with technology scaling and with high performance requirements. Supply-voltage scaling is an efficient way to reduce energy by lowering the operating voltage and the clock frequency ...
Read More
Increasing hardware data prefetching performance using the second-level cache

Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Read More
Snug set-associative caches: Reducing leakage power of instruction and data caches with no performance penalties

As transistors keep shrinking and on-chip caches keep growing, static power dissipation resulting from leakage of caches takes an increasing fraction of total power in processors. Several techniques have already been proposed to reduce leakage power by ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '17: Proceedings of the International Conference on Supercomputing
June 2017
300 pages
ISBN:9781450350204
DOI:10.1145/3079079
General Chairs:
William D. Gropp
University of Illinois at Urbana-Champaign, Illinois
,
Pete Beckman
Argonne National Laboratory/Northwestern University, Illinois
,
Program Chairs:
Zhiyuan Li
Purdue University, West Lafayette, Indiana
,
Francisco J. Cazorla
IIIA-CSIC and Barcelona Supercomputing Center, Barcelona, Spain
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 264
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

libPRISM: an intelligent adaptation of prefetch and SMT levels

ICS '17: Proceedings of the International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Combined circuit and architectural level variable supply-voltage scaling for low power

Increasing hardware data prefetching performance using the second-level cache

Snug set-associative caches: Reducing leakage power of instruction and data caches with no performance penalties