ABSTRACT
Current microprocessors include several knobs to modify the hardware behavior in order to improve performance under different workload demands. An impractical and time consuming offline profiling is needed to evaluate the design space to find the optimal knob configuration. Different knobs are typically configured in a decoupled manner to avoid the time-consuming offline profiling process. This can often lead to underperforming configurations and sometimes to conflicting decisions that jeopardize system power- performance efficiency. Thus, a dynamic management of the different hardware knobs is necessary to find the knob configuration that maximizes system power-performance efficiency without the burden of offline profiling.
In this paper, we propose libPRISM, an infrastructure that enables the transparent management of multiple hardware knobs in order to adapt the system to the evolving demands of hardware resources in different workloads. We use libPRISM to implement a policy that maximizes system performance without degrading energy efficiency by dynamically managing the SMT level and prefetcher hardware knobs of an IBM POWER8 system. We evaluate our solution using 24 applications from 3 different parallel benchmarks suites without the need of offline profiling or workload modification. Overall, the solution increases performance up to 220% (15.4% on average) and reduces dynamic power consumption up to 13% (2.0% on average) when compared to the static default knob configuration.
- Boneti, C., et al. Balancing HPC applications through smart allocation of resources in MT processors. IPDPS'08.Google Scholar
- Boneti, C., et al. A Dynamic Scheduler for Balancing HPC Applications. SC'08. Google ScholarDigital Library
- Boneti, C., et al. Software-Controlled Priority Characterization of POWER5 Processor. ISCA'08. Google ScholarDigital Library
- Casas, M., et al. Runtime-Aware Architectures. Euro-Par'15.Google Scholar
- Cazorla, F., et al. Dynamically controlled resource allocation in SMT processors. MICRO'04. Google ScholarDigital Library
- Cazorla, F., et al. Improving Memory Latency Aware Fetch Policies for SMT Processors. ISHPC'03.Google Scholar
- Cazorla, F., et al. Predictable Performance in SMT Processors: Synergy Between the OS and SMTs. TC'06. Google ScholarDigital Library
- Chilimbi, T., et al. Dynamic Hot Data Stream Prefetching for General-purpose Programs. PLDI'02. Google ScholarDigital Library
- CORAL Benchmarks. Https://asc.llnl.gov/coral-benchmarks/.Google Scholar
- Creech, T., et al. Efficient Multiprogramming for Multicores with SCAF. MI- CRO'13. Google ScholarDigital Library
- de Melo, A. C. The new linux perf tools. 2010.Google Scholar
- Ebrahimi, E., et al. Coordinated Control of Multiple Prefetchers in Multi-core Systems. MICRO'42. Google ScholarDigital Library
- Ebrahimi, E., et al. Prefetch-aware Shared Resource Management for Multi-core Systems. ISCA'11. Google ScholarDigital Library
- Ebrahimi, E., et al. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. HPCA'09.Google Scholar
- Everman, S., et al. A Memory-Level Parallelism Aware Fetch Policy for SMT Processors. HPCA'07. Google ScholarDigital Library
- Fatahalian, K., et al. Sequoia: Programming the Memory Hierarchy. SC'06. Google ScholarDigital Library
- Feliu, J., et al. Addressing Fairness in SMT Multicores with a Progress-Aware Scheduler. IPDPS'16. Google ScholarDigital Library
- Feliu, J., et al. Symbiotic job scheduling on the IBM POWER8. HPCA'15.Google Scholar
- Floyd, M., et al. Adaptive energy-management features of the IBM POWER7 chip. 2015.Google Scholar
- Hall, B., et al. Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8. 2015.Google Scholar
- Heirman, W., et al. Automatic SMT Threading for OpenMP Applications on the Intel Xeon Phi Co-processor. ROSS'14. Google ScholarDigital Library
- Hur, I., et al. Memory Prefetching Using Adaptive Stream Detection. MICRO'06. Google ScholarDigital Library
- Jia, Z., et al. Auto-tuning Spark Big Data Workloads on POWER8: Prediction-Based Dynamic SMT Threading. PACT'16.Google Scholar
- Jimenez, V., et al. Increasing multicore system efficiency through intelligent bandwidth shifting. HPCA'15.Google Scholar
- Jiménez, V., et al. Making Data Prefetch Smarter: Adaptive Prefetching on POWER7. PACT'12.Google Scholar
- Jin, H., et al. The OpenMP implementation of NAS parallel benchmarks and its performance. 1999.Google Scholar
- Khan, M., et al. A case for resource efficient prefetching in multicores. ISPASS'14.Google Scholar
- Li, M., et al. PATer: A Hardware Prefetching Automatic Tuner on IBM POWER8 Processor. CAL'16. Google ScholarDigital Library
- Luk, C., et al. Ispike: a post-link optimizer for the Intel reg; Itanium reg; architecture. CGO'04. Google ScholarDigital Library
- Manivannan, M., et al. Runtime-Guided Cache Coherence Optimizations in Multi-core Architectures. IPDPS'14. Google ScholarDigital Library
- Mericas, A., et al. IBM POWER8 performance features and evaluation. IBM Journal of Research and Development (2015).Google Scholar
- Moseley, T., et al. Methods for modeling resource contention on simultaneous multithreading processors. ICCD'05. Google ScholarDigital Library
- Müller, M., et al. OpenMP in a Heterogeneous World: 8th International Workshop on OpenMP. IWOMP'12.Google Scholar
- OpenMP Architecture Review Board. OpenMP Application Program Interface Version 4.5.Google Scholar
- Prat, D., et al. Adaptive and application dependent runtime guided hardware prefetcher reconfiguration on the IBM POWER7. CoRR'15.Google Scholar
- Snavely, A., et al. Symbiotic Jobs cheduling for a Simultaneous Multithreaded Processor. ASPLOS IX. Google ScholarDigital Library
- Tembey, P., et al. Smt Switch: Software Mechanisms for Power Shifting. CAL'13. Google ScholarDigital Library
- Valero, M., et al. Runtime-Aware Architectures: A First Approach. International Journal on Supercomputing Frontiers and Innovations 1, 1 (June 2014), 29--44. Google ScholarDigital Library
- Vega, A., et al. Crank It Up or Dial It Down: Coordinated Multiprocessor Frequency and Folding Control. MICRO'13. Google ScholarDigital Library
- Wang, Z., et al. Guided Region Prefetching: A Cooperative Hardware/Software Approach. ISCA'03. Google ScholarDigital Library
- Wu, C., et al. PACMan: Prefetch-aware Cache Management for High Performance Caching. MICRO'11. Google ScholarDigital Library
- Zhang, Y., et al. An Adaptive OpenMP Loop Scheduler for Hyperthreaded SMPs. PDCS'04.Google Scholar
- Zhang, Y., et al. Runtime Empirical Selection of Loop Schedulers on Hyperthreaded SMPs. IPDPS'05. Google ScholarDigital Library
- Zhuang, X., et al. Reducing Cache Pollution via Dynamic Data Prefetch Filtering. TC'07. Google ScholarDigital Library
Index Terms
- libPRISM: an intelligent adaptation of prefetch and SMT levels
Recommendations
Combined circuit and architectural level variable supply-voltage scaling for low power
Energy-efficient processor design is becoming more and more important with technology scaling and with high performance requirements. Supply-voltage scaling is an efficient way to reduce energy by lowering the operating voltage and the clock frequency ...
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Snug set-associative caches: Reducing leakage power of instruction and data caches with no performance penalties
As transistors keep shrinking and on-chip caches keep growing, static power dissipation resulting from leakage of caches takes an increasing fraction of total power in processors. Several techniques have already been proposed to reduce leakage power by ...
Comments