DOI QR코드

DOI QR Code

Workload Characteristics-based L1 Data Cache Switching-off Mechanism for GPUs

  • Do, Thuan Cong (Dept. of Computer Science and Engineering, Korea University) ;
  • Kim, Gwang Bok (School of Electronics and Computer Engineering, Chonnam National University) ;
  • Kim, Cheol Hong (School of Electronics and Computer Engineering, Chonnam National University)
  • Received : 2018.07.25
  • Accepted : 2018.10.17
  • Published : 2018.10.31

Abstract

Modern graphics processing units (GPUs) have become one of the most attractive platforms in exploiting high thread level parallelism with the support of new programming tools such as CUDA and OpenCL. Recent GPUs has applied cache hierarchy to support irregular memory access patterns; however, L1 data cache (L1D) exhibits poor efficiency in the GPU. This paper shows that the L1D does not always positively affect the applications in terms of performance and energy efficiency for the GPU. The performance of the GPU is even harmed by using the L1D for lots of applications. Our proposed technique exploits the characteristics of the currently-executed applications to predict the performance impact of the L1D on the GPU and then decides whether to continuously use the cache for the application or not. Our experimental results show that the proposed technique improves the GPU performance by 9.4% and saves up to 52.1% of the power consumption in the L1D.

Keywords

References

  1. W. Jia, K. Shaw and M. Martonosi, "MRPB: Memory Request Prioritization for Massively Parallel Processors," in the IEEE International Symposium on High Performance Computer Architecture (HPCA), pp272-283, 2014.
  2. NVIDIA, "Whitepaper: NVIDIA's Next Generation CUDA Compute and Graphics Architecture: Fermi," 2009.
  3. Y. Torres and A. Escribano, "Understanding the Impact of CUDA Tuning Techniques for Fermi," In High Performance Computing and Simulation (HPCS), pp. 631-639, 2011.
  4. A. Jog, O. Kayiran, N. Nachiappan, A. Mishra, M. Kandermir, O. Mutlu, R. Iyer and C. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 395-406, 2013.
  5. S. Lee, A. Arunkumar and C. Wu, "CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads," in the International Symposium on Computer Architecture (ISCA), pp. 515-527, 2015.
  6. M. Lee, G. Kim, J. Kim, W. Seo, Y. Cho and S. Ryu., "iPAWS: Instruction-Issue Pattern-based Adaptive Warp Scheduling for GPGPUs," in the IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 370-381, 2016.
  7. V. Narasiman, M. Shebanow, C. Lee, R. Miftakhutdinov, O. Mutlu and Y. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in the IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 308-317, 2014.
  8. M. Gebhart, R. Johnson, D. Tarjan, S. Keckler, W. Dally, E. Lindoholm and K. Skadron, "Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors," in the International Symposium on Computer Architecture (ISCA), pp. 235-246, 2011.
  9. W. Fung, I. Sham, G. Yuan and T. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in the IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 407-420, 2007.
  10. M. Qureshi, A. Jaleel, Y. Patt, S. Steely and J. Emer, "Adaptive Insertion Policies for High Performance Caching," in the International Symposium on Computer Architecture (ISCA), pp. 381-391, 2007.
  11. C. T. Do, H. J. Choi, J. M. Kim and C. H. Kim, "A New Cache Replacement Algorithm for Last-Level Caches by Exploiting Tag-Distance Correlation of Cache Lines," in Microprocessors and Microsystems, 39(4), pp. 286-295, 2015. https://doi.org/10.1016/j.micpro.2015.05.005
  12. A. S. Leon, B. Langley and J. L. Shin, "The UltraSPARC T1 Processor: CMT Reliability," In Custom Integrated Circuits Conference, pp. 555-562, 2006.
  13. T. Rogers, M. O'Connor and T. Aamodt, "Cache-consciou s Wavefront Scheduling," in the IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 72-83, 2012.
  14. NVIDIA, "NVIDA Tegra Multiprocessor Architecture," 2010.
  15. Y. Wu, R. Rakvic, L. Chen, C. Miao, G. Chrysos and J. Fang, "Compiler Managed Micro-cache Bypassing for High Performance EPIC Processors," in the IEEE/ACM International Symposium on Microarchitect ure (MICRO), pp. 134-145, 2002.
  16. T. L. Johnson and W.-M. W. Hwu, "Run-time Adaptive Cache Hierarchy Management via Reference Analysis," in the International Symposium on Computer Architecture (ISCA), pp. 315-326, 1997.
  17. M. Kharbutli and D. Solihin, "Counter-based Cache Replacement and Bypassing Algorithms," in IEEE Transactions on Computers, 57(4), pp. 433-447, 2008. https://doi.org/10.1109/TC.2007.70816
  18. A. Bakhola, G. Yuan, W. Fung, H. Wong and T. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in the International Symposium on Analysis of Systems and Software (ISPASS), pp. 163-174, 2009.
  19. H. Liu, M. Ferdman, J. Huh, and D. Burger, "Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency," in the IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 222-233, 2008.
  20. D. Kirk and W. Hwu, "Programming Massively Parallel Processors," 2010.
  21. C. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. Steely Jr. and J. Emer, "SHiP: Signature-based Hit Predictor for High Performance Caching," in the IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 430-441, 2011.
  22. NVIDA, CUDA SDK http://developer.nvidia.com/gpu-computing-sdk.
  23. X. Chen, L. Chang, C. Rodrigues, J. Lv, Z. Wang, and W. Hwu, "Adaptive Cache Management for Energy-Efficient GPU Computing," in the IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 343-355, 2014.
  24. N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. Veidenbaum, "Improving Cache Management Policies Using Dynamic Reuse Distances," in the IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 389-400, 2012.
  25. X. Xie, Y. Liang, Y. Wang, G. Sun and T. Wang, "Coordinated Static and Dynamic Cache Bypassing for GPUs," in the IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 76-88, 2015.
  26. S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in the IEEE International Symposiumon Workload Characterization, (IISWC), pp. 44-54, 2009.
  27. S. Hong and H. Kim, "An Integrated GPU Power and Performance Model," in the International Symposium on Computer Architecture (ISCA), pp. 280-289, 2010.
  28. J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. Kim, T. Aamodt and V. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in the International Symposium on Computer Architecture (ISCA), pp. 487-498, 2013.