skip to main content
research-article

SPMPool: Runtime SPM Management for Memory-Intensive Applications in Embedded Many-Cores

Published:23 October 2016Publication History
Skip Abstract Section

Abstract

Distributed Scratchpad Memories (SPMs) in embedded many-core systems require careful selection of data placement to achieve good performance. Applications mapped to these platforms have varying memory requirements based on their runtime behavior, resulting in under- or overutilization of the local SPMs. We propose SPMPool to share the available on-chip SPMs on many-cores among concurrently executing applications in order to reduce the overall memory access latency. By pooling SPM resources, we can assign underutilized memory resources, due to idle cores or low memory usage, to applications dynamically. SPMPool is the first workload-aware SPM mapping solution for many-cores that dynamically allocates data at runtime—using profiled data—to address the unpredictable set of concurrently executing applications. Our experiments on workloads with varying interapplication memory intensity show that SPMPool can achieve up to 76% reduction in memory access latency for configurations ranging from 16 to 256 cores, compared to the traditional approach that limits executing cores to use their local SPMs.

References

  1. W. Ahmed, M. Shafique, L. Bauer, and J. Henkel. 2011a. mRTS: Run-time system for reconfigurable processors with multi-grained instruction-set extensions. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE), 2011.Google ScholarGoogle Scholar
  2. W. Ahmed, M. Shafique, L. Bauer, and J. H. Karlsruhe. 2011b. Adaptive resource management for simultaneous multitasking in mixed-grained reconfigurable multi-core processors. In Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ke Bai and Aviral Shrivastava. 2013. Automatic and efficient heap data management for limited local memory multicore architectures. In Proceedings of the Conference on Design, Automation and Test in Europe. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ke Bai, A. Shrivastava, and S. Kudchadker. 2011. Stack data management for limited local memory (LLM) multi-core processors. In Proceedings of the 2011 IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Mohammad Banikazemi, Dan Poff, and Bülent Abali. 2008. PAM: A novel performance/power aware meta-scheduler for multi-core systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. A. D. Bathen, N. D. Dutt, Dongyoun Shin, and Sung-Soo Lim. 2011. SPMVisor: Dynamic scratchpad memory virtualization for secure, low power, and high performance distributed on-chip memories. In Proceedings of the 2011 Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable software-defined caches. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (August 2011), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Bolosky, R. Fitzgerald, and M. Scott. 1989. Simple but effective techniques for NUMA memory management. SIGOPS Oper. Syst. Rev. 23, 5 (November 1989), 19--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jichuan Chang and Gurindar S. Sohi. 2007. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st Annual International Conference on Supercomputing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Weijia Che, A. Panda, and K. S. Chatha. 2010. Compilation of stream programs for multicore processors that incorporate scratchpad memories. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Doosan Cho, S. Pasricha, I. Issenin, N. D. Dutt, Minwook Ahn, and Yunheung Paek. 2009. Adaptive scratch pad memory management for dynamic behavior of multimedia applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sangyeun Cho and Lei Jin. 2006. Managing distributed, shared L2 caches through OS-level page allocation. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Junchul Choi, Hyunok Oh, Sungchan Kim, and Soonhoi Ha. 2012. Executing synchronous dataflow graphs on a SPM-based multicore architecture. In Proceedings of the 49th Annual Design Automation Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T. Morris, and Eddie Kohler. 2013. The scalable commutativity rule: Designing scalable software for multicore processors. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ning Deng, Weixing Ji, Jaxin Li, and Qi Zuo. 2011. A semi-automatic scratchpad memory management framework for CMP. In Proceedings of the 9th International Conference on Advanced Parallel Processing Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Bernhard Egger, Jaejin Lee, and Heonshik Shin. 2008. Scratchpad memory management in a multitasking environment. In Proceedings of the 8th ACM International Conference on Embedded Software. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. SIGARCH Comput. Archit. News 39, 3 (June 2011), 365--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Fabrizio Fazzino, Maurizio Palesi, and David Patti. 2008. Noxim: Network-on-chip simulator. Retrieved from http://sourceforge.net/projects/noxim.Google ScholarGoogle Scholar
  21. Poletti Francesco, Paul Marchal, David Atienza, Luca Benini, Francky Catthoor, and Jose M. Mendias. 2004. An integrated hardware/software approach for run-time scratchpad management. In Proceedings of the 41st Annual Design Automation Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lovic Gauthier, Tohru Ishihara, Hideki Takase, Hiroyuki Tomiyama, and Hiroaki Takada. 2010. Minimizing inter-task interferences in scratch-pad memory usage for reducing the energy consumption of multi-task systems. In Proceedings of the 2010 International Conference on Compilers, Architectures and Synthesis for Embedded Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the 2001 IEEE International Workshop on Workload Characterization (WWC-4). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Henkel, A. Herkersdorf, L. Bauer, T. Wild, M. Hbner, R. K. Pujari, A. Grudnitsky, J. Heisswolf, A. Zaib, B. Vogel, V. Lari, and S. Kobbe. 2012. Invasive manycore architectures. In Proceedings of the 17th Asia and South Pacific Design Automation Conference.Google ScholarGoogle Scholar
  26. J. Henkel, V. Narayanan, S. Parameswaran, and J. Teich. 2013. Run-time adaption for highly-complex multi-core systems. In Proceedings of the 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Henry Hoffman. 2013. Seec: A Framework for Self-aware Management of Goals and Constraints in Computing Systems (Power-aware Computing, Accuracy-aware Computing, Adaptive Computing, Autonomic Computing). Ph.D. Dissertation. Advisor(s) Agarwal, Anant and Devadas, Srinivas. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. Van Der Wijngaart, and T. Mattson. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of the 2010 International Conference on Solid-State Circuits Conference Digest of Technical Papers (ISSCC’10).Google ScholarGoogle Scholar
  30. Ravi Iyer, Li Zhao, Fei Guo, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan Solihin, Lisa Hsu, and Steve Reinhardt. 2007. QoS policies and architecture for cache/memory in CMP platforms. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Aamer Jaleel. 2007. Memory Characterization of Workloads Using Instrumentation-Driven Simulation A Pin-based Memory Characterization of the SPEC CPU2000 and SPEC CPU2006 Benchmark Suites. Technical Report. Intel.Google ScholarGoogle Scholar
  32. Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Jr., and Joel Emer. 2008. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Weixing Ji, Ning Deng, Feng Shi, Qi Zuo, and Jiaxin Li. 2011. Dynamic and adaptive SPM management for a multi-task environment. J. Syst. Archit. (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Kaseridis, J. Stuecheli, and L. K. John. 2009. Bank-aware dynamic cache partitioning for multicore architectures. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Rob Knauerhase, Paul Brett, Barbara Hohlt, Tong Li, and Scott Hahn. 2008. Using OS observations to improve performance in multicore systems. In IEEE Micro. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Alvin R. Lebeck, Xiaobo Fan, Heng Zeng, and Carla Ellis. 2000. Power aware page allocation. SIGOPS Oper. Syst. Rev. 34, 5 (November 2000), 105--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Hyunjin Lee, Sangyeun Cho, and B. R. Childers. 2010. StimulusCache: Boosting performance of chip multiprocessors with excess cache. In Proceedings of the 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle Scholar
  38. Hyunjin Lee, Sangyeun Cho, and B. R. Childers. 2011. CloudCache: Expanding and shrinking private caches. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Andrea Marongiu and Luca Benini. 2012. An OpenMP compiler for efficient use of distributed scratchpad memory in MPSoCs. IEEE Trans. Comput. 61, 2 (February 2012), 222--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Patrick S. McCormick, Ryan Karl Braithwaite, and Wu-chun Feng. 2011. Empirical Memory-Access Cost Models in Multicore NUMA Architectures. No. LA-UR-11-10315. Los Alamos National Laboratory (LANL).Google ScholarGoogle Scholar
  41. Andreas Merkel, Jan Stoess, and Frank Bellosa. 2010. Resource-conscious scheduling for energy efficiency on multicore processors. In Proceedings of the 5th European Conference on Computer Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Laboratories (2009), 22--31.Google ScholarGoogle Scholar
  43. Amit Pabalkar, Aviral Shrivastava, Arun Kannan, and Jongeun Lee. 2008. SDRM: Simultaneous determination of regions and function-to-region mapping for scratchpad memories. In Proceedings of the 15th International Conference on High Performance Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 1997. Efficient utilization of scratch-pad memory in embedded processor applications. In Proceedings of the 1997 European Conference on Design and Test. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Nauman Rafique, Won-Taek Lim, and Mithuna Thottethodi. 2006. Architectural support for operating system-driven CMP cache management. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Muhammad Shafique, Lars Bauer, Waheed Ahmed, and Jörg Henkel. 2011. Minority-game-based resource allocation for run-time reconfigurable multi-core processors. In Proceedings of the 2011 Design, Automation and Test in Europe Conference and Exhibition.Google ScholarGoogle ScholarCross RefCross Ref
  48. A. Sharifi, S. Srikantaiah, M. Kandemir, and M. J. Irwin. 2012. Courteous cache sharing: Being nice to others in capacity management. In Proceedings of the 2012 49th ACM/EDAC/IEEE Design Automation Conference (DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Shekhar Srikantaiah, Mahmut Kandemir, and Mary Jane Irwin. 2008. Adaptive set pinning: Managing shared caches in chip multiprocessors. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S. Steinke, L. Wehmeyer, Bo-Sik Lee, and P. Marwedel. 2002. Assigning program and data objects to scratchpad for energy reduction. In Proceedings of the 2002 Design, Automation and Test in Europe Conference and Exhibition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Vivy Suhendra, Chandrashekar Raghavan, and Tulika Mitra. 2006. Integrated scratchpad memory optimization and task scheduling for MPSoC architectures. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Vivy Suhendra, Abhik Roychoudhury, and Tulika Mitra. 2008. Scratchpad allocation for concurrent embedded software. In Proceedings of the 6th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. H. Takase, H. Tomiyama, and H. Takada. 2010. Partitioning and allocation of scratch-pad memory for priority-based preemptive multi-task systems. In Proceedings of the 2010 Design, Automation Test in Europe Conference Exhibition (DATE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Sumesh Udayakumaran, Angel Dominguez, and Rajeev Barua. 2006. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embed. Comput. Syst. (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Manish Verma, Stefan Steinke, and Peter Marwedel. 2003. Data partitioning for maximal scratchpad usage. In Proceedings of the 2003 Asia and South Pacific Design Automation Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Yuejian Xie and Gabriel H. Loh. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Lei Zhang, Meikang Qiu, Wei-Che Tseng, and Edwin H-M. Sha. 2010. Variable partitioning and scheduling for MPSoC with virtually shared scratch pad memory. Journal of Signal Processing Systems 58, 2 (2010), 247--265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. 2011. Dynamic cache contention detection in multi-threaded applications. In Proceedings of the 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SPMPool: Runtime SPM Management for Memory-Intensive Applications in Embedded Many-Cores

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Embedded Computing Systems
      ACM Transactions on Embedded Computing Systems  Volume 16, Issue 1
      Special Issue on VIPES, Special Issue on ICESS2015 and Regular Papers
      February 2017
      602 pages
      ISSN:1539-9087
      EISSN:1558-3465
      DOI:10.1145/3008024
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 October 2016
      • Accepted: 1 July 2016
      • Revised: 1 June 2016
      • Received: 1 February 2015
      Published in tecs Volume 16, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader