research-article

SPMPool: Runtime SPM Management for Memory-Intensive Applications in Embedded Many-Cores

Authors:
Hossein Tajik

University of California, Irvine, CA

University of California, Irvine, CA
View Profile

,
Bryan Donyanavard

University of California, Irvine, CA

University of California, Irvine, CA
View Profile

,
Nikil Dutt

University of California, Irvine, CA

University of California, Irvine, CA
View Profile

,
Janmartin Jahn

Karlsruhe Institute of Technology, Germany

Karlsruhe Institute of Technology, Germany
View Profile

,
Jörg Henkel

Karlsruhe Institute of Technology, Germany

Karlsruhe Institute of Technology, Germany
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 16 Issue 1Article No.: 25pp 1–27https://doi.org/10.1145/2968447

Published:23 October 2016Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Distributed Scratchpad Memories (SPMs) in embedded many-core systems require careful selection of data placement to achieve good performance. Applications mapped to these platforms have varying memory requirements based on their runtime behavior, resulting in under- or overutilization of the local SPMs. We propose SPMPool to share the available on-chip SPMs on many-cores among concurrently executing applications in order to reduce the overall memory access latency. By pooling SPM resources, we can assign underutilized memory resources, due to idle cores or low memory usage, to applications dynamically. SPMPool is the first workload-aware SPM mapping solution for many-cores that dynamically allocates data at runtime—using profiled data—to address the unpredictable set of concurrently executing applications. Our experiments on workloads with varying interapplication memory intensity show that SPMPool can achieve up to 76% reduction in memory access latency for configurations ranging from 16 to 256 cores, compared to the traditional approach that limits executing cores to use their local SPMs.

References

W. Ahmed, M. Shafique, L. Bauer, and J. Henkel. 2011a. mRTS: Run-time system for reconfigurable processors with multi-grained instruction-set extensions. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE), 2011.Google Scholar
W. Ahmed, M. Shafique, L. Bauer, and J. H. Karlsruhe. 2011b. Adaptive resource management for simultaneous multitasking in mixed-grained reconfigurable multi-core processors. In Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). Google ScholarDigital Library
Ke Bai and Aviral Shrivastava. 2013. Automatic and efficient heap data management for limited local memory multicore architectures. In Proceedings of the Conference on Design, Automation and Test in Europe. Google ScholarDigital Library
Ke Bai, A. Shrivastava, and S. Kudchadker. 2011. Stack data management for limited local memory (LLM) multi-core processors. In Proceedings of the 2011 IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP). Google ScholarDigital Library
Mohammad Banikazemi, Dan Poff, and Bülent Abali. 2008. PAM: A novel performance/power aware meta-scheduler for multi-core systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’08). Google ScholarDigital Library
L. A. D. Bathen, N. D. Dutt, Dongyoun Shin, and Sung-Soo Lim. 2011. SPMVisor: Dynamic scratchpad memory virtualization for secure, low power, and high performance distributed on-chip memories. In Proceedings of the 2011 Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). Google ScholarDigital Library
Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable software-defined caches. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University. Google ScholarDigital Library
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (August 2011), 1--7. Google ScholarDigital Library
W. Bolosky, R. Fitzgerald, and M. Scott. 1989. Simple but effective techniques for NUMA memory management. SIGOPS Oper. Syst. Rev. 23, 5 (November 1989), 19--31. Google ScholarDigital Library
Jichuan Chang and Gurindar S. Sohi. 2007. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st Annual International Conference on Supercomputing. Google ScholarDigital Library
Weijia Che, A. Panda, and K. S. Chatha. 2010. Compilation of stream programs for multicore processors that incorporate scratchpad memories. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE), 2010. Google ScholarDigital Library
Doosan Cho, S. Pasricha, I. Issenin, N. D. Dutt, Minwook Ahn, and Yunheung Paek. 2009. Adaptive scratch pad memory management for dynamic behavior of multimedia applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. Google ScholarDigital Library
Sangyeun Cho and Lei Jin. 2006. Managing distributed, shared L2 caches through OS-level page allocation. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
Junchul Choi, Hyunok Oh, Sungchan Kim, and Soonhoi Ha. 2012. Executing synchronous dataflow graphs on a SPM-based multicore architecture. In Proceedings of the 49th Annual Design Automation Conference. Google ScholarDigital Library
Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T. Morris, and Eddie Kohler. 2013. The scalable commutativity rule: Designing scalable software for multicore processors. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. Google ScholarDigital Library
Ning Deng, Weixing Ji, Jaxin Li, and Qi Zuo. 2011. A semi-automatic scratchpad memory management framework for CMP. In Proceedings of the 9th International Conference on Advanced Parallel Processing Technologies. Google ScholarDigital Library
Bernhard Egger, Jaejin Lee, and Heonshik Shin. 2008. Scratchpad memory management in a multitasking environment. In Proceedings of the 8th ACM International Conference on Embedded Software. Google ScholarDigital Library
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. SIGARCH Comput. Archit. News 39, 3 (June 2011), 365--376. Google ScholarDigital Library
Fabrizio Fazzino, Maurizio Palesi, and David Patti. 2008. Noxim: Network-on-chip simulator. Retrieved from http://sourceforge.net/projects/noxim.Google Scholar
Poletti Francesco, Paul Marchal, David Atienza, Luca Benini, Francky Catthoor, and Jose M. Mendias. 2004. An integrated hardware/software approach for run-time scratchpad management. In Proceedings of the 41st Annual Design Automation Conference. Google ScholarDigital Library
Lovic Gauthier, Tohru Ishihara, Hideki Takase, Hiroyuki Tomiyama, and Hiroaki Takada. 2010. Minimizing inter-task interferences in scratch-pad memory usage for reducing the energy consumption of multi-task systems. In Proceedings of the 2010 International Conference on Compilers, Architectures and Synthesis for Embedded Systems. Google ScholarDigital Library
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the 2001 IEEE International Workshop on Workload Characterization (WWC-4). Google ScholarDigital Library
Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture. Google ScholarDigital Library
J. Henkel, A. Herkersdorf, L. Bauer, T. Wild, M. Hbner, R. K. Pujari, A. Grudnitsky, J. Heisswolf, A. Zaib, B. Vogel, V. Lari, and S. Kobbe. 2012. Invasive manycore architectures. In Proceedings of the 17th Asia and South Pacific Design Automation Conference.Google Scholar
J. Henkel, V. Narayanan, S. Parameswaran, and J. Teich. 2013. Run-time adaption for highly-complex multi-core systems. In Proceedings of the 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). Google ScholarDigital Library
John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News (2006). Google ScholarDigital Library
Henry Hoffman. 2013. Seec: A Framework for Self-aware Management of Goals and Constraints in Computing Systems (Power-aware Computing, Accuracy-aware Computing, Adaptive Computing, Autonomic Computing). Ph.D. Dissertation. Advisor(s) Agarwal, Anant and Devadas, Srinivas. Google ScholarDigital Library
J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. Van Der Wijngaart, and T. Mattson. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of the 2010 International Conference on Solid-State Circuits Conference Digest of Technical Papers (ISSCC’10).Google Scholar
Ravi Iyer, Li Zhao, Fei Guo, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan Solihin, Lisa Hsu, and Steve Reinhardt. 2007. QoS policies and architecture for cache/memory in CMP platforms. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. Google ScholarDigital Library
Aamer Jaleel. 2007. Memory Characterization of Workloads Using Instrumentation-Driven Simulation A Pin-based Memory Characterization of the SPEC CPU2000 and SPEC CPU2006 Benchmark Suites. Technical Report. Intel.Google Scholar
Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Jr., and Joel Emer. 2008. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
Weixing Ji, Ning Deng, Feng Shi, Qi Zuo, and Jiaxin Li. 2011. Dynamic and adaptive SPM management for a multi-task environment. J. Syst. Archit. (2011). Google ScholarDigital Library
D. Kaseridis, J. Stuecheli, and L. K. John. 2009. Bank-aware dynamic cache partitioning for multicore architectures. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP’09). Google ScholarDigital Library
Rob Knauerhase, Paul Brett, Barbara Hohlt, Tong Li, and Scott Hahn. 2008. Using OS observations to improve performance in multicore systems. In IEEE Micro. Google ScholarDigital Library
Alvin R. Lebeck, Xiaobo Fan, Heng Zeng, and Carla Ellis. 2000. Power aware page allocation. SIGOPS Oper. Syst. Rev. 34, 5 (November 2000), 105--116. Google ScholarDigital Library
Hyunjin Lee, Sangyeun Cho, and B. R. Childers. 2010. StimulusCache: Boosting performance of chip multiprocessors with excess cache. In Proceedings of the 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
Hyunjin Lee, Sangyeun Cho, and B. R. Childers. 2011. CloudCache: Expanding and shrinking private caches. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). Google ScholarDigital Library
Andrea Marongiu and Luca Benini. 2012. An OpenMP compiler for efficient use of distributed scratchpad memory in MPSoCs. IEEE Trans. Comput. 61, 2 (February 2012), 222--236. Google ScholarDigital Library
Patrick S. McCormick, Ryan Karl Braithwaite, and Wu-chun Feng. 2011. Empirical Memory-Access Cost Models in Multicore NUMA Architectures. No. LA-UR-11-10315. Los Alamos National Laboratory (LANL).Google Scholar
Andreas Merkel, Jan Stoess, and Frank Bellosa. 2010. Resource-conscious scheduling for energy efficiency on multicore processors. In Proceedings of the 5th European Conference on Computer Systems. Google ScholarDigital Library
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Laboratories (2009), 22--31.Google Scholar
Amit Pabalkar, Aviral Shrivastava, Arun Kannan, and Jongeun Lee. 2008. SDRM: Simultaneous determination of regions and function-to-region mapping for scratchpad memories. In Proceedings of the 15th International Conference on High Performance Computing. Google ScholarDigital Library
Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 1997. Efficient utilization of scratch-pad memory in embedded processor applications. In Proceedings of the 1997 European Conference on Design and Test. Google ScholarDigital Library
Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
Nauman Rafique, Won-Taek Lim, and Mithuna Thottethodi. 2006. Architectural support for operating system-driven CMP cache management. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
Muhammad Shafique, Lars Bauer, Waheed Ahmed, and Jörg Henkel. 2011. Minority-game-based resource allocation for run-time reconfigurable multi-core processors. In Proceedings of the 2011 Design, Automation and Test in Europe Conference and Exhibition.Google ScholarCross Ref
A. Sharifi, S. Srikantaiah, M. Kandemir, and M. J. Irwin. 2012. Courteous cache sharing: Being nice to others in capacity management. In Proceedings of the 2012 49th ACM/EDAC/IEEE Design Automation Conference (DAC). Google ScholarDigital Library
Shekhar Srikantaiah, Mahmut Kandemir, and Mary Jane Irwin. 2008. Adaptive set pinning: Managing shared caches in chip multiprocessors. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarDigital Library
S. Steinke, L. Wehmeyer, Bo-Sik Lee, and P. Marwedel. 2002. Assigning program and data objects to scratchpad for energy reduction. In Proceedings of the 2002 Design, Automation and Test in Europe Conference and Exhibition. Google ScholarDigital Library
Vivy Suhendra, Chandrashekar Raghavan, and Tulika Mitra. 2006. Integrated scratchpad memory optimization and task scheduling for MPSoC architectures. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. Google ScholarDigital Library
Vivy Suhendra, Abhik Roychoudhury, and Tulika Mitra. 2008. Scratchpad allocation for concurrent embedded software. In Proceedings of the 6th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. Google ScholarDigital Library
H. Takase, H. Tomiyama, and H. Takada. 2010. Partitioning and allocation of scratch-pad memory for priority-based preemptive multi-task systems. In Proceedings of the 2010 Design, Automation Test in Europe Conference Exhibition (DATE). Google ScholarDigital Library
Sumesh Udayakumaran, Angel Dominguez, and Rajeev Barua. 2006. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embed. Comput. Syst. (2006). Google ScholarDigital Library
Manish Verma, Stefan Steinke, and Peter Marwedel. 2003. Data partitioning for maximal scratchpad usage. In Proceedings of the 2003 Asia and South Pacific Design Automation Conference. Google ScholarDigital Library
Yuejian Xie and Gabriel H. Loh. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture. Google ScholarDigital Library
Lei Zhang, Meikang Qiu, Wei-Che Tseng, and Edwin H-M. Sha. 2010. Variable partitioning and scheduling for MPSoC with virtually shared scratch pad memory. Journal of Signal Processing Systems 58, 2 (2010), 247--265. Google ScholarDigital Library
Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. 2011. Dynamic cache contention detection in multi-threaded applications. In Proceedings of the 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. Google ScholarDigital Library
Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems. Google ScholarDigital Library

Index Terms

SPMPool: Runtime SPM Management for Memory-Intensive Applications in Embedded Many-Cores
1. Computer systems organization
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Distributed memory
        Virtual memory

Recommendations

ShaVe-ICE: Sharing Distributed Virtualized SPMs in Many-Core Embedded Systems
Special Issue on MEMCODE 2015 and Regular Papers (Diamonds)

Traditional approaches for managing software-programmable memories (SPMs) do not support sharing of distributed on-chip memory resources and, consequently, miss the opportunity to better utilize those memory resources. Managing on-chip memory resources ...
Read More
SA-SPM: an efficient compiler for security aware scratchpad memory (invited paper)
LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

Scratchpad memories (SPM) are often used to boost the performance of application-specific embedded systems. In embedded systems, main memories are vulnerable to external attacks such as bus snooping or memory extraction. Therefore it is desirable to ...
Read More
DynaPoMP: dynamic policy-driven memory protection for SPM-based embedded systems
WESS '11: Proceedings of the Workshop on Embedded Systems Security

Today's embedded systems are often used to access, store, manipulate, and communicate sensitive data. Embedded system security risks are exacerbated by emerging trends (e.g., network connectivity, application download service, migration to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 16, Issue 1
Special Issue on VIPES, Special Issue on ICESS2015 and Regular Papers
February 2017
602 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3008024
Editor:
Sandeep K. Shukla
Indian Institute of Technology, India
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 23 October 2016
- Accepted: 1 July 2016
- Revised: 1 June 2016
- Received: 1 February 2015
Published in tecs Volume 16, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Scratchpad memory
many-core
memory mapping
runtime system
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 181
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SPMPool: Runtime SPM Management for Memory-Intensive Applications in Embedded Many-Cores

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

ShaVe-ICE: Sharing Distributed Virtualized SPMs in Many-Core Embedded Systems

SA-SPM: an efficient compiler for security aware scratchpad memory (invited paper)

DynaPoMP: dynamic policy-driven memory protection for SPM-based embedded systems