Abstract
Distributed Scratchpad Memories (SPMs) in embedded many-core systems require careful selection of data placement to achieve good performance. Applications mapped to these platforms have varying memory requirements based on their runtime behavior, resulting in under- or overutilization of the local SPMs. We propose SPMPool to share the available on-chip SPMs on many-cores among concurrently executing applications in order to reduce the overall memory access latency. By pooling SPM resources, we can assign underutilized memory resources, due to idle cores or low memory usage, to applications dynamically. SPMPool is the first workload-aware SPM mapping solution for many-cores that dynamically allocates data at runtime—using profiled data—to address the unpredictable set of concurrently executing applications. Our experiments on workloads with varying interapplication memory intensity show that SPMPool can achieve up to 76% reduction in memory access latency for configurations ranging from 16 to 256 cores, compared to the traditional approach that limits executing cores to use their local SPMs.
- W. Ahmed, M. Shafique, L. Bauer, and J. Henkel. 2011a. mRTS: Run-time system for reconfigurable processors with multi-grained instruction-set extensions. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE), 2011.Google Scholar
- W. Ahmed, M. Shafique, L. Bauer, and J. H. Karlsruhe. 2011b. Adaptive resource management for simultaneous multitasking in mixed-grained reconfigurable multi-core processors. In Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). Google ScholarDigital Library
- Ke Bai and Aviral Shrivastava. 2013. Automatic and efficient heap data management for limited local memory multicore architectures. In Proceedings of the Conference on Design, Automation and Test in Europe. Google ScholarDigital Library
- Ke Bai, A. Shrivastava, and S. Kudchadker. 2011. Stack data management for limited local memory (LLM) multi-core processors. In Proceedings of the 2011 IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP). Google ScholarDigital Library
- Mohammad Banikazemi, Dan Poff, and Bülent Abali. 2008. PAM: A novel performance/power aware meta-scheduler for multi-core systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’08). Google ScholarDigital Library
- L. A. D. Bathen, N. D. Dutt, Dongyoun Shin, and Sung-Soo Lim. 2011. SPMVisor: Dynamic scratchpad memory virtualization for secure, low power, and high performance distributed on-chip memories. In Proceedings of the 2011 Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). Google ScholarDigital Library
- Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable software-defined caches. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
- Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University. Google ScholarDigital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (August 2011), 1--7. Google ScholarDigital Library
- W. Bolosky, R. Fitzgerald, and M. Scott. 1989. Simple but effective techniques for NUMA memory management. SIGOPS Oper. Syst. Rev. 23, 5 (November 1989), 19--31. Google ScholarDigital Library
- Jichuan Chang and Gurindar S. Sohi. 2007. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st Annual International Conference on Supercomputing. Google ScholarDigital Library
- Weijia Che, A. Panda, and K. S. Chatha. 2010. Compilation of stream programs for multicore processors that incorporate scratchpad memories. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE), 2010. Google ScholarDigital Library
- Doosan Cho, S. Pasricha, I. Issenin, N. D. Dutt, Minwook Ahn, and Yunheung Paek. 2009. Adaptive scratch pad memory management for dynamic behavior of multimedia applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. Google ScholarDigital Library
- Sangyeun Cho and Lei Jin. 2006. Managing distributed, shared L2 caches through OS-level page allocation. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
- Junchul Choi, Hyunok Oh, Sungchan Kim, and Soonhoi Ha. 2012. Executing synchronous dataflow graphs on a SPM-based multicore architecture. In Proceedings of the 49th Annual Design Automation Conference. Google ScholarDigital Library
- Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert T. Morris, and Eddie Kohler. 2013. The scalable commutativity rule: Designing scalable software for multicore processors. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. Google ScholarDigital Library
- Ning Deng, Weixing Ji, Jaxin Li, and Qi Zuo. 2011. A semi-automatic scratchpad memory management framework for CMP. In Proceedings of the 9th International Conference on Advanced Parallel Processing Technologies. Google ScholarDigital Library
- Bernhard Egger, Jaejin Lee, and Heonshik Shin. 2008. Scratchpad memory management in a multitasking environment. In Proceedings of the 8th ACM International Conference on Embedded Software. Google ScholarDigital Library
- Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. SIGARCH Comput. Archit. News 39, 3 (June 2011), 365--376. Google ScholarDigital Library
- Fabrizio Fazzino, Maurizio Palesi, and David Patti. 2008. Noxim: Network-on-chip simulator. Retrieved from http://sourceforge.net/projects/noxim.Google Scholar
- Poletti Francesco, Paul Marchal, David Atienza, Luca Benini, Francky Catthoor, and Jose M. Mendias. 2004. An integrated hardware/software approach for run-time scratchpad management. In Proceedings of the 41st Annual Design Automation Conference. Google ScholarDigital Library
- Lovic Gauthier, Tohru Ishihara, Hideki Takase, Hiroyuki Tomiyama, and Hiroaki Takada. 2010. Minimizing inter-task interferences in scratch-pad memory usage for reducing the energy consumption of multi-task systems. In Proceedings of the 2010 International Conference on Compilers, Architectures and Synthesis for Embedded Systems. Google ScholarDigital Library
- M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the 2001 IEEE International Workshop on Workload Characterization (WWC-4). Google ScholarDigital Library
- Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture. Google ScholarDigital Library
- J. Henkel, A. Herkersdorf, L. Bauer, T. Wild, M. Hbner, R. K. Pujari, A. Grudnitsky, J. Heisswolf, A. Zaib, B. Vogel, V. Lari, and S. Kobbe. 2012. Invasive manycore architectures. In Proceedings of the 17th Asia and South Pacific Design Automation Conference.Google Scholar
- J. Henkel, V. Narayanan, S. Parameswaran, and J. Teich. 2013. Run-time adaption for highly-complex multi-core systems. In Proceedings of the 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). Google ScholarDigital Library
- John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News (2006). Google ScholarDigital Library
- Henry Hoffman. 2013. Seec: A Framework for Self-aware Management of Goals and Constraints in Computing Systems (Power-aware Computing, Accuracy-aware Computing, Adaptive Computing, Autonomic Computing). Ph.D. Dissertation. Advisor(s) Agarwal, Anant and Devadas, Srinivas. Google ScholarDigital Library
- J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. Van Der Wijngaart, and T. Mattson. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of the 2010 International Conference on Solid-State Circuits Conference Digest of Technical Papers (ISSCC’10).Google Scholar
- Ravi Iyer, Li Zhao, Fei Guo, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan Solihin, Lisa Hsu, and Steve Reinhardt. 2007. QoS policies and architecture for cache/memory in CMP platforms. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. Google ScholarDigital Library
- Aamer Jaleel. 2007. Memory Characterization of Workloads Using Instrumentation-Driven Simulation A Pin-based Memory Characterization of the SPEC CPU2000 and SPEC CPU2006 Benchmark Suites. Technical Report. Intel.Google Scholar
- Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Jr., and Joel Emer. 2008. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
- Weixing Ji, Ning Deng, Feng Shi, Qi Zuo, and Jiaxin Li. 2011. Dynamic and adaptive SPM management for a multi-task environment. J. Syst. Archit. (2011). Google ScholarDigital Library
- D. Kaseridis, J. Stuecheli, and L. K. John. 2009. Bank-aware dynamic cache partitioning for multicore architectures. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP’09). Google ScholarDigital Library
- Rob Knauerhase, Paul Brett, Barbara Hohlt, Tong Li, and Scott Hahn. 2008. Using OS observations to improve performance in multicore systems. In IEEE Micro. Google ScholarDigital Library
- Alvin R. Lebeck, Xiaobo Fan, Heng Zeng, and Carla Ellis. 2000. Power aware page allocation. SIGOPS Oper. Syst. Rev. 34, 5 (November 2000), 105--116. Google ScholarDigital Library
- Hyunjin Lee, Sangyeun Cho, and B. R. Childers. 2010. StimulusCache: Boosting performance of chip multiprocessors with excess cache. In Proceedings of the 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
- Hyunjin Lee, Sangyeun Cho, and B. R. Childers. 2011. CloudCache: Expanding and shrinking private caches. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). Google ScholarDigital Library
- Andrea Marongiu and Luca Benini. 2012. An OpenMP compiler for efficient use of distributed scratchpad memory in MPSoCs. IEEE Trans. Comput. 61, 2 (February 2012), 222--236. Google ScholarDigital Library
- Patrick S. McCormick, Ryan Karl Braithwaite, and Wu-chun Feng. 2011. Empirical Memory-Access Cost Models in Multicore NUMA Architectures. No. LA-UR-11-10315. Los Alamos National Laboratory (LANL).Google Scholar
- Andreas Merkel, Jan Stoess, and Frank Bellosa. 2010. Resource-conscious scheduling for energy efficiency on multicore processors. In Proceedings of the 5th European Conference on Computer Systems. Google ScholarDigital Library
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Laboratories (2009), 22--31.Google Scholar
- Amit Pabalkar, Aviral Shrivastava, Arun Kannan, and Jongeun Lee. 2008. SDRM: Simultaneous determination of regions and function-to-region mapping for scratchpad memories. In Proceedings of the 15th International Conference on High Performance Computing. Google ScholarDigital Library
- Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 1997. Efficient utilization of scratch-pad memory in embedded processor applications. In Proceedings of the 1997 European Conference on Design and Test. Google ScholarDigital Library
- Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
- Nauman Rafique, Won-Taek Lim, and Mithuna Thottethodi. 2006. Architectural support for operating system-driven CMP cache management. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
- Muhammad Shafique, Lars Bauer, Waheed Ahmed, and Jörg Henkel. 2011. Minority-game-based resource allocation for run-time reconfigurable multi-core processors. In Proceedings of the 2011 Design, Automation and Test in Europe Conference and Exhibition.Google ScholarCross Ref
- A. Sharifi, S. Srikantaiah, M. Kandemir, and M. J. Irwin. 2012. Courteous cache sharing: Being nice to others in capacity management. In Proceedings of the 2012 49th ACM/EDAC/IEEE Design Automation Conference (DAC). Google ScholarDigital Library
- Shekhar Srikantaiah, Mahmut Kandemir, and Mary Jane Irwin. 2008. Adaptive set pinning: Managing shared caches in chip multiprocessors. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarDigital Library
- S. Steinke, L. Wehmeyer, Bo-Sik Lee, and P. Marwedel. 2002. Assigning program and data objects to scratchpad for energy reduction. In Proceedings of the 2002 Design, Automation and Test in Europe Conference and Exhibition. Google ScholarDigital Library
- Vivy Suhendra, Chandrashekar Raghavan, and Tulika Mitra. 2006. Integrated scratchpad memory optimization and task scheduling for MPSoC architectures. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. Google ScholarDigital Library
- Vivy Suhendra, Abhik Roychoudhury, and Tulika Mitra. 2008. Scratchpad allocation for concurrent embedded software. In Proceedings of the 6th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. Google ScholarDigital Library
- H. Takase, H. Tomiyama, and H. Takada. 2010. Partitioning and allocation of scratch-pad memory for priority-based preemptive multi-task systems. In Proceedings of the 2010 Design, Automation Test in Europe Conference Exhibition (DATE). Google ScholarDigital Library
- Sumesh Udayakumaran, Angel Dominguez, and Rajeev Barua. 2006. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embed. Comput. Syst. (2006). Google ScholarDigital Library
- Manish Verma, Stefan Steinke, and Peter Marwedel. 2003. Data partitioning for maximal scratchpad usage. In Proceedings of the 2003 Asia and South Pacific Design Automation Conference. Google ScholarDigital Library
- Yuejian Xie and Gabriel H. Loh. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture. Google ScholarDigital Library
- Lei Zhang, Meikang Qiu, Wei-Che Tseng, and Edwin H-M. Sha. 2010. Variable partitioning and scheduling for MPSoC with virtually shared scratch pad memory. Journal of Signal Processing Systems 58, 2 (2010), 247--265. Google ScholarDigital Library
- Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. 2011. Dynamic cache contention detection in multi-threaded applications. In Proceedings of the 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. Google ScholarDigital Library
- Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems. Google ScholarDigital Library
Index Terms
- SPMPool: Runtime SPM Management for Memory-Intensive Applications in Embedded Many-Cores
Recommendations
ShaVe-ICE: Sharing Distributed Virtualized SPMs in Many-Core Embedded Systems
Special Issue on MEMCODE 2015 and Regular Papers (Diamonds)Traditional approaches for managing software-programmable memories (SPMs) do not support sharing of distributed on-chip memory resources and, consequently, miss the opportunity to better utilize those memory resources. Managing on-chip memory resources ...
SA-SPM: an efficient compiler for security aware scratchpad memory (invited paper)
LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded SystemsScratchpad memories (SPM) are often used to boost the performance of application-specific embedded systems. In embedded systems, main memories are vulnerable to external attacks such as bus snooping or memory extraction. Therefore it is desirable to ...
DynaPoMP: dynamic policy-driven memory protection for SPM-based embedded systems
WESS '11: Proceedings of the Workshop on Embedded Systems SecurityToday's embedded systems are often used to access, store, manipulate, and communicate sensitive data. Embedded system security risks are exacerbated by emerging trends (e.g., network connectivity, application download service, migration to ...
Comments