Skip to main content

Streamlining the OpenMP Programming Model on Ultra-Low-Power Multi-core MCUs

  • Conference paper
  • First Online:
Architecture of Computing Systems (ARCS 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12800))

Included in the following conference series:

Abstract

High-level programming models aim at exploiting hardware parallelism and reducing software development costs. However, their adoption on ultra-low-power multi-core microcontroller (MCU) platforms requires minimizing the overheads of work-sharing constructs on fine-grained parallel regions. This work tackles this challenge by proposing OMP-SPMD, a streamlined approach for parallel computing enabling the OpenMP syntax for the Single-Program Multiple-Data (SPMD) paradigm. To assess the performance improvement, we compare our solution with two alternatives: a baseline implementation of the OpenMP runtime based on the fork-join paradigm (OMP-base) and a version leveraging hardware-specific optimizations (OPM-opt). We benchmarked these libraries on a Parallel Ultra-Low Power (PULP) MCU, highlighting that hardware-specific optimizations improve OMP-base performance up to 69%. At the same time, OMP-SPMD leads to an extra improvement up to 178%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abdi, H., Williams, L.J.: Principal component analysis. Wiley Interdisciplinary Rev. Comput. Stat. 2(4), 433–459 (2010)

    Article  Google Scholar 

  2. Brigham, E.O.: The Fast Fourier Transform and its Applications. Prentice-Hall Inc., Hoboken (1988)

    Google Scholar 

  3. Chapman, B., Huang, L., Biscondi, E., Stotzer, E., Shrivastava, A., Gatherer, A.: Implementing OpenMP on a high performance embedded multicore MPSoC. In: 2009 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–8. IEEE (2009)

    Google Scholar 

  4. Chen, K.C., Chen, C.H.: Enabling SIMT execution model on homogeneous multi-core system. ACM Trans. Archit. Code Optim. (TACO) 15(1), 1–26 (2018)

    Article  Google Scholar 

  5. Diaz, J., Munoz-Caro, C., Nino, A.: A survey of parallel programming models and tools in the multi and many-core era. IEEE Trans. Parallel Distrib. Syst. 23(8), 1369–1386 (2012)

    Article  Google Scholar 

  6. Gaster, B., Howes, L., Kaeli, D.R., Mistry, P., Schaa, D.: Heterogeneous computing with OpenCL. Newnes (2012)

    Google Scholar 

  7. Gautschi, M., et al.: Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices. IEEE Trans. Very Large Scale Integration (VLSI) Syst. 25(10), 2700–2713 (2017)

    Google Scholar 

  8. Glaser, F., Tagliavini, G., Rossi, D., Haugou, G., Huang, Q., Benini, L.: Energy-efficient hardware-accelerated synchronization for shared-L1-memory multiprocessor clusters. IEEE Trans. Parallel Distrib. Syst. 32(3), 633–648 (2021)

    Article  Google Scholar 

  9. GNU Foundation: libgomp runtime. https://gcc.gnu.org/onlinedocs/libgomp/

  10. LLVM Project: LLVM OpenMP runtime. https://openmp.llvm.org/Reference.pdf

  11. Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989)

    Article  Google Scholar 

  12. Mitra, G., Stotzer, E., Jayaraj, A., Rendell, A.P.: Implementation and optimization of the OpenMP accelerator model for the TI keystone II architecture. In: DeRose, L., de Supinski, B.R., Olivier, S.L., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2014. LNCS, vol. 8766, pp. 202–214. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11454-5_15

    Chapter  Google Scholar 

  13. Montagna, F., Benatti, S., Rossi, D.: Flexible, scalable and energy efficient bio-signals processing on the pulp platform: a case study on seizure detection. J. Low Power Electron. Appl. 7(2), 16 (2017)

    Article  Google Scholar 

  14. Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for? Queue 6(2), 40–53 (2008)

    Article  Google Scholar 

  15. Noble, W.S.: What is a support vector machine? Nat. Biotechnol. 24(12), 1565–1567 (2006)

    Article  Google Scholar 

  16. Pereira, M.M., Sousa, R.C.F., Araujo, G.: Compiling and optimizing OpenMP 4.X programs to OpenCL and SPIR. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2017. LNCS, vol. 10468, pp. 48–61. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65578-9_4

    Chapter  Google Scholar 

  17. Pullini, A., Rossi, D., Loi, I., Tagliavini, G., Benini, L.: Mr. Wolf: an energy-precision scalable parallel ultra low power SoC for IoT edge processing. IEEE J. Solid-State Circuits 54(7), 1970–1981 (2019)

    Google Scholar 

  18. PULP Project: RI5CY Manual. https://www.pulp-platform.org/docs/ri5cy_user_manual.pdf

  19. PULP Project: Setup of Xilinx FPGA boards. https://github.com/pulp-platform/pulp/tree/master/fpga/pulpissimo-zcu104

  20. Quinlan, D.: ROSE: compiler support for object-oriented frameworks. Parallel Process. Lett. 10(02n03), 215–226 (2000)

    Google Scholar 

  21. Rossi, D., et al.: PULP: a parallel ultra low power platform for next generation IoT applications. In: 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–39. IEEE (2015)

    Google Scholar 

  22. Sony Corporation: Sony Spresense multicore microcontroller. https://developer.sony.com/develop/spresense/

  23. Stratton, J.A., et al.: Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 111–119 (2010)

    Google Scholar 

  24. Tagliavini, G., Cesarini, D., Marongiu, A.: Unleashing fine-grained parallelism on embedded many-core accelerators with lightweight OpenMP tasking. IEEE Trans. Parallel Distrib. Syst. 29(9), 2150–2163 (2018)

    Article  Google Scholar 

  25. Taylor, B., Marco, V.S., Wang, Z.: Adaptive optimization for OpenCL programs on embedded heterogeneous systems. ACM SIGPLAN Notices 52(5), 11–20 (2017)

    Article  Google Scholar 

Download references

Acknowledgments

This work has been partially supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement numbers 732631 (OPRECOMP), 863337 (WiPLASH), and 857191 (IOTWINS).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giuseppe Tagliavini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Montagna, F., Tagliavini, G., Rossi, D., Garofalo, A., Benini, L. (2021). Streamlining the OpenMP Programming Model on Ultra-Low-Power Multi-core MCUs. In: Hochberger, C., Bauer, L., Pionteck, T. (eds) Architecture of Computing Systems. ARCS 2021. Lecture Notes in Computer Science(), vol 12800. Springer, Cham. https://doi.org/10.1007/978-3-030-81682-7_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-81682-7_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-81681-0

  • Online ISBN: 978-3-030-81682-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics