Skip to main content

Advertisement

Log in

PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Novel pervasive devices such as smart surveillance cameras and autonomous micro-UAVs could greatly benefit from the availability of a computing device supporting embedded computer vision at a very low power budget. To this end, we propose PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy scalability that exploit the capabilities of the STMicroelectronics UTBB FD-SOI 28nm technology. We show that PULP performance can be scaled over a 1x-354x range, with a peak theoretical energy efficiency of 211 GOPS/W. We present performance results for several demanding kernels from the image processing and vision domain, with post-layout power modeling: a motion detection application that can run at an efficiency up to 192 GOPS/W (90 % of the theoretical peak); a ConvNet-based detector for smart surveillance that can be switched between 0.7 and 27fps operating modes, scaling energy consumption per frame between 1.2 and 12mJ on a 320 ×240 image; and FAST + Lucas-Kanade optical flow on a 128 ×128 image at the ultra-low energy budget of 14 μJ per frame at 60fps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11

Similar content being viewed by others

Notes

  1. We used the or1k-elf-gcc compiler (build 4.9.0 20140308), with the following flags: -O2 -nostdlib -mhard-mul -msoft-div.

  2. We will refer to this network as the small network from this point on.

  3. As our CNNs work on grayscale images, the training and test samples where converted from RGB to grayscale. Training consisted in 500 epochs of mini-batch stochastic gradient descent with momentum μ=0.9 and starting learning rate λ 0 = 0.01 (dropping exponentially as \(\lambda =\lambda _{0} \cdot 0.995^{n_{\text {epoch}}})\), using 20 % dropout [39] layers for better regularization.

References

  1. Ambiq Apollo website. http://ambiqmicro.com/low-power-microcontroller.

  2. Analog Devices Blackfin Dual Core Processor. http://www.analog.com/en/processors-dsp/blackfin/adsp-bf608/products/product.html.

  3. CentEye Stonyman. Hawksbill Silicon Documentation.

  4. CrazyFlie Nano-Quadcopter website. http://www.bitcraze.se/crazyflie/.

  5. NVidia Tegra K1 website. http://www.nvidia.com/object/tegra-k1-processor.html.

  6. NXP LPC54100 website. http://www.nxp.com/products/microcontrollers/cortex_m4/lpc54100/.

  7. Qualcomm Snapdragon 810 website. https://www.qualcomm.com/products/snapdragon/processors/810.

  8. SiliconLabs EFM32 Microcontroller. http://www.silabs.com/products/mcu/lowpower/pages/efm32g-gecko.aspx.

  9. STM32F401xB, STM32F401xC. Data sheet.

  10. Texas Instruments MSP430 Low-Power MCUs. http://www.ti.com/lsds/ti/microcontrollers_16-bit_32-bit/msp/overview.page.

  11. OpenRISC (2012). 1000 Architecture Manual.

  12. Benini, L., Flamand, E., Fuin, D., & Melpignano, D. (2012). P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In 2012 Design, automation & test in europe conference & exhibition (DATE), pp. 983–987. IEEE. doi:10.1109/DATE.2012.6176639.

  13. Bol, D., De Vos, J., Hocquet, C., Botman, F., Durvaux, F., Boyd, S., Flandre, D., & Legat, J.D. (2013). SleepWalker: A 25-MHz 0.4-V Sub-mm2 7-uW/MHz microcontroller in 65-nm lp/gp cmos for low-carbon wireless sensor nodes. IEEE Journal of Solid-State Circuits, 48(1), 20–32. doi:10.1109/JSSC.2012.2218067. http://ieeexplore.ieee.org/articleDetails.jsp?arnumber=6332542.

    Article  Google Scholar 

  14. Botman, F., Vos, J.D., Bernard, S., Stas, F., Legat, J.D., & Bol, D. (2014). Bellevue: a 50MHz variable-width simd 32bit microcontroller at 0 . 37v for processing-intensive wireless sensor nodes. In Proceedings of 2014 IEEE Symposium on Circuits and Systems, pp. 1207–1210.

  15. Brockmeyer, E., Nachtergaele, L., Catthoor, F., Bormans, J., & de Man, H. (1999). Low power memory storage and transfer organization for the MPEG-4 full pel motion estimation on a multimedia processor. IEEE Transactions on Multimedia, 1(2), 202–216. doi:10.1109/6046.766740.

    Article  Google Scholar 

  16. Carey, S.J., Lopich, A., Barr, D.R.W., Wang, B., & Dudek, P. (2013). A 100,000 fps vision sensor with embedded 535gops/W 256x256 SIMD processor array. pp. C182–C183. IEEE, Kyoto.

  17. Chen, P., Hong, K., Naikal, N., Sastry, S.S., Tygar, D., Yan, P., Yang, A.Y., Chang, L.C., Lin, L., Wang, S., Lobatn, E., Oh, S., & Ahammad, P. (2013). A low-bandwidth camera sensor platform with applications in smart camera networks. ACM Trans Sen Netw, 9(2), 21:1–21:23. doi:10.1145/2422966.2422978.

    Article  Google Scholar 

  18. Chua, L., Gulak, G., Pierzchala, E., & Rodríguez-Vázquez, A. (2013). Cellular neural networks and analog VLSI. Springer Science & Business Media.

  19. Clemons, J., Zhu, H., Savarese, S., & Austin, T. (2011). MEVBench: A mobile computer vision benchmarking suite. In 2011 IEEE International Symposium on Workload Characterization (IISWC), pp. 91–102. IEEE. doi:10.1109/IISWC.2011.6114206.

  20. Codrescu, L., Anderson, W., Venkumanhanti, S., Zeng, M., Plondke, E., Koob, C., Ingle, A., Tabony, C., & Maule, R. (2014). Hexagon DSP: an architecture optimized for mobile multimedia and communications. IEEE Micro, 34(2), 34–43. doi:10.1109/MM.2014.12.

    Article  Google Scholar 

  21. Conti, F., Pilkington, C., Marongiu, A., & Benini, L. (2014). He-P2012 : architectural heterogeneity exploration on a scalable many-core platform. In Proceedings of 25th IEEE Conference on Application-Specific Architectures and Processors.

  22. Conti, F., Pullini, A., & Benini, L. (2014). Brain-inspired Classroom Occupancy Monitoring on a Low-Power Mobile Platform.

  23. Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., Gong, Y., & Acero, A. (2013). Recent advances in deep learning for speech research at Microsoft. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing pp. 8604–8608. doi:10.1109/ICASSP.2013.6639345.

  24. de Dinechin, B.D., Ayrignac, R., Beaucamps, P.E., Couvert, P., Ganne, B., de Massas, P.G., Jacquet, F., Jones, S., Chaisemartin, N.M., Riss, F., & Strudel, T. (2013). A clustered manycore processor architecture for embedded and accelerated applications. In 2013 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE. doi:10.1109/HPEC.2013.6670342.

  25. Dogan, A. Y., Atienza, D., Burg, A., Loi, I., & Benini, L. (2011). Power/performance exploration of single-core and multi-core processor approaches for biomedical signal processing. In Integrated Circuit and System Design. Power and Timing Modeling, Optimization, and Simulation, pp. 102–111. Springer.

  26. Farabet, C., Martini, B., Corda, B., Akselrod, P., Culurciello, E., & LeCun, Y. (2011). NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR 2011 Workshops, pp. 109–116. IEEE. doi:10.1109/CVPRW.2011.5981829.

  27. Fick, D., Dreslinski, R.G., Giridhar, B., Kim, G., Seo, S., Fojtik, M., Satpathy, S., Lee, Y., Kim, D., Liu, N., Wieckowski, M., Chen, G., Mudge, T., Blaauw, D., & Sylvester, D. (2013). Centip3De: A Cluster-based ntc architecture with 64 arm cortex-m3 cores in 3d stacked 130 nm cmos. IEEE Journal of Solid-State Circuits, 48(1), 104–117. doi:10.1109/JSSC.2012.2222814. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6399548.

    Article  Google Scholar 

  28. Gautschi, M., Rossi, D., & Benini, L. (2014). Customizing an open source processor to fit in an ultra-low power cluster with a shared L1 memory. In Proceedings of the 24th edition of the great lakes symposium on VLSI - GLSVLSI ’14, pp. 87–88. ACM Press, New York, New York, USA. doi:10.1145/2591513.2591569.

  29. Gokhale, V., Jin, J., Dundar, A., Martini, B., & Culurciello, E. (2014). A 240 G-ops/s mobile coprocessor for deep neural networks. In CVPR. Workshops.

  30. Heng, L., Honegger, D., Lee, G.H., Meier, L., Tanskanen, P., Fraundorfer, F., & Pollefeys, M. (2014). Autonomous visual mapping and exploration with a micro aerial vehicle. Journal of Field Robotics, 31(4), 654–675. doi:10.1002/rob.21520.

    Article  Google Scholar 

  31. Hsieh, C.H., Lin, T.P., VLSI architecture for block-matching motion estimation algorithm, & IEEE Transactions on Circuits and Systems for Video Technology (1992), 2(2), 169–175. doi:10.1109/76.143416.

  32. Ickes, N., Sinangil, Y., Pappalardo, F., Guidetti, E., & Chandrakasan, A.P. (2011). A 10 pJ/cycle ultra-low-voltage 32-bit microprocessor system-on-chip. In 2011 Proceedings of the ESSCIRC (ESSCIRC), pp. 159–162. IEEE. doi:10.1109/ESSCIRC.2011.6044889.

  33. Jacquet, D., Hasbani, F., Flatresse, P., Wilson, R., Arnaud, F., Cesana, G., Di Gilio, T., Lecocq, C., Roy, T., Chhabra, A., Grover, C., Minez, O., Uginet, J., Durieu, G., Adobati, C., Casalotto, D., Nyer, F., Menut, P., Cathelin, A., Vongsavady, I., & Magarshack, P. (2014). A 3 GHz dual core processor arm cortex tm -a9 in 28 nm utbb fd-soi cmos with ultra-wide voltage range and energy efficiency optimization. IEEE Journal of Solid-State Circuits, 49(4), 812–826. doi:10.1109/JSSC.2013.2295977.

    Article  Google Scholar 

  34. Jeon, D., Kim, Y., Lee, I., Zhang, Z., Blaauw, D., & Sylvester, D. (2013). A 470 mV 2.7mW feature extraction-accelerator for micro-autonomous vehicle navigation in 28nm cmos, 166–168.

  35. Karpathy, A., & Leung, T. (2014). Large-scale video classification with convolutional neural networks.

  36. Kim, G., Kim, Y., Lee, K., & Park, S. (2014). A 1.22 tops and 1.52mw/ mh z augmented reality multi-core processor with neural network noc for hmd applications. In Proceedings of 2014 IEEE International Solid-State Circuits Conference, pp. 182–184.

  37. Knag, P., Kim, J.K., Chen, T., & Zhang, Z. (2015). A sparse coding neural network asic with on-chip learning for feature extraction and encoding, Vol. 50.

  38. Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images.

  39. Krizhevsky, A., Sutskever, I., Hinton, G. E., Pereira, F., Burges, C. J. C., Bottou, L., & Weinberger, K. Q. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems 25 (pp. 1097–1105). Curran Associates: Inc.

    Google Scholar 

  40. Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. In Proceedings of the IEEE 86(11), 2278–2324. doi:10.1109/5.726791.

  41. Lin, Z., Sankaran, J., & Flanagan, T. (2013). Empowering automotive vision with TI?s Vision AccelerationPac TI White Paper.

  42. Lucas, B.D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, pp. 674–679. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. http://dl.acm.org/citation.cfm?id=1623264.1623280.

  43. Ma, K.Y., Chirarattananon, P., Fuller, S.B., & Wood, R.J. (2013). Controlled flight of a biologically inspired, insect-scale robot. Science, 340(6132), 603–607. doi:10.1126/science.1231806.PMID:23641114.

    Article  Google Scholar 

  44. Mahesri, A., Johnson, D., Crago, N., & Patel, S.J. (2008). Tradeoffs in designing accelerator architectures for visual computing. In 2008 41st IEEE/ACM International Symposium on Microarchitecture, pp. 164–175. IEEE. doi:10.1109/MICRO.2008.4771788.

  45. Marongiu, A., Capotondi, A., Tagliavini, G., & Benini, L. (2013). Improving the programmability of STHORM-based heterogeneous systems with offload-enabled OpenMP. In Proceedings of the First International Workshop on Many-core Embedded Systems - MES ’13, pp. 1–8. ACM Press, New York, New York, USA. doi:10.1145/2489068.2489069.

  46. Meinerzhagen, P., Sherazi, S.M.Y., Burg, A., & Rodrigues, J.N. (2011). Benchmarking of standard-cell based memories in the sub-vt domain in 65-nm cmos technology. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 1(2), 173–182. doi:10.1109/JETCAS.2011.2162159. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5976987.

    Article  Google Scholar 

  47. Melpignano, D., Benini, L., Flamand, E., Jego, B., Lepley, T., Haugou, G., Clermidy, F., & Dutoit, D. (2012). Platform 2012, a many-core computing accelerator for embedded SoCs.

  48. Merolla, P.a., Arthur, J.V., Alvarez-Icaza, R., Cassidy, a.S., Sawada, J., Akopyan, F., Jackson, B.L., Imam, N., Guo, C., Nakamura, Y., Brezzo, B., Vo, I., Esser, S.K., Appuswamy, R., Taba, B., Amir, A., Flickner, M.D., Risk, W.P., Manohar, R., & Modha, D.S. (2014). A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197), 668–673. doi:10.1126/science.1254642.

    Article  Google Scholar 

  49. Miro-Panades, I., Beigné, E., Thonnart, Y., Alacoque, L., Vivet, P., Lesecq, S., Puschini, D., Molnos, A., Thabet, F., Tain, B., Chehida, K.B., Engels, S., Wilson, R., & Fuin, D. (2014). A Fine-grain variation-aware dynamic vdd-hopping avfs architecture on a 32 nm gals MPSoC. IEEE Journal of Solid-State Circuits, 49(7), 1475–1486. doi:10.1109/JSSC.2014.2317137. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6807524.

    Article  Google Scholar 

  50. Moloney, D. (2011). 1TOPS/W software programmable media processor, HotChips HC23. Stanford.

  51. Oh, J., Lee, S., & Yoo, H.J. (2013). 1.2-mW Online learning mixed-mode intelligent inference engine for low-power real-time object recognition processor. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(5), 921–933. doi:10.1109/TVLSI.2012.2198249. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6210400.

    Article  Google Scholar 

  52. Park, S., Maashri, A.A., Irick, K.M., Chandrashekhar, A., Cotter, M., Chandramoorthy, N., Debole, M., & Narayanan, V. (2012). System-on-chip for biologically inspired vision applications. IPSJ Transactions on System LSI Design Methodology, 5, 71–95. doi:10.2197/ipsjtsldm.5.71. http://www.cse.psu.edu/nic5090/HMAP/sldm.pdf.

    Article  Google Scholar 

  53. Patterson, D. (2009). The top 10 innovations in the new NVIDIA Fermi architecture, and the top 3 next challenges NVIDIA Whitepaper.

  54. Pennebaker, W.B., Mitchell, J.L., Fogg, C., & LeGall, D. (1997). MPEG Digital Video Compression Standard Chapman & Hall.

  55. Rahimi, A., Loi, I., Kakoee, M.R., & Benini, L. (2011). A fully-synthesizable single-cycle interconnection network for Shared-L1 processor clusters. In 2011 Design, Automation & Test in Europe pp. 1–6. doi:10.1109/DATE.2011.5763085.

  56. Rossi, D., Loi, I., Haugou, G., & Benini, L. (2014). Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters. In Proceedings of the 11th ACM Conference on Computing Frontiers - CF ’14, pp. 1–10. ACM Press, New York, New York, USA. doi:10.1145/2597917.2597922.

  57. Rossi, D., Mucci, C., Campi, F., Spolzino, S., Vanzolini, L., Sahlbach, H., Whitty, S., Ernst, R., Putzke-Roming, W., & Guerrieri, R. (2013). Application space exploration of a heterogeneous run-time configurable digital signal processor. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(2), 193–205. doi:10.1109/TVLSI.2012.2185963.

    Article  Google Scholar 

  58. Rosten, E., & Drummond, T. (2005). Fusing points and lines for high performance tracking. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1 pp. 1508–1515 Vol. 2. doi:10.1109/ICCV.2005.104. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1544896.

  59. Rosten, E., Drummond, T., & Machine learning for high-speed corner detection (2006). In Computer Vision? ECCV 2006 pp. 1–14. doi:10.1007/11744023_34.

  60. Sabarad, J., Kestur, S., Dantara, D., Narayanan, V., & Khosla, D. (2012). A reconfigurable accelerator for neuromorphic object recognition. In 17th Asia and South Pacific Design Automation Conference, pp. 813–818. IEEE. doi:10.1109/ASPDAC.2012.6165067.

  61. Scaramuzza, D., Achtelik, M., Doitsidis, L., Friedrich, F., Kosmatopoulos, E., Martinelli, A., Achtelik, M., Chli, M., Chatzichristofis, S., Kneip, L., Gurdan, D., Heng, L., Lee, G.H., Lynen, S., Pollefeys, M., Renzaglia, A., Siegwart, R., Stumpf, J., Tanskanen, P., Troiani, C., Weiss, S., & Meier, L. (2014). Vision-controlled micro flying robots: from system design to autonomous navigation and mapping in gps-denied environments. IEEE Robotics Automation Magazine, 21(3), 26–40. doi:10.1109/MRA.2014.2322295.

    Article  Google Scholar 

  62. Seo, S., Dreslinski, R.G., Woh, M., Chakrabarti, C., Mahlke, S., & Mudge, T. (2010). Diet SODA: A Power-Efficient Processor for Digital Cameras. In Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design - ISLPED ’10, p. 79. ACM Press, New York, New York, USA. doi:10.1145/1840845.1840862.

  63. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). OverFeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229 [cs].

  64. Toshev, A., & Szegedy, C. (2014). DeepPose : human pose estimation via deep neural networks. In Proceedings of 2014 IEEE conference on computer vision and pattern recognition. doi:10.1109/CVPR.2014.214.

  65. Wood, R.J., Finio, B., Karpelson, M., Ma, K., Pérez-Arancibia, N.O., Sreetharan, P.S., Tanaka, H., & Whitney, J.P. (2012). Progress on ‘pico’ air vehicles. The International Journal of Robotics Research, 31(11), 1292–1302. doi:10.1177/0278364912455073.

    Article  Google Scholar 

  66. Yoon, J.S., Kim, J.H., Kim, H.E., Lee, W.Y., Kim, S.H., Chung, K., Park, J.S., & Kim, L.S. (2013). A unified graphics and vision processor with a 0.89 uw/fps pose estimation engine for augmented reality. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(2), 206–216. doi:10.1109/TVLSI.2012.2186157. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6156811.

    Article  Google Scholar 

  67. Zhang, N., Paluri, M., Ranzato, M.A., Darrell, T., Bourdev, L., & Berkeley, U.C. (2014). PANDA : pose aligned networks for deep attribute modeling. In Proceedings of 2014 IEEE conference on computer vision and pattern recognition, vol. 2. doi:10.1109/CVPR.2014.212.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesco Conti.

Additional information

This work was funded by the project IcySoC, evaluated by the Swiss NSF and funded by Nano-Tera.ch with Swiss Confederation financing, and by the EU FP7 project PHIDIAS (g.a. 318013). We also thank STMicroelectronics for granting access to the FDSOI 28nm technology libraries.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Conti, F., Rossi, D., Pullini, A. et al. PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision. J Sign Process Syst 84, 339–354 (2016). https://doi.org/10.1007/s11265-015-1070-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-015-1070-9

Keywords

Navigation