Abstract
Although performing inference with artificial neural networks (ANN) was until quite recently considered as essentially compute intensive, the emergence of deep neural networks coupled with the evolution of the integration technology transformed inference into a memory bound problem. This ascertainment being established, many works have lately focused on minimizing memory accesses, either by enforcing and exploiting sparsity on weights or by using few bits for representing activations and weights, to be able to use ANNs inference in embedded devices. In this work, we detail an architecture dedicated to inference using ternary {−1, 0, 1} weights and activations. This architecture is configurable at design time to provide throughput vs. power trade-offs to choose from. It is also generic in the sense that it uses information drawn for the target technologies (memory geometries and cost, number of available cuts, etc.) to adapt at best to the FPGA resources. This allows to achieve up to 5.2k frames per second per Watt for classification on a VC709 board using approximately half of the resources of the FPGA.
- Hande Alemdar, Vincent Leroy, Adrien Prost-Boucle, and Frédéric Pétrot. 2017. Ternary neural networks for resource-efficient AI applications. In Proceedings of the 30th International Joint Conference on Neural Networks. 2547--2554. Retrieved from https://github.com/slide-lig/tnn-train.Google ScholarCross Ref
- Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. 2018. YodaNN: An architecture for ultra-low power binary-weight CNN acceleration. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 37, 1 (2018), 48--60.Google ScholarCross Ref
- Ken Batcher. 1987. Quoted in “Humour the computer”, Andrew Davidson, 1995, MIT Press, p.-40.Google Scholar
- Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems. MIT Press, 3123--3131. Google ScholarDigital Library
- Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Surinder-pal Singh, Elio Guidetti, Fabio De Ambroggi, Tommaso Majo, Paolo Zambotti, Manuj Ayodhyawasi, Harvinder Singh, and Nalin Aggarwal. 2017. 14.1A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems. In Proceedings of the IEEE International Solid-State Circuits Conference. IEEE, 238--239.Google Scholar
- Nicholas J. Fraser, Yaman Umuroglu, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Scaling binarized neural networks on reconfigurable logic. In Proceedings of the 8th Workshop and 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms. 25--30. Google ScholarDigital Library
- Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture. 243--254. Google ScholarDigital Library
- Lu Hou, Quanming Yao, and James T. Kwok. 2016. Loss-aware binarization of deep networks. arXiv:1611.01600.Google Scholar
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv:1609.07061.Google Scholar
- Kyuyeon Hwang and Wonyong Sung. 2014. Fixed-point feedforward deep neural network design using weights -1, 0, and +1. In Proceedings of the IEEE Workshop on Signal Processing Systems.Google ScholarCross Ref
- Matthew Jacobsen, Dustin Richmond, Matthew Hogains, and Ryan Kastner. 2015. RIFFA 2.1: A reusable integration framework for FPGA accelerators. ACM Trans. Reconfig. Technol. Syst. 8, 4 (Sept. 2015), 22:23. Google ScholarDigital Library
- Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo. 2017. A novel zero weight/activation-aware hardware architecture of convolutional neural network. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition. IEEE, 1462--1467. Google ScholarDigital Library
- Donald E. Knuth. 1997. Seminumerical algorithms, vol. 2. In The Art of Computer Programming. Addison-Wesley, Reading.Google ScholarDigital Library
- Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. Toronto University.Google Scholar
- Martin Kumm and Peter Zipf. 2014. Pipelined compressor tree optimization using integer linear programming. In Proceedings of the 24th International Conference on Field Programmable Logic and Applications. 1--8.Google ScholarCross Ref
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
- Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv:1605.04711.Google Scholar
- Yixing Li, Zichuan Liu, Kai Xu, Hao Yu, and Fengbo Ren. 2017. A 7.663TOPS 8.2W energy-efficient FPGA accelerator for binary convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 290--291. Google ScholarDigital Library
- Zhiqiang Liu, Yong Dou, Jingfei Jiang, Jinwei Xu, Shijie Li, Yongmei Zhou, and Yingnan Xu. 2017. Throughput-optimized FPGA accelerator for deep convolutional neural networks. ACM Trans. Reconfig. Technol. Syst. 10, 3 (July 2017), 17:1--17:23. Google ScholarDigital Library
- Duncan J. M. Moss, Eriko Nurvitadhi, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip H. W. Leong. 2017. High performance binary neural networks on the Xeon+FPGA platform. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications.Google Scholar
- Hiroki Nakahara, Tomoya Fujii, and Shimpei Sato. 2017. A fully connected layer elimination for a binarized convolutional neural network on an FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications.Google Scholar
- Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. 2011. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning.Google Scholar
- Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit Subhaschandra, and Guy Boudoukh. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 5--14. Google ScholarDigital Library
- Jinhwan Park and Wonyong Sung. 2016. FPGA-based implementation of deep neural networks using on-chip memory only. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 1011--1015.Google ScholarDigital Library
- Ardavan Pedram, Stephen Richardson, Mark Horowitz, Sameh Galal, and Shahar Kvatinsky. 2017. Dark memory and accelerator-rich system optimization in the dark silicon era. IEEE Design Test 34, 2 (2017), 39--50.Google ScholarCross Ref
- Adrien Prost-Boucle, Alban Bourge, Frédéric Pétrot, Hande Alemdar, Nicholas Caldwell, and Vincent Leroy. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications.Google ScholarCross Ref
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision. Springer, 525--542.Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.Google Scholar
- Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. 2011. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. In Proceedings of the International Joint Conference on Neural Networks.Google Scholar
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. 2016. Inception-v4, inception-ResNet and the impact of residual connections on learning. arXiv:1602.07261.Google Scholar
- Olivier Temam. 2010. The rebirth of neural networks. Keynote speech. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
- Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 65--74. Google ScholarDigital Library
- Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Heng Wai Leong, Magnus Jahre, and Kees A. Vissers. 2016. FINN: A framework for fast, scalable binarized neural network inference. arXiv:1612.07119. Google ScholarDigital Library
- K. Vissers. 2017. A framework for reduced precision neural networks on FPGA. In Proceedings of the 17th International Forum on MPSoC. Retrieved from http://www.mpsoc-forum.org/previous/2017/files/proceedings/Kees_Vissers.pdf.Google Scholar
- Ephrem Wu, Xiaoqian Zhang, David Berman, and Inkeun Cho. 2017. A high-throughput reconfigurable processing array for neural networks. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications.Google ScholarCross Ref
- Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. 2017. Accelerating binarized convolutional neural networks with software-programmable FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 15--24. Google ScholarDigital Library
- Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. 2017. Trained ternary quantization. arXiv:1612.01064v3.Google Scholar
- Peter Škoda, Tomislav Lipić, Àgoston Srp, Branka Medved Rogina, Karolj Skala, and Ferenc Vajda. 2011. Implementation framework for artificial neural networks on FPGA. In Proceedings of the 34th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO’11). 274--278.Google Scholar
Index Terms
- High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression
Recommendations
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysAs convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...
A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art ...
Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA
Special Issue on Deep learning on FPGAsConvolutional Neural Networks-- (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art ...
Comments