Abstract
The computational complexity of neural networks for large-scale or real-time applications necessitates hardware acceleration. Most approaches assume that the network architecture and parameters are unknown at design time, permitting usage in a large number of applications. This article demonstrates, for the case where the neural network architecture and ternary weight values are known a priori, that extremely high throughput implementations of neural network inference can be made by customising the datapath and routing to remove unnecessary computations and data movement. This approach is ideally suited to FPGA implementations as a specialized implementation of a trained network improves efficiency while still retaining generality with the reconfigurability of an FPGA. A VGG-style network with ternary weights and fixed point activations is implemented for the CIFAR10 dataset on Amazon’s AWS F1 instance. This article demonstrates how to remove 90% of the operations in convolutional layers by exploiting sparsity and compile-time optimizations. The implementation in hardware achieves 90.9 ± 0.1% accuracy and 122k frames per second, with a latency of only 29µs, which is the fastest CNN inference implementation reported so far on an FPGA.
- Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. 2017. YodaNN: An architecture for ultralow power binary-weight CNN acceleration. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 37, 1 (2017), 48--60.Google ScholarCross Ref
- Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing hardware in a scala embedded language. In Proceedings of the Design Automation Conference (DAC’12). IEEE, 1212--1221.Google ScholarDigital Library
- Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M Bronstein, and Avi Mendelson. 2018. Streaming architecture for large-scale quantized neural networks on an FPGA-based dataflow platform. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’18). IEEE, 162--169.Google ScholarCross Ref
- Yoonho Boo and Wonyong Sung. 2017. Structured sparse ternary weight coding of deep neural networks for efficient hardware implementations. In Proceedings of the 2017 IEEE International Workshop on Signal Processing Systems (SiPS’17). IEEE, 1--6.Google ScholarCross Ref
- P. Cappello and K. Steiglitz. 1984. Some complexity issues in digital signal processing. IEEE Trans. Acoust. Speech Sign. Process. 32, 5 (Oct. 1984), 1037--1041.Google ScholarCross Ref
- Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circ. 52, 1 (2017), 127--138.Google ScholarCross Ref
- Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks. In Advances in Neural Information Processing Systems. 4107--4115.Google Scholar
- Julian Faraone, Nicholas Fraser, Giulio Gambardella, Michaela Blott, and Philip H. W. Leong. 2017. Compressing low precision deep neural networks using sparsity-induced regularization in ternary networks. In Proceedings of the International Conference on Neural Information Processing. Springer, 393--404.Google Scholar
- Linux Foundation. 2015. Data Plane Development Kit (DPDK). Retrieved from http://www.dpdk.org.Google Scholar
- Nicholas J. Fraser, Yaman Umuroglu, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Scaling binarized neural networks on reconfigurable logic. In Proceedings of the 8th Workshop and 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms. ACM, 25--30.Google Scholar
- Ben Graham. 2014. Fractional max-pooling (2014). arXiv preprint arXiv:1412.6071 (2014).Google Scholar
- Martin Hardieck, Martin Kumm, Konrad Möller, and Peter Zipf. 2018. Constant matrix multiplication with ternary adders. In Proceedings of the IEEE International Conference on Electronics, Circuits and Systems (ICECS’18).Google ScholarCross Ref
- Shen-Fu Hsiao, Ming-Chih Chen, and Chia-Shin Tu. 2006. Memory-free low-cost designs of advanced encryption standard using common subexpression elimination for subfunctions in transformations. IEEE Trans. Circ. Syst. I: Regul. Pap. 53, 3 (2006), 615--626.Google ScholarCross Ref
- Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 1--12.Google ScholarDigital Library
- Jin Hee Kim, Brett Grady, Ruolong Lian, John Brothers, and Jason H. Anderson. 2017. FPGA-based CNN inference accelerator synthesized from multi-threaded C software. In Proceedings of the 30th IEEE International System-on-Chip Conference (SOCC'17). IEEE, 268--273.Google Scholar
- Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. Citeseer.Google Scholar
- Martin Kumm, Martin Hardieck, and Peter Zipf. 2017. Optimization of constant matrix multiplication with low power and high throughput. IEEE Trans. Comput. 66, 12 (2017), 2072--2080.Google ScholarCross Ref
- Martin Kumm, Peter Zipf, Mathias Faust, and Chip-Hong Chang. 2012. Pipelined adder graph optimization for high speed multiple constant multiplication. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’12). IEEE, 49--52.Google ScholarCross Ref
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (5 2015), 436--444. DOI:https://doi.org/10.1038/nature14539Google Scholar
- Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016).Google Scholar
- Yixing Li, Zichuan Liu, Kai Xu, Hao Yu, and Fengbo Ren. 2017. A 7.663-TOPS 8.2-W energy-efficient FPGA accelerator for binary convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 290--291.Google ScholarDigital Library
- Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, and Shaojun Wei. 2018. FP-BNN: Binarized neural network on FPGA. Neurocomputing 275 (2018), 1072--1086.Google ScholarDigital Library
- Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey. 2017. Ternary neural networks with fine-grained quantization. arXiv preprint arXiv:1705.01462 (2017).Google Scholar
- Paolo Meloni, Alessandro Capotondi, Gianfranco Deriu, Michele Brian, Francesco Conti, Davide Rossi, Luigi Raffo, and Luca Benini. 2018. NEURA ghe: Exploiting CPU-FPGA synergies for efficient and flexible CNN inference acceleration on Zynq SoCs. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 18.Google ScholarDigital Library
- Duncan J. M. Moss, Eriko Nurvitadhi, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip H. W. Leong. 2017. High performance binary neural networks on the Xeon+ FPGA™ platform. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). IEEE, 1--4.Google Scholar
- Adrien Prost-Boucle, Alban Bourge, Frédéric Pétrot, Hande Alemdar, Nicholas Caldwell, and Vincent Leroy. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). IEEE, 1--7.Google ScholarCross Ref
- Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 26--35.Google ScholarDigital Library
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision. Springer, 525--542.Google ScholarCross Ref
- Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 65--74.Google ScholarDigital Library
- Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. 2017. Accelerating deep convolutional networks using low-precision and sparsity. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 2861--2865.Google ScholarCross Ref
- Yanshu Wang, Shuang Liang, Song Yao, Yi Shan, Song Han, Junjie Peng, and Hong Xia Luo. 2017. Reconfigurable processor for deep learning in autonomous vehicles. ITU Journal: ICT Discoveries 1 (2017).Google Scholar
- Ning Wu, Xiaoqiang Zhang, Yunfei Ye, and Lidong Lan. 2013. Improving common subexpression elimination algorithm with a new gate-level delay computing method. In Proceedings of the World Congress on Engineering and Computer Science, Vol. 2.Google Scholar
- Chi Zhang and Viktor Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 35--44.Google ScholarDigital Library
- Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).Google Scholar
Index Terms
- Unrolling Ternary Neural Networks
Recommendations
High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression
Special Issue on Deep learning on FPGAsAlthough performing inference with artificial neural networks (ANN) was until quite recently considered as essentially compute intensive, the emergence of deep neural networks coupled with the evolution of the integration technology transformed ...
An FPGA implementation for neural networks with the FDFM processor core approach
This paper presents a field programmable gate array FPGA implementation of a three-layer perceptron using the few DSP blocks and few block RAMs FDFM approach implemented in the Xilinx Virtex-6 family FPGA. In the FDFM approach, multiple processor cores ...
Comments