skip to main content
research-article

Unrolling Ternary Neural Networks

Published:18 October 2019Publication History
Skip Abstract Section

Abstract

The computational complexity of neural networks for large-scale or real-time applications necessitates hardware acceleration. Most approaches assume that the network architecture and parameters are unknown at design time, permitting usage in a large number of applications. This article demonstrates, for the case where the neural network architecture and ternary weight values are known a priori, that extremely high throughput implementations of neural network inference can be made by customising the datapath and routing to remove unnecessary computations and data movement. This approach is ideally suited to FPGA implementations as a specialized implementation of a trained network improves efficiency while still retaining generality with the reconfigurability of an FPGA. A VGG-style network with ternary weights and fixed point activations is implemented for the CIFAR10 dataset on Amazon’s AWS F1 instance. This article demonstrates how to remove 90% of the operations in convolutional layers by exploiting sparsity and compile-time optimizations. The implementation in hardware achieves 90.9 ± 0.1% accuracy and 122k frames per second, with a latency of only 29µs, which is the fastest CNN inference implementation reported so far on an FPGA.

References

  1. Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. 2017. YodaNN: An architecture for ultralow power binary-weight CNN acceleration. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 37, 1 (2017), 48--60.Google ScholarGoogle ScholarCross RefCross Ref
  2. Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing hardware in a scala embedded language. In Proceedings of the Design Automation Conference (DAC’12). IEEE, 1212--1221.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M Bronstein, and Avi Mendelson. 2018. Streaming architecture for large-scale quantized neural networks on an FPGA-based dataflow platform. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’18). IEEE, 162--169.Google ScholarGoogle ScholarCross RefCross Ref
  4. Yoonho Boo and Wonyong Sung. 2017. Structured sparse ternary weight coding of deep neural networks for efficient hardware implementations. In Proceedings of the 2017 IEEE International Workshop on Signal Processing Systems (SiPS’17). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  5. P. Cappello and K. Steiglitz. 1984. Some complexity issues in digital signal processing. IEEE Trans. Acoust. Speech Sign. Process. 32, 5 (Oct. 1984), 1037--1041.Google ScholarGoogle ScholarCross RefCross Ref
  6. Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circ. 52, 1 (2017), 127--138.Google ScholarGoogle ScholarCross RefCross Ref
  7. Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks. In Advances in Neural Information Processing Systems. 4107--4115.Google ScholarGoogle Scholar
  8. Julian Faraone, Nicholas Fraser, Giulio Gambardella, Michaela Blott, and Philip H. W. Leong. 2017. Compressing low precision deep neural networks using sparsity-induced regularization in ternary networks. In Proceedings of the International Conference on Neural Information Processing. Springer, 393--404.Google ScholarGoogle Scholar
  9. Linux Foundation. 2015. Data Plane Development Kit (DPDK). Retrieved from http://www.dpdk.org.Google ScholarGoogle Scholar
  10. Nicholas J. Fraser, Yaman Umuroglu, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Scaling binarized neural networks on reconfigurable logic. In Proceedings of the 8th Workshop and 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms. ACM, 25--30.Google ScholarGoogle Scholar
  11. Ben Graham. 2014. Fractional max-pooling (2014). arXiv preprint arXiv:1412.6071 (2014).Google ScholarGoogle Scholar
  12. Martin Hardieck, Martin Kumm, Konrad Möller, and Peter Zipf. 2018. Constant matrix multiplication with ternary adders. In Proceedings of the IEEE International Conference on Electronics, Circuits and Systems (ICECS’18).Google ScholarGoogle ScholarCross RefCross Ref
  13. Shen-Fu Hsiao, Ming-Chih Chen, and Chia-Shin Tu. 2006. Memory-free low-cost designs of advanced encryption standard using common subexpression elimination for subfunctions in transformations. IEEE Trans. Circ. Syst. I: Regul. Pap. 53, 3 (2006), 615--626.Google ScholarGoogle ScholarCross RefCross Ref
  14. Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jin Hee Kim, Brett Grady, Ruolong Lian, John Brothers, and Jason H. Anderson. 2017. FPGA-based CNN inference accelerator synthesized from multi-threaded C software. In Proceedings of the 30th IEEE International System-on-Chip Conference (SOCC'17). IEEE, 268--273.Google ScholarGoogle Scholar
  16. Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. Citeseer.Google ScholarGoogle Scholar
  17. Martin Kumm, Martin Hardieck, and Peter Zipf. 2017. Optimization of constant matrix multiplication with low power and high throughput. IEEE Trans. Comput. 66, 12 (2017), 2072--2080.Google ScholarGoogle ScholarCross RefCross Ref
  18. Martin Kumm, Peter Zipf, Mathias Faust, and Chip-Hong Chang. 2012. Pipelined adder graph optimization for high speed multiple constant multiplication. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’12). IEEE, 49--52.Google ScholarGoogle ScholarCross RefCross Ref
  19. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (5 2015), 436--444. DOI:https://doi.org/10.1038/nature14539Google ScholarGoogle Scholar
  20. Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016).Google ScholarGoogle Scholar
  21. Yixing Li, Zichuan Liu, Kai Xu, Hao Yu, and Fengbo Ren. 2017. A 7.663-TOPS 8.2-W energy-efficient FPGA accelerator for binary convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 290--291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, and Shaojun Wei. 2018. FP-BNN: Binarized neural network on FPGA. Neurocomputing 275 (2018), 1072--1086.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey. 2017. Ternary neural networks with fine-grained quantization. arXiv preprint arXiv:1705.01462 (2017).Google ScholarGoogle Scholar
  24. Paolo Meloni, Alessandro Capotondi, Gianfranco Deriu, Michele Brian, Francesco Conti, Davide Rossi, Luigi Raffo, and Luca Benini. 2018. NEURA ghe: Exploiting CPU-FPGA synergies for efficient and flexible CNN inference acceleration on Zynq SoCs. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Duncan J. M. Moss, Eriko Nurvitadhi, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip H. W. Leong. 2017. High performance binary neural networks on the Xeon+ FPGA™ platform. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). IEEE, 1--4.Google ScholarGoogle Scholar
  26. Adrien Prost-Boucle, Alban Bourge, Frédéric Pétrot, Hande Alemdar, Nicholas Caldwell, and Vincent Leroy. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). IEEE, 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 26--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision. Springer, 525--542.Google ScholarGoogle ScholarCross RefCross Ref
  29. Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 65--74.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. 2017. Accelerating deep convolutional networks using low-precision and sparsity. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 2861--2865.Google ScholarGoogle ScholarCross RefCross Ref
  31. Yanshu Wang, Shuang Liang, Song Yao, Yi Shan, Song Han, Junjie Peng, and Hong Xia Luo. 2017. Reconfigurable processor for deep learning in autonomous vehicles. ITU Journal: ICT Discoveries 1 (2017).Google ScholarGoogle Scholar
  32. Ning Wu, Xiaoqiang Zhang, Yunfei Ye, and Lidong Lan. 2013. Improving common subexpression elimination algorithm with a new gate-level delay computing method. In Proceedings of the World Congress on Engineering and Computer Science, Vol. 2.Google ScholarGoogle Scholar
  33. Chi Zhang and Viktor Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 35--44.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).Google ScholarGoogle Scholar

Index Terms

  1. Unrolling Ternary Neural Networks

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Reconfigurable Technology and Systems
              ACM Transactions on Reconfigurable Technology and Systems  Volume 12, Issue 4
              December 2019
              163 pages
              ISSN:1936-7406
              EISSN:1936-7414
              DOI:10.1145/3361265
              • Editor:
              • Deming Chen
              Issue’s Table of Contents

              Copyright © 2019 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 18 October 2019
              • Revised: 1 August 2019
              • Accepted: 1 August 2019
              • Received: 1 December 2018
              Published in trets Volume 12, Issue 4

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format