Abstract
Deep neural networks have evolved remarkably over the past few years and they are currently the fundamental tools of many intelligent systems. At the same time, the computational complexity and resource consumption of these networks continue to increase. This poses a significant challenge to the deployment of such networks, especially in real-time applications or on resource-limited devices. Thus, network acceleration has become a hot topic within the deep learning community. As for hardware implementation of deep neural networks, a batch of accelerators based on a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC) have been proposed in recent years. In this paper, we provide a comprehensive survey of recent advances in network acceleration, compression, and accelerator design from both algorithm and hardware points of view. Specifically, we provide a thorough analysis of each of the following topics: network pruning, low-rank approximation, network quantization, teacher–student networks, compact network design, and hardware accelerators. Finally, we introduce and discuss a few possible future directions.
Similar content being viewed by others
References
Albericio J, Judd P, Hetherington T, et al., 2016. Cnvlutin: ineffectual-neuron-free deep neural network computing. Proc 43rd Int Symp on Computer Architecture, p.1–13. https://doi.org/10.1145/3007787.3001138
Alwani M, Chen H, Ferdman M, et al., 2016. Fused-layer CNN accelerators. 49th Annual IEEE/ACM Int Symp on MICRO, p.1–12. https://doi.org/10.1109/MICRO.2016.7783725
Anwar S, Hwang K, Sung W, 2017. Structured pruning of deep convolutional neural networks. ACM J Emerg Technol Comput Syst, 13(3), Article 32. https://doi.org/10.1145/3005348
Cai Z, He X, Sun J, et al., 2017. Deep learning with low precision by half-wave Gaussian quantization. IEEE Computer Society Conf on Computer Vision and Pattern Recognition, p.5918–5926.
Chen L, Li J, Chen Y, et al., 2017. Accelerator-friendly neural-network training: learning variations and defects in RRAM crossbar. Proc Conf on Design, Automation and Test in Europe Conf and Exhibition, p.19–24.
Chen Y, Sun N, Temam O, et al., 2014. DaDianNao: a machine-learning supercomputer. Proc 47th Annual IEEE/ACM Int Symp on Microarchitecture, p.609–622. https://doi.org/10.1109/MICRO.2014.58
Chen Y, Krishna T, Emer J, et al., 2017. Eyeriss: an energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Sol-Stat Circ, 52(1): 127–138. https://doi.org/10.1109/JSSC.2016.2616357
Cheng J, Wu J, Leng C, et al., 2017. Quantized CNN: a unified approach to accelerate and compress convolutional networks. IEEE Trans Neur Netw Learn Syst, 99:1–14. https://doi.org/10.1109/TNNLS.2017.2774288
Cheng Z, Soudry D, Mao Z, et al., 2015. Training binary multilayer neural networks for image classification using expectation backpropagation. http://arxiv.org/abs/1503.03562
Courbariaux M, Bengio Y, David J, 2015. Binaryconnect: training deep neural networks with binary weights during propagations. NIPS, p.3123–3131.
Denil M, Shakibi B, Dinh L, et al., 2013. Predicting parameters in deep learning. NIPS, p.2148–2156.
Dettmers T, 2015. 8-bit approximations for parallelism in deep learning. http://arxiv.org/abs/1511.04561
Gao M, Pu J, Yang X, et al., 2017. TETRIS: scalable and efficient neural network acceleration with 3D memory. Proc 22nd Int Conf on Architectural Support for Programming Languages and Operating Systems, p.751–764. https://doi.org/10.1145/3093337.3037702
Gong Y, Liu L, Yang M, et al., 2014. Compressing deep convolutional networks using vector quantization. http://arxiv.org/abs/1412.6115
Gudovskiy D, Rigazio L, 2017. ShiftCNN: generalized lowprecision architecture for inference of convolutional neural networks. http://arxiv.org/abs/1706.02393
Guo Y, Yao A, Chen Y, 2016. Dynamic network surgery for efficient DNNs. NIPS, p.1379–1387.
Gupta S, Agrawal A, Gopalakrishnan K, et al., 2015. Deep learning with limited numerical precision. Proc 32nd Int Conf on Machine Learning, p.1737–1746.
Hammerstrom D, 2012. A VLSI architecture for highperformance, low-cost, on-chip learning. IJCNN Int Joint Conf on Neural Networks, p.537–544. https://doi.org/10.1109/IJCNN.1990.137621
Han S, Mao H, Dally W, 2015a. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. http://arxiv.org/abs/1510.00149
Han S, Pool J, Tran J, et al., 2015b. Learning both weights and connections for efficient neural network. NIPS, p.1135–1143.
Han S, Liu X, Mao H, et al., 2016. EIE: efficient inference engine on compressed deep neural network. ACM/IEEE 43rd Annual Int Symp on Computer Architecture, p.243–254. https://doi.org/10.1109/ISCA.2016.30
Han S, Kang J, Mao H, et al., 2017. ESE: efficient speech recognition engine with sparse LSTM on FPGA. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.75–84. https://doi.org/10.1145/3020078.3021745
Hassibi B, Stork D, 1993. Second order derivatives for network pruning: optimal brain surgeon. NIPS, p.164–171.
He K, Zhang X, Ren S, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770–778.
He Y, Zhang X, Sun J, 2017. Channel pruning for accelerating very deep neural networks. http://arxiv.org/abs/1707.06168
Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. http://arxiv.org/abs/1503.02531
Holi J, Hwang J, 1993. Finite precision error analysis of neural network hardware implementations. IEEE Trans Comput, 42(3):281–290. https://doi.org/10.1109/12.210171
Horowitz M, 2014. 1.1 computing’s energy problem (and what we can do about it). IEEE Int Solid-State Circuits Conf Digest of Technical Papers, p.10–14. https://doi.org/10.1109/ISSCC.2014.6757323
Hou L, Yao Q, Kwok J, 2016. Loss-aware binarization of deep networks. http://arxiv.org/abs/1611.01600
Howard A, Zhu M, Chen B, et al., 2017. Mobilenets: efficient convolutional neural networks for mobile vision applications. http://arxiv.org/abs/1704.04861
Hu Q, Wang P, Cheng J, 2018. From hashing to CNNs: training binary weight networks via hashing. 32nd AAAI Conf on Artificial Intelligence, in press.
Hwang K, Sung W, 2014. Fixed-point feedforward deep neural network design using weights +1, 0, and -1. IEEE Workshop on Signal Processing Systems, p.1–6. https://doi.org/10.1109/SiPS.2014.6986082
Iandola F, Han S, Moskewicz M, et al., 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. http://arxiv.org/abs/1602.07360
Jaderberg M, Vedaldi A, Zisserman A, 2014. Speeding up convolutional neural networks with low rank expansions. http://arxiv.org/abs/1405.3866
Jegou H, Douze M, Schmid C, 2011. Product quantization for nearest neighbor search. IEEE Trans Patt Anal Mach Intell, 33(1):117–128. https://doi.org/10.1109/TPAMI.2010.57
Jouppi N, 2017. In-datacenter performance analysis of a tensor processing unit. Proc 44th Annual Int Symp on Computer Architecture, p.1–12. https://doi.org/10.1145/3140659.3080246
Kim D, Kung J, Chai S, et al., 2016. Neurocube: a programmable digital neuromorphic architecture with highdensity 3D memory. ACM/IEEE 43rd Annual Int Symp on Computer Architecture, p.380–392. https://doi.org/10.1109/ISCA.2016.41
Kim K, Kim J, Yu J, et al., 2016. Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks. Proc 53rd Annual Design Automation Conf, Article 124. https://doi.org/10.1145/2897937.2898011
Kim M, Smaragdis P, 2016. Bitwise neural networks. http://arxiv.org/abs/1601.06071
Kim YD, Park E, Yoo S, et al., 2015. Compression of deep convolutional neural networks for fast and low power mobile applications. http://arxiv.org/abs/1511.06530
Ko J, Mudassar B, Na T, et al., 2017. Design of an energy-efficient accelerator for training of convolutional neural networks using frequency-domain computation. ACM/EDAC/IEEE Design Automation Conf, p.1–6. https://doi.org/10.1145/3061639.3062228
Krizhevsky A, Hinton G, 2009. Learning Multiple Layers of Features from Tiny Images. MS Thesis, Department of Computer Science, University of Toronto, Toronto, Canada.
Krizhevsky A, Sutskever I, Hinton G, 2012. Imagenet classification with deep convolutional neural networks. NIPS, p.1097–1105.
Lebedev V, Lempitsky V, 2016. Fast ConvNets using groupwise brain damage. IEEE Conf on Computer Vision and Pattern Recognition, p.2554–2564.
Lebedev V, Ganin Y, Rakhuba M, et al., 2014. Speedingup convolutional neural networks using fine-tuned CPdecomposition. http://arxiv.org/abs/1412.6553
LeCun Y, Denker J, Solla S, et al., 1989. Optimal brain damage. NIPS, p.598–605.
Lee EH, Miyashita D, Chai E, et al., 2017. LogNet: energyefficient neural networks using logarithmic computation. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5900–5904. https://doi.org/10.1109/ICASSP.2017.7953288
Li F, Zhang B, Liu B, 2016. Ternary weight networks. http://arxiv.org/abs/1605.04711
Li G, Li F, Zhao T, et al., 2018. Block convolution: towards memory-efficient inference of large-scale CNNs on FPGA. Design Automation and Test in Europe, in press.
Lin M, Chen Q, Yan S, 2013. Network in network. http://arxiv.org/abs/1312.4400
Lin Z, Courbariaux M, Memisevic R, et al., 2015. Neural networks with few multiplications. http://arxiv.org/abs/1510.03009
Liu S, Du Z, Tao J, et al., 2016. Cambricon: an instruction set architecture for neural networks. Proc 43rd Int Symp on Computer Architecture, p.393–405. https://doi.org/10.1145/3007787.3001179
Liu Z, Li J, Shen Z, et al., 2017. Learning efficient convolutional networks through network slimming. IEEE Int Conf on Computer Vision, p.2736–2744
Luo J, Wu J, Lin W, 2017. ThiNet: a filter level pruning method for deep neural network compression. http://arxiv.org/abs/1707.06342
Ma Y, Cao Y, Vrudhula S, et al., 2017a. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. 27th Int Conf on Field Programmable Logic and Applications, p.1–8. https://doi.org/10.23919/FPL.2017.8056824
Ma Y, Cao Y, Vrudhula S, et al., 2017b. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.45–54. https://doi.org/10.1145/3020078.3021736
Ma Y, Kim M, Cao Y, et al., 2017c. End-to-end scalable FPGA accelerator for deep residual networks. IEEE Int Symp on Circuits and Systems, p.1–4. https://doi.org/10.1109/ISCAS.2017.8050344
Mao H, Han S, Pool J, et al., 2017. Exploring the regularity of sparse structure in convolutional neural networks. http://arxiv.org/abs/1705.08922
Miyashita D, Lee E, Murmann B, 2016. Convolutional neural networks using logarithmic data representation. http://arxiv.org/abs/1603.01025
Nguyen D, Kim D, Lee J, 2017. Double MAC: doubling the performance of convolutional neural networks on modern FPGAs. Design, Automation & Test in Europe Conf & Exhibition, p.890–893. https://doi.org/10.23919/DATE.2017.7927113
Nurvitadhi E, Hillsboro, Venkatesh G, et al., 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.5–14. https://doi.org/10.1145/3020078.3021740
Parashar A, Rhu M, Mukkara A, et al., 2017. SCNN: an accelerator for compressed-sparse convolutional neural networks. Proc 44th Annual Int Symp on Computer Architecture, p.27–40. https://doi.org/10.1145/3140659.3080254
Price M, Glass J, Chandrakasan A, 2017. 14.4 a scalable speech recognizer with deep-neural-network acoustic models and voice-activated power gating. IEEE Int Solid-State Circuits Conf, p.244–245. https://doi.org/10.1109/ISSCC.2017.7870352
Qiu JT, Wang J, Yao S, et al., 2016. Going deeper with embedded FPGA platform for convolutional neural network. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.26–35. https://doi.org/10.1145/2847263.2847265
Rastegari M, Ordonez V, Redmon J, et al., 2016. XNORNet: ImageNet classification using binary convolutional neural networks. European Conf on Computer Vision, p.525–542. https://doi.org/10.1007/978-3-319-46493-0_32
Ren A, Li Z, Ding C, et al., 2017. SC-DCNN: highly-scalable deep convolutional neural network using stochastic computing. Proc 22nd Int Conf on Architectural Support for Programming Languages and Operating Systems, p.405–418. https://doi.org/10.1145/3093336.3037746
Romero A, Ballas N, Kahou S, et al., 2014. FitNets: hints for thin deep nets. http://arxiv.org/abs/1412.6550
Russakovsky O, Deng J, Su H, et al., 2015. Imagenet large scale visual recognition challenge. Int J Comput Vis, 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y4
Sharma H, Park J, Mahajan D, et al., 2016. From highlevel deep neural models to FPGAs. 49th Annual IEEE/ACM Int Symp on Microarchitecture, p.1–21. https://doi.org/10.1109/MICRO.2016.7783720
Shen Y, Ferdman M, Milder P, 2017. Escher: a CNN accelerator with flexible buffering to minimize off-chip transfer. IEEE 25th Annual Int Symp on Field-Programmable Custom Computing Machines, p.93–100. https://doi.org/10.1109/FCCM.2017.47
Sim H, Lee J, 2017. A new stochastic computing multiplier with application to deep convolutional neural networks. Proc 54th Annual Design Automation Conf, Article 29. https://doi.org/10.1145/3061639.3062290
Simonyan K, Zisserman A, 2014. Very deep convolutional networks for large-scale image recognition. http://arxiv.org/abs/1409.1556
Suda N, Chandra V, Dasika G, et al., 2016. Throughputoptimized openCL-based FPGA accelerator for largescale convolutional neural networks. Proc ACM/ SIGDA Int Symp on Field-Programmable Gate Arrays, p.16–25. https://doi.org/10.1145/2847263.2847276
Szegedy C, Liu W, Jia Y, et al., 2015. Going deeper with convolutions. Conf on Computer Vision and Pattern Recognition, p.1–9. https://doi.org/10.1109/CVPR.2015.7298594
Tang W, Hua G, Wang L, 2017. How to train a compact binary neural network with high accuracy? 31st AAAI Conf on Artificial Intelligence, p.2625–2631.
Tann H, Hashemi S, Bahar I, et al., 2017. Hardware-software codesign of accurate, multiplier-free deep neural networks. 54th ACM/EDAC/IEEE Design Automation Conf, p.1–6. https://doi.org/10.1145/3061639.3062259
Umuroglu Y, Fraster N, Gambardella G, et al., 2017. FINN: a framework for fast, scalable binarized neural network inference. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.65–74. https://doi.org/10.1145/3020078.3021744
Venieris S, Bouganis C, 2016. fpgaAConvNet: a framework for mapping convolutional neural networks on FPGAs. IEEE 24th Annual Int Symp on Field-Programmable Custom Computing Machines, p.40–47. https://doi.org/10.1109/FCCM.2016.22
Venkataramani S, Ranjan A, Banerjee S, et al., 2017. ScaleDeep: a scalable compute architecture for learning and evaluating deep networks. Proc 44th Annual Int Symp on Computer Architecture, p.13–26. https://doi.org/10.1145/3079856.3080244
Wang P, Cheng J, 2016. Accelerating convolutional neural networks for mobile applications. Proc ACM on Multimedia Conf, p.541–545. https://doi.org/10.1145/2964284.2967280
Wang P, Cheng J, 2017. Fixed-point factorized networks. IEEE Conf on Computer Vision and Pattern Recognition, p.4012–4020.
Wang P, Hu Q, Fang Z, et al., 2018. Deepsearch: a fast image search framework for mobile devices. ACM Trans Multim Comput Commun Appl, 14(1), Article 6. https://doi.org/10.1145/3152127.
Wang Y, Xu J, Han Y, et al., 2016. Deepburning: automatic generation of FPGA-based learning accelerators for the neural network family. Proc 53rd Annual Design Automation Conf, Article 110. https://doi.org/10.1145/2897937.2898003
Wei SC, Yu CH, Zhang P, et al., 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. 54th ACM/EDAC/IEEE Design Automation Conf, p.1–6. https://doi.org/10.1145/3061639.3062207
Wen W, Wu C, Wang Y, et al., 2016. Learning structured sparsity in deep neural networks. NIPS, p.2074–2082.
Wu J, Leng C, Wang Y, et al., 2016. Quantized convolutional neural networks for mobile devices. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.4820–4828.
Xia L, Tang T, Huangfu W, et al., 2016. Switched by input: power efficient structure for RRAM-based convolutional neural network. 53rd ACM/EDAC/IEEE Design Automation Conf, p.1–6. https://doi.org/10.1145/2897937.2898101
Xiao QC, Liang Y, Lu LQ, et al., 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. 54th ACM/EDAC/IEEE Design Automation Conf, p.1–6. https://doi.org/10.1145/3061639.3062244
Xie S, Girshick R, Dollar P, et al., 2017. Aggregated residual transformations for deep neural networks. IEEE Conf on Computer Vision and Pattern Recognition, p.5987–5995. https://doi.org/10.1109/CVPR.2017.634
Yang H, 2017. TIME: a training-in-memory architecture for memristor-based deep neural networks. 54th ACM/ EDAC/IEEE Design Automation Conf, p.1–6. https://doi.org/10.1145/3061639.3062326
Zagoruyko S, Komodakis N, 2016. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. http://arxiv.org/abs/1612.03928
Zhang C, Li P, Sun GY, et al., 2015. Optimizing FPGAbased accelerator design for deep convolutional neural networks. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.161–170. https://doi.org/10.1145/2684746.2689060
Zhang C, Fang Z, Pan P, et al., 2016a. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. IEEE/ACM Int Conf on Computer-Aided Design, p.1–8. https://doi.org/10.1145/2966986.2967011
Zhang C, Wu D, Sun J, et al., 2016b. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. Proc Int Symp on Low Power Electronics and Design, p.326–331. https://doi.org/10.1145/2934583.2934644
Zhang S, Du Z, Zhang L, et al., 2016. Cambricon-X: an accelerator for sparse neural networks. 49th Annual IEEE/ACM Int Symp on Microarchitecture, p.1–12. https://doi.org/10.1109/MICRO.2016.7783723
Zhang X, Zou J, He K, et al., 2015. Accelerating very deep convolutional networks for classification and detection. IEEE Trans Patt Anal Mach Intell, 38(10):1943–1955. https://doi.org/10.1109/TPAMI.2015.2502579
Zhang X, Zhou X, Lin M, et al., 2017. ShuffleNet: an extremely efficient convolutional neural network for mobile devices. http://arxiv.org/abs/1707.01083
Zhao R, Song WN, Zhang WT, et al., 2017. Accelerating binarized convolutional neural networks with softwareprogrammable FPGAs. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.15–24. https://doi.org/10.1145/3020078.3021741
Zhou A, Yao A, Guo Y, et al., 2017. Incremental network quantization: towards lossless CNNs with low-precision weights. http://arxiv.org/abs/1702.03044
Zhou S, Wu Y, Ni Z, et al., 2016. DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. http://arxiv.org/abs/1606.06160
Zhu C, Han S, Mao H, et al., 2016. Trained ternary quantization. http://arxiv.org/abs/1612.01064
Zhu J, Qian Z, Tsui C, 2016. LRADNN: high-throughput and energy-efficient deep neural network accelerator using low rank approximation. 21st Asia and South Pacific Design Automation Conf, p.581–586. https://doi.org/10.1109/ASPDAC.2016.7428074
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cheng, J., Wang, Ps., Li, G. et al. Recent advances in efficient computation of deep convolutional neural networks. Frontiers Inf Technol Electronic Eng 19, 64–77 (2018). https://doi.org/10.1631/FITEE.1700789
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1700789