Skip to main content

Differentiable Joint Pruning and Quantization for Hardware Efficiency

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Abstract

We present a differentiable joint pruning and quantization (DJPQ) scheme. We frame neural network compression as a joint gradient-based optimization problem, trading off between model pruning and quantization automatically for hardware efficiency. DJPQ incorporates variational information bottleneck based structured pruning and mixed-bit precision quantization into a single differentiable loss function. In contrast to previous works which consider pruning and quantization separately, our method enables users to find the optimal trade-off between both in a single training procedure. To utilize the method for more efficient hardware inference, we extend DJPQ to integrate structured pruning with power-of-two bit-restricted quantization. We show that DJPQ significantly reduces the number of Bit-Operations (BOPs) for several networks while maintaining the top-1 accuracy of original floating-point models (e.g., 53\(\times \) BOPs reduction in ResNet18 on ImageNet, 43\(\times \) in MobileNetV2). Compared to the conventional two-stage approach, which optimizes pruning and quantization independently, our scheme outperforms in terms of both accuracy and BOPs. Even when considering bit-restricted quantization, DJPQ achieves larger compression ratios and better accuracy than the two-stage approach.

Y. Lu—Work done during internship at Qualcomm AI Research.

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baskin, C., et al.: UNIQ: uniform noise injection for the quantization of neural networks (2018). CoRR abs/1804.10969. http://arxiv.org/abs/1804.10969

  2. Bengio, Y., Léonard, N., Courville, A.C.: Estimating or propagating gradients through stochastic neurons for conditional computation (2013). CoRR abs/1308.3432. http://arxiv.org/abs/1308.3432

  3. Bethge, J., Bartz, C., Yang, H., Chen, Y., Meinel, C.: MeliusNet: can binary neural networks achieve mobilenet-level accuracy? (2020). arXiv:2001.05936

  4. Dai, B., Zhu, C., Guo, B., Wipf, D.: Compressing neural networks using the variational information bottleneck. In: International Conference on Machine Learning, pp. 1135–1144 (2018)

    Google Scholar 

  5. Dong, Z., et al.: HAWQ-V2: hessian aware trace-weighted quantization of neural networks (2019). arXiv:1911.03852

  6. Dong, Z., Yao, Z., Gholami, A., Mahoney, M.W., Keutzer, K.: HAWQ: hessian aware quantization of neural networks with mixed-precision. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 293–302 (2019)

    Google Scholar 

  7. Esser, S.K., McKinstry, J.L., Bablani, D., Appuswamy, R., Modha, D.S.: Learned step size quantization. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=rkgO66VKDS

  8. Gysel, P., Pimentel, J., Motamedi, M., Ghiasi, S.: Ristretto: A framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 29(11), 5784–5789 (2018)

    Article  Google Scholar 

  9. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural network with pruning, trained quantization and Huffman coding. In: 4th International Conference on Learning Representations (2016)

    Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  11. He, Y., Kang, G., Dong, X., Fu, Y., Yang, Y.: Soft filter pruning for accelerating deep convolutional neural networks (2018). arXiv:1808.06866

  12. He, Y., Lin, J., Liu, Z., Wang, H., Li, L.J., Han, S.: AMC: AutoML for model compression and acceleration on mobile devices. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800 (2018)

    Google Scholar 

  13. He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397 (2017)

    Google Scholar 

  14. Ignatov, A., et al.: AI benchmark: all about deep learning on smartphones in 2019 (2019). arXiv:1910.06663

  15. Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference: a whitepaper (2018). arXiv:1806.08342

  16. Kuzmin, A., Nagel, M., Pitre, S., Pendyam, S., Blankevoort, T., Welling, M.: Taxonomy and evaluation of structured compression of convolutional neural networks (2019). arXiv:1912.09802

  17. Li, F., Zhang, B., Liu, B.: Ternary weight networks (2016). arXiv:1605.04711

  18. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient ConvNets. In: International Conference on Learning Representations (2017). https://openreview.net/pdf?id=rJqFGTslg

  19. Louizos, C., Reisser, M., Blankevoort, T., Gavves, E., Welling, M.: Relaxed quantization for discretized neural networks. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=HkxjYoCqKX

  20. Louizos, C., Ullrich, K., Welling, M.: Bayesian compression for deep learning. In: Advances in Neural Information Processing Systems, pp. 3288–3298 (2017)

    Google Scholar 

  21. Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through \(L_0\) regularization. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=H1Y8hhg0b

  22. Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural networks for resource efficient inference. In: International Conference on Learning Representations (2017). https://openreview.net/pdf?id=SJGCiw5gl

  23. Nagel, M., Baalen, M.v., Blankevoort, T., Welling, M.: Data-free quantization through weight equalization and bias correction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1325–1334 (2019)

    Google Scholar 

  24. Peng, H., Wu, J., Chen, S., Huang, J.: Collaborative channel pruning for deep networks. In: International Conference on Machine Learning, pp. 5113–5122 (2019)

    Google Scholar 

  25. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)

    Google Scholar 

  26. Theis, L., Korshunova, I., Tejani, A., Huszár, F.: Faster gaze prediction with dense networks and fisher pruning (2018). arXiv:1801.05787

  27. Tishby, N., Pereira, F., Bialek, W.: The information bottleneck method. In: Proceedings of the 37th Allerton Conference on Communication, Control and Computation, vol. 49 (2001)

    Google Scholar 

  28. Tung, F., Mori, G.: CLIP-Q: Deep network compression learning by in-parallel pruning-quantization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7873–7882 (2018)

    Google Scholar 

  29. Uhlich, S., et al.: Differentiable quantization of deep neural networks (2019). CoRR abs/1905.11452. http://arxiv.org/abs/1905.11452

  30. Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: HAQ: hardware-aware automated quantization with mixed precision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620 (2019)

    Google Scholar 

  31. Wu, H., Judd, P., Zhang, X., Isaev, M., Micikevicius, P.: Integer quantization for deep learning inference: Principles and empirical evaluation (2020). arXiv:2004.09602

  32. Wu, S., Li, G., Chen, F., Shi, L.: Training and inference with integers in deep neural networks. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=HJGXzmspb

  33. Yang, H., Gui, S., Zhu, Y., Liu, J.: Automatic neural network compression by sparsity-quantization joint learning: a constrained optimization-based approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2178–2188 (2020)

    Google Scholar 

  34. Yazdanbakhsh, A., Elthakeb, A.T., Pilligundla, P., Mireshghallah, F., Esmaeilzadeh, H.: ReLeQ: an automatic reinforcement learning approach for deep quantization of neural networks (2018). arXiv:1811.01704

  35. Ye, S., et al.: A unified framework of DNN weight pruning and weight clustering/quantization using ADMM (2018). arXiv:1811.01907

Download references

Acknowledgments

We would like to thank Jilei Hou and Joseph Soriaga for consistent support, and thank Jinwon Lee, Kambiz Azarian and Nojun Kwak for their great help in revising this paper and providing valuable feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying Wang .

Editor information

Editors and Affiliations

Appendices

A Quantization scheme in DJPQ

Figure 9 illustrates the proposed quantization scheme. First, a non-linear function is applied to map any weight input x to \(\tilde{x}\) (shown in blue curve). Then a uniform quantization is applied to \(\tilde{x}\). The quantized value \(x_q\) is shown in red.

Fig. 9.
figure 9

Illustration of quantization scheme. The blue curve gives the nonlinear mapping function. The red curve corresponds to the quantization value. (Color figure online)

B Experimental details

1.1 B.1 Comparison of DJPQ with DQ

To show that joint optimization of pruning and quantization outperforms quantization only scheme such as DQ, we plot in Fig. 10 a comparison of weight and activation bit-width for DQ and DJPQ. The results are for ResNet18 on ImageNet. We plot the pruning ratio curve of DJPQ in both figures to better show the pruning effect in the joint optimization scheme. As seen in the figure, there is no big difference between weight bit-width for DQ and DJPQ. However, the difference between activation bit-width for the two schemes is significant. Layer 0 to 6 in DJPQ has much larger activation bit-width than those in DQ, while those layers correspond to high pruning ratios in DJPQ. It provides a clear evidence that pruning and quantization can well tradeoff between each other in DJPQ, resulting in lower redundancy in the compressed model.

1.2 B.2 DJPQ Results for MobileNetV2

Figures 11 and 12 show the optimized bit-width and pruning ratio distributions, respectively. It is observed that earlier layers tend to have larger pruning ratios than the following layers. And for many layers in MobileNetV2, DJPQ is able to quantize the layers into to small bit-width.

Fig. 10.
figure 10

Comparison of bit-width distributions of DQ and DJPQ for ResNet18 on ImageNet. The top figure plots the weight bit-width for each layer, while the bottom plots the corresponding activation bit-width. The green curve is the pruning ratio of DJPQ in each layer.(Color figure online)

1.3 B.3 Experimental Setup

The input for all experiments are uniformly quantized to 8 bits. For the two-stage scheme including ‘VIBNet+fixed quant.’ and ‘VIBNet+DQ’, VIBNet pruning is firstly optimized at learning rate 1e–3 with SGD, with a pruning learning rate scaling of 5. The strength \(\gamma \) is set to 5e–6. In DQ for the pruned model, \(\beta \) is chosen to 1e–11 and the learning rate is 5e–4. The learning rate scaling for quantization is 0.05. All the pruning threshold \(\alpha _{th}\) is chosen to 1e–3. The number of epochs is 20 for each of the stage.

Fig. 11.
figure 11

Bit-width for MobileNetV2 on ImageNet with DJPQ scheme

Fig. 12.
figure 12

Pruning ratio for MobileNetV2 on ImageNet with DJPQ scheme

For DJPQ experiments on VGG7, the learning rate is set to 1e–3 with an ADAM optimizer. The initial bit-width is 6. The strength \(\gamma \) and \(\beta \) are 1e–6 and 1e–9, respectively. The scaling of learning rate for pruning and quantization is 10 and 0.05, respectively. For DJPQ on ResNet18, we chose a learning rate 1e–3 with SGD optimization. The initial bit-width for weights and activations are 6 and 8. The scaling of learning rate for pruning and quantization are 5 and 0.05, respectively. The strength \(\gamma \) and \(\beta \) are 1e–5 and 1e–10. For DJPQ on MobileNetV2, the learning rate is set to 1e–4 for SGD optimization. The initial bit-width is 8. The strength \(\gamma \) and \(\beta \) are set to 1e–8 and 1e–11. The learning rate scaling for pruning and quantization are selected to be 1 and 0.005, respectively.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Y., Lu, Y., Blankevoort, T. (2020). Differentiable Joint Pruning and Quantization for Hardware Efficiency. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12374. Springer, Cham. https://doi.org/10.1007/978-3-030-58526-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58526-6_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58525-9

  • Online ISBN: 978-3-030-58526-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics