Skip to main content
Log in

Knowledge Distillation: A Survey

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher–student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. A hint means the output of a teacher’s hidden layer that supervises the student’s learning.

References

  • Aditya, S., Saha, R., Yang, Y., & Baral, C. (2019). Spatial knowledge distillation to aid visual reasoning. In WACV.

  • Aguilar, G., Ling, Y., Zhang, Y., Yao, B., Fan, X., & Guo, E. (2020). Knowledge distillation from internal representations. In AAAI.

  • Aguinaldo, A., Chiang, P. Y., Gain, A., Patil, A., Pearson, K., & Feizi, S. (2019). Compressing gans using knowledge distillation. arXiv preprint arXiv:1902.00159.

  • Ahn, S., Hu, S., Damianou, A., Lawrence, N. D., & Dai, Z. (2019). Variational information distillation for knowledge transfer. In CVPR.

  • Albanie, S., Nagrani, A., Vedaldi, A., & Zisserman, A. (2018). Emotion recognition in speech using cross-modal transfer in the wild. In ACM MM.

  • Allen-Zhu, Z., Li, Y., & Liang, Y. (2019). Learning and generalization in overparameterized neural networks, going beyond two layers. In NeurIPS.

  • Anil, R., Pereyra, G., Passos, A., Ormandi, R., Dahl, G. E., & Hinton, G. E. (2018). Large scale distributed neural network training through online distillation. In ICLR.

  • Arora, S., Cohen, N., & Hazan, E. (2018). On the optimization of deep networks: Implicit acceleration by overparameterization. In ICML.

  • Arora, S., Khapra, M. M., & Ramaswamy, H. G. (2019). On knowledge distillation from complex networks for response prediction. In NAACL-HLT.

  • Asami, T., Masumura, R., Yamaguchi, Y., Masataki, H., & Aono, Y. (2017). Domain adaptation of dnn acoustic models using knowledge distillation. In ICASSP.

  • Ashok, A., Rhinehart, N., Beainy, F., & Kitani, K. M. (2018). N2N learning: Network to network compression via policy gradient reinforcement learning. In ICLR.

  • Asif, U., Tang, J. & Harrer, S. (2020). Ensemble knowledge distillation for learning improved and efficient networks. In ECAI.

  • Ba, J., & Caruana, R. (2014). Do deep nets really need to be deep? In NeurIPS.

  • Bagherinezhad, H., Horton, M., Rastegari, M., & Farhadi, A. (2018). Label refinery: Improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641.

  • Bai, H., Wu, J., King, I., & Lyu, M. (2020). Few shot network compression via cross distillation. In AAAI.

  • Bai, Y., Yi, J., Tao, J., Tian, Z., & Wen, Z. (2019). Learn spelling from teachers: transferring knowledge from language models to sequence-to-sequence speech recognition. In Interspeech.

  • Bashivan, P., Tensen, M., & DiCarlo, J. J. (2019). Teacher guided architecture search. In ICCV.

  • Belagiannis, V., Farshad, A., & Galasso, F. (2018). Adversarial network compression. In ECCV.

  • Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE TPAMI, 35(8), 1798–1828.

    Article  Google Scholar 

  • Bergmann, P., Fauser, M., Sattlegger, D., & Steger, C. (2020). Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In CVPR.

  • Bhardwaj, S., Srinivasan, M., & Khapra, M. M. (2019). Efficient video classification using fewer frames. In CVPR.

  • Bistritz, I., Mann, A., & Bambos, N. (2020). Distributed Distillation for On-Device Learning. In NeurIPS.

  • Bohdal, O., Yang, Y., & Hospedales, T. (2020). Flexible Dataset Distillation: Learn Labels Instead of Images. arXiv preprint arXiv:2006.08572.

  • Boo, Y., Shin, S., Choi, J., & Sung, W. (2021). Stochastic precision ensemble: self-knowledge distillation for quantized deep neural networks. In AAAI.

  • Brutzkus, A., & Globerson, A. (2019). Why do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem. In ICML.

  • Bucilua, C., Caruana, R. & Niculescu-Mizil, A. (2006). Model compression. In SIGKDD.

  • Caccia, M., Rodriguez, P., Ostapenko, O., Normandin, F., Lin, M., Caccia, L., Laradji, I., Rish, I., Lacoste, A., Vazquez D., & Charlin, L. (2020). Online Fast Adaptation and Knowledge Accumulation (OSAKA): a New Approach to Continual Learning. In NeurIPS.

  • Chan, W., Ke, N. R., & Lane, I. (2015). Transferring knowledge from a RNN to a DNN. arXiv preprint arXiv:1504.01483.

  • Chawla, A., Yin, H., Molchanov, P., & Alvarez, J. (2021). Data-Free Knowledge Distillation for Object Detection. In WACV.

  • Chebotar, Y. & Waters, A. (2016). Distilling knowledge from ensembles of neural networks for speech recognition. In Interspeech.

  • Chen, D., Mei, J. P., Wang, C., Feng, Y. & Chen, C. (2020a). Online knowledge distillation with diverse peers. In AAAI.

  • Chen, D., Mei, J. P., Zhang, Y., Wang, C., Wang, Z., Feng, Y., & Chen, C. (2021). Cross-layer distillation with semantic calibration. In AAAI.

  • Chen, G., Choi, W., Yu, X., Han, T., & Chandraker, M. (2017). Learning efficient object detection models with knowledge distillation. In NeurIPS.

  • Chen, H., Wang, Y., Xu, C., Yang, Z., Liu, C., Shi, B., Xu, C., Xu, C.,&Tian, Q. (2019a). Data-free learning of student networks. In ICCV.

  • Chen, H., Wang, Y., Xu, C., Xu, C., & Tao, D. (2021). Learning student networks via feature embedding. IEEE TNNLS, 32(1), 25–35.

    Google Scholar 

  • Chen, T., Goodfellow, I. & Shlens, J. (2016). Net2net: Accelerating learning via knowledge transfer. In ICLR.

  • Chen, W. C., Chang, C. C. & Lee, C. R. (2018a). Knowledge distillation with feature maps for image classification. In ACCV.

  • Chen, X., Zhang, Y., Xu, H., Qin, Z., & Zha, H. (2018b). Adversarial distillation for efficient recommendation with external knowledge. ACM TOIS, 37(1), 1–28.

    Article  Google Scholar 

  • Chen, X., Su, J., & Zhang, J. (2019b). A two-teacher tramework for knowledge distillation. In ISNN.

  • Chen, Y., Wang, N., & Zhang, Z. (2018c). Darkrank: Accelerating deep metric learning via cross sample similarities transfer. In AAAI.

  • Chen, Y. C., Gan, Z., Cheng, Y., Liu, J., & Liu, J. (2020b). Distilling knowledge learned in BERT for text generation. In ACL.

  • Chen, Y. C., Lin, Y. Y., Yang, M. H., Huang, J. B. (2019c). Crdoco: Pixel-level domain transfer with cross-domain consistency. In CVPR.

  • Chen, Z., & Liu, B. (2018). Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3), 1–207.

    Article  Google Scholar 

  • Chen, Z., Zhu, L., Wan, L., Wang, S., Feng, W., & Heng, P. A. (2020c). A multi-task mean teacher for semi-supervised shadow detection. In CVPR.

  • Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2018). Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35(1), 126–136.

    Article  Google Scholar 

  • Cheng, X., Rao, Z., Chen, Y., & Zhang, Q. (2020). Explaining knowledge distillation by quantifying the knowledge. In CVPR.

  • Cho, J. H. & Hariharan, B. (2019). On the efficacy of knowledge distillation. In ICCV.

  • Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In CVPR.

  • Chung, I., Park, S., Kim, J. & Kwak, N. (2020). Feature-map-level online adversarial knowledge distillation. In ICML.

  • Clark, K., Luong, M. T., Khandelwal, U., Manning, C. D. & Le, Q. V. (2019). Bam! born-again multi-task networks for natural language understanding. In ACL.

  • Courbariaux, M., Bengio, Y. & David, J. P. (2015). Binaryconnect: Training deep neural networks with binary weights during propagations. In NeurIPS.

  • Crowley, E. J., Gray, G. & Storkey, A. J. (2018). Moonshine: Distilling with cheap convolutions. In NeurIPS.

  • Cui, J., Kingsbury, B., Ramabhadran, B., Saon, G., Sercu, T., Audhkhasi, K., et al. (2017). Knowledge distillation across ensembles of multilingual models for low-resource languages. In ICASSP.

  • Cui, Z., Song, T., Wang, Y., & Ji, Q. (2020). Knowledge augmented deep neural networks for joint facial expression and action unit recognition. In NeurIPS.

  • Cun, X., & Pun, C. M. (2020). Defocus blur detection via depth distillation. In ECCV.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR.

  • Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y. & Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. In NeurIPS.

  • Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT .

  • Ding, Q., Wu, S., Sun, H., Guo, J. & Xia, S. T. (2019). Adaptive regularization of labels. arXiv preprint arXiv:1908.05474.

  • Do, T., Do, T. T., Tran, H., Tjiputra, E. & Tran, Q. D. (2019). Compact trilinear interaction for visual question answering. In ICCV.

  • Dong, X. & Yang, Y. (2019). Teacher supervises students how to learn from partially labeled images for facial landmark detection. In ICCV.

  • Dou, Q., Liu, Q., Heng, P. A., & Glocker, B. (2020). Unpaired multi-modal segmentation via knowledge distillation. IEEE TMI, 39(7), 2415–2425.

    Google Scholar 

  • Du, S., You, S., Li, X., Wu, J., Wang, F., Qian, C., & Zhang, C. (2020). Agree to disagree: Adaptive ensemble knowledge distillation in gradient space. In NeurIPS.

  • Duong, C. N., Luu, K., Quach, K. G. & Le, N. (2019.) ShrinkTeaNet: Million-scale lightweight face recognition via shrinking teacher-student networks. arXiv preprint arXiv:1905.10620.

  • Fakoor, R., Mueller, J. W., Erickson, N., Chaudhari, P., & Smola, A. J. (2020). Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation. In NeurIPS.

  • Flennerhag, S., Moreno, P. G., Lawrence, N. D. & Damianou, A. (2019). Transferring knowledge across learning processes. In ICLR.

  • Freitag, M., Al-Onaizan, Y. & Sankaran, B. (2017). Ensemble distillation for neural machine translation. arXiv preprint arXiv:1702.01802.

  • Fu, H., Zhou, S., Yang, Q., Tang, J., Liu, G., Liu, K., & Li, X. (2021). LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding. In AAAI.

  • Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J. & Ramabhadran, B. (2017). Efficient knowledge distillation from an ensemble of teachers. In Interspeech.

  • Furlanello, T., Lipton, Z., Tschannen, M., Itti, L. & Anandkumar, A. (2018). Born again neural networks. In ICML.

  • Gao, L., Mi, H., Zhu, B., Feng, D., Li, Y., & Peng, Y. (2019). An adversarial feature distillation method for audio classification. IEEE Access, 7, 105319–105330.

    Article  Google Scholar 

  • Gao, M., Wang, Y., & Wan, L. (2021). Residual error based knowledge distillation. Neurocomputing, 433, 154–161.

    Article  Google Scholar 

  • Gao, Z., Chung, J., Abdelrazek, M., Leung, S., Hau, W. K., Xian, Z., et al. (2020). Privileged modality distillation for vessel border detection in intracoronary imaging. IEEE TMI, 39(5), 1524–1534.

    Google Scholar 

  • Garcia, N. C., Morerio, P. & Murino, V. (2018). Modality distillation with multiple stream networks for action recognition. In ECCV.

  • Ge, S., Zhao, S., Li, C., & Li, J. (2018). Low-resolution face recognition in the wild via selective knowledge distillation. IEEE TIP, 28(4), 2051–2062.

    MathSciNet  Google Scholar 

  • Ge, S., Zhao, S., Li, C., Zhang, Y., & Li, J. (2020). Efficient low-resolution face recognition via bridge distillation. IEEE TIP, 29, 6898–6908.

    Google Scholar 

  • Ghorbani, S., Bulut, A. E. & Hansen, J. H. (2018). Advancing multi-accented lstm-ctc speech recognition using a domain specific student-teacher learning paradigm. In SLTW.

  • Gil, Y., Chai, Y., Gorodissky, O. & Berant, J. (2019). White-to-black: Efficient distillation of black-box adversarial attacks. In NAACL-HLT.

  • Goldblum, M., Fowl, L., Feizi, S. & Goldstein, T. (2020). Adversarially robust distillation. In AAAI.

  • Gong, C., Chang, X., Fang, M. & Yang, J. (2018). Teaching semi-supervised classifier via generalized distillation. In IJCAI.

  • Gong, C., Tao, D., Liu, W., Liu, L., & Yang, J. (2017). Label propagation via teaching-to-learn and learning-to-teach. TNNLS, 28(6), 1452–1465.

    Google Scholar 

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In NeurIPS.

  • Gordon, M. A. & Duh, K. (2019). Explaining sequence-level knowledge distillation as data-augmentation for neural machine translation. arXiv preprint arXiv:1912.03334.

  • Gu, J., & Tresp, V. (2020). Search for better students to learn distilled knowledge. In ECAI.

  • Guan, Y., Zhao, P., Wang, B., Zhang, Y., Yao, C., Bian, K., & Tang, J. (2020). Differentiable feature aggregation search for knowledge distillation. In ECCV.

  • Guo, Q., Wang, X., Wu, Y., Yu, Z., Liang, D., Hu, X., & Luo, P. (2020). Online knowledge distillation via collaborative learning. In CVPR.

  • Gupta, S., Hoffman, J. & Malik, J. (2016). Cross modal distillation for supervision transfer. In CVPR.

  • Hahn, S. & Choi, H. (2019). Self-knowledge distillation in natural language processing. In RANLP.

  • Haidar, M. A. & Rezagholizadeh, M. (2019). Textkd-gan: Text generation using knowledge distillation and generative adversarial networks. In Canadian conference on artificial intelligence.

  • Han, S., Pool, J., Tran, J. & Dally, W. (2015). Learning both weights and connections for efficient neural network. In NeurIPS.

  • Hao, W., & Zhang, Z. (2019). Spatiotemporal distilled dense-connectivity network for video action recognition. Pattern Recognition, 92, 13–24.

    Article  Google Scholar 

  • Haroush, M., Hubara, I., Hoffer, E., & Soudry, D. (2020). The knowledge within: Methods for data-free model compression. In CVPR.

  • He, C., Annavaram, M., & Avestimehr, S. (2020a). Group knowledge transfer: Federated learning of large CNNs at the edge. In NeurIPS.

  • He, F., Liu, T., & Tao, D. (2020b). Why resnet works? residuals generalize. IEEE TNNLS, 31(12), 5349–5362.

    MathSciNet  Google Scholar 

  • He, K., Zhang, X., Ren, S. & Sun, J. (2016) Deep residual learning for image recognition. In CVPR.

  • He, T., Shen, C., Tian, Z., Gong, D., Sun, C. & Yan, Y. (2019). Knowledge adaptation for efficient semantic segmentation. In CVPR.

  • Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., & Choi, J. Y. (2019a). A comprehensive overhaul of feature distillation. In ICCV.

  • Heo, B., Lee, M., Yun, S. & Choi, J. Y. (2019b). Knowledge distillation with adversarial samples supporting decision boundary. In AAAI.

  • Heo, B., Lee, M., Yun, S. & Choi, J. Y. (2019c). Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In AAAI.

  • Hinton, G., Vinyals, O. & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

  • Hoffman, J., Gupta, S. & Darrell, T. (2016). Learning with side information through modality hallucination. In CVPR.

  • Hong, W. & Yu, J. (2019). Gan-knowledge distillation for one-stage object detection. arXiv preprint arXiv:1906.08467.

  • Hou, Y., Ma, Z., Liu, C. & Loy, CC. (2019). Learning lightweight lane detection cnns by self attention distillation. In ICCV.

  • Hou, Y., Ma, Z., Liu, C., Hui, T. W., & Loy, C. C. (2020). Inter-Region Affinity Distillation for Road Marking Segmentation. In CVPR.

  • Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.

  • Hu, H., Xie, L., Hong, R., & Tian, Q. (2020). Creating something from nothing: Unsupervised knowledge distillation for cross-modal hashing. In CVPR.

  • Hu, M., Peng, Y., Wei, F., Huang, Z., Li, D., Yang, N., et al. (2018). Attention-guided answer distillation for machine reading comprehension. In EMNLP.

  • Huang, G., Liu, Z., Van, Der Maaten, L. & Weinberger, K. Q. (2017). Densely connected convolutional networks. In CVPR.

  • Huang, M., You, Y., Chen, Z., Qian, Y. & Yu, K. (2018). Knowledge distillation for sequence model. In Interspeech.

  • Huang, Z. & Wang, N. (2017). Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219.

  • Huang, Z., Zou, Y., Bhagavatula, V., & Huang, D. (2020). Comprehensive attention self-distillation for weakly-supervised object detection. In NeurIPS.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML

  • Jang, Y., Lee, H., Hwang, S. J. & Shin, J. (2019). Learning what and where to transfer. In ICML.

  • Ji, G., & Zhu, Z. (2020). Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher. In NeurIPS.

  • Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., et al. (2020). Tinybert: Distilling bert for natural language understanding. In EMNLP.

  • Jin, X., Peng, B., Wu, Y., Liu, Y., Liu, J., Liang, D., Yan, J., & Hu, X. (2019). Knowledge distillation via route constrained optimization. In ICCV.

  • Kang, M., Mun, J. & Han, B. (2020). Towards oracle knowledge distillation with neural architecture search. In AAAI.

  • Kim, J., Park, S. & Kwak, N. (2018). Paraphrasing complex network: Network compression via factor transfer. In NeurIPS.

  • Kim, J., Bhalgat, Y., Lee, J., Patel, C., & Kwak, N. (2019a). QKD: Quantization-aware Knowledge Distillation. arXiv preprint arXiv:1911.12491.

  • Kim, J., Hyun, M., Chung, I. & Kwak, N. (2019b). Feature fusion for online mutual knowledge distillation. In ICPR.

  • Kim, S. W. & Kim, H. E. (2017). Transferring knowledge to smaller network with class-distance loss. In ICLRW.

  • Kim, Y., Rush & A. M. (2016). Sequence-level knowledge distillation. In EMNLP.

  • Kimura, A., Ghahramani, Z., Takeuchi, K., Iwata, T. & Ueda, N. (2018). Few-shot learning of neural networks from scratch by pseudo example optimization. In BMVC.

  • Kwon, K., Na, H., Lee, H., & Kim, N. S. (2020). Adaptive knowledge distillation based on entropy. In ICASSP.

  • Kong, H., Zhao, J., Tu, X., Xing, J., Shen, S. & Feng, J. (2019). Cross-resolution face recognition via prior-aided face hallucination and residual knowledge distillation. arXiv preprint arXiv:1905.10777.

  • Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.

  • Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NeurIPS.

  • Kuncoro, A., Ballesteros, M., Kong, L., Dyer, C. & Smith, N. A. (2016). Distilling an ensemble of greedy dependency parsers into one mst parser. In EMNLP.

  • Kundu, J. N., Lakkakula, N. & Babu, R. V. (2019). Um-adapt: Unsupervised multi-task adaptation using adversarial cross-task distillation. In CVPR.

  • Lai, K. H., Zha, D., Li, Y., & Hu, X. (2020). Dual policy distillation. In IJCAI.

  • Lan, X., Zhu, X., & Gong, S. (2018). Self-referenced deep learning. In ACCV.

  • Lee, H., Hwang, S. J. & Shin, J. (2019a). Rethinking data augmentation: Self-supervision and self-distillation. arXiv preprint arXiv:1910.05872.

  • Lee, K., Lee, K., Shin, J. & Lee, H. (2019b). Overcoming catastrophic forgetting with unlabeled data in the wild. In ICCV.

  • Lee, K., Nguyen, L. T. & Shim, B. (2019c). Stochasticity and skip connections improve knowledge transfer. In AAAI.

  • Lee, S. & Song, B. (2019). Graph-based knowledge distillation by multi-head attention network. In BMVC.

  • Lee, S. H., Kim, D. H. & Song, B. C. (2018). Self-supervised knowledge distillation using singular value decomposition. In ECCV.

  • Li, B., Wang, Z., Liu, H., Du, Q., Xiao, T., Zhang, C., & Zhu, J. (2021). Learning light-weight translation models from deep transformer. In AAAI.

  • Li, C., Peng, J., Yuan, L., Wang, G., Liang, X., Lin, L., & Chang, X. (2020a). Blockwisely supervised neural architecture search with knowledge distillation. In CVPR.

  • Li, G., Zhang, J., Wang, Y., Liu, C., Tan, M., Lin, Y., Zhang, W., Feng, J., & Zhang, T. (2020b). Residual distillation: Towards portable deep neural networks without shortcuts. In NeurIPS.

  • Li, J., Fu, K., Zhao, S., & Ge, S. (2019). Spatiotemporal knowledge distillation for efficient estimation of aerial video saliency. IEEE TIP, 29, 1902–1914.

    MathSciNet  Google Scholar 

  • Li, M., Lin, J., Ding, Y., Liu, Z., Zhu, J. Y., & Han, S. (2020c). Gan compression: Efficient architectures for interactive conditional gans. In CVPR.

  • Li, Q., Jin, S. & Yan, J. (2017). Mimicking very efficient network for object detection. In CVPR.

  • Li, T., Li, J., Liu, Z., & Zhang, C. (2020d). Few sample knowledge distillation for efficient network compression. In CVPR.

  • Li, X., Wu, J., Fang, H., Liao, Y., Wang, F., & Qian, C. (2020e). Local correlation consistency for knowledge distillation. In ECCV.

  • Li, Z., & Hoiem, D. (2017). Learning without forgetting. IEEE TPAMI, 40(12), 2935–2947.

    Article  Google Scholar 

  • Lin, T., Kong, L., Stich, S. U., & Jaggi, M. (2020). Ensemble distillation for robust model fusion in federated learning. In NeurIPS.

  • Liu, I. J., Peng, J. & Schwing, A. G. (2019a). Knowledge flow: Improve upon your teachers. In ICLR.

  • Liu, J., Chen, Y. & Liu, K. (2019b). Exploiting the ground-truth: An adversarial imitation based knowledge distillation approach for event detection. In AAAI.

  • Liu, J., Wen, D., Gao, H., Tao, W., Chen, T. W., Osa, K., et al. (2019c). Knowledge representing: efficient, sparse representation of prior knowledge for knowledge distillation. In CVPRW.

  • Liu, P., King, I., Lyu, M. R., & Xu, J. (2019d). DDFlow: Learning optical flow with unlabeled data distillation. In AAAI.

  • Liu, P., Liu, W., Ma, H., Mei, T. & Seok, M. (2020a). Ktan: knowledge transfer adversarial network. In IJCNN.

  • Liu, Q., Xie, L., Wang, H., Yuille & A. L. (2019e). Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In ICCV.

  • Liu, R., Fusi, N. & Mackey, L. (2018). Model compression with generative adversarial networks. arXiv preprint arXiv:1812.02271.

  • Liu, W., Zhou, P., Zhao, Z., Wang, Z., Deng, H., & Ju, Q. (2020b). FastBERT: a self-distilling BERT with adaptive inference time. In ACL.

  • Liu, X., Wang, X. & Matwin, S. (2018b). Improving the interpretability of deep neural networks with knowledge distillation. In ICDMW.

  • Liu, X., He, P., Chen, W. & Gao, J. (2019f). Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482.

  • Liu, Y., Cao, J., Li, B., Yuan, C., Hu, W., Li, Y. & Duan, Y. (2019g). Knowledge distillation via instance relationship graph. In CVPR.

  • Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z. & Wang, J. (2019h). Structured knowledge distillation for semantic segmentation. In CVPR.

  • Liu, Y., Jia, X., Tan, M., Vemulapalli, R., Zhu, Y., Green, B., et al. (2019i). Search to distill: Pearls are everywhere but not the eyes. In CVPR.

  • Liu, Y., Zhang, W., & Wang, J. (2020c). Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing, 415, 106–113.

    Article  Google Scholar 

  • Lopes, R. G., Fenu, S. & Starner, T. (2017). Data-free knowledge distillation for deep neural networks. In NeurIPS.

  • Lopez-Paz, D., Bottou, L., Schölkopf, B. & Vapnik, V. (2016). Unifying distillation and privileged information. In ICLR.

  • Lu, L., Guo, M. & Renals, S. (2017). Knowledge distillation for small-footprint highway networks. In ICASSP.

  • Luo, P., Zhu, Z., Liu, Z., Wang, X. & Tang, X. (2016). Face model compression by distilling knowledge from neurons. In AAAI.

  • Luo, S., Pan, W., Wang, X., Wang, D., Tang, H., & Song, M. (2020). Collaboration by competition: Self-coordinated knowledge amalgamation for multi-talent student learning. In ECCV.

  • Luo, S., Wang, X., Fang, G., Hu, Y., Tao, D., & Song, M. (2019). Knowledge amalgamation from heterogeneous networks by common feature learning. In IJCAI.

  • Luo, Z., Hsieh, J. T., Jiang, L., Carlos Niebles, J. & Fei-Fei, L. (2018). Graph distillation for action detection with privileged modalities. In ECCV.

  • Macko, V., Weill, C., Mazzawi, H. & Gonzalvo, J. (2019). Improving neural architecture search image classifiers via ensemble learning. In NeurIPS workshop.

  • Ma, J., & Mei, Q. (2019). Graph representation learning via multi-task knowledge distillation. arXiv preprint arXiv:1911.05700.

  • Ma, N., Zhang, X., Zheng, H. T., & Sun, J. (2018). Shufflenet v2: Practical guidelines for efficient CNN architecture design. In ECCV.

  • Meng, Z., Li, J., Zhao, Y. & Gong, Y. (2019). Conditional teacher-student learning. In ICASSP.

  • Micaelli, P. & Storkey, A. J. (2019). Zero-shot knowledge transfer via adversarial belief matching. In NeurIPS.

  • Minami, S., Hirakawa, T., Yamashita, T. & Fujiyoshi, H. (2019). Knowledge transfer graph for deep collaborative learning. arXiv preprint arXiv:1909.04286.

  • Mirzadeh, S. I., Farajtabar, M., Li, A. & Ghasemzadeh, H. (2020). Improved knowledge distillation via teacher assistant. In AAAI.

  • Mishra, A. & Marr, D. (2018). Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. In ICLR.

  • Mobahi, H., Farajtabar, M., & Bartlett, P. L. (2020). Self-distillation amplifies regularization in hilbert space. In NeurIPS.

  • Mou, L., Jia, R., Xu, Y., Li, G., Zhang, L. & Jin, Z. (2016). Distilling word embeddings: An encoding approach. In CIKM.

  • Mukherjee, P., Das, A., Bhunia, A. K. & Roy, P. P. (2019). Cogni-net: Cognitive feature learning through deep visual perception. In ICIP.

  • Mullapudi, R. T., Chen, S., Zhang, K., Ramanan, D. & Fatahalian, K. (2019). Online model distillation for efficient video inference. In ICCV.

  • Muller, R., Kornblith, S. & Hinton, G. E. (2019). When does label smoothing help? In NeurIPS.

  • Mun, J., Lee, K., Shin, J. & Han, B. (2018). Learning to specialize with knowledge distillation for visual question answering. In NeurIPS.

  • Munjal, B., Galasso, F. & Amin, S. (2019). Knowledge distillation for end-to-end person search. In BMVC.

  • Nakashole, N. & Flauger, R. (2017). Knowledge distillation for bilingual dictionary induction. In EMNLP.

  • Nayak, G. K., Mopuri, K. R., & Chakraborty, A. (2021). Effectiveness of arbitrary transfer sets for data-free knowledge distillation. In WACV.

  • Nayak, G. K., Mopuri, K. R., Shaj, V., Babu, R. V. & Chakraborty, A. (2019). Zero-shot knowledge distillation in deep networks. In ICML.

  • Ng, R. W., Liu, X. & Swietojanski, P. (2018). Teacher-student training for text-independent speaker recognition. In SLTW.

  • Nie, X., Li, Y., Luo, L., Zhang, N. & Feng, J. (2019). Dynamic kernel distillation for efficient pose estimation in videos. In ICCV.

  • Noroozi, M., Vinjimoor, A., Favaro, P. & Pirsiavash, H. (2018). Boosting self-supervised learning via knowledge transfer. In CVPR.

  • Nowak, T. S. & Corso, J. J. (2018). Deep net triage: Analyzing the importance of network layers via structural compression. arXiv preprint arXiv:1801.04651.

  • Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., et al. (2018). Parallel wavenet: Fast high-fidelity speech synthesis. In ICML.

  • Pan, B., Cai, H., Huang, D. A., Lee, K. H., Gaidon, A., Adeli, E., & Niebles, J. C. (2020). Spatio-temporal graph for video captioning with knowledge distillation. In CVPR

  • Pan, Y., He, F., & Yu, H. (2019). A novel enhanced collaborative autoencoder with knowledge distillation for top-n recommender systems. Neurocomputing, 332, 137–148.

    Article  Google Scholar 

  • Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I. & Talwar, K. (2017). Semi-supervised knowledge transfer for deep learning from private training data. In ICLR.

  • Papernot, N., McDaniel, P., Wu, X., Jha, S. & Swami, A. (2016). Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE SP.

  • Park, S. & Kwak, N. (2020). Feature-level ensemble knowledge distillation for aggregating knowledge from multiple networks. In ECAI.

  • Park, W., Kim, D., Lu, Y. & Cho, M. (2019). Relational knowledge distillation. In CVPR.

  • Passban, P., Wu, Y., Rezagholizadeh, M., & Liu, Q. (2021). ALP-KD: Attention-based layer projection for knowledge distillation. In AAAI.

  • Passalis, N. & Tefas, A. (2018). Learning deep representations with probabilistic knowledge transfer. In ECCV.

  • Passalis, N., Tzelepi, M., & Tefas, A. (2020a). Probabilistic knowledge transfer for lightweight deep representation learning. TNNLS. https://doi.org/10.1109/TNNLS.2020.2995884.

  • Passalis, N., Tzelepi, M., & Tefas, A. (2020b). Heterogeneous knowledge distillation using information flow modeling. In CVPR.

  • Peng, B., Jin, X., Liu, J., Li, D., Wu, Y., Liu, Y., et al. (2019a). Correlation congruence for knowledge distillation. In ICCV.

  • Peng, H., Du, H., Yu, H., Li, Q., Liao, J., & Fu, J. (2020). Cream of the crop: Distilling prioritized paths for one-shot neural architecture search. In NeurIPS.

  • Peng, Z., Li, Z., Zhang, J., Li, Y., Qi, G. J. & Tang, J. (2019b). Few-shot image recognition with knowledge transfer. In ICCV.

  • Perez, A., Sanguineti, V., Morerio, P. & Murino, V. (2020). Audio-visual model distillation using acoustic images. In WACV.

  • Phuong, M., & Lampert, C. H. (2019a). Towards understanding knowledge distillation. In ICML.

  • Phuong, M., & Lampert, C. H. (2019b). Distillation-based training for multi-exit architectures. In ICCV.

  • Pilzer, A., Lathuiliere, S., Sebe, N. & Ricci, E. (2019). Refine and distill: Exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In CVPR.

  • Polino, A., Pascanu, R. & Alistarh, D. (2018). Model compression via distillation and quantization. In ICLR.

  • Price, R., Iso, K., & Shinoda, K. (2016). Wise teachers train better DNN acoustic models. EURASIP Journal on Audio, Speech, and Music Processing, 2016(1), 10.

    Article  Google Scholar 

  • Radosavovic, I., Dollar, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data distillation: Towards omni-supervised learning. In CVPR.

  • Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollar P. (2020). Designing network design spaces. In CVPR.

  • Roheda, S., Riggan, B. S., Krim, H. & Dai, L. (2018). Cross-modality distillation: A case for conditional generative adversarial networks. In ICASSP.

  • Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2015). Fitnets: Hints for thin deep nets. In ICLR.

  • Ross, A. S. & Doshi-Velez, F. (2018). Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In AAAI.

  • Ruder, S., Ghaffari, P. & Breslin, J. G. (2017). Knowledge adaptation: Teaching to adapt. arXiv preprint arXiv:1702.02052.

  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR.

  • Sanh, V., Debut, L., Chaumond, J. & Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

  • Saputra, M. R. U., de Gusmao, P. P., Almalioglu, Y., Markham, A. & Trigoni, N. (2019). Distilling knowledge from a deep pose regressor network. In ICCV.

  • Sau, B. B. & Balasubramanian, V. N. (2016). Deep model compression: Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650.

  • Seo, H., Park, J., Oh, S., Bennis, M., & Kim, S. L. (2020). Federated Knowledge Distillation. arXiv preprint arXiv:2011.02367.

  • Shakeri, S., Sethy, A. & Cheng, C. (2019). Knowledge distillation in document retrieval. arXiv preprint arXiv:1911.11065.

  • Shen, C., Wang, X., Song, J., Sun, L., & Song, M. (2019a). Amalgamating knowledge towards comprehensive classification. In AAAI.

  • Shen, C., Wang, X., Yin, Y., Song, J., Luo, S., & Song, M. (2021). Progressive network grafting for few-shot knowledge distillation. In AAAI.

  • Shen, C., Xue, M., Wang, X., Song, J., Sun, L., & Song, M. (2019b). Customizing student networks from heterogeneous teachers via adaptive knowledge amalgamation. In ICCV.

  • Shen, J., Vesdapunt, N., Boddeti, V. N. & Kitani, K. M. (2016). In teacher we trust: Learning compressed models for pedestrian detection. arXiv preprint arXiv:1612.00478.

  • Shen, P., Lu, X., Li, S. & Kawai, H. (2018). Feature representation of short utterances based on knowledge distillation for spoken language identification. In Interspeech.

  • Shen, P., Lu, X., Li, S., & Kawai, H. (2020). Knowledge distillation-based representation learning for short-utterance spoken language identification. IEEE/ACM Transactions on Audio Speech and Language, 28, 2674–2683.

    Article  Google Scholar 

  • Shen, P., Lu, X., Li, S. & Kawai, H. (2019c). Interactive learning of teacher-student model for short utterance spoken language identification. In ICASSP.

  • Shen, Z., He, Z. & Xue, X. (2019d). Meal: Multi-model ensemble via adversarial learning. In AAAI.

  • Shi, B., Sun, M., Kao, C. C., Rozgic, V., Matsoukas, S. & Wang, C. (2019a). Compression of acoustic event detection models with quantized distillation. In Interspeech.

  • Shi, B., Sun, M., Kao, CC., Rozgic, V., Matsoukas, S. & Wang, C. (2019b). Semi-supervised acoustic event detection based on tri-training. In ICASSP.

  • Shi, Y., Hwang, M. Y., Lei, X., & Sheng, H. (2019c). Knowledge distillation for recurrent neural network language modeling with trust regularization. In ICASSP.

  • Shin, S., Boo, Y. & Sung, W. (2019). Empirical analysis of knowledge distillation technique for optimization of quantized deep neural networks. arXiv preprint arXiv:1909.01688.

  • Shmelkov, K., Schmid, C., & Alahari, K. (2017). Incremental learning of object detectors without catastrophic forgetting. In ICCV.

  • Shu, C., Li, P., Xie, Y., Qu, Y., Dai, L., & Ma, L.(2019). Knowledge squeezed adversarial network compression. arXiv preprint arXiv:1904.05100.

  • Siam, M., Jiang, C., Lu, S., Petrich, L., Gamal, M., Elhoseiny, M., et al. (2019). Video object segmentation using teacher-student adaptation in a human robot interaction (HRI) setting. In ICRA.

  • Sindhwani, V., Sainath, T. & Kumar, S. (2015). Structured transforms for small-footprint deep learning. In NeurIPS.

  • Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.

  • Song, X., Feng, F., Han, X., Yang, X., Liu, W. & Nie, L. (2018). Neural compatibility modeling with attentive knowledge distillation. In SIGIR.

  • Srinivas, S. & Fleuret, F. (2018). Knowledge transfer with jacobian matching. In ICML.

  • Su, J. C. & Maji, S. (2017). Adapting models to signal degradation using distillation. In BMVC.

  • Sun, L., Gou, J., Yu, B., Du, L., & Tao, D. (2021) Collaborative teacher–student learning via multiple knowledge transfer. arXiv preprint arXiv:2101.08471.

  • Sun, S., Cheng, Y., Gan, Z. & Liu, J. (2019). Patient knowledge distillation for bert model compression. In NEMNLP-IJCNLP.

  • Sun, P., Feng, W., Han, R., Yan, S., & Wen, Y. (2019). Optimizing network performance for distributed dnn training on gpu clusters: Imagenet/alexnet training in 1.5 minutes. arXiv preprint arXiv:1902.06855.

  • Takashima, R., Li, S. & Kawai, H. (2018). An investigation of a knowledge distillation method for CTC acoustic models. In ICASSP.

  • Tan, H., Liu, X., Liu, M., Yin, B., & Li, X. (2021). KT-GAN: Knowledge-transfer generative adversarial network for text-to-image synthesis. IEEE TIP, 30, 1275–1290.

    Google Scholar 

  • Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In CVPR.

  • Tan, M., & Le, Q. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML.

  • Tan, X., Ren, Y., He, D., Qin, T., Zhao, Z. & Liu, T. Y. (2019). Multilingual neural machine translation with knowledge distillation. In ICLR.

  • Tang, J., Shivanna, R., Zhao, Z., Lin, D., Singh, A., Chi, E. H., & Jain, S. (2020). Understanding and improving knowledge distillation. arXiv preprint arXiv:2002.03532.

  • Tang, J., & Wang, K. (2018). Ranking distillation: Learning compact ranking models with high performance for recommender system. In SIGKDD.

  • Tang, R., Lu, Y., Liu, L., Mou, L., Vechtomova, O. & Lin, J. (2019). Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136.

  • Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS.

  • Thoker, F. M. & Gall, J. (2019). Cross-modal knowledge distillation for action recognition. In ICIP.

  • Tian, Y., Krishnan, D. & Isola, P. (2020). Contrastive representation distillation. In ICLR.

  • Tu, Z., He, F., & Tao, D. (2020). Understanding generalization in recurrent neural networks. In International conference on learning representations. ICLR.

  • Tung, F., & Mori, G. (2019). Similarity-preserving knowledge distillation. In ICCV.

  • Turc, I., Chang, M. W., Lee, K. & Toutanova, K. (2019). Well-read students learn better: The impact of student initialization on knowledge distillation. arXiv preprint arXiv:1908.08962.

  • Urban, G., Geras, K. J., Kahou, S. E., Aslan, O., Wang, S., Caruana, R., (2017). Do deep convolutional nets really need to be deep and convolutional? In ICLR.

  • Vapnik, V., & Izmailov, R. (2015). Learning using privileged information: Similarity control and knowledge transfer. Journal of Machine Learning Research, 16(1), 2023–2049.

    MathSciNet  MATH  Google Scholar 

  • Vongkulbhisal, J., Vinayavekhin, P. & Visentini-Scarzanella, M. (2019). Unifying heterogeneous classifiers with distillation. In CVPR.

  • Walawalkar, D., Shen, Z., & Savvides, M. (2020). Online ensemble model compression using knowledge distillation. In ECCV.

  • Wang, C., Lan, X. & Zhang, Y. (2017). Model distillation with knowledge transfer from face classification to alignment and verification. arXiv preprint arXiv:1709.02929.

  • Wang, L., & Yoon, K. J. (2020). Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. arXiv preprint arXiv:2004.05937.

  • Wang, H., Zhao, H., Li, X. & Tan, X. (2018a). Progressive blockwise knowledge distillation for neural network acceleration. In IJCAI.

  • Wang, J., Bao, W., Sun, L., Zhu, X., Cao, B., & Philip, S. Y. (2019a). Private model compression via knowledge distillation. In AAAI.

  • Wang, J., Gou, L., Zhang, W., Yang, H., & Shen, H. W. (2019b). Deepvid: Deep visual interpretation and diagnosis for image classifiers via knowledge distillation. TVCG, 25(6), 2168–2180.

    Google Scholar 

  • Wang, M., Liu, R., Abe, N., Uchida, H., Matsunami, T., & Yamada, S. (2018b). Discover the effective strategy for face recognition model compression by improved knowledge distillation. In ICIP.

  • Wang, M., Liu, R., Hajime, N., Narishige, A., Uchida, H. & Matsunami, T.(2019c). Improved knowledge distillation for training fast low resolution face recognition model. In ICCVW.

  • Wang, T., Yuan, L., Zhang, X. & Feng, J. (2019d). Distilling object detectors with fine-grained feature imitation. In CVPR.

  • Wang, T., Zhu, J. Y., Torralba, A., & Efros, A. A. (2018c). Dataset distillation. arXiv preprint arXiv:1811.10959.

  • Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020a). Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In NeurIPS.

  • Wang, W., Zhang, J., Zhang, H., Hwang, M. Y., Zong, C. & Li, Z. (2018d). A teacher-student framework for maintainable dialog manager. In EMNLP.

  • Wang, X., Fu, T., Liao, S., Wang, S., Lei, Z., & Mei, T. (2020b). Exclusivity-consistency regularized knowledge distillation for face recognition. In ECCV.

  • Wang, X., Hu, J. F., Lai, J. H., Zhang, J. & Zheng, W. S. (2019e). Progressive teacher-student learning for early action prediction. In CVPR.

  • Wang, X., Zhang, R., Sun, Y. & Qi, J. (2018e) Kdgan: Knowledge distillation with generative adversarial networks. In NeurIPS.

  • Wang, Y., Xu, C., Xu, C., & Tao, D. (2019f). Packing convolutional neural networks in the frequency domain. IEEE TPAMI, 41(10), 2495–2510.

    Article  Google Scholar 

  • Wang, Y., Xu, C., Xu, C. & Tao, D. (2018f). Adversarial learning of portable student networks. In AAAI.

  • Wang, Z. R., & Du, J. (2021). Joint architecture and knowledge distillation in CNN for Chinese text recognition. Pattern Recognition, 111, 107722.

    Article  Google Scholar 

  • Watanabe, S., Hori, T., Le Roux, J. & Hershey, J. R. (2017). Student-teacher network learning with enhanced features. In ICASSP.

  • Wei, H. R., Huang, S., Wang, R., Dai, X. & Chen, J. (2019). Online distilling from checkpoints for neural machine translation. In NAACL-HLT.

  • Wei, Y., Pan, X., Qin, H., Ouyang, W. & Yan, J. (2018). Quantization mimic: Towards very tiny CNN for object detection. In ECCV.

  • Wong, J. H. & Gales, M. (2016). Sequence student-teacher training of deep neural networks. In Interspeech.

  • Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., et al. (2019). Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In CVPR.

  • Wu, A., Zheng, W. S., Guo, X. & Lai, J. H. (2019a). Distilled person re-identification: Towards a more scalable system. In CVPR.

  • Wu, G., & Gong, S. (2021). Peer collaborative learning for online knowledge distillation. In AAAI.

  • Wu, J., Leng, C., Wang, Y., Hu, Q. & Cheng, J. (2016). Quantized convolutional neural networks for mobile devices. In CVPR.

  • Wu, M. C., Chiu, C. T. & Wu, K. H. (2019b). Multi-teacher knowledge distillation for compressed video action recognition on deep neural networks. In ICASSP.

  • Wu, X., He, R., Hu, Y., & Sun, Z. (2020). Learning an evolutionary embedding via massive knowledge distillation. International Journal of Computer Vision, 1–18.

  • Xia, S., Wang, G., Chen, Z., & Duan, Y. (2018). Complete random forest based class noise filtering learning for improving the generalizability of classifiers. IEEE TKDE, 31(11), 2063–2078.

    Google Scholar 

  • Xie, J., Lin, S., Zhang, Y. & Luo, L. (2019). Training convolutional neural networks with cheap convolutions and online distillation. arXiv preprint arXiv:1909.13063.

  • Xie, Q., Hovy, E., Luong, M. T., & Le, Q. V. (2020). Self-training with Noisy Student improves ImageNet classification. In CVPR.

  • Xu, G., Liu, Z., Li, X., & Loy, C. C. (2020a). Knowledge distillation meets self-supervision. In ECCV.

  • Xu, K., Rui, L., Li, Y., & Gu, L. (2020b). Feature normalized knowledge distillation for image classification. In ECCV.

  • Xu, Z., Wu, K., Che, Z., Tang, J., & Ye, J. (2020c). Knowledge transfer in multi-task deep reinforcement learning for continuous control. In NeurIPS.

  • Xu, Z., Hsu, Y. C. & Huang, J. (2018a). Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. In ICLR workshop.

  • Xu, Z., Hsu, Y. C. & Huang, J. (2018b). Training student networks for acceleration with conditional adversarial networks. In BMVC.

  • Xu, T. B., & Liu, C. L. (2019). Data-distortion guided self-distillation for deep neural networks. In AAAI.

  • Yan, M., Zhao, M., Xu, Z., Zhang, Q., Wang, G. & Su, Z. (2019). Vargfacenet: An efficient variable group convolutional neural network for lightweight face recognition. In ICCVW.

  • Yang, C., Xie, L., Qiao, S. & Yuille, A. (2019a). Knowledge distillation in generations: More tolerant teachers educate better students. In AAAI.

  • Yang, C., Xie, L., Su, C. & Yuille, A. L. (2019b). Snapshot distillation: Teacher-student optimization in one generation. In CVPR.

  • Yang, J., Martinez, B., Bulat, A., & Tzimiropoulos, G. (2020a). Knowledge distillation via adaptive instance normalization. In ECCV.

  • Yang, Y., Qiu, J., Song, M., Tao, D. & Wang, X. (2020b). Distilling knowledge from graph convolutional networks. In CVPR.

  • Yang, Z., Shou, L., Gong, M., Lin, W. & Jiang, D. (2020c). Model compression with two-stage multi-teacher knowledge distillation for web question answering system. In WSDM.

  • Yao, A., & Sun, D. (2020). Knowledge transfer via dense cross-layer mutual-distillation. In ECCV.

  • Yao, H., Zhang, C., Wei, Y., Jiang, M., Wang, S., Huang, J., Chawla, N. V., & Li, Z. (2020). Graph few-shot learning via knowledge transfer. In AAAI.

  • Ye, J., Ji, Y., Wang, X., Gao, X., & Song, M. (2020). Data-free knowledge amalgamation via group-stack dual-GAN. In CVPR.

  • Ye, J., Ji, Y., Wang, X., Ou, K., Tao, D. & Song, M. (2019). Student becoming the master: Knowledge amalgamation for joint scene parsing, depth estimation, and more. In CVPR.

  • Yim, J., Joo, D., Bae, J. & Kim, J. (2017). A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR.

  • Yin, H., Molchanov, P., Alvarez, J. M., Li, Z., Mallya, A., Hoiem, D., Jha, Niraj K., & Kautz, J. (2020). Dreaming to distill: Data-free knowledge transfer via DeepInversion. In CVPR.

  • Yoo, J., Cho, M., Kim, T., & Kang, U. (2019). Knowledge extraction with no observable data. In NeurIPS.

  • You, S., Xu, C., Xu, C., & Tao, D. (2017). Learning from multiple teacher networks. In SIGKDD.

  • You, S., Xu, C., Xu, C. & Tao, D. (2018). Learning with single-teacher multi-student. In AAAI.

  • You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., et al. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. In ICLR.

  • Yu, L., Yazici, V. O., Liu, X., Weijer, J., Cheng, Y. & Ramisa, A. (2019). Learning metrics from teachers: Compact networks for image embedding. In CVPR.

  • Yu, X., Liu, T., Wang, X., & Tao, D. (2017). On compressing deep models by low rank and sparse decomposition. In CVPR.

  • Yuan, F., Shou, L., Pei, J., Lin, W., Gong, M., Fu, Y., & Jiang, D. (2021). Reinforced multi-teacher selection for knowledge distillation. In AAAI.

  • Yuan, L., Tay, F. E., Li, G., Wang, T. & Feng, J. (2020). Revisit knowledge distillation: a teacher-free framework. In CVPR.

  • Yuan, M., & Peng, Y. (2020). CKD: Cross-task knowledge distillation for text-to-image synthesis. IEEE TMM, 22(8), 1955–1968.

    Google Scholar 

  • Yue, K., Deng, J., & Zhou, F. (2020). Matching guided distillation. In ECCV.

  • Yun, S., Park, J., Lee, K. & Shin, J. (2020). Regularizing class-wise predictions via self-knowledge distillation. In CVPR.

  • Zagoruyko, S. & Komodakis, N. (2017). Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR.

  • Zhai, M., Chen, L., Tung, F., He, J., Nawhal, M. & Mori, G. (2019). Lifelong gan: Continual learning for conditional image generation. In ICCV.

  • Zhai, S., Cheng, Y., Zhang, Z. M. & Lu, W. (2016). Doubly convolutional neural networks. In NeurIPS.

  • Zhao, C., & Hospedales, T. (2020). Robust domain randomised reinforcement learning through peer-to-peer distillation. In NeurIPS.

  • Zhao, H., Sun, X., Dong, J., Chen, C., & Dong, Z. (2020a). Highlight every step: Knowledge distillation via collaborative teaching. IEEE TCYB. https://doi.org/10.1109/TCYB.2020.3007506.

  • Zhao, L., Peng, X., Chen, Y., Kapadia, M., & Metaxas, D. N. (2020b). Knowledge as Priors: Cross-Modal Knowledge Generalization for Datasets without Superior Knowledge. In CVPR.

  • Zhao, M., Li, T., Abu Alsheikh, M., Tian, Y., Zhao, H., Torralba, A. & Katabi, D. (2018). Through-wall human pose estimation using radio signals. In CVPR.

  • Zhang, C. & Peng, Y. (2018). Better and faster: knowledge transfer from multiple self-supervised learning tasks via graph distillation for video classification. In IJCAI.

  • Zhang, F., Zhu, X. & Ye, M. (2019a). Fast human pose estimation. In CVPR.

  • Zhang, J., Liu, T., & Tao, D. (2018). An information-theoretic view for deep learning. arXiv preprint arXiv:1804.09060.

  • Zhang, H., Hu, Z., Qin, W., Xu, M., & Wang, M. (2021a). Adversarial co-distillation learning for image recognition. Pattern Recognition, 111, 107659.

    Article  Google Scholar 

  • Zhang, L., Shi, Y., Shi, Z., Ma, K., & Bao, C. (2020a). Task-oriented feature distillation. In NeurIPS.

  • Zhang, L., Song, J., Gao, A., Chen, J., Bao, C. & Ma, K. (2019b). Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In ICCV.

  • Zhang, M., Song, G., Zhou, H., & Liu, Y. (2020b). Discriminability distillation in group representation learning. In ECCV.

  • Zhang, S., Feng, Y., & Li, L. (2021b). Future-guided incremental transformer for simultaneous translation. In AAAI.

  • Zhang, S., Guo, S., Wang, L., Huang, W., & Scott, M. R. (2020c). Knowledge integration networks for action recognition. In AAAI.

  • Zhang, W., Miao, X., Shao, Y., Jiang, J., Chen, L., Ruas, O., & Cui, B. (2020d). Reliable data distillation on graph convolutional network. In ACM SIGMOD.

  • Zhang, X., Wang, X., Bian, J. W., Shen, C., & You, M. (2021c). Diverse knowledge distillation for end-to-end person search. In AAAI.

  • Zhang, X., Zhou, X., Lin, M. & Sun, J. (2018a). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR.

  • Zhang, Y., Lan, Z., Dai, Y., Zeng, F., Bai, Y., Chang, J., & Wei, Y. (2020e). Prime-aware adaptive distillation. In ECCV.

  • Zhang, Y., Xiang, T., Hospedales, T. M. & Lu, H. (2018b). Deep mutual learning. In CVPR.

  • Zhang, Z., & Sabuncu, M. R. (2020). Self-distillation as instance-specific label smoothing. In NeurIPS.

  • Zhang, Z., Shi, Y., Yuan, C., Li, B., Wang, P., Hu, W., & Zha, Z. J. (2020f). Object relational graph with teacher-recommended learning for video captioning. In CVPR.

  • Zhou C, Neubig G, Gu J (2019a) Understanding knowledge distillation in non-autoregressive machine translation. In ICLR.

  • Zhou, G., Fan, Y., Cui, R., Bian, W., Zhu, X. & Gai, K. (2018). Rocket launching: A universal and efficient framework for training well-performing light net. In AAAI.

  • Zhou, J., Zeng, S. & Zhang, B. (2019b) Two-stage image classification supervised by a single teacher single student model. In BMVC.

  • Zhou, P., Mai, L., Zhang, J., Xu, N., Wu, Z. & Davis, L. S. (2020). M2KD: Multi-model and multi-level knowledge distillation for incremental learning. In BMVC.

  • Zhu, M., Han, K., Zhang, C., Lin, J., & Wang, Y. (2019). Low-resolution visual recognition via deep feature distillation. In ICASSP.

  • Zhu, X., & Gong, S. (2018). Knowledge distillation by on-the-fly native ensemble. In NeurIPS.

Download references

Acknowledgements

The work was in part supported by Australian Research Council Projects FL-170100117, IH-180100002, IC-190100031, and National Natural Science Foundation of China 61976107.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dacheng Tao.

Additional information

Communicated by Minsu Cho.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gou, J., Yu, B., Maybank, S.J. et al. Knowledge Distillation: A Survey. Int J Comput Vis 129, 1789–1819 (2021). https://doi.org/10.1007/s11263-021-01453-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01453-z

Keywords

Navigation