A survey on self-supervised learning methods for domain adaptation in deep neural networks focusing on the optimization problems

Document Type : Review Article

Authors

Department of Mathematics, Faculty of Sciences, University of Qom, Qom, Iran

Abstract

Deep convolutional neural networks have been widely and successfully used for various computer vision tasks. The main bottleneck for developing these models has been the lack of large datasets labeled by human experts. Self-supervised learning approaches have been used to deal with this challenge and allow developing models for domains with small labeled datasets. Another challenge for developing deep learning models is that their performance decreases when deployed on a target domain different from the source domain used for model training. Given a model trained on a source domain, domain adaptation refers to the methods used for adjusting a model or its output such that when the model is applied to a target domain, it achieves higher performance. This paper reviews the most commonly used self-supervised learning approaches and highlights their utility for domain adaptation.

Keywords

Main Subjects


[1] Y. Abramson and Y. Freund, Active learning for visual object recognition, in Technical report, UCSD, 2004.
[2] I. Achituve, H. Maron, and G. Chechik, Self-supervised learning for domain adaptation on point clouds, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2021, pp. 123–133.
[3] H. Akada, S. F. Bhat, I. Alhashim, and P. Wonka, Self-supervised learning of domain invariant features for depth estimation, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2022, pp. 3377–3387.
[4] P. Bojanowski and A. Joulin, Unsupervised learning by predicting noise, in Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh, eds., vol. 70 of Proceedings of Machine Learning Research, PMLR, 06–11 Aug 2017, pp. 517–526.
[5] U. Buchler, B. Brattoli, and B. Ommer, Improving spatiotemporal self-supervision by deep reinforcement learning, in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 770–786.
[6] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, Deep clustering for unsupervised learning of visual features, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149.
[7] W.-L. Chang, H.-P. Wang, W.-H. Peng, and W.-C. Chiu, All about structure: Adapting structural information across domains for boosting semantic segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1900–1909.
[8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE Transactions on Pattern Analysis and Machine Intelligence, 40 (2018), pp. 834–848.
[9] M.-H. Chen, B. Li, Y. Bao, G. AlRegib, and Z. Kira, Action segmentation with joint self-supervised temporal domain adaptation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[10] X. Chen and K. He, Exploring simple siamese representation learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 15750–15758.
[11] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, Domain adaptive faster r-cnn for object detection in the wild, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3339–3348.
[12] Y.-C. Chen, Y.-Y. Lin, M.-H. Yang, and J.-B. Huang, Crdoco: Pixel-level domain transfer with crossdomain consistency, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1791–1800.
[13] Y. Cheng, F. Wei, J. Bao, D. Chen, F. Wen, and W. Zhang, Dual path learning for domain adaptation of semantic segmentation, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 9082–9091.
[14] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, 2005, pp. 886–893 vol. 1.
[15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
[16] C. Doersch, A. Gupta, and A. A. Efros, Unsupervised visual representation learning by context prediction, in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1422–1430.
[17] C. Doersch and A. Zisserman, Multi-task self-supervised visual learning, in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2051–2060.
[18] A. Dundar, M.-Y. Liu, T.-C. Wang, J. Zedlewski, and J. Kautz, Domain stylization: A strong, simple baseline for synthetic to real image domain adaptation, arXiv preprint arXiv:1807.09384, (2018).
[19] K. Fujii and K. Kawamoto, Generative and self-supervised domain adaptation for one-stage object detection, Array, 11 (2021), p. 100071.
[20] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, Domain-adversarial training of neural networks, The journal of machine learning research, 17 (2016), pp. 2096–2030.
[21] R. Girshick, Fast r-cnn, in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
[22] R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
[23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, eds., vol. 27, Curran Associates, Inc., 2014.
[24] K. He, G. Gkioxari, P. Dollar, and R. Girshick, Mask r-cnn, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[25] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[26] G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, 313 (2006), pp. 504–507.
[27] E. Hoffer, I. Hubara, and N. Ailon, Deep unsupervised learning through spatial contrasting, arXiv preprint arXiv:1610.00243, (2016).
[28] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, Cycada: Cycle-consistent adversarial domain adaptation, in International conference on machine learning, PMLR, 2018, pp. 1989–1998.
[29] J. Hoffman, D. Wang, F. Yu, and T. Darrell, Fcns in the wild: Pixel-level adversarial and constraintbased adaptation, arXiv preprint arXiv:1612.02649, (2016).
[30] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, Densely connected convolutional networks, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700– 4708.
[31] H. Huang, Q. Huang, and P. Krahenbuhl, Domain transfer through deep activation matching, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 590–605.
[32] S. Iizuka, E. Simo-Serra, and H. Ishikawa, Let there be color! joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification, 35 (2016).
[33] , Globally and locally consistent image completion, ACM Transactions on Graphics (ToG), 36 (2017), pp. 1–14.
[34] A. K. Jain, M. N. Murty, and P. J. Flynn, Data clustering: A review, ACM Comput. Surv., 31 (1999), p. 264–323.
[35] D. Kim, D. Cho, D. Yoo, and I. S. Kweon, Learning image representations by completing damaged jigsaw puzzles, in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2018, pp. 793–802.
[36] D. Kim, K. Saito, T.-H. Oh, B. A. Plummer, S. Sclaroff, and K. Saenko, Cross-domain selfsupervised learning for domain adaptation with few source labels, arXiv preprint arXiv:2003.08264, (2020).
[37] D. Kim, Y. Yoo, S. Park, J. Kim, and J. Lee, Selfreg: Self-supervised contrastive regularization for domain generalization, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 9619–9628.
[38] N. Komodakis and S. Gidaris, Unsupervised representation learning by predicting image rotations, in International Conference on Learning Representations (ICLR), 2018.
[39] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems, 25 (2012).
[40] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al., The open images dataset v4, International Journal of Computer Vision, 128 (2020), pp. 1956–1981.
[41] G. Larsson, M. Maire, and G. Shakhnarovich, Learning representations for automatic colorization, in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, eds., Cham, 2016, Springer International Publishing, pp. 577–593.
[42] , Colorization as a proxy task for visual understanding, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6874–6883.
[43] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, ´ J. Totz, Z. Wang, et al., Photo-realistic single image super-resolution using a generative adversarial network, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4681– 4690.
[44] W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool, Webvision database: Visual learning and understanding from web data, arXiv preprint arXiv:1708.02862, (2017).
[45] Y. Li, M. Paluri, J. M. Rehg, and P. Dollar´ , Unsupervised learning of edges, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1619–1627.
[46] Y. Li, L. Yuan, and N. Vasconcelos, Bidirectional learning for domain adaptation of semantic segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6936–6945.
[47] C.-H. Lin and B.-F. Wu, Domain adapting ability of self-supervised learning for face recognition, in 2021 IEEE International Conference on Image Processing (ICIP), 2021, pp. 479–483.
[48] J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[49] M. Long, H. Zhu, J. Wang, and M. I. Jordan, Deep transfer learning with joint adaptation networks, in International conference on machine learning, PMLR, 2017, pp. 2208–2217.
[50] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. Van Der Maaten, Exploring the limits of weakly supervised pretraining, in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 181–196.
[51] A. Mahendran, J. Thewlis, and A. Vedaldi, Cross pixel optical-flow similarity for self-supervised learning, in Asian Conference on Computer Vision, Springer, 2018, pp. 99–116.
[52] J. Manders, T. van Laarhoven, and E. Marchiori, Adversarial alignment of class prediction uncertainties for domain adaptation, arXiv preprint arXiv:1804.04448, (2018).
[53] S. Mishra, K. Saenko, and V. Saligrama, Surprisingly simple semi-supervised domain adaptation with pretraining and consistency, arXiv e-prints, (2021), pp. arXiv–2101.
[54] Y. Mitsuzumi, G. Irie, D. Ikami, and T. Shibata, Generalized domain adaptation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 1084–1093.
[55] T. N. Mundhenk, D. Ho, and B. Y. Chen, Improvements to context based self-supervised learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9339–9348.
[56] K. Najafian, A. Ghanbari, I. Stavness, L. Jin, G. Hassan Shirdel, and F. Maleki, A semi-selfsupervised learning approach for wheat head detection using extremely small number of labeled samples, in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021, pp. 1342–1351.
[57] M. Noroozi and P. Favaro, Unsupervised learning of visual representations by solving jigsaw puzzles, in European conference on computer vision, Springer, 2016, pp. 69–84.
[58] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, Boosting self-supervised learning via knowledge transfer, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9359–9367.
[59] S. J. Pan and Q. Yang, A survey on transfer learning. ieee transactions on knowledge and data engineering, 22 (10): 1345, 1359 (2010).
[60] D. Pathak, R. Girshick, P. Dollar, T. Darrell, and B. Hariharan ´ , Learning features by watching objects move, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2701–2710.
[61] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, Context encoders: Feature learning by inpainting, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2536–2544.
[62] Z. Pei, Z. Cao, M. Long, and J. Wang, Multi-adversarial domain adaptation, in Thirty-second AAAI conference on artificial intelligence, 2018.
[63] Y. Rao and J. Ni, Self-supervised domain adaptation for forgery localization of jpeg compressed images, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 15034–15043.
[64] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779– 788.
[65] S. Ren, K. He, R. Girshick, and J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems, 28 (2015), pp. 91–99.
[66] Z. Ren and Y. J. Lee, Cross-domain self-supervised multi-task feature learning using synthetic imagery, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 762–771.
[67] O. Ronneberger, P. Fischer, and T. Brox, U-net: Convolutional networks for biomedical image segmentation, in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, eds., Cham, 2015, Springer International Publishing, pp. 234– 241.
[68] S. Roy, A. Unmesh, and V. P. Namboodiri, Deep active learning for object detection.
[69] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, Adapting visual category models to new domains, in European conference on computer vision, Springer, 2010, pp. 213–226.
[70] K. Saito, D. Kim, S. Sclaroff, and K. Saenko, Universal domain adaptation through self supervision, Advances in Neural Information Processing Systems, 33 (2020).
[71] K. Saito, Y. Ushiku, T. Harada, and K. Saenko, Adversarial dropout regularization, arXiv preprint arXiv:1711.01575, (2017).
[72] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, Maximum classifier discrepancy for unsupervised domain adaptation, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3723–3732.
[73] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek ´ , Image classification with the fisher vector: Theory and practice, International journal of computer vision, 105 (2013), pp. 222–245.
[74] N. Sayed, B. Brattoli, and B. Ommer, Cross and learn: Cross-modal self-supervision, in German Conference on Pattern Recognition, Springer, 2018, pp. 228–243.
[75] B. Settles, Active learning: Synthesis lectures on artificial intelligence and machine learning, Long Island, NY: Morgan & Clay Pool, (2012).
[76] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, (2014).
[77] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[78] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
[79] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker, Learning to adapt structured output space for semantic segmentation, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7472–7481.
[80] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164.
[81] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Perez ´ , Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2517–2526.
[82] M. Wang and W. Deng, Deep visual domain adaptation: A survey, Neurocomputing, 312 (2018), pp. 135– 153.
[83] Q. Wang, D. Dai, L. Hoyer, L. Van Gool, and O. Fink, Domain adaptive semantic segmentation with self-supervised depth estimation, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 8515–8525.
[84] X. Wang, K. He, and A. Gupta, Transitive invariance for self-supervised visual representation learning, in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1329–1338.
[85] C. Wei, L. Xie, X. Ren, Y. Xia, C. Su, J. Liu, Q. Tian, and A. L. Yuille, Iterative reorganization with weak spatial constraints: Solving arbitrary jigsaw puzzles for unsupervised representation learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1910– 1919.
[86] Z. Wu, X. Han, Y.-L. Lin, M. G. Uzunbas, T. Goldstein, S. N. Lim, and L. S. Davis, Dcan: Dual channel-wise alignment networks for unsupervised scene adaptation, in European Conference on Computer Vision, Springer, 2018, pp. 535–552.
[87] J. Xie, R. Girshick, and A. Farhadi, Unsupervised deep embedding for clustering analysis, in Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger, eds., vol. 48 of Proceedings of Machine Learning Research, New York, New York, USA, 20–22 Jun 2016, PMLR, pp. 478–487.
[88] J. Xu, S. Ramos, D. Vazquez, and A. M. L ´ opez ´ , Domain adaptation of deformable part-based models, IEEE transactions on pattern analysis and machine intelligence, 36 (2014), pp. 2367–2380.
[89] J. Xu, D. Vazquez, K. Mikolajczyk, and A. M. L ´ opez ´ , Hierarchical online domain adaptation of deformable part-based models, in 2016 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2016, pp. 5536–5541.
[90] J. Xu, L. Xiao, and A. M. Lopez ´ , Self-supervised domain adaptation for computer vision tasks, IEEE Access, 7 (2019), pp. 156694–156706.
[91] J. Yang, D. Parikh, and D. Batra, Joint unsupervised learning of deep representations and image clusters, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5147–5156.
[92] X. Yue, Z. Zheng, S. Zhang, Y. Gao, T. Darrell, K. Keutzer, and A. S. Vincentelli, Prototypical cross-domain self-supervised learning for few-shot unsupervised domain adaptation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 13834–13844.
[93] R. Zhang, P. Isola, and A. A. Efros, Colorful image colorization, in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, eds., Cham, 2016, Springer International Publishing, pp. 649– 666.
[94] R. Zhang, P. Isola, and A. A. Efros, Split-brain autoencoders: Unsupervised learning by cross-channel prediction, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1058–1067.
[95] R. Zhang, J.-Y. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. Efros, Real-time user-guided image colorization with learned deep priors, ACM Trans. Graph., 36 (2017).
[96] Y. Zhang, P. David, H. Foroosh, and B. Gong, A curriculum domain adaptation approach to the semantic segmentation of urban scenes, IEEE transactions on pattern analysis and machine intelligence, 42 (2019), pp. 1823–1841.
[97] Y. Zhang, P. David, and B. Gong, Curriculum domain adaptation for semantic segmentation of urban scenes, in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2020–2030.
[98] Y. Zhang, Z. Qiu, T. Yao, D. Liu, and T. Mei, Fully convolutional adaptation networks for semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6810–6818.
[99] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, Pyramid scene parsing network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[100] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, Unpaired image-to-image translation using cycleconsistent adversarial networks, in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
[101] Y. Zou, Z. Yu, B. V. K. Vijaya Kumar, and J. Wang, Unsupervised domain adaptation for semantic segmentation via class-balanced self-training, in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, eds., Cham, 2018, Springer International Publishing, pp. 297–313.