Abstract
Image synthesis is a process of converting the input text, sketch, or other sources, i.e., another image or mask, into an image. It is an important problem in the computer vision field, where it has attracted the research community to attempt to solve this challenge at a high level to generate photorealistic images. Different techniques and strategies have been employed to achieve this purpose. Thus, the aim of this paper is to provide a comprehensive review of various image synthesis models covering several aspects. First, the image synthesis concept is introduced. We then review different image synthesis methods divided into three categories: image generation from text, sketch, and other inputs, respectively. Each sub-category is introduced under the proper category based upon the general framework to provide a broad vision of all existing image synthesis methods. Next, brief details of the benchmarked datasets used in image synthesis are discussed along with specifying the image synthesis models that leverage them. Regarding the evaluation, we summarize the metrics used to evaluate the image synthesis models. Moreover, a detailed analysis based on the evaluation metrics of the results of the introduced image synthesis is provided. Finally, we discuss some existing challenges and suggest possible future research directions.












Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Adiban M, Safari A, Salvi G (2020) Step-gan: A step-by-step training for multi generator gans with application to cyber security in power systems. arXiv [eess.SP].
. Arjovsky M, Chintala S, Bottou L (2017) Wasserstein gan. arXiv [stat.ML]
Baraheem SS, Nguyen TV (2020b) Aesthetic-aware text to image synthesis. In 2020b 54th Annual Conference on Information Sciences and Systems (CISS), p 1–6
Baraheem SS, Nguyen TV (2020) Text-to-image via mask anchor points. Pattern Recognition Lett 133:25–32
Caesar H, Uijlings J, Ferrari V (2018) Coco-stuff: Thing and stuff classes in context. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, p 1209–1218
Cai L, Gao H, Ji S (2017) Multi-stage variational auto-encoders for coarse- to-fine image generation. arXiv [cs.CV]
Chalechale A, Mertins A, Naghdy G (2004) Edge image description us- ing angular radial partitioning. IEE Proc - Vis. Image Signal Process 151(2):93
Chen T, Cheng M-M, Tan P, Shamir A, Hu S-M (2009) Sketch2photo: internet image montage. ACM Trans Graph 28(5):1–10
Chen W, Hays J (2018) Sketchygan: towards diverse and realistic sketch to image synthesis. arXiv [cs.CV]
Chen H, Jiang L (2019) Efficient gan-based method for cyber-intrusion detection. arXiv [cs.LG
Chen X, Kingma DP, Salimans T, Duan Y, Dhariwal P, Schulman J, Sutskever I, Abbeel P (2016) Abbeel. Variational lossy autoencoder. arXiv [cs.LG]
Chen J, Shen Y, Gao J, Liu J, Liu X (2017a) Language-based image editing with recurrent attentive models. arXiv [cs.CV]
Chen L, Srivastava S, Duan Z, Xu C (2017b) Deep cross-modal audio- visual generation arXiv [cs.CV].
Chicco D (2021) Siamese neural networks: An overview. Methods Mol Biol 2190:73–94
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Comaniciu D, Meer P (2002) obust analysis of feature spaces: color image segmentation. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
“Common problems,” Google Developers. https://developers.google.com/machine-learning/gan/problems (accessed Jan. 10, 2023)
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic ur- ban scene understanding. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Dalal N, Triggs B (2005) Histograms of oriented gradients for human de- tection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol 1, p 886–893
Dalmaz O, Yurt M, Cukur T (2022) Resvit: Residual vision transformers for multimodal medical image synthesis. IEEE Trans Med Imaging 41(10):2598–2614
Das A, Kottur S, Moura JM, Lee S, Batra D (2017) Visual dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Davenport RK, Rogers CM, Russell IS (1973) Cross modal perception in apes. Neuropsychologia 11(1):21–28
Deng L (2012) The mnist database of handwritten digit images for ma- chine learning research [best of the web. IEEE Signal Process Mag 29(6):141–142
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, p 248–255
Denton EL, Chintala S, Fergus R 2015 Deep generative image models using a laplacian pyramid of adversarial networks. arXiv [cs.CV]
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language under- standing. arXiv [cs.CL]
Dokmanic I, Parhizkar R, Ranieri J, Vetterli M (2015) Euclidean distance matrices: essential theory, algorithms and applications. IEEE Signal Process Mag 32(6):12–30
Dosovitskiy A, Springenberg JT, Brox T (2015) Learning to generate chairs with convolutional neural networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 1538–1546
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv [cs.CV]
Dumitrescu B (2017) Gram matrix representation. Signals and Communication Technology. Springer International Publishing, Cham, pp 23–69
Dumoulin V, Shlens J, Kudlur M (2016) A learned representation for artistic style. arXiv [cs.CV]
Eitz M, Richter R, Hildebrand K, Boubekeur T, Alexa M (2011) Photo- sketcher: interactive sketch-based image synthesis. IEEE Comput Graph Appl 31(6):56–66
Eitz M, Hays J, Alexa M (2012) How do humans sketch objects? ACM Trans Graph 31(4):1–10
Eitz M, Hildebrand K, Boubekeur T, Alexa M (2009) A descriptor for large scale image retrieval based on sketched feature lines. In Proceed- ings of the 6th Eurographics Symposium on Sketch-Based Interfaces and Modeling - SBIM
Elgammal A, Liu B, Elhoseiny M, Mazzone M (2017) Can: creative adversarial networks, generating ‘art’ by learning about styles and deviating from style norms. arXiv [cs.AI]
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compo- sitionality. arXiv [cs.CL]
He K, Zhang X, Ren S, Sun J. (2015) Deep residual learning for image recognition. arXiv [cs.CV]
Perarnau G, Weijer J, Raducanu B, A´lvarez JM (2016) Invertible conditional gans for image editing. arXiv [cs.CV]
Liu Y, Qin Z, Luo Z, Wang H (2017) Auto-painter: cartoon image generation from sketch by using conditional generative adversarial networks. arXiv [cs.CV]
Feng F, Li R, Wang X (2014) Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM International Conference on Multimedia, vol MM 14
Feng Z, Xu C, Tao D (2019) Self-supervised representation learning by rotation feature decoupling. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p 10364–10374
Finlayson SG, Lee H, Kohane IS, Oakden-Rayner L (2018) Towards generative adversarial networks as a new paradigm for radiology education. arXiv [cs.CV]
Gadde R, Karlapalem K (2011) Aesthetic guideline driven photography by robots. Proceedings of the Twenty-Second international joint conference on Artificial Intelligence, p 2060–2065
Gao L, Chen D, Zhao Z, Shao J, Shen HT (2021) Lightweight dynamic conditional gan with pyramid attention for text-to-image synthesis. Pattern Recogn 110(107384):107384
Gao C, Liu Q, Xu Q, Wang L, Liu J, Zou C (2020) Sketchycoco: Image generation from freehand scene sketches,. arXiv [cs.CV].
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems vol 2, p 2672–2680
Gou Y, Wu Q, Li M, Gong B, Han M (2020) Segattngan: Text to image generation with segmentation attention. arXiv [cs.CV]
. Gregor K, Danihelka I, Graves A, Rezende DJ, Wierstra D (2015) Draw: a recurrent neural network for image generation. arXiv [cs.CV]
Grother P (1995) Nist special database 19 handprinted forms and characters database.
Gulrajani I, Kumar K, Ahmed F, Taiga AA, Visin F, Vazquez D, Courville A (2016) Pixelvae: A latent variable model for natural images. arXiv [cs.LG]
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. arXiv [cs.LG]
Güngör A, Dar SU, Öztürk Ş, Korkmaz Y, Elmas G, Özbey M, Güngör A, Çukur T (2022) Adaptive diffusion priors for accelerated mri reconstruction. arXiv [eess.IV]
Hao W, Zhang Z, Guan H (2018) Cmcgan: a uniform framework for cross-modal visual-audio mutual generation. Proc. Conf. AAAI Artif. Intell, 32(1)
Harris Zellig S (1981) Distributional Structure. Springer Netherlands, Dordrecht, pp 3–22
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv [cs.CV]
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv [cs.LG].
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. arXiv [cs.LG]
Huang X, Belongie S (2017) Arbitrary style transfer in real-time with adaptive instance normalization. arXiv [cs.CV]
Huang GB, Mattar M, Berg T, Learned-Miller E (2008) Labeled faces in the wild: A database for studying face recognition in un- constrained environments. In Workshop on Faces in “Real-Life” Images: Detection, Alignment, and Recognition
Huang X, Liu M-Y, Belongie S, Kautz J (2018a) Multimodal unsupervised image-to-image translation, arXiv [cs.CV]
Huang H, Yu PS, Wang C (2018b) An introduction to image synthesis with generative adversarial nets, 2018b. arXiv [cs.CV]
Huiskes MJ, Lew MS (2008) Lew. The mir flickr retrieval evaluation. In Proceed- ing of the 1st ACM international conference on Multimedia information retrieval - MIR
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv [cs.LG]
Isola P, Zhu J-Y, Zhou T, Efros AA (2016) Image-to-image translation with conditional adversarial networks. arXiv [cs.CV]
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. arXiv [cs.CV]
Jinzhen Mu, Chen C, Zhu W, Li S, Zhou Y (2022) Taming mode collapse in generative adversarial networks using cooperative realness discriminators. IET Image Proc 16(8):2240–2262
Johnson J, Alahi A, Li Fei-Fei (2016) Perceptual Losses for Real-time Style Transfer and Super-resolution. Springer International Publishing, Cham
Jolicoeur-Martineau A, Piché-Taillefer R, Combes RT, Mitliagkas I (2020) Adversarial score matching and im- proved sampling for image generation. arXiv [cs.LG]
Amit Kamran S, Fariha Hossain K, Tavakkoli A, Zuckerbrod SL, Baker SA (2021) Vtgan: semi-supervised retinal image synthesis and disease prediction using vision transformers. arXiv [eess.IV]
Karras T, Aila T, Laine S, Lehtinen J (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv [cs.NE]
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv [stat.ML]
Kingma DP, Salimans T, Jozefowicz R, Chen X, Sutskever I, Welling M (2016) Welling. Improving variational inference with inverse autoregressive flow. arXiv [cs.LG].
Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. arXiv [cs.CL]
Kolesnikov A, Zhai X, Beyer L (2019) Beyer. Revisiting self-supervised visual representation learning. arXiv [cs.CV]
Kong Z, Ping W, Huang J, Zhao K, Catanzaro B (2020) Diffwave: a versatile diffusion model for audio synthesis. arXiv [eess.AS]
Kramer MA (1991) Nonlinear principal component analysis using autoassocia- tive neural networks. AIChE J 37(2):233–243
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Technical report, Journal of Software Engineering and Applications.
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Kruskal JB (1964) Nonmetric multidimensional scaling: A numerical method. Psychometrika 29(2):115–129
Kumar N, Berg AC, Belhumeur PN, Nayar SK (2009) Attribute and simile classifiers for face verification. In 2009 IEEE 12th International Conference on Computer Vision. IEEE p 365–372
Li C, Wand M (2016) Precomputed Real-time Texture Synthesis with Markovian Generative Adversarial Networks. Springer International Publishing, Cham
Li B, Liu X, Dinesh K, Duan Z, Sharma G (2016) Creating a multi- track classical musical performance dataset for multimodal music analysis: challenges, insights, and applications. EEE Trans Multimedia 21(2):522–535
Li L, Sun Y, Hu F, Zhou T, Xi X, Ren J (2020) Text to realistic image generation with attentional concatenation generative adversarial networks. Discrete Dyn Nat Soc 2020(1):10
Li JG, Zhang XF, Jia CM, Xu JZ, Zhang L, Wang Y, Ma SW, Gao W (2020) Direct speech-to-image translation. IEEE Journal of Selected Topics in Signal Processing 14(3):517–529
Li Z, Deng C, Yang E, Tao D (2021) Staged sketch-to-image synthesis via semi-supervised generative adversarial networks. IEEE Trans Multi- Media 23:2694–2705
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. arXiv [cs.CV]
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2016) Feature pyramid networks for object detection. arXiv [cs.CV]
Lin YJ, Wu PW, Chang CH, Chang EY, Liao SW (2019) Liao. Relgan: Multi-domain image-to-image translation via relative attributes. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
Liu L, Chen R, Wolf L, Cohen-Or D (2010) Optimizing photo composition. Comput. Graph. Forum 29(2):469–478
Liu Y, Dellaert F (2002) A classification based similarity metric for 3D image retrieval, in Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231), p. 800–805
Liu R, Yu Q, Yu S (2019) Unsupervised sketch-to-photo synthesis. arXiv [cs.CV]
Liu B, Zhu Y, Song K, Elgammal A (2020) Self-supervised sketch-to- image synthesis. arXiv [cs.CV]
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput vis 60(2):91–110
Lu Y, Wu S, Tai Y-W, C.-K. (2018) Tang. Image generation from sketch constraint using contextual gan. Computer Vision – ECCV 2018. Springer International Publishing, Cham, pp 213–228
Manjunath BS, Salembier P, Sikora T (2002) Introduction to mpeg-7: Multimedia content description interface. In: Manjunath BS, Salembier P, Sikora T (eds) Introduction to mpeg-7: Multimedia content description interface. John Wiley and Sons, Chichester
Mansimov E, Parisotto E, Ba J.L, Salakhutdinov R (2015) Generating images from captions with attention. arXiv [cs.LG]
Mao X, Li Q, Xie H, Lau RY, Wang Z, Paul Smolley S (2016) Least squares generative adversarial networks, 2016. arXiv [cs.CV]
Miller GA, Beckwith R, Fellbaum C, Gross D, Miller KJ (1990) In- troduction to wordnet: An on-line lexical database. Int j Lexicogr 3(4):235–244
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv [cs.LG].
Özbey M, Dar SU, Bedel HA, Dalmaz O, Özturk Ş, Güngör A, Çukur T (2022) Unsupervised medical image translation with adversarial diffusion models. arXiv [eess.IV]
Chen N, Zhang Y, Zen H, Ron Weiss J, Norouzi M, Chan W (2020) Wavegrad: Estimating gradients for waveform gener- ation. arXiv [eess.AS]
Nazeri K, Ng E, Joseph T, Qureshi FZ, Ebrahimi M (2019) Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv [cs.CV]
Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models, p 18–24. arXiv [cs.LG]
Nilsback ME, Zisserman A (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision. Graphics and Image Processing.
Odena A, Olah C, Shlens J (2016) Conditional image synthesis with auxiliary classifier gans. arXiv [stat.ML]
Ojala T, Pietikäinen M, Harwood D (1996) A comparative study of tex- ture measures with classification based on featured distributions. Pattern Recognit 29(1):51–59
Osahor U, Kazemi H, Dabouei A, Nasrabadi N (2020) Quality guided sketch-to-photo image synthesis. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62–66
Park H, Yoo Y, Kwak N (2018) Mc-gan: Multi-conditional generative adversarial network for image synthesis. arXiv [cs.CV]
Park T, Liu M-Y, Wang T-C, Zhu J-Y (2019a) Semantic image synthesis with spatially-adaptive normalization. arXiv [cs.CV]
Park T, Liu MY, Wang TC, Zhu JY (2019b) Semantic image synthesis with spatially-adaptive normalization. In 2019b IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p 2337–2346
Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet G, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross- modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535
Qiao T, Zhang J, Xu D, Tao D (2019) Mirrorgan: Learning text-to-image generation by redescription. arXiv [cs.CL]
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv [cs.LG]
Rajput GG, Prashantha (2019) Sketch based image retrieval using grid approach on large scale database. Procedia Comput. Sci 165:216–223
Rasmussen CE (1999) The infinite gaussian mixture model. In Proceedings of the 12th International Conference on Neural Information Processing Systems, p 554–560
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016a) Generative adversarial text to image synthesis. arXiv [cs.NE]
Reed S, Akata Z, Lee H, Schiele B (2016a) Learning deep representations of fine-grained visual descriptions. arXiv [cs.CV]
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. Springer International Publishing, Cham
Rother C, Kolmogorov V, Blake A (2004) Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Trans Graph 23(3):309–314
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen (2016) Improved techniques for training gans. In Proceedings of the 30th International Conference on Neural Information Processing Systems, p 2234–2242.
Sangkloy P, Burnell N, Ham C, Hays J (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans Graph 35(4):1–12
Sangkloy P, Lu J, Fang C, Yu F, Hays J (2016b) Scribbler: controlling deep image synthesis with sketch and color. arXiv [cs.CV]
Sasaki H, Willcocks CG, Breckon TP (2021) Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models. arXiv [cs.CV]
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing: a Publication of the IEEE Signal Processing Society 45(11):2673–2681
Sharma S, Suhubdy D, Michalski V, Kahou SE, Bengio Y (2018) Chat- painter: Improving text to image generation using dialogue. arXiv [cs.CV]
Sohl-Dickstein J, Weiss EA, Maheswaranathan N, Ganguli S (2015) Deep unsupervised learning using nonequilibrium thermodynam- ics. arXiv [cs.LG].
Song Y, Ermon S (2020) Improved techniques for training score- based generative models. Adv Neural Inf Process Syst 33(12438):12448
Souza DM, Wehrmann J, Ruiz DD (2020) Efficient neural architecture for text-to-image synthesis. arXiv [cs.LG]
Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2014) Striving for simplicity: The all convolutional net. arXiv [cs.LG]
Stein BE, Meredith MA (1993) The merging of the senses. The MIT Press, Cambridge
Sushko V, Schönfeld E, Zhang D, Gall J, Schiele B, Khoreva A (2020) You only need adversarial supervision for semantic image synthesis. arXiv [cs.CV]
Szanto B, Pozsegovics P, Vamossy Z, Sergyan S (2011) Sketch4match — content-based image retrieval system using sketches. In 2011 IEEE 9th In- ternational Symposium on Applied Machine Intelligence and Informatics (SAMI), p 183–188
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 1–9
Taigman Y, Yang M, Ranzato MA, Wolf L (2014) Deepface: closing the gap to human-level performance in face verification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, p 1701–1708
Tanveer MI, Liu J, Hoque ME (2015) Unsupervised extraction of human- interpretable nonverbal behavioral cues in a public speaking scenario. In Proceedings of the 23rd ACM international conference on Multimedia - MM 15
Thaung L (2020) Advanced data augmentation: With generative adversarial networks and computer-aided design.
Torralba A, Fergus R, Freeman WT (2008) 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. arXiv [cs.CL]
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408
Vroomen J, Gelder B (2000) Sound enhances visual perception: cross-modal effects of auditory organization on vision. J Exp Psychol Hum Percept Perform 26(5):1583–1590
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200–2011 dataset.
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds200–2011 dataset. Advances in Water Rerces - ADV WATER RESOUR.
Wang Z, Simoncelli EP, Bovik A (2003) Multi-scale structural similarity for image quality assessment. Ieee, New York
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error measurement to structural similarity. IEEE Trans Image Processing 13(4):600–612
Wang X, Qiao T, Zhu J, Hanjalic A, Scharenborg O (2021) Generating images from spoken descriptions. IEEE ACM Trans Audio Speech Lang Process 29:850–865
Wang TC, Liu MY, Zhu JY, Tao A, Kautz J, Catanzaro B (2017) Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. ArXiv [Cs.CV]
Wang M,. Lang C,. Liang L,. Lyu G,. Feng S, and. Wang T (2020) Attentive generative adversarial network to bridge multi-domain gap for image syn- thesis. In 2020 IEEE International Conference on Multimedia and Expo (ICME,).
Welinder P, Branson S, Mita T, Wah C, Schroff F, Belongie S, Perona P (2010a) Caltech-ucsd birds 200”. Technical report cns-tr-2010a-001, California Institute of Technology.
Welinder P, Branson S, Perona P (2010b) The multidimensional wisdom of crowds. NIPS.
Wu W, Cao K, Li C, Qian C, Loy CC (2019) Transgaga: Geometry- aware unsupervised image-to-image translation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p 8012–8021
Xian W, Sangkloy P, Agrawal V, Raj A, Lu J, Fang C, Yu F, Hays J (2018) Texturegan: Controlling deep image synthesis with texture patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 8456–8465
Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, p 3485–3492
Xie S, Tu Z (2015) Holistically-nested edge detection. In 2015 IEEE Inter- national Conference on Computer Vision (ICCV), p 1395–1403
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2017) At- tngan: Fine-grained text to image generation with attentional generative adversarial networks. arXiv [cs.CV]
Yan Z, Zhang H, Wang B, Paris S, Yu Y (2014) Automatic photo adjustment using deep neural networks. arXiv [cs.CV]
Yan X, Yang J, Sohn K, Lee H (2015) Attribute2image: conditional image generation from visual attributes. arXiv [cs.LG]
Yu Q, Yang Y, Song YZ, Xiang T (2015) Hospedales. Sketch-a-net that beats humans. arXiv [cs.CV]
Yu Q, Liu F, Song Y-Z, Xiang T, Hospedales TM, Loy CC (2016) Sketch me that shoe. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 799–807
Yu J, Lin Z, Yang J, Shen X, Lu X, Huang TS (2018) Generative image inpainting with contextual attention. arXiv [cs.CV]
Zhang Z, Luo P, Loy CC, Tang X (2014) Deep learning face attributes in the wild. arXiv [cs.CV]
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas D (2016) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv [cs.CV]
Zhang H, Xu T, Li H,. Zhang S, Wang X, Huang X, Metaxas D (2017) Stackgan++: Realistic image synthesis with stacked generative adversar- ial networks,. arXiv [cs.CV]
Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. arXiv [cs.CV]
Zhang P, Zhang B, Chen D, Yuan L, Wen F (2020) Cross-domain cor- respondence learning for exemplar-based image translation. arXiv [cs.CV]
Zhang H, Koh JY, Baldridge J, Lee H, Yang Y (2021a) Cross-modal contrastive learning for text-to-image generation. arXiv [cs.CV]
Zhang J, Li K, Lai YK, Yang J (2021b) Pise: Person image synthesis and editing with decoupled gan. In 2021b IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) p 7978–7986
Zhao T, Chen C, Liu Y, Zhu X (2021) Guigan: Learning to generate gui designs using generative adversarial networks. arXiv [cs.HC]
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2016) Se- mantic understanding of scenes through the ade20k dataset. Int J Comput Vision 127:302–321
Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2018) Places: A 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1452–1464
Zhu X, Goldberg AB, Eldawy M, Dyer CR, Strock B (2007) A text-to- picture synthesis system for augmenting communication. In Proceedings of the 22nd national conference on Artificial intelligence, vol 2, p 1590–1595
Zhu J-Y, Park T, Isola P, Efros AA (2017) Unpaired image-to- image translation using cycle-consistent adversarial networks. arXiv [cs.CV]
Zhu P, Abdal R, Qin Y, Wonka (2020) Wonka. Sean: Image synthesis with semantic region-adaptive normalization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Zou C, Mo H, Gao C, Du R, Fu H (2019) Language-based colorization of scene sketches. ACM Trans Graph 38(6):1–16
Acknowledgements
The first author would like to thank Umm Al-Qura University, in Saudi Arabia, for the continuous support. This work has been supported in part by the University of Dayton Office for Graduate Academic Affairs through the Graduate Student Summer Fellowship Program. This work was also supported in part by the National Science Foundation under Grant NSF 2025234.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Baraheem, S.S., Le, TN. & Nguyen, T.V. Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook. Artif Intell Rev 56, 10813–10865 (2023). https://doi.org/10.1007/s10462-023-10434-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-023-10434-2
Keywords
Profiles
- Samah Saeed Baraheem View author profile
- Trung-Nghia Le View author profile