Keywords

1 Introduction

Over the years, great progress has been made in image generation by the advances in Generative Adversarial Networks (GANs) [6, 12]. As shown in Fig. 1 the generation quality and diversity have been improved substantially from the early DCGAN [16] to the very recent Alias-free GAN [11]. After the adversarial training of the generator and the discriminator, we can have the generator as a pretrained feedforward network for image generation. After feeding a vector sampled from some random distribution, this generator can synthesize a realistic image as the output. However, such an image generation pipeline doesn’t allow users to customize the output image, such as changing the lighting condition of the output bedroom image or adding a smile to the output face image. Moreover, it is less understood how a realistic image can be generated from the layer-wise representations of the generator. Therefore, we need to interpret the learned representation of deep generative models for understanding and the practical application of interactive image editing.

This chapter will introduce the recent progress of the explainable machine learning for deep generative models. I will show how we can identify the human-understandable concepts in the generative representation and use them to steer the generator for interactive image generation. Readers might also be interested in watching a relevant tutorial talk I gave at CVPR’21 Tutorial on Interpretable Machine Learning for Computer VisionFootnote 1. A more detailed survey paper on GAN interpretation and inversion can be found in [21].

Fig. 1.
figure 1

Progress of image generation made by different GAN models over the years.

This chapter focuses on interpreting the pretrained GAN models, but a similar methodology can be extended to other generative models such as VAE. Recent interpretation methods can be summarized into the following three approaches: the supervised approach, the unsupervised approach, and the embedding-guided approach. The supervised approach uses labels or classifiers to align the meaningful visual concept with the deep generative representation; the unsupervised approach aims to identify the steerable latent factors in the deep generative representation through solving an optimization problem; the embedding-guided approach uses the recent pretrained language-image embedding CLIP [15] to allow a text description to guide the image generation process.

In the following sections, I will select representative methods from each approach and briefly introduce them as primers for this rapidly growing direction.

2 Supervised Approach

Fig. 2.
figure 2

GAN dissection framework and interactive image editing interface. Images are extracted from [3]. The method aligns the unit activation with the semantic mask of the output image, thus by turning up or down the unit activation we can include or remove the corresponding visual concept in the output image.

The supervised approach uses labels or trained classifiers to probe the representation of the generator. One of the earliest interpretation methods is the GAN Dissection [4]. Derived from the previous work Network Dissection [3], GAN Dissection aims to visualize and understand the individual convolutional filters (we term them as units) in the pretrained generator. It uses semantic segmentation networks [24] to segment the output images. It then calculates the agreement between the spatial location of the unit activation map and the semantic mask of the output image. This method can identify a group of interpretable units closely related to object concepts, such as sofa, table, grass, buildings. Those units are then used as switches where we can add or remove some objects such as a tree or lamp by turning up or down the activation of the corresponding units. The framework of GAN Dissection and the image editing interface are shown in Fig. 2. In the interface of GAN Dissection, the user can select the object to be manipulated and brush the output image where it should be removed or added.

Besides steering the filters at the intermediate convolutional layer of the generator as the GAN Dissection does, the latent space where we sample the latent vector as input to the generator is also being explored. The underlying interpretable subspaces aligning with certain attributes of the output image can be identified. Here we denote the pretrained generator as G(.) and the random vector sampled from the latent space as \(\mathbf{z} \), and then the output image becomes \(I = G(\mathbf{z} )\). Under different vectors, the output images become different. Thus the latent space encodes various attributes of images. If we can steer the vector \(\mathbf{z} \) through one relevant subspace and preserve its projection to the other subspaces, we can edit one attribute of the output image in a disentangled way.

Fig. 3.
figure 3

We can use classifier to predict various attributes from the output image then go back to the latent space to identify the attribute boundaries. Images below show the image editing results achieved by [22].

To align the latent space with the semantic space, we can first apply off-the-shelf classifiers to extract the attributes of the synthesized images and then compute the causality between the occurring attributes in the generated images and the corresponding vectors in the latent space. The HiGAN method proposed in [22] follows such a supervised approach as illustrated in Fig. 3: (1) Thousands of latent vectors are sampled, and the images are generated. (2) Various levels of attributes are predicted from the generated images by applying the off-the-shelf classifiers. (3) For each attribute a, a linear boundary \(\mathbf{n} _a\) is trained in the latent space using the predicted labels and the latent vectors. We consider it a binary classification and train a linear SVM to recognize each attribute. The weight of the trained SVM is \(\mathbf{n} _a\). (4) a counterfactual verification step is taken to pick up the reliable boundary. Here we follow a linear model to shift the latent code as

$$\begin{aligned} I' = G(\mathbf{z} + \lambda \mathbf{n} _a), \end{aligned}$$
(1)

where the normal vector of the trained attribute boundary is denoted as \(\mathbf{n} _a\) and \(I'\) is the edited image compared to the original image I. Then the difference between predicted attribute scores before and after manipulation becomes,

$$\begin{aligned} \varDelta a = \frac{1}{K}\sum _{k=1}^K \max (F(G(\mathbf{z} _k + \mathbf{n} _a)) - F(G(\mathbf{z} _k)), 0), \end{aligned}$$
(2)

here F(.) is the attribute predictor with the input image, and K is the number of synthesized images. Ranking \(\varDelta a\) allows us to identify the reliable attribute boundaries out of the candidate set \(\{ \mathbf{n} _a\}\), where there are about one hundred attribute boundaries trained from step 3 of the HiGAN method. After that, we can then edit the output image from the generator by adding or removing the normal vector of the target attribute on the original latent code. Some image manipulation results are shown in Fig. 3.

Similar supervised methods have been developed to edit the facial attributes [17, 18] and improve the image memorability [5]. Steerability of various attributes in GANs has also been analyzed [9]. Besides, the work of StyleFlow [1] replaces the linear model with a nonlinear invertible flow-based model in the latent space with more precise facial editing. Some recent work uses a differentiable renderer to extract 3D information from the image GANs for more controllable view synthesis [23]. For the supervised approach, many challenges remain for future work, such as expanding the annotation dictionary, achieving more disentangled manipulation, and aligning latent space with image region.

3 Unsupervised Approach

As generative models become more and more popular, people start training them on a wide range of images, such as cats and anime. To steer the generative models trained for cat or anime generation, following the previous supervised approach, we have to define the attributes of the images and annotate many images to train the classifiers. It is a very time-consuming process.

Alternatively, the unsupervised approach aims to identify the controllable dimensions of the generator without using labels/classifiers.

SeFa [19] is an unsupervised approach for discovering the interpretable representation of a generator. It directly decomposes the pre-trained weights. More specifically, in the pre-trained generator of the popular StyleGAN [12] or PGGAN [10] model, there is an affine transformation between the latent code and the internal activation. Thus the manipulation model can be simplified as

$$\begin{aligned} \mathbf{y} ' \triangleq G_1(\mathbf{z} ')&= G_1(\mathbf{z} + \alpha \mathbf{n} ) = \mathbf{A} {} \mathbf{z} + \mathbf{b} + \alpha \mathbf{A} {} \mathbf{n} = \mathbf{y} + \alpha \mathbf{A} {} \mathbf{n} , \end{aligned}$$
(3)

where \(\mathbf{y} \) is the original projected code and \(\mathbf{y} '\) is the projected code after manipulation by n. From Eq. (3) we can see that the manipulation process is instance independent. In other words, given any latent code z together with a particular latent direction n, the editing can always be achieved by adding the term \(\alpha \mathbf{A} {} \mathbf{n} \) onto the projected code after the first step. From this perspective, the weight parameter \(\mathbf{A} \) should contain the essential knowledge of the image variation. Thus we aim to discover important latent directions by decomposing \(\mathbf{A} \) in an unsupervised manner. We propose to solve the following optimization problem:

$$\begin{aligned} \mathbf {N}^* = \mathop {\arg \max }_{\{\mathbf {N}\in R^{d\times k}: \mathbf {n}_i^T\mathbf {n}_i = 1\ \forall i=1,\cdots ,k\}} \sum _{i=1}^k ||\mathbf {A}\mathbf {n}_i||_2^2, \end{aligned}$$
(4)

where \(\mathbf{N} = [\mathbf{n} _1, \mathbf{n} _2, \cdots , \mathbf{n} _k]\) correspond to the top k semantics sorted by their eigenvalues, and \(\mathbf{A} \) is the learned weight in the affine transform between the latent code and the internal activation. This objective aims at finding the directions that can cause large variations after the projection of \(\mathbf{A} \). The resulting solution becomes the eigenvectors of the matrix \(\mathbf{A} ^T\mathbf{A} \). Those resulting directions at different layers control different attributes of the output image, thus pushing the latent code z on the important directions \(\{\mathbf {n}_1, \mathbf {n}_2, \cdots , \mathbf {n}_k\}\) facilitates the interactive image editing. Figure 4 shows some editing result.

Fig. 4.
figure 4

Manipulation results from SeFa [19] on the left and the interface for interactive image editing on the right. On the left, each attribute corresponds to some \(\mathbf{n} _i\) in the latent space of the generator. In the interface, user can simply drag each slider bar associating with certain attribute to edit the output image

Many other methods have been developed for the unsupervised discovery of interpretable latent representation. Härkönen et al. [7] perform PCA on the sampled data to find primary directions in the latent space. Voynov and Babenko [20] jointly learn a candidate matrix and a classifier such that the classifier can properly recognize the semantic directions in the matrix. Peebles et al. [14] develops a Hessian penalty as a regularizer for improving disentanglement in training. He et al. [8] designs a linear subspace with an orthogonal basis in each layer of the generator to encourage the decomposition of attributes. Many challenges remain for the unsupervised approach, such as how to evaluate the result from unsupervised learning, annotate each discovered dimension, and improve the disentanglement in the GAN training process.

4 Embedding-Guided Approach

The embedding-guided approach aligns language embedding with generative representations. It allows users to use any free-form text to guide the image generation. The difference between the embedding-guided approach and the previous unsupervised approach is that the embedding-guided approach is conditioned on the given text to manipulate the image to be more flexible, while the unsupervised approach discovers the steerable dimensions in a bottom-up way thus it lacks fine-grained control.

Recent work on StyleCLIP [13] combines the pretrained language-image embedding CLIP [15] and StyleGAN generator [12] for free-form text-driven image editing. CLIP is a pretrained embedding model from 400 million image-text pairs. Given an image \(I_s\), it first projects it back into the latent space as \(\mathbf{w} _s\) using existing GAN inversion method. Then StyleCLIP designs the following optimization objective

$$\begin{aligned} \mathbf{w} ^* = \mathop {\arg \min }D_{CLIP}(G(\mathbf{w} ), t) + \lambda _{L2}||\mathbf{w} - \mathbf{w} _s||_2 + \lambda _{ID}L_{ID}(\mathbf{w} , \mathbf{w} _s), \end{aligned}$$
(5)

where \(D_{CLIP}(.,.)\) measure the distance between an image and a text using the pre-trained CLIP model, the second and the third terms are some regularizers to keep the similarity and identity with the original input image. Thus this optimization objective results in a latent code \(\mathbf{w} ^*\) that generates an image close to the given text in the CLIP embedding space as well as similar to the original input image. StyleCLIP further develops some architecture design to speed up the iterative optimization. Figure 5 shows the text driven image editing results.

Some concurrent work called Paint by Word from Bau et al. [2] combines CLIP embedding with region-based image editing. It has a masked optimization objective that allows the user to brush the image to provide the input mask.

Fig. 5.
figure 5

Text driven image editing results from a) StyleCLIP [13] and b) Paint by Word [2].

5 Concluding Remarks

Interpreting deep generative models leads to a deeper understanding of how the learned representations decompose images to generate them. Discovering the human-understandable concepts and steerable dimensions in the deep generative representations also facilitates the promising applications of interactive image generation and editing. We have introduced representative methods from three approaches: the supervised approach, the unsupervised approach, and the embedding-guided approach. The supervised approach can achieve the best image editing quality when the labels or classifiers are available. It remains challenging for the unsupervised and embedding-guided approaches to achieve disentangled manipulation. More future works are expected on the accurate inversion of the real images and the precise local and global image editing.