1 Introduction

Humans are capable of learning new concepts using a few empirical observations. This remarkable ability is arguably accomplished via transfer learning techniques, such as the bootstrapping learning strategy, where agents learn simple tasks first before tackling more complex activities [22]. For instance, humans begin to cruise and crawl before they learn how to walk. Learning to cruise and crawl allows infants to improve their locomotion skills, body balance, control of limbs, and the perception of depth, all of which are crucial pre-requisites for learning the more complex activity of walking [11, 16].

In many machine learning applications, a similar transfer learning strategy is desired when labeled examples are difficult to obtain that can faithfully represent the entire target set \(\mathcal {Y}\). This is often the case, for example, in image classification and in neural image decoding [14, 17]. The transfer learning strategy typically employed in this setting is either called few-shot, one-shot, or zero-shot learning, depending on how many labeled examples are available during the training stage [10, 14]. Here, a desired target set \(\mathcal {Y}\) (classes) is learned indirectly by learning semantic attributes instead. These attributes are, then, used to predict the classes in \(\mathcal {Y}\).

The motivation behind the attribute-based learning approach with scarce data is close in spirit to the rationale of the bootstrapping learning strategy. In brief terms, it helps to learn simple tasks first before attempting to learn more complex activities. In the context of classification, semantic attributes are (chosen to be) abundant, where a single attribute spans multiple classes. Hence, labeled examples for the semantic attributes are more plentiful, which makes the task of predicting attributes relatively easy. Moreover, the target set \(\mathcal {Y}\) is embedded in the space of semantic attributes, a.k.a. the semantic space, which makes it possible, perhaps, to predict classes that were rarely seen, if ever, during the training stage.

In this paper, we focus on the attribute-based zero-shot learning setting, where a finite number of semantic attributes is used to predict novel classes that were never seen during the training stage. More formally [14]:

Definition 1

(Attribute-Based Zero-Shot Setting). In the attribute-based zero-shot setting, we have an instance space \(\mathcal {X}\), a semantic space \(\mathcal {A}\), and a target set \(\mathcal {Y}\), where \(|\mathcal {A}|<\infty \) and \(|\mathcal {Y}|<\infty \). A sample S comprises of m examples \(\{(X_i, Y_i, A_i)\}_{i=1,\ldots , m}\), with \(X_i\in \mathcal {X}\), \(Y_i\in \mathcal {Y}\), and \(A_i\in \mathcal {A}\). Moreover, \(\mathcal {Y}\) is partitioned into two non-empty subsets: the set of visible classes \(\mathcal {Y}_V = \bigcup _{(X_i,Y_i,A_i)\in S} \{Y_i\}\) and the set of hidden classes \(\mathcal {Y}_H = \mathcal {Y}{\setminus } \mathcal {Y}_V\). The goal is to use S to learn a hypothesis \(f:\mathcal {X}\rightarrow \mathcal {Y}_H\) that can correctly predict the hidden classes \(\mathcal {Y}_H\).

The key part of Definition 1 is the final goal. Unlike the traditional setting of learning, we no longer assume that the sample size m is large enough for all classes in \(\mathcal {Y}\) to be seen during the training stage. In general, we allow \(\mathcal {Y}\) to be partitioned into two non-empty subsets \(\mathcal {Y}_V\) and \(\mathcal {Y}_H\), which, respectively, correspond to the visible and the hidden classes in the given sample S. The classical argument for why the goal of learning to predict the hidden classes is possible in this setting is that the hidden classes \(\mathcal {Y}_H\) are coupled with the instances \(\mathcal {X}\) and the visible classes \(\mathcal {Y}_V\) via the semantic space \(\mathcal {A}\) [14].

Fig. 1.
figure 1

In the polygon shape recognition problem, the instances are images of polygons and we have five disjoint classes: equilateral triangles, non-equilateral triangles, squares, non-square rectangles, and non-rectangular parallelograms.

To illustrate the traditional argument for attribute-based zero-shot learning, let us consider the simple polygon shape recognition problem shown in Fig. 1. In this problem, the instance space \(\mathcal {X}\) is the set of images of polygons, i.e. two-dimensional shapes bounded by a closed chain of a finite number of line segments, while the target set \(\mathcal {Y}\) is the set of the five disjoint classes shown in Fig. 1.

In the traditional setting of learning, a large sample of instances S would be collected and a classifier would be trained on the sample (e.g. using one-vs-all or one-vs-one). One of the fundamental assumptions in the traditional setting of learning for guaranteeing generalization is the stationarity assumption; examples in the sample S are assumed to be drawn i.i.d. from the same distribution as the future examples. Along with a few additional assumptions, learning in the traditional setting can be rigorously shown to be feasible [1, 5, 19, 23].

In the zero-shot learning setting, by contrast, it is assumed that the target set \(\mathcal {Y}\) is partitioned into two non-empty subsets \(\mathcal {Y}=\mathcal {Y}_V\cup \mathcal {Y}_H\). During the training stage, only instances from the visible classes \(\mathcal {Y}_V\) are seen. The goal is to be able to predict the hidden classes correctly. This goal is arguably achieved by introducing a coupling between \(\mathcal {Y}_V\) and \(\mathcal {Y}_H\) via the semantic space. For example, we recognize that the five classes in Fig. 1 can be completely determined by the values of the following three binary attributes:

  • \(a_1\): Does the polygon contain 4 sides?

  • \(a_2\): Are all sides in the polygon of equal length?

  • \(a_3\): Does the polygon contain, at least, one acute angle?

The set of all possible answers to these three binary questions forms a semantic space \(\mathcal {A}\) for the polygon shape recognition problem. Given an instance X with the semantic embedding \(A=(a_1,a_2,a_3)\in \{0,1\}^3\), its class can be uniquely determined. For example, any equilateral triangle has the semantic embedding (0, 1, 1), which means that the latter polygon (1) does not contain four sides, (2) its sides are all of equal length, and (3) it contains some acute angles. Among the five classes in \(\mathcal {Y}\), only the class of equilateral triangles satisfy this semantic embedding. Similarly, the four remaining classes have unique semantic embeddings as well.

Because the classes can be inferred from the values of the three binary attributes mentioned above, it is often argued that hidden classes can be predicted by, first, learning to predict the values of the semantic attributes based on a sample (training set) S, and, second, by using those predicted attributes to predict the hidden classes in \(\mathcal {Y}_H\) via some hand-crafted mappings [7, 9, 1315, 17, 20, 21]. In our example in Fig. 1, suppose the class of non-square rectangles is never seen during the training stage. If we know that a polygon has the semantic embedding (1, 0, 0), which means that it has four sides, its sides are not all of equal length, and it does not contain any acute angles, then it seems reasonable to conclude that it is a non-square rectangle even if we have not seen any non-square rectangles in the sample S. Does this imply that zero-shot learning is a well-posed approach? We will show that the answer is, in fact, negative. The key ingredient in our argument is the fact that the mechanical task of predicting an attribute is quite different from the epistemological task of learning the correct meaning of the attribute.

The rest of the paper is structured as follows. We first explain why the two tasks of “predicting" an attribute and “learning" an attribute are quite different from each other. We will illustrate this overlooked fact on the simple shape recognition problem of Fig. 1 and demonstrate it in a greater depth on some synthetic and real datasets afterward. Next, we use such a distinction between “predicting" and “learning" to argue that the attribute-based zero-shot learning approach is fundamentally ill-posed, which, we believe, explains why the previous zero-shot learning algorithms proposed in the literature have not performed significantly better than supervised learning with very few training examples.

2 Why Learning and Predicting Are Two Different Tasks

2.1 The Polygon Shape Recognition Problem

Let us return to the original polygon shape recognition example of Fig. 1. Suppose that the two classes of non-square rectangles and non-rectangular parallelograms are hidden from the sample S. That is:

$$\begin{aligned} \mathcal {Y}_V&= \{\text {equilateral triangles,} \quad \text {non-equilateral triangles,} \quad \text {squares}\}\\ \mathcal {Y}_H&=\{\text {non-square rectangles,} \quad \text {non-rectangular parallelograms}\} \end{aligned}$$

In the attribute-based zero-shot learning setting, we learn to predict the three semantic attributes \((a_1,a_2,a_3)\) mentioned earlier based on the sample S that only contains examples from the visible three classes. Once we learn to predict them correctly based on the sample S, we are supposed to be able to recognize the two hidden classes via their semantic embeddings. The semantic embedding for non-square rectangles is, in this example, (1, 0, 0), while the semantic embeddings for non-rectangular parallelograms is the set \(\{(1,0,1), (1,1,1)\}\).

To see why this is, in fact, an incorrect approach, we note that the task of predicting an attribute aims, by definition, at utilizing all the relevant information in the sample S that aid the prediction task. In our example, since only the three visible classes \(\mathcal {Y}_V\) are seen in the sample S, a good predictor should infer from S the following logical assertions:

  1. 1.

    If a polygon does not contain four sides, then it contains one acute angle. Formally:

    $$\begin{aligned} (a_1=0)\rightarrow (a_3=1) \end{aligned}$$

    From this, the contrapositive assertion \((a_3=0)\rightarrow (a_1=1)\) is deduced as well.

  2. 2.

    If the sides of a polygon are not of equal length, then it does not contain four sides. Formally:

    $$\begin{aligned} (a_2=0)\rightarrow (a_1=0) \end{aligned}$$

    Again, its contrapositive assertion also holds.

  3. 3.

    If the polygon does not contain an acute angle, then all of its sides are of equal length. Formally:

    $$\begin{aligned} (a_3=0)\rightarrow (a_2=1) \end{aligned}$$

These logical assertions and others are likely to be used by a good predictor, at least implicitly, since they are always true in the sample S. In addition, such a predictor would have a good generalization ability if the instances continued to be drawn i.i.d. from the same distribution as the training sample S, i.e. if the set of visible classes remained unchanged.

If, on the other hand, an instance is now drawn from a hidden class in \(\mathcal {Y}_H\), then some of these logical assertions would no longer hold and the original algorithm that was trained to predict the semantic attributes would fail. This follows from the fact that instances drawn from the hidden classes have a different distribution. Therefore, the fact that classes can be uniquely determined by the values of the semantic attributes is of little importance here because the semantic attributes are likely to be predicted correctly for the visible classes only. Needless to mention, this violates the original goal of the attribute-based zero-shot learning setting.

2.2 Optical Digit Recognition

To show that the previous argument on the polygon shape recognition problem is not a contrived argument, let us look into a real classification problem, in which we can visualize the decision rule used by the predictors. We will use the optical digit recognition problem to illustrate our argument. In order to be able to interpret the decision rule used by the predictor, we will use the linear support vector machine (SVM) algorithm [4, 6], trained without the bias term using the LIBLINEAR package [8].

One way of introducing a semantic space for the ten digits is to use the seven-segment display shown in Fig. 2. That is, the instance space \(\mathcal {X}\) is the set of noisy digits, the classes are the ten digits \(\mathcal {Y}=\{0,1,2,\ldots , 9\}\), and the semantic space is \(\mathcal {A}=\{-1,+1\}^7\) corresponding to the seven segments. For example, using the order of segments (a,b,c,d,e,f,g) shown in Fig. 2, the digit 0 in Fig. 2 has the semantic embedding (1, 1, 1, 0, 1, 1, 1) while the digit 1 has the semantic embedding (0, 0, 1, 0, 0, 1, 0), and so on.

Fig. 2.
figure 2

In optical digit recognition, a semantic space can be introdued using the seven segment display segments shown in this figure.

Fig. 3.
figure 3

The instance space in the optical digit recognition problem is the set of noisy digits, where the value of every pixel is flipped with probability 0.1.

In our implementation, we run the experiment as followsFootnote 1. First, a perfect digit is generated, which is later contaminated with noise. In particular, every pixel is flipped with probability 0.1. As a result, the instance space is the set of noisy digits, as depicted in Fig. 3. Then, five digits are chosen to be in the visible set of classes, and the remaining digits are hidden during the training stage. We train classifiers that predict the attributes (i.e. segments) using the visible set of classes, and use those classifiers to predict the hidden classes afterward.

Now, the classical argument for attribute-based zero-shot learning goes as follows:

  1. 1.

    Every digit can be uniquely determined by its seven-segment display. When an exact match is not found, one can carry out a nearest neighbor search [7, 9, 15, 17, 20, 21] or a maximum a posteriori estimation method [13, 14].

  2. 2.

    Every segment in {a,b,c,d,e,f,g} is a concept class by itself that spans multiple digits. Hence, the number of training examples available for each segment is large, which makes it easier to predict.

  3. 3.

    Because each of the seven segments spans multiple classes, we no longer need to see all of the ten digits during the training stage in order to learn to predict the seven segments reliably.

This argument clearly rests on the assumption that “learning" a concept is equivalent to the task of reliably “predicting"it. From the earlier discussion in the polygon shape matching problem, this assumption is, in fact, invalid. Figure 4 shows what happens when a proper subset of the ten digits is seen during the training stage. As shown in the figure, the linear classifier trained using SVM exploits the relevant information available in the training set to maximize its prediction accuracy of the attributes.

For example, when we train a classifier to predict the segment ‘a’ using the visible classes \(\{0,1,2,3,4\}\), a good predictor would use the fact that the segment ‘a’, which is the target concept, always co-occurs with the segment ‘g’. Therefore, the contrapositive rule implies that the absence of ‘g’ implies the absence of the segment ‘a’. This is clearly seen in Fig. 4 (top left corner). Of course, what the predictor learns is even more complex than this, as shown in Fig. 4. When novel instances from the hidden classes are present, these correlations no longer hold and the algorithm fails to predict the semantic attributes correctly. To reiterate, such a failure is fundamentally due to the fact that the hidden classes constitute a different distribution of instances from the one seen during the training stage.

The results of applying a linear SVM using binary relevance to predict the seven segments is shown in Fig. 4. In this figure, the blue regions correspond to the pixels that contribute positively to the decision rule for predicting the corresponding segment, while the red regions contribute negatively. There are two key takeaways from this figure. First, the prediction rule used by the classifier does not correspond to the “true" meaning of the semantic attribute. After all, the goal of classification is to be able to “predict"the attribute as opposed to learning what it actually means. Second, changing the set of visible classes can change the prediction rule for the same attribute quite notably. Both observations challenge the rationale behind the attribute-based zero-shot learning setting.

2.3 Zero-Shot Learning on Popular Datasets

Next, we examine the performance of zero-shot learning on benchmark datasets. Two of the most popular datasets for evaluating zero-shot learning algorithms are the Animals with Attributes (AwA) dataset [14] and the aPascal-aYahoo dataset [9]Footnote 2. We briefly describe each dataset next.

The Animals with Attributes (AwA) Dataset: The AwA dataset was collected by querying search engines, such as Google and Flickr, for images of 50 animals. Afterward, these images were manually handled to remove outliers and duplicates. The final dataset contains 30,475 images, where the minimum number of images per class is 92 and the maximum is 1,168. In addition, 85 attributes are introduced. In the zero-shot learning setting, 40 (visible) classes are used for training and 10 (hidden) classes are used for testing [14].

Fig. 4.
figure 4

In this figure, linear SVM was implemented to predict the seven segments of noisy optical images. The columns correspond to the seven segments \(\{a,b,c,d,e,f,g\}\) while the rows correspond to different choices of visible classes. From top to bottom, the visible classes are \(\{0,1,2,3,4\}\) for the first row, \(\{5,6,7,8,9\}\) for the second row, \(\{0,2,4,6,8\}\) for the third row, and \(\{1,3,5,7,9\}\) for the fourth row. Every figure depicts the coefficient vector w that is learned by linear SVM, where blue regions correspond to the pixels that contribute positively towards the corresponding segment and red regions contribute negatively. The black figures are not applicable because the training sample S either lacks negative examples for the corresponding attribute or it lacks positive examples. (Color figure online)

The aPascal-aYahoo (aP-AY) Dataset: The aP-aY dataset contains 12,695 images, which were chosen from the PASCAL VOC 2008 data set [9]. These images are used during the training stage of the zero-shot learning setting. In addition, a total of 2,644 images were collected from the Yahoo image search engine to be used during the test stage. Both sets of images have disjoint classes. More specifically, the training dataset contains 20 classes while the test dataset contains 12 classes. Moreover, every image has been annotated with 64 binary attributes.

Results: Table 1 presents some fairly-recent reported results on the two datasets AwA and aP-aY. The zero-shot learning algorithms provided in the table are the direct attribute prediction (DAP) algorithm proposed in [13, 14], which is one of the standard baseline methods for this task, the indirect attribute prediction (IAP) algorithm proposed in [13, 14], the embarrassingly simple zero-shot learning algorithm proposed in [18], and the zero-shot random forest algorithm proposed in [12]. The best reported prediction accuracy for the AwA dataset is 49.3 % while the best reported prediction accuracy for the aP-aY dataset is 26.0 %.

Table 1. In this table, some fairly-recent results of zero-shot learning algorithms are presented. The reported figures of the four zero-shot learning algorithms are for the multiclass prediction accuracy on the two datasets AwA and aP-aY, where the figures are taken from the original papers [1214, 18]

In order to properly interpret the reported results, we have also provided in Table 1 the number of training examples from the hidden classes that would suffice, in a traditional supervised learning setting, to obtain the same accuracy reported by the zero-shot learning algorithms in the literature. These latter figures are obtained from the experimental study conducted in [14]. Note that while about 600 examples per visible class are used during the training stage, the best reported zero-shot prediction accuracy on the hidden classes is equivalent to the accuracy of supervised learning using fewer than 20 training examples per hidden class. In fact, the zero-shot learning accuracy reported on the aP-aY is worse than the accuracy of supervised learning when as few as 2 training examples per hidden class are used.

When the area under the curve (AUC) is used as a performance measure, which is known to be more robust to class imbalance than the prediction accuracy, then the apparent merit of zero-shot learning becomes even more questionable. For instance, the popular direct attribute prediction (DAP) on the AwA dataset achieves an AUC of 0.81, which is equivalent to the performance of supervised learning using as few as 10 training examples from each hidden class only (c.f. Tables 4 and 7b in [14]). Recall, by contrast, that over 600 examples per visible class are used for training.

3 A Mathematical Formalism

The above argument and empirical evidence on the ill-posedness of attribute-based zero-shot learning can be formalized mathematically. Incidentally, this will allow us to identify paradigms of zero-shot learning for which the above argument no longer holds.

As stated in Sect. 2 and illustrated in Fig. 4, the fundamental problem with attribute-based zero-shot learning is that it aims at learning concept classes (semantic attributes) with respect to one distribution of instances (i.e. when conditioned on the visible set of classes) with the goal of being able to predict those concept classes for an arbitrary distribution of instances (i.e. when conditioned on the unknown hidden set of classes). Clearly, this is an ill-posed strategy that violates the core assumptions of statistical learning theory.

To remedy this problem, we can cast zero-shot learning as a domain adaptation problem [18]. In the standard domain adaptation setting, it is assumed that the training examples are drawn i.i.d. from some source distribution \(\mathcal {D}_S\) whereas future test examples are drawn i.i.d. from a different target distribution \(\mathcal {D}_T\). Let \(h:\mathcal {X}\rightarrow \mathcal {Y}\) be a predictor. Then, the average misclassification error rate of h with respect to \(\mathcal {D}_T\) is bounded by:

$$\begin{aligned} \mathbb {E}_{(X,Y)\sim \mathcal {D}_T} \{h(X)\ne Y\} \le \mathbb {E}_{(X,Y)\sim \mathcal {D}_S} \{h(X)\ne Y\} + d_{TV}(\mathcal {D}_S, \mathcal {D}_T), \end{aligned}$$
(1)

where \(d_{TV}(\mathcal {D}_S, \mathcal {D}_T)\) is the total-variation distance between the two probability distributions \(\mathcal {D}_S\) and \(\mathcal {D}_T\) [2]. Similar bounds that also hold with a high probability can be found in [3]. Hence, learning a good predictor h with respect to some source distribution \(\mathcal {D}_S\) does not guarantee a good prediction accuracy with respect to an arbitrary target distribution \(\mathcal {D}_T\) unless the two distributions are nearly identical.

Therefore, in order to turn zero-shot learning into a well-posed strategy, it is imperative that a common representation R(X) is used, such that the induced distribution of R(X) remains nearly unchanged when the instances X are conditioned on the visible set of classes or on the hidden set of classes. Then, by learning to predict semantic attributes given R(X), generalization bounds, such as the one provided in Eq. (1), guarantee a good prediction accuracy in the zero-shot setting. One method that can accomplish this goal is to divide the instances \(X_i\) into multiple local segments \(X_i\rightarrow (Z_{i,1}, Z_{i,2},\ldots )\in \mathcal {Z}^r\) such that a classifier \(h:\mathcal {Z}\rightarrow \mathcal {A}\) is trained to predict the semantic attributes in every local segment separately. If these local segments have a stable distribution across the visible and hidden set of classes, then zero-shot learning is feasible. A prototypical example of this approach is segmenting sounds into phonemes in word recognition systems and using those phonemes to recognize words (classes) [17].

4 Conclusion

Attribute-based zero-shot learning is a transfer learning strategy that has been widely studied in the literature. Its aim is to learn to predict novel classes that are never seen during the training stage by learning to predict semantic attributes instead. In this paper, we argue that attribute-based zero-shot learning is an ill-posed strategy because the two tasks of “predicting" and “learning" an attribute are fundamentally different. We demonstrate our argument on synthetic datasets and use it, finally, to explain the poor performance results that have been reported so far in the literature for various zero-shot learning algorithms on popular benchmark datasets.