1 Introduction

AIA consists in describing an image by keywords related to its visual content. Traditional AIA methods are based on supervised learning, where labeled images are used to train models in which labels are taken as classes, then the model can assign labels from the learnt models to new images. Recently, advances in deep learning have increased considerably the performance in tasks such as image classification [1] and object recognition [2]. However, supervised approaches are still limited to assign only the labels available in the training dataset.

As an alternative to the supervised approach to AIA, the so called unsupervised approach has recently emerged [3]. Here, the dataset with labeled images is replaced by a reference collection of image-text pairs where the image is immersed in textual information. An unsupervised method uses this unstructured information in order to annotate new images. This is a more challenging task because the labels to assign need to be mined from the textual part. However, the scenario has as advantage the availability and easy access to huge collections of images with associated text from different sources such as wikipedia, social networks and, in general, the web. Furthermore, any word mined from the text in the reference collection can be used to describe the visual content of images.

These two approaches of AIA work in different scenarios. On the one hand, supervised approaches are competitive working with a fixed number of labels, which makes them appropriate for specific domains. However, supervised methods rely on huge labeled datasets which are only possible for a limited number of labels, e.g. ImageNet (http://www.image-net.org/) formed by labeled images with presence or absence of 1000 object categories. On the other hand, unsupervised approaches are not restricted to a fixed number of words to assign because these can be mined from the reference collection. Notwithstanding, both approaches are evaluated under a supervised scenario using a fixed vocabulary of labels, e.g. see ImageCLEF (http://imageclef.org/). In Fig. 1 the current evaluation is exemplified, the output of three annotation systems are compared, two of them unsupervised. Unsupervised methods labeling images with words extracted from a free vocabulary. We can see that even though all assigned words from unsupervised methods seem to be relevant to annotate the evaluated image, only those words matching exactly with the ground truth are taken into account by supervised evaluation metrics.

Fig. 1.
figure 1

Current evaluation framework.

Fig. 2.
figure 2

Semantic relatedness similarity among labels

Due to this unfair evaluation, the effectiveness of unsupervised methods is not really known, since classic metrics do not capture the relevance of output words from free-vocabularies. With the aim to provide a more suitable evaluation that serves to encourage new developments of unsupervised automatic image annotation (UAIA) methods, in this paper we propose a flexible evaluation framework that allows us to compare coverage and relevance of the assigned labels from a free-vocabulary. We introduce new metrics that adapt classic ones such as recall, precision and F1, to deal with approximate matchings between predicted labels and the ground truth. The proposed framework defines a scenario where it is possible to evaluate the assignments from free-label vocabularies, it makes possible to use labeled or unlabeled datasets for training phase, and taking advantage of available resources for AIA, where we only need a labeled test set. We used the proposed framework for performing a comparison between supervised and unsupervised AIA methods. The experimental results show that the proposed framework is a good alternative for evaluating supervised and unsupervised AIA methods under fair conditions by taking into account the semantic relatedness of assigned and ground truth labels.

As far as we know there is not much work on evaluation of UAIA methods. The most related contributions are directed to automatic description generation [4]. However, this task goes beyond of describing isolated parts of the image. It involves generating a textual description that verbalizes the content of the image. For evaluating descriptions, there exists measures that compute a score indicating the semantic similarity between the description generated by a system and a set of human-written reference descriptions. One of the most representative measures is Meteor (http://www.cs.cmu.edu/~alavie/METEOR/) that allows matchings using exact and synonyms, exhibiting higher correlation with human judgments.

The remainder of the paper is organized as follows: Sect. 2 describes our evaluation framework; Sect. 3 reports experimental results. Finally, Sect. 4 presents some conclusions of this work.

2 Evaluation Framework for UAIA

To overcome the limitation in evaluation, many systems that follow an unsupervised approach have to perform a process that translates the free vocabulary extracted from the reference collection into the labels to be evaluated (those present in the ground truth). For instance, the words isle, resort, and coast from Method 2 in Fig. 1 need to be translated to island, hotel and beach, respectively. However, this kind of translation limits the diversity of words that could be mined on the reference collection. Our proposed evaluation framework overcomes this limitation by measuring the semantic relatedness between the ground truth and the words assigned by the UAIA method, even when there is no overlap between both sets of words. In this sense, the assigned labels by the system can refer to everything related with the visual content in the image. The proposed evaluation framework follows two steps:

  1. 1.

    Measuring the semantic relatedness. This is calculated between all words in the ground truth and all output words from the UAIA method. The idea is to assign a score that captures the semantic relatedness of each word pair. In this way, it is possible to identify a best approximated matching for each word in the ground truth. For instance, in Fig. 2, the semantic relatedness between the output words of an unsupervised method and all words in the ground truth is calculated, where a score of 1 means exact matching.

  2. 2.

    Measuring coverage and relevance. Having the best approximated matching from the previous step, we conceived a way to measure coverage and relevance by adapting three classic measures: recall, precision and F1 measure. More details about the adapted metrics will be provided in Subsect. 2.1.

2.1 Measuring Coverage and Relevance in UAIA Methods

In order to decide whether a word from the ground truth is covered and which words assigned by system are relevant, we have adapted classic metrics by establishing a threshold \(\alpha \) and taking into account the similarity between words. This \(\alpha \) regulates the hardness matching among words, an \(\alpha \) value of 1 is an exact match, whereas a value of 0.2 refers to a matching with low semantic relatedness. Note that with the definition of \(\alpha \) is possible to relax the matching among words as much as necessary. Accordingly, the new metrics for evaluation of UAIA methods are as follows:

  • Recall\(_{\alpha }\) (coverage). Unlike classic recall, for this adaptation to reach a maximum score implies that for each word from the ground truth, there exists a matching with a word from the output system, such that this matching is greater than \(\alpha \).

    $$\begin{aligned} R_{\alpha } = \frac{\sum _{w_{i}\in gt}1_{(\exists w_{j}:w_{j} \in outs \wedge sim(w_{i},w_{j})\ge \alpha )}}{|gt|} \end{aligned}$$
    (1)

    where gt refers to words from ground truth, and outs are the output words given by the annotation system, \(sim(w_{i},w_{j})\) expresses the semantic relatedness between two words and can be estimated by either one of the two approaches defined in Subsect. 2.2.

  • Precision\(_\alpha \) (relevance). In this measure, the effectiveness of the output words is given by the number of words that surpass the \(\alpha \) threshold:

    $$\begin{aligned} P_{\alpha } = \frac{\sum _{w_{j}\in outs}1_{(\exists w_{i}:w_{i} \in gt \wedge sim(w_{i},w_{j})\ge \alpha )}}{|outs|} \end{aligned}$$
    (2)

    The precision\(_{\alpha }\) complements the coverage with a score that expresses the quality of the words given by the system in relation to the quantity.

  • F1\(_\alpha \). This metric combines recall\(_{\alpha }\) and precision\(_{\alpha }\) by using a harmonic mean:

    $$\begin{aligned} F1_{\alpha } = 2 \cdot \frac{R_{\alpha }\cdot P_{\alpha }}{R_{\alpha } + P_{\alpha }} \end{aligned}$$
    (3)

Furthermore, the relaxation defined by the \(\alpha \) threshold can be used to adapt other metrics such as MAP (Mean Average Precision) that is usually used to evaluate the ranked list of words that are output of the AIA methods. Accordingly the AP (Average Precision) is defined as:

  • AP\(_{\alpha }\). This adapted metric makes use of the approximated matching defined by the \(\alpha \) threshold in P\(_{\alpha }\), so it is not necessary to define a function as relevance indicator unlike the original metric:

    $$\begin{aligned} AP_{\alpha } = \frac{\sum _{k=1}^{|outs|} P_{\alpha }(k)}{|gt|} \end{aligned}$$
    (4)

    where k is the word to evaluate and

    $$\begin{aligned} P_{\alpha }(k) = \frac{\sum _{j=1}^{k}1_{(\exists w_{i}:w_{i} \in gt \wedge sim(w_{i},w_{j})\ge \alpha )}}{|k|} \end{aligned}$$
    (5)

Retaking the example of Fig. 1, using the proposed evaluation metrics. When the scores are reported at \(\alpha =0.7\), the method1 obtains \(R_\alpha =0.66\), \(P_\alpha =1\) and \(F1_\alpha =0.79\), whilst method2 obtains \(R_\alpha =1\), \(P_\alpha =1\) and \(F1_\alpha =1\) and method3 obtains \(R_\alpha =0.5\), \(P_\alpha =0.66\) and \(F1_\alpha =0.57\). Under this new evaluation, both unsupervised methods obtained competitive results and they can be compared with the scores obtained by the supervised method, even when the words do not exactly correspond with the words of the ground truth.

2.2 Measuring the Semantic Relatedness

Measuring semantic relatedness between two words is a well known task in natural language processing. This relatedness is generally established by context around the words, and it can be produced by knowledge-based approaches, e.g. using a lexical database like WordNet (https://wordnet.princeton.edu/), or by statistical-based approaches, e.g. distributed representations [5]. We describe these two approaches in the following paragraphs:

Knowledge-Based Approach. For instance, WuP metric [6] calculates relatedness using the depths of two words in WordNet by counting their number of nodes, along with the depth of the lower common subsumer (lcs).

Statistical-Based Approach. Distributed representations [5] allow representing words by vectors in a \(\mathbb {R}^r\) space. The final purpose is to build accurate representations of words that semantically associated words with similar meaning by using context around the words. A cosine similarity between word vectors is used.

3 Experiments and Results

This section is divided into two parts. The first part describes the used dataset, the evaluated methods, and the experimental settings. While in the second part, the obtained results are reported and discussed.

3.1 Dataset, Methods and Experimental Settings

Dataset. We considered a benchmark used in the large concept annotation task at ImageCLEF for year 2013 [7]. This dataset was created with the aim of being exclusively used for UAIA methods, it provides a reference collection with 250,000 documents. Each document is composed by an image, and a web page that provides textual information. We used the recently released test set for evaluation. This contains 2000 images annotated with 116 different concepts.

Evaluated AIA Methods. We used two UAIA methods [3] that were developed to annotate any word extracted from a reference collection. Both methods exploit the information differently, also they are flexible to use visual descriptors and fixed the number of words that can be assigned as annotations. In addition, we have evaluated a supervised method [8] that uses a model trained by context dependent support vector machines (SVM) for each concept in the ground truth. This method has obtained the best results in the considered dataset. Following a traditional classification problem this supervised method uses the development set for training models for each concept.

Experimental Settings

  • We used the statistical-based approach for calculation of semantic relatedness. The learned vectors were obtained with word2vec [5] using Wikipedia for training. We performed the same experiments using the WuP metric from WordNet obtaining similar results.

  • In order to reduce the low variability over the semantic relatedness similarity values, the reported results are normalized values using z score. This helps to follow a more proportional and intuitive scale close to human perception. This scores are then mapped to [0, 1].

  • In order to report a fair comparison between supervised and unsupervised methods, the quantity of words used by unsupervised methods is equal to the maximum of words labeled by the supervised system (20 labels).

3.2 Evaluating Proposed Metrics

Table 1 shows the results reported for ImageCLEF 2013 dataset on the test set using classic metrics for different AIA methods under two scenarios. On top are reported results obtained by considering a free vocabulary, while on the bottom part are reported results obtained by considering only labels in the test set.

Table 1. Classic metrics used in AIA - TWO SCENARIOS

For this last case, unsupervised methods (called Local and Global) reveal competitive recall. However, using free vocabulary, note that the scores reported by unsupervised methods have been drastically reduced due to almost null overlap with the ground truth. In the following, we evaluate the same systems with our proposed metrics, under the free vocabulary scenario. Our aim is to show that output words could be relevant although these do not have an exact overlap with words from the ground truth. Figure 3 shows the F1\(_{\alpha }\), we can see at \(\alpha =0.8\) that the unsupervised methods have similar results. However, at the lowest levels there exist a large separation, thus making possible a fairer comparison. For knowing how good are the words at different levels of \(\alpha \) threshold see examples in Subsect. 3.3. In Fig. 4, the MAP\(_\alpha \) allows us to measure the ranking given by annotation systems. From this figure we can see that the difference between the unsupervised methods at \(\alpha =0.6\) is reduced, this indicates that both systems are competitive including at top the most relevant words for annotating images.

Fig. 3.
figure 3

Evaluation of F1\(_{\alpha }\) measure.

Fig. 4.
figure 4

Evaluation of MAP\(_\alpha \) measure.

Recapitulating on the last two figures, these have shown that the supervised method maintains its performance and it can be better compared to unsupervised methods without misplace its competitiveness.

3.3 How Relevant Are the Words at Different \(\alpha \) Thresholds?

This subsection presents a qualitative analysis of the words obtained at different levels of the \(\alpha \) threshold. Figure 5 show images with their ground truth and some of the words generated by unsupervised methods. Each column shows the words between the ranges of the \(\alpha \) thresholds, the idea is to allow the reader to judge the relevance when using a free vocabulary for labeling images.

Fig. 5.
figure 5

Words obtained at different levels of \(\alpha \) threshold

4 Conclusions

In this paper, we have proposed a flexible framework for evaluation of UAIA methods. It aims at mitigating the unfair evaluation of UAIA methods with respect to supervised methods, and to boost new developments. In this sense, our proposed framework allows us to compare the output labels of UAIA systems without restricting the number of assigned words or fixing the vocabulary of annotation. The scenario of evaluation is that it can used on labeled or unlabeled datasets, taking advantage of available resources. The preliminary results show that it is possible to compare supervised and unsupervised methods. In this regard, we have showed that the performance of supervised methods is not affected or misplaced. Besides, we showed that unsupervised approaches can be competitive, also when evaluating supervised and unsupervised the difference is reduced. Furthermore, it is possible relaxing the relevance of the output words from unsupervised methods in order to increase diversity of image annotation.