Qualitative evaluation of automatic assignment of keywords to images

https://doi.org/10.1016/j.ipm.2004.11.001Get rights and content

Abstract

In image retrieval, most systems lack user-centred evaluation since they are assessed by some chosen ground truth dataset. The results reported through precision and recall assessed against the ground truth are thought of as being an acceptable surrogate for the judgment of real users. Much current research focuses on automatically assigning keywords to images for enhancing retrieval effectiveness. However, evaluation methods are usually based on system-level assessment, e.g. classification accuracy based on some chosen ground truth dataset. In this paper, we present a qualitative evaluation methodology for automatic image indexing systems. The automatic indexing task is formulated as one of image annotation, or automatic metadata generation for images. The evaluation is composed of two individual methods. First, the automatic indexing annotation results are assessed by human subjects. Second, the subjects are asked to annotate some chosen images as the test set whose annotations are used as ground truth. Then, the system is tested by the test set whose annotation results are judged against the ground truth. Only one of these methods is reported for most systems on which user-centred evaluation are conducted. We believe that both methods need to be considered for full evaluation. We also provide an example evaluation of our system based on this methodology. According to this study, our proposed evaluation methodology is able to provide deeper understanding of the system’s performance.

Introduction

Evaluation is a critical issue for Information Retrieval (IR). Assessment of the performance or the value of an IR system for its intended task is one of the distinguishing features of the subject. The type of evaluation to be considered depends on the objectives of the retrieval system. In general, retrieval performance evaluation is based on a test reference collection, e.g. TREC, and on an evaluation measure, e.g. precision and recall (Baeza-Yates & Ribeiro-Neto, 1999).

Saracevic (1995) reviews the history and nature of evaluation in IR and describes six different levels of IR evaluation from system to user levels. However, most IR evaluations are only based on the system level(s) and lack user-centred evaluation. To achieve a more comprehensive picture of IR performance and users’ needs, both system- and user-centred evaluations are needed. That is, we need to evaluate at different levels as appropriate and/or against different types of relevance (Dunlop, 2000). Examples of some recent studies focusing on user judgments are Belkin et al., 2001, Hersh et al., 2001, and Spink (2002).

Due to the advances in computing and multimedia technologies, the size of image collections is increasing rapidly. Content-Based Image Retrieval (CBIR) has been an active research area for the last decade whose main goal is to design mechanisms for searching large image collections. Similar to traditional IR, studies on user issues of image retrieval are lacking (Fidel, 1997, Rasmussen, 1997).

Current CBIR systems index and retrieve images based on their low-level features, such as colour, texture, and shape, and it is difficult to find desired images based on these low-level features, because they have no direct correspondence to high-level concepts in humans’ minds. This is the so-called semantic gap problem. Bridging the semantic gap in image retrieval has attracted much work generally focussing on making systems more intelligent and automatically understanding image contents in terms of high-level concepts (Eakins, 2002). Image annotation systems, i.e. automatic assignment of one or multiple keywords to an image, have been developed for this purpose (Barnard et al., 2003, Kuroda and Hagiwara, 2002, Li and Wang, 2003, Park et al., 2004, Tsai et al., 2003, Vailaya et al., 2001).

To evaluate the annotation results, most of these systems are only based on some chosen dataset with ground truth, such as Corel. However, the problem is that currently there is no standard image dataset for evaluation, like the web track of TREC for IR (Craswell, Hawking, Wilkinson, & Wu, 2003). As IR systems also need to consider human subjects for evaluation, quantitative evaluation of current annotation systems are insufficient to validate their performances. Therefore, user-centred evaluation of image annotation systems is also necessary.

This paper is organised as follows. Section 2 reviews related work on conducting qualitative evaluation for image retrieval related algorithms, systems, etc. Section 3 presents our qualitative evaluation methodology for image annotation systems. Section 4 shows an example of assessing our image annotation system based on the proposed methodology. Section 5 provides some discussion of the user-centred evaluation. Finally, some conclusions are drawn in Section 6.

Section snippets

Related work

For human assessment image retrieval systems, the general approach is to ask human subjects to evaluate directly the systems’ outputs. For example, a questionnaire can be devised to ask the human judges to rank the level of preference for each specific retrieved image, or the ease with which they were able to find desired images. For image annotation, keywords associated with their images can be selected as relevant or irrelevant by the judges. Then, conclusion can be drawn from the analysis of

The evaluation methodology

The conclusion of Section 2 motivates proposing a user-centred evaluation methodology for existing image annotation systems in terms of effectiveness, i.e. quality and accuracy of image annotation. Fig. 1 shows the evaluation procedure. It is composed of the Type I and Type II evaluation methods described above. Both types of evaluation contain three steps, which are research question formulation, data collection, and data analysis, which can provide different kinds of understanding of an image

The judges

We asked five judges (PhD research students) who are not experts in image indexing and retrieval to decide whether the keywords which are assigned by our system and the random guessing approach are relevant to that image. There were three male judges and two females who were all English first language speakers.

The test set

We considered two datasets. One was the Corel image collection and the other one was supplied by Washington University.

Discussion

These results, especially those of Section 4.2.5 show that even a state of the art automatic image indexing system like CLAIRE cannot match the performance of human annotators in terms annotation accuracy, especially when the classification scale (i.e. number of words in the indexing vocabulary) increases. However a closer inspection of these results indicates that the performance actually achieved may be useful in the context of building practical image retrieval systems which can take initial

Conclusion

Evaluation is a critical issue for information retrieval, and to fully understand the performance of IR systems it is necessary to consider both system- and user-centred evaluations. Image retrieval has become an active research area, and in image retrieval, much of the current research effort is focused on automatically annotating or indexing images to facilitate search in image databases. Most of the existing automatic image annotation systems are evaluated against their annotation or

Acknowledgement

The authors would like to thank Chris Stokoe, James Malone, Sheila Garfield, Mark Elshaw, and Jean Davison to participate the system evaluation.

References (58)

  • B.M. Mehtre et al.

    Content-based image retrieval using a composite color-shape approach

    Information Processing and Management

    (1998)
  • T.P. Minka et al.

    Interactive learning with a “a society of models”

    Pattern Recognition

    (1997)
  • S.B. Park et al.

    Content-based image classification using a neural network

    Pattern Recognition Letters

    (2004)
  • D. Sánchez et al.

    Modelling subjectivity in visual perception of orientation for image retrieval

    Information Processing and Management

    (2003)
  • A. Spink

    A user-centered approach to evaluating human interaction with Web search engines: an exploratory study

    Information Processing and Management

    (2002)
  • D.M. Squire et al.

    Assessing agreement between human and machine clustering of image databases

    Pattern Recognition

    (1998)
  • J.K. Wu et al.

    Fuzzy content-based retrieval in image databases

    Information Processing and Management

    (1998)
  • R. Applegate

    Models of user satisfaction: understanding false positives

    Reference Quarterly

    (1993)
  • L. Armitage et al.

    Analysis of user need in image archives

    Journal of Information Science

    (1997)
  • R. Baeza-Yates et al.

    Modern information retrieval

    (1999)
  • K. Barnard et al.

    Matching words and pictures

    Journal of Machine Learning Research

    (2003)
  • Barnard, K., & Shirahatti, N. V. (2003). A method for comparing content based image retrieval methods. In Proceedings...
  • Black Jr., J. A., Fahmy, G., & Panchanathan, S. (2002). A method for evaluating the performance of content-based image...
  • Conniss, L. R., Ashford, A. J., & Graham, M. E. (2000). Information seeking behaviour in image retrieval: VISOR I final...
  • I.J. Cox et al.

    The Bayesian image retrieval system, PicHunter: theory, implementation and psychophysical experiments

    IEEE Transactions on Image Processing

    (2000)
  • Craswell, N., Hawking, D., Wilkinson, R., & Wu, M. (2003). Overview of the TREC 2003 web track. In Proceedings of the...
  • M. Dunlop

    Reflections on Mira: interactive evaluation in information retrieval

    Journal of the American Society for Information Science

    (2000)
  • Efthimiadis, E. N., & Fidel, R. (2000). The effect of query type on subject searching behavior of image databases: an...
  • R. Fidel

    The image retrieval task: implications for the design and evaluation of image databases

    New Review of Hypermedia and Multimedia

    (1997)
  • Cited by (19)

    • Image annotation: Then and now

      2018, Image and Vision Computing
      Citation Excerpt :

      To ascertain the accuracy of the annotation system, there exist two broad classes of evaluation measure: (i) the qualitative measure and (ii) the quantitative measure. The qualitative measure [215] deals with human subject based assessment. The subjects are asked to evaluate the performance of the system so that a more comprehensive picture of the annotation system can be obtained.

    • An attentive self-organizing neural model for text mining

      2009, Expert Systems with Applications
      Citation Excerpt :

      In terms of the evaluation of experiment results, this paper has five native English speakers as experiment participants. We use the user-center based approach to evaluate the performance of our mechanism (Tsai, McGarry, & Tait, 2006). In this paper, we initially use two terms, “market” and “company” as two queries for searching two interest fields.

    • A framework for evaluating automatic indexing or classification in the context of retrieval

      2016, Journal of the Association for Information Science and Technology
    • Indexing: From thesauri to the Semantic Web

      2012, Indexing: From Thesauri to the Semantic Web
    View all citing articles on Scopus
    View full text