Qualitative evaluation of automatic assignment of keywords to images
Introduction
Evaluation is a critical issue for Information Retrieval (IR). Assessment of the performance or the value of an IR system for its intended task is one of the distinguishing features of the subject. The type of evaluation to be considered depends on the objectives of the retrieval system. In general, retrieval performance evaluation is based on a test reference collection, e.g. TREC, and on an evaluation measure, e.g. precision and recall (Baeza-Yates & Ribeiro-Neto, 1999).
Saracevic (1995) reviews the history and nature of evaluation in IR and describes six different levels of IR evaluation from system to user levels. However, most IR evaluations are only based on the system level(s) and lack user-centred evaluation. To achieve a more comprehensive picture of IR performance and users’ needs, both system- and user-centred evaluations are needed. That is, we need to evaluate at different levels as appropriate and/or against different types of relevance (Dunlop, 2000). Examples of some recent studies focusing on user judgments are Belkin et al., 2001, Hersh et al., 2001, and Spink (2002).
Due to the advances in computing and multimedia technologies, the size of image collections is increasing rapidly. Content-Based Image Retrieval (CBIR) has been an active research area for the last decade whose main goal is to design mechanisms for searching large image collections. Similar to traditional IR, studies on user issues of image retrieval are lacking (Fidel, 1997, Rasmussen, 1997).
Current CBIR systems index and retrieve images based on their low-level features, such as colour, texture, and shape, and it is difficult to find desired images based on these low-level features, because they have no direct correspondence to high-level concepts in humans’ minds. This is the so-called semantic gap problem. Bridging the semantic gap in image retrieval has attracted much work generally focussing on making systems more intelligent and automatically understanding image contents in terms of high-level concepts (Eakins, 2002). Image annotation systems, i.e. automatic assignment of one or multiple keywords to an image, have been developed for this purpose (Barnard et al., 2003, Kuroda and Hagiwara, 2002, Li and Wang, 2003, Park et al., 2004, Tsai et al., 2003, Vailaya et al., 2001).
To evaluate the annotation results, most of these systems are only based on some chosen dataset with ground truth, such as Corel. However, the problem is that currently there is no standard image dataset for evaluation, like the web track of TREC for IR (Craswell, Hawking, Wilkinson, & Wu, 2003). As IR systems also need to consider human subjects for evaluation, quantitative evaluation of current annotation systems are insufficient to validate their performances. Therefore, user-centred evaluation of image annotation systems is also necessary.
This paper is organised as follows. Section 2 reviews related work on conducting qualitative evaluation for image retrieval related algorithms, systems, etc. Section 3 presents our qualitative evaluation methodology for image annotation systems. Section 4 shows an example of assessing our image annotation system based on the proposed methodology. Section 5 provides some discussion of the user-centred evaluation. Finally, some conclusions are drawn in Section 6.
Section snippets
Related work
For human assessment image retrieval systems, the general approach is to ask human subjects to evaluate directly the systems’ outputs. For example, a questionnaire can be devised to ask the human judges to rank the level of preference for each specific retrieved image, or the ease with which they were able to find desired images. For image annotation, keywords associated with their images can be selected as relevant or irrelevant by the judges. Then, conclusion can be drawn from the analysis of
The evaluation methodology
The conclusion of Section 2 motivates proposing a user-centred evaluation methodology for existing image annotation systems in terms of effectiveness, i.e. quality and accuracy of image annotation. Fig. 1 shows the evaluation procedure. It is composed of the Type I and Type II evaluation methods described above. Both types of evaluation contain three steps, which are research question formulation, data collection, and data analysis, which can provide different kinds of understanding of an image
The judges
We asked five judges (PhD research students) who are not experts in image indexing and retrieval to decide whether the keywords which are assigned by our system and the random guessing approach are relevant to that image. There were three male judges and two females who were all English first language speakers.
The test set
We considered two datasets. One was the Corel image collection and the other one was supplied by Washington University.
Discussion
These results, especially those of Section 4.2.5 show that even a state of the art automatic image indexing system like CLAIRE cannot match the performance of human annotators in terms annotation accuracy, especially when the classification scale (i.e. number of words in the indexing vocabulary) increases. However a closer inspection of these results indicates that the performance actually achieved may be useful in the context of building practical image retrieval systems which can take initial
Conclusion
Evaluation is a critical issue for information retrieval, and to fully understand the performance of IR systems it is necessary to consider both system- and user-centred evaluations. Image retrieval has become an active research area, and in image retrieval, much of the current research effort is focused on automatically annotating or indexing images to facilitate search in image databases. Most of the existing automatic image annotation systems are evaluated against their annotation or
Acknowledgement
The authors would like to thank Chris Stokoe, James Malone, Sheila Garfield, Mark Elshaw, and Jean Davison to participate the system evaluation.
References (58)
- et al.
Iterative exploration, design and evaluation of support for query reformulation in interactive information retrieval
Information Processing and Management
(2001) An analysis of image retrieval tasks in the field of art history
Information Processing and Management
(2001)- et al.
Users’ relevance criteria in image retrieval in American history
Information Processing and Management
(2002) - et al.
A relevance feedback mechanism for content-based image retrieval
Information Processing and Management
(1999) Towards intelligent image retrieval
Pattern Recognition
(2002)- et al.
Image searching on the Excite Web search engine
Information Processing and Management
(2001) - et al.
Modeling and retrieving images by content
Information Processing and Management
(1997) - et al.
Comparison of edge detectors: a methodology and initial study
Computer Vision and Image Understanding
(1998) - et al.
Challenging conventional assumptions of automated information retrieval with real users: Boolean searching and batch retrieval evaluations
Information Processing and Management
(2001) - et al.
An image retrieval system by impression words and specific object names—IRIS
Neurocomputing
(2002)
Content-based image retrieval using a composite color-shape approach
Information Processing and Management
Interactive learning with a “a society of models”
Pattern Recognition
Content-based image classification using a neural network
Pattern Recognition Letters
Modelling subjectivity in visual perception of orientation for image retrieval
Information Processing and Management
A user-centered approach to evaluating human interaction with Web search engines: an exploratory study
Information Processing and Management
Assessing agreement between human and machine clustering of image databases
Pattern Recognition
Fuzzy content-based retrieval in image databases
Information Processing and Management
Models of user satisfaction: understanding false positives
Reference Quarterly
Analysis of user need in image archives
Journal of Information Science
Modern information retrieval
Matching words and pictures
Journal of Machine Learning Research
The Bayesian image retrieval system, PicHunter: theory, implementation and psychophysical experiments
IEEE Transactions on Image Processing
Reflections on Mira: interactive evaluation in information retrieval
Journal of the American Society for Information Science
The image retrieval task: implications for the design and evaluation of image databases
New Review of Hypermedia and Multimedia
Cited by (19)
Image annotation: Then and now
2018, Image and Vision ComputingCitation Excerpt :To ascertain the accuracy of the annotation system, there exist two broad classes of evaluation measure: (i) the qualitative measure and (ii) the quantitative measure. The qualitative measure [215] deals with human subject based assessment. The subjects are asked to evaluate the performance of the system so that a more comprehensive picture of the annotation system can be obtained.
An attentive self-organizing neural model for text mining
2009, Expert Systems with ApplicationsCitation Excerpt :In terms of the evaluation of experiment results, this paper has five native English speakers as experiment participants. We use the user-center based approach to evaluate the performance of our mechanism (Tsai, McGarry, & Tait, 2006). In this paper, we initially use two terms, “market” and “company” as two queries for searching two interest fields.
A framework for evaluating automatic indexing or classification in the context of retrieval
2016, Journal of the Association for Information Science and TechnologyRobustness and reliability evaluations of image annotation
2016, Imaging Science JournalIndexing: From thesauri to the Semantic Web
2012, Indexing: From Thesauri to the Semantic WebOGIR: An ontology-based grid information retrieval framework
2012, Online Information Review