1 Introduction

Images and illustrations are sources of essential information within the biomedical domain. For example, images can be found in the articles appearing in biomedical publications and in the case reports contained in electronic health records. Within these resources, images are informative for a variety of tasks, and they often convey information not otherwise mentioned in surrounding text. Following the rapid progress in science and medicine, the volume of biomedical knowledge that is represented visually is constantly growing, and it is increasingly important that we provide a means for quickly accessing the most relevant images for a given information need. Not surprisingly, biomedical image retrieval systems have been developed to address this challenge.

Generally, image retrieval systems enable users to access images using one of several strategies. First, text-based image retrieval methods represent images with their associated descriptions or annotations. Using traditional text-based retrieval techniques, users search for images by providing a system with a description of the image content they desire to retrieve. Second, content-based image retrieval (CBIR) methods represent images with numeric feature vectors describing their appearance. Following a “query-by-example” paradigm, users query a CBIR system with some example image, and the system ranks retrieved images according to their visual similarity with the example. Finally, in an attempt to combine the strengths of these two approaches, multimodal image retrieval systems represent images with both descriptive text and content-based features. Footnote 1 This approach allows users to construct multimodal information requests consisting of a textual description of the image content they desire to retrieve that is augmented by visual features extracted from one or more representative images.

Figure 1 shows an example multimodal information request taken from the 2010 ImageCLEF Footnote 2 medical retrieval track data set (Müller et al. 2010). The textual description of this retrieval topic asks for “CT images containing a fatty liver,” and the example images are visual depictions of this request: the large gray mass in each of the abdominal CT scans is a liver and the white arrows indicate areas of fat accumulation (Hamer et al. 2006). A multimodal image retrieval system must process the textual description of the topic and extract content-based features from the example images in order to generate a multimodal query.

Fig. 1
figure 1

Example multimodal topic taken from the 2010 ImageCLEF medical retrieval track data set

As evidence of the significant contribution the combination of text-based and content-based features can provide, a variety of multimodal image retrieval strategies have been proposed. Unfortunately, developing a multimodal retrieval system becomes challenging when the usability of the system and the quality of the results are primary and equal concerns. The limitations of existing methods can be attributed to deficiencies in the following areas:

  • Practicality Comparing content-based features across images generally requires significantly more computation time than comparing text-based features across documents. Because of this expense, some multimodal systems are unable to retrieve relevant images from large collections in an amount of time consistent with traditional text-based retrieval, the efficiency of which many users have come to expect.

  • Precision Effectively combining content-based and text-based features in a way that actually improves retrieval precision has proven to be challenging. In particular, for the task of retrieving images from biomedical articles, many multimodal systems are unable to significantly improve upon the precision of simple text-based methods.

The above drawbacks were recognized by Datta et al. (2008) in their survey of image retrieval trends when they remarked that “the future lies in harnessing as many channels of information as possible, and fusing them in smart, practical ways to solve real problems.”

We describe in this work global feature mapping (GFM), a practical solution for performing multimodal biomedical image retrieval. GFM advances the state of the art by enabling efficient access to the images in biomedical articles while simultaneously improving upon the average retrieval precision of existing methods. Recognizing the importance of text in determining image relevance, GFM combines a predominantly text-based image representation with a limited amount of visual information through the following process:

  1. 1.

    Our system extracts a set of global content-based features from a collection of images and groups them into clusters.

  2. 2.

    Our system maps each cluster to a unique alphanumeric code word that it then assigns to all images whose features are members of the cluster.

  3. 3.

    Our system combines the code words assigned to an image with other text related to the image in a multimodal surrogate document that is indexable with a traditional text-based information retrieval system.

  4. 4.

    Our system searches the index using a textual query generated from a multimodal topic by first assigning code words to the topic’s example images and then by combining these words with the topic’s textual description.

We experimentally validated the success of GFM using the 2010 and 2012 (Müller et al. 2012) ImageCLEF medical retrieval track data sets. Our results show that on both collections, GFM achieves statistically significant improvements in mean average precision (averaging 6.31 %, p < 0.05) over a competitive text-based retrieval approach. When configured for performing content-based retrieval, GFM also demonstrates a significant improvement in precision compared with standard methods. As evidence of its practicality, our results show that GFM requires a response time comparable to that of text-based retrieval, suggesting that it is an appropriate technique for indexing large image collections.

To further demonstrate its practicality, we implemented GFM in two information retrieval systems: a general purpose system based on the vector space retrieval model and a biomedical system that utilizes a probabilistic retrieval model. We obtained statistically significant improvements using both systems. Finally, we have incorporated GFM into the OpenI system (Demner-Fushman et al. 2012). OpenI is a multimodal biomedical image retrieval platform that currently indexes over one million images taken from the articles included in the open access subset of PubMed Central.\(^{\circledR}\) OpenI is a publicly accessible service Footnote 3 developed by the U.S. National Library of Medicine.

The remainder of this article is organized as follows. We review in Sect. 2 existing image retrieval strategies. We present the details of GFM in Sect. 3, and we discuss the reuse of text-based systems for performing multimodal retrieval in Sect. 4. We describe our evaluation of GFM on the ImageCLEF data sets in Sect. 5, our two prototype implementations of GFM in Sect. 6, and our experimental results in Sect. 7. Finally, we discuss the significance of our results in Sect. 8.

2 Background and related work

Image retrieval is a broad and well-researched topic whose scope is far greater than our immediate task of efficiently retrieving biomedical images. In this section, we first briefly review the general strengths and weaknesses of well-known text-based and content-based image retrieval strategies. We then discuss current work related to multimodal retrieval.

2.1 Text-based image retrieval

Text-based image retrieval systems represent images using descriptive text. For example, the images contained in biomedical articles can be represented by their associated captions. Using this text, a collection of surrogate documents is created to represent a given set of images. These documents are indexed with a traditional text-based information retrieval system, and they are searched using text-based queries.

There are several advantages to indexing and retrieving images using text. First, text-based retrieval is a well-understood topic, and the knowledge gained in this area is easily applied to the retrieval of images when they are represented by related text. Second, text-based retrieval is efficient. Because words are discrete data, image surrogates can be indexed in data structures that allow for low latency retrieval, such as inverted file indices. Additionally, because text-based image queries are typically sparse, only a fraction of the surrogates in an index must be scored and ranked for a given query. Finally, a text-based representation allows for semantic image retrieval, enabling us to search for images by providing a system with a description of the content we desire. By “semantic image retrieval” we are referring to the ability of a system to reason beyond the surface form of an information request. For example, it is common for text-based biomedical retrieval systems to perform query expansion using ontological resources such as the Unified Medical Language System\(^{\circledR}\)(UMLS\(^{\circledR}\)) (Lindberg et al. 1993). Footnote 4 A query for the term “heart attack” could then be used to retrieve documents mentioning the term “myocardial infarction” since these terms, although having different surface representations, refer the same concept.

Unfortunately, text can often be a poor substitute for image content. For example, authors sometimes do not write meaningful captions for the images they include in their articles. For a text-based image retrieval system to be effective, the surrogate documents with which it represents images must adequately reflect the content that is requested of it.

2.2 Content-based image retrieval

CBIR systems represent images as numeric vectors. These multidimensional visual descriptors characterize features of the images’ content, such as their color or texture patterns. A CBIR system queries a collection of images using an example image, and it ranks the images according to their visual similarity with the example.

The advantage of CBIR systems compared to text-based image retrieval systems is their ability to perform searches based on visual similarity. Such an ability is useful, for example, for finding within a collection of images all images that are nearly identical to one another, regardless of the context in which they appear. Müller et al. (2004) survey the use of CBIR systems in medical applications.

However, there are several disadvantages to retrieving images using their content. First, the visual similarity of a retrieved image with some example image is not always indicative of its relevance to a query. Whereas text-based image retrieval systems provide a means to access relevant images using descriptive text, semantic retrieval is difficult to achieve using visual similarity alone. Footnote 5 Second, CBIR systems are usually not as efficient as text-based image retrieval systems. Because visual descriptors can be highly dimensional, dense, and continuous-valued, computing the similarity between any two images can often be a computationally intensive task. Footnote 6

CBIR systems judge the similarity of images using a distance measure computed between their extracted visual descriptors. Although many distance measures have been proposed (e.g., Rubner et al. 2000), below we illustrate visual similarity using Euclidean distance. Assume the vectors f x q and f x j represent the visual descriptors of some feature x extracted for images I q and I j . The similarity of these two images for feature x is defined as:

$$\hbox{sim}(I_q, I_j) = 1-\frac{\Vert {\bf f}_{q}^{x} - {\bf f}_{j}^{x} \Vert}{\max\limits_{m,n} \Vert {\bf f}_{m}^{x} - {\bf f}_{n}^{x} \Vert}$$
(1)

Thus, their similarity is equal to one minus the normalized Euclidean distance between their visual descriptors. The denominator of the above function computes the maximum distance between the descriptors extracted for all images I m and I n within some collection. This min–max normalization ensures the computed value is always defined on the interval [0,1].

A naïve content-based retrieval approach involves using the above equation to compute the visual similarity between an example image and every image within some collection. The images are then ranked by sorting them in decreasing order of their similarity with the example. We refer to this approach as the brute-force retrieval strategy. Although the brute-force strategy is adequate for small image collections, it does not scale to large collections, and its use is impractical for many retrieval tasks. A variety of techniques exist for reducing the cost associated with the brute-force retrieval approach. Below, we briefly review some of these general approaches before discussing in Sect. 2.3 work specifically related to GFM’s multimodal retrieval strategy.

2.2.1 Exact representations

Spatial data structures provide an efficient means of storing and retrieving visual descriptors, and they are well-understood within the metric space approach to similarity search (Zezula et al. 2006). These data structures are commonly organized as search trees created by first recursively partitioning a space into regions and then assigning objects to these regions. While many spatial data structures share similar organizations, they often differ in the way in which they partition a space. Examples of spatial data structures include vantage-point trees (Yianilos 1993), generalized hyperplane trees (Uhlmann 1991), geometric nearest-neighbor access trees (Brin 1995), and M-trees (Ciaccia et al. 1997). Within Euclidean spaces, k-d trees (Bentley 1975) and R-trees (Guttman 1984) are common.

Unfortunately, the use of such data structures for indexing image content is not always appropriate. We frequently represent the content of an image collection using more than one feature. Because the descriptors of these features are often highly dimensional, the improvement in response time realized through the use of spatial data structures does not always justify their use. Spatial data structures generally perform no better than brute-force algorithms for finding nearest neighbors in highly dimensional spaces (Indyk 2004).

2.2.2 Inexact representations

Dimensionality reduction techniques, especially when they are used in combination with spatial data structures, can effectively reduce the cost associated with maintaining multidimensional data. By representing visual descriptors with fewer attributes, the response time incurred by spatial data structures can be significantly improved. Examples of dimensionality reduction techniques include principal component analysis (Ng and Sedighian 1996), singular value decomposition (Pham et al. 2007), self-organizing maps (Kohonen 2001), multidimensional scaling (Beatty and Manjunath, 1997), and locality-sensitive hashing (Indyk and Motwani 1998).

However, inexactly representing the visual descriptors of a collection of images forces a trade-off between retrieval precision and efficiency. Because descriptors must be transformed to enable their efficient storage and retrieval, we can no longer rank images according to their exact similarity with a query. Instead, we must rely on an approximate similarity, which may result in a significant reduction in retrieval precision. Moreover, as we increase the number of descriptors with which we represent images, or the dimensionality of these descriptors, we can expect the precision of an approximate similarity search to worsen. The lower-dimensional approximation becomes increasingly inexact as we increase the descriptiveness of the original representation.

2.2.3 Bag of visual words representations

Of the existing techniques for reducing the response time of brute-force CBIR, the use of “visual words” (e.g., Yang et al. 2007) is most similar to GFM’s processing of content-based image features. Using the bag of visual words (BVW) approach, an image is first segmented into a set of regions, commonly by overlaying a regular grid onto the image. Alternatively, an interest point detector can be used to detect salient local patches within an image (Nowak et al. 2006). Widely used interest point detectors include the Harris affine region detector (Harris and Stephens 1988), Lowe’s difference of Gaussians detector (Lowe 2004), and the Kadir–Brady saliency detector (Kadir and Brady 2001). Once an image has been segmented into regions, local features, especially SIFT (Lowe 1999) features, are extracted from each region, and these features are mapped to visual words. An image is represented as a collection of the words assigned to its constituent regions, which is a description of the image that can be efficiently maintained in an inverted file index.

Having been inspired by the success of text-based retrieval, numerous content-based retrieval strategies have been proposed that implement the BVW approach. A well-known example is the Video Google system (Sivic and Zisserman 2003), which utilizes the BVW strategy and inverted file indices to efficiently retrieve all occurrences of user-outlined objects in videos. Another example demonstrating the use of text retrieval models for performing content-based retrieval is the Mirror DBMS (de Vries 1999), which generates visual words for independent local feature spaces and then applies a retrieval model based on the INQUERY system (Callan et al. 1992). de Vries and Westerveld (2004) describe a similar content-based retrieval system based on the language modeling approach of information retrieval. Finally, the Viper project (Squire et al. 2000) demonstrated that inverted file indices permit the use of extremely high-dimensional feature spaces for performing content-based image retrieval. The MedGIFT (Müler et al. 2003) system is a more recent incarnation of this work that has been adapted to the biomedical domain.

The difference between BVW representations and the content-based feature processing of GFM can be summarized as follows. BVW representations decompose images spatially into patches, representing the local features extracted from each patch with a word. Conversely, GFM decomposes images conceptually into complimentary “views” of the images’ content according to various global features. Each view is then represented as a set of words. The two models are orthogonal, with the former mapping local patches within an image to words and the latter mapping global views of an image to words.

BVW approaches provide a convenient representation for region-based computer vision tasks where the spatial orientation of an image’s local features is not an essential consideration. For example, BVW models are commonly used for categorizing the objects within images. However, for retrieval tasks in which the overall appearance of images is important, BVW models often do not perform well in isolation, and the use of global features can improve performance. Within the biomedical domain, images of a particular medical imaging modality commonly exhibit a similar global appearance. For example, physicians usually perform chest X-ray using standard medical imaging equipment on patients oriented in the same direction. A result of this uniform examination procedure is that chest X-ray images can be distinguished from other medical imaging modalities using global features, such as color and texture. For multimodal retrieval tasks within the biomedical domain, it is often not necessary to consider the local features of images within a particular modality. Detecting nodules within a chest X-ray query, for example, is not needed if the query text already mentions the concept “tuberculoma,” a pulmonary nodule found in patients having tuberculosis. For such retrieval tasks, GFM is an appropriate technique for efficiently combining the exact representation of an image’s global content-based features with descriptive text so as to improve average retrieval precision.

2.3 Multimodal image retrieval

Multimodal image retrieval systems, the last of the three image retrieval methods we will discuss, represent images as a “fusion” of descriptive text and numeric feature vectors. Fusion can either be performed early in the analysis process by creating a unified data representation or late in the process, after each data type has been analyzed independently. GFM is an instance of early fusion because it combines an image’s text-based and content-based features into a single indexable representation. Retrieval strategies that filter or re-rank images retrieved using a text-based query based on their visual similarity with some example image are instances of late fusion. These methods perform retrieval separately for each modality and later merge the results into a single ranked list of images. Atrey et al. (2010) survey fusion methods that have been proposed for a variety of different data types.

Because multimodal image retrieval systems combine the aforementioned text-based and content-based image retrieval approaches, these systems inherit the strengths and weaknesses of each method. Advantages of multimodal retrieval include the ability to search for images both semantically and by visual similarity. A disadvantage is the inherent difficulty in determining an effective fusion strategy that simultaneously improves retrieval precision while remaining practical for use in real systems.

One of the most active areas of multimodal image retrieval research has been the biomedical domain. Although an exhaustive account of the multimodal biomedical image retrieval strategies is not feasible, a popular topic has been the retrieval of images from biomedical articles. An appropriate starting point for surveying this work is Müller et al.’s (2010a) retrospective of the ImageCLEF evaluations. This volume describes the various ImageCLEF tracks and the evolution of strategies used by the ImageCLEF participants. A recurring theme—not only of the medical retrieval track, but of the other tracks as well—is the difficulty encountered by the participants in meaningfully combining text-based and content-based image features. The prototype implementation of GFM is based on our own past experiences, documented by Simpson et al. (2009, 2010, 2011, 2012a), at developing multimodal retrieval strategies for these evaluations.

While many multimodal fusion-based retrieval strategies have been proposed within the biomedical domain, we review the following as being representative methods that have also been evaluated on the ImageCLEF data sets. Kalpathy-Cramer and Hersh (2010) demonstrate an effective late fusion approach for improving the early precision of a medical image retrieval system. The method first assigns image modality labels (e.g., X-ray) to a collection of images based on their content-based features, and it then uses these labels to re-rank images retrieved using text-based queries. Clinchant et al. (2010) and Alpkocak et al. (2012) also describe techniques that use image modality to re-rank results obtained by a text-based image search. Whereas the above approaches perform late fusion using image modality, Demner-Fushman et al. (2009) describe a medical image retrieval system that first performs a text-based query to retrieve an initial set of images and then re-ranks the retrieved images according to their visual similarity with an example query image. Similarly, Gkoufas et al. (2011) perform brute-force CBIR to re-rank the one thousand highest ranked images retrieved using a text-based retrieval approach. Caicedo et al. (2010) utilize latent semantic kernels to construct combined text-based and content-based feature vectors, which they then use for performing a brute-force retrieval strategy. Finally, Rahman et al. (2010) describe a fusion-based query expansion method, and Zhou et al. (2010) evaluate the effectiveness of classical information fusion techniques for biomedical image retrieval.

GFM is distinct from the above multimodal approaches in several ways. First, whereas the above methods primarily rely on late fusion techniques to filter or re-rank the results of text-based retrieval, GFM is an early fusion approach. GFM’s combination of image code words with image-related text enables the creation of multimodal surrogate documents that are indexable by traditional text-based retrieval systems. The reuse of text-based systems for performing multimodal retrieval contributes to GFM’s low search latency and suggests that it is an appropriate technique for indexing large image collections. Second, our results demonstrate that GFM consistently achieves statistically significant improvements in retrieval precision over text-based approaches whereas existing methods show mixed results.

3 Global feature mapping

GFM is a practical solution for enabling the retrieval of biomedical images using both descriptive text and visual similarity. By mapping the exact representation of an image’s global content to code words, and then by combing these words with other image-related text, GFM creates a multimodal image representation that is efficiently indexed and retrieved using a traditional text-based information retrieval system. Reusing a text-based system for performing multimodal image retrieval ensures the efficiency of GFM and improves upon the average retrieval precision of existing methods.

Below, we detail GFM’s image indexing and retrieval process. The primary components of this process include (1) a method for generating a “codebook” of words with which to represent the global features extracted from a collection of images, (2) a method for assigning these code words to images, (3) a method for indexing the images’ assigned code words with their related descriptive text, and (4) a method for querying the resulting multimodal index. We follow this section with a discussion of the treatment of image code words in a traditional text-based information retrieval system.

3.1 Codebook generation

GFM’s codebook generation process defines a mapping of the global content of a collection of images to a set of indexable code words. Assume a collection of images \(\{I_1, I_2, \ldots, I_m\}\) and a set F of global content-based features, such as color and texture. Footnote 7 We represent each image in the collection as a set of numeric vectors extracted for each of the features:

$$I_j = \left\{ {\bf f}_{j}^{x} \colon x \in F \right\}$$
(2)

The vector f x j is a visual descriptor of feature x of image I j , 1 ≤ j ≤ m. For each of the global features, the codebook generation process clusters the corresponding vectors extracted from the images in the collection and then maps the resulting cluster centroids to unique code words. The complete mapping of cluster centroids to code words defines GFM’s codebook.

The codebook generation process proceeds as follows. First, GFM optionally partitions each vector f x j into p lower-dimensional vectors of equal dimensionality. f x j is written in terms of its constituent partitions as:

$${\bf f}_{j}^{x} = \left[\begin{array}{llll} {\bf f}_{1,j}^{x} &{\bf f}_{2,j}^{x} & \cdots & {\bf f}_{p,j}^{x} \end{array}\right]$$
(3)

where the row vector f x l,j is partition l of f x j , 1 ≤ l ≤ p. The dimensionality of each of these lower-dimensional vectors is equal to 1/p times the dimensionality of the original descriptor. Thus, the lower-dimensional vector f x l,j contains dimensions of the original vector that are within the range [lp − p + 1, lp]. GFM partitions descriptors when their dimensionality and the number of images in the collection combine to make clustering the vectors prohibitively expensive with available resources. Also, because partitioning increases the number of vectors representing each image, it increases the number of ways in which these images can differ, thereby improving the intracluster ranking of the retrieved images (Sect. 7.2.4).

GFM’s use of lower-dimensional feature vectors is related to the notion of product quantization (Jégou et al. 2011) and the dimensionality reduction technique proposed by Ferhatosmanoglu et al. (2001) that partitions vectors after having transformed them using the Karhunen–Loeve transformation (KLT). However, unlike these methods, GFM does not partition visual descriptors so as to improve the performance of approximate nearest neighbor search. Instead, GFM partitions the vectors and maps them to code words as a practical means of combining visual and textual information in a form indexable by a traditional text-based retrieval system. We retrieve images not by approximating the Euclidean distance between their descriptors, but by relying upon the underlying text-based retrieval model. For example, the well-known vector space model computes the cosine distance between tf and idf term vectors. Using GFM, each of these vectors is constructed from the combined term statistics of an image’s code words as well as its related text.

After GFM partitions the extracted visual descriptors, the clustering process begins. GFM requires a centroid-based algorithm, such as k-means (Lloyd 1982), to cluster each set of lower-dimensional vectors corresponding to a given partition and feature. Assume \(\{{\bf f}_1, {\bf f}_2, \ldots, {\bf f}_m\}_l^x\) is the set of lower-dimensional vectors representing partition l of the visual descriptors extracted from the collection of images for feature x. We denote the clustering of these vectors as \(\{C_1, C_2, \ldots, C_k\}_l^x\), where C x i,l is the set of vectors belonging to cluster i, 1 ≤ i ≤ k. We denote the centroid of cluster C x i,l as the vector c x i,l .

Once the clustering process is complete, GFM generates the codebook. GFM stores in the codebook a mapping from each cluster centroid to a unique code word. Because each centroid c x i,l is uniquely identified by the feature x and the tuple (il), GFM combines these values to construct textual code words of the form “x:kipl. ” For example, having extracted the color layout descriptor (CLD) (Chang et al., 2001) for all images in the collection, GFM maps the centroid c cld1,2 to the text string “cld:k1p2.” In this way, each code word in GFM’s codebook is uniquely associated with a given feature, partition, and cluster. Although various other techniques can be envisioned for generating unique code words, GFM follows the aforementioned strategy to map images to sets of words.

The diagram shown in Fig. 2 summarizes GFM’s codebook generation process. Assume \(\{a, b, c\} \subseteq F, 1 \leq l \leq p\), and 1 ≤ i ≤ k. A collection of images \(\{I_1, I_2, \ldots, I_m\}\) first undergoes feature extraction (FE) to produce a set of visual descriptors \(\{{\bf f}_1, {\bf f}_2, \ldots, {\bf f}_m\}^{x}\) for a feature x. This set of descriptors then undergoes feature partitioning (FP) to produce p sets of lower-dimensional vectors. For a partition l, the set of lower-dimensional vectors \(\{{\bf f}_1, {\bf f}_2, \ldots, {\bf f}_m\}_l^{x}\) is grouped into k clusters, resulting in a set of centroids \(\{ {\bf c}_1, {\bf c}_2, \ldots, {\bf c}_k \}_l^{x}\). GFM stores in the codebook a mapping from each centroid c x i,l to a unique code word of the form “x:kipl. ” In Fig. 2, the clustering process is represented with the k-means algorithm (KM), but any centroid-based clustering algorithm is sufficient for generating the codebook.

Fig. 2
figure 2

Codebook generation. Visual descriptors representative of a set of global features are first extracted from a set of images via some feature extractor (FE). Then, the descriptors of each feature are partitioned via some feature partitioner (FP) to form several sets of lower-dimensional vectors. Finally, the sets of lower-dimensional vectors are clustered via the k-means algorithm (KM), and the resulting cluster centroids are mapped to unique textual code words in the codebook

3.2 Code word assignment

After generating and storing in the codebook unique words representative of the cluster centroids, GFM then assigns the words to each image in the collection. GFM assigns the code word “x:kipl” to all images whose partition l of the descriptor for feature x lies within the cluster whose centroid is c x i,l . Specifically, the set of images to which GFM assigns this word is given by \(\{I_j \colon {\bf f}_{l,j}^{x} \in C^x_{i,l}\}\).

While it is useful to know the set of images to which GFM assigns a given code word, we often must consider the set of all code words assigned to a given image. Recall that the code word representing centroid c x i, l is defined by the feature x and the tuple (il). The set of defining tuples for all code words assigned to an image I j for feature x is given by:

$$W_j^x = \left\{ \left( \mathop{\hbox{argmin}}\limits_{i} \Vert {\bf f}_{l,j}^{x} - {\bf c}_{i, l}^x \Vert, l \right) \colon 1 \leq l \leq p \right\}$$
(4)

W x q identifies the centroids that are nearest to each of the p lower-dimensional vectors representing feature x of image I j . The set of all code words representing I j is then given by:

$$D_j^c = \left\{{``}x{\rm :k}i{\rm p}l{\text{''}} \colon x \in F \wedge (i, l) \in W_j^x\right\}$$
(5)

Intuitively, D c j can be thought of as a text document containing the code words for the content-based features extracted for image I j .

The diagram shown in Fig. 3 summarizes the code word assignment process of GFM for an image I j and a single global feature x. Assume \(x \in F, 1 \leq l \leq p\), and 1 ≤ i ≤ k. The image first undergoes feature extraction to produce a visual descriptor for feature x. The extracted descriptor then undergoes feature partitioning (FP) to produce p lower-dimensional vectors \(\{{\bf f}_{1}, {\bf f}_{2}, \ldots, {\bf f}_{p}\}_j^{x}\). Feature extraction and partitioning are performed during the codebook generation process. The codebook entries for partition l of feature x are given by \(\{{\bf c}_1, {\bf c}_2, \ldots, {\bf c}_k\}_l^x\). For each lower-dimensional vector f x l,j , the code word assignment process performs a nearest-neighbor search (NN) to select among these entries the cluster centroid to which the vector is nearest. Each selected centroid c x i,l is represented in the set W x j as a tuple (il). The textual representation of the tuples for all the content-based features extracted for an image I j is then given by D c j .

Fig. 3
figure 3

Code word assignment. The visual descriptor representative of a global feature is extracted from a query image and partitioned via some feature partitioner (FP), forming several lower dimensional vectors. For each lower-dimensional vector, a nearest-neighbor search (NN) is performed to select among the codebook entries the cluster centroid to which the vector is nearest. The query image is then represented by a set of tuples that uniquely define the selected centroids. Note that this diagram only depicts the code word assignment process for one of the features used for codebook generation

3.3 Multimodal image representation

Because images are seldom self-evident, they are frequently accompanied by text. In general, this text can provide meaning to the visual characteristics of the images and can place the images within a broader context. For example, the images found in biomedical articles are usually accompanied by descriptive captions, and their relevance is commonly discussed in passages within the articles’ full text that mention the images.

GFM provides efficient access to both the meaning of images as well as their visual characteristics by representing images as a combination of their text-based and content-based features. Assume GFM has completed the codebook generation process for a collection of images and a set of global content-based features. Furthermore, assume GFM has assigned each image in the collection one or more code words corresponding to each of the global features. To prepare the collection of images for indexing, GFM combines the code words assigned to the images with natural words taken from their related text. Footnote 8 Let D t j represent a document containing descriptive text related to an image I j . We define a multimodal surrogate document D j for image I j as the following:

$$D_j = D_j^t \cup D_j^c$$
(6)

After constructing multimodal surrogates for each image in the collection, we index the resulting documents with a traditional text-based information retrieval system.

The image representation used by GFM is best understood with an example. Figure 4 shows a multimodal surrogate document created for an image in the 2010 ImageCLEF medical retrieval track data set. The image, taken from an article by Helbich et al. (1999), is a CT scan depicting morphologic abnormalities in a 9-year-old boy with cystic fibrosis. The image is represented by both its text-based and content-based features in a document that is indexable with a traditional text-based information retrieval system. The image’s text-based features include its caption, passages from the full text of the article that mention the image (i.e., Fig. 2 in this example), the title of the article, the article’s abstract, and the article’s assigned medical subject headings (MeSH\(^{\circledR}\) terms). Footnote 9 The image’s global content-based features are represented as code words derived from five visual descriptors, each of which are described in Sect. 6.1. Note that in this particular example, the descriptors have not been partitioned into lower-dimensional vectors (i.e., p = 1). Thus, the image is only assigned one code word for each of the five features.

Fig. 4
figure 4

Multimodal image representation. The image from Fig. 2 of the article “Cystic fibrosis: CT assessment of lung involvement in children and adults” by Helbich et al. (1999) is shown represented by a combination of text-based and content-based features

3.4 Multimodal image retrieval

Having utilized a traditional text-based information retrieval system to index the collection of multimodal surrogate documents produced by GFM, we can efficiently retrieve images from the collection. Assume we would like to retrieve the most relevant images for a multimodal topic, such as the one shown in Fig. 1. In order to formulate a query for this retrieval task, we must first assign code words to the topic’s example images and process the textual description of the topic. Then, we can combine the images’ code words with natural words taken from the topic description and use this text to search the collection of multimodal image surrogates. We describe GFM’s query formulation process below.

Like it does for images in the collection, GFM represents queries as multimodal surrogate documents containing both text-based and content-based features. For a given topic or information request, GFM processes the topic’s example images in the same way as it does images in the indexed collection, extracting visual descriptors for an identical set of global content-based features and partitioning these descriptors into the same number of lower-dimensional vectors. GFM assigns words to a query image I q following the code word assignment process, which produces a set of code words D c q . GFM combines D c q with a set of natural words D t q taken from the topic’s textual description Footnote 10 in order to form a multimodal query D q . Finally, we submit D q as a query to the text-based information retrieval system we used to index the collection of multimodal surrogates and retrieve a set of images.

3.5 Relation with semantic image annotation

Much recent work in the CBIR community has dealt with bridging the so-called “semantic gap” between an image’s content and its meaning. The idea is that by automatically labeling an image or the interesting regions of an image with semantically meaningful concepts, such concepts could then be leveraged in order to retrieve conceptually similar images. Our assumption in this work has been that semantic descriptions of the images in our collections are already accessible: the images found in biomedical articles are surrounded by meaningful text (e.g., their captions), and we are using meaningful text (e.g., topic descriptions) as the primary means of retrieving them. Thus, the semantic gap associated with our collection is narrow if it exists at all, and bridging it is not a problem that GFM attempts to solve. Although the code words GFM assigns to images can be thought of as annotations, they convey no obvious meaning beyond cluster membership. Even so, our experiments have shown that, for our data sets, we can improve the retrieval of relevant images by incorporating these image code words into a text-based retrieval process. However, it is beneficial for us to briefly survey some representative work related to semantic annotation.

Many approaches to image annotation attempt to create joint probabilistic models of text-based and content-based features. Typical of these approaches, image-related text is represented as a bag of words, and image content is represented as “blobs,” which are quantized content-based feature vectors extracted from important image regions. Conceptually, blobs are similar to GFM’s code words, but whereas GFM may represent the global content of a single image with several code words, a single blob represents the local content of one region. The goal of an annotation model is then to learn joint word-blob probabilities from a collection of images and their associated text. Perhaps inspired by techniques from natural language processing, Duygulu et al. (2006) formulate the modeling problem as an instance of machine translation, and Lavrenko et al. (2003) apply the language modeling framework of information retrieval to learn the semantics of images. Barnard et al. (2003) investigate various correspondence models as well as a multimodal extension of Latent Dirichlet allocation (LDA). Blei and Jordan (2003) also propose the use of LDA for modeling associations between words and images. Finally, though not directly related to annotation, Rasiwasia et al. (2010) model the correlations between images’ content-based features and their related text in support of cross-modal retrieval. The authors demonstrate that their cross-modal model can outperform systems when evaluated on unimodal retrieval tasks.

Instead of modeling the associations between text-based and content-based image features, semantic annotation can also be achieved using supervised machine learning techniques. Datta et al. (2007) describe a structure-composition model for categorizing image regions. The authors annotate images with tags corresponding to recognized regions and use the annotations for retrieving semantically similar images. They use a bag of words distance measure based on WordNet (Miller 1995) for computing semantic similarity. Li and Wang (2008) present ALIPR (automatic linguistic indexing of pictures—real time), a real time image annotator that uses hidden Markov models to capture the spatial dependencies of content-based features associated with a given set of semantic categories. A related approach is described by Chang et al. (2003), who use Bayes point machines (Herbrich et al. 2001) to assign “soft” annotations to images based on category confidence measures estimated from a training set of labeled images.

Within the biomedical domain, region classification has been a popular approach for improving image retrieval. Lacoste et al. (2007) index images using a combination of UMLS concepts extracted from image-related text and VisMed (Lim and Chevallet 2005) terms derived from image content. VisMed terms are semantic labels generated by classifying the appearance of image regions. The authors demonstrate that a multimodal fusion approach that utilizes VisMed terms is capable of outperforming systems evaluated on the 2005 ImageCLEF medical retrieval track data set. However, unlike GFM’s unsupervised method of generating code words, semantic annotation using VisMed terms is an instance of supervised learning and requires a sufficient set of training from which to derive the terms. Additionally, Simpson et al. (2012b) discuss the creation of a “visual ontology” of biomedical imaging entities using supervised learning. The authors utilize natural language and image processing techniques to automatically create a training set of annotated image regions by pairing the visible arrows in images with the caption text describing their pointed-to regions. They then use this data set to train a classifier to label regions in images having no associated text. This approach has yet to be evaluated for its use in improving medical image retrieval.

Finally, Wang et al. (2008) discuss a search-based approach to image annotation. To annotate an image, the method first performs a content-based search to retrieve visually similar images, and it then uses text related to the retrieved images to form a list of candidate annotations for the original.

4 Images as words

When represented as code words, images become subject to the underlying models used by traditional text-based information retrieval systems. While this may not seem immediately desirable, the well-understood concepts of text-based retrieval are easily adapted for use with image code words, and they prove to be beneficial for improving upon the retrieval performance and efficiency of existing content-based and multimodal image retrieval systems. Below, we discuss how common text-based retrieval techniques—namely, query expansion and relevance ranking—operate when we represent images as words.

4.1 Code word expansion

Text-based retrieval systems often perform query expansion in an effort to improve retrieval performance. A commonly used technique involves expanding a query to include the synonyms and morphological variants of existing terms. Text-based query expansion methods, however, are not directly applicable to image code words because, as they are not natural words, they do not have conventional synonyms or variants. Instead, we define the relatedness of two code words as the distance between their representative cluster centroids, and we expand a query of code words to include those corresponding to nearby centroids.

A problem with traditional centroid-based clustering algorithms is the requirement that each element belongs to exactly one cluster. This restriction is unfortunate for GFM because it implies that images that may be similar in appearance can be assigned different code words. Consider a Voronoi diagram representing the clustering of the visual descriptors extracted from a collection of images for a particular global feature. Descriptors lying close to and on either side of the boundary between two adjacent cells are more similar to each other than either one is to its respective cell center. Thus, because GFM assigns different code words to images whose descriptors lie within different cells, it may not retrieve the most visually similar set of images to a given query image if the query image’s descriptor lies far away from a cell’s center.

The goal of code word expansion is to minimize the negative impact rigid cluster membership has on retrieval performance. Similar to fuzzy cluster analysis (Bezdek et al. 1999), code word expansion allows GFM to assign more than one code word to a query image for a given feature and partition. Recall that the set W x q contains all tuples (il) that define the code words GFM assigns to a query image I q for feature x. For each partition l of feature xW x q identifies the single centroid c x i,l to which f x l,q is closest. In order to expand a code word query, we parameterize W x q with a code word expansion factor \(\varepsilon\). \(W_q(\varepsilon)^x\) identifies the \(\varepsilon\) nearest cluster centroids to each f x l,q and is defined by:

$$W_j^x(\varepsilon) = \left\{\left(e_i, l\right) \colon 1 \leq i \leq \varepsilon \wedge 1 \leq l \leq p \wedge e_i \in E_{l,j}^x\right\}$$
(7)

E x l,j is a set of identifiers corresponding to the cluster centroids after they have be sorted in order of increasing distance from f x l,q :

$$E_{l,j}^x = \{e_1, e_2, \ldots, e_k\}\quad \hbox{ordered by}\quad\lVert{\bf f}_{l,j}^x-{\bf c}_{e_n,l}^x\rVert\leq\lVert{\bf f}_{l,j}^x -{\bf c}_{e_{n+1}^x, l}\rVert$$
(8)

Thus, the set \(W_q^x(\varepsilon)\) contains all tuples (il) representative of the \(\varepsilon\) nearest centroids to each f x l,j for 1 ≤ l ≤ p. Similarly, \(D_q^c(\varepsilon)\) contains the actual expanded set of code words for a query image I q . The code words assigned to an image are subject to the term weighting strategy of the underlying text-based retrieval model. However, because many retrieval systems allow terms to be weighted manually, a weighting strategy that allocates less weight to the expanded code words could potentially be realized that simulates the probabilistic cluster membership obtainable by fuzzy clustering techniques.

4.2 Image similarity

The most apparent consequence of using a traditional text-based retrieval system to index images is that retrieved images are ranked according to some text-based similarity measure. Whereas existing content-based and multimodal image retrieval systems commonly rank images by the Euclidean distance between their extracted visual descriptors, this ranking is not directly possible when we instead represent images as words. Modern text-based retrieval systems implement a variety of set-theoretic, algebraic, and probabilistic retrieval models. Though not always the best-performing approaches, many text-based systems, such as Apache Lucene, Footnote 11 implement a combination of the Boolean and vector space models, especially variants of these models that utilize tfidf term weighting. Below, we briefly discuss the treatment of image code words within these well-known models.

4.2.1 Boolean model

The Boolean model (Lancaster and Fayen 1973) was one of the first and most widely adopted information retrieval strategies, and many modern retrieval systems provide a mechanism for constructing queries that utilize standard Boolean operators. If we assume query images to be the disjunction of their code words, then the set of images retrieved by the model for a query image I q is given by \(\{I_j \colon D_j^c \cap D_q^c(\varepsilon) \neq \emptyset \}\). Thus, the model retrieves all images from the collection that are represented by a code word contained in the set of expanded code words GFM assigns to the query image. Alternatively, if we assume query images to be the conjunction of their code words, the set of images retrieved by the model is given by \(\{I_j \colon D_j^c \subseteq D_q^c(\varepsilon) \}\).

The use of Boolean operators is especially useful for creating queries for topics containing more than one example image, such as the one shown in Fig. 1. For such topics, we might like to retrieve all images that are visually similar to at least one of the example images, or we might, instead, prefer to retrieve images that are similar to all of the examples. We can construct result sets for complex image queries by first retrieving a set of images for each example image according to either the disjunctive or conjunctive query formulation strategy and then applying the Boolean operators to the retrieved sets of images.

4.2.2 Vector space model

The vector space model (Salton et al. 1975) is a well-known algebraic model of information retrieval where documents and queries are represented as term vectors. We can construct code word vectors for images following the classical formulation. D c j , the code words GFM assigns to an image I j , is represented as a set of term vectors \(\{{\bf v}_j^x \colon x \in F\}\), where each v x j corresponds to the codebook entries for feature x. Furthermore, each v x j is partitioned into p lower-dimensional term vectors:

$${\bf v}_j^x = \left[\begin{array}{llll} {\bf v}_{1,j}^x & {\bf v}_{2,j}^x & \ldots & {\bf v}_{p,j}^x \end{array}\right]$$
(9)

Each v x l,j corresponds only to those codebook entries of feature x that are defined for partition l. The attributes of these lower-dimensional vectors are weights corresponding to the codebook entries they represent:

$${\bf v}_{l,j}^x = \left[\begin{array}{llll} w_{1,l,j}^x& w_{2,l,j}^x & \cdots & w_{k,l,j}^x \end{array}\right]$$
(10)

Vector space retrieval systems commonly implement the tfidf term weighting strategy. Because GFM only assigns images one word per feature and partition combination, the term frequency of each codebook entry contained in document D c j is equal to one. Thus, code words are weighted by their inverse document frequency, which is defined by:

$$\begin{aligned} w_{i,l,j}^x = \left\{\begin{array}{ll} \log\frac{m}{\lvert C_{i,l}^x \rvert} & \hbox{if} (i, l) \in W_j^x(\varepsilon) \\ 0 & \hbox{otherwise} \end{array}\right. \end{aligned}$$
(11)

The inverse document frequency of a given code word is related to the number of images whose visual descriptors are members of the cluster the code word represents. Because code words are uniquely defined by feature x and tuple (il), this number is equal to |C x i,l |. The tfidf weighting strategy implies that code words representing clusters having few members are weighted more heavily than those representing clusters with many members. Thus, the retrieval system favors images that are more unique within the collection.

Code word expansion can either be performed when indexing images or when mapping query images to their associated code words; it is not necessary to perform code word expansion during both the indexing and retrieval steps. By convention, we perform code word expansion during the retrieval process. Thus, we let \(\varepsilon = 1\) for the m images in the collection and \(\varepsilon \geq 1\) for query images.

Having defined the term vectors used by the vector space model and the weights assigned to each code word, we can compute the similarity between two images. The similarity between a query image I q and an image I j from the collection is given by the average cosine similarity between the code word vectors representing I q and I j for all features. For a given query image, the retrieval system computes its similarity with each image in the collection and then ranks the collection of images accordingly.

5 Data and evaluation

The medical retrieval track of ImageCLEF has been an important catalyst for advancing the science of image retrieval within the biomedical domain (Hersh et al. 2009). For the ad hoc retrieval task, participants are provided with a set of topics, and they are challenged with retrieving for each topic the most relevant images from a collection of biomedical articles. As was shown in Fig. 1, each topic is multimodal, consisting of a textual description of some information need as well as one or more example images. Although the best-performing systems at ImageCLEF evaluations have historically relied upon text-based retrieval methods, recent systems have shown encouraging progress towards combing these methods with content-based approaches, especially since the introduction of an image modality classification task (Müller et al. 2010b). For the classification task, the goal is to classify images according to medical imaging modalities such as “Computerized Tomography” or “X-ray.”

We chose the 2010 and 2012 ImageCLEF medical retrieval track data sets for the evaluation of GFM. The 2010 collection contains 77,479 images taken from a subset of the articles appearing in the Radiology and Radiographics journals, and the 2012 collection contains 306,539 images taken from a portion of the articles in the open access subset of PubMed Central. Each image is associated with its caption and the title, identifier, and URL of the article in which it appears. The 2010 collection identifies each article by its PubMed identifier (PMID) whereas the 2012 collection uses its PubMed Central identifier (PMCID). There are sixteen multimodal ad hoc topics in the 2010 collection and twenty-two topics in the 2012 data set. The organizers of the ImageCLEF evaluations categorize the topics in roughly equal proportions as being “Visual,” “Mixed,” or “Semantic” according to their expected benefit from content-based or text-based retrieval techniques.

Following the TREC evaluation methodology (Voorhees and Harman 2005), the highest ranked images retrieved by each ImageCLEF participant for a given topic were pooled and manually judged as either being relevant to the topic or not relevant. Using these judgements, we report a system’s performance for a topic as binary preference (bpref), judged mean average precision (MAP′), and judged precision-at-ten (P′@10) over the one thousand highest-ranked images. Although they are highly correlated, because bpref and MAP′ do not always agree, we assume these metrics to be complimentary and use them both as an indicator of average system performance. However, Sakai (2007) has determined that average precision, when computed on only the images having relevance judgements, is at least as robust to incomplete judgements as bpref but more discriminative. To measure the statistical significance between the average performance of two or more systems, we applied Fisher’s two-sided, paired randomization test (Smucker et al. 2007), which is a recommended statistical test for evaluating information retrieval systems.

We evaluate the efficiency of each retrieval system having measured the time in milliseconds needed to produce a ranked list of results for each topic. To conduct the experiments, we organized the retrieval systems in a client/server architecture networked via a Gigabit Ethernet connection. The GNU/Linux server had 2 Intel Xeon 5160 processors (2 cores, 3 GHz, 4 MB L2 cache) and 10 GB of memory. The Microsoft Windows XP client had a single Intel Xeon W3520 processor (4 cores, 2.66 GHz, 8 MB L3 cache) and 3 GB of memory.

6 Implementation

Owing to its practicality, we have implemented GFM within two text-based information retrieval frameworks. The first, Apache Lucene, is a general purpose vector space system widely recognized for its ease of use and reasonable performance. The second, Essie (Ide et al. 2007), is a biomedical retrieval system developed by the U.S. National Library of Medicine. Essie scores documents using a probabilistic retrieval model, and automatically expands query terms along the synonymy relationships in the UMLS. The retrieval models of both Lucene and Essie support queries that utilize standard Boolean operators. We evaluate our Lucene implementation of GFM on the 2010 ImageCLEF collection and our Essie implementation on the 2012 data set. Because the medical retrieval track of ImageCLEF is a domain-specific retrieval task, we have also implemented UMLS synonymy expansion for Lucene. Both of the above retrieval systems allow documents to be composed of multiple fields and provide low latency access to documents using inverted file indices. Below we describe how we represent images as multi-field documents indexable by these two systems.

6.1 Image representation

We represent the images in the ImageCLEF collections using a combination of text-based and content-based features. Our text-based features include an image’s caption and mentions as well as the title, abstract, and MeSH terms of the article in which it is contained. The ImageCLEF data sets provide image captions and article titles. To obtain each article’s abstract and MeSH terms, we utilize its associated PMID/PMCID with the Entrez programming utilities (NCBI 2010) to retrieve MEDLINE citations containing the required elements. To obtain image mentions, we extract passages that refer to the images from the full text articles retrieved using the provided URLs. We identify image mentions using regular expression patterns that match image labels. For example, if an image’s caption identifies it as “Fig. 2a,” we extract sentences that contain variants of this label.

Our content-based features primarily describe color and texture information, and they include the descriptors listed in Table 1. We used the “Core” features with our Lucene implementation and both the “Core” and “Additional” features with our Essie implementation. Although we recognize that no single combination of features is adequate for describing the content of all images, a detailed analysis of the strengths and weaknesses of these particular sets is beyond the scope of our current evaluation. However, note that the dimensionality of many of the features is prohibitively large for maintaining them in spatial data structures. To efficiently extract these features, we utilized the MapReduce framework on an eight-node Apache Hadoop Footnote 12 cluster. For convenience, we extracted the features for both collection and topic images offline, prior to performing our indexing and retrieval experiments. However, given the extracted features for a topic image, our GFM implementations compute the associated code words online, and this computation time is accounted for in our results.

Table 1 Content-based features used for our global feature mapping implementations

Once the content-based features have been extracted, our GFM implementations cluster them using the k-means++ algorithm (Arthur and Vassilvitskii 2007), which we chose for its simplicity, efficiency, and accuracy. Additionally, because k-means++ uses Euclidean distance as its clustering metric, GFM’s retrieval results can be compared with those obtained by other retrieval systems using Euclidean distance without the need for considering potential differences in similarity metrics. However, GFM’s indexing and retrieval method is compatible with other centroid-based clustering techniques, and we have experimented with some of these algorithms, such as hierarchical k-means. In addition to k-means and its variants, many other clustering techniques have been proposed for image retrieval tasks. Datta et al. (2008) survey the strengths and weaknesses of several popular algorithms.

We experimentally determined reasonable values for the number of partitions and clusters for each feature based on preliminary observations. Due to the computational complexity associated with clustering the higher-dimensional features vectors we used for the 2012 ImageCLEF collection, we let the maximum number of feature partitions equal six (p = 6) for this data set, whereas we let the number of partitions equal two (p = 2) for the 2010 ImageCLEF collection. To ensure the scalability of our method, we let the number of clusters for each partition be logarithmic in the total number of images. Thus, the number of clusters for a partition p is given by:

$$k = \left\lceil \frac{d}{p} \times \log m \right\rceil$$
(12)

where d is the dimensionality of a content-based feature shown in Table 1, and m is the total number of images in the collection.

We include the images’ text-based features and the code words corresponding to their content-based features as unique fields in multi-field text documents. In this way, each image in the ImageCLEF collections is represented as a surrogate document indexable by a typical text-based retrieval system. Figure 4 shows an example multimodal image document.

6.2 Image retrieval

Because GFM is a multimodal image retrieval method, our Lucene and Essie implementations support three distinct retrieval paradigms. In addition to multimodal retrieval, these search strategies also include text-based and content-based approaches. We present implementation details related to each of these uses of GFM in the remainder of this section.

Because one of the primary objectives of our current work is to demonstrate that visual information can be used to improve upon a competitive text-based approach, it is important that our textual baseline be a state-of-the art retrieval method. For automatically generating queries, our textual baseline first organizes the textual description of a topic into the well-formed clinical question (i.e., PICO Footnote 13) framework (Richardson et al. 1995) following the method described by Demner-Fushman and Lin (2007). Accordingly, it extracts from the topic UMLS concepts related to problems, interventions, age, anatomy, drugs, and image modality. In addition to automatically expanding these extracted concepts using the UMLS synonymy, it also expands identified modalities using a thesaurus manually constructed by Demner-Fushman et al. (2008) based on the RadLex (Langlotz 2006) ontology. Footnote 14 Our textual baseline then constructs a disjunctive query consisting of all the expanded terms. To ensure the early precision of our retrieval results, the textual baseline weights term occurrences in image captions and article titles more than occurrences in other text-based fields. It also requires that any modality terms identified in the query occur in a retrieved image’s caption or mentions. Finally, in order to improve recall, our textual baseline pads the initially retrieved results with images retrieved using the verbatim topic description as query.

We refer to the use of our GFM implementations for content-based image retrieval as content-based GFM. Content-based GFM is an approximation of a typical CBIR system that represents images using the content-based features shown in Table 1 and compares them using Euclidean distance. In contrast with our textual baseline, content-based GFM only searches the fields of our indices that correspond to content-based features and only processes the example images of a multimodal topic to construct a query. For automatically generating queries, content-based GFM first concurrently extracts the content-based features for all the example images in a topic. It then maps the extracted features to code words using a default code word expansion factor of one (\(\varepsilon = 1\)). Finally, content-based GFM constructs a disjunctive query consisting of the mapped code words for all example images, enabling it to retrieve images visually similar to any of the examples.

The last search paradigm our GFM implementations support is multimodal image retrieval, and we refer to this use as multimodal GFM. Multimodal GFM is the combination of our textual baseline with content-based GFM. It searches all the fields of our indices, and it processes both a topic’s textual description as well as its example images to construct a multimodal query. Based on our preliminary experiments, multimodal GFM weights the images’ text-based features significantly more than their content-based features and uses a default code word expansion factor of two \((\varepsilon = 2)\).

7 Results

In this section we present experimental results for the evaluation of our GFM implementations. Because GFM seeks to improve retrieval precision by providing a practical means of efficiently incorporating visual information into a predominately text-based image retrieval strategy, we discuss here results for both retrieval time and performance. In our evaluation, we demonstrate that, although GFM makes use of content-based image features, it requires a retrieval time that is roughly equal to that of a traditional text-based retrieval system, providing evidence that GFM is capable of indexing large-scale image collections. We also show that our multimodal and content-based GFM implementations achieve statistically significant improvements in retrieval precision on both the 2010 and 2012 ImageCLEF collections.

Before presenting our complete set of results, we show in Fig. 5 example retrieval results obtained with our Lucene GFM implementation for a topic taken from the 2010 ImageCLEF medical retrieval track data set. Depicted are (1) the textual description of the topic and its example images, (2) relevance scores and retrieval times obtained with our three retrieval approaches, and (3) the top five ranked images retrieved using each method. The compared retrieval methods include the textual baseline (TB) as well as content-based and multimodal GFM (CB-GFM and M-GFM, respectively). The results in Fig. 5 show that for this topic and among our methods, multimodal GFM achieves the best performance by successfully combining and improving upon our text-based and content-based approaches. Since GFM utilizes traditional inverted file indices for indexing and retrieval, the search latency achieved by content-based and multimodal GFM is comparable to that of the textual baseline. Note that we first introduced this particular multimodal topic when describing Fig. 1. We explore these and additional results in more detail in Sects. 7.1 and 7.2.

Fig. 5
figure 5

Example retrieval results for topic six of the 2010 ImageCLEF medical retrieval track data set. Relevance scores are given for the 1,000 highest ranked images with metrics including binary preference (bpref), judged mean average precision (MAP′), and judged precision-at-ten (P′@10). Results are shown for content-based global feature mapping (CB-GFM), multimodal global feature mapping (M-GFM), and the textual baseline. All reported times are the lowest of ten retrieval runs and are given in milliseconds

7.1 Retrieval time

Table 2 shows the retrieval time required by our Lucene GFM implementation for each topic taken from the 2010 ImageCLEF medical retrieval track data set. Retrieval times are reported in milliseconds and reflect the lowest search latency obtained from ten retrieval runs. We report the lowest retrieval time for each method—as opposed to the average—to avoid including the cost of any other processes running on our evaluation system that may have preempted the retrieval process. In addition to content-based and multimodal GFM, we provide for comparison the retrieval times for the textual baseline as well as the brute-force CBIR approach (BF-CBIR). For brute-force CBIR, all visual descriptors were loaded to memory prior to timing; therefore, the search latencies reported for this method only reflect the time needed for first computing Euclidean distances between the query and collection descriptors and then sorting these distances. For all other methods, the reported latencies include the time required for generating code words and parsing the queries in addition to retrieving the images. The time needed for extracting content-based features from the example query images is not included our reported search latencies.

Table 2 Retrieval time

The results depicted in Table 2 show that content-based and multimodal GFM require a retrieval time that is roughly comparable to that of a traditional text-based information retrieval system. In addition, it shows that for the 2010 ImageCLEF collection, brute-force CBIR takes approximately two orders of magnitude longer to retrieve the highest-ranked images than the comparable content-based GFM. The fact that brute-force CBIR requires so much more time compared to the others is not surprising—this is a naïve retrieval strategy. However, brute-force CBIR remains a remarkably common approach used for searching small-to-medium sized collections. Table 2 demonstrates that the efficiency of inverted file indices is easily obtainable for the content-based and multimodal image retrieval paradigms, which is especially significant for managing large image collections.

The retrieval times presented in Table 2 vary slightly for each topic. For the GFM-based results, the variation in retrieval time generally reflects differences in the length of the queries and the number of images the queries retrieve. Longer queries require additional time to parse, and queries that retrieve many images require more time to score the results. The length of a query depends on the length of the topic’s textual description as well as the number of example images it has. The number of images retrieved for each topic depends on the queries. For image-based queries, this is related to the number of feature vectors in each cluster. For example, a query containing a cluster word representing many images will result in more images being scored because the cluster word is more common within the collection. For the brute-force approach, the number of scored images and the length of the feature vectors remains constant across the topics.

7.2 Retrieval performance

Having demonstrated that content-based and multimodal GFM achieve response times comparable to what is obtained by our textual baseline, we now show that GFM is capable of improving upon the average retrieval precision of existing methods. In doing so, we also demonstrate the effectiveness of code word expansion and show that intracluster image ranking—the relative ranking of images mapped to identical sets of code words—is improved when indexing a sufficient number of features or feature partitions.

7.2.1 Multimodal retrieval

Tables 3 and 4 show the retrieval results obtained by our multimodal Lucene and Essie GFM implementations for each topic taken from the 2010 and 2012 ImageCLEF medical retrieval track data sets. For comparison, we also include in the tables the results obtained by our textual baseline and the multimodal systems that achieved the highest average bpref at the ImageCLEF evaluations. Taken together, these results show that (1) our textual baseline is statistically indistinguishable from the best performing systems evaluated at ImageCLEF and that (2) the performance of multimodal GFM is significantly better than that of our textual baseline. The observed improvement in retrieval precision is especially encouraging because it is consistent across two different data sets and GFM implementations and, as we saw in Table 2, requires a negligible increase in retrieval latency over our textual baseline.

Table 3 Multimodal retrieval results for ImageCLEF 2010
Table 4 Multimodal retrieval results for ImageCLEF 2012

For the 2010 ImageCLEF results shown in Table 3, we see that our Lucene implementation of multimodal GFM achieved a statistically significant increase in both MAP′ (10.03 %, p = 0.02) and bpref (7.46 %, p = 0.02) compared to our textual baseline. Although the average P′@10 obtained by multimodal GFM is also greater than that of our textual baseline, this improvement did not reach the level of statistical significance (p < 0.05). These results show that incorporating a limited amount of visual information into the retrieval process can provide a slight but consistent performance improvement over text-based retrieval.

The potential for multimodal GFM to create synergistic combinations of text-based and content-based features is perhaps best demonstrated by topic 9. Topic 9 is about MR images of papilledema (swelling of the optic disc) and both of the provided MR images are of the head. For this topic, our textual baseline achieved a bpref of 0.3819, which is consistent with its average bpref over all sixteen topics (0.3834). However, multimodal GFM dramatically improved upon this result by obtaining a bpref of 0.7778. We will see in Sect. 7.2.2 that when configured for performing content-based retrieval, GFM is unable to retrieve a single relevant image for topic 9, demonstrating that it is through the combination of features that multimodal GFM improves performance. The inability of content-based GFM to retrieve relevant images for this topic may be due to the dissimilarity of the two example images: one image depicts a sagittal view of the head whereas the other shows a coronal view. Although both topic images are used for constructing a query, the lack of a singular way in which to visually describe the concept likely contributes to content-based GFM’s retrieval of many irrelevant images. For topics such as this one, the use of semantic information provided by text-based features significantly improves the performance of GFM.

The 2012 ImageCLEF results depicted in Table 4 show that, like our Lucene implementation, our Essie implementation of multimodal GFM also achieved a statistically significant increase in MAP′ (2.61 %, p = 0.02) compared to our textual baseline. However, we did not find its improvement in bpref or P′@10 to be statistically significant. The overall trend for our Essie implementation on the 2012 ImageCLEF collection is similar to that of our Lucene implementation on the 2010 data set, with multimodal GFM providing a slight but consistent improvement in performance for many of the topics. We did observe a decrease in bpref and P′@10 on topic 6, though. Although the average performance of the three systems shown in Table 4 differ somewhat, their similarity is not a coincidence: the best performing system at the 2012 ImageCLEF evaluation was an earlier implementation of multimodal GFM . In addition to reducing the overall weight Essie allocates to the content-based fields of our indices when scoring documents, our current implementation of multimodal GFM also dynamically reduces the weight given to image code words for topics the ImageCLEF organizers categorized as “Semantic” topics.

7.2.2 Content-based retrieval

Although GFM is primarily intended to be a practical means of performing multimodal image retrieval, we can also evaluate its use for content-based retrieval. Table 5 shows retrieval results obtained by our Lucene implementation of content-based GFM for each topic taken from the 2010 ImageCLEF medical retrieval track data set. For comparison, we also include in the table results obtained by brute-force CBIR and the content-based system that achieved the highest average bpref at the 2010 ImageCLEF evaluation.

Table 5 Content-based retrieval results for ImageCLEF 2010

The results depicted in Table 5 show that content-based GFM achieved a statistically significant increase in bpref (195.65 %, p < 0.01) over brute-force CBIR, but it did not perform significantly better than the brute-force method in terms of MAP′ or P′@10. Thus, the performance of content-based GFM is comparable with that of the best system at the 2010 ImageCLEF evaluation. This result is especially interesting because these methods are conceptually similar: they both utilize the same set of content-based visual descriptors and compare these descriptors with Euclidean distance. However, our Lucene implementation of content-based GFM additionally applies tf-idf term weighting to the image code words. As we described in Sect. 4.2.2, because tf = 1 for all code words in an image’s surrogate document, code words with a greater idf are weighted more heavily. Thus, code words corresponding to clusters containing a smaller number of visual descriptors are given more weight than code words mapped to clusters of larger sizes. This difference favors images that are more unique within the collection, and it contributes to the average increase in retrieval performance obtained by content-based GFM. Finally, the increase in bpref is also significant because, as was shown in Table 2, the response time of content-based GFM is a fraction of that required by brute-force CBIR.

Immediately apparent from the results depicted in Table 5 is the poor performance of content-based retrieval in relation to the text-based and multimodal strategies shown in Tables 3 and 4. However, it is well-known that CBIR generally does not perform as well as textual methods for literature-based image retrieval tasks such as those encountered through participation in the ImageCLEF evaluations (Müller et al. 2010a). Because the performance of CBIR systems can be so poor, it is not surprising that the community has had difficulty developing multimodal retrieval strategies that improve upon the performance of text-based approaches. In this regard, multimodal GFM is significant for its ability to consistently demonstrate an increase in performance over our textual baseline.

7.2.3 Code word expansion

Table 6 shows retrieval results obtained by our Lucene implementation of content-based GFM with varying code word expansion factors for each topic taken from the 2010 ImageCLEF medical retrieval track data set. We include in the table results obtained without code word expansion \((\varepsilon=1)\), with an expansion factor of two \((\varepsilon=2)\), and with an expansion factor of three \((\varepsilon=3)\). For each topic and expansion factor, we report the bpref obtained by content-based GFM, the number of images retrieved as a percentage of the total number of images in the collection (ret), the number of relevant images retrieved as a percentage of the total number of images relevant to the topic (rel_ret), and the time taken in milliseconds to obtain the results. Unlike the relevance scores presented in Tables 3, 4 and 5, here we report results for all retrieved images—instead of only the one thousand highest ranked images—to clearly demonstrate the impact of code word expansion on image ranking.

Table 6 Usefulness of query expansion for content-based global feature mapping

Table 6 demonstrates that, although the performance of content-based GFM is low, a code word expansion factor of one obtains nearly all relevant images by retrieving less than one percent of the total number of images in the collection. While increasing the expansion factor results in the retrieval of additional relevant images for some topics, it does not significantly improve retrieval precision, and in some cases it actually worsens performance. For example, an expansion factor of three allows content-based GFM to retrieve several additional relevant images for topic 1 compared with no code word expansion, but it decreases the bpref of topic 1 from 0.0219 to 0.0100. The limited effectiveness of code word expansion provides evidence that the k-means++ algorithm, despite its policy of rigid cluster membership, is already successful at producing a clustering of content-based features adequate for our retrieval experiments. Because the number of images in each cluster is small, increasing the expansion factor does not significantly affect the number of images retrieved as a percentage of the total number of images in the collection. However, code word expansion causes a modest increase in response time because a larger number of images must be scored, and longer queries require additional time to parse.

7.2.4 Intracluster ranking

Figure 6 shows the average number of images retrieved by our Lucene implementation of content-based GFM at each retrieval rank under various configurations. Because GFM represents with a single code word all images whose visual descriptors for a given feature lie within the same cluster, it lacks the ability to discriminate among images mapped to the same code word. For example, with a query consisting of a single code word, content-based GFM will retrieve all images whose visual descriptors are in the cluster the code word represents. However, because each of the retrieved images is given the same score, their relative similarity to the example query image is lost, and any ranking of the images is meaningless. We seek to explain this behaviour by presenting in Fig. 6 the number of images given the same score by content-based GFM averaged over all retrieval ranks for all sixteen topics taken from the 2010 ImageCLEF medical retrieval track data set.

Fig. 6
figure 6

Average number of images at each retrieval rank as the number of indexed features (a) is increased and, for a single feature, as the number of indexed subspace partitions (b) is increased

In Fig. 6a, we show the average number of images retrieved by content-based GFM per rank as the number of indexed features is increased from one to five. The results show that as we increase the number of features describing the images, we drastically decrease the average number of images retrieved per rank. Thus, increasing the number of indexed features improves the ability of content-based GFM to discriminate among visually similar images. In Fig. 6a, the number of feature partitions is one (p = 1), and the number of images per rank is averaged over all possible combinations of the given number of features. For example, the average number of images retrieved per rank with three features (1.20) is averaged over all sixteen ImageCLEF 2010 topics, over all retrieval ranks, and over all possible combinations of three features. Among five total features there are ten possible combinations of three features.

In Fig. 6b, we show the average number of images retrieved by content-based GFM per rank as the number of feature partitions is increased from one to five. Similar to Fig. 6a, this figure demonstrates that as we partition the visual descriptors of the images’ content-based features into an increasing number of lower-dimensional vectors, we quickly decrease the average number of images retrieved per rank. Thus, increasing the number of feature partitions also improves the ability of content-based GFM to discriminate among visually similar images. Because increasing the number of feature partitions only impacts intracluster rankings, recall-based measures that are computed over all retrieved images, such as MAP, are generally not sensitive to feature partitioning. However, metrics computed on partial ranked lists, such as P@10, may be affected by the number of feature partitions. In Fig. 6b, the number of features representing each images is one, and the number of images retrieved per rank for a given number of partitions is averaged over all sixteen ImageCLEF 2010 topics, over all retrieval ranks, and over all five image representations consisting of a single content-based feature.

8 Conclusion

The images found within biomedical articles are sources of essential information to which we must provide efficient access. Not surprisingly, various image retrieval strategies have been proposed for use in the biomedical domain. Unfortunately, although they demonstrate considerable empirical success, traditional text-based image retrieval methods are often unable to retrieve images whose relevance is not explicitly mentioned in the article text. Additionally, content-based retrieval methods are unable to produce meaningful results for many literature-based information needs because visual similarity can be a poor indicator of image relevance. Due to the limitations of these unimodal strategies, practical retrieval techniques capable of fusing information from multiple modalities is desirable.

Global feature mapping (GFM) is a multimodal strategy for retrieving images from biomedical articles. The approach seeks to improve upon the precision of text-based image retrieval methods by providing a practical and efficient means of incorporating a limited amount of visual information into the retrieval process. GFM operates by (1) grouping the global content-based features extracted from an image collection into clusters, (2) assigning images alphanumeric code words indicative of the clusters in which their features reside, (3) indexing a combination of image code words and descriptive text using a text-based information retrieval system, and (4) searching the image index using textual queries derived from multimodal topics.

We evaluated the performance of GFM on the 2010 and 2012 ImageCLEF medical retrieval track data sets. Our multimodal retrieval approach utilizing GFM demonstrated a statistically significant improvement in mean average precision over our text-based strategy, a baseline retrieval method competitive with the best performing systems evaluated at the ImageCLEF forums. Additionally, when configured for performing content-based retrieval, our approach outperformed the highest ranked content-based systems.

Although GFM’s improvements in retrieval precision were small, its performance validates our intuition that visual similarity can play a small yet significant role in multimodal literature-based image retrieval tasks. Key to its success were GFM’s use of an inexact representation of content-based features and its weighting of these features, in conjunction with image-related text, according to an underlying text-based retrieval model. The advantages of these two qualities are perhaps best demonstrated by the comparison of content-based GFM to brute-force CBIR, where GFM outperformed the brute-force method using the same set of features and the same similarity metric for clustering. Because we did not evaluate our particular choice of content-based features, it remains to be seen if the use of a more sophisticated image representation would result in similar improvements in retrieval precision.

To demonstrate GFM’s practicality, we implemented it in two information retrieval systems: a general purpose system based on the vector space retrieval model and a biomedical system that utilizes a probabilistic retrieval model. We obtained statistically significant improvements using both systems. As further evidence of its practicality, we demonstrated that the response time of our multimodal approach is comparable to that of our text-based strategy. Owing to its empirical success, we have incorporated GFM into OpenI, a biomedical image retrieval system currently indexing over one million images from the articles included in the open access subset of PubMed Central.