Adaptive features selection for expert datasets: A cultural heritage application

https://doi.org/10.1016/j.image.2018.06.011Get rights and content

Highlights

  • Adaptive and efficient framework for small expert datasets without prior knowledge.

  • Improving the image representation by using an information gain model.

  • Using a model based on human visual system.

  • Mixing deep features with classical local descriptors.

  • Experiment results showing the interest of the proposed framework.

Abstract

Image Retrieval is still a very active field of image processing as the number of available image datasets continuously increases. One of the principal objectives of Content-Based Image Retrieval (CBIR) is to return to user the most similar images to a given query with respect to their visual content. Our work fits in a very specific application context: indexing small expert image dataset, e.g. cultural heritage images, with no prior knowledge on the images. Because of the image complexity, one of our contributions is the choice of effective descriptors from literature placed in direct competition. Two strategies are used to combine features: a psycho-visual one and a statistical one. In this context, we propose an automatic and adaptive framework based on the well-known bags of visual words and phrases models that select relevant visual descriptors for each keypoint to construct a more discriminative image representation. Experiment results show the adaptiveness and the performance of our framework on “generic” benchmark datasets and on two cultural heritage datasets.

Introduction

In the last decades, the success of smartphones and other mobile devices capable of taking and sharing photos instantaneously is closely linked to the exponential increase of image datasets. A huge number of those images are everyday life photos where the end user could be any one of us. But there are also expert images. In this paper, we focus on “expert image datasets”, which are of interest for domain expert end-users. The expert end-users can be clinicians, historians, digital curators, numismatists, etc. Those expert datasets may have quite heterogeneous contents (e.g. historic images of persons, constructions, etc.) or more specific contents (e.g. datasets of old coins, butterflies, etc.). The particularities of such datasets have to be considered in the indexing tools that will help them to manage their data for further image exploitation stages as retrieval or browsing.

This paper focus on indexing expert image collections, and particularly cultural heritage image datasets which has become a topic of major interest for experts and researchers. Indeed, implementing digital and long-term preservation strategies, supporting open cultural heritage data are major topics for numerous countries.

Cultural heritage datasets contain very heterogeneous content as paintings, sculptures, etc. Picard et al. in [1] have proven that the classical approaches of CBIR do not provide satisfying results on this type of data. So, there is an urgent need to propose new methodologies to help expert users manage their data.

In this specific application context, we focus our research on image datasets with no prior knowledge. Thus, the scope of this paper is indexing image collection with Content-Based Image Retrieval (CBIR) methods. One objective of CBIR research is to create a discriminative visual signature describing the visual content of each image. To do so, the Bags of Visual Words model [2] (BoVW) has become popular. It aims at representing image visual features into a simple multidimensional vector. These vectors are the image signatures, i.e. histograms of the visual word occurrences but lacks discriminative power [[3], [4]]. More recently, models based on human brain obtained very good results on computer vision tasks such as image classification or object recognition. These models are called Convolutional Neural Network (CNN) and outperform results obtained with “classical” schemes [[5], [6]]. Other recent papers include semantic knowledge in addition to visual content. Thus, cross-modal retrieval using deep learning has become very popular [[7], [8]]. These new methodologies have a tremendous appetite for learning which is impossible in our specific context of small expert datasets with no prior knowledge.

In this paper, we present our unsupervised framework, which aims at combining automatically the information from various local descriptors. Our main contribution consists in the selection of the most relevant visual features by keypoint in a smart way. Indeed, we select some of the keypoints according to their importance to both the human visual system and the dataset itself. Thus, the chosen set of descriptors should provide relevant texture, shape, edge and colour information to obtain a more discriminative image representation. To strengthen the discriminative power of keypoints, we propose to introduce two strategies in our unsupervised approach. First, we introduce a psycho-visual model in two different steps of our image representation framework: (i) to discard irrelevant keypoints before starting the indexing step and (ii) to weight the importance of each keypoint during the signature construction step. This process gives more importance to salient keypoints and discredits the others. Secondly, a statistical approach is used to choose for each keypoint the combination of local descriptors providing the best information with respect to the dataset by using a particular weighting scheme. Weighting schemes and Information Gain models have been widely used in the Information Retrieval field [9]. They increase the importance of a term within a document (in our case, an image) by weighting each visual feature by a value that evolves with the number of occurrences within the dataset.

Another contribution of the proposed framework is its efficiency. Indeed, our framework reduces the number of used keypoints with respect to the visual saliency information. Furthermore, the obtained image signatures are sparser, which reduces the retrieval complexity.

To evaluate the performance of our contribution, we compare results obtained on well-known “generic” datasets. The BoVW model with the different descriptors, including deep features, and their concatenation will be the baseline approaches. We also compare our performance against other deep learning frameworks. Linked to our specific context and the motivation of this work, we present an evaluation of the proposed framework on two cultural heritage datasets: ROMANE 1K, which is a collection of Romanesque art images with heterogeneous content (paintings, sculptures, …) and the Coin Collection Online Catalogue which is a numismatic dataset of Byzantine coins.

This article is structured as follows: first, we present the state of the art in Section 2. It describes the following processes and models: BoVW and selected improvements, CNN based literature, and a brief overview of weighting schemes in image retrieval and visual saliency methodologies. Then, Section 3 gives an overview of our proposal. Section 4 presents the experiments on “generic” image datasets and on two different cultural heritage image collections, and discusses the findings of our study. Finally, Section 5 concludes and gives some perspectives.

Section snippets

Related works

In this section, we first present the BoVW model, few inspired improvements and Bags of Visual Phrases model approaches. After introducing CNN models and transfer learning approaches, we expose a brief study of literature approaches concerning visual saliency and weighting schemes in the context of CBIR.

Proposed method

In this section, we propose an overview of our unsupervised CBIR framework based on a locally adapted selection of visual features. Our main contribution is to ensure effective and efficient performance by taking advantage of weighting scheme and visual saliency models. In our proposal, visual words are selected according to their relevance in vocabularies and their importance for the human visual system. Indeed, combining those two models gives a better characterisation of each image keypoint

Experimental results

This section presents the datasets used to evaluate our work and explains our technical choices. Then, we analyse the results obtained and discuss the findings.

Conclusion

In this paper, we propose to adapt the discriminative power of CBIR models for a specific context of expert datasets. Based on the well-known Bags of Visual Words and Visual Phrases models, we propose an automatic and adaptive framework that selects relevant visual descriptors for each keypoint to construct the image representation. The visual saliency information is firstly used to discard non relevant keypoints by computing a threshold on the image. To construct the image signature, a

References (45)

  • Y. Peng, X. Huang, Y. Zhao, An overview of cross-media retrieval: Concepts, methodologies, benchmarks and challenges,...
  • WangK. et al.

    A comprehensive survey on cross-modal retrieval

    CoRR

    (2016)
  • ChatouxH. et al.

    Comparative study of descriptors with dense key points

  • ChenQ. et al.

    Contextualizing object detection and classification

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • JégouH. et al.

    Aggregating local descriptors into a compact image representation

  • DelhumeauJ. et al.

    Revisiting the VLAD image representation

  • EggertC. et al.

    Improving VLAD: Hierarchical coding and a refined local coordinate system

  • ChenT. et al.

    Discriminative soft bag-of-visual phrase for mobile landmark recognition

    IEEE Trans. Multimed.

    (2014)
  • LecunY. et al.

    Gradient-based learning applied to document recognition

    Proc. IEEE

    (1998)
  • KrizhevskyA. et al.

    Imagenet classification with deep convolutional neural networks

  • RussakovskyO. et al.

    ImageNet large scale visual recognition challenge

    Internat. J. Comput. Vis.

    (2015)
  • SimonyanK. et al.

    Very deep convolutional networks for large-scale image recognition

    CoRR

    (2014)
  • View full text