Adaptive features selection for expert datasets: A cultural heritage application
Introduction
In the last decades, the success of smartphones and other mobile devices capable of taking and sharing photos instantaneously is closely linked to the exponential increase of image datasets. A huge number of those images are everyday life photos where the end user could be any one of us. But there are also expert images. In this paper, we focus on “expert image datasets”, which are of interest for domain expert end-users. The expert end-users can be clinicians, historians, digital curators, numismatists, etc. Those expert datasets may have quite heterogeneous contents (e.g. historic images of persons, constructions, etc.) or more specific contents (e.g. datasets of old coins, butterflies, etc.). The particularities of such datasets have to be considered in the indexing tools that will help them to manage their data for further image exploitation stages as retrieval or browsing.
This paper focus on indexing expert image collections, and particularly cultural heritage image datasets which has become a topic of major interest for experts and researchers. Indeed, implementing digital and long-term preservation strategies, supporting open cultural heritage data are major topics for numerous countries.
Cultural heritage datasets contain very heterogeneous content as paintings, sculptures, etc. Picard et al. in [1] have proven that the classical approaches of CBIR do not provide satisfying results on this type of data. So, there is an urgent need to propose new methodologies to help expert users manage their data.
In this specific application context, we focus our research on image datasets with no prior knowledge. Thus, the scope of this paper is indexing image collection with Content-Based Image Retrieval (CBIR) methods. One objective of CBIR research is to create a discriminative visual signature describing the visual content of each image. To do so, the Bags of Visual Words model [2] (BoVW) has become popular. It aims at representing image visual features into a simple multidimensional vector. These vectors are the image signatures, i.e. histograms of the visual word occurrences but lacks discriminative power [[3], [4]]. More recently, models based on human brain obtained very good results on computer vision tasks such as image classification or object recognition. These models are called Convolutional Neural Network (CNN) and outperform results obtained with “classical” schemes [[5], [6]]. Other recent papers include semantic knowledge in addition to visual content. Thus, cross-modal retrieval using deep learning has become very popular [[7], [8]]. These new methodologies have a tremendous appetite for learning which is impossible in our specific context of small expert datasets with no prior knowledge.
In this paper, we present our unsupervised framework, which aims at combining automatically the information from various local descriptors. Our main contribution consists in the selection of the most relevant visual features by keypoint in a smart way. Indeed, we select some of the keypoints according to their importance to both the human visual system and the dataset itself. Thus, the chosen set of descriptors should provide relevant texture, shape, edge and colour information to obtain a more discriminative image representation. To strengthen the discriminative power of keypoints, we propose to introduce two strategies in our unsupervised approach. First, we introduce a psycho-visual model in two different steps of our image representation framework: (i) to discard irrelevant keypoints before starting the indexing step and (ii) to weight the importance of each keypoint during the signature construction step. This process gives more importance to salient keypoints and discredits the others. Secondly, a statistical approach is used to choose for each keypoint the combination of local descriptors providing the best information with respect to the dataset by using a particular weighting scheme. Weighting schemes and Information Gain models have been widely used in the Information Retrieval field [9]. They increase the importance of a term within a document (in our case, an image) by weighting each visual feature by a value that evolves with the number of occurrences within the dataset.
Another contribution of the proposed framework is its efficiency. Indeed, our framework reduces the number of used keypoints with respect to the visual saliency information. Furthermore, the obtained image signatures are sparser, which reduces the retrieval complexity.
To evaluate the performance of our contribution, we compare results obtained on well-known “generic” datasets. The BoVW model with the different descriptors, including deep features, and their concatenation will be the baseline approaches. We also compare our performance against other deep learning frameworks. Linked to our specific context and the motivation of this work, we present an evaluation of the proposed framework on two cultural heritage datasets: ROMANE 1K, which is a collection of Romanesque art images with heterogeneous content (paintings, sculptures, …) and the Coin Collection Online Catalogue which is a numismatic dataset of Byzantine coins.
This article is structured as follows: first, we present the state of the art in Section 2. It describes the following processes and models: BoVW and selected improvements, CNN based literature, and a brief overview of weighting schemes in image retrieval and visual saliency methodologies. Then, Section 3 gives an overview of our proposal. Section 4 presents the experiments on “generic” image datasets and on two different cultural heritage image collections, and discusses the findings of our study. Finally, Section 5 concludes and gives some perspectives.
Section snippets
Related works
In this section, we first present the BoVW model, few inspired improvements and Bags of Visual Phrases model approaches. After introducing CNN models and transfer learning approaches, we expose a brief study of literature approaches concerning visual saliency and weighting schemes in the context of CBIR.
Proposed method
In this section, we propose an overview of our unsupervised CBIR framework based on a locally adapted selection of visual features. Our main contribution is to ensure effective and efficient performance by taking advantage of weighting scheme and visual saliency models. In our proposal, visual words are selected according to their relevance in vocabularies and their importance for the human visual system. Indeed, combining those two models gives a better characterisation of each image keypoint
Experimental results
This section presents the datasets used to evaluate our work and explains our technical choices. Then, we analyse the results obtained and discuss the findings.
Conclusion
In this paper, we propose to adapt the discriminative power of CBIR models for a specific context of expert datasets. Based on the well-known Bags of Visual Words and Visual Phrases models, we propose an automatic and adaptive framework that selects relevant visual descriptors for each keypoint to construct the image representation. The visual saliency information is firstly used to discard non relevant keypoints by computing a threshold on the image. To construct the image signature, a
References (45)
- et al.
Term-weighting approaches in automatic text retrieval
- et al.
Fine-tuning deep convolutional neural networks for distinguishing illustrations from photographs
Expert Syst. Appl.
(2016) - et al.
Deep network aided by guiding network for pedestrian detection
Pattern Recognit. Lett.
(2017) - et al.
A feature-integration theory of attention
Cogn. Psychol.
(1980) - et al.
Challenges in content-based image indexing of cultural heritage collections
IEEE Signal Process. Mag.
(2015) - et al.
Visual categorization with bags of keypoints
- et al.
From bag-of-visual-words to bag-of-visual-phrases using n-grams
- et al.
A comparative study of irregular pyramid matching in bag-of-bags of words model for image retrieval
- et al.
Deep learning
Nature
(2015) - et al.
Wide residual networks