Keywords

1 Introduction

The processing of visual information in human vision system begins in a thin layer made of neural tissue called retina. The architecture of our retina allows us to receive, every second, up to 10 billion bits information, while our cerebral cortex reaches about 10 billion neurons. Due to the lack of storage capability of our brain we cannot simultaneously perform complex analysis on all the input visual information [1]. One of the most important task of the human visual system is to detect the important visual subset, i.e. the salient subset. When a person performs any visual task (watching TV, driving a car) the eyes flick rapidly from place to place to inspect the visual scene. The movements of the eyes, while observing a scene, are not random: each movement allows the central part of vision (fovea) to fall upon the region of interest of a picture (this is why vision is not uniform across our field of view and acuity decreases with eccentricity) [2].

There is an intimate connection between visual attention and eye movements; for this reason in the last decades, how, why and when we move our eyes is becoming a major topic in scientific research.

Two main factors drive visual attention: bottom-up factors and top-down factors. Bottom-up factors are derived solely from the visual scene [2]. Regions of interest attracting human attention are sufficiently discriminative with respect to surrounding features. This attentional mechanism is called exogenous [3]. Top-down attention, on the other hand, is driven by cognitive factors such as knowledge, expectations and current goals [4]. Other terms for top-down attention are endogenous [5], voluntary, or centrally cued attention.

The objective of Visual Saliency is to detect the most important regions of an image from a perceptual point of view, i.e. to imitate the behaviour of the human visual system. Visual Saliency studies are included in several research areas, such as Psychology, Neurobiology, Computer Science, Artificial Intelligence, Medicine; our work considers Visual Saliency from a Computer Science point of view. The objective is to build a saliency map revealing the salient subset in an image. It is usually a grayscale map and each pixel falls in the dynamic range [0, 255], where highest intensity values correspond to the most salient pixels of a picture.

In the same way as Visual Attention, Saliency approaches are based on Bottom-up factors and Top-down factors. More in detail, Visual Saliency methods can be grouped in three main approaches: Bottom-up, Top-down, Hybrid.

Bottom-up methods are stimulus-driven. These methods seek for the so-called “visual pop-out” saliency. Human attention, in these approaches, is considered as a cognitive process concentrating on the most unusual aspects of an environment while ignoring the common aspects. This consideration is implemented by several methods such as center-surround operation [6] and graph based activation maps [7].

Top-down methods for saliency detection are based on high level visual tasks such as Object Detection or Face Detection. In these methods the predefined task is given by the object class to be detected [8].

Hybrid methods are generally structured in two levels: a bottom-up layer gives rise to a noisy saliency map and a top-down layer filters out noisy regions in saliency maps created by the bottom-up layer.

The eye tracking technique has been typically adopted to examine human visual attention. The location of the eye fixation reflects attention, while the duration of the eye fixation reflects processing difficulty and amount of attention [9]. Specifically, fixation duration varies depending on types of information (e.g. texts or graphics) and types of tasks (e.g. reading or problem solving).

In our work we used an eye tracker to record the gaze path of thirty observers while viewing each image of a dataset [10]. Each image is show at full resolution for three seconds, separated by one second of viewing a gray screen (we adopted the same experimental approach of Torralba et al. [11]). The database [10] consists of several images with single objects in the foreground and homogeneous background color, but any dataset with a single main object (target) and a limited number of distractors in each image would have been appropriate as well. The viewers, while observing the images, sat at a distance of 70 cm from a 22 in. computer screen of \(1920 \times 1080\) resolution.

We used the eye tracking data to create a ground truth made of fixation point maps showing where viewers look in the first three seconds of observation.

Our contributions in this work are three: a new available ground truth of fixation maps; a new saliency method, extending our previous study on visual saliency; a correlation study between saliency maps and the object attention process.

2 State of the Art

Models and approaches for visual saliency detection are inspired by human visual system mechanisms. As we reported in the first section of this paper, saliency methods can be divided in three main groups: Bottom-up, Top-down, Hybrid. In [6] the authors proposed a bottom-up approach based on multi-scale analysis of the image. In greater detail, multi-scale image features are used to create a topographical saliency map, then a dynamical neural network selects the attended locations with respect to the saliency values. The principle of center-surround difference is adopted in [12] for the parallel extraction of different feature maps. In [7] Harel et al. propose a saliency method (well known as GBVS) based on a biologically plausible graph-based model: the leading models of visual saliency may be organized into three stages: extraction, activation, normalization. Wang et al. [13] survey the corresponding literature on the low-level methods for visual saliency.

An effective method [14] for visual saliency detection based on multi-scale and multi-channel mean has been proposed by Sun et al. The image is decomposed and reconstructed by using wavelet transform and a bicubic interpolation algorithm is applied to narrow the filtered image in multi-scale. The saliency values are the distances between the narrowed images and the means of their channels. SIFT [15] keypoints density maps have been proposed in our previous works to extract saliency maps and texture scale [15,16,17].

In Top-down approaches [8, 18], the visual attention process is considered task dependent, and the observer’s expectations and wills analyzing the scene are the reason why a point is fixed rather than others. In [19] the authors perform saliency detection with a Top-down model that jointly learns a Conditional Random Field (CRF) and a visual dictionary.

Generally Hybrid systems for saliency use the combination of bottom-up and top-down stimuli. In many hybrid approaches [20, 21], a Top-down layer is used to refine the noisy map extracted from the Bottom-up layer. For example the Top-down component in [20] is face detection. Chen et al. [21] used a combination of face and text detection and they found the optimal solutions through branch and bound technique. A well known state-of-the-art hybrid approach was proposed by Judd et al. [11] in addition to a database [22] of eye tracking data from 15 viewers. Low, middle and high-level features of this data have been used to train a model of saliency.

Yu et al. in [23] used a paradigm based on Gestalt grouping cues for object-based saliency detection.

In the last years several researchers focused their attention on deep learning approaches for saliency prediction because high-quality visual saliency model can be learned by using deep convolutional neural networks (CNNs). For instance, in [24] the authors introduced a neural network architecture, which has fully connected layers on top of a CNNs responsible for feature extraction at different scales.

The authors of [25] reported a comparative study that evaluates the performances of 13 state-of-the-art saliency models. A new metric is also proposed and compared with previous models. In [26] the authors give some formal definitions on three different type of approaches (Bottom-up, Top-down, Hybrid) and an overview on existing methods. Furthermore, the authors offer a description of publicly available datasets and the performance metrics used.

3 Proposed Methods

3.1 Eye Tracking Data Acquisition

We chose 5 different objects from the Object Pose Estimation Database (OPED) [10, 27], and for each object we selected 19 views having a 130\(^{\circ }\) fixed vertical angle and an horizontal angle ranging from 0\(^{\circ }\) to 180\(^{\circ }\) in 10\(^{\circ }\) increments. The resulting 95 images have been thoroughly filtered to attenuate noise that could shift human attention and padded to fill the \(22''\) screen at \(1920 \times 1080\) resolution. We showed the images to 24 users (males and females between 21 and 34 years old) placed at approximately 70 cm from the screen.

The acquisition procedure for each user was as follows: the Tobii EyeX [28] running at 60 Hz refresh rate was calibrated to the user, then each image was shown for 3 s while capturing all user’s saccades and fixations, separated by 1 s of neutral gray screen, to keep the results consistent to those of previous works in literature [11] (Fig. 1).

Fig. 1.
figure 1

The setup used for eye fixation data acquisition.

Fig. 2.
figure 2

An image from the OPED and its fixation map before and after Gaussian blurring.

Fixation Map Generation. Data acquired from the Tobii EyeX include three arrays of the same length: an array of positions looked at by the left eye, an array of positions looked at by the right eye, and an array of sampling times. The fixation points are calculated averaging data of both eyes and converting the result to screen coordinates, then full resolution maps are built by adding one to a pixel’s value each time a user has looked at said pixel. The map are subsequently averaged using a Gaussian convolution kernel and normalized to [0, 1] (Fig. 2).

3.2 Proposed Saliency Map Generation Method

We aimed to improve the method we (Ardizzone et al.) developed in 2011 [15] by adding chroma information to the saliency map generation algorithm based on SIFT [29] Density Maps (SDMs). A SDM is built by counting the number of detected SIFT keypoints inside a sliding window of size k x k centered on each pixel of the image. To obtain a valid saliency map, the SDM is further processed by taking the absolute difference of each pixel with the most frequent value (mode) of the map, rescaling the values to [0, 1] and blurring the result with an average filter which has a window size that is half of that used to build the map (k).

Color-based saliency has been implemented in two ways by harnessing the power of HSV and CIE L*a*b* color spaces. We early found that the optimal SDM window size equation we used in [15]:

$$\begin{aligned} k = 2^{\lfloor \log _2{\left( \frac{\min {(M, \ N)}}{4} \right) }\rfloor } \end{aligned}$$
(1)

is unsuitable in object attention because it is calculated on entire image size, while the object only takes a small central portion of it, causing excessive loss of detail in the generated saliency maps. We overcame the problem by first taking the mean of the dimensions of the object bounding boxes in all images used during the data acquisition phase, then applying (1) to the calculated values.

HSV Color Space Saliency. In HSV an image is expressed using cylindrical coordinates, where hue is an angular dimension that goes from 0\(^{\circ }\) to 360\(^{\circ }\) and then back to 0\(^{\circ }\), while saturation and value are linear dimensions. 8-bit RGB images can be easily converted to HSV by projecting the RGB cube on a chromaticity plane in such a way that an hexagon is formed:

$$\begin{aligned} \begin{aligned} C_{max}&= \max (R, G, B)\\ C_{min}&= \min (R, G, B)\\ \varDelta&= C_{max} - C_{min}\\ H&= {\left\{ \begin{array}{ll} 0 \text { if } C_{max} = 0\\ \left( 60 \times \frac{G-B}{\varDelta } + 360\right) \ \bmod 360 &{} \text { if } R = C_{max}\\ 60 \times \frac{B-R}{\varDelta } + 120 &{} \text { if } G = C_{max}\\ 60 \times \frac{R-G}{\varDelta } + 240 &{} \text { if } B = C_{max}\\ \end{array}\right. }\\ S&= \frac{\varDelta }{C_{max}}\\ V&= \frac{C_{max}}{255}\\ \end{aligned} \end{aligned}$$
(2)

We convert hue and saturation from polar coordinates to cartesian coordinatesFootnote 1, in order to eliminate the discontinuity around zero in hue values:

$$\begin{aligned} \begin{aligned} X&= S \circ \cos (H)\\ Y&= S \circ \sin (H) \end{aligned} \end{aligned}$$
(3)

then we rescale the X, Y, V channels to the [0, 1] range for convenience of processing and separately calculate statistically processed SDMs. The three maps are combined into the final saliency map (Fig. 3):

$$\begin{aligned} SM_{HSV} = \frac{1}{3}(SM_H + SM_S + SM_V). \end{aligned}$$
(4)
Fig. 3.
figure 3

An image from the OPED and its SIFT saliency map calculated in HSV space.

CIE L*a*b* Color Space Saliency. HSV space still shows some shortcomings, namely hue and saturation channels are dominated by noise when brightness is low; furthermore, it is not biologically inspired, and does not model the HVS color opponent process [30]. Therefore, we decided to implement SDM calculation also in the CIE L*a*b* space, which is perceptually uniform and designed with color opponency in mind [31].

The processing steps are essentially the same as the previous method: RGB \(\rightarrow \) L*a*b* conversion (D65 illuminant used as reference), channel range rescaling, SDM calculation, statistical processing and fusion. Coordinate transformation has been omitted because this color space does not have mathematical discontinuities (Fig. 4).

Fig. 4.
figure 4

An image from the OPED and its SIFT saliency map calculated in L*a*b* space.

4 Experimental Results

We generated saliency maps using various methods, as our legacy work [15], Itti-Koch-Niebur [6], GBVS [7], Judd [11], our two new color-based methods and a fixed centered Gaussian distribution as a baseline [11]. We ran tests on our 95 image dataset and its related fixation point and fixation map database, on an Intel Core i7-4770 computer with 4 cores (8 threads) and 16 GB of RAM. For the calculation of GBVS and Itti-Koch-Niebur saliency maps the GBVS Toolbox [32] has been used, as it includes an enhanced implementation of Itti’s algorithm; Judd saliency maps were instead generated running Judd’s code [22] with its original trained parameters. We binarized saliency and fixation maps at various percentiles [11, 15] (between 0.95 and 0.5) and evaluated the performance of our method in terms of F-measure values:

$$\begin{aligned} \begin{gathered} P = \frac{n(M_D \cap M_R)}{n(M_D)}; R = \frac{n(M_D \cap M_R)}{n(M_R)}\\ F_1 = 2 \frac{P \times R}{P + R} \end{gathered} \end{aligned}$$
(5)

where \(M_D\) is the binary version of the detected saliency map, while \(M_R\) is the binary version of the reference fixation map. We also calculated Normalized Scanpath Saliency (NSS) values, which is a well balanced, binarization-independent metric [33].

Fig. 5.
figure 5

Performance graphs of various saliency models in terms of F-measure vs. threshold (top) and NSS values (bottom).

From Fig. 5, we note an opposite trend with respect to natural image saliency model performances reported in other works: in object attention, as saliency threshold reduces the F-measure tends to reduce as well, instead of increasing. The performances of both our models, instead, increase slightly with threshold until they reach a plateau at 90% saliency levels. Our CIE L*a*b*-based method always gets best results in both metrics, while the HSV-based method underperforms at high saliency levels with respect to GBVS and our previous work.

The execution time required for calculating a HSV or L*a*b* saliency map is about 12 s for a \(1920 \times 1080\) image.

5 Conclusion and Future Works

In this paper we presented a new scale and color-based visual saliency method to generate accurate saliency maps when only one object is present in the stimulus scene and we proposed a reference dataset to evaluate algorithm results. Our method, although taking into account only bottom-up features and being unsupervised, performs better than various reference algorithms, some of which exploit top-down features and trained neural networks (Judd et al.). We expect that this method would also give good results on natural images depicting a main object and a limited number of distractors, especially if dissimilar in size to the main object itself.

We believe that the effectiveness of our method comes from its ability to adapt to the effective object size, therefore correctly keeping track of the saliency of small object features.

In our future works, we plan to implement these improvements in our saliency algorithms for natural images and crowded scenes. The extension of this method to crowded scenes is not trivial and will probably require the addition of extra segmentation and object detection steps, to identify the size of all relevant objects in the image and compute the optimal SDM window size. We are also investigating the feasibility of multi-scale approaches using different window sizes on different parts of the image.