A behaviorally inspired fusion approach for computational audiovisual saliency modeling

https://doi.org/10.1016/j.image.2019.05.001Get rights and content

Highlights

  • Audiovisual bottom-up human attention modeling via computational audiovisual saliency

  • Three different generic audiovisual fusion schemes resulting in 2-D saliency map

  • Audiovisual eye-tracking data collection for SumMe, ETMD databases, publicly released

  • Evaluation by comparison with human experimental findings from behavioral experiments

  • Evaluation with 6 eye-tracking databases: DIEM, AVAD, Coutrot1, Coutrot2, SumMe, ETMD

Abstract

Human attention is highly influenced by multi-modal combinations of perceived sensory information and especially audiovisual information. Although systematic behavioral experiments have provided evidence that human attention is multi-modal, most bottom-up computational attention models, namely saliency models for fixation prediction, focus on visual information, largely ignoring auditory input. In this work, we aim to bridge the gap between findings from neuroscience concerning audiovisual attention and the computational attention modeling, by creating a 2-D bottom-up audiovisual saliency model. We experiment with various fusion schemes for integrating state-of-the-art auditory and visual saliency models in a single audiovisual attention/saliency model based on behavioral findings, that we validate in two experimental levels: (1) using results from behavioral experiments aiming to reproduce the results in a mostly qualitative manner and to ensure that our modeling is in line with behavioral findings, and (2) using 6 different databases with audiovisual human eye-tracking data. For this last purpose, we have also collected eye-tracking data for two databases: ETMD, a movie database that contains highly edited videos (movie clips), and SumMe, a database that contains unstructured and unedited user videos. Experimental results indicate that our proposed audiovisual fusion schemes in most cases improve performance compared to visual-only models, without any prior knowledge of the video/audio content. Also, they can be generalized and applied to any auditory saliency model and any visual spatio-temporal saliency model.

Introduction

Attention can be defined as the behavioral and cognitive process of selectively concentrating on a specific aspect of information, while ignoring other perceivable input. The role of attention is vital to humans, and its mechanism has been in the research focus for many decades. A computational modeling of human attention could not only be exploited in applications like robot navigation, human–robot interaction, advertising, summarization, etc., but could also offer an additional insight in our understanding of human attention functions.

Although visual and auditory stimuli often attract attention in isolation, most of the times stimuli are multi-sensory and multi-modal, resulting in human multi-modal attention, e.g., audiovisual attention. The influence that multi-modal stimuli exert on human attention and behavior [1], [2], [3], [4] can be perceived both in everyday life, but also through targeted behavioral experiments. It can therefore be observed that when multi-modal stimuli are incongruent they can lead to illusionary perception of the multi-modal event, as in the ventriloquist or the McGurk effect [5], while in the opposite case, where the stimuli are synchronized/aligned, they can effectively enhance both perception and performance.

In this work, we focus on investigating how multi-sensory, and specifically audiovisual stimuli can influence human bottom-up attention, namely saliency [6]. For example, in [7], a series of behavioral experiments is described, that highlights the influence of multi-modal stimuli on saliency, through an effect called “pip and pop”: in a visual search task that consists of a cluttered image containing a target (that has to be identified by humans) and distractors that change dynamically, the insertion of a non-localized auditory pip synchronized with target changes can significantly enhance reaction times. It has been observed that these task-irrelevant pips make the target become more salient (i.e. “pop out”). This is just a single example of strong audiovisual interaction, the mechanisms of which have been in the focus of cognitive research for years.

In parallel, visual and auditory saliency mechanisms have been separately well-studied, and the related findings have been integrated into individual computational models, that have already been employed in real-world applications. Some of them have been inspired and validated by behavioral experiments, like the seminal works of [8], [9] for visual saliency, and [10] for auditory saliency. Motivated and validated by behavioral observations in psycho-sensory experiments these models have inspired variations and improvements, which have been used in applications like object recognition in images and prominence detection in speech.

Despite the simultaneous development of visual and auditory saliency models, few efforts have focused on creating a joint audiovisual model [11], [12]. The majority of models trying to predict human attention in videos are based only on visual information excluding auditory input. On the other hand, audiovisual fusion has been found to boost performance in applications where audio and visual modalities are correlated and refer to the same event, e.g., speech recognition [13], movie summarization [14], human–robot interaction [15], [16], [17].

We aim to bridge the gap between behavioral research, where audiovisual integration has been well-investigated, and computational modeling of attention, which is mostly based on the visual modality. Our goal is to investigate ways to fuse existing visual and auditory saliency models in order to create a 2-D audiovisual saliency model that will be in line both with behavioral findings, but also with human eye-tracking data. The model should capture well the audiovisual correspondences, but its performance should not be degraded if there is no audio or if audio is not related to video. In our preliminary work [18] we introduced such an audiovisual model, based on Itti et al. and Kayser et al. and carried out some preliminary experiments to validate it through behavioral findings from a particular experiment. In this current paper we have investigated more fusion schemes in order to integrate auditory and visual saliency, we have carried out more experiments with behavioral findings but we also present an evaluation strategy that involves eye-tracking data and comparisons with various models. Some of these data have been collected for the purposes of this paper and will be publicly released. The contributions of this paper can be summarized as follows:

  • Audiovisual bottom-up human attention modeling via computational audiovisual saliency modeling, inspired and validated by behavioral experiments.

  • Investigation of three different audiovisual fusion schemes between visual saliency and non-localized auditory saliency, resulting in a 2-D audiovisual saliency map instead of fusion at decision or feature level. The proposed audiovisual fusion schemes for attention/saliency modeling are generic since they can be applied to any visual spatio-temporal saliency method.

  • Audiovisual eye-tracking data collection for two databases, SumMe and ETMD that contain unedited user videos and highly edited movies, respectively, and unconstrained audio. The collected eye-tracking data will be released in public.

  • Two-level evaluation of the proposed model:

    • 1.

      Comparison against human experimental findings from behavioral experiments in a qualitative way, aspiring to build a computational model able to explain and reproduce aspects of human attention.

    • 2.

      Comparison against human eye-tracking data from databases with audiovisual eye-tracking data and variable complexity, DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD.

The rest of the paper is organized as follows: Section 2 is dedicated to an extensive review of audiovisual saliency models, behavioral findings related to audiovisual interactions, and state-of-the-art visual, and auditory saliency models . Section 3 incorporates the main aspects of computational audiovisual modeling and the proposed fusion schemes. Section 4 contains a description of the evaluation metrics as well as a detailed description of the stimuli and the conducted experiments, both for the behavioral findings and the human eye-tracking databases. Also, the newly collected audiovisual eye-tracking databases are described, and in the end of the section an analysis and discussion of the results and performance across methods and datasets is performed. The last section concludes our work.

Section snippets

Related work

Several attempts to model audiovisual attention exist in the literature, but most of them are application-specific or use spatial audio in order to fuse it with visual information.

Audiovisual attention models: Probably the first attempt in modeling audiovisual saliency appears in [19], where the eye fixation predictions in an audiovisual scene served as cues for guiding a humanoid robot. Here, the model of Itti et al. [9] is employed for visual saliency, while auditory saliency is only

Computational audiovisual saliency modeling

The main focus of this work is to fuse, in a behaviorally-inspired way, individual auditory and visual saliency models in order to form a 2-D audiovisual saliency model and investigate its plausibility. We essentially try to combine several theoretical and experimental findings from neuroscience with signal processing techniques. A high-level overview of the model is presented in Fig. 2. An auditory and a visual stimulus serve as input to an auditory and a visual spatio-temporal saliency model

Evaluation metrics

Since we are addressing a fixation prediction problem, which is primarily a visual task where the auditory influence has been incorporated into a visual saliency map, the evaluation metrics we adopt consist of widely used visual saliency evaluation metrics [45], [110].

We denote the output of our model by Estimated Saliency Map (ESM). In eye-tracking experiments, the Ground-truth Saliency Map (GSM) is the map built from eye movement data. In behavioral experiments, inspired by [6], and due to

Conclusion

We have developed a computational audiovisual saliency model based on behaviorally-inspired fusion schemes between well-known individual saliency models and aspire to validate its plausibility via human behavioral experiments and eye-tracking data. We propose three fusion schemes and subsequently evaluate them. Our first validation effort concerns the “pip and pop” and “sine vs. square” effects, where our model exhibits a similar behavior to the experimental results compared to visual-only

Acknowledgments

This work was cofinanced by the European Regional Development Fund of the EU and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call ‘Research - Create - Innovate’ (T1EDK-01248, “i-Walk”).

The authors wish to thank all the members of the NTUA CVSP Lab who participated in the audiovisual eye-tracking data collection. Special thanks to Efthymios Tsilionis for sharing his code and his advice during eye-tracking database collection.

References (120)

  • RicheN. et al.

    Rare2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis

    Signal Process., Image Commun.

    (2013)
  • KoutrasP. et al.

    A perceptually based spatio-temporal computational framework for visual saliency estimation

    Signal Process., Image Commun.

    (2015)
  • BordierC. et al.

    Sensory processing during viewing of cinematographic material: Computational modeling and functional neuroimaging

    Neuroimage

    (2013)
  • MeredithM.A. et al.

    Interactions among converging sensory inputs in the superior colliculus

    Science

    (1983)
  • MeredithM.A. et al.

    Visual, auditory, and somatosensory convergence on cells in superior colliculus results in multisensory integration

    J. Neurophysiol.

    (1986)
  • VatakisA. et al.

    Crossmodal binding: Evaluating the ”unity assumption” using audiovisual speech stimuli

    Percept. Psychophys.

    (2007)
  • MaragosP. et al.

    Cross-modal integration for performance improving in multimedia: A review

  • McGurkH. et al.

    Hearing lips and seeing voices

    Nature

    (1976)
  • Van der BurgE. et al.

    Pip and pop: Nonspatial auditory signals improve spatial visual search

    J. Exp. Psychol. Hum. Percept. Perform.

    (2008)
  • KochC. et al.

    Shifts in selective visual attention: Towards the underlying neural circuitry

    Hum. Neurobiol.

    (1985)
  • IttiL. et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • MinX. et al.

    Fixation prediction through multimodal analysis

    ACM Trans. Multimed. Comput. Commun. Appl.

    (2017)
  • A. Coutrot, N. Guyader, An audiovisual attention model for natural conversation scenes, in: Proc. IEEE Int. Conf. on...
  • PotamianosG. et al.

    Recent advances in the automatic recognition of audiovisual speech

    Proc. IEEE

    (2003)
  • EvangelopoulosG. et al.

    Multimodal saliency and fusion for movie summarization based on aural, visual, textual attention

    IEEE Trans. Multimed.

    (2013)
  • SchillaciG. et al.

    Evaluating the effect of saliency detection and attention manipulation in human-robot interaction

    Int. J. Soc. Robot.

    (2013)
  • I. Rodomagoulakis, N. Kardaris, V. Pitsikalis, E. Mavroudi, A. Katsamanis, A. Tsiami, P. Maragos, Multimodal human...
  • A. Tsiami, P. Koutras, N. Efthymiou, P.P. Filntisis, G. Potamianos, P. Maragos, Multi3: Multi-sensory perception system...
  • A. Tsiami, A. Katsamanis, P. Maragos, A. Vatakis, Towards a behaviorally-validated computational audiovisual saliency...
  • J. Ruesch, M. Lopes, A. Bernardino, J. Hornstein, J. Santos-Victor, R. Pfeifer, Multimodal saliency-based bottom-up...
  • SchauerteB. et al.

    Multimodal saliency-based attention for object-based scene analysis

  • S. Ramenahalli, D.R. Mendat, S. Dura-Bernal, E. Culurciello, E. Nieburt, A. Andreou, Audio-visual saliency map:...
  • R. Ratajczak, D. Pellerin, Q. Labourey, C. Garbay, A fast audiovisual attention model for human detection and...
  • ChenY. et al.

    Audio matters in visual attention.

    IEEE Trans. Circuits Syst. Video Technol.

    (2014)
  • G. Evangelopoulos, A. Zlatintsi, G. Skoumas, K. Rapantzikos, A. Potamianos, P. Maragos, Y. Avrithis, Video event...
  • P. Koutras, A. Zlatintsi, E. Iosif, A. Katsamanis, P. Maragos, A. Potamianos, Predicting audio-visual salient events...
  • CoutrotA. et al.

    How saliency, faces, and sound influence gaze in dynamic social scenes

    J. Vis.

    (2014)
  • CoutrotA. et al.

    Multimodal saliency models for videos

  • SongG.

    Effect of Sound in Videos on Gaze: Contribution to Audio-Visual Saliency Modeling

    (2013)
  • MinX. et al.

    Sound influences visual attention discriminately in videos

  • X. Min, G. Zhai, C. Hu, K. Gu, Fixation prediction through multimodal analysis, in: Proc. IEEE Int. Conf. on Visual...
  • FujisakiW. et al.

    Recalibration of audiovisual simultaneity

    Nature Neurosci.

    (2004)
  • Van der BurgE. et al.

    Rapid recalibration to audiovisual asynchrony

    J. Neurosci.

    (2013)
  • Van der BurgE. et al.

    Rapid temporal recalibration is unique to audiovisual stimuli

    Exp. Brain Res.

    (2015)
  • KeetelsM. et al.

    Sound affects the speed of visual processing

    J. Exp. Psychol. Hum. Percept. Perform.

    (2011)
  • GleissS. et al.

    Eccentricity dependent auditory enhancement of visual stimulus detection but not discrimination

    Front. Integr. Neurosci.

    (2013)
  • LiQ. et al.

    Spatiotemporal relationships among audiovisual stimuli modulate auditory facilitation of visual target discrimination

    Percept. Abstr.

    (2015)
  • BurrD. et al.

    Auditory dominance over vision in the perception of interval duration

    Exp. Brain Res.

    (2009)
  • ChenL. et al.

    Intersensory binding across space and time: A tutorial review

    Atten. Percept. Psychophys.

    (2013)
  • ErnstM.O.

    A Bayesian view on multimodal cue integration

    Hum. Body Percept. Inside Out

    (2006)
  • Cited by (18)

    • Equipment noise evaluation based on auditory saliency map

      2022, Applied Acoustics
      Citation Excerpt :

      Although systematic behavioral experiments have provided evidence that human attention is multi-modal, most bottom-up computational attention models, namely saliency models for fixation prediction, focus on visual information, largely ignoring auditory input. In 2019, Antigoni Tsiami [27] aimed to bridge the gap between findings from neuroscience concerning audiovisual attention and computational attention modeling by creating a 2-D bottom-up audiovisual saliency model. Inspired by biological perceptual characteristics in human auditory systems and the mechanisms of saliency detection, Xiaoming Zhao [28], 2020, proposed a co-saliency detection method for multiple sound signals.

    • Audio–visual collaborative representation learning for Dynamic Saliency Prediction

      2022, Knowledge-Based Systems
      Citation Excerpt :

      The eye-tracking data are annotated by 10 different people. The SumMe dataset [90,91] consists of 25 video clips with diverse topics, e.g., playing ball, cooking, traveling, etc. The corresponding eye-tracking data are collected from 10 viewers.

    • NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos

      2023, MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
    View all citing articles on Scopus

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.image.2019.05.001.

    View full text