Elsevier

Cognition

Volume 114, Issue 3, March 2010, Pages 389-404
Cognition

When hearing the bark helps to identify the dog: Semantically-congruent sounds modulate the identification of masked pictures

https://doi.org/10.1016/j.cognition.2009.10.012Get rights and content

Abstract

We report a series of experiments designed to assess the effect of audiovisual semantic congruency on the identification of visually-presented pictures. Participants made unspeeded identification responses concerning a series of briefly-presented, and then rapidly-masked, pictures. A naturalistic sound was sometimes presented together with the picture at a stimulus onset asynchrony (SOA) that varied between 0 and 533 ms (auditory lagging). The sound could be semantically congruent, semantically incongruent, or else neutral (white noise) with respect to the target picture. The results showed that when the onset of the picture and sound occurred simultaneously, a semantically-congruent sound improved, whereas a semantically-incongruent sound impaired, participants’ picture identification performance, as compared to performance in the white-noise control condition. A significant facilitatory effect was also observed at SOAs of around 300 ms, whereas no such semantic congruency effects were observed at the longest interval (533 ms). These results therefore suggest that the neural representations associated with visual and auditory stimuli can interact in a shared semantic system. Furthermore, this crossmodal semantic interaction is not constrained by the need for the strict temporal coincidence of the constituent auditory and visual stimuli. We therefore suggest that audiovisual semantic interactions likely occur in a short-term buffer which rapidly accesses, and temporarily retains, the semantic representations of multisensory stimuli in order to form a coherent multisensory object representation. These results are explained in terms of Potter’s (1993) notion of conceptual short-term memory.

Introduction

In everyday life, the majority of perceptual events provide information to multiple sensory modalities simultaneously (e.g., see the chapters in Calvert, Spence, & Stein, 2004). For example, when watching a ball bounce, we normally also hear the synchronized sound of the ball hitting the ground. In such cases, coherent spatial, temporal, and semantic information is conveyed to both the visual and auditory modalities. The spatiotemporal constraints on crossmodal interactions and multisensory integration have been explored extensively by researchers in recent years (see Spence, 2007, for a review). By contrast, the constraints of semantic congruency on multisensory object identification have, to date, received far less attention. The goal of the present study was therefore to investigate whether the identification of a visually-presented picture can be influenced by the presentation of a semantically-congruent stimulus in another sensory modality (i.e., in this case, audition).

Researchers have been studying the processing of visually-presented pictures and how it can be influenced by the presentation of relevant visual information (such as the visual background) for nearly 40 years (e.g., Biederman, 1972, Biederman et al., 1982, Boyce and Pollastek, 1992, Boyce et al., 1989, Davenport and Potter, 2004, Glaser and Glaser, 1989, Henderson and Hollingworth, 1999, Hollingworth and Henderson, 1998, Intraub, 1981, Intraub, 1984, Palmer, 1975, Potter, 1975, Potter, 1976). For example, researchers have shown that the theme of a picture can be detected easily in an on-line detection task while being difficult to recall in a subsequently-presented recognition task (Intraub, 1981, Potter, 1975, Potter, 1976). In addition, picture processing can be influenced by the semantic congruency of any supplementary visual information that may happen to be presented at the same time. For example, it has been shown that it is easier to detect or report a visual target when it is presented against a semantically congruent background than when it is presented against a semantically incongruent background instead (Biederman et al., 1982, Boyce et al., 1989, Davenport and Potter, 2004, Palmer, 1975; though see Hollingworth & Henderson, 1998). Similarly, when a word is superimposed on top of a picture, the identification of that picture can either be facilitated by the presentation of a semantically congruent word or else inhibited by the presentation of a semantically incongruent word (Glaser & Dungelhoff, 1984). Furthermore, the identification or recognition of a picture can either be improved, when the subsequently-presented item (either a picture or a word) is semantically congruent with it (Glaser and Glaser, 1989, O’Conner and Potter, 2002), or else impaired when the subsequent item happens to be semantically incongruent (Glaser and Glaser, 1989, Intraub, 1984, Loftus and Ginn, 1984). Potter (1976) labeled the latter phenomenon conceptual masking.

These results have three important implications for our understanding of the processes underlying the identification of visually-presented pictures. First, the semantic information associated with a picture can be accessed within 100 ms of stimulus onset, maintained for a further 300 ms, and then either consolidated in memory or else rapidly forgotten (Intraub, 1999, Potter, 1975). Second, irrespective of the nature of the visual stimuli (e.g., picture or word), a shared abstract representation of the semantic information may be accessed if the stimuli are associated with a common concept (Potter & Faulconer, 1975). Therefore, the interactions that are seen at the level of semantic congruency between picture and word presumably rely on a common semantic system (see Glaser & Glaser, 1989). Third, the effects of supplementary visual information on participants’ picture identification performance are not constrained by the need for strict temporal coincidence; instead, a subsequently-presented stimulus (e.g., lagging by as much as 350 ms in Loftus & Ginn’s, 1984, study) can also influence performance. Potter, 1993, Potter, 1999 proposed a module, called conceptual short-term memory (CSTM) to account for the semantic processing of visual stimuli. Potter thought of CSTM as consisting of a short-term buffer that serves as the interface between perception and memory. Once a visual stimulus has been identified, its meaning is accessed rapidly and maintained temporarily in CSTM. Furthermore, the maintained semantic representations are structured into a coherent context in order to be consolidated in memory; otherwise, unstructured (usually semantically incongruent) information would be rapidly forgotten (Potter, 1999).

In the present study, we investigated semantic processing for the case of multisensory interactions between visual and auditory stimuli. We hypothesized that a common semantic system can be accessed by a given picture and a semantically-congruent stimulus presented in another sensory modality, such as audition (see Barsalou, 1999, Versace et al., 2009, on this issue). If this hypothesis is correct, the semantic congruency of an auditory stimulus may be expected to influence the identification of a visually-presented picture. Furthermore, we suggest that audiovisual semantic interactions should still occur even when the presentation of the sound lags within a certain interval (around 300 ms) after the picture.

Jackson (1953) was perhaps the first researcher to demonstrate the effect of semantic congruency on human information processing in a multisensory setting. He showed that the spatial ventriloquism effect (whereby the perceived source of the sound is pulled toward the location of a visual stimulus) occurred over larger spatial disparities for realistic pairs of stimuli (e.g., the whistling sound normally associated with a steam kettle and the sight of a steaming kettle) than for artificial pairings of stimuli (e.g., the sound of a bell and the illumination of a light). That is, it would appear that people can tolerate a larger spatial (or for that matter temporal) separation between congruent (as compared to incongruent or unrelated) pairs of audiovisual stimuli and still bind them. However, this explicit measure of audiovisual integration (namely, pointing to the apparent source of a sound) may unfortunately have resulted from response bias or the particular decision strategies adopted by participants (see Bertelson and Aschersleben, 1998, Choe et al., 1975, Vatakis and Spence, 2007). This is probably the reason why, in subsequent studies where the possible influence of such biases on performance have been ruled out, similar effects of audiovisual semantic congruency have not always been observed (e.g., Vatakis et al., 2008, Vatakis and Spence, 2008; though see Parise and Spence, 2009, Vatakis and Spence, 2007).

In addition, Koppen, Alsius, and Spence (2008) have recently shown that audiovisual semantic congruency does not modulate the Colavita visual dominance effect (Colavita, 1974). In this paradigm, participants respond to visual and auditory targets accurately when they were presented alone; however, participants sometimes failed to respond to the auditory targets when they were presented together with visual targets (see also Hecht and Reiner, 2009, Spence et al., submitted for publication). Across three experiments, Koppen et al. demonstrated that the semantic congruency between visual and auditory stimuli (congruent versus incongruent, such as the picture of dog accompanied by the sound of a dog “woofing” or a cat “meowing”, respectively) had no effect on the magnitude of the Colavita visual dominance effect. That is, congruent sounds were detected no more accurately than incongruent sounds on those trials in which they were accompanied by a visual target.

On the other hand, there is some evidence from other experimental paradigms to show that the semantic congruency of visual and auditory stimuli can influence the latency of a participant’s behavioral responses. For example, target detection latencies tend to be shorter when a visual target is accompanied by a semantically-congruent sound than when it is accompanied by a semantically-incongruent sound instead (Laurienti et al., 2004, Molholm et al., 2004, Suied et al., 2009). Performance in visual search tasks accompanied by a sound that happens to be semantically congruent with the visual target has also been shown to be facilitated, as compared to when search displays are accompanied by a target-incongruent sound instead (see Iordanescu, Guzman-Martinez, Grabowecky, & Suzuki, 2008).

It should be noted here that those recent studies that have demonstrated an audiovisual semantic congruency effect have always done so in terms of the speed of participants’ behavioral responses. As a consequence, it is unclear whether the audiovisual semantic congruency effect reported in these previous studies should be attributed primarily to a perceptual or to a decisional level effect. In the studies of Laurienti et al., 2004, Molholm et al., 2004, Suied et al., 2009, the congruent visual and auditory stimuli corresponded to the same response key. It is therefore possible that, when information was provided both visually and auditorily (i.e., in the bimodal condition), the participant’s response might have been triggered by the first stimulus that they happened to have perceived (i.e., a statistical facilitation of responses might have been observed, see Raab, 1962). All three of these studies ruled out statistical facilitation as the sole account of their data by comparing the cumulative distribution function of response times (RTs) for the bimodal condition to the predictions of the race model (see Miller, 1982). A further concern though is whether the facilitatory effect reported in the bimodal condition in these studies was truly due to semantic congruency, or alternatively, due to the visual and auditory stimuli both being linked with the same response (i.e., response redundancy, see Experiment 3 in Miller, 1982, on this point).

Suied et al. (2009, in the Appendix of their paper), for example, tried to disentangle semantic congruency from response redundancy by mapping the semantically-incongruent stimuli (e.g., the picture of a phone and the “ribbit” sound made by a frog) to the same response key. However, using this manipulation, the presentation of the telephone ringing sound did not speed-up the participants’ responses to the picture of the telephone (i.e., no semantic congruency effect was observed). Evidence supporting the view that response redundancy is critical to the shortened RTs that have been observed in speeded tasks comes from the work of Sinnett, Soto-Faraco, and Spence (2008). In their study, even though the visual and auditory targets were always semantically incongruent (e.g., a picture of “traffic light” paired with the sound of a cat “meowing”), when they were mapped to the same response key, RTs in the bimodal condition were nevertheless still significantly faster than the predictions of the race model. That is, given that the visual and auditory inputs were processed independently, they were still able to sum together to reach the criterion for initiating a response (this result is in line with the independent coactivation model, Miller, 1991). Therefore, one needs to exercise caution before explaining the RT data in speeded tasks in terms of semantic congruency effects since the results are easily confounded with the response redundancy (i.e., a decision-level processing) effect (Santee and Egeth, 1982, Watt, 1991).

Another problem for the above-mentioned studies is that the visual target that participants had to detect, or search for, was clearly defined beforehand (Iordanescu et al., 2008, Koppen et al., 2008, Laurienti et al., 2004, Molholm et al., 2004, Suied et al., 2009). What this means in practice is that the participants in these studies may have been able to accomplish their task simply by recognizing the salient features of the target picture (e.g., four legs in the case of the picture of dog and then inferring the rest). Besides, as the target had been well-defined in advance, a clear goal-directed control mechanism may have been active, which might have guided participants’ attention to select the contingent stimuli (e.g., Folk, Remington, & Johnston, 1992). That is, the presentation of the semantically-congruent sound may have been selected for further processing, whereas the semantically-incongruent sound was ignored. The prediction here is that the facilitation elicited by the presentation of a semantically-congruent sound can be observed, whereas the interference elicited by a semantically-incongruent sound should be reduced or even eliminated, consistent with the results of both Iordanescu et al., 2008, Molholm et al., 2004 studies.

In the present study, a picture identification task was used to test for the existence of an audiovisual semantic congruency effect. We consider that this task is less sensitive to the influence of response-level processing and goal-directed control than the speeded tasks used in previous research. In each trial, a picture from a corpus of at least 36 pictures was presented briefly and then immediately masked, and the participants had to try and identify the visual target. The accuracy of participants’ unspeeded identification responses was recorded as the dependent variable. One of the advantages of this task is that the measurement of the accuracy of participants’ performance for briefly-presented targets provides a more sensitive measure of perceptual processing than does the measurement of RTs for targets presented for a longer time (e.g., Norman and Bobrow, 1975, Prinzmetal et al., 2005, Santee and Egeth, 1982). This is because any guessing bias can be estimated and separated from participants’ performance, and hence the actual sensitivity of participants’ responding can be revealed (see Green & Swets, 1966). Another advantage of using a target identification task is that participants do not know the identity of the upcoming target on each and every trial. The increased uncertainty regarding the identity of the target should therefore help to reduce any attentional selection resulting from goal-directed control that may have been taking place in previous studies (e.g., Yeh & Liao, 2008).

The primary goal of the present study was therefore to provide convergent evidence regarding the influence of audiovisual semantic congruency on the identification of masked visual targets, and further to examine whether the source of any congruency effect was due to the facilitation of performance in the congruent condition, the impairment of performance in the incongruent condition, or to both effects operating simultaneously. Furthermore, in previous studies of audiovisual semantic congruency, the visual and auditory stimuli have always been presented simultaneously (Iordanescu et al., 2008, Koppen et al., 2008, Laurienti et al., 2004, Molholm et al., 2004, Suied et al., 2009). Given that the semantic congruency effect that has been reported within the visual modality can be induced by the presentation of both simultaneous and asynchronous stimuli, it seemed worthwhile to investigate whether the same was true in a crossmodal (i.e., audiovisual) setting as well. The stimulus onset asynchrony (SOA) between the picture and the sound was therefore manipulated; however, because we were primarily interested in the processing of a picture after it had been presented, rather than the semantic priming of a picture by sound (e.g., Schneider, Engel, & Debener, 2008), the auditory stimulus was only presented simultaneously or else delayed with respect to the onset of the picture. In Experiments 1 and 2, the onset of the sound occurred simultaneously with that of the picture. In Experiments 3 and 4, the onset of the sound was delayed by 307 and 533 ms with respect to the picture. In Experiment 5, the SOAs (0, 333, 533 ms) were randomized on a trial-by-trial basis and tested within the same group of participants. Based on the results of semantic congruency studies in visual perception, we predicted that any audiovisual semantic congruency effect should be observed at SOAs shorter than around 300 ms, but not at SOAs longer than 500 ms.

Section snippets

Experiment 1

Experiment 1 was designed to investigate whether the semantic congruency (congruent versus incongruent) of a sound would influence the ability of participants to identify briefly-presented line-drawing pictures. We predicted that the presentation of a semantically-congruent sound would facilitate, whereas the presentation of a semantically-incongruent sound would impair, participants’ performance relative to a control condition in which the picture was accompanied by a burst of white noise.

Experiment 2

There were two principal goals in Experiment 2: First, to try and replicate the audiovisual semantic congruency effect in the masked picture identification task reported in Experiment 1. The second aim was to compare the semantic congruency effect elicited by the presentation of a sound to a unimodal (i.e., visual only) baseline condition.

Experiment 3

In this experiment, the onset of the sound occurred 307 ms after that of the picture. The prediction was that the semantically-congruent sound would still likely facilitate, while the semantically-incongruent sound would still inhibit, participants’ picture identification performance, as compared to the performance seen in the control noise condition.

Experiment 4

In this experiment, the onset of the sound was delayed by 533 ms with respect to that of the picture. We predicted that the semantic congruency of the sound would no longer have any influence on participants’ picture identification performance.

Experiment 5

In this experiment, we attempted to replicate the results of the previous experiments but now using a within-participants experimental design. The SOAs (0, 333, or 533 ms, see Fig. 1B) were randomized on a trial-by-trial basis. This randomization procedure would have ensured that the participants were unable to predict (and, more importantly, to adapt to) the constantly-varying SOA between the picture and the sound.

General discussion

The results of the five experiments reported in the present study demonstrate that the semantic congruency of a sound can influence participants’ identification of a masked line-drawing picture. Specifically, when the picture and the sound were presented at the same time, participants’ picture identification performance was facilitated by the presentation of the semantically-congruent sound, as compared to when the picture was presented together with a burst of white noise, or in the absence of

Acknowledgment

Yi-Chuan Chen is supported by the Ministry of Education in Taiwan (SAS-96109-1-US-37).

References (138)

  • M. Kubovy et al.

    Auditory and visual objects

    Cognition

    (2001)
  • N. Lavie

    Distracted and confused? Selective attention under load

    Trends in Cognitive Sciences

    (2005)
  • A.J. Marcel

    Conscious and unconscious perception: Experiments on visual masking and word recognition

    Cognitive Psychology

    (1983)
  • A.J. Marcel

    Conscious and unconscious perception: An approach to the relations between phenomenal experience and perceptual processes

    Cognitive Psychology

    (1983)
  • J. Miller

    Divided attention: Evidence for coactivation with redundant signals

    Cognitive Psychology

    (1982)
  • M. Mishkin et al.

    Object vision and spatial vision: Two central pathways

    Trends in Neurosciences

    (1983)
  • M.M. Murray et al.

    Rapid discrimination of visual and multisensory memories revealed by electrical neuroimaging

    NeuroImage

    (2004)
  • E.A. Murray et al.

    Role of perirhinal cortex in object perception, memory, and associations

    Current Opinion in Neurobiology

    (2001)
  • J. Navarra et al.

    Exposure to asynchronous audiovisual speech extends the temporal window for audiovisual integration

    Cognitive Brain Research

    (2005)
  • D.A. Norman et al.

    On data-limited and resource-limited processes

    Cognitive Psychology

    (1975)
  • Z.W. Pylyshyn

    The role of location indexes in spatial perception: A sketch of the FINST spatial index model

    Cognition

    (1989)
  • Z.W. Pylyshyn

    Visual indexes, preconceptual objects, and situated vision

    Cognition

    (2001)
  • C.L. Reed et al.

    What vs. Where in touch: An fMRI study

    NeuroImage

    (2005)
  • G. Rees et al.

    Processing of irrelevant visual motion during performance of an auditory attention task

    Neuropsychologia

    (2001)
  • D. Alais et al.

    Separate attentional resources for vision and audition

    Proceedings of the Royal Society B

    (2006)
  • K.M. Arnell et al.

    Cross-modality attentional blinks without preparatory task-set switching

    Psychonomic Bulletin & Review

    (2002)
  • J.A. Ballas

    Common factors in the identification of an assortment of brief everyday sounds

    Journal of Experimental Psychology: Human Perception and Performance

    (1993)
  • L.W. Barsalou

    Perceptual symbol systems

    Behavioral and Brain Sciences

    (1999)
  • M.S. Beauchamp et al.

    Unraveling multisensory integration: Patchy organization within human STS multisensory cortex

    Nature Neuroscience

    (2004)
  • M.O. Belardinelli et al.

    Audio-visual crossmodal interactions in environmental perception: An fMRI investigation

    Cognitive Processing

    (2004)
  • P. Bertelson et al.

    Automatic visual bias of perceived auditory location

    Psychonomic Bulletin & Review

    (1998)
  • I. Biederman

    Perceiving real-world scenes

    Science

    (1972)
  • H. Bowman et al.

    The simultaneous type, serial token model of temporal attention and working memory

    Psychological Review

    (2007)
  • S.J. Boyce et al.

    Identification of objects in scenes: The role of scene background in object naming

    Journal of Experimental Psychology: Learning, Memory, and Cognition

    (1992)
  • S.J. Boyce et al.

    Effect of background information on object identification

    Journal of Experimental Psychology: Human Perception and Performance

    (1989)
  • M. Brand-D’Abrescia et al.

    Task coordination between and within sensory modalities: Effects on distraction

    Perception & Psychophysics

    (2008)
  • J.S. Chan et al.

    Behavioral evidence for task-dependent “what” versus “where” processing within and across modalities

    Perception & Psychophysics

    (2008)
  • C.S. Choe et al.

    The “ventriloquist effect”: Visual dominance or response bias?

    Perception & Psychophysics

    (1975)
  • F.B. Colavita

    Human sensory dominance

    Perception & Psychophysics

    (1974)
  • P. Dalton et al.

    Attentional capture in serial audiovisual search tasks

    Perception & Psychophysics

    (2007)
  • J.L. Davenport et al.

    Scene consistency in object and background perception

    Psychological Science

    (2004)
  • H.C. Dijkerman et al.

    Somatosensory processes subserving perception and action

    Behavioral and Brain Sciences

    (2007)
  • J. Duncan et al.

    Restricted attentional capacity within but not between sensory modalities

    Nature

    (1997)
  • C.L. Folk et al.

    Involuntary covert orienting is contingent on attentional control settings

    Journal of Experimental Psychology: Human Perception and Performance

    (1992)
  • W. Fujisaki et al.

    Recalibration of audiovisual simultaneity

    Nature Neuroscience

    (2004)
  • A. Gazzaley et al.

    Functional connectivity during working memory maintenance

    Cognitive, Affective, & Behavioral Neuroscience

    (2004)
  • M.H. Giard et al.

    Auditory-visual integration during multimodal object recognition in humans: A behavioral and electrophysiological study

    Journal of Cognitive Neuroscience

    (1999)
  • W.R. Glaser et al.

    The time course of picture-word interference

    Journal of Experimental Psychology: Human Perception and Performance

    (1984)
  • W.R. Glaser et al.

    Context effects in Stroop-like word and picture processing

    Journal of Experimental Psychology: General

    (1989)
  • Cited by (0)

    View full text