1 Definition and Overview

Face perception is a critical skill for survival: Among other information gathered from the face, specifying identity is necessary for deciding whether an individual is a known ally or enemy. Consequently, the cognitive demands of face perception differ from most instances of object recognition: Unlike objects, which are typically identified at the category level (e.g., “chair”; Rosch et al. 1976), recognizing faces as individuals (e.g., “Bob”) is essential in day-to-day interactions. But, all faces consist of the same kinds of features (eyes, nose, and mouth) in the same configuration (eyes above nose, nose above mouth). Thus, one challenge of face recognition is to successfully individuate a large number of visually similar objects, while at the same time generalizing across perceptual features that are not critical to identity such as differences in illumination or emotional expression. Ultimately we master this task, but we continue to improve over many years of experience, with recent work suggesting that face recognition abilities do not peak until after 30 years of age (Germine et al. 2011).

The sociobiological necessity of individuating and differentiating group members combined with the differences in cognitive demands for faces compared with other object categories has led to specialization for face processing. For example, despite the limitations of the newborn visual system in terms of visual acuity, contrast sensitivity, and spatial frequency range (Nelson 2001), infants are able to discriminate their mother’s face from the face of strangers on the basis of visual information alone (e.g., Bushnell et al. 1989; Field et al. 1984). Infants also exhibit a more general visual preference for faces and spend more time looking at or tracking faces—including schematic faces—compared with other highly salient visual stimuli (e.g., Goren et al. 1975; Johnson et al. 1991; Maurer and Young 1983; Morton and Johnson 1991; Valenza et al. 1996; but see Easterbrook et al. 1999). These findings are sometimes taken as evidence that there is an innate face module; infants are born with a subcortical mechanism that contains structural information about faces and is responsible for orienting responses to objects that match this structure (Slater and Kirby 1998; but see Simion et al. 2007 for an alternative account proposing that infant preferences reflect general properties of the developing visual system).

Specialization for face perception is also supported by the discovery of neurons in the inferotemporal cortex (IT), particularly the superior temporal sulcus (STS), of non-human primates that respond preferentially to faces (Bruce et al. 1981; Gross et al. 1972; Perrett et al. 1982). These “face cells” are systematically organized in visual areas (e.g., cells within the same cortical columns in the 6-layer cortical lamina respond to similar head views), and they generally do not respond to most other visual or arousing stimuli, and their responses to faces are unaffected by image transformations (e.g., gray scale, size). In general, as long as a face is easy to perceive, face cells respond with little modulation; if faces are difficult or impossible to see, face cell activity is greatly reduced or eliminated. One notable exception is profile views, which reduce or eliminate face cell responses despite being easily recognized as faces. Individual face parts also elicit neural responses from face cells, although most face cells exhibit a larger response to stimuli containing multiple face features compared to stimuli containing a single face feature (Perrett et al. 1982). Indeed, cell tuning to individual features is enhanced when features are presented within a whole face context that includes other face features (Freiwald et al. 2009). Thus, there is an underlying organization of face cells in the primate brain that selectively respond to face perception.

Evidence for cortical areas specialized for face perception is also provided by adults suffering from prosopagnosia, an impairment in face recognition. Cases of prosopagnosia differ greatly in etiology: Prosopagnosia can be present from childhood (developmental or congenital prosopagnosia; McConachie 1976), or it can be the result of brain injury, stroke, or degenerative disease that affects occipitotemporal brain regions, specifically the fusiform gyrus, after normal face recognition skills have been acquired (Farah 1990). Critically, prosopagnosia is characterized by impaired face recognition while object recognition abilities are relatively spared and elementary visual processing remains intact. The existence of an impairment that disproportionately affects face recognition and at least one documented case of the opposite impairment (spared face recognition in the presence of object recognition deficits; Moscovitch et al. 1997) suggests that there is an anatomically segregated system dedicated to face recognition; it is possible to incur damage to this specific region without damaging brain areas necessary for general object recognition, and vice versa.

In summary, there is strong neurophysiological and neuropsychological evidence for specialization of parts of the visual system for face processing. Next, we turn to behavioral and cognitive neuroscience methods that afford more flexibility and experimental control and which have been used to study the nature of the mechanisms underlying such specialization.

2 Cognitive and Neural Mechanisms of Face Perception

Consistent with the location of brain lesions in prosopagnosic patients and face-selective cells identified by single-cell recordings in non-human primates, brain imaging studies in healthy adults have revealed several distinct regions in the brain that show more activity in response to faces relative to other objects, including scrambled faces: the STS, regions in the occipital fusiform area (OFA), the lateral fusiform gyrus (Puce et al. 1995, 1998; Sergent et al. 1992; Kanwisher et al. 1997), and the anterior temporal lobe (aIT) (Gauthier et al. 1999b; Sergent et al. 1992). The lateral fusiform gyrus has received the most attention, and the selectivity in this region is so robust that it has been named the fusiform face area (FFA; see Weiner and Grill-Spector 2011, for a review). Note that these regions are defined functionally, most often using a comparison of activation in response to images of faces versus a baseline of other objects or scenes. The label FFA, for instance, often maps onto two separate areas of activity, about 15 mm apart along the posterior–anterior axis of the fusiform gyrus (Pinsk et al. 2009; see Fig. 11.1). Recent advances in the spatial resolution of fMRI, from cubic voxels that are 3 mm to 1.5 or even 1 mm on each side, recently led to the proposal of a topography of relatively face-selective and body part-selective areas in high-level visual cortex that could be helpful in standardizing the labeling of functional areas across studies (Weiner and Grill-Spector 2011). The applicability of this scheme remains difficult to evaluate because most studies do not localize body parts-responsive areas. This is not the place to discuss all of the evidence relevant to the functional role of these different regions (see for instance Haxby et al. 2000 for an influential model), but it is generally suggested that the FFA represents an intermediate stage of processing in a ventral temporal cortex route for face perception that is critical to the representation of individual faces (Gauthier et al. 2000), between that of the OFA which seems to represent facial features (Pitcher et al. 2011) and the aIT, where individual faces elicit even more distinct response patterns (Kriegeskorte et al. 2007). However, new evidence based on dynamic causal modeling of fMRI data suggests that inputs may reach OFA and FFA in parallel, with the two regions reciprocally connected (Ewbank et al. 2012). The understanding of the anatomical and functional organization of high-level visual areas will no doubt continue to evolve in the next decades.

Fig. 11.1
figure 1

Example of functional localization of face-selective areas in an individual subject, using a comparison of faces to various objects, in a task where subjects detect 1-back repetitions of identical images presented foveally, one per second. Other face-selective areas in the superior temporal sulcus and anterior temporal pole are also found and are not visible in this slice. Image courtesy of Rankin McGugin

Moving on to human studies using methods that have even better temporal resolution, there is also evidence that an ERP component measured at occipitotemporal electrodes that emerges 170 ms after stimulus presentation is face-specific. This component, called the N170, is found for faces but not other categories of objects, such as cars or butterflies (Bentin et al. 1996; Eimer 1998), and is significantly reduced in prosopagnosic individuals (Eimer and McCarthy 1999). Similar to the pattern of single-cell recordings in non-human primates, the N170 is sensitive to the presentation of an intact face, showing a smaller amplitude and longer latency when face parts are presented in isolation (Bentin et al. 1996).

Neural activity in the FFA and the N170 ERP component are associated with two key behavioral signatures of face perception, holistic processing, and the inversion effect. Holistic processing refers to the fact that faces are processed as unified wholes rather than as a collection of features. The strongest evidence that faces are processed holistically comes from studies based on the composite illusion (see Fig. 11.2) where participants exhibit an inability to selectively attend to a single face half and ignore information in the rest of a face, even when instructed to do so, and even when a failure to do so is disadvantageous for performance (Young et al. 1987; Farah et al. 1998)—participants cannot ignore irrelevant information in a face because faces are processed as wholes. Holistic processing facilitates the extraction of information about spatial relations that goes beyond the shape of individual parts or their coarse configuration, enabling more rapid identification of visually similar objects, consistent with the unique goals of face perception. Indeed, holistic processing is observed for faces but not for non-face objects (Farah et al. 1998; Richler et al. 2011). Supporting the role of holistic processing in successful face recognition, recent work has shown that people who process faces more holistically are better at recognizing faces (Richler et al. 2011).

Fig. 11.2
figure 2

Composite illusion. Participants are slower to name the top face half (“George Clooney”) when it is aligned with a bottom face half belonging to a different individual (e.g., Brad Pitt) compared to when the parts are misaligned

The inversion effect refers to the finding that inversion disrupts memory for faces more so than it does for other objects that have a clear canonical orientation (e.g., houses; Yin 1969; Carey and Diamond 1977; Valentine and Bruce 1986). In other words, although all mono-oriented objects show a processing advantage when upright, the difference in performance between upright and inverted is more pronounced for faces. One explanation for this phenomenon is that inversion disrupts the perception of metric distances between features (e.g., interocular distance) more so than the perception of individual local features (Leder and Bruce 2000; Searcy and Bartlett 1996; Rhodes et al. 1993). Because information about precise spatial relations between features is especially critical to face perception, inversion is particularly disruptive to performance.

An inversion effect is also observed in the FFA: Although the FFA responds preferentially to both upright and inverted faces (Kanwisher et al. 1998), FFA activity is reduced for inverted versus upright faces (Gauthier et al. 1999b; Yovel and Kanwisher 2005). These results are consistent with behavioral work showing that both upright and inverted faces are processed holistically, but overall performance is reduced for inverted faces (Richler et al. 2011; Sekuler et al. 2004). Moreover, longer presentation times are required to obtain holistic effects (Richler et al. 2011) and to achieve above-chance identification performance (Curby and Gauthier 2009) for inverted versus upright faces, findings that map on remarkably well to the delay in the N170 response when faces are inverted (Bentin et al. 1996; Rossion et al. 2000). Taken together, these results suggest that upright and inverted faces are processed in a qualitatively similar manner, but our more extensive experience with upright faces leads to an advantage in processing efficiency over inverted faces, promoting better performance.

In summary, behavioral and neural evidence link the mechanisms specialized for face perception to holistic processing and reveal that this mechanism operates most efficiently for upright faces, although its action also generalizes to inverted faces, for which it is less effective. While such work attempts to capture what differs between faces and non-face objects, the results on inversion illustrate that the domain of operation of this mechanism is not all or none. Next, we review efforts to understand when non-face objects can be processed using the same mechanism as faces, and what it suggests about the nature of the phenotype.

3 Face Perception as a Behavioral Phenotype

Decades of research have established that face perception is supported by specialized cognitive and neural mechanisms. More recent work has shown that performance on face processing tasks is more strongly correlated with monozygotic versus dizygotic twins (Wilmer et al. 2010; Zhu et al. 2010). Together with growing evidence that developmental prosopagnosia is hereditary (de Haan 1999; Grueter et al. 2008), these results suggest that face perception is a heritable cognitive ability. But what exactly is being inherited?

On the one hand, individual differences in face perception have been shown to be dissociable from object perception, suggesting that face perception is a domain-specific heritable skill. In twin studies, performance on face processing tasks is unrelated to performance on tasks with other visual objects (Wilmer et al. 2010) or more general cognitive abilities (Zhu et al. 2010). Variability in performance on face versus object processing tasks is sometimes found to be independent (Furl et al. 2011; Garrido et al. 2009), and variability associated with performance with faces, but not objects, predicts overall activity (Furl et al. 2011) and gray matter volume (Garrido et al. 2009) in the FFA.

But, such evidence does not exhaust all the possible ways that face perception could be related to more domain-general skills. Indeed, while “faces” can approximately be considered one domain, “objects” cannot: Performance with one object category (e.g., cars) can be relatively independent from that with another object category (e.g., birds; Gauthier et al. 2000; Bukach et al. 2010). Indeed, the fact that recent twin studies (Wilmer et al. 2010; Zhu et al. 2010) base their conclusion that face recognition is a domain-specific heritable skill on a comparison with a single category of non-face objects is problematic; an approach that compares faces to a single object category does not reveal potential differences between non-face categories themselves. As an example, the Thatcher illusion, where it is difficult to detect that local features (e.g., eyes, mouth) have been inverted when the entire face is presented upside down (see Fig. 11.3) was believed to be face-specific because the illusion was larger for faces than for a single non-face category (Thompson 1980). However, the illusion for faces is not exceptionally large compared to the distribution obtained when many non-face categories are used (Wong et al. 2010). Therefore, it is not sufficient to claim that a face-specific phenotype has been found based solely on the evidence that performance with faces differs from that with a single non-face category because such a contrast does not capture the regular variability that exists between different non-face object categories. Thus, a unique challenge that arises when attempting to measure a potential domain-specific phenotype is to properly characterize domain specificity itself.

Fig. 11.3
figure 3

Thatcher illusion. It is difficult to detect that local features (e.g., eyes, mouth) have been inverted when the entire face is presented upside down compared to when the face is presented upright

Of particular relevance to determining the domain specificity of face recognition is the distinction between objects and objects of expertise. In contrast to faces that tend to be processed at the individual level, objects are typically categorized at the basic level (Rosch et al. 1976). However, this is not always, nor does it have to be, the case. For example, an avid birder’s goal is not simply to spot a bird, but rather to identify its species. Accordingly, bird experts can be as fast to categorize birds at the subordinate level (e.g., robin) as the basic level (e.g., bird; Tanaka and Taylor 1991). Moreover, individuals with extensive real-world experience individuating non-face objects within a visual homogenous category (e.g., cars or birds) process them more like faces: Objects of expertise are processed holistically (Bukach et al. 2012), and quantitative measures of expertise predict the magnitude of several neural and behavioral signatures of face perception, such as the inversion effect (Curby and Gauthier 2009), activity in the FFA (Gauthier et al. 2000; Xu 2005; Engel et al. 2009; McGugin et al., submitted A), and the magnitude of the N170 ERP component (Gauthier et al. 2003). Therefore, while individual differences in face recognition may dissociate from individual differences in object recognition, with the latter being less heritable (Furl et al. 2011; Garrido et al. 2009; Wilmer et al. 2010; Zhu et al. 2010), these results may not hold when using objects of expertise.

Of course, these similarities do not necessarily mean that the perception of faces and objects of expertise are related abilities. However, recent research suggests that the perception of faces and objects of expertise do not merely occupy brain real estate in roughly the same neighborhood, but that they are in fact not functionally independent. For example, when face targets are interspersed among task-irrelevant cars, interference from the car distractors is observed as a function of car expertise, with car experts showing more interference than car novices (McKeeff et al. 2010; McGugin et al. 2011a; see also Gauthier et al. 2003; Rossion et al. 2004). Put more simply, processing non-face objects of expertise disrupts face perception. Such evidence suggests that performance with both faces and objects of expertise reflect a common mechanism that supports holistic processing.

Research suggests that holistic processing may develop in response to the individuation demands that are similar in these domains. That is, provided sufficient practice at the task of individuating visually similar objects, holistic processing mechanisms, to the extent that they are available to a given individual, appear to be recruited. Indeed, in addition to face recognition deficits, individuals with prosopagnosia often exhibit difficultly discriminating between items within visually homogenous non-face categories, such as cars or birds (Bornstein 1963; Damasio et al. 1982; see also Gauthier et al. 1999a). The idea that face perception is closely related to individuation is supported by training studies, where, unlike with real-world experts, the precise kind and amount of experience can be carefully controlled. These studies demonstrate that behavioral and neural signatures of face perception are obtained for novel objects following individuation training (Gauthier et al. 1998; Wong et al. 2009a, b), while other kinds of training regimens that teach categorization based on simple dimensions or local features but do not require individuation do not produce face-like outcomes (McGugin et al. 2011b; Wong et al. 2009a). These results suggest that one important property of high-level visual areas, including putative face-selective regions, is that they demonstrate functional flexibility and can be tuned by experience.

Thus, the alternative to a face-specific phenotype is that the observed genetic differences in face perception are the result of a more general aptitude for a particular kind of visual learning that happens to be critical in face perception. In fact, this phenotype may be most fully realized in face perception: Faces are a category where sufficient exposure coupled with motivating factors may lead most individuals to realize their full potential. For this very reason while face recognition ability may be one good measure of this phenotype, it is not sufficient for interpreting what the phenotype is about; it is difficult to find a non-face domain where the motivation and opportunity to develop expertise is universally high, and any relationship between faces and objects of expertise will break down if there is significant variability in subjects’ experience. For example, a birder’s individuation ability for birds will reflect a combination of the individual’s aptitude for learning subtle visual distinctions, their motivation to do so, and the intensity and duration of their efforts. Indeed, face recognition deficits in autism have been attributed to a breakdown in the normal acquisition of face expertise due to a lack of social motivation to attend to faces (Schultz 2005). One case study of a boy with autism revealed that while his FFA was not responsive to faces, FFA activity was elicited in response to Digimon characters, a category of objects with which the boy showed intense interest (Grelotti et al. 2005). These results support the notion that although individuals with autism show impaired face recognition, the underlying cognitive phenotype—individuation learning—remains intact. Therefore, training studies present an optimal approach to studying whether there is a domain-general behavioral phenotype related to individuation learning.

Having identified a potential domain-general ability that may underlie the heritability of face recognition, we must now turn to the issue of how we actually measure this ability. To properly characterize a behavioral phenotype, a measure that successfully taps into the construct of interest and that does so reliably within an individual is essential. Unfortunately, important concepts in face recognition that have since been applied to objects of expertise have been poorly measured in the past, in ways that do not capture the underlying trait of interest. For example, the composite task is the most popular measure of holistic processing. Yet, one often-used version of this task (partial design) has been shown to track response biases that are not stable and that can be influenced by task factors independently of the construct of interest, perceptual interference due to holistic processing (e.g., Cheung et al. 2008; Richler et al. 2011b, c). Although an alternative composite task measure (complete design) does not suffer from issues related to response biases and has been successfully used in studies of individual differences (McGugin et al., submitted B; Richler et al. 2011a), it too is not ideal for an individual differences approach: Holistic processing is operationalized as a difference of differences, and difference scores can often be less reliable than their component parts (Thorndike et al. 1991; see Zhu et al. 2010, supplemental information for an example). Thus, measuring the trait of interest is difficult even with faces, a category with which people have many years of experience (e.g., Germine et al. 2011). These issues will become even more important in research on a more domain-general aptitude where the goal is to find meaningful variance in individual performance that can be captured in brief learning studies.

Additionally, recent work has found striking sex differences in object recognition, with females showing superior performance with some categories, and males showing superior performance with others. Critically, the relationship between the individual differences in face recognition and object recognition is mediated by an interaction with sex (McGugin et al., submitted B): In other words, performance in face recognition is only correlated with performance for objects with sex-congruent categories (e.g., cars in men, birds in women). Although the cause of these sex differences are unknown (and, intriguingly, they are unrelated to self-reported experience and interest, see also Dennett et al. 2011), these results demonstrate that using a single category of control non-face objects to draw conclusions about the domain specificity of what is heritable in face recognition is unlikely to be sufficient.

4 Conclusion

Face recognition is a task for which there is strong evidence of specialization in the primate brain. A great deal of cognitive psychology and cognitive neuroscience work has linked face processing to a particular perceptual strategy, that of holistic processing, which reflects observers’ inability to selectively attend to parts of a face while ignoring the other parts. There is evidence suggesting that the system that supports holistic processing is not unique to faces, as objects in a non-face category can be processed holistically and engage face-selective responses in the brain, when individuals have had extensive experience individuating objects from this category. At least some of the abilities supporting face recognition appear to be heritable, but existing data does not allow us yet to conclude whether the ability is specific to faces or more general, reflecting the ability to learn holistically. The case of face recognition illustrates special difficulties inherent in establishing evidence for a domain-specific phenotype, such as the need to compare the putative domain to more than one control object and the importance of considering large differences in experience when comparing performance across domains.