Introduction

The visual system has limited capacity (Broadbent, 1958; Luck & Vogel, 1997). One way of utilizing its limited capacity efficiently is through ensemble representation. By summarizing redundant and complex information in a scene, the visual system can rapidly assess the overall properties of the scene and extract its gist (Cavanagh, 2001; Chong & Treisman, 2003), despite the focused attention limits (Cavanagh & Alvarez, 2005; Dux & Marois, 2009; Simons & Levin, 1997). Another way of coping with our limited capacity is focused attention. It selects and processes relevant information (Carrasco, 2011; Chun, Golomb, & Turk-Browne, 2011), thereby reducing the load on the visual system.

These two separate modes of processing to deal with our capacity limit have been proposed before (Chong & Evans, 2011; Treisman, 2006) because of the following differences. First, the two modes serve different purposes: Ensemble perception is used for extracting the gist of a scene, while focused attention is used for recognizing a few relevant object(s). Second, they deal with the capacity limitation in different ways: Ensemble perception summarizes complex and redundant information, whereas focused attention filters out irrelevant information. Third, they have been suggested to use functionally different pathways: a non-selective pathway for ensemble perception versus a selective pathway for focused attention (Wolfe, Võ, Evans, & Greene, 2011).

Nevertheless, there have been some doubts about whether the ensemble processing mode, separate from focused attention, is necessary. For example, Myczek and Simons (2008) have shown that findings attributed to ensemble processing could be explained by a focused attention mode of sampling a few items. The noise-and-selection model of averaging also showed that observers’ averaging performance could well be described by averaging a few selected items within the limited capacity of focused attention (Allik, Toom, Raidvee, Averin, & Kreegipuu, 2013). Some studies have found that there are limits to the number of averages computed using the simultaneous-sequential paradigm (Attarha, Moore, & Vecera, 2014) and the pre-cueing method (Huang, 2015). Other studies have found that there are also limits to the number of items included in an average using the ideal observer analysis (Maule & Franklin, 2016) and the set-size manipulation (Ji & Pourtois, 2018).

These findings of capacity-limited ensemble processing led many researchers to investigate how many items contribute to averaging (Allik et al., 2013; Im & Halberda, 2013; Solomon, 2010; Solomon, Morgan, & Chubb, 2011; for a review, see Whitney & Yamanashi Leib, 2018). The estimated number of items varied widely depending on the types of stimuli to be averaged and other variations across studies. It was three in the case of size averaging (Allik et al., 2013; Solomon et al., 2011), but nearly 40 sizes in Lee, Baek, and Chong (2016); more than four for facial expressions (Haberman & Whitney, 2010); and up to 90 orientations (Dakin, 2001). To find a trend in these widely varied estimations, Whitney and Yamanashi Leib (2018) plotted the estimated number of items from 21 studies and found that observers average approximately the square root of the number of items in a display.

Thus, the visual system seems not to use all the available information for averaging. This could be because (1) only selected items contribute to averaging, (2) the averaging process is inaccurate or imprecise, or (3) both. Please note that inaccurate averaging could be due to noise involved with both individual items and the averaging process. Since most studies aimed to determine the capacity limit of ensemble processing, they focused on finding the maximum number of items included in averaging. This led previous studies to conclude that only some items contribute to averaging, whereas others do not contribute to averaging at all (Allik et al., 2013; Myczek & Simons, 2008; Solomon, 2010; Solomon et al., 2011). If we assume that attention selects items that can contribute to averaging (Allik et al., 2013), there is no need to assume two modes of coping with our limited capacity: ensemble perception and focused attention. Only focused attention is an important strategy of coping with our limited capacity.

However, there are reasons to believe that the visual system has an ensemble processing mode, separate from focused attention. First, as we mentioned before, ensemble perception and focused attention serve different purposes, using different methods to cope with limited capacity. Second, some studies have found an improvement in the precision of averaging with increased set-sizes (Allik et al., 2013Footnote 1; Baek & Chong, 2020; Haberman & Whitney, 2010; Lee et al., 2016; Parkes, Lund, Angelucci, Solomon, & Morgan, 2001; Robitaille & Harris, 2011; but see also Ji & Pourtois, 2018). This is presumably because the noise of individual items could be cancelled out during the averaging task (Baek & Chong, 2020; Galton, 1907; Parkes et al., 2001; Sun & Chong, 2019). On the other hand, performance usually dropped with increased set-sizes in other tasks that required focused attention (e.g., conjunction search: Treisman & Gelade, 1980; visual working memory: Luck & Vogel, 1997), if a set-size exceeded the capacity limit. Indeed, using the same display, Robitaille and Harris (2011) found that averaging performance improved with increased set-sizes, whereas search performance deteriorated. This opposite trend of the set-size effect suggests a separate mode of processing (i.e., ensemble perception).

Third, we think that the visual system is not likely to use only selected items for averaging because even unselected information can contribute to visual processing (Treisman, 1969; Wolford & Morrison, 1980). Consistent with this claim, previous studies have shown that nearly all items contributed to averaging. Chong and his colleagues (Chong, Joo, Emmanouil, & Treisman, 2008) showed that averaging performance dropped when small samples, rather than the entire display, were given. The averaging performance depended on the number of visible items (Joo, Shin, Chong, & Blake, 2009) and on the quality of to-be-averaged items (Jacoby, Kamke, & Mattingley, 2013; Sun & Chong, 2019), and improved with the number of items included (Allik et al., 2013; Baek & Chong, 2020; Haberman & Whitney, 2010; Lee et al., 2016; Parks et al., 2001; Robitaille & Harris, 2011). Alvarez and Oliva (2008) even found that all presented items had to be included in their average to achieve the observed precision of observers’ averaging performance. Finally, to-be-ignored items also contributed to averaging (Oriet & Brand, 2013). These results suggest that the visual system includes far more items for averaging than the limit of focused attention.

Finally, some averaging models (Allik et al., 2013; Dakin et al., 2005; Solomon, 2010; Solomon et al., 2011) and a capacity-estimation method (Rodriguez-Cintron, Wright, Chubb, & Sperling, 2019) assumed a variable capacity when estimating how many items contribute to averaging. In other words, the number of items contributing to averaging can vary across different set-sizes in these studies. We think that observers’ intrinsic capacity limit should not vary depending on set-sizes. Baek and Chong (2020) showed that observers’ averaging performance can be well described by a model with a fixed attention limit across different set-sizes. In this distributed attention model of averaging, each item contributes to averaging evenly, but its contribution decreases with increasing set-sizes owing to the fixed limits of capacity. This model (Baek & Chong, 2020) outperformed the noise-and-selection model with the assumption of variable capacity (Allik et al., 2013) in predicting observers’ performance of averaging. Thus, the averaging process is likely to consider all items evenly, rather than only a few selected items.

Thus, it seems that the visual system has two different modes of visual processing to cope with its limited capacity: ensemble perception and selective attention. How then does the visual system access so many items (i.e., over the limit of focused attention) for averaging, given its limited capacity? Hierarchically organized receptive fields in visual processing (Ungerleider & Bell, 2011) and population coding of individual items (Georgopoulos, Schwartz, & Kettner, 1986) may provide an answer to this question. If individual items are represented as a population code in a relevant stage of visual processing, population responses can be easily summarized as a Gaussian-shaped activity profile over them, and we can take the peak value as representing the mean (Fig. 1 bottom left). At the same time, if selection of individual object(s) is necessary for object recognition, the visual system can narrow a Gaussian profile down to relevant responses among population responses (Fig. 1 bottom right).

Fig. 1
figure 1

Two modes of attention (adapted from Cha & Chong, 2018). Each box is an imaginary map of responses to features over locations in one-dimensional space. Darker shades indicate stronger responses. Population responses are depicted below each box in a line-graph. The visual system can summarize population responses using a distributed attention mode (left) and focus on relevant responses for object recognition using a focused attention mode (right)

This idea is schematically demonstrated in Fig. 1. Incoming visual inputs can be represented as population responses that reflect their magnitude and quality depending on locations (Fig. 1 top). If the visual system requires a statistical summary, it can use the distributed attention mode to read it out from population responses (Fig. 1 bottom left). If attention is applied to a broader region in a wide Gaussian profile, responses within a region can be summarized. Likewise, the visual system can use the focused attention mode to select important responses for object recognition (Fig. 1 bottom right). If attention is applied to a specific region in a narrow Gaussian profile, selected responses will increase and thus be distinguished from others. Previous studies have also suggested the use of a population code to represent statistical summaries (Chong & Treisman, 2003; Hochstein, Pavlovskaya, Bonneh, & Soroker, 2018).

The two different readout mechanisms (distributed and focused attention modes) are different manifestations of a single attention system and can be hierarchically organized. Depending on the purpose of an ongoing task, the visual system flexibly deploys attention in two modes: distributed attention is used for extracting the gist of a scene, while focused attention is used for recognizing specific objects. Consistent with this idea, Cha and Chong (2018) found that observers averaged only relevant orientations for surface perception, suggesting utilization of a different mode of attention to access relevant information. Since attention is involved with multiple stages of visual processing (Kastner & Ungerleider, 2000), these two modes can still interact with each other. For example, an attended object contributes to averaging more than unattended objects (Choi & Chong, 2019; De Fockert & Marchant, 2008).

In conclusion, we propose that the visual system has two separate modes of efficiently managing its limited capacity. Ensemble representation provides a summary of complex and redundant information of a scene, whereas focused attention selects important information from a scene to recognize a few objects. Based on population responses of individual objects, the visual system can read out either a statistical summary for gist perception or crucial information for object recognition.