Learning time-varying categories

Navarro, Daniel J.; Perfors, Andrew; Vong, Wai Keen

doi:10.3758/s13421-013-0309-6

Learning time-varying categories

Published: 20 April 2013

Volume 41, pages 917–927, (2013)
Cite this article

Download PDF

Memory & Cognition Aims and scope Submit manuscript

Learning time-varying categories

Download PDF

Daniel J. Navarro¹,
Andrew Perfors¹ &
Wai Keen Vong¹

1355 Accesses
6 Citations
Explore all metrics

Abstract

Many kinds of objects and events in our world have a strongly time-dependent quality. However, most theories about concepts and categories either are insensitive to variation over time or treat it as a nuisance factor that produces irrational order effects during learning. In this article, we present two category learning experiments in which we explored peoples’ ability to learn categories whose structure is strongly time-dependent. We suggest that order effects in categorization may in part reflect a sensitivity to changing environments, and that understanding dynamically changing concepts is an important part of developing a full account of human categorization.

The impact of training methodology and category structure on the formation of new categories from existing knowledge

Article 27 October 2018

Sébastien Hélie, Farzin Shamloo & Shawn W. Ell

Reconciling categorization and memory via environmental statistics

Article 16 February 2024

Arjun Devraj, Thomas L. Griffiths & Qiong Zhang

Classification errors and response times over multiple distributed sessions as a function of category structure

Article 11 May 2018

Derek E. Zeigler & Ronaldo Vigo

At no two moments in time are we presented with the same world: Objects move, plants and animals are born and die, friends come and go, the sun rises and sets, and so on. More abstractly, while some of the rules that describe our world—like physical laws—are invariant over the course of our everyday experience, others—like legal rules—are not. Given some appropriate time scale, certain characteristics of an entity or class of entities can change; moreover, they may tend to change in systematic ways. For instance, the features that describe phones have changed considerably over recent decades: Not only do modern phones perform many new functions, they are also physically smaller, sleeker, and smoother. Not surprisingly, people’s expectations about category members change to suit the environment as it stands: If asked to describe a phone in 2012, few people would refer to a rotary dial, but in 1970, nearly everyone would.

In one sense, nothing is surprising about this observation. However, the changeable nature of many of the concepts and categories with which humans must interact has not been greatly emphasized in the categorization literature (but see Elliott & Anderson, 1995). In category-learning experiments, it is generally assumed that the underlying category is more or less static, and as such, the order in which one encounters category members should not matter to a rational learner. In statistics, this is referred to as the assumption of exchangeability, and for reasons of simplicity, it is generally the default assumption. Current probabilistic models of categorization make this assumption quite explicitly (e.g., Griffiths, Sanborn, Canini, & Navarro, 2008; Sanborn, Griffiths, & Navarro, 2010), and to the extent that standard exemplar and prototype models of categorization can be viewed as kinds of probabilistic models, they can also be seen to abide by this assumption (Ashby & Alfonso-Reese, 1995; Griffiths et al., 2008).

Perhaps because exchangeability is assumed in most real-world data analysis, it is generally taken to be a normative standard. However, human learners are also sensitive to the order in which category members are observed; this sensitivity appears to violate this normative standard. One way to account for order effects in cognitive models is to use learning rules that are sensitive to them. Some such rules can be viewed as modifications to standard probabilistic models. For instance, highlighting effects can be accounted for by assuming that people follow a “locally Bayesian” learning rule (Kruschke, 2006), whereas primacy effects can be captured by using a particle filtering learning rule (Sanborn et al., 2010). Another approach is to adopt connectionist, error-driven learning rules, which implicitly assume that recent items are more salient, and so are able to capture some kinds of recency effects, often better than a simple recency-weighting strategy would (e.g., Nosofsky, Kruschke, & McKinley, 1992; Sakamoto, Jones, & Love, 2008). A third approach is to alter the underlying stimulus representation: For instance, certain recency effects can be captured by the assumption that people track the differences between successive observations (Stewart, Brown, & Chater, 2002).

Although all of these approaches endeavor to account for order effects, it remains difficult to say whether it is rational to be sensitive to stimulus order. Many studies have avoided any explicit discussion of whether this sensitivity should be called normative (e.g., Stewart et al., 2002), others have argued that it reflects the cognitive limitations of the human learner (Sakamoto et al., 2008; Sanborn et al., 2010), and still others have suggested that order-sensitive learning rules are necessary if the learner is to be able to adapt to a changing world (Elliott & Anderson, 1995; Nosofsky et al., 1992). The latter perspective is mirrored rather explicitly in the literature on sequential effects (Yu & Cohen, 2009) and change detection (Brown & Steyvers, 2009).

Regardless of their views on the rationality of order effects, the literature shows a great deal of uniformity; in particular, the models that best capture human performance weight the observations by their recency. In most models, the weight assigned to a particular observation tends to decay approximately exponentially as a function of age (Brown & Steyvers, 2009; Nosofsky et al., 1992; Yu & Cohen, 2009), although in some cases the decay has been proposed to be a power function (Elliott & Anderson, 1995),which would be more in keeping with the literature on memory (Rubin, Hinton, & Wenzel, 1999; Wixted & Ebbesen, 1997). A similar outcome exists in the judgmental forecasting literature, which examines how people perceive and extrapolate time series data: Once again, an exponential weighting rule appears to account for human performance (see Goodwin & Wright, 1994, and Lawrence, Goodwin, O’Connor, & Onkal, 2006, for overviews). An exponential weighting scheme also emerges from the literature on stimulus generalization (Shepard, 1987).

The combined weight of this work suggests that weighting more recent items according to an exponential function is both theoretically and empirically justifiable. In fact, if the world changes at unpredictable times and in arbitrary ways, an exponential weighting scheme is close to optimal (e.g., Yu & Cohen, 2009). However, the world does not always change in an unpredictable fashion. For instance, the changes to the category of “phone” have been at least partially predictable: Newer phones tend to be faster, smaller, and more technologically capable. In this example, at least, it is clear that people have a strong expectation that the category will change systematically and in a particular direction. This ability to extrapolate the direction of future change cannot be explained by assuming that sensitivity to order emerges simply from weighting more recent items more highly. A recency explanation would correctly predict that, for example, the iPad 4 would be more similar to the iPad 3 than to the iPad 2. What it does not predict, however, is that people expect the iPad 4 to be systematically different from (e.g., faster than) both. In order to capture this effect, we need to move beyond simple recency toward an explanation based on the idea that people detect trends and extend them.

The goal of this article is to address these open questions by investigating how people learn categories that change in a systematic fashion over time. We began by presenting an experiment that involves a simple “linear change” pattern, using a standard supervised classification task. We then followed this up with an experimental design that required participants to generate new category members. In both experiments, we found evidence that participants were sensitive to a systematic pattern of change in the observations that they were shown.

Experiment 1

In this section, we present a category-learning experiment in which people were presented with fairly obvious and systematic temporal changes, in order to investigate how well people would learn to anticipate those changes.

Method

Participants

A group of 59 participants were recruited through a mailing list whose members consist primarily of current and former undergraduate psychology students. They were paid $10/h for their time. The median age was 23, and the participants were predominantly (63%) female.

Materials and procedure

The learning task was a standard supervised classification experiment, performed on a computer. The stimuli were little cartoon objects (“floaters”), which were displayed floating above a horizontal line (“the ground”). The height of the floaters was the only respect in which the stimuli varied from each other. An example of what a floater looked like is shown in Fig. 1.

On each of 100 trials, the participants were shown a single floater and asked to predict whether it would flash red or blue. After making their prediction, they would receive feedback for 2 seconds while the floater flashed the appropriate color. As the left panel of Fig. 2 illustrates, as the experiment progressed, all of the stimuli shown to people tended to rise, regardless of which category they belonged to. In the figure, black circles correspond to items that belonged to the “high” category, and white circles correspond to floaters that belonged to the “low” category. Assignment of the flash color (red or blue) to each category (high or low) was randomized across participants.

For the purposes of our analysis, we refer to the average rise (approximately 2 mm) as 1 unit. The reason for doing this is that it allows us to write the true classification rule in a very simple form. Specifically, the classification rule was such that, if x _t denotes the height of the stimulus on trial t, the optimal response is to select the response option corresponding to the high category if x _t > t. Such a rule, shown as the solid line in Fig. 2, achieves 100% accuracy on the task. However, because most stimuli tended to lie quite close to the classification boundary, the task was relatively difficult, even though the general trend was clear. Consistent with this, during informal discussions, the participants indicated that they detected the upward trend early in the experiment but still found the task to be quite challenging.

A model for the task

Our data analysis relied on a simple categorization model. The model was inspired by decision bound models (e.g., Ashby & Gott, 1988; Ashby & Lee, 1991; Ashby & Maddox, 1993), although unlike most decision bound models, it was not explicitly derived from general recognition theory. In this section, we describe the structure of the model, because it is central to our analysis. It is worth noting, however, that we tried a variety of other categorization models, and the qualitative pattern of the results was not affected.^{Footnote 1}

Recall from Fig. 2 that the stimuli were designed to be approximately normally distributed with mean μ _t, where μ _t increased linearly over time. Moreover, if a particular stimulus lay above the mean (i.e., x _t > μ _t), it belonged to the high category, and otherwise it belonged to the low category. In other words, tracking the category boundary over time was equivalent to tracking the value of μ _t over trials. A simple model for estimating the value of μ _t is as follows. Suppose that the learner has some estimate $ {{\widehat{\mu}}_{t-1 }} $ of the location of the category boundary before the start of trial t – 1. When the stimulus x _t–1 is observed, the boundary is shifted by some proportion w in the direction of that observation, yielding the following estimate for the location of the category boundary before the start of trial t:

$$ {{\widehat{\mu}}_t}=\left( {1-w} \right){{\widehat{\mu}}_{t-1 }}+w{x_{t-1 }} $$

(1)

where w is a “twitchiness” parameter that indicates the extent to which the learner relies on the very last observation that he or she has seen.^{Footnote 2} Expanding the recursion in Eq. 1, we observe that this model produces an estimate $ {{\widehat{\mu}}_t} $ that is an exponentially weighted average of the previous items:

$$ {{\widehat{\mu}}_t}=w\sum\limits_{k=1}^t {{{{\left( {1-w} \right)}}^{k-1 }}{x_{t-k }}} $$

(2)

In this equation, the fictitious “zero-th stimulus” x ₀ corresponds to an initial value for the category boundary. The key thing to note in this equation is that recent trials will contribute more heavily to the estimate of μ _t; large values of w imply that only a few observations are used, and small values of w allow multiple observations to be used. Having formed an estimate of where the category boundary lies, the learner is assumed to generate responses probabilistically, as a function of the deviation fit between the current stimulus and the estimated category boundary. This deviation is given by

$$ {\varDelta_t}={x_t}-{{\widehat{\mu}}_t} $$

(3)

and if ∆_t > 0, then the item is more likely to be classified as a member of the high category. Specifically, we assume that the function relating distance to categorization probability is logistic (e.g., Navarro, 2007). If p _t denotes the probability of selecting the high category on trial t, then

$$ {p_t}=\frac{1}{{1+\exp \left( {-λ {\Delta_t}} \right)}} $$

(4)

where λ governs the rate at which the classification probability changes as a function of distance from the category boundary. This model is closely related to the tracking model used by Brown and Steyvers (2009), but it has links to other models, too. For instance, in the extreme case in which w = 1, this heuristic corresponds to a relative judgment strategy in which each stimulus is compared only to the last stimulus in the experiment, and it becomes a slight simplification of the model used by Stewart et al. (2002). When w = 1 and λ is large, the model produces a very simple heuristic: Select the high category if and only if the current stimulus is higher than the previous one. Additionally, the model has links to prototype models of classification. In an equal-variance prototype model, the learner represents separate means (the prototypes) for each category, and the category boundary lies equidistant from the two prototypes. If the estimate for a category prototype takes recency into account by taking an exponentially weighted average (see, e.g., Navarro & Perfors, 2012), the classification probabilities will end up being almost identical to those produced by our model. The problem with using a heuristic such as this one is that it is very sensitive to random fluctuations in the data (when w is large), or else the estimate of the category boundary lags a long way behind the true one (when w is small). To see this, note that when w is large, the learner is strongly influenced by the most recent observation, and to the extent that this observation is noisy, or otherwise misleading as to the location of the category boundary, the learner will be unduly influenced by it. Decreasing the value of w allows the learner to avoid this mistake by aggregating information from multiple observations, but this comes at a price: Because the category is moving, setting w to too small a value means that the estimate $ {{\widehat{\mu}}_t} $ will always be “lagging” a long way behind the data.

This issue can be avoided to some extent if the learner is able to detect the pattern of change and anticipate the fact that the category boundary μ _t moves on each trial. For the sake of simplicity, we assume that this corresponds to the introduction of a simple correction factor b. This correction factor yields a slight modification of the model, in which the estimate of the category boundary on trial t is given by

$$ {{\widehat{\mu}}_t}=b+w\sum\limits_{k=1}^t {{{{\left( {1-w} \right)}}^{k-1 }}{x_{t-k }}} $$

(5)

We now have a simple classification model with three parameters: twitchiness w that governs how reliant the learner is on the last stimulus; the slope parameter λ that describes the relationship between distance and generalization; and the bias parameter b that describes a correction factor, shifting the classification boundary to accommodate the fact that the task involves a clear trend over time. Although this model has not (to our knowledge) been used in the categorization literature previously, it has been used in the judgmental forecasting literature (see Goodwin & Wright, 1994) as a heuristic that is appropriate for modeling human extrapolation judgments for a trended time series.

Results

Human performance was significantly above chance for both categories: 76% of the high-category items and 61% of the low-category items were classified correctly, as is shown in the left panel of Fig. 3. Note that, while performance in both categories was significantly above chance, people performed better in the high category. This is not surprising: Because both categories were rising throughout the experiment, novel items from the low category tended to be close to the previous high-category items. By contrast, new items from the high category were much less confusable, since they were not close to previous low-category items.

When we plot the performance of all 59 participants separately, as in the right panel of Fig. 3, it is clear that the improved performance on the high-category items holds at the individual-participant level, as well. Nine of the 59 participants were not significantly above chance (the p < .05 significance threshold is plotted as a dashed line). Of the remaining 50 participants, 47 classified the high-category items more accurately than the low-category items while performing above 50% correct for both categories. This is illustrated visually by the fact that the vast majority of the dots in Fig. 3 fall within the solid black triangle.

Model fits were performed at the individual-participant level, obtaining separate values of b, w, and λ for all 59 participants by using maximum likelihood estimation to estimate the parameter values. The first five trials were excluded for the purpose of the model fitting.^{Footnote 3} As is shown in Fig. 4, the model produces classification behavior very similar to that of the human participants. Across participants, the correlations between the human data and the model predictions were r = .94 (p < .001), for the probability of correctly classifying low-category items, and r = .92 (p < .001), for the high-category items. At a within-participant level, the model fit was significantly better than chance (as assessed via a likelihood ratio test at the p < .05 level) for 52 of the 59 participants. Of the seven participants whose data were not well-fit by the model, five were among those who did not classify the stimuli above chance. It is no surprise that the model cannot account for the performance of those participants, as the empirical data from those participants are extremely noisy.

The key empirical test here related to the parameter estimates, most notably of the bias parameter b. Given this, we restricted the analysis to those participants whose data were well-fit by the model. Our inclusion criterion here was that the model needed to explain at least 25% of the variance in a participant’s choices. This criterion was met for 43 of the participants. Among these participants, the model explained 43% of the variance on average (SD = 12%), corresponding to a 71% probability that the model would make the same response as the participant on any given trial (SD = 6%).

Descriptive statistics for the parameter estimates are provided in Table 1. From a theoretical perspective, the first analysis to consider was a test of whether the b parameter is in fact necessary within the model. To that end, for all 43 participants, we fit a restricted model with the bias parameter fixed at b = 0; a likelihood ratio test rejected this null model (at p < .05) in all 43 cases, favoring the model that included bias. A second analysis to consider was to look at the magnitude of the bias parameter. The mean value of b was 1.07, with a 95% confidence interval of [0.28, 1.87], implying that participants did shift their classification boundaries upward.

Table 1 Means and standard deviations for model parameters, estimated from the data of all 43 participants for which the model fit was deemed adequate

Full size table

Closer examination revealed a somewhat more complicated story. There was a moderately strong negative correlation of r = −.77 (p < .001) between the bias parameter b and the twitchiness parameter w (this was the only significant correlation). This was to be expected: As noted earlier, smaller values of w mean that participants are aggregating information across more trials, which in turn implies that the uncorrected estimate of the category boundary will lag farther behind the true boundary. In other words, the optimal value of b should be higher for smaller values of w. This is illustrated in Fig. 5, which plots b against w for all participants (black dots), along with a regression line (solid line) that depicts the relationship between the two. The fact that the confidence bands for the regression line (gray region) sit above zero (dotted line) for most values of w indicates that, in general, participants were extrapolating.

The fact that people extrapolate does not imply that the extent of the extrapolation is optimal. Indeed, Fig. 5 also shows the optimal value of b for all values of w (dashed line). It is clear that the participant responses lie below the optimal value, indicating that they did not extrapolate far enough. This is consistent with the raw data in Fig. 3: The simple fact that people are more accurate for high-category items than for low-category items provides strong evidence that they do not entirely anticipate the extent of the changes across trials. Moreover, when we plot the mean category boundary (taken across participants) extracted from the model, as per Fig. 6, it is clear that the model reproduces this behavior.

Discussion

The results from Experiment 1 provided some indication that people are capable of detecting changes to a category over time and are able extrapolate a trend that they have observed when making decisions about new items. However, the effect is subtle and relies on the assumption that, if the categories had not been changing over time, participants would have used a recency-weighted average to estimate the category representation (i.e., the model without b). This assumption seems sensible, insofar as the task was a standard supervised categorization task, and models of that kind have been highly successful in explaining human behavior in such tasks. Nevertheless, it is desirable to show that the effect can be detected without requiring detailed modeling, if we allow the task to be modified to capture the effect more directly.

Experiment 2

One of the problems with the supervised classification task used in Experiment 1 is that the participants provided only a very limited amount of data (a binary choice) on any particular trial. This made it a little difficult to measure very subtle effects related to how category representations changed at a trial-by-trial level. To redress this, in Experiment 2 we employed a very different dependent measure: The participants were asked to generate new category members on each trial. After training participants on a category that changes over time, we asked them to generate a sequence of future category members over multiple time points. If participants were genuinely able to extrapolate their category knowledge, this sequence of responses should extend the trend that had appeared in the training data.^{Footnote 4}