Pigeon and human performance in a multi-armed bandit task in response to changes in variable interval schedules

Racey, Deborah; Young, Michael E.; Garlick, Dennis; Ngoc-Minh Pham, Jennifer; Blaisdell, Aaron P.

doi:10.3758/s13420-011-0025-7

Pigeon and human performance in a multi-armed bandit task in response to changes in variable interval schedules

Published: 06 March 2011

Volume 39, pages 245–258, (2011)
Cite this article

Download PDF

Learning & Behavior Aims and scope Submit manuscript

Pigeon and human performance in a multi-armed bandit task in response to changes in variable interval schedules

Download PDF

Deborah Racey¹,
Michael E. Young²,
Dennis Garlick³,
Jennifer Ngoc-Minh Pham³ &
…
Aaron P. Blaisdell³

2087 Accesses
12 Citations
1 Altmetric
Explore all metrics

Abstract

The tension between exploitation of the best options and exploration of alternatives is a ubiquitous problem that all organisms face. To examine this trade-off across species, pigeons and people were trained on an eight-armed bandit task in which the options were rewarded on a variable interval (VI) schedule. At regular intervals, each option’s VI changed, thus encouraging dynamic increases in exploration in response to these anticipated changes. Both species showed sensitivity to the payoffs that was often well modeled by Luce’s (1963) decision rule. For pigeons, exploration of alternative options was driven by experienced changes in the payoff schedules, not the beginning of a new session, even though each session signaled a new schedule. In contrast, people quickly learned to explore in response to signaled changes in the payoffs.

Aversion to Option Loss in a Restless Bandit Task

Article 01 June 2018

Learning the opportunity cost of time in a patch-foraging task

Article 28 April 2015

Mechanisms of within-session sequential behavior in pigeons

Article 17 January 2023

Direct interaction with the environment provides much of the information that informs subsequent actions. Rarely is choice made in the presence of perfect knowledge. In a multitude of domains, organisms begin by choosing almost blindly; what is learned about the environment varies according to which of the possibilities are experienced. The world often fails to reveal information about the utility of options not chosen—the route not taken, the career not selected, the product not purchased (Taleb, 2007). In a complex environment where options are many and/or variable, complete knowledge of prevailing contingencies may require very long-term exploration. Even after long experience with the prevailing contingencies, continued exploration of options with less utility may be necessary in order to adapt to change.

Under similar conditions, what leads some choosers to exploit their knowledge of differential utility and others to explore their options? Continued exploration may be an adaptive behavior learned through experience with changing environments (Rakow & Miler, 2009; Stahlman, Roberts, & Blaisdell 2010; Stahlman, Young, & Blaisdell 2010), or it may be that imperfect knowledge maintains exploration so that responding to changing conditions is a side effect rather than an adaptation. A complete study of the trade-off between exploration and exploitation will require the use of choice environments in which more than two options are available (cf. Rakow & Miler, 2009). We examined this trade-off in the present project by investigating human and pigeon behavior in an eight-option task.

In addition to contending with the real-world complexity related to large numbers of options, most species live in changing environments. Although researchers in foraging behavior have investigated decision-making mainly through familiar, stationary environments, such that the individuals are fully informed about the nature of the options (e.g., Lin & Batzli, 2002; Zach, 1979), there is increasing interest in how such information is acquired (e.g., Mettke-Hofmann, Wink, Winkler, & Leisler 2004; Plowright & Shettleworth, 1990). The introduction of environmental changes has often been used to study how animals gather information about their environment.

We took an approach that was inspired by the study of reinforcement-learning algorithms as applied to machine learning (Koulouriotis & Xanthopoulos, 2008; Sutton & Barto, 1998). In its simplest form, reinforcement-learning analyses often use the multi-armed (or “n-armed”) bandit task to evaluate various methods of distributing exploration and exploitation (e.g., Dimitrakakis & Lagoudakis, 2008; Sikora, 2008). This task provides an excellent platform to explore choice in stationary (with unchanging payoffs) and nonstationary (with changing payoffs) environments, and it has also been applied to the domains of human learning and cognition (e.g., Burns, Lee, & Vickers 2006; Plowright & Shettleworth, 1990), economics (e.g., Banks, Olson, & Porter 1997), marketing and management (e.g., Azoulay-Schwartz, Kraus, & Wilkenfeld 2004; Valsecchi, 2003), and math and computer science (e.g., Auer, Cesa-Bianchi, Freund, & Schapire 1995; Koulouriotis & Xanthopoulos, 2008).

Task description

The multi-armed bandit task (MABT) usually involves choosing among multiple possible actions that lead to immediate reward and about which nothing is initially known. The MABT took its name from the “one-armed bandit,” another term for the slot machine. Rather than the one arm of a slot machine, however, a MABT has n options. It can be thought of as a set of n slot machines, each with an independent payoff schedule. After each selection, the reinforcer is awarded based on an underlying schedule of reinforcement. A player must explicitly explore an environment in order to learn the expected payoffs for these n options, and then can later exploit this knowledge. In a four-armed bandit task similar to the one used in the present study, Steyvers, Lee, and Wagenmakers (2009) employed a Bayesian optimal-decision model derived from the softmax equation (Luce, 1963) to explore how humans balance exploration with exploitation. In addition, eight-stimulus arrays very similar to the one used in the present study have been used with nonhuman animals (Jensen, Miller, & Neuringer 2006) and humans (Rothstein, Jensen, & Neuringer 2008), and in both cases behavior came under the control of the prevailing contingencies. Thus, this MABT provides a decision task that is potentially both complex and challenging, yet at the same time simple enough that it can be used to study a wide range of decision-making in both humans and other animals.

Exploration versus exploitation

An arm pull is an action, and at any point an actor is expected to rely on an estimate of action values based on the sampling history with each option. Choosing the action with the highest estimated action value (the “greedy” action) is exploitation, because the actor is exploiting its current knowledge. If the actor chooses a nongreedy action, it is exploring—a behavior that potentially enhances overall knowledge by improving the estimate of a nongreedy option. Greedy actions allow the actor to maximize its chance of immediate reward for the very next action, but nongreedy actions may be preferable, in order to maximize long-term reward or value (i.e., they actually are greedy, but over an extended time horizon).

Reward may be lower in the short term when exploring, but long-term value may be greater, since the actor may discover actions that are better than the current greedy action or that provide viable alternatives if the action with the long-run highest value is currently less profitable (due to molecular aspects of the payoff schedule in which an option’s value is temporarily lower; e.g., for VI schedules) or later becomes unprofitable (due to molar changes in the payoff schedule; e.g., changing from a variable ratio 5 to variable ratio 50). Whether exploration or exploitation is best at any given choice point will depend on the expected changes in these payoffs, inter alia. For a nonstationary bandit task, option values change during the task by changing the underlying molar contingencies—as if the room full of slot machines were reprogrammed occasionally during the allotted time of play. Continued exploration is critical if an organism is to track and adapt to these changes.

The machine-learning literature provides some guidance regarding methods for action selection appropriate to the bandit task. The greedy strategy may be used to solve stationary bandit problems, and it requires that every response be made to the option with the highest value (i.e., the richest reinforcement schedule). This strategy results in quick and complete preference for one option, which is precisely what should be avoided in a nonstationary environment.

Alternatively, Luce’s (1963) decision rule (often called softmax) describes a strategy that uses the expected rewards of the options to choose them probabilistically. In other words, it assigns the highest selection probability to the greedy option, but the rest of the remaining options are chosen according to their value estimates. The probability of choosing action a is

$$ P\left( {actio{n_a}} \right) = \frac{{{e^{{\theta .valu{e_a}}}}}}{{{{\sum\limits_{{j = 1}}^n {{e^{{\theta .valu{e_j}}}}} }}}}, $$

(1)

where θ is the exploitation parameter, value _i denotes the current estimated value for the ith action, and n is the number of possible actions. When θ is zero, exploitation is absent and exploration of alternatives is predicted to be maximal, such that each action is equiprobable. Higher values of θ result in higher levels of exploitation; the option with the highest action value (the greedy response) is selected more frequently as θ increases. At very high levels of θ, Luce’s decision rule becomes indistinguishable from the greedy decision strategy. The inclusion of the θ parameter allows for adjustment of the levels of exploitation and exploration to describe a particular organism’s behavior, depending on variables such as time, satiety, environmental uncertainty, and use by each species and subject and at each stage in learning.

People and pigeons are not Turing machines, and their estimates of action values may be imperfect. Regardless, these action values may be based simply on an overall history with each option, such as the proportion of total responses to that option that have been reinforced, or by some more complex calculation. For example, these estimates may be weighted to more recent experience or sensitive to the changes in reinforcement probability over time that are inherent in VI schedules. For this study, we assumed these action values to be equal to the overall programmed likelihood of reinforcement represented by the VI schedule for each option. Thus, we operationally defined exploration as choosing a response that has a lower molar reinforcement rate.

The present experiments examined both pigeon and human performance using a nonstationary MABT. Each species chose from among eight response options in order to provide a complex set of choices that would constrain the theoretical analysis. We were interested in testing three hypotheses. First, could Luce’s decision rule be used to assess the balance between exploitation and exploration for pigeons and humans in our choice task? Second, would both species adaptively and quickly increase their level of exploratory behavior in response to environmental cues that signal a change in choice payoffs? For pigeons, each daily session began with a new set of choice payoffs, and thus an adaptively optimal pigeon would begin each session with maximal exploration and be unaffected by the previous day’s programmed schedules. For people, a new session began every few minutes and was signaled by a discriminative cue at the top of the display that should prompt a sudden increase in exploration. Third, would exploration continue throughout a session, or would pigeons and people exhibit a higher level of exploitation later in the session, once differential choice value had been determined?

Experiment 1

Method

Subjects

A total of 6 experimentally naïve adult White Carneaux pigeons (Columba livia) participated in the experiment. The pigeons were individually housed in steel home cages with metal wire mesh floors in a vivarium, and a 12-h light:dark cycle was maintained. Testing was conducted 5–7 days/week during the light cycle. The pigeons were maintained at approximately 85% of their free-feeding weights, and were given free access to grit and water while in their home cages.

Apparatus

Testing was conducted in a flat-black Plexiglas chamber (38 cm wide × 36 cm deep × 38 cm high). All stimuli were presented by computer on a color LCD monitor (NEC MultiSync LCD1550M) visible through a 23.2 × 30.5 cm viewing window in the middle of the front panel of the chamber. Pecks to the monitor were detected by an infrared touch screen (Carroll Touch, Elotouch Systems, Fremont, CA) mounted on the front panel. A 28-V houselight located in the ceiling of the box was used for illumination, except during time outs. A food hopper (Coulbourn Instruments, Allentown, PA) was located below the monitor with an access hole situated flush with the floor. All experimental events were controlled and data recorded by a Pentium III class computer (Dell, Austin, TX). A video card controlled the monitor using the SVGA graphics mode (800 × 600 pixels).

Procedure

Preliminary training

The 6 pigeons were first trained to eat from the hopper in the chamber. Next, responses were autoshaped to a white disk that appeared in the center of the screen. Pecking to the disk resulted in the hopper rising for 3 s before lowering again. This was followed by a 60-s intertrial interval (ITI) before the next disk was displayed. Once the pigeon was consistently responding to the disk, training began.

Bandit training

The pigeons were presented with differently colored disks on the screen, with each disk approximately 2 cm in diameter. The disks were arranged in a circular array starting at the top of the screen, such that disks that were opposite each other were approximately 8 cm apart (see Fig. 1). This display was located so that the bottom of the lowest disk was 3 cm above the bottom edge of the screen. The colors used for the disks, from the left clockwise, were gray, light blue, red, yellow, pink, green, dark blue, and orange. The reward value given to a particular disk was fixed throughout the session, but the reward values were randomly redistributed across disks from one session to the next. Thus, in one session the values assigned to disks clockwise from the top may have been 6, 192, 12, 3, 384, 24, 48, and 96, but the distribution on the following session may have been 12, 192, 48, 6, 96, 384, 3, and 24. This redistribution of values was done at the beginning of each session. The relative positions of the colors were not changed from session to session. Throughout training and testing, sessions were 60 min long.

Initial training consisted of assigning random ratios (RRs) to the disks, using the following probabilities of each peck being rewarded: .61, .37, .22, .14, .08, .05, .03, and .02. After 60 sessions of training, it became clear that the pigeons were showing strong biases to disks located in particular positions and were not pecking to disks located in other positions, even if they had the highest reward value.

One possibility was that pecking to the disks was relatively cheap to the pigeons, so the difference in reward structure was not very tangible. Another factor was that pigeons tend toward maximization (i.e., high exploitation) on RR schedules by showing nearly exclusive responding for the option with the richest experienced payoff structure (Herrnstein & Loveland, 1975). To increase sensitivity to reward and to encourage exploration by temporarily decreasing the reward value of a disk, the reward structure was changed from a random ratio to a variable-interval schedule. The variable intervals used were 3, 6, 12, 24, 48, 96, 192, and 384 s and varied by up to ±50% of the scheduled interval (e.g., for VI 3, the interval varied between 1.5 and 4.5 s). After another 60 sessions, it was clear that the pigeons were still showing strong biases to disks located in particular positions. Shifting the color assignments revealed that the bias was based on location and not color.

The pigeons completed 5 sessions in which only one disk from the display was shown, and the disk had a .61 probability of reward. In this situation, the pigeons did reliably peck to the disk, regardless of its color or position.

The pigeons then completed 40 sessions with all eight disks present, one of which had a .61 probability of reward and seven of which had no reward. The pigeons still showed a strong bias to particular disk locations, even if the locations were not associated with reward in a given session.

A final attempt to equalize the perceived reward value of the disks and encourage exploration involved presenting the pigeons again with all eight disks for 24 sessions. However, the reward schedule was made more extreme, with VIs of 3, 9, 27, 81, 243, 729, 2,187, and 6,561 s (with experienced intervals again varying up to ±50% of the scheduled interval). In addition, at the end of the 24 sessions, the disk that was most pecked was eliminated. For the subsequent 24 sessions, only the remaining seven disks were present, with the longest reward interval was not assigned to a disk. At the end of this set of 24 sessions, the most pecked disk was again eliminated along with the longest reward interval still being used. This procedure progressed until the pigeons were given 24 sessions with only the three least-pecked (by location) disks remaining. To keep the pigeons at 85% of free-feeding weight, a session was terminated once 300 rewards had been received during the session.

Testing

For the test sessions, the pigeons were presented with all eight disks for 24 sessions with VIs of 3, 9, 27, 81, 243, 729, 2,187, and 6,561 s. Assignment of VI schedule to the disks varied daily. Only the data from this final set of testing sessions were analyzed.

Results

To analyze the data, we used two approaches. First, we will describe the frequency with which each disk was chosen as a function of its programmed payoff. This approach will provide a general assessment of the degree of control established by the reward structure. Second, we will provide an analytical assessment of the pigeons’ exploratory behavior using Luce’s decision rule (Luce, 1963).

From a reinforcement-learning perspective, low θ values indicate that a chooser either has not learned the differential payoffs or has maintained high exploration despite the differential payoffs. However, a sudden decrease in θ (when responding is not a function of previous disk value) indicates that a chooser has recognized that the payoffs have changed, thus prompting an increase in exploratory behavior.

The complicating factor in our analysis is that the programmed contingencies may not have been experienced equally by every organism. A pigeon may have undersampled a particular choice and thus obtained a biased estimate of its payoff. Pigeons frequently showed disk biases and failed to fully explore each of the options. Thus, in our second set of choice analyses for pigeons, we used disk location as an independent predictor of the best-fitting θ values and predicted lower θs (i.e., poor response differentiation as a function of payoff value) for less-preferred disks.

To estimate behavioral differentiation, we used the following instantiation of Luce’s decision rule:

$$ P\left( {ke{y_i}} \right) = \frac{{{e^{{\theta .payof{f_i}}}}}}{{\sum\limits_{{j = 1}}^8 {{e^{{\theta .payof{f_j}}}}} }}, $$

(2)

in which payoff _i is the logarithm of the inverse of the programmed VI. The equation generates eight probabilities, one for each of the eight disks, that sum to 1.0.

To fit Luce’s decision rule to behavior, we used nonlinear mixed-effects modeling and identified the maximum likelihood best-fitting parameter values (Cudeck & Harring, 2007; Davidian & Giltinan, 2003). Mixed-effects modeling is used to simultaneously generate estimates of parameter estimates for each subject and as a function of the independent variables (e.g., Laird & Ware, 1982; Pinheiro & Bates, 2004). This approach is superior to the two-stage approach, in which parameter estimates are derived independently for each subject and the estimates are used in a subsequent analysis, because the results of the first stage do not include information about uncertainty in the parameter estimates that are used in the second stage (Shkedy, Straetemans, & Molenberghs 2005).

We examined changes in the maximum likelihood for θ in Eq. 1 across birds (random effect) as a function of our predictors (fixed effects). To apply Luce’s decision rule, we needed to identify the best proxy for disk value (i.e., payoff). Preliminary analyses identified that an appropriate function mapping VI to value was the logarithm of the reinforcement rate (1/VI). The inverse translates the VI into an expected rate, so that higher values are associated with better schedules (Fig. 2 reveals that this transformed variable is a good proxy for the relative long-run probability that the pigeon was rewarded for choosing that disk). The log transformation produced a stronger fit than the untransformed reinforcement rates.

Choice differentiation as a function of programmed payoffs

When we examined the proportion of trials on which each disk was chosen by each pigeon, the pigeons showed a marked preference for disks with the richest programmed VI schedules (see Fig. 3, solid lines). One pigeon, Cosmo, showed a strong preference for the disk with the second-best payoff schedule. A closer examination of the pigeons’ disk choices, however, revealed that despite our attempts to train out disk biases, the pigeons still showed general preferences for disks in the lower part of the display (see the peck location density plots shown in Fig. 4; these plots were produced using JMP’s nonparametric bivariate density function; SAS Institute Inc., Cary, NC). For some pigeons, certain disks were so rarely sampled that these choices are not visible in our density plots. When these less-preferred disks were associated with high payoffs for a particular session, the pigeon rarely experienced the high value of these disks.

As a baseline of comparison, we initially ignored these disk biases and identified the best-fitting θ for Eq. 1 [using log (1/VI) as a proxy for payoff rate] as a function of 5-min trial block (1–12). The analysis revealed that the degree of response differentiation, θ, varied as a function of block, F(11, 8975) = 5.71, p < .0001, BIC = −4711, R ² = .40. The maximum likelihood value of θ was .10 in Block 1, reached .32 by Block 3, peaked at .34 in Block 6, and steadily decreased toward .22 in Block 12. Thus, the pigeons tended to quickly differentiate the better disks among the choice alternatives, but as the session progressed, their behavior became increasingly undifferentiated. Interestingly, this behavior was highly correlated with the number of pecks produced throughout the session: Pecking was highest during Blocks 2–4 and then gradually fell throughout the session. By Block 12, responding averaged 28% of the peak rate of responding. It appears that as the pigeons’ level of satiety increased, the motivation to differentiate among the payoff disks decreased, or the motivation to exploit abated.

We have defined exploitation for this experiment as a response to the option with the richest VI schedule; thus, the decreases in θ later in the session indicate increased exploration/decreased exploitation. A molecular definition of exploitation would involve the choice of the response with the highest momentary probability of payoff. When payoffs are delivered by VI schedule, the longer it has been since a particular option has been chosen, the greater that probability is. The response option with the leanest overall VI schedule may be the richest at the moment, if enough time has passed since it was last chosen. If the increasing exploration of options later in a session were the result of pigeons learning to choose other options due to an increase in their momentary reinforcement rate, we would expect an increase in payoff rate to accompany it. This outcome did not occur. Figure 5 shows the proportion of responses reinforced for each trial block within a session and indicates that—with the exception of Estelle—decreases in differentiation were associated with decreases, not increases, in reinforcement.

The consequences of our using a VI schedule are revealed in the likelihood of continuing to respond on a disk that has just been rewarded. Figure 6 (left column) shows a smoothed spline of the likelihood of returning to a disk as a function of time elapsed since it was last rewarded, for the disks assigned the three richest schedules. The figure reveals a temporary decrease immediately following reward for some pigeons, at least for the VI 3-s and 9-s disks. During this dip, the pigeons were more likely to choose another disk (an exploratory response) as a function of its relative payoff likelihood, as shown in Figure 3.

The predicted disk choices for each pigeon are shown superimposed on Figure 3. Luce’s decision rule predicts that responding is a monotonic function of disk value, and thus the rule cannot account for the unusual data patterns observed in Cosmo when disk value was solely a function of programmed (not experienced) payoff. However, the other birds’ behavior was well approximated by Eq. 2.

Finally, we examined the degree to which disk value on a previous session lingered into the next session. In the first 5-min part of a session (Block 1), response likelihood was as much a function of a disk’s value on the previous session [t(6) = 4.33, p < .01] as of its value for the current session [t(6) = 4.07, p < .01]. Over the next five blocks, the effect of a disk’s previous value steadily decreased (ts of 2.52, 1.66, 1.28, and 0.51), whereas the effect of a disk’s current value was maintained (ts = 4.54, 3.74, 3.88, and 4.55).

Choice differentiation as a function of programmed payoffs and disk location

Because some pigeons were not showing sufficient exploration of all eight response disks, using the programmed payoff in fitting Luce’s decision rule is problematic. To incorporate the effect of disk location for individual birds, we assessed θ as a function of both trial block and disk location. The analysis revealed that the degree of response differentiation, θ, varied as a function of both block, F(11, 8968) = 3.74, p < .0001, and disk location, F(7, 8968) = 7.94, p < .0001, BIC = −5,048, R ² = .46. A model that included an interaction produced a poorer fit, BIC = −4,510, indicating that it was overparameterized, and thus the interaction was not included in our analysis.

The best-fitting θ values as a function of trial block and disk are shown in Figure 7 which shows the main effects of both block (line graph) and disk location (star plot). It is readily apparent that exploitation (i.e., behavioral differentiation as a function of disk payoff) peaks relatively early in a session and steadily decreases, paralleling our earlier analysis that did not include disk location as a predictor. It is also apparent that responses on disks in the upper right part of the display (disks 0, 1, and 2) produce weaker behavioral differentiation as a function of disk payoff (i.e., lower θs), confirming the behavioral patterns documented in the peck density plots of Figure 4. Although the fit was better for this analysis, the improvements were relatively minor.

Discussion

In an eight-armed bandit task, pigeons’ disk choice was largely a function of the VI schedule associated with each disk. For 4 of the pigeons, their behavior was broadly consistent with that predicted by Luce’s decision rule as applied to the programmed reinforcement rate [log (1/VI)], thus suggesting that the derived θ values are good estimates of the degree of exploitation exhibited by the pigeons. Pigeons did not demonstrate high degrees of exploration at the beginning of a session that was cued by session onset, but rather their low θ values were a result of behavior being heavily influenced by carryover from the prior session’s disk values. Within 10 min, however, their responding was largely driven by the new reinforcement contingencies. Thus, increases in exploration were likely produced by adversity—only when preferred disks were no longer paying off at a high rate did the pigeons begin to explore other choices (see Gallistel, Mark, King, & Latham 2001, for an alternative interpretation of matching in nonstationary environments).

Our pigeons, which were working for primary reinforcers, showed less exploitation as a session progressed. This change could have been due to an anticipated change in disk payoffs, but the evidence suggests that exploitation decreased due to an increase in satiety. Regardless of this pattern, we did not see high degrees of exploitation at any point in a session. Averaged across every session and trial block, no pigeon chose its preferred disk more than 45% of the time (see Fig. 3). When these results were averaged across sessions but broken down by trial blocks, no pigeon chose its preferred disk more than 55% of the time (not shown). The pigeons were not adopting greedy strategies in our nonstationary environment. Despite our attempts to eliminate disk biases, the birds continued to show location preferences that were independent of a disk’s programmed reinforcement schedule. We attempted to incorporate these biases into our analysis as an independent factor that allowed less behavioral differentiation (lower θ values) for certain disk locations, but the fit was only marginally better. An alternative formulation that would retain Luce’s decision rule would be to incorporate disk location into our estimates of value, thus making a disk’s value a function of both its scheduled payoff and its location. Unfortunately, this approach would require a post hoc assessment of disk preferences for each bird.

Experiment 2

In our second experiment, we used a similar design to examine exploration versus exploitation in humans. We anticipated rapid changes in θ and fewer location preferences that were independent of payoffs. The literature on risky choice and risk perception suggests that people might be well adapted to identifying and responding to changes in payoffs for decisions under uncertainty (for a discussion of various examples, see Rakow & Miler, 2009).