doi:10.1016/j.neunet.2007.04.028
Copyright © 2007 Elsevier Ltd All rights reserved.
Multiple model-based reinforcement learning explains dopamine neuronal activity
Mathieu Bertina, b,
,
, Nicolas Schweighoferc and Kenji Doyaa, d
aATR Computational Neuroscience Labs, 2-2-2 Hikaridai, “Keihanna Science City”, Kyoto 619-0288, Japan
bLaboratoire d’Informatique de Paris 6, Universite Paris 6 Pierre et Marie Curie, 4 place Jussieu 75005, Paris, France
cDepartment of Biokinesiology and Physical Therapy, University of Southern California, 1540 E. Alcazar St. CHP 155, Los Angeles 90089-9006, USA
dNeural Computation Unit, Initial Research Project Laboratory, Okinawa Institute of Science and Technology, 12-22 Suzaki, Gushikawa, Okinawa, 904-2234, Japan
Received 18 February 2005;
accepted 11 April 2007.
Available online 6 June 2007.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
A number of computational models have explained the behavior of dopamine neurons in terms of temporal difference learning. However, earlier models cannot account for recent results of conditioning experiments; specifically, the behavior of dopamine neurons in case of variation of the interval between a cue stimulus and a reward has not been satisfyingly accounted for. We address this problem by using a modular architecture, in which each module consists of a reward predictor and a value estimator. A “responsibility signal”, computed from the accuracy of the predictions of the reward predictors, is used to weight the contributions and learning of the value estimators. This multiple-model architecture gives an accurate account of the behavior of dopamine neurons in two specific experiments: when the reward is delivered earlier than expected, and when the stimulus–reward interval varies uniformly over a fixed range.
Keywords: Dopamine; Reinforcement learning; Multiple model; Timing prediction; Classical conditioning
Fig. 1. Behavior of a dopamine neuron when a monkey expects a reward 1 s after the lever touch ((Hollerman & Schultz, 1998), with permission). Responding to the reward is seen in test trials when it is delivered half-second early or late. Note that following early reward, no subsequent dip of activity is observed at the time reward had been expected.
Fig. 2. Response of a dopamine neuron when the stimulus–reward interval varies uniformly over a 2 s range ((Fiorillo & Schultz, 2001) with permission). Rasters are sorted by stimulus–reward delay, with shortest delays at the bottom. The vertical line marks the time of the reward. More firing is seen after the reward for smaller delays.
Fig. 3. Architecture of the MMRL (Multiple Model Reinforcement Learning) model. The activity of dopamine neurons is given by the global TD error δ(t). See text for abbreviations.
Fig. 4. MMRL model, evolution of the TD error through learning when the stimulus–reward interval is constant.
Fig. 5. Simulated dopamine response in early and delayed reward conditions, for the tapped delay line model (left) and MMRL model (right). S: cue stimulus. R: reward. After a 150 trials training with a 10 time steps ISI, the reward is presented 5 steps earlier (top), on time (middle), or 5 steps later (bottom). The tapped delay line model wrongly produces a negative error after an earlier reward (left, top), at the time the reward was expected. This negative error doesn’t occur in our MMRL model (right, top), in accord with experimental data.
Fig. 6. Simulated dopamine responses when ISI varies uniformly over a range. The tapped delay line model (left) incorrectly predicts identical excitation for all possible reward locations. The semi-Markov model (center) and MMRL model (right) predict decreasing excitation as the interval gets longer, as observed in experimental data. The semi-Markov model also predicts a negative response for longer-than-average rewards, which is not obvious in experimental data.