Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Computational noise in reward-guided learning drives behavioral variability in volatile environments

Abstract

When learning the value of actions in volatile environments, humans often make seemingly irrational decisions that fail to maximize expected value. We reasoned that these ‘non-greedy’ decisions, instead of reflecting information seeking during choice, may be caused by computational noise in the learning of action values. Here using reinforcement learning models of behavior and multimodal neurophysiological data, we show that the majority of non-greedy decisions stem from this learning noise. The trial-to-trial variability of sequential learning steps and their impact on behavior could be predicted both by blood oxygen level-dependent responses to obtained rewards in the dorsal anterior cingulate cortex and by phasic pupillary dilation, suggestive of neuromodulatory fluctuations driven by the locus coeruleus–norepinephrine system. Together, these findings indicate that most behavioral variability, rather than reflecting human exploration, is due to the limited computational precision of reward-guided learning.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Experimental paradigm and noisy RL model.
Fig. 2: Contributions of learning noise and choice stochasticity to non-greedy decisions.
Fig. 3: Decomposition of learning noise into ultimately predictable and unpredictable terms.
Fig. 4: Characterization of decision effects predicted by learning noise.
Fig. 5: Neural correlates of learning noise in the human brain.
Fig. 6: Neural correlates of learning noise in choice-free, cued trials.
Fig. 7: Brain–behavior and pupillometric analyses.
Fig. 8: Proposed payoff–cost trade-off on learning precision.

Similar content being viewed by others

Data availability

The data (behavioral, neuroimaging and pupillometric) that support these findings are available from the corresponding author upon request.

Code availability

Python and C++ code for fitting all computational models described in the article are available at https://github.com/csmfindling/learning_variability. The algorithmic backbone of the Monte Carlo procedures used to fit models can be found in Supplementary Modeling Note.

References

  1. Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 1998).

  2. Rescorla, R. A. & Wagner, A. R. A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. in Classical Conditioning II (eds Black, A. H.Prokasy, W. F.) 64–99 (Appleton-Century-Crofts, 1972).

  3. Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A. & Cohen, J. D. Humans use directed and random exploration to solve the explore–exploit dilemma. J. Exp. Psychol. Gen. 143, 2074–2081 (2014).

    PubMed  PubMed Central  Google Scholar 

  5. Wyart, V. & Koechlin, E. Choice variability and suboptimality in uncertain environments. Curr. Opin. Behav. Sci. 11, 109–115 (2016).

    Google Scholar 

  6. Drugowitsch, J., Wyart, V., Devauchelle, A.-D. & Koechlin, E. Computational precision of mental inference as critical source of human choice suboptimality. Neuron 92, 1398–1411 (2016).

    CAS  PubMed  Google Scholar 

  7. Fechner, G. T. Elements of Psychophysics (Holt, Reinehart & Winston, 1966).

  8. Churchland, M. M. et al. Stimulus onset quenches neural variability: a widespread cortical phenomenon. Nat. Neurosci. 13, 369–378 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Johnson, K. O., Hsiao, S. S. & Yoshioka, T. Neural coding and the basic law of psychophysics. Neuroscientist 8, 111–121 (2002).

    PubMed  PubMed Central  Google Scholar 

  10. Palminteri, S., Wyart, V. & Koechlin, E. The importance of falsification in computational cognitive modeling. Trends Cogn. Sci. 21, 425–433 (2017).

    PubMed  Google Scholar 

  11. Boorman, E. D., Behrens, T. E. J., Woolrich, M. W. & Rushworth, M. F. S. How green is the grass on the other side? Frontopolar cortex and the evidence in favor of alternative courses of action. Neuron 62, 733–743 (2009).

    CAS  PubMed  Google Scholar 

  12. Palminteri, S., Khamassi, M., Joffily, M. & Coricelli, G. Contextual modulation of value signals in reward and punishment learning. Nat. Commun. 6, 8096 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. Lau, B. & Glimcher, P. W. Dynamic response-by-response models of matching behavior in rhesus monkeys. J. Exp. Anal. Behav. 84, 555–579 (2005).

    PubMed  PubMed Central  Google Scholar 

  14. Gershman, S. J., Pesaran, B. & Daw, N. D. Human reinforcement learning subdivides structured action spaces by learning effector-specific values. J. Neurosci. 29, 13524–13531 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Yu, A. J. & Cohen, J. D. Sequential effects: superstition or rational behavior? Adv. Neural Inf. Process. Syst. 21, 1873–1880 (2009).

    Google Scholar 

  16. Cohen, J. D., McClure, S. M. & Yu, A. J. Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 362, 933–942 (2007).

    PubMed  PubMed Central  Google Scholar 

  17. Doya, K. Modulators of decision making. Nat. Neurosci. 11, 410–416 (2008).

    CAS  PubMed  Google Scholar 

  18. Shenhav, A., Botvinick, M. M. & Cohen, J. D. The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron 79, 217–240 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Donoso, M., Collins, A. G. E. & Koechlin, E. Foundations of human reasoning in the prefrontal cortex. Science 344, 1481–1486 (2014).

    CAS  PubMed  Google Scholar 

  20. Aston-Jones, G. & Cohen, J. D. An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance. Annu. Rev. Neurosci. 28, 403–450 (2005).

    CAS  PubMed  Google Scholar 

  21. Usher, M., Cohen, J. D., Servan-Schreiber, D., Rajkowski, J. & Aston-Jones, G. The role of locus coeruleus in the regulation of cognitive performance. Science 283, 549–554 (1999).

    CAS  PubMed  Google Scholar 

  22. Eldar, E., Cohen, J. D. & Niv, Y. The effects of neural gain on attention and learning. Nat. Neurosci. 16, 1146–1153 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Jepma, M. & Nieuwenhuis, S. Pupil diameter predicts changes in the exploration-exploitation trade-off: evidence for the adaptive gain theory. J. Cogn. Neurosci. 23, 1587–1596 (2011).

    PubMed  Google Scholar 

  24. Joshi, S., Li, Y., Kalwani, R. M. & Gold, J. I. Relationships between pupil diameter and neuronal activity in the locus coeruleus, colliculi, and cingulate cortex. Neuron 89, 221–234 (2015).

    PubMed  PubMed Central  Google Scholar 

  25. Gershman, S. J. A unifying probabilistic view of associative learning. PLoS Comput. Biol. 11, e1004567 (2015).

    PubMed  PubMed Central  Google Scholar 

  26. Beck, J. M., Ma, W. J., Pitkow, X., Latham, P. E. & Pouget, A. Not noisy, just wrong: the role of suboptimal inference in behavioral variability. Neuron 74, 30–39 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Kennerley, S. W., Walton, M. E., Behrens, T. E. J., Buckley, M. J. & Rushworth, M. F. S. Optimal decision making and the anterior cingulate cortex. Nat. Neurosci. 9, 940–947 (2006).

    CAS  PubMed  Google Scholar 

  28. Tervo, D. G. R. et al. Behavioral variability through stochastic choice and its gating by anterior cingulate cortex. Cell 159, 21–32 (2014).

    CAS  PubMed  Google Scholar 

  29. Farashahi, S. et al. Metaplasticity as a neural substrate for adaptive learning and choice under uncertainty. Neuron 94, 401–414.e6 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. Meder, D. et al. Simultaneous representation of a spectrum of dynamically changing value estimates during decision making. Nat. Commun. 8, 1942 (2017).

    PubMed  PubMed Central  Google Scholar 

  31. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).

    Google Scholar 

  32. Bottou, L. Large-scale machine learning with stochastic gradient descent. in Proceedings of COMPSTAT’2010 (eds Lechevallier Y. & Saporta G.) 177–186 (2010).

  33. Behrens, T. E. J., Woolrich, M. W., Walton, M. E. & Rushworth, M. F. S. Learning the value of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 (2007).

    CAS  PubMed  Google Scholar 

  34. Yu, A. J. & Dayan, P. Uncertainty, neuromodulation, and attention. Neuron 46, 681–692 (2005).

    CAS  PubMed  Google Scholar 

  35. Arnsten, A. F. T. & Goldman-Rakic, P. S. Selective prefrontal cortical projections to the region of the locus coeruleus and raphe nuclei in the rhesus monkey. Brain Res. 306, 9–18 (1984).

    CAS  PubMed  Google Scholar 

  36. Warren, C. M. et al. The effect of atomoxetine on random and directed exploration in humans. PLoS One 12, e0176034 (2017).

    PubMed  PubMed Central  Google Scholar 

  37. Kane, G. A. et al. Increased locus coeruleus tonic activity causes disengagement from a patch-foraging task. Cogn. Affect. Behav. Neurosci. 17, 1073–1083 (2017).

    PubMed  Google Scholar 

  38. Browning, M., Behrens, T. E., Jocham, G., O’Reilly, J. X. & Bishop, S. J. Anxious individuals have difficulty learning the causal statistics of aversive environments. Nat. Neurosci. 18, 590–596 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. Robert, C. & Casella, G. Monte Carlo Statistical Methods (Springer, 2004).

  40. Chopin, N. A sequential particle filter method for static models. Biometrika 89, 539–552 (2002).

    Google Scholar 

  41. Chopin, N., Jacob, P. E. & Papaspiliopoulos, O. SMC2: an efficient algorithm for sequential analysis of state space models. J. R. Stat. Soc. B 75, 397–426 (2013).

    Google Scholar 

  42. Lindsten, F. & Schön, T. B. Backward simulation methods for Monte Carlo statistical inference. Found. Trends Mach. Learn. 6, 1–143 (2013).

    Google Scholar 

  43. Doucet, A., Godsill, S. & Andrieu, C. On sequential Monte Carlo sampling methods for Bayesian filtering. Stat. Comput. 10, 197–208 (2000).

    Google Scholar 

  44. Deichmann, R., Gottfried, J., Hutton, C. & Turner, R. Optimized EPI for fMRI studies of the orbitofrontal cortex. Neuroimage 19, 430–441 (2003).

    CAS  PubMed  Google Scholar 

  45. Weiskopf, N., Hutton, C., Josephs, O. & Deichmann, R. Optimal EPI parameters for reduction of susceptibility-induced BOLD sensitivity losses: a whole-brain analysis at 3T and 1.5T. Neuroimage 33, 493–504 (2006).

    PubMed  Google Scholar 

  46. Jenkinson, M., Beckmann, C. F., Behrens, T. E. J., Woolrich, M. W. & Smith, S. M. FSL. Neuroimage 62, 782–790 (2012).

    Google Scholar 

  47. Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D. & Iverson, G. Bayesian t tests for accepting and rejecting the null hypothesis. Psychon. Bull. Rev. 16, 225–237 (2009).

    PubMed  Google Scholar 

  48. Stephan, K. E., Penny, W. D., Daunizeau, J., Moran, R. J. & Friston, K. J. Bayesian model selection for group studies. NeuroImage 15, 1004–1017 (2009).

    Google Scholar 

Download references

Acknowledgements

We thank C. Summerfield (University of Oxford; Google DeepMind) for comments on an earlier version of the manuscript. This work was supported by a starting grant from the European Research Council awarded to V.W. (ERC-StG-759341), a junior researcher grant from the Agence Nationale de la Recherche awarded to V.W. (ANR-14-CE13-0028) and two department-wide grants from the Agence Nationale de la Recherche (ANR-10-LABX-0087 and ANR-10-IDEX-0001-02 PSL). C.F. was supported by a graduate research fellowship from the Direction Générale de l’Armement (2015-60-0041). S.P. was supported by a CNRS-Inserm ATIP-Avenir grant (R16069JS) and a research grant from the Programme Emergence(s) of the City of Paris.

Author information

Authors and Affiliations

Authors

Contributions

S.P. and V.W. were responsible for conceptualization. C.F., V.W. and S.P. were responsible for the methodology. C.F., V.S. and V.W. performed the formal analysis. V.S. and R.D. carried out the investigations. C.F., V.S. and V.W. wrote the original draft. C.F., V.S., S.P. and V.W. reviewed and edited the report. V.W. supervised the study and acquired funding.

Corresponding author

Correspondence to Valentin Wyart.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Neuroscience thanks Samuel Gershman, Yonatan Loewenstein, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Additional model comparisons across experiments 1 and 2 (N = 59 participants).

(a) Knock-out procedure. The top panel shows models that varied based on the presence or absence of learning noise (ζ) in addition to the softmax choice policy (β). The bottom panel shows models that varied based on the presence or absence of a softmax choice policy (β) on top of learning noise (ζ). (b) Results of the model comparison in the partial and complete feedback conditions for models described in panel a pooled across experiment 1 (N = 29) and experiment 2 (N = 30). Similarly to the main behavioral results, these comparisons revealed that participants featured both learning noise (fixed-effects: BF≈1050.3, random-effects: exceedance p=0.99) and choice stochasticity (fixed-effects: BF≈1082.4, random-effects: exceedance p=0.999) in the partial feedback condition (left panel). In the complete feedback condition, the model with learning noise better explained the data than the exact model (fixed-effects: BF≈10100.3, random-effects: exceedance p=0.999). Furthermore, a model with learning noise and an argmax action selection policy fitted the data decisively better than a model with learning noise and a softmax policy (fixed-effects: BF≈1015.8, random-effects: exceedance p=0.999) (right panel). Error bars for model frequencies correspond to the s.d. of estimated Dirichlet distributions.

Supplementary Figure 2 Results of the parameter recovery procedure.

Implementation of the parameter recovery procedure in experiment 1 (in the partial feedback condition). For a given set of parameter values, we simulated the model 29 times (once for each of the N = 29 different realizations of the task). Obtained simulated actions were fitted using the same exact procedure used to fit human data, to quantify the extent to which we could recover the simulated (ground-truth) parameters values. (a) Parameter recovery for the learning noise parameter ζ, with other parameters (softmax temperature 1/β and learning rates) fixed. (b) Parameter recovery for the softmax temperature 1/β with other parameters (learning noise ζ and learning rates) fixed. Fixed parameter values were set to group-level mean estimates obtained using a fixed-effects approach. For the single parameter whose value was varied, we considered 11 values logarithmically distributed around the group-level mean estimate. Horizontal lines represent the subjects’ group mean with the 99% confidence interval. Each dot represents the recovered parameter averaged across simulations (N=29) with vertical lines showing s.d.m. The results shown indicate that ground-truth parameter values are well recovered by our fitting procedure. Also, it shows that the fitting procedure is robust to changes of learning noise and softmax temperature parameters. Recovered parameters do not saturate within the range values parameters considered for learning noise (a) and start to saturate only when the softmax temperature parameter is set to about three times the group-level mean value (b).

Supplementary information

Supplementary information

Supplementary Figures 1 & 2, Supplementary Tables 1 & 2, Supplementary Modeling Note, and Supplementary Note

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Findling, C., Skvortsova, V., Dromnelle, R. et al. Computational noise in reward-guided learning drives behavioral variability in volatile environments. Nat Neurosci 22, 2066–2077 (2019). https://doi.org/10.1038/s41593-019-0518-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41593-019-0518-9

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing