Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Steckelmacher, Denis; Plisnier, Hélène; Roijers, Diederik M.; Nowé, Ann

doi:10.1007/978-3-030-46133-1_2

Denis Steckelmacher¹⁴,
Hélène Plisnier¹⁴,
Diederik M. Roijers¹⁵ &
…
Ann Nowé¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11908))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1848 Accesses
2 Citations

Abstract

Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: https://github.com/vub-ai-lab/bdpi. Appendix: https://arxiv.org/abs/1903.04193.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Deterministic actor-critic methods are slightly different and outside the scope of this paper.
2.
https://github.com/maximecb/gym-miniworld.
3.
Only the number of hidden neurons changes between some environments, a trivial change.

References

Agrawal, S., Goyal, N.: Analysis of Thompson sampling for the multi-armed bandit problem. In: Conference on Learning Theory (COLT) (2012)
Google Scholar
Anthony, T., Tian, Z., Barber, D.: Thinking fast and slow with deep learning and tree search. In: Advances in Neural Information Processing Systems (NIPS), pp. 5366–5376 (2017)
Google Scholar
Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Hochreiter, S.: RUDDER: return decomposition for delayed rewards. arXiv abs/1806.07857 (2018)
Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 13(5), 834–846 (1983)
Article Google Scholar
Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspective on reinforcement learning. In: International Conference on Machine Learning (ICML), pp. 449–458 (2017)
Google Scholar
Bellman, R.: A Markovian decision process. J. Math. Mech. 6, 679–684 (1957)
MathSciNet MATH Google Scholar
Böhmer, W., Guo, R., Obermayer, K.: Non-deterministic policy improvement stabilizes approximated reinforcement learning. arXiv abs/1612.07548 (2016)
Brockman, G., et al.: OpenAI Gym (2016)
Google Scholar
Burda, Y., Edwards, H., Storkey, A., Klimov, O.: Exploration by random network distillation. arXiv abs/1810.12894 (2018)
Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: Advances in Neural Information Processing Systems (NIPS), pp. 2249–2257 (2011)
Google Scholar
Chen, R.Y., Sidor, S., Abbeel, P., Schulman, J.: UCB exploration via Q-ensembles. arXiv abs/1706.01502 (2017)
Degris, T., White, M., Sutton, R.S.: Linear off-policy actor-critic. In: International Conference on Machine Learning (ICML) (2012)
Google Scholar
Fu, J., Kumar, A., Soh, M., Levine, S.: Diagnosing bottlenecks in deep Q-learning algorithms. arXiv abs/1902.10250 (2019)
Fujimoto, S., Hoof, H.V., Meger, D.: Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning (ICML), pp. 1582–1591 (2018)
Google Scholar
Gruslys, A., Azar, M.G., Bellemare, M.G., Munos, R.: The reactor: a sample-efficient actor-critic architecture. arXiv abs/1704.04651 (2017)
Gu, S., Lillicrap, T., Turner, R.E., Ghahramani, Z., Schölkopf, B., Levine, S.: Interpolated policy gradient: merging on-policy and off-policy gradient estimation for deep reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 3849–3858 (2017)
Google Scholar
Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R.E., Levine, S.: Q-prop: sample-efficient policy gradient with an off-policy critic. In: International Conference on Learning Representations (ICLR) (2017)
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv abs/1801.01290 (2018)
van Hasselt, H.: Double Q-learning. In: Neural Information Processing Systems (NIPS), p. 9 (2010)
Google Scholar
Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learning. arXiv abs/1710.02298 (2017)
Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: International Conference on Machine Learning (ICML), pp. 267–274 (2002)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Konda, V.R., Borkar, V.S.: Actor-critic-type learning algorithms for Markov decision processes. SIAM J. Control Opt. 38(1), 94–123 (1999)
Article MathSciNet Google Scholar
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv abs/1509.02971 (2015)
Lin, L.J.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8(3–4), 293–321 (1992)
Google Scholar
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning (ICML), p. 10 (2016)
Google Scholar
Nikolov, N., Kirschner, J., Berkenkamp, F., Andreas, K.: Information-directed exploration for deep reinforcement learning. In: International Conference on Learning Representations (ICLR) (2019, in preparation)
Google Scholar
O’Donoghue, B., Munos, R., Kavukcuoglu, K., Mnih, V.: PGQ: combining policy gradient and Q-learning. In: International Conference on Learning Representations (ICLR), p. 15 (2017)
Google Scholar
Osband, I., Aslanides, J., Cassirer, A.: Randomized prior functions for deep reinforcement learning. arXiv abs/1806.03335 (2018)
Osband, I., Blundell, C., Pritzel, A., Van Roy, B.: Deep exploration via bootstrapped DQN. In: Advances in Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Parisotto, E., Ba, J., Salakhutdinov, R.: Actor-mimic: deep multitask and transfer reinforcement learning. In: International Conference on Learning Representations (ICLR) (2016)
Google Scholar
Pazis, J., Lagoudakis, M.G.: Binary action search for learning continuous-action control policies. In: International Conference on Machine Learning (ICML), pp. 793–800. ACM (2009)
Google Scholar
Pirotta, M., Restelli, M., Pecorino, A., Calandriello, D.: Safe policy iteration. In: Proceedings of the 30th International Conference on Machine Learning (ICML), pp. 307–315 (2013)
Google Scholar
Rusu, A.A., et al.: Policy distillation. arXiv abs/1511.06295 (2015)
Scherrer, B.: Approximate policy iteration schemes: a comparison. In: Proceedings of the 31th International Conference on Machine Learning (ICML), pp. 1314–1322 (2014)
Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M.I., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning (ICML) (2015)
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv abs/1707.06347 (2017)
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)
Article Google Scholar
Sun, W., Gordon, G.J., Boots, B., Bagnell, J.A.: Dual policy iteration. arXiv abs/1805.10755 (2018)
Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Neural Information Processing Systems (NIPS), p. 7 (2000)
Google Scholar
Thomas, P.S., Theocharous, G., Ghavamzadeh, M.: High confidence policy improvement. In: International Conference on Machine Learning (ICML), pp. 2380–2388 (2015)
Google Scholar
Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294 (1933)
Article Google Scholar
Wagner, P.: A reinterpretation of the policy oscillation phenomenon in approximate policy iteration. In: Advances in Neural Information Processing Systems (NIPS), pp. 2573–2581 (2011)
Google Scholar
Wang, Z., et al.: Sample efficient actor-critic with experience replay. Technical report (2016)
Google Scholar
Watkins, C., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
MATH Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3), 229–256 (1992)
MathSciNet MATH Google Scholar
Wu, Y., Mansimov, E., Grosse, R.B., Liao, S., Ba, J.: Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In: Advances in Neural Information Processing Systems (NIPS), pp. 5279–5288 (2017)
Google Scholar

Download references

Acknowledgments

The first and second authors are funded by the Science Foundation of Flanders (FWO, Belgium), respectively as 1129319N Aspirant, and 1SA6619N Applied Researcher.

Author information

Authors and Affiliations

Vrije Universiteit Brussel, Pleinlaan 2, 1050, Brussels, Belgium
Denis Steckelmacher, Hélène Plisnier & Ann Nowé
VU Amsterdam, De Boelelaan 1105, 1081 HV, Amsterdam, The Netherlands
Diederik M. Roijers

Authors

Denis Steckelmacher
View author publications
You can also search for this author in PubMed Google Scholar
Hélène Plisnier
View author publications
You can also search for this author in PubMed Google Scholar
Diederik M. Roijers
View author publications
You can also search for this author in PubMed Google Scholar
Ann Nowé
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Denis Steckelmacher .

Editor information

Editors and Affiliations

Leuphana University, Lüneburg, Germany
Ulf Brefeld
IRISA/Inria, Rennes, France
Elisa Fromont
University of Würzburg, Würzburg, Germany
Andreas Hotho
Leiden University, Leiden, The Netherlands
Arno Knobbe
ETH Zurich, Zurich, Switzerland
Marloes Maathuis
Institut National des Sciences Appliquées, Villeurbanne, France
Céline Robardet

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Steckelmacher, D., Plisnier, H., Roijers, D.M., Nowé, A. (2020). Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11908. Springer, Cham. https://doi.org/10.1007/978-3-030-46133-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-46133-1_2
Published: 30 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46132-4
Online ISBN: 978-3-030-46133-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)