Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning

Chen, Diqi; Wang, Yizhou; Gao, Wen

doi:10.1007/s10489-020-01702-7

Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning

Published: 01 June 2020

Volume 50, pages 3301–3317, (2020)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Diqi Chen¹,
Yizhou Wang² &
Wen Gao³

1766 Accesses
9 Citations
Explore all metrics

Abstract

Multi-objective reinforcement learning (MORL) algorithms aim to approximate the Pareto frontier uniformly in multi-objective decision making problems. In the scenario of deep reinforcement learning (RL), gradient-based methods are often adopted to learn deep policies/value functions due to the fast convergence speed, while pure gradient-based methods can not guarantee a uniformly approximated Pareto frontier. On the other side, evolution strategies straightly manipulate in the solution space to achieve a well-distributed Pareto frontier, but applying evolution strategies to optimize deep networks is still a challenging topic. To leverage the advantages of both kinds of methods, we propose a two-stage MORL framework combining a gradient-based method and an evolution strategy. First, an efficient multi-policy soft actor-critic algorithm is proposed to learn multiple policies collaboratively. The lower layers of all policy networks are shared. The first-stage learning can be regarded as representation learning. Secondly, the multi-objective covariance matrix adaptation evolution strategy (MO-CMA-ES) is applied to fine-tune policy-independent parameters to approach a dense and uniform estimation of the Pareto frontier. Experimental results on three benchmarks (Deep Sea Treasure, Adaptive Streaming, and Super Mario Bros) show the superiority of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evolutionary Action Selection for Gradient-Based Policy Learning

Adaptive Evolutionary Reinforcement Learning with Policy Direction

Article Open access 23 February 2024

Pareto Multi-task Deep Learning

Notes

Anonymous et al., Adaptive Streaming: From Bitrate Maximization to Rate-Distortion Optimization

References

Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795
Article Google Scholar
Abualigah LM, Khader AT, Hanandeh ES (2018) A combination of objective functions and hybrid krill herd algorithm for text document clustering analysis. Eng Appl Artif Intell 73:111–125
Article Google Scholar
Abualigah LM, Khader AT, Hanandeh ES (2018) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48(11):4047–4071
Article Google Scholar
Abualigah LM, Khader AT, Hanandeh ES (2018) A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J Comput Sci 25:456– 466
Article Google Scholar
Abualigah LM, Khader AT, Hanandeh ES, Gandomi AH (2017) A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl Soft Comput 60:423–435
Article Google Scholar
Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin
Book Google Scholar
Abualigah LMQ, Hanandeh ES (2015) Applying genetic algorithms to information retrieval using vector space model. International Journal of Computer Science, Engineering and Applications 5(1):19
Article Google Scholar
Barrett L, Narayanan S (2008) Learning all optimal policies with multiple criteria. In: International conference on machine learning. ACM, pp 41–47
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Article Google Scholar
de Bruin T, Kober J, Tuyls K, Babuška R (2018) Integrating state representation learning into deep reinforcement learning. IEEE Robotics and Automation Letters 3(3):1394–1401
Article Google Scholar
Brys T, Harutyunyan A, Vrancx P, Taylor ME, Kudenko D, Nowé A (2014) Multi-objectivization of reinforcement learning problems by reward shaping. In: International joint conference on neural networks. IEEE, pp 2315–2322
Castelletti A, Pianosi F, Restelli M (2012) Tree-based fitted Q-iteration for multi-objective Markov decision problems. In: International joint conference on neural networks. IEEE, pp 1–8
Chen D, Wang Y, Gao W (2020) A two-stage multi-objective deep reinforcement learning framework. In: European conference on artificial intelligence (ECAI)
Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197
Article Google Scholar
Dhariwal P, Hesse C, Klimov O, Nichol A, Plappert M, Radford A, Schulman J, Sidor S, Wu Y, Zhokhov P (2017) OpenAI baselines. https://github.com/openai/baselines
Fernando C, Banarse D, Blundell C, Zwols Y, Ha D, Rusu AA, Pritzel A, Wierstra D (2017) Pathnet: evolution channels gradient descent in super neural networks. arXiv:1701.08734
Gao P, Zhang Q, Wang F, Xiao L, Fujita H, Zhang Y (2020) Learning reinforced attentional representation for end-to-end visual tracking. Inf Sci 517:52–67
Article Google Scholar
Ha D, Schmidhuber J (2018) Recurrent world models facilitate policy evolution. In: Advances in neural information processing systems, pp 2450–2462
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv:1801.01290
Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P et al (2018) Soft actor-critic algorithms and applications. arXiv:1812.05905
Hansen N (2016) The CMA evolution strategy: a tutorial. arXiv:1604.00772
Igel C, Hansen N, Roth S (2007) Covariance matrix adaptation for multi-objective optimization. Evol Comput 15(1):1–28
Article Google Scholar
Igel C, Heidrich-Meisner V, Glasmachers T (2008) Shark. J Mach Learn Res 9(Jun):993–996
Google Scholar
Igel C, Suttorp T, Hansen N (2007) Steady-state selection and efficient covariance matrix update in the multi-objective CMA-ES. In: International conference on evolutionary multi-criterion optimization. Springer, Berlin, pp 171–185
Kauten C (2018) Super mario bros for OpenAI Gym. GitHub. https://github.com/Kautenja/gym-super-mario-bros
Lehman J, Chen J, Clune J, Stanley KO (2018) Safe mutations for deep and recurrent neural networks through output gradients. In: Proceedings of the genetic and evolutionary computation conference. ACM, pp 117–124
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv:1509.02971
Lizotte DJ, Bowling M, Murphy SA (2012) Linear fitted-Q iteration with multiple reward functions. J Mach Learn Res 13(Nov):3253–3295
MathSciNet MATH Google Scholar
Lizotte DJ, Bowling MH, Murphy SA (2010) Efficient reinforcement learning with multiple reward functions for randomized controlled trial analysis. In: International conference on machine learning. Citeseer, pp 695–702
Mannor S, Shimkin N (2004) A geometric approach to multi-criterion reinforcement learning. J Mach Learn Res 5(Apr):325–360
MathSciNet MATH Google Scholar
Mao H, Netravali R, Alizadeh M (2017) Neural adaptive video streaming with pensieve. In: Proceedings of the conference of the ACM special interest group on data communication. ACM, pp 197–210
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp 1928–1937
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv:1312.5602
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529
Article Google Scholar
Natarajan S, Tadepalli P (2005) Dynamic preferences in multi-criteria reinforcement learning. In: International conference on machine learning. ACM, pp 601–608
Ng AY, Harada D, Russell S (1999) Policy invariance under reward transformations: theory and application to reward shaping. In: International conference on machine learning, vol 99, pp 278–287
Nguyen TT (2018) A multi-objective deep reinforcement learning framework. arXiv:1803.02965
Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped DQN. In: Advances in neural information processing systems, pp 4026–4034
Osband I, Russo D, Van Roy B (2013) (More) efficient reinforcement learning via posterior sampling. In: Advances in neural information processing systems, pp 3003–3011
Osband I, Van Roy B (2015) Bootstrapped thompson sampling and deep exploration. arXiv:1507.00300
Osband I, Van Roy B, Wen Z (2014) Generalization and exploration via randomized value functions. arXiv:1402.0635
Parisi S, Pirotta M, Restelli M (2016) Multi-objective reinforcement learning through continuous pareto manifold approximation. J Artif Intell Res 57:187–227
Article MathSciNet Google Scholar
Parisi S, Pirotta M, Smacchia N, Bascetta L, Restelli M (2014) Policy gradient approaches for multi-objective sequential decision making. In: International joint conference on neural networks. IEEE, pp 2323–2330
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, pp 8024–8035
Risi S, Togelius J (2015) Neuroevolution in games: state of the art and open challenges. IEEE Transactions on Computational Intelligence and AI in Games 9(1):25–41
Article Google Scholar
Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z, et al. (2018) A tutorial on thompson sampling. Foundations and Trends®;, in Machine Learning 11(1):1–96
Article Google Scholar
Salimans T, Ho J, Chen X, Sidor S, Sutskever I (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484
Article Google Scholar
Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: International conference on machine learning
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017) Mastering the game of Go without human knowledge. Nature 550(7676):354
Article Google Scholar
Spiteri K, Urgaonkar R, Sitaraman RK (2016) BOLA: near-optimal bitrate adaptation for online videos. In: IEEE international conference on computer communications. IEEE, pp 1–9
Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J (2017) Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv:1712.06567
Sullivan GJ, Wiegand T, et al. (1998) Rate-distortion optimization for video compression. IEEE Signal Proc Mag 15(6):74–90
Article Google Scholar
Sutton RS, Barto AG, et al. (1998) Introduction to reinforcement learning, vol 135. MIT Press , Cambridge
Google Scholar
Suttorp T, Hansen N, Igel C (2009) Efficient covariance matrix update for variable metric evolution strategies. Mach Learn 75(2):167–197
Article Google Scholar
Tajmajer T (2017) Multi-objective deep Q-learning with subsumption architecture. arXiv:1704.06676
Tesauro G, Das R, Chan H, Kephart J, Levine D, Rawson F, Lefurgy C (2008) Managing power consumption and performance of computing systems using reinforcement learning. In: Advances in neural information processing systems, pp 1497– 1504
Vamplew P, Dazeley R, Berry A, Issabekov R, Dekker E (2011) Empirical evaluation methods for multiobjective reinforcement learning algorithms. Mach Learn 84(1-2):51–80
Article MathSciNet Google Scholar
Vamplew P, Yearwood J, Dazeley R, Berry A (2008) On the limitations of scalarisation for multi-objective reinforcement learning of Pareto fronts. In: Australasian joint conference on artificial intelligence. Springer, pp 372–378
Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double Q-learning. In: AAAI conference on artificial intelligence
Van Moffaert K, Drugan MM, Nowé A (2013) Hypervolume-based multi-objective reinforcement learning. In: International conference on evolutionary multi-criterion optimization. Springer, pp 352–366
Van Moffaert K, Drugan MM, Nowé A (2013) Scalarized multi-objective reinforcement learning: novel design techniques. In: 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL). IEEE, pp 191–199
Voß T, Hansen N, Igel C (2010) Improved step size adaptation for the MO-CMA-ES. In: Annual conference on genetic and evolutionary computation. ACM, pp 487–494
Wiering MA, Withagen M, Drugan MM (2014) Model-based multi-objective reinforcement learning. In: 2014 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL). IEEE, pp 1–6
Yin X, Jindal A, Sekar V, Sinopoli B (2015) A control-theoretic approach for dynamic adaptive video streaming over HTTP. In: ACM SIGCOMM computer communication review, vol 45. ACM, pp 325–338
Zitzler E, Thiele L (1999) Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach. IEEE Trans Evol Comput 3(4):257–271
Article Google Scholar

Download references

Acknowledgements

The authors would like to express our thanks for the support from the following research grants: 2018AAA0102004, NSFC-61625201, NSFC-61527804.

Author information

Authors and Affiliations

Key Lab of Intelligent Information Processing, Institute of Computing Technology and University of Chinese Academy of Sciences, Beijing, China
Diqi Chen
National Engineering Lab for Video Technology, Institute of Digital Media, Key Lab of Machine Perception (MoE), School of Electronics Engineering and Computer Science (EECS), Peking University, and Deepwise AI Lab, Beijing, China
Yizhou Wang
National Engineering Lab for Video Technology, Institute of Digital Media, Key Lab of Machine Perception (MoE), School of Electronics Engineering and Computer Science (EECS), Peking University, Beijing, China
Wen Gao

Authors

Diqi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yizhou Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wen Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diqi Chen.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, D., Wang, Y. & Gao, W. Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning. Appl Intell 50, 3301–3317 (2020). https://doi.org/10.1007/s10489-020-01702-7

Download citation

Published: 01 June 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s10489-020-01702-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Evolutionary Action Selection for Gradient-Based Policy Learning

Adaptive Evolutionary Reinforcement Learning with Policy Direction

Pareto Multi-task Deep Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Evolutionary Action Selection for Gradient-Based Policy Learning

Adaptive Evolutionary Reinforcement Learning with Policy Direction

Pareto Multi-task Deep Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation