Abstract
Multi-objective reinforcement learning (MORL) algorithms aim to approximate the Pareto frontier uniformly in multi-objective decision making problems. In the scenario of deep reinforcement learning (RL), gradient-based methods are often adopted to learn deep policies/value functions due to the fast convergence speed, while pure gradient-based methods can not guarantee a uniformly approximated Pareto frontier. On the other side, evolution strategies straightly manipulate in the solution space to achieve a well-distributed Pareto frontier, but applying evolution strategies to optimize deep networks is still a challenging topic. To leverage the advantages of both kinds of methods, we propose a two-stage MORL framework combining a gradient-based method and an evolution strategy. First, an efficient multi-policy soft actor-critic algorithm is proposed to learn multiple policies collaboratively. The lower layers of all policy networks are shared. The first-stage learning can be regarded as representation learning. Secondly, the multi-objective covariance matrix adaptation evolution strategy (MO-CMA-ES) is applied to fine-tune policy-independent parameters to approach a dense and uniform estimation of the Pareto frontier. Experimental results on three benchmarks (Deep Sea Treasure, Adaptive Streaming, and Super Mario Bros) show the superiority of the proposed method.
Similar content being viewed by others
Notes
Anonymous et al., Adaptive Streaming: From Bitrate Maximization to Rate-Distortion Optimization
References
Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795
Abualigah LM, Khader AT, Hanandeh ES (2018) A combination of objective functions and hybrid krill herd algorithm for text document clustering analysis. Eng Appl Artif Intell 73:111–125
Abualigah LM, Khader AT, Hanandeh ES (2018) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48(11):4047–4071
Abualigah LM, Khader AT, Hanandeh ES (2018) A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J Comput Sci 25:456– 466
Abualigah LM, Khader AT, Hanandeh ES, Gandomi AH (2017) A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl Soft Comput 60:423–435
Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin
Abualigah LMQ, Hanandeh ES (2015) Applying genetic algorithms to information retrieval using vector space model. International Journal of Computer Science, Engineering and Applications 5(1):19
Barrett L, Narayanan S (2008) Learning all optimal policies with multiple criteria. In: International conference on machine learning. ACM, pp 41–47
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
de Bruin T, Kober J, Tuyls K, Babuška R (2018) Integrating state representation learning into deep reinforcement learning. IEEE Robotics and Automation Letters 3(3):1394–1401
Brys T, Harutyunyan A, Vrancx P, Taylor ME, Kudenko D, Nowé A (2014) Multi-objectivization of reinforcement learning problems by reward shaping. In: International joint conference on neural networks. IEEE, pp 2315–2322
Castelletti A, Pianosi F, Restelli M (2012) Tree-based fitted Q-iteration for multi-objective Markov decision problems. In: International joint conference on neural networks. IEEE, pp 1–8
Chen D, Wang Y, Gao W (2020) A two-stage multi-objective deep reinforcement learning framework. In: European conference on artificial intelligence (ECAI)
Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197
Dhariwal P, Hesse C, Klimov O, Nichol A, Plappert M, Radford A, Schulman J, Sidor S, Wu Y, Zhokhov P (2017) OpenAI baselines. https://github.com/openai/baselines
Fernando C, Banarse D, Blundell C, Zwols Y, Ha D, Rusu AA, Pritzel A, Wierstra D (2017) Pathnet: evolution channels gradient descent in super neural networks. arXiv:1701.08734
Gao P, Zhang Q, Wang F, Xiao L, Fujita H, Zhang Y (2020) Learning reinforced attentional representation for end-to-end visual tracking. Inf Sci 517:52–67
Ha D, Schmidhuber J (2018) Recurrent world models facilitate policy evolution. In: Advances in neural information processing systems, pp 2450–2462
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv:1801.01290
Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P et al (2018) Soft actor-critic algorithms and applications. arXiv:1812.05905
Hansen N (2016) The CMA evolution strategy: a tutorial. arXiv:1604.00772
Igel C, Hansen N, Roth S (2007) Covariance matrix adaptation for multi-objective optimization. Evol Comput 15(1):1–28
Igel C, Heidrich-Meisner V, Glasmachers T (2008) Shark. J Mach Learn Res 9(Jun):993–996
Igel C, Suttorp T, Hansen N (2007) Steady-state selection and efficient covariance matrix update in the multi-objective CMA-ES. In: International conference on evolutionary multi-criterion optimization. Springer, Berlin, pp 171–185
Kauten C (2018) Super mario bros for OpenAI Gym. GitHub. https://github.com/Kautenja/gym-super-mario-bros
Lehman J, Chen J, Clune J, Stanley KO (2018) Safe mutations for deep and recurrent neural networks through output gradients. In: Proceedings of the genetic and evolutionary computation conference. ACM, pp 117–124
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv:1509.02971
Lizotte DJ, Bowling M, Murphy SA (2012) Linear fitted-Q iteration with multiple reward functions. J Mach Learn Res 13(Nov):3253–3295
Lizotte DJ, Bowling MH, Murphy SA (2010) Efficient reinforcement learning with multiple reward functions for randomized controlled trial analysis. In: International conference on machine learning. Citeseer, pp 695–702
Mannor S, Shimkin N (2004) A geometric approach to multi-criterion reinforcement learning. J Mach Learn Res 5(Apr):325–360
Mao H, Netravali R, Alizadeh M (2017) Neural adaptive video streaming with pensieve. In: Proceedings of the conference of the ACM special interest group on data communication. ACM, pp 197–210
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp 1928–1937
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv:1312.5602
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529
Natarajan S, Tadepalli P (2005) Dynamic preferences in multi-criteria reinforcement learning. In: International conference on machine learning. ACM, pp 601–608
Ng AY, Harada D, Russell S (1999) Policy invariance under reward transformations: theory and application to reward shaping. In: International conference on machine learning, vol 99, pp 278–287
Nguyen TT (2018) A multi-objective deep reinforcement learning framework. arXiv:1803.02965
Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped DQN. In: Advances in neural information processing systems, pp 4026–4034
Osband I, Russo D, Van Roy B (2013) (More) efficient reinforcement learning via posterior sampling. In: Advances in neural information processing systems, pp 3003–3011
Osband I, Van Roy B (2015) Bootstrapped thompson sampling and deep exploration. arXiv:1507.00300
Osband I, Van Roy B, Wen Z (2014) Generalization and exploration via randomized value functions. arXiv:1402.0635
Parisi S, Pirotta M, Restelli M (2016) Multi-objective reinforcement learning through continuous pareto manifold approximation. J Artif Intell Res 57:187–227
Parisi S, Pirotta M, Smacchia N, Bascetta L, Restelli M (2014) Policy gradient approaches for multi-objective sequential decision making. In: International joint conference on neural networks. IEEE, pp 2323–2330
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, pp 8024–8035
Risi S, Togelius J (2015) Neuroevolution in games: state of the art and open challenges. IEEE Transactions on Computational Intelligence and AI in Games 9(1):25–41
Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z, et al. (2018) A tutorial on thompson sampling. Foundations and Trends®;, in Machine Learning 11(1):1–96
Salimans T, Ho J, Chen X, Sidor S, Sutskever I (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484
Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: International conference on machine learning
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017) Mastering the game of Go without human knowledge. Nature 550(7676):354
Spiteri K, Urgaonkar R, Sitaraman RK (2016) BOLA: near-optimal bitrate adaptation for online videos. In: IEEE international conference on computer communications. IEEE, pp 1–9
Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J (2017) Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv:1712.06567
Sullivan GJ, Wiegand T, et al. (1998) Rate-distortion optimization for video compression. IEEE Signal Proc Mag 15(6):74–90
Sutton RS, Barto AG, et al. (1998) Introduction to reinforcement learning, vol 135. MIT Press , Cambridge
Suttorp T, Hansen N, Igel C (2009) Efficient covariance matrix update for variable metric evolution strategies. Mach Learn 75(2):167–197
Tajmajer T (2017) Multi-objective deep Q-learning with subsumption architecture. arXiv:1704.06676
Tesauro G, Das R, Chan H, Kephart J, Levine D, Rawson F, Lefurgy C (2008) Managing power consumption and performance of computing systems using reinforcement learning. In: Advances in neural information processing systems, pp 1497– 1504
Vamplew P, Dazeley R, Berry A, Issabekov R, Dekker E (2011) Empirical evaluation methods for multiobjective reinforcement learning algorithms. Mach Learn 84(1-2):51–80
Vamplew P, Yearwood J, Dazeley R, Berry A (2008) On the limitations of scalarisation for multi-objective reinforcement learning of Pareto fronts. In: Australasian joint conference on artificial intelligence. Springer, pp 372–378
Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double Q-learning. In: AAAI conference on artificial intelligence
Van Moffaert K, Drugan MM, Nowé A (2013) Hypervolume-based multi-objective reinforcement learning. In: International conference on evolutionary multi-criterion optimization. Springer, pp 352–366
Van Moffaert K, Drugan MM, Nowé A (2013) Scalarized multi-objective reinforcement learning: novel design techniques. In: 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL). IEEE, pp 191–199
Voß T, Hansen N, Igel C (2010) Improved step size adaptation for the MO-CMA-ES. In: Annual conference on genetic and evolutionary computation. ACM, pp 487–494
Wiering MA, Withagen M, Drugan MM (2014) Model-based multi-objective reinforcement learning. In: 2014 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL). IEEE, pp 1–6
Yin X, Jindal A, Sekar V, Sinopoli B (2015) A control-theoretic approach for dynamic adaptive video streaming over HTTP. In: ACM SIGCOMM computer communication review, vol 45. ACM, pp 325–338
Zitzler E, Thiele L (1999) Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach. IEEE Trans Evol Comput 3(4):257–271
Acknowledgements
The authors would like to express our thanks for the support from the following research grants: 2018AAA0102004, NSFC-61625201, NSFC-61527804.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, D., Wang, Y. & Gao, W. Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning. Appl Intell 50, 3301–3317 (2020). https://doi.org/10.1007/s10489-020-01702-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01702-7