深度强化学习中稀疏奖励问题研究综述

doi:10.11896/jsjkx.190200352

摘要/Abstract

摘要： 强化学习作为机器学习的重要分支,是在与环境交互中寻找最优策略的一类方法。强化学习近年来与深度学习进行了广泛结合,形成了深度强化学习的研究领域。作为一种崭新的机器学习方法,深度强化学习同时具有感知复杂输入和求解最优策略的能力,可以应用于机器人控制等复杂决策问题。稀疏奖励问题是深度强化学习在解决任务中面临的核心问题,在实际应用中广泛存在。解决稀疏奖励问题有利于提升样本的利用效率,提高最优策略的水平,推动深度强化学习在实际任务中的广泛应用。文中首先对深度强化学习的核心算法进行阐述;然后介绍稀疏奖励问题的5种解决方案,包括奖励设计与学习、经验回放机制、探索与利用、多目标学习和辅助任务等;最后对相关研究工作进行总结和展望。

关键词: 强化学习, 人工智能, 深度强化学习, 深度学习, 稀疏奖励

Abstract: As an important research direction of machine learning,reinforcement learning is a kind of method of finding out the optimal policy by interacting with the environment.In recent years,deep learning is widely used in reinforcement learning algorithm,forming a new research field named deep reinforcement learning.As a new machine learning method,deep reinforcement learning has the ability to perceive complex inputs and solve optimal policies.It is applied to robot control and complex decision-making problems.The sparse reward problem is the core problem of reinforcement learning in solving practical tasks.Sparse reward problem exists widely in practical applications.Solving the sparse reward problem is conducive to improving the sample-efficiency and the quality of optimal policy,and promoting the application of deep reinforcement learning to practical tasks.Firstly,an overview of the core algorithm of deep reinforcement learning was given.Then five solutions of sparse reward problem were introduced,including reward design and learning,experience replay,exploration and exploitation,multi-goal learning and auxiliary tasks.Finally,the related researches were summarized and prospected.

Key words: Artificial intelligence, Deep learning, Deep reinforcement learning, Reinforcement learning, Sparse reward

中图分类号:

TP181

杨惟轶,白辰甲,蔡超,赵英男,刘鹏. 深度强化学习中稀疏奖励问题研究综述[J]. 计算机科学, 2020, 47(3): 182-191. https://doi.org/10.11896/jsjkx.190200352

ANG Wei-yi,BAI Chen-jia,CAI Chao,ZHAO Ying-nan,LIU Peng. Survey on Sparse Reward in Deep Reinforcement Learning[J]. Computer Science, 2020, 47(3): 182-191. https://doi.org/10.11896/jsjkx.190200352

参考文献

[1]SUTTON R S,BARTO A G.Reinforcement learning:An intro- duction[M].MIT Press,US,2018.
[2]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].nature,2015,521(7553):436.
[3]LI Y.Deep reinforcement learning:An overview[J].arXiv: 1701.07274,2017.
[4]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search[J].Nature,2016,529(7587):484.
[5]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge[J].Nature,2017,550(7676):354.
[6]SILVER D,HUBERT T,SCHRITTWIESER J,et al.A general reinforcement learning algorithm that masters chess,shogi,and Go through self-play[J].Science,2018,362(6419):1140-1144.
[7]PLAPPERT M,ANDRYCHOWICZ M,RAY A,et al.Multi- goal reinforcement learning:Challenging robotics environments and request for research[J].arXiv:1802.09464,2018.
[8]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized ex- perience replay[J].arXiv:1511.05952,2015.
[9]LEVINE S,PASTOR P,KRIZHEVSKY A,et al.Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection[J].The International Journal of Robotics Research,2018,37(4／5):421-436.
[10]ISELE D,RAHIMI R,COSGUN A,et al.Navigating occluded intersections with autonomous vehicles using deep reinforcement learning[C]∥2018 IEEE International Conference on Robotics and Automation (ICRA).IEEE,2018:2034-2039.
[11]BELLMAN R.A Markovian decision process[J].Journal of Mathematics and Mechanics,1957,6(5):679-684.
[12]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing atari with deep reinforcement learning[J].arXiv:1312.5602,2013.
[13]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529.
[14]HASSELT H V.Double Q-learning[C]∥Advances in Neural Information Processing Systems.2010:2613-2621.
[15] HASSELT H V,GUEZ A,SILVER D.Deep reinforcement learning with double q-learning[C]∥Thirtieth AAAI Confe-rence on Artificial Intelligence.2016.
[16]HORGAN D,QUAN J,BUDDEN D,et al.Distributed prioritized experience replay[C]∥International Conference onLear-ning Representations.2018.
[17]WANG Z,SCHAUL T,HESSEL M,et al.Dueling Network Ar- chitectures for Deep Reinforcement Learning[C]∥International Conference on Machine Learning.2016:1995-2003.
[18]BELLEMARE M G,DABNEY W,MUNOS R.A distributional perspective on reinforcement learning[C]∥International Conference on Machine Learning.2017:449-458.
[19]HESSEL M,MODAYIL J,VAN HASSELT H,et al.Rainbow:Combining improvements in deep reinforcement learning[C]∥Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[20]DE ASIS K,HERNANDEZ-GARCIA J F,HOLLAND G Z, et al.Multi-step reinforcement learning:A unifying algorithm[C]∥Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[21]FORTUNATO M,AZAR M G,PIOT B,et al.Noisy networks for exploration[C]∥International Conference on Learning Representations.2018.
[22]PRECUP D,SUTTON R S,DASGUPTA S.Off-policy temporal-difference learning with function approximation[C]∥International Conference on Machine Learning.2001:417-424.
[23]BROWNE C B,POWLEY E,WHITEHOUSE D,et al.A survey of monte carlo tree search methods[J].IEEE Transactions on Computational Intelligence and AI in Games,2012,4(1):1-43.
[24]SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms[C]∥International Conference on Machine Learning.2014.
[25]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]∥International conference on machine learning.2016:1928-1937.
[26]WYMANN B,ESPIÉ E,GUIONNEAU C,et al.Torcs,the open racing car simulator[J].Software,2000,4(6).
[27]TODOROV E,EREZ T,TASSA Y.Mujoco:A physics engine for model-based control[C]∥2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE,2012:5026-5033.
[28]KEMPKA M,WYDMUCH M,RUNC G,et al.Vizdoom:A doom-based ai research platform for visual reinforcement lear-ning[C]∥2016 IEEE Conference on Computational Intelligence and Games (CIG).IEEE,2016:1-8.
[29]BEATTIE C,LEIBO J Z,TEPLYASHIN D,et al.Deepmind lab[J].arXiv:1612.03801,2016.
[30]BABAEIZADEH M,FROSIO I,TYREE S,et al.Reinforcement learning through asynchronous advantage actor-critic on a gpu[C]∥International Conference on Learning Representations.2017.
[31]ESPEHOLT L,SOYER H,MUNOS R,et al.IMPALA:Scala- ble Distributed Deep-RL with Importance Weighted Actor-Learner Architectures[C]∥International Conference on Machine Learning.2018:1406-1415.
[32]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]∥International Conference on Machine Learning.2015,37:1889-1897.
[33]WU Y,MANSIMOV E,GROSSE R B,et al.Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation[C]∥Advances in neural information processing systems.2017:5279-5288.
[34]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017.
[35]SCHULMAN J,MORITZ P,LEVINE S,et al.High-dimensional continuous control using generalized advantage estimation[C]∥International Conference on Learning Representations.2016.
[36]NACHUM O,NOROUZI M,XU K,et al.Bridging the gap between value and policy based reinforcement learning[C]∥Advances in Neural Information Processing Systems.2017:2775-2785.
[37]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[C]∥International Conference on Learning Representations.2016.
[38]FUJIMOTO S,HOOF H,MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[C]∥International Conference on Machine Learning.2018:1582-1591.
[39]HAUSKNECHT M,STONE P.Deep reinforcement learning in parameterized action space[C]∥International Conference on Learning Representations.2016.
[40]STONE P.What’s hot at RoboCup[C]∥Thirtieth AAAI Conference on Artificial Intelligence.2016.
[41]HAARNOJA T,TANG H,ABBEEL P,et al.Reinforcement learning with deep energy-based policies[C]∥International Conference on Machine Learning.2017:1352-1361.
[42]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]∥International Conference on Machine Learning.2018:1856-1865.
[43]SCHULMAN J,CHEN X,ABBEEL P.Equivalence between policy gradients and soft q-learning[J].arXiv:1704.06440,2017.
[44]GU S,LILLICRAP T,GHAHRAMANI Z,et al.Q-prop:Sample-efficient policy gradient with an off-policy critic[C]∥International Conference on Learning Representations.2017.
[45]O’DONOGHUE B,MUNOS R,KAVUKCUOGLU K,et al. Combining policy gradient and Q-learning[C]∥International Conference on Learning Representations.2017.
[46]WANG Z,BAPST V,HEESS N,et al.Sample efficient actor-critic with experience replay[C]∥International Conference on Learning Representations.2017.
[47]ZHAO X Y,DING S F.Research on Deep Reinforcement Lear- ning[J].Computer Science,2018,45(7):1-6.
[48]OPENAI.Faulty Reward Functions in the Wild[EB/OL].ht- tps://blog.openai.com/faulty-reward-functions.2017.
[49]RUSSELL S,NORVIG P.Artificial Intelligence A Modern Approach 3rd Edition Pdf[J].Hong Kong:Pearson Education Asia,2011.
[50]AMODEI D,OLAH C,STEINHARDT J,et al.Concrete Problems in AI Safety[J].arXiv:1606.06565,2016.
[51]NG A Y,RUSSELL S J.Algorithms for inverse reinforcement learning[C]∥ICML.2000,1:2.
[52]ZIEBART B D,MAAS A L,BAGNELL J A,et al.Maximum entropy inverse reinforcement learning[C]∥AAAI Conference on Artificial Intelligence.2008:1433-1438.
[53]AGHASADEGHI N,BRETL T.Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals[C]∥2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE,2011:1561-1566.
[54]FINN C,LEVINE S,ABBEEL P.Guided cost learning:Deep inverse optimal control via policy optimization[C]∥International Conference on Machine Learning.2016:49-58.
[55]HADFIELD-MENELL D,MILLI S,ABBEEL P,et al.Inverse reward design[C]∥Advances in neural information processing systems.2017:6765-6774.
[56]CHRISTIANO P F,LEIKE J,BROWN T,et al.Deep reinforcement learning from human preferences[C]∥Advances in Neural Information Processing Systems.2017:4299-4307.
[57]ZHANG K F,YU Y.Methodologies for Imitation Learning via Inverse Reinforcement Learning:A Review[J].Journal of Computer Research and Development,2019,56(2):254-261.
[58]HOU Y,LIU L,WEI Q,et al.A novel DDPG method with prioritized experience replay[C]∥2017 IEEE International Confe-rence on Systems,Man,and Cybernetics (SMC).IEEE,2017:316-321.
[59]TAVAKOLI A,PARDO F,KORMUSHEV P.Action branching architectures for deep reinforcement learning[C]∥Thirty-Se-cond AAAI Conference on Artificial Intelligence.2018.
[60]HORGAN D,QUAN J,BUDDEN D,et al.Distributed prioritized experience replay[C]∥International Conference on Lear-ning Representations.2018.
[61]DE BRUIN T,KOBER J,TUYLS K,et al.Experience selection in deep reinforcement learning for control[J].The Journal of Machine Learning Research,2018,19(1):347-402.
[62]BAI C J,LIU P,ZHAO W,et al.Active Sampling for Deep Q-Learning Based on TD-error Adaptive Correction[J].Journal of Computer Research and Development,2019,56(2):262-280.
[63]CHAPELLE O,LI L.An empirical evaluation of thompson sampling[C]∥Advances in neural information processing systems.2011:2249-2257.
[64]KOLTER J Z,NG A Y.Near-Bayesian exploration in polyno- mial time[C]∥Proceedings of the 26th Annual International Conference on Machine Learning.ACM,2009:513-520.
[65]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep explora- tion via bootstrapped DQN[C]∥Advances in neural information processing systems.2016:4026-4034.
[66]BELLEMARE M,SRINIVASAN S,OSTROVSKI G,et al.Unifying count-based exploration and intrinsic motivation[C]∥Advances in Neural Information Processing Systems.2016:1471-1479.
[67]OSTROVSKI G,BELLEMARE M G,VAN DEN OORD A,et al.Count-based exploration with neural density models[C]∥Proceedings of the 34th International Conference on Machine Learning.2017:2721-2730.
[68]VAN OORD A,KALCHBRENNER N,KAVUKCUOGLU K. Pixel Recurrent Neural Networks[C]∥International Conference on Machine Learning.2016:1747-1756.
[69]SALIMANS T,KARPATHY A,CHEN X,et al.Pixelcnn++:A pixelcnn implementation with discretized logistic mixture likelihood and other modifications[C]∥International Conference on Learning Representations (ICLR).2017.
[70]TANG H,HOUTHOOFT R,FOOTE D,et al.#Exploration:A study of count-based exploration for deep reinforcement learning[C]∥Advances in Neural Information Processing Systems.2017:2753-2762.
[71]HOUTHOOFT R,CHEN X,DUAN Y,et al.Vime:Variational information maximizing exploration[C]∥Advances in Neural Information Processing Systems.2016:1109-1117.
[72]STADIE B C,LEVINE S,ABBEEL P.Incentivizing exploration in reinforcement learning with deep predictive models[J].arXiv:1507.00814,2015.
[73]PATHAK D,AGRAWAL P,EFROS A A,et al.Curiosity-dri- ven Exploration by Self-supervised Prediction[C]∥International Conference on Machine Learning.2017:2778-2787.
[74]BURDA Y,EDWARDS H,PATHAK D,et al.Large-scale study of curiosity-driven learning[C]∥International Conference on Learning Representations (ICLR).2019.
[75]BURDA Y,EDWARDS H,STORKEY A,et al.Exploration by random network distillation[C]∥International Conference on Learning Representations (ICLR).2019.
[76]FU J,CO-REYES J,LEVINE S.Ex2:Exploration with exemplar models for deep reinforcement learning[C]∥Advances in Neural Information Processing Systems.2017:2577-2587.
[77]OSBAND I,ASLANIDES J,CASSIRER A.Randomized prior functions for deep reinforcement learning[C]∥Advances in Neural Information Processing Systems.2018:8626-8638.
[78]CONTI E,MADHAVAN V,SUCH F P,et al.Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents[C]∥Advances in Neural Information Processing Systems.2018:5032-5043.
[79]GUPTA A,MENDONCA R,LIU Y X,et al.Meta-reinforce- ment learning of structured exploration strategies[C]∥Advances in Neural Information Processing Systems.2018:5307-5316.
[80]ANDRYCHOWICZ M,WOLSKI F,RAY A,et al.Hindsight experience replay[C]∥Advances in Neural Information Processing Systems.2017:5048-5058.
[81]SUTTON R S,MODAYIL J,DELP M,et al.Horde:A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction[C]∥The 10th International Confe-rence on Autonomous Agents and Multiagent Systems.2011:761-768.
[82]SCHAUL T,HORGAN D,GREGOR K,et al.Universal value function approximators[C]∥International Conference on Machine Learning.2015:1312-1320.
[83]RAUBER P,UMMADISINGU A,MUTZ F,et al.Hindsight policy gradients[C]∥International Conference on Learning Representations (ICLR).2019.
[84]FANG M,ZHOU C,SHI B,et al.DHER:Hindsight Experience Replay for Dynamic Goals[C]∥International Conference on Learning Representations (ICLR).2019.
[85]LANKA S,WU T.ARCHER:Aggressive Rewards to Counter bias in Hindsight Experience Replay[J].arXiv:1809.02070,2018.
[86]NAIR A V,PONG V,DALAL M,et al.Visual reinforcement learning with imagined goals[C]∥Advances in Neural Information Processing Systems.2018:9209-9220.
[87]KINGMA D P,WELLING M.Auto-encoding variational bayes[J].arXiv:1312.6114,2013.
[88]SCHMIDHUBER J.Powerplay:Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem[J].Frontiers in psychology,2013,4:313.
[89]FLORENSA C,HELD D,WULFMEIER M,et al.Reverse curriculum generation for reinforcement learning[C]∥International conference on Robot Learning.2017.
[90]FLORENSA C,HELD D,GENG X,et al.Automatic goal genera- tion for reinforcement learning agents[C]∥International Conference on Machine Learning.2018:1514-1523.
[91]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Gene- rative adversarial nets[C]∥Advances in Neural Information Processing Systems.2014:2672-2680.
[92]SUKHBAATAR S,LIN Z,KOSTRIKOV I,et al.Intrinsic motivation and automatic curricula via asymmetric self-play[C]∥International Conference on Learning Representations (ICLR).2018.
[93]JADERBERG M,MNIH V,CZARNECKI W M,et al.Rein- forcement learning with unsupervised auxiliary tasks[C]∥International Conference on Learning Representations (ICLR).2017.
[94]MIROWSKI P,PASCANU R,VIOLA F,et al.Learning to navi- gate in complex environments[C]∥International Conference on Learning Representations (ICLR).2017.
[95]MIROWSKI P,GRIMES M,MALINOWSKI M,et al.Learning to navigate in cities without a map[C]∥Advances in Neural Information Processing Systems.2018:2424-2435.
[96]PARISOTTO E,SALAKHUTDINOV R.Neural map:Struc- tured memory for deep reinforcement learning[C]∥Internatio-nal Conference on Learning Representations.2018.
[97]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[98]GU S,LILLICRAP T,SUTSKEVER I,et al.Continuous deep q-learning with model-based acceleration[C]∥International Conference on Machine Learning.2016:2829-2838.
[99]XU Z,VAN HASSELT H P,SILVER D.Meta-gradient reinforcement learning[C]∥Advances in Neural Information Processing Systems.2018:2402-2413.
[100]NACHUM O,GU S S,LEE H,et al.Data-efficient hierarchical reinforcement learning[C]∥Advances in Neural Information Processing Systems.2018:3307-3317.
[101]TENENBAUM J.Building machines that learn and think like people[C]∥Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems.2018:5-5.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed