计算机科学 ›› 2020, Vol. 47 ›› Issue (3): 182-191.doi: 10.11896/jsjkx.190200352
杨惟轶1,白辰甲2,蔡超1,赵英男2,刘鹏2
ANG Wei-yi1,BAI Chen-jia2,CAI Chao1,ZHAO Ying-nan2,LIU Peng2
摘要: 强化学习作为机器学习的重要分支,是在与环境交互中寻找最优策略的一类方法。强化学习近年来与深度学习进行了广泛结合,形成了深度强化学习的研究领域。作为一种崭新的机器学习方法,深度强化学习同时具有感知复杂输入和求解最优策略的能力,可以应用于机器人控制等复杂决策问题。稀疏奖励问题是深度强化学习在解决任务中面临的核心问题,在实际应用中广泛存在。解决稀疏奖励问题有利于提升样本的利用效率,提高最优策略的水平,推动深度强化学习在实际任务中的广泛应用。文中首先对深度强化学习的核心算法进行阐述;然后介绍稀疏奖励问题的5种解决方案,包括奖励设计与学习、经验回放机制、探索与利用、多目标学习和辅助任务等;最后对相关研究工作进行总结和展望。
中图分类号:
[1]SUTTON R S,BARTO A G.Reinforcement learning:An intro- duction[M].MIT Press,US,2018. [2]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].nature,2015,521(7553):436. [3]LI Y.Deep reinforcement learning:An overview[J].arXiv: 1701.07274,2017. [4]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search[J].Nature,2016,529(7587):484. [5]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge[J].Nature,2017,550(7676):354. [6]SILVER D,HUBERT T,SCHRITTWIESER J,et al.A general reinforcement learning algorithm that masters chess,shogi,and Go through self-play[J].Science,2018,362(6419):1140-1144. [7]PLAPPERT M,ANDRYCHOWICZ M,RAY A,et al.Multi- goal reinforcement learning:Challenging robotics environments and request for research[J].arXiv:1802.09464,2018. [8]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized ex- perience replay[J].arXiv:1511.05952,2015. [9]LEVINE S,PASTOR P,KRIZHEVSKY A,et al.Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection[J].The International Journal of Robotics Research,2018,37(4/5):421-436. [10]ISELE D,RAHIMI R,COSGUN A,et al.Navigating occluded intersections with autonomous vehicles using deep reinforcement learning[C]∥2018 IEEE International Conference on Robotics and Automation (ICRA).IEEE,2018:2034-2039. [11]BELLMAN R.A Markovian decision process[J].Journal of Mathematics and Mechanics,1957,6(5):679-684. [12]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing atari with deep reinforcement learning[J].arXiv:1312.5602,2013. [13]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529. [14]HASSELT H V.Double Q-learning[C]∥Advances in Neural Information Processing Systems.2010:2613-2621. [15] HASSELT H V,GUEZ A,SILVER D.Deep reinforcement learning with double q-learning[C]∥Thirtieth AAAI Confe-rence on Artificial Intelligence.2016. [16]HORGAN D,QUAN J,BUDDEN D,et al.Distributed prioritized experience replay[C]∥International Conference onLear-ning Representations.2018. [17]WANG Z,SCHAUL T,HESSEL M,et al.Dueling Network Ar- chitectures for Deep Reinforcement Learning[C]∥International Conference on Machine Learning.2016:1995-2003. [18]BELLEMARE M G,DABNEY W,MUNOS R.A distributional perspective on reinforcement learning[C]∥International Conference on Machine Learning.2017:449-458. [19]HESSEL M,MODAYIL J,VAN HASSELT H,et al.Rainbow:Combining improvements in deep reinforcement learning[C]∥Thirty-Second AAAI Conference on Artificial Intelligence.2018. [20]DE ASIS K,HERNANDEZ-GARCIA J F,HOLLAND G Z, et al.Multi-step reinforcement learning:A unifying algorithm[C]∥Thirty-Second AAAI Conference on Artificial Intelligence.2018. [21]FORTUNATO M,AZAR M G,PIOT B,et al.Noisy networks for exploration[C]∥International Conference on Learning Representations.2018. [22]PRECUP D,SUTTON R S,DASGUPTA S.Off-policy temporal-difference learning with function approximation[C]∥International Conference on Machine Learning.2001:417-424. [23]BROWNE C B,POWLEY E,WHITEHOUSE D,et al.A survey of monte carlo tree search methods[J].IEEE Transactions on Computational Intelligence and AI in Games,2012,4(1):1-43. [24]SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms[C]∥International Conference on Machine Learning.2014. [25]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]∥International conference on machine learning.2016:1928-1937. [26]WYMANN B,ESPIÉ E,GUIONNEAU C,et al.Torcs,the open racing car simulator[J].Software,2000,4(6). [27]TODOROV E,EREZ T,TASSA Y.Mujoco:A physics engine for model-based control[C]∥2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE,2012:5026-5033. [28]KEMPKA M,WYDMUCH M,RUNC G,et al.Vizdoom:A doom-based ai research platform for visual reinforcement lear-ning[C]∥2016 IEEE Conference on Computational Intelligence and Games (CIG).IEEE,2016:1-8. [29]BEATTIE C,LEIBO J Z,TEPLYASHIN D,et al.Deepmind lab[J].arXiv:1612.03801,2016. [30]BABAEIZADEH M,FROSIO I,TYREE S,et al.Reinforcement learning through asynchronous advantage actor-critic on a gpu[C]∥International Conference on Learning Representations.2017. [31]ESPEHOLT L,SOYER H,MUNOS R,et al.IMPALA:Scala- ble Distributed Deep-RL with Importance Weighted Actor-Learner Architectures[C]∥International Conference on Machine Learning.2018:1406-1415. [32]SCHULMAN J,LEVINE S,ABBEEL P,et al.Trust Region Policy Optimization[C]∥International Conference on Machine Learning.2015,37:1889-1897. [33]WU Y,MANSIMOV E,GROSSE R B,et al.Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation[C]∥Advances in neural information processing systems.2017:5279-5288. [34]SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347,2017. [35]SCHULMAN J,MORITZ P,LEVINE S,et al.High-dimensional continuous control using generalized advantage estimation[C]∥International Conference on Learning Representations.2016. [36]NACHUM O,NOROUZI M,XU K,et al.Bridging the gap between value and policy based reinforcement learning[C]∥Advances in Neural Information Processing Systems.2017:2775-2785. [37]LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[C]∥International Conference on Learning Representations.2016. [38]FUJIMOTO S,HOOF H,MEGER D.Addressing Function Approximation Error in Actor-Critic Methods[C]∥International Conference on Machine Learning.2018:1582-1591. [39]HAUSKNECHT M,STONE P.Deep reinforcement learning in parameterized action space[C]∥International Conference on Learning Representations.2016. [40]STONE P.What’s hot at RoboCup[C]∥Thirtieth AAAI Conference on Artificial Intelligence.2016. [41]HAARNOJA T,TANG H,ABBEEL P,et al.Reinforcement learning with deep energy-based policies[C]∥International Conference on Machine Learning.2017:1352-1361. [42]HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft Actor-Critic:Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor[C]∥International Conference on Machine Learning.2018:1856-1865. [43]SCHULMAN J,CHEN X,ABBEEL P.Equivalence between policy gradients and soft q-learning[J].arXiv:1704.06440,2017. [44]GU S,LILLICRAP T,GHAHRAMANI Z,et al.Q-prop:Sample-efficient policy gradient with an off-policy critic[C]∥International Conference on Learning Representations.2017. [45]O’DONOGHUE B,MUNOS R,KAVUKCUOGLU K,et al. Combining policy gradient and Q-learning[C]∥International Conference on Learning Representations.2017. [46]WANG Z,BAPST V,HEESS N,et al.Sample efficient actor-critic with experience replay[C]∥International Conference on Learning Representations.2017. [47]ZHAO X Y,DING S F.Research on Deep Reinforcement Lear- ning[J].Computer Science,2018,45(7):1-6. [48]OPENAI.Faulty Reward Functions in the Wild[EB/OL].ht- tps://blog.openai.com/faulty-reward-functions.2017. [49]RUSSELL S,NORVIG P.Artificial Intelligence A Modern Approach 3rd Edition Pdf[J].Hong Kong:Pearson Education Asia,2011. [50]AMODEI D,OLAH C,STEINHARDT J,et al.Concrete Problems in AI Safety[J].arXiv:1606.06565,2016. [51]NG A Y,RUSSELL S J.Algorithms for inverse reinforcement learning[C]∥ICML.2000,1:2. [52]ZIEBART B D,MAAS A L,BAGNELL J A,et al.Maximum entropy inverse reinforcement learning[C]∥AAAI Conference on Artificial Intelligence.2008:1433-1438. [53]AGHASADEGHI N,BRETL T.Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals[C]∥2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE,2011:1561-1566. [54]FINN C,LEVINE S,ABBEEL P.Guided cost learning:Deep inverse optimal control via policy optimization[C]∥International Conference on Machine Learning.2016:49-58. [55]HADFIELD-MENELL D,MILLI S,ABBEEL P,et al.Inverse reward design[C]∥Advances in neural information processing systems.2017:6765-6774. [56]CHRISTIANO P F,LEIKE J,BROWN T,et al.Deep reinforcement learning from human preferences[C]∥Advances in Neural Information Processing Systems.2017:4299-4307. [57]ZHANG K F,YU Y.Methodologies for Imitation Learning via Inverse Reinforcement Learning:A Review[J].Journal of Computer Research and Development,2019,56(2):254-261. [58]HOU Y,LIU L,WEI Q,et al.A novel DDPG method with prioritized experience replay[C]∥2017 IEEE International Confe-rence on Systems,Man,and Cybernetics (SMC).IEEE,2017:316-321. [59]TAVAKOLI A,PARDO F,KORMUSHEV P.Action branching architectures for deep reinforcement learning[C]∥Thirty-Se-cond AAAI Conference on Artificial Intelligence.2018. [60]HORGAN D,QUAN J,BUDDEN D,et al.Distributed prioritized experience replay[C]∥International Conference on Lear-ning Representations.2018. [61]DE BRUIN T,KOBER J,TUYLS K,et al.Experience selection in deep reinforcement learning for control[J].The Journal of Machine Learning Research,2018,19(1):347-402. [62]BAI C J,LIU P,ZHAO W,et al.Active Sampling for Deep Q-Learning Based on TD-error Adaptive Correction[J].Journal of Computer Research and Development,2019,56(2):262-280. [63]CHAPELLE O,LI L.An empirical evaluation of thompson sampling[C]∥Advances in neural information processing systems.2011:2249-2257. [64]KOLTER J Z,NG A Y.Near-Bayesian exploration in polyno- mial time[C]∥Proceedings of the 26th Annual International Conference on Machine Learning.ACM,2009:513-520. [65]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep explora- tion via bootstrapped DQN[C]∥Advances in neural information processing systems.2016:4026-4034. [66]BELLEMARE M,SRINIVASAN S,OSTROVSKI G,et al.Unifying count-based exploration and intrinsic motivation[C]∥Advances in Neural Information Processing Systems.2016:1471-1479. [67]OSTROVSKI G,BELLEMARE M G,VAN DEN OORD A,et al.Count-based exploration with neural density models[C]∥Proceedings of the 34th International Conference on Machine Learning.2017:2721-2730. [68]VAN OORD A,KALCHBRENNER N,KAVUKCUOGLU K. Pixel Recurrent Neural Networks[C]∥International Conference on Machine Learning.2016:1747-1756. [69]SALIMANS T,KARPATHY A,CHEN X,et al.Pixelcnn++:A pixelcnn implementation with discretized logistic mixture likelihood and other modifications[C]∥International Conference on Learning Representations (ICLR).2017. [70]TANG H,HOUTHOOFT R,FOOTE D,et al.#Exploration:A study of count-based exploration for deep reinforcement learning[C]∥Advances in Neural Information Processing Systems.2017:2753-2762. [71]HOUTHOOFT R,CHEN X,DUAN Y,et al.Vime:Variational information maximizing exploration[C]∥Advances in Neural Information Processing Systems.2016:1109-1117. [72]STADIE B C,LEVINE S,ABBEEL P.Incentivizing exploration in reinforcement learning with deep predictive models[J].arXiv:1507.00814,2015. [73]PATHAK D,AGRAWAL P,EFROS A A,et al.Curiosity-dri- ven Exploration by Self-supervised Prediction[C]∥International Conference on Machine Learning.2017:2778-2787. [74]BURDA Y,EDWARDS H,PATHAK D,et al.Large-scale study of curiosity-driven learning[C]∥International Conference on Learning Representations (ICLR).2019. [75]BURDA Y,EDWARDS H,STORKEY A,et al.Exploration by random network distillation[C]∥International Conference on Learning Representations (ICLR).2019. [76]FU J,CO-REYES J,LEVINE S.Ex2:Exploration with exemplar models for deep reinforcement learning[C]∥Advances in Neural Information Processing Systems.2017:2577-2587. [77]OSBAND I,ASLANIDES J,CASSIRER A.Randomized prior functions for deep reinforcement learning[C]∥Advances in Neural Information Processing Systems.2018:8626-8638. [78]CONTI E,MADHAVAN V,SUCH F P,et al.Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents[C]∥Advances in Neural Information Processing Systems.2018:5032-5043. [79]GUPTA A,MENDONCA R,LIU Y X,et al.Meta-reinforce- ment learning of structured exploration strategies[C]∥Advances in Neural Information Processing Systems.2018:5307-5316. [80]ANDRYCHOWICZ M,WOLSKI F,RAY A,et al.Hindsight experience replay[C]∥Advances in Neural Information Processing Systems.2017:5048-5058. [81]SUTTON R S,MODAYIL J,DELP M,et al.Horde:A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction[C]∥The 10th International Confe-rence on Autonomous Agents and Multiagent Systems.2011:761-768. [82]SCHAUL T,HORGAN D,GREGOR K,et al.Universal value function approximators[C]∥International Conference on Machine Learning.2015:1312-1320. [83]RAUBER P,UMMADISINGU A,MUTZ F,et al.Hindsight policy gradients[C]∥International Conference on Learning Representations (ICLR).2019. [84]FANG M,ZHOU C,SHI B,et al.DHER:Hindsight Experience Replay for Dynamic Goals[C]∥International Conference on Learning Representations (ICLR).2019. [85]LANKA S,WU T.ARCHER:Aggressive Rewards to Counter bias in Hindsight Experience Replay[J].arXiv:1809.02070,2018. [86]NAIR A V,PONG V,DALAL M,et al.Visual reinforcement learning with imagined goals[C]∥Advances in Neural Information Processing Systems.2018:9209-9220. [87]KINGMA D P,WELLING M.Auto-encoding variational bayes[J].arXiv:1312.6114,2013. [88]SCHMIDHUBER J.Powerplay:Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem[J].Frontiers in psychology,2013,4:313. [89]FLORENSA C,HELD D,WULFMEIER M,et al.Reverse curriculum generation for reinforcement learning[C]∥International conference on Robot Learning.2017. [90]FLORENSA C,HELD D,GENG X,et al.Automatic goal genera- tion for reinforcement learning agents[C]∥International Conference on Machine Learning.2018:1514-1523. [91]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Gene- rative adversarial nets[C]∥Advances in Neural Information Processing Systems.2014:2672-2680. [92]SUKHBAATAR S,LIN Z,KOSTRIKOV I,et al.Intrinsic motivation and automatic curricula via asymmetric self-play[C]∥International Conference on Learning Representations (ICLR).2018. [93]JADERBERG M,MNIH V,CZARNECKI W M,et al.Rein- forcement learning with unsupervised auxiliary tasks[C]∥International Conference on Learning Representations (ICLR).2017. [94]MIROWSKI P,PASCANU R,VIOLA F,et al.Learning to navi- gate in complex environments[C]∥International Conference on Learning Representations (ICLR).2017. [95]MIROWSKI P,GRIMES M,MALINOWSKI M,et al.Learning to navigate in cities without a map[C]∥Advances in Neural Information Processing Systems.2018:2424-2435. [96]PARISOTTO E,SALAKHUTDINOV R.Neural map:Struc- tured memory for deep reinforcement learning[C]∥Internatio-nal Conference on Learning Representations.2018. [97]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [98]GU S,LILLICRAP T,SUTSKEVER I,et al.Continuous deep q-learning with model-based acceleration[C]∥International Conference on Machine Learning.2016:2829-2838. [99]XU Z,VAN HASSELT H P,SILVER D.Meta-gradient reinforcement learning[C]∥Advances in Neural Information Processing Systems.2018:2402-2413. [100]NACHUM O,GU S S,LEE H,et al.Data-efficient hierarchical reinforcement learning[C]∥Advances in Neural Information Processing Systems.2018:3307-3317. [101]TENENBAUM J.Building machines that learn and think like people[C]∥Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems.2018:5-5. |
[1] | 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺. 时序知识图谱表示学习 Temporal Knowledge Graph Representation Learning 计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204 |
[2] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[3] | 饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277 |
[4] | 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波. 基于边缘智能的频谱地图构建与分发方法 Construction and Distribution Method of REM Based on Edge Intelligence 计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148 |
[5] | 汤凌韬, 王迪, 张鲁飞, 刘盛云. 基于安全多方计算和差分隐私的联邦学习方案 Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy 计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108 |
[6] | 孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061 |
[7] | 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟. 智能博弈对抗方法:博弈论与强化学习综合视角对比分析 Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning 计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174 |
[8] | 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100 |
[9] | 王剑, 彭雨琦, 赵宇斐, 杨健. 基于深度学习的社交网络舆情信息抽取方法综述 Survey of Social Network Public Opinion Information Extraction Based on Deep Learning 计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099 |
[10] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[11] | 姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046 |
[12] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[13] | 程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157 |
[14] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[15] | 周慧, 施皓晨, 屠要峰, 黄圣君. 基于主动采样的深度鲁棒神经网络学习 Robust Deep Neural Network Learning Based on Active Sampling 计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044 |
|