基于分层强化学习的无人机空战多维决策

doi:10.12382/bgxb.2022.0711

摘要/Abstract

摘要：

针对无人机空战过程中面临的智能决策问题,基于分层强化学习架构建立无人机智能空战的多维决策模型。将空战自主决策由单一维度的机动决策扩展到雷达开关、主动干扰、队形转换、目标探测、目标追踪、干扰规避、武器选择等多个维度,实现空战主要环节的自主决策;为解决维度扩展后决策模型状态空间复杂度、学习效率低的问题,结合Soft Actor-Critic算法和专家经验训练和建立元策略组,并改进传统的Option-Critic算法,设计优化策略终止函数,提高策略的切换的灵活性,实现空战中多个维度决策的无缝切换。实验结果表明,该模型在无人机空战全流程的多维度决策问题中具有较好的对抗效果,能够控制智能体根据不同的战场态势灵活切换干扰、搜索、打击、规避等策略,达到提升传统算法性能和提高解决复杂决策效率的目的。

关键词: 无人机空战, 多维决策, 分层强化学习, Soft Actor-Critic算法, Option-Critic算法

Abstract:

To solve the intelligent decision-making problem in the process of UAV air combat, a multi-dimensional decision-making model for UAV intelligent air combat based on the hierarchical reinforcement learning architecture is established, allowing the autonomous decision-making of air combat to be extended from a single-dimensional maneuver decision to a multi-dimensional one including radar switch, active jamming, formation conversion, target detection, target tracking, interference avoidance, weapon selection, etc., so that autonomous decision-making in the main steps of air combat is realized. In order to solve the problems of state-space complexity and low learning efficiency of the decision-making model after the dimension expansion, a meta-strategy group is trained and established with the Soft Actor-Critic algorithm and expert experience, and the traditional Option-Critic algorithm is improved. The strategy termination function is designed and optimized to improve the flexibility of strategy switching and realize seamless multi-dimensional decision-making switching in air combat.. The experimental results show that the proposed method has good countermeasure effectiveness for the multi-dimensional decision-making during the whole process of UAV air combat, which can control the agent to flexibly switch among interference, search, strike, and avoidance strategies according to different battlefield situations with the purpose of improving the performance of traditional algorithms and the efficiency of solving complex decision-making processes.

Key words: UAV air combat, multi-dimensional decision-making, hierarchical reinforcement learning, Soft Actor-Critic algorithm, Option-Critic algorithm

张建东, 王鼎涵, 杨啟明, 史国庆, 陆屹, 张耀中. 基于分层强化学习的无人机空战多维决策[J]. 兵工学报, 2023, 44(6): 1547-1563.

ZHANG Jiandong, WANG Dinghan, YANG Qiming, SHI Guoqing, LU Yi, ZHANG Yaozhong. Multi-Dimensional Decision-Making for UAV Air Combat Based on Hierarchical Reinforcement Learning[J]. Acta Armamentarii, 2023, 44(6): 1547-1563.

图/表 32

图1 空战全流程分析

Fig.1 Analysis of the whole air combat process

图2 雷达探测重叠区域分析

Fig.2 Overlapping area analysis of radar detection

图3 我机位于判断圆域外分析图

Fig.3 Analysis of our UAVs located outside the judgment circle

图4 我机位于判断圆域内分析图

Fig.4 Analysis of our UAVs located in the judgment circle

图5 导弹攻击区

Fig.5 Missile attack zone

图6 SAC训练模型算法伪代码

Fig.6 Pseudocode of SAC training model algorithm

图7 整体作战的分层决策结构

Fig.7 Hierarchical decision-making structure for operations

图8 改进Option-Critic算法结构图

Fig.8 Diagram of improved Option-Critic algorithm structure

图9 Beta算法伪代码

Fig.9 Pseudocode of Beta algorithm

图10 多维空战的构建方法及流程

Fig.10 Construction method and process of multi-dimensional air combat

图11 目标运动、我机固定时的追踪训练示意图

Fig.11 Schematic diagram of tracking training when the enemy UAVs are moving and our UAV is fixed

图12 SAC跟踪训练的原始回报曲线(100轮)

Fig.12 Original reward curve of SAC trackingtraining (100 rounds)

图13 SAC跟踪训练的原始回报曲线(1000轮)

Fig.13 Original reward curve of SAC tracking training (1000 rounds)

图14 我机移动时跟踪验证的回报曲线

Fig.14 Reward curve for tracking verification when our UAV is moving

图15 验证演示过程示意图

Fig.15 Schematic diagram of verification demonstration

图16 基于DDPG的跟踪训练原始回报曲线

Fig.16 Original reward curve of tracking training based on DDPG

图17 跟踪加规避干扰的回报曲线

Fig.17 Tracking plus distraction avoidance payoff curve

图18 干扰规避与追踪演示图

Fig.18 Interference avoidance and tracking demo

图19 决策模型算法流程

Fig.19 Process of decision-making model algorithm

图20 多维决策算法执行示意图

Fig.20 Schematic diagram of multi-dimensional decision-making algorithm execution

图21 分层强化1v1的仿真验证示意图

Fig.21 Schematic diagram of simulation verification of hierarchical reinforcement 1v1

表1 10回合雷达开关与否的被动发现次数统计

Table 1 Data of the number of passive discoveries with or without radar switch in 10 rounds

回合	N₁	N₁	N₁-N₁
1	16	15	1
2	17	8	9
3	22	12	10
4	18	14	4
5	26	16	10
6	24	10	14
7	21	15	6
8	19	13	6
9	17	15	2
10	20	12	8

图22 使用雷达开关策略与否的平均被动发现次数对比

Fig.22 Comparison of average passive discoveries with and without radar switch strategy

图23 初始编队

Fig.23 Initial formation

表2 10回合队形转换50步内打击与损失统计

Table 2 Data of strike and loss within 50 steps in 10 rounds of formation change

回合	S	回合	S	D
1	1	6	0	0
2	0		7	2	1
3	2		8	0	0
4	1		9	1	0
5	0		10	1	0

图24 编队重组后协同编队打击目标

Fig.24 Cooperative formation striking the target after formation reorganization

图25 友机阵亡我机搜索决策

Fig.25 Our UAV’s search decision when friendly UAVs are killed

图26 分布式搜索

Fig.26 Distributed search

图27 打击演示

Fig.27 Strike demo

表3 10回合队形转换50步内打击与损失统计

Table 3 The gap between the current distance and the edge of the attack zone in 10 rounds

回合	D₁/m	Δ₁/m	回合	D₁/m	Δ₁/m
1	119.3	0.7	6	108.2	11.8
2	114.7	5.3		7	103.9	16.1
3	100.2	19.8		8	114.2	5.8
4	118.9	1.1		9	119.1	0.9
5	117.3	2.7		10	119.3	0.7

表4 作战结果

Table 4 Combat result

算法	回合数	胜率/%	平均战损
fix_rule_no_att算法	10	0	3.8
Beta算法	10	100	1.4

图28 算法对比

Fig.28 Algorithm comparison

参考文献 22

[1]	杨伟. 关于未来战斗机发展的若干讨论[J]. 航空学报, 2020, 41(6):524337.
	YANG W. Development of future fighters[J]. Acta Aeronautica et Astronautica Sinica, 2020, 41(6): 524377. (in Chinese) doi: 10.7527/S1000-6893.2020.24377
[2]	刘冰雁, 叶雄兵, 周赤非, 等. 基于改进DQN的复合模式在轨服务资源分配[J]. 航空学报, 2020, 41(5): 323630. doi: 10.7527/S1000-6893.2019.23630
	LIU B Y, YE X B, ZHOU C F, et al. Allocation of composite mode on-orbit service resource based on improved DQN[J]. Acta Aeronautica et Astronautica Sinica, 2020, 41(5):323630. (in Chinese) doi: 10.7527/S1000-6893.2019.23630
[3]	DAVID S, GUY L, NICOLAS H et al. Deterministic policy gradient algorithms[C]//Proceedings of the 31st International Conference on Machine Learning. Beijing, China: IEEE, 2014, 32(1):387-395.
[4]	张耀中, 徐佳林, 姚康佳, 等. 基于DDPG算法的无人机集群追击任务[J]. 航空学报, 2020, 41(10):324000. doi: 10.7527/S1000-6893.2020.24000
	ZHANG Y Z, XU J L, YAO K J, et al. Pursuit missions for UAV swarms based on DDPG algorithm[J]. Acta Aeronautica et Astronautica Sinica, 2020, 41(10):324000. (in Chinese) doi: 10.7527/S1000-6893.2020.24000
[5]	SHI H B, SUN Y R, LI G Y. Model-based DDPG for motor control[C]//Proceedings of 2017 International Conference on Progress in Informatics and Computing. Nanjing, China:IEEE, 2017:284-288.
[6]	KULKARNI T D, NARASIMHAN K R, SAEEDI A, et al. Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation[C]//Proceedings of the 30th Conference on Neural Information Processing Systems. Barcelona, Spain: Neural Information Processing Systems, 2016: 1826.
[7]	王俊敏, 姜青山, 罗泽明. 预警机指挥编队协同空战分层决策模型[J]. 海军航空工程学院学报, 2014, 29(5):491-496.
	WANG J M, JIANG Q S, LUO Z M. A hierarchical decision-making model for cooperative air combat of early warning aircraft command formations[J]. Journal of Naval Aeronautical and Astronautical University, 2014, 29(5):491-496. (in Chinese)
[8]	付跃文, 王元诚, 陈珍, 等. 基于多智能体粒子群的协同空战目标决策研究[J]. 系统仿真学报, 2018, 30(11):4151-4157. doi: 10.16182/j.issn1004731x.joss.201811013
	FU Y W, WANG Y C, CHEN Z, et al. Research on target decision-making of cooperative air combat based on multi-agent particle swarm[J]. Journal of System Simulation, 2018, 30(11):4151-4157. (in Chinese)
[9]	文永明, 石晓荣, 黄雪梅, 等. 一种无人机集群对抗多耦合任务智能决策方法[J]. 宇航学报, 2021, 42(4):504-512.
	WEN Y M, SHI X R, HUANG X M, et al. An intelligent decision-making method for UAV swarms against multi-coupling tasks[J]. Journal of Astronautics, 2021, 42(4):504-512. (in Chinese)
[10]	程先峰, 严勇杰. 基于MAXQ分层强化学习的有人机/无人机协同路径规划研究[J]. 信息化研究, 2020, 46(1):13-19.
	CHENG X F, YAN Y J. Research on collaborative path planning of manned and unmanned aerial vehicles based on MAXQ hierarchical reinforcement learning[J]. Informatization Research, 2020, 46(1):13-19. (in Chinese)
[11]	吴宜珈, 赖俊, 陈希亮, 等. 强化学习算法在超视距空战辅助决策上的应用研究[J]. 航空兵器, 2021, 28(2):55-61.
	WU Y J, LAI J, CHEN X L, et al. Research on the application of reinforcement learning algorithm in decision-making assistance in over-the-horizon air combat[J]. Aero Weaponry, 2021, 28(2):55-61. (in Chinese)
[12]	POPE A P, IDE J S, MICOVIC D, et al. Hierarchical reinforcement learning for air-to-air combat[C]//Proceedings of 2021 International Conference on Unmanned Aircraft Systems. Athens,Greece:IEEE, 2021: 275-284.
[13]	冷鹏飞, 徐朝阳. 一种深度强化学习的雷达辐射源个体识别方法[J]. 兵工学报, 2018, 39(12):2420-2426. doi: 10.3969/j.issn.1000-1093.2018.12.016
	LENG P F, XU Z Y. A deep reinforcement learning method for individual identification of radar radiation sources[J]. Acta Armamentarii, 2018, 39(12):2420-2426. (in Chinese)
[14]	朱建文, 赵长见, 李小平, 等. 基于强化学习的集群多目标分配与智能决策方法[J]. 兵工学报, 2021, 42(9):2040-2048.
	ZHU J W, ZHAO C J, LI X P, et al. Cluster multi-objective assignment and intelligent decision-making method based on reinforcement learning[J]. Acta Armamentarii, 2021, 42(9):2040-2048. (in Chinese)
[15]	陈中原, 韦文书, 陈万春. 基于强化学习的多发导弹协同攻击智能制导律[J]. 兵工学报, 2021, 42(8):1638-1647.
	CHEN Z Y, WEI W S, CHEN W C. Intelligent guidance law for cooperative attack of multiple missiles based on reinforcement learning[J]. Acta Armamentarii, 2021, 42(8):1638-1647. (in Chinese)
[16]	高昂, 董志明, 叶红兵, 等. 基于深度强化学习的巡飞弹突防控制决策[J]. 兵工学报, 2021, 42(5):1101-1110. doi: 10.3969/j.issn.1000-1093.2021.05.023
	GAO A, DONG Z M, YE H B, et al. Penetration control decision of cruise missile based on deep reinforcement learning[J]. Acta Armamentarii, 2021, 42(5):1101-1110. (in Chinese)
[17]	刘冰雁, 叶雄兵, 岳智宏, 等. 基于多组并行深度Q网络的连续空间追逃博弈算法[J]. 兵工学报, 2021, 42(3):663-672. doi: 10.3969/j.issn.1000-1093.2021.03.024
	LIU B Y, YE X B, YUE Z H, et al. A continuous space chase-escape game algorithm based on multiple parallel deep Q-networks[J]. Acta Armamentarii, 2021, 42(3):663-672. (in Chinese)
[18]	CHAKROVORTY J, WARD P N, ROY J, et al. Option-critic in cooperative multi-agent systems[C]//Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems Virtual. Auckland, New Zealand: IEEE, 2020: 1792-1794.
[19]	惠俊鹏, 汪韧, 俞启东. 基于强化学习的再入飞行器“新质”走廊在线生成技术研究[J]. 航空学报, 2022, 43(9):623-635.
	HUI J P, WANG R, YU Q D. Research of generating new quality flight corridor for reentry a ircraft based on reinforcement learning[J]. Acta Aeronautica et Astronautica Sinica, 2022, 43(9):623-635. (in Chinese)
[20]	罗杰, 董志岩, 翟鹏, 等. 基于强化学习算法的智能飞控开发系统[J]. 计算机系统应用, 2022, 31(7):93-98.
	LUO J, DONG Z Y, ZHAI P, et al. Intelligent flight control development system based on reinforcement learning algorithm[J]. Computer Systems & Applications, 2022, 31(7):93-98. (in Chinese)
[21]	魏航. 基于强化学习的无人机空中格斗算法研究[D]. 哈尔滨: 哈尔滨工业大学, 2015.
	WEI H. Research of UCAV air combat based on reinforcement learning[D]. Harbin: Harbin Institute of Technology, 2015. (in Chinese)
[22]	中国电子科技集团公司认知与智能技术重点实验室. MaCA环境说明[R]. 北京: 中国电子科技集团公司第五十一研究所, 2019:1-20.
	China Electronics Technology Group Corporation Key Laboratory of Cognitive and Intelligent Technology. MaCA environment description[R]. Beijing: The 51st Research Institute of China Electronics Technology Group Corporation, 2019:1-20. (in Chinese)