Abstract
Solving the optimization problem to approach a Nash Equilibrium point plays an important role in imperfect information games, e.g., StarCraft and poker. Neural Fictitious Self-Play (NFSP) is an effective algorithm that learns approximate Nash Equilibrium of imperfect-information games from purely self-play without prior domain knowledge. However, it needs to train a neural network in an off-policy manner to approximate the action values. For games with large search spaces, the training may suffer from unnecessary exploration and sometimes fails to converge. In this paper, we propose a new Neural Fictitious Self-Play algorithm that combines Monte Carlo tree search with NFSP, called MC-NFSP, to improve the performance in real-time zero-sum imperfect-information games. With experiments and empirical analysis, we demonstrate that the proposed MC-NFSP algorithm can approximate Nash Equilibrium in games with large-scale search depth while the NFSP can not. Furthermore, we develop an Asynchronous Neural Fictitious Self-Play framework (ANFSP). It uses asynchronous and parallel architecture to collect game experience and improve both the training efficiency and policy quality. The experiments with th e games with hidden state information (Texas Hold’em), and the FPS (firstperson shooter) games demonstrate effectiveness of our algorithms.
Similar content being viewed by others
References
Arulkumaran K, Cully A, Togelius J. Alphastar: an evolutionary computation perspective. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion. 2019, 314–315
Nash J. Non-cooperative games. Annals of Mathematics, 1951, 54(2): 286–295
Sanholm T. The state of solving large incomplete-information games, and application to poker. AI Magazine, 2010, 31(4): 13–32
Bošanský B, Kiekintveld C, Lisý V, Pĕchouček M. An exact double-oracle algorithm for zero-sum extensive-form games with imperfect information. Journal of Artificial Intelligence Research, 2014, 51: 829–866
Bowling M, Burch N, Johanson M, Tammelin O. Heads-up limit hold’em poker is solved. Science, 2015, 347(6218): 145–149
Browne C B, Powley E, Whitehouse D, Lucas S M, Cowling P I, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(1): 1–43
Brown G W. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951, 13(1): 374–376
Heinrich J, Lanctot M, Silver D. Fictitious self-play in extensive-form games. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 805–813
Heinrich J, Silver D. Deep reinforcement learning from self-play in imperfect-information games. 2016, arXiv preprint arXiv:1603.01121
Sutton R S, Barto A G. Reinforcement Learning: An Introduction. 2nd ed. London: MIT Press, 1998
Myerson R B. Game Theory: Analysis of Conflict. 1st ed. London: Harvard University Press, 1991
Shi L, Li S, Cao L, Yang L, Pan G. TBQ (σ) improving efficiency of trace utilization for off-policy reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2019, 1025–1032
Yang L, Shi M, Zheng Q, Meng W, Pan G. A unified approach for multi-step temporal-difference learning with eligibility traces in reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 2984–2990
Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533
Meng W, Zheng Q, Yang L, Li P, Pan G. Qualitative measurements of policy discrepancy for return-based deep q-network. IEEE Transactions on Neural Networks and Learning Systems, 2019, 31(10): 4374–4380
Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T P, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937
Silver D, Huang A, Maddison C J, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529(7587): 484
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, et al. Mastering the game of go without human knowledge. Nature, 2017, 550(7676): 354
Sukhbaatar S, Szlam A, Fergus R. Learning multiagent communication with back propagation. In: Proceedings of the 30th Annual Conference on Neural Information Processing Systems. 2016, 2244–2252
Peng P, Wen Y, Yang Y, Yuan Q, Tang Z, Long H, Wang J. Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games. 2017, arXiv preprint arXiv:1703.10069
Heinrich J, Silver D. Smooth uct search in computer poker. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 554–560
Lisy V, Lanctot M, Bowling M. Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In: Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems. 2015, 27–36
Brown N, Sandholm T. Libratus: the superhuman ai for no-limit poker. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 5226–5228
Moravčík M, Schmid M, Burch N, Lisy V, Morrill D, Bard N, Davis T, Waugh K, Johanson M, Bowling M. Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science, 2017, 356(6337): 508–513
Leslie D S, Collins E J. Generalised weakened fictitious play. Games and Economic Behavior, 2006, 56(2): 285–298
Hendon E, Jacobsen H J, Sloth B. Fictitious play in extensive form games. Games and Economic Behavior, 1996, 15(2): 177–202
Thrun S, Schwartz A. Issues in using function approximation for reinforcement learning. In: Proceedings of the 4th Connectionist Models Summer School. 1993, 1–7
Anschel O, Baram N, Shimkin N. Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 176–185
Foerster J, Nardelli N, Farquhar G, Afouras T, Torr P H, Kohli P, Whiteson S. Stabilising experience replay for deep multi-agent reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 1146–1155
Kocsis L, Szepesvári C. Bandit based Monte-Carlo planning. In: Proceedings of the 17th European Conference on Machine Learning. 2006, 282–293
Audibert J Y, Munos R, Szepesvári C. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 2009, 410(19): 1876–1902
Shah D, Xie Q, Xu Z. On reinforcement learning using Monte Carlo tree search with supervised learning: non-asymptotic analysis. 2019, arXiv preprint arXiv:1902.05213
Lisy V, Kovarik V, Lanctot M, Bosansky B. Convergence of Monte Carlo tree search in simultaneous move games. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 2013, 2112–2120
Auger D. Multiple tree for partially observable Monte-Carlo tree search. In: Proceedings of the 14th European Conference on the Applications of Evolutionary Computation. 2011, 53–62
Ponsen M, De Jong S, Lanctot M. Computing approximate nash equilibria and robust best-responses using sampling. Journal of Artificial Intelligence Research, 2011, 42: 575–605
Cowling P I, Powley E J, Whitehouse D. Information set Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(2): 120–143
Jin P, Keutzer K, Levine S. Regret minimization for partially observable deep reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 2342–2351
Vitter J S. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11(1): 37–57
Chaslot G M B, Winands M H, van den Herik H J. Parallel Monte-Carlo tree search. In: Proceedings of the 6th International Conference on Computers and Games. 2008, 60–71
Fang Q, Boas D A. Monte Carlo simulation of photon migration in 3d turbid media accelerated by graphics processing units. Optics Express, 2009, 17(22): 20178–20190
Hill M D, Marty M R. Amdahl’s law in the multicore era. Computer, 2008, 41(7): 33–38
Lanctot M, Waugh K, Zinkevich M, Bowling M. Monte Carlo sampling for regret minimization in extensive games. In: Proceedings of the 23rd Annual Conference on Neural Information Processing Systems. 2009, 1078–1086
Acknowledgements
This work was supported by National Key Research and Development Program of China (2017YFB1002503), Science and Technology Innovation 2030 — “New Generation Artificial Intelligence” Major Project (2018AAA0100902), China.
Author information
Authors and Affiliations
Corresponding author
Additional information
Li Zhang received the BEng and PhD degrees from Zhejiang University, China in 2007 and 2013, respectively. He is currently an assistant researcher of the Department of Computer Science, Zhejiang University, China. In 2009, he was a visiting scholar at the University of Hong Kong, China. From 2013 to 2017, he was a researcher in Works Applications Co., Ltd. His current interests include deep learning, game theory, human-machine hybrid computing and pervasive computing. He has authored over ten refereed papers and patents.
Yuxuan Chen is a PhD student of computer science in the Pervasive and Cyborg Technology laboratory of Zhejiang University, China. He is doing research on game algorithms in incomplete information games and Nash Equlibrium problems. He did some research on reinforcement learning and implemented some related algorithms. He is also interested in human-computer iteraction. He also contributes on opensource gaming algorithms.
Wei Wang is currently an Algorithm engineer in Alibaba. She received her MS in computer science from Zhejiang University, and BS in computer Science from University of electronic science and technology of China. She did some reinforcement learning research with lecturer during studying in the university. And she works on recommendation system now.
Ziliang Han is currently pursuing a master’s degree at the Pervasive and Cyborg Technology laboratory of Zhejiang University, China. He has a bachelor’s degree in computer science from Jilin University, China. He did some research on reinforcement learning and implemented some related algorithms. He is also building a computational platform for large-scale game algorithm training and evaluation. He also contributes on opensource gaming algorithms.
Shijian Li received the PhD degree from Zhejiang University, China in 2006. In 2010, he was a Visiting Scholar with the Institute Telecom SudParis, France. He is currently with the College of Computer Science and Technology, Zhejiang University, China. He was published over 40 papers. His research interests include sensor networks, ubiquitous computing, and social computing. He serves as an Editor of the International Journal of Distributed Sensor Networks and as Reviewer or PC Member of more than ten conferences.
Zhijie Pan, PhD, Prof. Director of the intelligent Vehicle Research Center of Zhejiang University, China. He is mainly engaged in the cross research fields of computer science and automotive, including AI, autonomous vehicle technology, automatic driving, ITS. He has published more than 90 papers in international conferences and international magazines IEEE, SAE, JSAE, etc., has more than 100 patents.
Gang Pan received the BEng and PhD degrees from Zhejiang University, China in 1998 and 2004, respectively. He is currently a professor of the Department of Computer Science, and deputy director of State Key Lab of CAD&CG, Zhejiang University, China. His current interests include artificial intelligence, pervasive computing, brain-inspired computing, and brain-machine interfaces. He has authored over 100 refereed papers, and 35 patents granted. He received three best paper awards (e.g., ACM UbiComp’16) and three nominations from premier international conferences. He is the recipient of IEEE TCSC Award for Excellence (Middle Career Researcher), CCF-IEEE CS Young Computer Scientist Award, and the State Scientific and Technological Progress Award. He serves as an Associate Editor of IEEE Trans. Neural Networks and Learning Systems, IEEE Systems Journal, Pervasive and Mobile Computing.
Electronic supplementary material
11704_2020_9307_MOESM1_ESM.pdf
A Monte Carlo Neural Fictitious self-play approach to approximate Nash Equilibrium in imperfect-information dynamic games
Rights and permissions
About this article
Cite this article
Zhang, L., Chen, Y., Wang, W. et al. A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games. Front. Comput. Sci. 15, 155334 (2021). https://doi.org/10.1007/s11704-020-9307-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-020-9307-6