Abstract
Approximate policy iteration (API) is studied to solve undiscounted optimal control problems in this paper. A discrete-time system with the continuous-state space and the finite-action set is considered. As approximation technique is used for the continuous-state space, approximation errors exist in the calculation and disturb the convergence of the original policy iteration. In our research, we analyze and prove the convergence of API for undiscounted optimal control. We use an iterative method to implement approximate policy evaluation and demonstrate that the error between approximate and exact value functions is bounded. Then, with the finite-action set, the greedy policy in policy improvement is generated directly. Our main theorem proves that if a sufficiently accurate approximator is used, API converges to the optimal policy. For implementation, we introduce a fuzzy approximator and verify the performance on the puddle world problem.
References
Abu-Khalaf M, Lewis FL. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica. 2005;41(5):779–91.
Abu-Khalaf M, Lewis F, Huang J. Policy iterations on the Hamilton–Jacobi–Isaacs equation for \(\text{ H }_{\infty }\) state feedback control with input saturation. IEEE Trans Autom Control. 2006;51(12):1989–95.
Al-Tamimi A, Abu-Khalaf M, Lewis F. Adaptive critic designs for discrete-time zero-sum games with application to \(\text{ H }_{\infty }\) control. IEEE Trans Syst Man Cybern B. 2007;37(1):240–7.
Al-Tamimi A, Lewis F, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof. IEEE Trans Syst Man Cybern B. 2008;38(4):943–9.
Barty K, Girardeau P, Roy JS, Strugarek C. Q-learning with continuous state spaces and finite decision set. In: Proceedings of the 2007 IEEE international symposium on approximate dynamic programming and reinforcement learning (ADPRL 2007); 2007. pp. 346–351.
Bertsekas DP, Tsitsiklis JN. Neuro-dynamic programming. Belmont, MA: Athena Scientific; 1996.
Boaro M, Fuselli D, Angelis F, Liu D, Wei Q, Piazza F. Adaptive dynamic programming algorithm for renewable energy scheduling and battery management. Cogn Comput. 2013;5(2):264–77.
Busoniu L, Ernst D, De Schutter B, Babuska R. Fuzzy approximation for convergent model-based reinforcement learning. In: Proceedings of the 2007 IEEE international conference on Fuzzy systems (FUZZ-IEEE-07), London, UK; 2007. pp. 968–973.
Busoniu L, Babuska R, De Schutter B, Ernst D. Reinforcement learning and dynamic programming using function approximators. New York: CRC Press; 2010.
Chen F, Jiang B, Tao G. Fault self-repairing flight control of a small helicopter via fuzzy feedforward and quantum control techniques. Cogn Comput. 2012;4(4):543–8.
Derhami V, Majd VJ, Nili Ahmadabadi M. Exploration and exploitation balance management in fuzzy reinforcement learning. Fuzzy Sets Syst. 2010;161(4):578–95.
Heydari A. Revisiting approximate dynamic programming and its convergence. IEEE Trans Cybern. 2014;44(12):2733–43.
Howard R. Dynamic programming and Markov processes. Cambridge, MA: MIT Press; 1960.
Hui G, Huang B, Wang Y, Meng X. Quantized control design for coupled dynamic networks with communication constraints. Cogn Comput. 2013;5(2):200–6.
Ikonen E, Najim K. Multiple model-based control using finite controlled markov chains. Cogn Comput. 2009;1(3):234–43.
Jia Z, Song Y, Cai W. Bio-inspired approach for smooth motion control of wheeled mobile robots. Cogn Comput. 2013;5(2):252–63.
Lewis F, Vrabie D. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst Mag. 2009;9(3):32–50.
Liu D, Wei Q. Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems. IEEE Trans Cybern. 2013;43(2):779–89.
Liu D, Wei Q. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst. 2014;25(3):621–34.
Meng F, Chen X. Correlation coefficients of hesitant fuzzy sets and their application based on fuzzy measures. Cogn Comput. 2015;7(4):445–63.
Munos R. Error bounds for approximate policy iteration. In: Proceedings of the 20th international conference on machine learning, Washington, Columbia; 2003. pp. 560–576.
Muse D, Wermter S. Actor-critic learning for platform-independent robot navigation. Cogn Comput. 2009;1(3):203–20.
Nedić A, Bertsekas DP. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn Syst. 2003;13(1–2):79–110.
Samar R, Kamal W. Optimal path computation for autonomous aerial vehicles. Cogn Comput. 2012;4(4):515–25.
Song Y, Li Q, Kang Y. Conjugate unscented fastslam for autonomous mobile robots in large-scale environments. Cogn Comput. 2014;6(3):496–509.
Sutton RS, Barto AG. Reinforcement learning: an introduction. Cambridge, MA: MIT Press; 1998.
Vieira D, Adeodato P, Goncalves P. A temporal difference GNG-based algorithm that can learn to control in reinforcement learning environments. In: Proceedings of the 12th international conference on machine learning and applications (ICMLA 2013), 2013; vol 1, pp. 329–332.
Wang D, Liu D, Li H. Policy iteration algorithm for online design of robust control for a class of continuous-time nonlinear systems. IEEE Trans Autom Sci Eng. 2014;11(2):627–32.
Wang Y, Feng G. On finite-time stability and stabilization of nonlinear port-controlled Hamiltonian systems. Sci China Inf Sci. 2013;56(10):1–14.
Wei Q, Liu D. A novel iterative \(\theta\)-adaptive dynamic programming for discrete-time nonlinear systems. IEEE Trans Autom Sci Eng. 2014;11(4):1176–90.
Zhang H, Liu D, Luo Y, Wang D. Adaptive dynamic programming for control: algorithms and stability. London: Springer; 2013.
Zhao D, Zhu Y. MEC-a near-optimal online reinforcement learning algorithm for continuous deterministic systems. IEEE Trans Neural Netw Learn Syst. 2015;26(2):346–56.
Zhao Y, Cheng D. On controllability and stabilizability of probabilistic Boolean control networks. Sci China Inf Sci. 2014;57(1):1–14.
Zhu Y, Zhao D, Liu D. Convergence analysis and application of fuzzy-HDP for nonlinear discrete-time HJB systems. Neurocomputing. 2015;149:124–31.
Acknowledgments
This work was supported in part by National Natural Science Foundation of China (No. 61273136), State Key Laboratory of Robotics and System (SKLRS-2015-ZD-04), and National Science Foundation (NSF) under grant ECCS 1053717.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhu, Y., Zhao, D., He, H. et al. Convergence Proof of Approximate Policy Iteration for Undiscounted Optimal Control of Discrete-Time Systems. Cogn Comput 7, 763–771 (2015). https://doi.org/10.1007/s12559-015-9350-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-015-9350-z