Elsevier

Automatica

Volume 49, Issue 1, January 2013, Pages 82-92
Automatica

A novel actor–critic–identifier architecture for approximate optimal control of uncertain nonlinear systems

https://doi.org/10.1016/j.automatica.2012.09.019Get rights and content

Abstract

An online adaptive reinforcement learning-based solution is developed for the infinite-horizon optimal control problem for continuous-time uncertain nonlinear systems. A novel actor–critic–identifier (ACI) is proposed to approximate the Hamilton–Jacobi–Bellman equation using three neural network (NN) structures—actor and critic NNs approximate the optimal control and the optimal value function, respectively, and a robust dynamic neural network identifier asymptotically approximates the uncertain system dynamics. An advantage of using the ACI architecture is that learning by the actor, critic, and identifier is continuous and simultaneous, without requiring knowledge of system drift dynamics. Convergence of the algorithm is analyzed using Lyapunov-based adaptive control methods. A persistence of excitation condition is required to guarantee exponential convergence to a bounded region in the neighborhood of the optimal control and uniformly ultimately bounded (UUB) stability of the closed-loop system. Simulation results demonstrate the performance of the actor–critic–identifier method for approximate optimal control.

Introduction

Reinforcement learning (RL) uses evaluative feedback from the environment to take appropriate actions (Sutton & Barto, 1998). One of the most widely used architectures to implement RL algorithms is the actor–critic architecture, where an actor performs certain actions by interacting with its environment, the critic evaluates the actions and gives feedback to the actor, leading to improvement in performance of subsequent actions (Barto et al., 1983, Sutton and Barto, 1998, Widrow et al., 1973). Actor–critic algorithms are pervasive in machine learning and are used to learn the optimal policy online for finite-space discrete-time Markov decision problems (Barto et al., 1983, Konda and Tsitsiklis, 2004, Prokhorov et al., 1997, Sutton and Barto, 1998, Werbos, 1990).

Similar to RL, optimal control involves selection of an optimal policy based on some long-term performance criteria. Dynamic Programming (DP) provides a means to solve optimal control problems (Kirk, 2004); however, DP is implemented backward in time, making it offline and computationally expensive for complex systems. Owing to the similarities between optimal control and RL (Sutton, Barto, & Williams, 1992), Werbos (1990) introduced RL-based actor–critic methods for optimal control, called Approximate Dynamic Programming (ADP). ADP uses neural networks (NNs) to approximately solve DP forward-in-time, thus avoiding the curse of dimensionality. A detailed discussion of ADP-based designs is found in Bertsekas and Tsitsiklis (1996), Prokhorov et al. (1997) and Si, Barto, Powell, and Wunsch (2004). The success of ADP prompted a major research effort towards designing ADP-based optimal feedback controllers. The discrete/iterative nature of the ADP formulation lends itself naturally to the design of discrete-time optimal controllers (Al-Tamimi et al., 2008, Balakrishnan and Biega, 1996, Dierks et al., 2009, Ferrari and Stengel, 2002, He and Jagannathan, 2007, Lendaris et al., 2000, Padhi et al., 2006).

Extensions of ADP-based controllers to continuous-time systems entails challenges in proving stability, and convergence, and ensuring the algorithm is online and model-free. Early solutions to the problem consisted of using a discrete-time formulation of time and state, and then applying an RL algorithm on the discretized system. Discretizing the state space for high dimensional systems requires a large memory space and a computationally prohibitive learning process. Baird (1993) proposed Advantage Updating, an extension of the Q-learning algorithm which could be implemented in continuous-time and provided faster convergence. Doya (2000) used a Hamilton–Jacobi–Bellman (HJB) framework to derive algorithms for value function approximation and policy improvement, based on a continuous-time version of the temporal difference error. Murray, Cox, Lendaris, and Saeks (2002) also used the HJB framework to develop a stepwise stable iterative ADP algorithm for continuous-time input-affine systems with an input quadratic performance measure. In Beard, Saridis, and Wen (1997), Galerkin’s spectral method is used to approximate the solution to the generalized HJB (GHJB), using which a stabilizing feedback controller was computed offline. Similar to Beard et al. (1997), Abu-Khalaf and Lewis (2005) proposed a least-squares successive approximation solution to the GHJB, where an NN is trained offline to learn the GHJB solution.

All of the aforementioned approaches for continuous-time nonlinear systems are offline and/or require complete knowledge of system dynamics. One of the contributions in Vrabie and Lewis (2009) is that only partial knowledge of the system dynamics is required, and a hybrid continuous-time/discrete-time sampled data controller is developed based on policy iteration (PI), where the feedback control operation of the actor occurs at a faster time scale than the learning process of the critic. Vamvoudakis and Lewis (2010) extended the idea by designing a model-based online algorithm called synchronous PI which involved synchronous, continuous-time adaptation of both actor and critic neural networks. Inspired by the work in Vamvoudakis and Lewis (2010), a novel actor–critic–identifier architecture is proposed in this paper to approximately solve the continuous-time infinite horizon optimal control problem for uncertain nonlinear systems; however, unlike Vamvoudakis and Lewis (2010), the developed method does not require knowledge of the system drift dynamics. The actor and critic NNs approximate the optimal control and the optimal value function, respectively, whereas the identifier dynamic neural network (DNN) estimates the system dynamics online. The integral RL technique in Vrabie and Lewis (2009) leads to a hybrid continuous-time/discrete-time controller with two time-scale actor–critic learning process, whereas the approach in Vamvoudakis and Lewis (2010), although continuous-time, requires complete knowledge of system dynamics. A contribution of this paper is the use of a novel actor–critic–identifier architecture, which obviates the need to know the system drift dynamics, and where the learning of the actor, critic and identifier is continuous and simultaneous. Moreover, the actor–critic–identifier method utilizes an identification-based online learning scheme, and hence is the first ever indirect adaptive control approach to RL. The idea is similar to the Heuristic Dynamic Programming (HDP) algorithm (Werbos, 1992), where Werbos suggested the use of a model network along with the actor and critic networks. Because of the generality of the considered system and objective function, the solution approach in this paper can be used in a wide range of applications in different fields, e.g., optimal control of space/air vehicles, chemical and manufacturing processes, robotics, financial systems, etc.

In the developed method, the actor and critic NNs use gradient and least-squares-based update laws, respectively, to minimize the Bellman error, which is the difference between the exact and the approximate HJB equation. The identifier DNN is a combination of a Hopfield-type (Hopfield, 1984) component, in parallel configuration with the system (Poznyak, Sanchez, & Yu, 2001), and a novel RISE (Robust Integral of Sign of the Error) component. The Hopfield component of the DNN learns the system dynamics based on online gradient-based weight tuning laws, while the RISE term robustly accounts for the function reconstruction errors, guaranteeing asymptotic estimation of the state and the state derivative. The online estimation of the state derivative allows the actor–critic–identifier architecture to be implemented without knowledge of system drift dynamics; however, knowledge of the input gain matrix is required to implement the control policy. While the design of the actor and critic are coupled through the HJB equation, the design of the identifier is decoupled from actor–critic, and can be considered as a modular component in the actor–critic–identifier architecture. Convergence of the actor–critic–identifier-based algorithm and stability of the closed-loop system are analyzed using Lyapunov-based adaptive control methods, and a persistence of excitation (PE) condition is used to guarantee exponential convergence to a bounded region in the neighborhood of the optimal control and uniformly ultimately bounded (UUB) stability of the closed-loop system. The PE condition is equivalent to the exploration paradigm in RL (Sutton & Barto, 1998) and ensures adequate sampling of the system’s dynamics, required for convergence to the optimal policy.

Section snippets

Actor–critic–identifier architecture for HJB approximation

Consider a continuous-time nonlinear system ẋ=F(x,u), where x(t)XRn,u(t)URm is the control input, F:X × U →Rn is Lipschitz continuous on X × U containing the origin, such that the solution x(t) of the system is unique for any finite initial condition x0 and control uU. The optimal value function can be defined as V(x(t))=minu(τ)Ψ(X)tτ<tr(x(s),u(x(s)))ds, where Ψ(X) is a set of admissible policies, and r(x,u)R is the immediate or local cost, defined as r(x,u)=Q(x)+uTRu, where Q(x)R

Actor–critic design

Using Assumption 3 and (4), the optimal value function and the optimal control can be represented by NNs as V(x)=WTϕ(x)+εv(x),u(x)=12R1gT(x)(ϕ(x)TW+εv(x)T), where WRN are unknown ideal NN weights, N is the number of neurons, ϕ(x)[ϕ1(x)ϕ2(x)ϕN(x)]TRN and ϕ(x)ϕxRN×n, such that ϕi(0)=0 and ϕi(0)=0i=1N, and εv()R is the function reconstruction error.

Assumption 7

The NN activation functions {ϕi(x):i=1N} are selected so that as N,ϕ(x) provides a complete independent basis for V(x).

Using

Identifier design

The following assumption is made for the identifier design:

Assumption 8

The control input is bounded, i.e. u(t)L. Using Assumption 2, Assumption 5 and the projection algorithm in (19), this assumption holds for the control design u(t)=uˆ(x) in (10).

Using Assumption 3, the dynamic system in (3), with control uˆ(x), can be represented using a multi-layer NN as ẋ=Fuˆ(x,uˆ)=WfTσ(VfTx)+εf(x)+g(x)uˆ, where WfRLf+1×n,VfRn×Lf are the unknown ideal NN weights, σfσ(VfTx)RLf+1 is the NN activation function,

Convergence and stability analysis

The unmeasurable form of the Bellman error can be written using (5), (6), (7), (8), (11), as δhjb=WˆcTωWcTϕFu+uˆTRuˆuTRuεvFu=W̃cTωWTϕF̃uˆ+14W̃aTϕGϕTW̃a14εvGεvTεvFu, where (9), (10) are used. The dynamics of the critic weight estimation error W̃c(t) can now be developed by substituting (40) in (15), as W̃̇c=ηcΓψψTW̃c+ηcΓω1+νωTΓω[WTϕF̃uˆ+14W̃aTϕGϕTW̃a14εvGεvTεvFu], where ψ(t)ω(t)1+νω(t)TΓ(t)ω(t)RN is the normalized critic regressor vector, bounded as ψ1νφ1,

Simulation

The following nonlinear system is considered (Vamvoudakis & Lewis, 2010) ẋ=[x1+x20.5x10.5x2(1(cos(2x1)+2)2)]+[0cos(2x1)+2]u, where x(t)[x1(t)x2(t)]TR2 and u(t)R. The state and control penalties are chosen as Q(x)=xT[1001]x;R=1.The optimal value function and optimal control for the system in (54) are known, and given by Vamvoudakis and Lewis (2010)V(x)=12x12+x22;u(x)=(cos(2x1)+2)x2, which can be used to find the optimal weights W=[0.501]T. The activation function for the critic NN is

Conclusion

An actor–critic–identifier architecture is proposed to learn the approximate solution to the HJB equation for infinite-horizon optimal control of uncertain nonlinear systems. The online method is the first ever indirect adaptive control approach to continuous-time RL. The learning by the actor, critic and identifier is continuous and simultaneous, and the novel addition of the identifier to the traditional actor–critic architecture eliminates the need to know the system drift dynamics. The

Shubhendu Bhasin received his Ph.D. in 2011 from the Department of Mechanical and Aerospace Engineering at the University of Florida. He is currently Assistant Professor in the Department of Electrical Engineering at the Indian Institute of Technology, Delhi. His research interests include reinforcement learning-based feedback control, approximate dynamic programming, neural network-based control, nonlinear system identification and parameter estimation, and robust and adaptive control of

References (47)

  • D. Bertsekas et al.

    Neuro-dynamic programming

    (1996)
  • S. Bradtke et al.

    Adaptive linear quadratic control using policy iteration

  • L. Busoniu et al.

    Reinforcement learning and dynamic programming using function approximators

    (2010)
  • F.H. Clarke

    Optimization and nonsmooth analysis

    (1990)
  • G. Cybenko

    Approximation by superpositions of a sigmoidal function

    Mathematics of Control, Signals, and Systems

    (1989)
  • W.E. Dixon et al.

    Nonlinear control of engineering systems: a Lyapunov-based approach

    (2003)
  • K. Doya

    Reinforcement learning in continuous time and space

    Neural Computation

    (2000)
  • Ferrari, S., & Stengel, R. (2002). An adaptive critic global controller. In Proc. Am. control conf. vol....
  • A. Filippov

    Differential equations with discontinuous right-hand side

    American Mathematical Society Translations

    (1964)
  • A.F. Filippov

    Differential equations with discontinuous right-hand sides

    (1988)
  • P. He et al.

    Reinforcement learning neural-network-based controller for nonlinear discrete-time systems with input constraints

    IEEE Transactions on Systems, Man, and Cybernetics. Part B Cybernetics

    (2007)
  • J. Hopfield

    Neurons with graded response have collective computational properties like those of two-state neurons

    Proceedings of the National Academy of Sciences of the United States of America

    (1984)
  • K. Hornik et al.

    Multilayer feedforward networks are universal approximators

    Neural Networks

    (1985)
  • Cited by (479)

    View all citing articles on Scopus

    Shubhendu Bhasin received his Ph.D. in 2011 from the Department of Mechanical and Aerospace Engineering at the University of Florida. He is currently Assistant Professor in the Department of Electrical Engineering at the Indian Institute of Technology, Delhi. His research interests include reinforcement learning-based feedback control, approximate dynamic programming, neural network-based control, nonlinear system identification and parameter estimation, and robust and adaptive control of uncertain nonlinear systems.

    Rushikesh Kamalapurkar received his Bachelor’s degree in mechanical engineering from Visvesvaraya National Institute of Technology, India in 2007 and his Master’s from the University of Florida in 2011. He is currently a Ph.D. student with the Nonlinear Control and Robotics group at the University of Florida. His research interests include the applications of reinforcement learning to feedback control of uncertain nonlinear systems, and differential game-based distributed control of multiple autonomous agents.

    Marcus Johnson received his Ph.D. in 2011 from the Department of Mechanical and Aerospace Engineering at the University of Florida. He is currently working as a Research Aerospace Engineer at NASA Ames Research Center and his main research interest is the development of Lyapunov-based proofs for optimality of nonlinear adaptive systems.

    Kyriakos G. Vamvoudakis was born in Athens Greece. He received the Diploma (5 year degree) in electronic and computer engineering from the Technical University of Crete, Greece in 2006 with highest honors, and the M.Sc. and Ph.D. degrees in electrical engineering from The University of Texas at Arlington in 2008 and 2011 respectively. From May 2011 to January 2012, he was working as an Adjunct Professor and Faculty Research Associate at The University of Texas at Arlington and at the Automation and Robotics Research Institute. He is currently working as a Project Research Scientist at the Center of Control, Dynamical Systems and Computation (CCDC) at the University of California, Santa Barbara. He is coauthor of 6 book chapters, 40 technical publications and the book Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles. His research interests include approximate dynamic programming, game theory, neural network feedback control, and optimal control. Recently, his research has focused on network security and multi-agent optimization. He is a member of Tau Beta Pi, Eta Kappa Nu and Golden Key honor societies and is listed in Who’s Who in the World, Who’s Who in Science and Engineering, and Who’s Who in America. He received the Best Paper Award for Autonomous/Unmanned Vehicles at the 27 th Army Science Conference in 2010, the Best Presentation Award at the World Congress of Computational Intelligence in 2010 and the Best Researcher Award, UTA Automation & Robotics Research Institute in 2011. He has organized special sessions for several international conferences. Dr. Vamvoudakis is a registered Electrical/Computer engineer (PE) and member of the Technical Chamber of Greece.

    F.L. Lewis, IEEE Fellow, IFAC Fellow, Fellow Inst. Measurement & Control, PE Texas, UK. Chartered Engineer. Distinguished Scholar Professor and Moncrief–O’Donnell Chair at The University of Texas at Arlington. Works in feedback control and intelligent systems. Author of 6 US patents, books, and several journal papers. Awards include Fulbright Research Award, NSF Research Initiation Grant, ASEE Terman Award, and International Neural Network Society Gabor Award and Neural Network Pioneer Award. Selected as Engineer of the Year by Ft. Worth IEEE Section.

    Warren Dixon received his Ph.D. in 2000 from the Department of Electrical and Computer Engineering from Clemson University. After completing his doctoral studies he was selected as an Eugene P. Wigner Fellow at Oak Ridge National Laboratory (ORNL). In 2004, Dr. Dixon joined the faculty of the University of Florida in the Mech. and Aero. Eng. Dept. His research focus is the development and application of Lyapunov-based control techniques for uncertain nonlinear systems. He has published 3 books, an edited collection, 9 chapters, and over 250 refereed journal and conference papers. His work has been recognized by the 2011 American Society of Mechanical Engineers (ASME) Dynamics Systems and Control Division Outstanding Young Investigator Award, 2009 American Automatic Control Council (AACC) O. Hugo Schuck Award, 2006 IEEE Robotics and Automation Society (RAS) Early Academic Career Award, an NSF CAREER Award (2006–2011), 2004 DOE Outstanding Mentor Award, and the 2001 ORNL Early Career Award for Engineering Achievement. Dr. Dixon is a senior member of IEEE. He serves or has served as a member of numerous technical, conference program, and organizing committees. He served as an appointed member to the IEEE CSS Board of Governors (BoG) in 2008, and now serves as the Director of Operations for the Executive Committee of the BoG. He is currently or formerly an associate editor for ASME Journal of Dynamic Systems, Measurement and Control, Automatica, IEEE Transactions on Systems Man and Cybernetics: Part B Cybernetics, and the International Journal of Robust and Nonlinear Control.

    This research is supported in part by the NSF CAREER award no. 0547448, NSF award no. 0901491, and the Department of Energy, grant no. DE-FG04-86NE37967 as part of the DOE University Research Program in Robotics (URPR). The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Raul Ordonez under the direction of Editor Miroslav Krstic.

    1

    Tel.: +91 11 26591124; fax: +91 11 26581606.

    View full text