Global optimality of approximate dynamic programming and its use in non-convex function minimization
Graphical abstract
Level curves of the Rosenbrock function subject to minimization and state trajectories for different initial conditions x0 ∈ {−2, −1, 0, 1, 2} × {−2, −1, 0, 1, 2}. The red plus signs denote the initial point of the respective trajectory.
Introduction
In the last two decades, approximate dynamic programming (ADP) has been shown to have a great promise in solving optimal control problems with neural networks (NN) [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]. In the ADP framework, the solutions are obtained using a two-network synthesis called adaptive critics (ACs) [2], [3], [4]. In the heuristic dynamic programming (HDP) approach with ACs, one network, called the ‘critic’ network, maps the input states to output the cost-to-go and another network, called the ‘action’ network, outputs the control with states of the system as its inputs [4], [5]. In the dual heuristic programming (DHP) formulation, the action network remains the same as in the HDP, however, the critic network outputs the costates with the current states as inputs [2], [6], [7]. The computationally effective single network adaptive critics (SNAC) architecture consists of one network only. In [8], the action network was eliminated in a DHP type formulation with control being calculated from the costate values. Similarly, the J-SNAC [9] eliminates the need for the action network in an HDP scheme. Note that the developments in [1], [2], [3], [4], [5], [6], [7], [8], [9] are for infinite-horizon problems.
The use of ADP for solving finite-horizon optimal control problems was considered in [10], [11], [12], [13], [14], [15]. Authors of [10] developed a time-varying neurocontroller for solving a scalar problem with state constraints. In [11] a single NN with a single set of weights was proposed which takes the time-to-go as an input along with the states and generates the fixed-final-time optimal control for discrete-time nonlinear multi-variable systems. An HDP based scheme for optimal control problems with soft and hard terminal constraints was presented in [12]. Finite-horizon problems with unspecified terminal times were considered in [13], [14]. For an extensive literature on adaptive critic based problems, the reader is referred to [16] and the references in the chapters.
Despite much published literature on adaptive critics, there still exists an open question about the nature of optimality of the adaptive critic based results. Are they locally or globally optimal? A major contribution of this study is in proving that the AC based solutions are globally optimal subject to the assumed basis functions. To help with the development of the proof, the ADP based algorithm for solving fixed-final-time problems developed in [11], [12] is revisited first. After describing the algorithm, a novel analysis of global optimality of the result is presented. It is shown that with any cost function with a quadratic control penalizing term, the resulting cost-to-go function (sometimes called value function) will be convex versus the control at the current time if the sampling time used for discretization of the original continuous-time system is small enough, and hence, the first order necessary optimality condition [17] will lead to the global optimal control. The second major contribution of this paper is in showing that the ADP can be used for function optimization, specifically, optimization of non-convex functions. Finally, through numerical simulations, two examples with varying complexities are presented and the performance of the proposed method is investigated. It is shown that despite the gradient based methods, selecting any initial guess on the minimum and updating the guess using the control resulting from the actor, the states will move directly toward the global minimum, passing any possible local minimum in the path.
The rest of this paper is organized as follows: The problem formulation is given in the next section, followed by Section ‘Approximate dynamic programming based solution’. Afterwards, the supporting theorems and analyses are presented. The use of the method in static function optimization is discussed next and followed by some conclusions.
Section snippets
Problem formulation
Let the control-affine dynamics of the system be given bywhere , and . The state and control vectors are denoted with and , respectively, where positive integers n and m denote the dimensions of the respective vectors. The selected cost function, J is fairly general but quadratic in control:where positive semi-definite smooth functions and penalize the states and positive definite matrix R
Approximate dynamic programming based solution
In this section, an ADP scheme called AC is used for solving the fixed-final-time optimal control problem in terms of the network weights and selected basis functions. The method is adopted from [11], [12]. In this scheme, two networks called critic and actor are trained to approximate the optimal cost-to-go and the optimal control, respectively. It should be noted that the optimal cost-to-go, which represents the incurred cost if optimal decisions are made from the current time to the final
A. Convergence analysis
Theorem 1 The iterative relation given by Eq. (10), with any initial guess for , converges, providing the sampling time ▵t selected for discretization of continuous-time dynamics (1) is small enough. Proof Let the right hand side of (10) be denoted with function where
The proof is complete if it is shown that the relation given by the successive approximationis a contraction mapping [22]. Since with 2-norm denoted with ||.|| is a Banach
Non-convex function optimization
One of the applications of the global optimality results given in this study is using ADP for finding the global optimum of smooth but possibly non-convex functions. In other words, the ‘optimal control’ tool can be used for ‘convex or non-convex function optimization’. Considering nonlinear programming based optimization methods [17], [18], for optimizing function ψ(x), one selects an initial guess, denoted with x0, and uses the update rulewhere is the update rate. Parameter
Numerical analysis
In order to numerically analyze the global optimality of ADP scheme and its use in static minimization, a simple way is selecting a cost function with a non-convex terminal cost term and evaluating the performance of the ADP in providing the global optimum. To this end, two separate examples are selected; a single variable example and a multi-variable benchmark example, namely Rosenbrock/Banana function. The source code for the numerical analysis can be found at [33].
Conclusions
The performance of approximate dynamic programming in finding the global optimal solution to the fixed-final-time control problem was investigated. A sufficient condition for global optimality of the result, regardless of the convexity or non-convexity of the functions representing the dynamics of the system or the state penalizing terms in the cost function, was derived. Moreover, an idea was presented in converting a static function optimization to an optimal control problem and using ADP for
References (33)
- et al.
Optimal control of unknown affine nonlinear discrete-time systems using offline-trained neural networks with proof of convergence
Neural Netw.
(2009) - et al.
A single network adaptive critic (SNAC) architecture for optimal control synthesis for a class of nonlinear systems
Neural Netw.
(2006) - et al.
Fixed-final-time optimal control of nonlinear systems with terminal constraints
Neural Netw.
(2013) - et al.
Fixed-final-time optimal tracking control of input-affine nonlinear systems
Neurocomputing
(2014) - et al.
Approximations of functions by a multilayer perceptron: a new approach
Neural Netw.
(1997) - et al.
Evolutionary-based techniques for real-life optimisation: development and testing
Appl. Soft Comput.
(2002) - et al.
Convergence of nomadic genetic algorithm on benchmark mathematical functions
Appl. Soft Comput.
(2013) Approximate dynamic programming for real-time control and neural modeling
- et al.
Adaptive-critic based neural networks for aircraft optimal control
J. Guidance Control Dyn.
(1996) - et al.
Adaptive critic designs
IEEE Trans. Neural Netw.
(1997)
Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof
IEEE Trans. Syst. Man Cybern. B
Online adaptive critic flight control
J. Guidance Control Dyn.
Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator
IEEE Trans. Neural Netw.
An online nonlinear optimal controller synthesis for aircraft with model uncertainties
State-constrained agile missile control with adaptive-critic-based neural networks
IEEE Trans. Control Syst. Technol.
Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics
IEEE Trans. Neural Netw. Learn. Syst.
Cited by (18)
Optimise transient control against DoS attacks on ESS by input convex neural networks in a game
2021, Sustainable Energy, Grids and NetworksCitation Excerpt :In Section 3, the ICNNs were used to guarantee the existence of the Nash Equilibrium(NE) in the difference game, because the ICNNs make the strategy utility function convex/concave respect to the strategies. Contrastingly, the strategy utility function based on non-convex neural networks such as RNN cannot be convex/concave when the sampling period is larger than a specific value, which has been found in previous research [16]. Therefore, a comparison between ICNNs based ADP and RNN based ADP was made in multiple sampling periods.
Real time control of tethered satellite systems to de-orbit space debris
2021, Aerospace Science and TechnologyA Q-learning predictive control scheme with guaranteed stability
2020, European Journal of ControlCitation Excerpt :Such tuning, based on linearization around the origin, was performed in [16] (for a continuous-time problem) under the assumption that the linearized dynamics are stabilizable. There exist approaches, in which the control policy itself is represented in terms of parameterized basis functions [see, e. g., 33, 43,65,68,71,77] that are directly adapting to the optimal policy κ*. To this point, samples are collected and the iteration is performed offline, i. e., parameters are learned and the respective system trajectory is simulated offline, whereas the obtained parameters are left unchanged in online control (a performance bound of such finite-sample control policy in the context of MDPs has been delivered, e. g., by Lazaric et al. [41]).
Preemptive degradation-induced battery replacement for hybrid electric vehicles in sustained optimal extended-range driving conditions
2017, Journal of Energy StorageCitation Excerpt :One could describe DDP as a “smart” approach to direct enumeration (colloquially dubbed the brute force approach) in that it does process the entirety of a given optimal problem's allowable states and variables, which is key in the technique's proof of global optimality [49], but does so at an exponential fraction of the calculations required by the latter. The DDP algorithm was also chosen because of its relative simplicity and its flexible nature, which allows it to process the number and variety of variables necessary to complete this work; in addition, iterative processes such as genetic algorithms [50], particle swarm optimization [51] or evolutionary methods [52] are susceptible to converge to a local optimal point rather than a global optimal [53], although they can be successfully applied to HEV management applications when properly tuned [54–56]. A full discussion on the DDP optimal process used for this work is found in [30].
Orthogonal PSO algorithm for economic dispatch of thermal generating units under various power constraints in smart power grid
2017, Applied Soft Computing JournalCitation Excerpt :In the past decades, many optimization techniques including traditional methods have been adopted in order to find the optimum power dispatch and the rate of optimum product for each on-line TGU. Some of the traditional methods include linear programming [4], quadratic programming [5], Lagrange relaxation [6], Lambda iterative method [7], and dynamic programming [8]. These methods offer certain advantages, for example, they only need to run once and do not have any problem specific parameters to specify.