Elsevier

Applied Soft Computing

Volume 24, November 2014, Pages 291-303
Applied Soft Computing

Global optimality of approximate dynamic programming and its use in non-convex function minimization

https://doi.org/10.1016/j.asoc.2014.07.003Get rights and content

Abstract

This study investigates the global optimality of approximate dynamic programming (ADP) based solutions using neural networks for optimal control problems with fixed final time. Issues including whether or not the cost function terms and the system dynamics need to be convex functions with respect to their respective inputs are discussed and sufficient conditions for global optimality of the result are derived. Next, a new idea is presented to use ADP with neural networks for optimization of non-convex smooth functions. It is shown that any initial guess leads to direct movement toward the proximity of the global optimum of the function. This behavior is in contrast with gradient based optimization methods in which the movement is guided by the shape of the local level curves. Illustrative examples are provided with single and multi-variable functions that demonstrate the potential of the proposed method.

Graphical abstract

Level curves of the Rosenbrock function subject to minimization and state trajectories for different initial conditions x0  {−2, −1, 0, 1, 2} × {−2, −1, 0, 1, 2}. The red plus signs denote the initial point of the respective trajectory.

  1. Download : Download full-size image

Introduction

In the last two decades, approximate dynamic programming (ADP) has been shown to have a great promise in solving optimal control problems with neural networks (NN) [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]. In the ADP framework, the solutions are obtained using a two-network synthesis called adaptive critics (ACs) [2], [3], [4]. In the heuristic dynamic programming (HDP) approach with ACs, one network, called the ‘critic’ network, maps the input states to output the cost-to-go and another network, called the ‘action’ network, outputs the control with states of the system as its inputs [4], [5]. In the dual heuristic programming (DHP) formulation, the action network remains the same as in the HDP, however, the critic network outputs the costates with the current states as inputs [2], [6], [7]. The computationally effective single network adaptive critics (SNAC) architecture consists of one network only. In [8], the action network was eliminated in a DHP type formulation with control being calculated from the costate values. Similarly, the J-SNAC [9] eliminates the need for the action network in an HDP scheme. Note that the developments in [1], [2], [3], [4], [5], [6], [7], [8], [9] are for infinite-horizon problems.

The use of ADP for solving finite-horizon optimal control problems was considered in [10], [11], [12], [13], [14], [15]. Authors of [10] developed a time-varying neurocontroller for solving a scalar problem with state constraints. In [11] a single NN with a single set of weights was proposed which takes the time-to-go as an input along with the states and generates the fixed-final-time optimal control for discrete-time nonlinear multi-variable systems. An HDP based scheme for optimal control problems with soft and hard terminal constraints was presented in [12]. Finite-horizon problems with unspecified terminal times were considered in [13], [14]. For an extensive literature on adaptive critic based problems, the reader is referred to [16] and the references in the chapters.

Despite much published literature on adaptive critics, there still exists an open question about the nature of optimality of the adaptive critic based results. Are they locally or globally optimal? A major contribution of this study is in proving that the AC based solutions are globally optimal subject to the assumed basis functions. To help with the development of the proof, the ADP based algorithm for solving fixed-final-time problems developed in [11], [12] is revisited first. After describing the algorithm, a novel analysis of global optimality of the result is presented. It is shown that with any cost function with a quadratic control penalizing term, the resulting cost-to-go function (sometimes called value function) will be convex versus the control at the current time if the sampling time used for discretization of the original continuous-time system is small enough, and hence, the first order necessary optimality condition [17] will lead to the global optimal control. The second major contribution of this paper is in showing that the ADP can be used for function optimization, specifically, optimization of non-convex functions. Finally, through numerical simulations, two examples with varying complexities are presented and the performance of the proposed method is investigated. It is shown that despite the gradient based methods, selecting any initial guess on the minimum and updating the guess using the control resulting from the actor, the states will move directly toward the global minimum, passing any possible local minimum in the path.

The rest of this paper is organized as follows: The problem formulation is given in the next section, followed by Section ‘Approximate dynamic programming based solution’. Afterwards, the supporting theorems and analyses are presented. The use of the method in static function optimization is discussed next and followed by some conclusions.

Section snippets

Problem formulation

Let the control-affine dynamics of the system be given byx˙(t)=f(x(t))+g(x(t))u(t)where f:nn, and g:nn×m. The state and control vectors are denoted with xn and um, respectively, where positive integers n and m denote the dimensions of the respective vectors. The selected cost function, J is fairly general but quadratic in control:J=ψ(x(tf))+t0tf(Q(x(t))+u(t)TRu(t))dtwhere positive semi-definite smooth functions Q:n and ψ:n penalize the states and positive definite matrix R

Approximate dynamic programming based solution

In this section, an ADP scheme called AC is used for solving the fixed-final-time optimal control problem in terms of the network weights and selected basis functions. The method is adopted from [11], [12]. In this scheme, two networks called critic and actor are trained to approximate the optimal cost-to-go and the optimal control, respectively. It should be noted that the optimal cost-to-go, which represents the incurred cost if optimal decisions are made from the current time to the final

A. Convergence analysis

Theorem 1

The iterative relation given by Eq. (10), with any initial guess for uk0m,kK, converges, providing the sampling time ▵t selected for discretization of continuous-time dynamics (1) is small enough.

Proof

Let the right hand side of (10) be denoted with function F:mm where

F(u)=12R¯1g¯(xk)TJk+1*(f¯(xk)+g¯(xk)u)

The proof is complete if it is shown that the relation given by the successive approximationui+1=F(ui)is a contraction mapping [22]. Since m with 2-norm denoted with ||.|| is a Banach

Non-convex function optimization

One of the applications of the global optimality results given in this study is using ADP for finding the global optimum of smooth but possibly non-convex functions. In other words, the ‘optimal control’ tool can be used for ‘convex or non-convex function optimization’. Considering nonlinear programming based optimization methods [17], [18], for optimizing function ψ(x), one selects an initial guess, denoted with x0, and uses the update rulexk+1=xk+τukwhere τ\{0} is the update rate. Parameter

Numerical analysis

In order to numerically analyze the global optimality of ADP scheme and its use in static minimization, a simple way is selecting a cost function with a non-convex terminal cost term and evaluating the performance of the ADP in providing the global optimum. To this end, two separate examples are selected; a single variable example and a multi-variable benchmark example, namely Rosenbrock/Banana function. The source code for the numerical analysis can be found at [33].

Conclusions

The performance of approximate dynamic programming in finding the global optimal solution to the fixed-final-time control problem was investigated. A sufficient condition for global optimality of the result, regardless of the convexity or non-convexity of the functions representing the dynamics of the system or the state penalizing terms in the cost function, was derived. Moreover, an idea was presented in converting a static function optimization to an optimal control problem and using ADP for

References (33)

  • A. Al-Tamimi et al.

    Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof

    IEEE Trans. Syst. Man Cybern. B

    (2008)
  • S. Ferrari et al.

    Online adaptive critic flight control

    J. Guidance Control Dyn.

    (2004)
  • G.K. Venayagamoorthy et al.

    Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator

    IEEE Trans. Neural Netw.

    (2002)
  • J. Ding et al.

    An online nonlinear optimal controller synthesis for aircraft with model uncertainties

  • D. Han et al.

    State-constrained agile missile control with adaptive-critic-based neural networks

    IEEE Trans. Control Syst. Technol.

    (2002)
  • A. Heydari et al.

    Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics

    IEEE Trans. Neural Netw. Learn. Syst.

    (2013)
  • Cited by (18)

    • Optimise transient control against DoS attacks on ESS by input convex neural networks in a game

      2021, Sustainable Energy, Grids and Networks
      Citation Excerpt :

      In Section 3, the ICNNs were used to guarantee the existence of the Nash Equilibrium(NE) in the difference game, because the ICNNs make the strategy utility function convex/concave respect to the strategies. Contrastingly, the strategy utility function based on non-convex neural networks such as RNN cannot be convex/concave when the sampling period is larger than a specific value, which has been found in previous research [16]. Therefore, a comparison between ICNNs based ADP and RNN based ADP was made in multiple sampling periods.

    • A Q-learning predictive control scheme with guaranteed stability

      2020, European Journal of Control
      Citation Excerpt :

      Such tuning, based on linearization around the origin, was performed in [16] (for a continuous-time problem) under the assumption that the linearized dynamics are stabilizable. There exist approaches, in which the control policy itself is represented in terms of parameterized basis functions [see, e. g., 33, 43,65,68,71,77] that are directly adapting to the optimal policy κ*. To this point, samples are collected and the iteration is performed offline, i. e., parameters are learned and the respective system trajectory is simulated offline, whereas the obtained parameters are left unchanged in online control (a performance bound of such finite-sample control policy in the context of MDPs has been delivered, e. g., by Lazaric et al. [41]).

    • Preemptive degradation-induced battery replacement for hybrid electric vehicles in sustained optimal extended-range driving conditions

      2017, Journal of Energy Storage
      Citation Excerpt :

      One could describe DDP as a “smart” approach to direct enumeration (colloquially dubbed the brute force approach) in that it does process the entirety of a given optimal problem's allowable states and variables, which is key in the technique's proof of global optimality [49], but does so at an exponential fraction of the calculations required by the latter. The DDP algorithm was also chosen because of its relative simplicity and its flexible nature, which allows it to process the number and variety of variables necessary to complete this work; in addition, iterative processes such as genetic algorithms [50], particle swarm optimization [51] or evolutionary methods [52] are susceptible to converge to a local optimal point rather than a global optimal [53], although they can be successfully applied to HEV management applications when properly tuned [54–56]. A full discussion on the DDP optimal process used for this work is found in [30].

    • Orthogonal PSO algorithm for economic dispatch of thermal generating units under various power constraints in smart power grid

      2017, Applied Soft Computing Journal
      Citation Excerpt :

      In the past decades, many optimization techniques including traditional methods have been adopted in order to find the optimum power dispatch and the rate of optimum product for each on-line TGU. Some of the traditional methods include linear programming [4], quadratic programming [5], Lagrange relaxation [6], Lambda iterative method [7], and dynamic programming [8]. These methods offer certain advantages, for example, they only need to run once and do not have any problem specific parameters to specify.

    View all citing articles on Scopus
    View full text