Abstract

An online policy learning algorithm is used to solve the optimal control problem of the power battery state of charge (SOC) observer for the first time. The design of adaptive neural network (NN) optimal control is studied for the nonlinear power battery system based on a second-order (RC) equivalent circuit model. First, the unknown uncertainties of the system are approximated by NN, and a time-varying gain nonlinear state observer is designed to address the problem that the resistance capacitance voltage and SOC of the battery cannot be measured. Then, to realize the optimal control, a policy learning-based online algorithm is designed, where only the critic NN is required and the actor NN widely used in most design of the optimal control methods is removed. Finally, the effectiveness of the optimal control theory is verified by simulation.

1. Introduction

Nowadays, electric vehicles are developing at a high speed [1]. The power battery provides the required high power for vehicle start stop, acceleration and deceleration, and other instabilities and greatly improves the service life of fuel cells by controlling the charging and discharging power of the power battery [1, 2]. As an important energy storage part of fuel-cell hybrid vehicles, it has far-reaching significance for the research of power cells. The state of charge (SOC) in the battery is one of the important parameters of the battery management system (BMS), but SOC cannot be directly measured by the on-board sensors. Therefore, SOC estimation is a very important problem in the theory and application. Moreover, the power battery is a highly complex nonlinear system in its working state, which greatly increases the difficulty of estimation [3].

In order to meet the requirements of accurate, fast, and real-time estimation of power battery SOC under different conditions, scholars have carried out a lot of advanced achievements. In [4], the authors proposed an observer-based unilateral Lipschitz conditional nonlinear system control method for a class of nonlinear systems with time-varying parameter uncertainties and norm bounded disturbances. For the state-space equation of the equivalent circuit model, a power battery SOC estimation method based on nonlinear observer is proposed in [5]. The authors in [6] introduced the second-order resistance capacitance (RC) model of the battery pack. Under the unilateral Lipschitz condition, a nonlinear observer based on the H∞ method is designed, but whether the optimal performance of the observer can be guaranteed remains to be verified. For the problem of optimal control design of the observers, the authors proposed an adaptive neural network backstepping recursive optimal control method for nonlinear strict feedback systems with state constraints [7]. The neural network (NN) state identification is used to approximate the unknown nonlinear dynamics, and under the actor-critic structure, the virtual and actual optimal controllers are constructed through the backstepping recursive control algorithm. Because actor-critic structure-based adaptive laws are generated on the basis of the square of Behrman residual error obtained by the gradient descent method, these methods are too complex and difficult to implement. In this regard, the authors in [8] proposed an optimal control method based on reinforcement learning (RL) for a class of nonlinear strict feedback systems with unknown dynamic functions. This method eliminates the persistent excitation assumption necessary for most RL-based adaptive optimal control. On this basis, the adaptive NN output-feedback optimal control problem for a class of strict feedback nonlinear systems with unknown internal dynamics, input saturation, and state constraints is studied in [9]. In [10, 11], the authors proposed the novel optimal control algorithm based on advanced AI techniques, which further promotes the development of the optimal control theory.

Inspired by the abovementioned research results, a nonlinear observer with time-varying gain is designed in this paper. Based on the unilateral Lipschitz condition, the nonlinear dynamic problem contained in the system output is solved. The internal unknown dynamic function is approximated by NN to estimate the SOC and the resistance capacitance voltage of the dynamic battery in the power system. Then, based on estimated system states, we develop a policy learning-based optimal control and the estimated weight error is convergence to zero. Finally, the simulation results show the effectiveness of the proposed method.

The innovations of this paper are summarized as follows:(1)The optimal control method based on critic NN is used to solve the optimal control problem of the power battery SOC observer for the first time.(2)Only one critic NN is used to ensure the convergence of the NN weights; thus, the actor NN widely used in most design of optimal control methods [1214] is removed.(3)Unlike the existing optimal control with known state, the battery state in this paper is unknown. This leads to a complex optimal control problem.

2. System Modeling

In this paper, we consider the second-order RC equivalent circuit model as shown in Figure 1 [15], where is the open-circuit voltage (OCV) respected to SOC, represents the current, denotes the terminal voltage, is the ohmic resistance, and are the electrochemical polarization resistance and the concentration polarization resistance, respectively, and and are the capacitances. and show the voltage of the electrochemical capacitor and concentration polarization capacitor , respectively.

Then, based on the Kirchhoff voltage laws, the state equation of Figure 1 can be given aswhere is the nominal capacity of the battery.

Then, its output equation can be defined aswhere , and is the nonlinear monotone increasing function.

Based on (1) and (2), we can obtain state space equation as follows:where , , , , and is the initial state.

As the power battery is a highly complex nonlinear system in its working state, there are many unknown uncertainties such as ambient temperature, battery self-discharge, battery life, and cycle interval. Therefore, the state space expression (3) can be expressed as follows:where represents nonlinear characteristics.

Assumption 1. In this paper, we assume that is stabilizable and is detectable. The nonlinear term is continuous and bounded.
Control objective: for the second-order RC equivalent model of power battery, based on an adaptive observer a policy learning algorithm-based optimal controller is designed to guarantee all signals of the closed-loop system uniformly ultimately bounded (UUB).
According to the second-order RC model of the power battery, we can derive its state space (3) or (5); then, we should design the control law u for the derived state space equation. Thus, we will use the NN observer and the policy learning algorithm to design the control law u.

3. Optimal Control of Power Battery

3.1. Observer Design via NN

This section will design an observer to estimate the battery voltage and SOC. Thus, we assumewhere is the ideal NN weights, is the activation function, and denotes the NN error.

In this paper, the function is unknown continuous; hence, the estimated function iswhere is the estimation of .

Then, based on (5) and (7), the observer can be designed aswhere is the estimation of , is the observation matrix, is the positive matrix, and is the estimation of .

We define the observation error

Then, from (5) and (8), we can obtain the observation error dynamic equation aswhere is the NN weight error.

Lemma 2. For system (5), if it adopts designed observer (8), the NN weights satisfy the adaptive lawThis can guarantee that errors and are UUB.

Proof. Consider a Lyapunov functionFrom [15], we have with , where and are the minimum and maximum values of the change rate of the function, respectively. Then, the derivation of (12) giveswhere .
According to the unilateral Lipschitz condition [9], the following inequalities can be obtained:Taking (14) and (15) into (13), and considering , we haveBased on [8], let , where ; thus, (16) can be further written aswhere and .
If , then the term in (17) can converge to zero. Moreover, by selecting the appropriate matrix , can be relatively large. According to (17), the observation error can converge to a small neighborhood containing the origin.

3.2. Optimal Control Design Based on the Observer
3.2.1. Online Policy Learning Algorithm

In this section, based on critic NN, we construct the policy learning law. Thus, system (8) can be rewritten aswhere , and is the Lyapunov function.

To realize the optimal control, we first define the cost function as\

With being the utility function, and are the weight matrices of proper dimension.

We define the Hamiltonian function of the optimal control problem and the optimal cost function as

The optimal cost function is the solution of the following HJB equation:

With , we can obtain this optimal control action asand the HIB equation in terms of aswith .

To realize the policy learning, some iteration procedure can be given as follows:(1)Select the small positive number . Set and , and then give an initial admissible control .(2)Using the control , resolvewith .(3)Update the control action using(4)If , stop, then apply the optimal control; else, let and go back to (2).

This algorithm will be convergence to the optimal control and optimal cost function when . The convergence of this algorithm can be referred to [16, 17].

3.2.2. NN Implementation

We assume the cost function is continuously differentiable. Then, we can use the NN reconstruct the aswhere is the ideal NN weights, is the activation function, and denotes the NN error. Then,where and are the gradient of the activation function and NN error, respectively. According to (28), we can obtain the Lyapunov function as

Assumption 3. (see [1214, 18]). If the NN weight , the NN error , the gradient , and derivative are bounded, then we can have and .
We define the estimation of (27) asThen, we havewith . Thus, the estimated Hamiltonian function can be given asTo minimize error (32), we construct the objective function , and then the descent algorithm can be designed aswith being the learning gain of the NN.
Based on (29), the Hamiltonian function can be rewritten aswhere is the residual error.
Define , if there is a positive constant such that , and denote the weight estimation error , and then based on (32) and (34), we have ; thus, we have the dynamic of the weight estimation error asThe persistent excitation (PE) condition is required to tune the NN, guaranteeing with being the positive constant. To this end, a probing noise is inserted into the system to meet the PE.
In this case, the optimal control action can be given asand its estimation isEquation (37) shows that using the trained critic network, the control policy can be derived directly; thus, the actor NN is removed in this paper. The structural diagram of the algorithm is given in Figure 2.

Lemma 4. For system (18), the adaptive law for the NN is provided by (33), and then the weight estimation error of NN is UUB.

Proof. Choose the Lyapunov function as . The time derivative of the Lyapunov function along the trajectory of error dynamics (35) isAfter doing some basic manipulations, we haveConsidering the Cauchy–Schwarz inequality and noticing the assumption , we can conclude that as long as andAccording to the Lyapunov theory, we obtain that the dynamics of the weight estimation error is UUB. The norm of the weight estimation error is bounded as well.
It is noted that the estimated weight is optimal to , and this indicates that the solution can be extracted from the estimated vector given in (30). Thus, one can derive the actual control for system (18) based on . As a consequence of Lemma 4, we can conclude that will converge to the optimal control , i.e., such that the control system stability can be retained based on Lemma 4.

Remark 5. In this paper, an observer is designed using NN to online estimate the unknown state (SOC); then, based on the estimated state, we develop a policy learning algorithm to online resolve the optimal control of the battery. The proposed methods are different from our previous work, such as [18], where the system states are assumed to be known, and this limits the application of the optimal control algorithm in practice.

Remark 6. To realize the output-feedback control using the policy learning, the PE condition is required in this paper. As shown in [14, 17], to guarantee the PE condition, an alternative way is to insert an exploration noise into the system for the first two seconds [17].

4. Simulation Results

For the second-order RC equivalent model of power battery, the effectiveness of the optimal control theory in this paper is verified by simulation based on Matlab. The values of resistance, capacitance, and battery capacity in the second-order RC equivalent model (5) are as follows: , , , , , and .

Let , then we can obtain and as

Given the design parameters in learning law (33) as and the initial values as , , and , we design the regressor of the critic NN as .

We aim at obtaining an optimal control policy that can stabilize system (18). For system (18), we need to find a feedback control policy that minimizes the cost function.with and . We adopt the online policy iteration algorithm to tackle the optimal control problem, where a critic network is constructed to approximate the cost function. During the implementation process of the policy learning algorithm, we introduce the noise to meet the PE condition. The exponentially decreasing probing noise and sinusoidal signals with different frequencies are used. They are introduced into the control input and thus affect the system states.

The evolution of the state trajectory is depicted in Figure 3, and this can be used to further design the optimal controller for the proposed system. Figure 4 gives the good estimated weights, where we have that the convergence of the weight has occurred after 1000 s. Then, the probing signal is turned off. This good convergence of the NN weights can ensure the stability of the controlled system, which can be found in Figure 5. Figure 5 is the controller system trajectory with the designed optimal controller. We see that the state converge to zero after the probing noise is turned off. Figure 6 shows the cost of the system under which is smooth, and this indicates that the designed controller is effective. The control action is given in Figure 7, which is bounded. This further shows Lemma 4 is true.

To show the improved performance of the proposed single critic NN-based ADP for solving the derived optimal control problem, a critic-actor NN-based online learning method [19] is also used for comparison. Moreover, in this comparison, we add the robustness verification of the proposed method. To this end, we set the nonlinear term . The profiles of the critic NN and actor NN weights can be found in Figure 8 and the corresponding control performances are given in Figure 9. Compared with Figures 9(a) and 9(b), it is clear that the proposed single critic NN-based can achieve faster transient state convergence even if there is a nonlinear term.

Generally, the modeling accuracy and control structure will influence the control performance of the closed-loop control systems. In this paper, the main factors affecting the control performance are the modeling uncertainties of the system and the convergence performance of critic NN weights. Moreover, better convergence of critic NN weights, i.e., faster convergence speed can help to achieve better control performance. In this respect, different choices of critic NN parameters and structure will affect the convergence of critic NN weights and the control performance. Hence, proper selection of NN parameters and structure, such as the initial value of weights, learning gain, and regressor structure, is helpful to further improve the control response.

5. Conclusion

For the second-order RC equivalent nonlinear system of power battery, the unknown uncertainty of the system is approximated by NN, and a time-varying gain nonlinear state observer is designed to solve the problem that the resistance capacitance voltage and charge (SOC) of the battery cannot be measured. Then, to realize the optimal control, a policy learning-based online algorithm is designed, where only the critic NN is required, and the actor NN widely used in most design of the optimal control methods is removed. Finally, the effectiveness of the optimal control theory is verified by simulation.

Data Availability

The data used to support the findings of this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Qinglin Zhu and Jun Zhao conceptualized the study; Huanli Sun and Ziliang Zhao were responsible for methodology; Ziliang Zhao and Yixin Liu performed formal analysis; Qinglin Zhu wrote the original draft; Qinglin Zhu and Yixin Liu reviewed and edited the manuscript; and Huangli Sun and Ziliang Zhao were responsible for funding acquisition. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

This work was supported in part by Jilin Provincial Major Science and Technology Projects (Grant no.: 20210301020GX).