Abstract

The popular deep learning algorithm is known to be instability because of the -value’s shake and overestimation action values under certain conditions. These issues tend to adversely affect their performance. In this paper, we develop the ensemble network architecture for deep reinforcement learning which is based on value function approximation. The temporal ensemble stabilizes the training process by reducing the variance of target approximation error and the ensemble of target values reduces the overestimate and makes better performance by estimating more accurate -value. Our results show that this architecture leads to statistically significant better value evaluation and more stable and better performance on several classical control tasks at OpenAI Gym environment.

1. Introduction

Reinforcement learning (RL) algorithms [1, 2] are very suitable for learning to control an agent by letting it interact with an environment. In recent years, deep neural networks (DNN) have been introduced into reinforcement learning, and they have achieved a great success on the value function approximation. The first deep -network (DQN) algorithm which successfully combines a powerful nonlinear function approximation technique known as DNN together with the -learning algorithm was proposed by Mnih et al. [3]. In this paper, experience replay mechanism was proposed. Following the DQN work, a variety of solutions have been proposed to stabilize the algorithms [39]. The deep -networks classes have achieved unprecedented success in challenging domains such as Atari 2600 and some other games.

Although DQN algorithms have been successful in solving many problems because of their powerful function approximation ability and strong generalization between similar state inputs, they are still poor in solving some issues. Two reasons for this are as follows: (a) the randomness of the sampling is likely to lead to serious shock and (b) these systematic errors might cause instability, poor performance, and sometimes divergence of learning. In order to address these issues, the averaged target DQN (ADQN) [10] algorithm is implemented to construct target values by combining target -networks continuously with a single learning network, and the Bootstrapped DQN [11] algorithm is proposed to get more efficient exploration and better performance with the use of several -networks learning in parallel. Although these algorithms do reduce the overestimate, they do not evaluate the importance of the past learned networks. Besides, high variance in target values combined with the max operator still exists.

There are some ensemble algorithms [4, 12] solving this issue in reinforcement learning, but these existing algorithms are not compatible with nonlinearly parameterized value functions.

In this paper, we propose the ensemble algorithm as a solution to this problem. In order to enhance learning speed and final performance, we combine multiple reinforcement learning algorithms in a single agent with several ensemble algorithms to determine the actions or action probabilities. In supervised learning, ensemble algorithms such as bagging, boosting, and mixtures of experts [13] are often used for learning and combining multiple classifiers. But in RL, ensemble algorithms are used for representing and learning the value function.

Based on an agent integrated with multiple reinforcement learning algorithms, multiple value functions are learned at the same time. The ensembles combine the policies derived from the value functions in a final policy for the agent. The majority voting (MV), the rank voting (RV), the Boltzmann multiplication (BM), and the Boltzmann addition (BA) are used to combine RL algorithms. While these methods are costly in deep reinforcement learning (DRL) algorithms, we combine different DRL algorithms that learn separate value functions and policies. Therefore in our ensemble approaches we combine the different policies derived from the update targets learned by deep -networks, deep Sarsa networks, double deep -networks, and other DRL algorithms. As a consequence, this leads to reduced overestimations, more stable learning process, and improved performance.

2.1. Reinforcement Learning

Reinforcement learning is a machine learning method that allows the system to interact with and learn from the environment to maximize cumulative return rewards. Assume that the standard reinforcement learning setting where an agent interacts with the environment . We can describe this process with Markov Decision Processes (MDP) [2, 9]. It can be specified as a tuple (. At each step , the agent receives a state , and select an action from the set of legal actions according to the policy , where is a policy mapping sequences to actions. The action is passed to the environment . In addition, the agent receives the next state and a reward signal . This process continues until the agent reaches a terminal state.

The agent seeks to maximize the expected discounted return, where we define the future discounted return at time as with discount factor . The goal of the RL agent is to learn a policy which makes the future discounted return maximize. For an agent behaving according to a stochastic policy , the value of the state-action pair can be defined as follows: . The optimal action-value function satisfies the Bellman equation .

The reinforcement learning algorithms estimate the action value function by iteratively updating the Bellman equation . When , the algorithm makes -value function converge to the optimal action value function [1]. If the optimal -function is known, the agent can select optimal actions by selecting the action with the maximal value in a state: .

2.2. Target Deep Learning

RL agents update their model parameters while they observe a stream of transitions like . They discard the incoming data after a single update. There are two issues with this method. The first one is that there are strong correlations among the incoming data, which may break the assumption of many popular stochastic gradient-based algorithms. Secondly, the minor changes in the function may result in a huge change in the policy, which makes the algorithm difficult to converge [7, 9, 14, 15].

As for the deep -networks algorithms proposed in (Mnih et al., 2013), two aspects are improved. On the one hand, the action value function is approximated by the DNN, DQN uses the DNN with a parameter to approximate the value function, ; on the other hand, the experience replay mechanism is adopted. The algorithm learns from sampled transitions from an experience buffer, rather than learning fully online. This mechanism makes it possible to break the temporal correlations by mixing more and less recent experience for updating and training. This model free reinforcement learning algorithm solves the problem of “model disaster” and uses the generalized approximation method of the value function to solve the problem of “dimension disaster."

The convergence issue was mentioned in 2015 by Schaul et al. [14]. The above -learning update rules can be directly implemented in a neural network. DQN uses the DNN with parameters to approximate the value function. The parameter updates from transition are given by the following [11]:with

The update targets for Sarsa can be described as follows:where α is the scalar learning rate. are target network parameters which are fixed to . In case the squared error is taken as a loss function .

In general, experience replay can reduce the amount of experience required to learn and replace it with more computation and more memory, which are often cheaper resources than the RL agent’s interactions with its environment [14].

2.3. Double Deep Learning

In -learning and DQN, the max operator uses the same values to both select and evaluate an action. This can therefore lead to overoptimistic value estimates (van Hasselt, 2010). To mitigate this problem, the update targets value of double -learning error can then be written as follows:DDQN is the same as for DQN [8], but with the target replaced with .

3. Ensemble Methods for Deep Reinforcement Learning

As DQN classes use DNNs to approximate the value function, it has strong generalization ability between similar state inputs. The generalization can cause divergence in the case of repeated bootstrapped temporal difference updates. So we can solve this issue by integrating different versions of the target network.

In contrast to a single classifier, ensemble algorithms in a system have been shown to be more effective. They can lead to a higher accuracy. Bagging, boosting, and Ada Boosting are methods to train multiple classifiers. But in RL, ensemble algorithms are used for representing and learning the value function. They are combined by major voting, Rank Voting, Boltzmann Multiplication, mixture model, and other ensemble methods. If the errors of the single classifiers are not strongly correlated, this can significantly improve the classification accuracy.

3.1. Temporal Ensemble

As described in Section 2.2, the DQN classes of deep reinforcement learning algorithms use a target network with parameters copied from every steps. Temporal Ensemble method is suitable for the algorithms which use a target network for updating and training. Temporal ensemble uses the previous learned networks to produce the value estimate and builds up complete networks with distinct memory buffers. The recent -value function is trained according to its own target network . So each one of -value functions represents temporally extended estimate of -value function.

Note that the more recent target network is likely to be more accurate at the beginning of the training and the accuracy of the target networks is increasing as the training goes on. So we denote a learning rate parameter here for target network. The weight of th target network is .

So the learned -value function by temporal ensemble can be described as follows:As , we can see that the target networks have the same weights when equals 1. This formula indicates that the closer the target networks are, the greater the target networks’ weight is. As target networks become more accurate, their weights become equal. The loss function remains the same as in DQN and so does the parameter update equation: In every iteration, the parameters of the oldest ones are removed from the target network buffer and the newest ones are added to the buffer. Note that the -value functions are inaccurate at the beginning of training. So the parameter may be a function of time and even the state space.

3.2. Ensemble of Target Values

The traditional ensemble reinforcement learning algorithms maintain multiple tabular algorithms in memory space [4, 16], and majority voting, rank voting, Boltzmann addition, and so forth are used to combine these tabular algorithms. But deep reinforcement learning uses neutral networks as function approximators. The use of multiple neural networks is very computationally expensive and inefficient. In contrast to previous researches, we combine different DRL algorithms that learn separate value functions and policies. Therefore in our ensemble approaches we combine the different policies derived from the update targets learned by deep -networks, deep Sarsa networks, double deep -networks, and other DRL algorithms as follows: Besides these update targets formula, other algorithms based on value function approximators can be also used to combine. The update targets according to the algorithm at time will be denoted by .

The loss function remains the same as in DQN and so does the parameter update equation:

3.3. The Ensemble Network Architecture

The temporal and target values ensemble algorithm (TEDQN) is an integrated architecture of the value-based DRL algorithms. As shown in Sections 3.1 and 3.2, the ensemble network architecture has two parts to avoid divergence and improve performance.

The architecture of our ensemble algorithm is shown in Figure 1; these two parts are combined together by evaluated network.

The temporal ensemble stabilizes the training process by reducing the variance of target approximation error [10]. Besides, the ensemble of target values reduces the overestimate and makes better performance by estimating more accurate -value. The temporal and target values ensemble algorithm are given by Algorithm 1.

Initialize action-value network with random weights
Initialize the target neural network buffer
For episode 1, do
For , do
With probability select a random action , otherwise
Execute action in environment and observe reward
and next state , and store transition ) in
Sample random minibatch of transition ) from
set
Ensemble -learner
set
set
set
Set
Every steps reset
End for
End for

As the ensemble network architecture shares the same input-output interface with standard -networks and target networks, we can recycle all learning algorithms with -networks to train the ensemble architecture.

4. Experiments

4.1. Experimental Setup

So far, we have carried out our experiments on several classical control and BoxD environments on OpenAI Gym: CartPole-v0, MountainCar-v0, and LunarLander-v2 [15]. We use the same network architecture, learning algorithms, and hyperparameters for all these environments.

We trained the algorithms using 10,000 episodes and used the Adaptive Moment Estimation (Adam) algorithm to minimize the loss with learning rate μ = 0.00001 and set the batch size to 32. The summary of the configuration is provided below. The target network updated each 300 steps. The behavior policy during training was -greedy with annealed linearly from 1 to 0.01 over the first five thousands steps and fixed at 0.01 thereafter. We used a replay memory of ten thousands most recent transitions.

We independently executed each method 10 times, respectively, on every task. For each running time, the learned policy will be tested 100 times without exploration noise or prior knowledge by every 100 training episodes to calculate the average scores. We report the mean and standard deviation of the convergence episodes and the scores of the best policy.

4.2. Results and Analysis

We consider three baseline algorithms that use target network and value function approximation, namely, the version of the DQN algorithm from the Nature paper [8], DSN that reduce over estimation [17], and DDQN that substantially improved the state-of-the-art by reducing the overestimation bias with double -learning [9].

Using this 10 no-ops performance measure, it is clear that the ensemble network does substantially better than a single network. For comparison we also show results for DQN, DSN, and DDQN. Figure 2 shows the improvement of the ensemble network over the baseline single network of DQN, DSN, and DDQN. Again, we see that the improvements are often very dramatic.

The results in Table 1 show that algorithms we presented can successfully train neural network controllers on the classical control domain on OpenAI Gym. A detailed comparison shows that there are several games in which TE DQN greatly improves upon DQN, DSN, and DDQN. Noteworthy examples include CartPole-v0 (performance has been improved by 13.6%, 79.5%, and 7.8%, and variance has been reduced by 100%, 100%, and 100%), MountainCar-v0 (performance has been improved by 26.7%, 21.2%, and 24.8%, and variance has been reduced by 31.6%, 77.9%, and 8.4%), and LunarLander-v2 (performance has been improved by 28.3%, 32.8%, and 50.5%, and variance has been reduced by 19.2%, 46.4%, and 50.5%).

5. Conclusion

We introduced a new learning architecture, making temporal extension and the ensemble of target values for deep learning algorithms, while sharing a common learning module. The new ensemble architecture, in combination with some algorithmic improvements, leads to dramatic improvements over existing approaches for deep RL in the challenging classical control issues. In practice, this ensemble architecture can be very convenient to integrate the RL methods based on the approximate value function.

Although the ensemble algorithms are superior to a single reinforcement learning algorithm, it is noted that the computational complexity is higher. The experiments also show that the temporal ensemble makes the training process more stable, and the ensemble of a variety of algorithms makes the estimation of the -value more accurate. The combination of the two ways enables the training to achieve a stable convergence. This is due to the fact that ensembles improve independent algorithms most if the algorithms predictions are less correlated. So that the output of the -network based on the choice of action can achieve balance between exploration and exploitation.

In fact, the independence of the ensemble algorithms and their elements is very important on the performance for ensemble algorithms. In further works, we want to analyze the role of each algorithm and each -network in different stages, so as to further enhance the performance of the ensemble algorithm.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.