1 Introduction

In the past decade, quantum information science established itself as a fruitful research field that leverages quantum effects to enhance communication and information processing tasks (Nielsen and Chuang 2000; Bennett and DiVincenzo 1995). The results and insights gained inspired further investigations which more recently contributed to the emergence of the field quantum machine learning (Schuld et al. 2014; Biamonte et al. 2016; Dunjko and Briegel 2018). The aim of this new field is twofold. On the one hand, machine learning methods are developed to further our understanding and control of physical systems and, on the other hand, quantum information processing is employed to enhance certain aspects of machine learning. A learning framework that features in both aspects of quantum machine learning is projective simulation (PS). In particular, PS can be seen as a platform for the design of autonomous (quantum) learning agents (Briegel and las Cuevas 2012).

The development of projective simulation is not motivated by the aim of designing ever-faster computer algorithms. Projective simulation is a tool for understanding various aspects of learning, where agents are viewed from the perspective of realizable entities such as robots or biological systems interacting with an unknown environment. In this embodied approach, the agent’s perception is influenced by its sensors, its actions are limited by its physical capabilities, and its memory is altered by its interaction with an environment. The deliberation process of the agent can be described by a random walk process on the memory structure and it is their quantum counterpart, quantum random walks, that offers a direct route to the quantization of the deliberation and learning process. Thereby, PS not only allows us to study learning in the quantum domain, it also offers speedups in a variety of learning settings (Paparo et al. 2014; Sriarunothai et al. 2017).

Projective simulation can be used to solve reinforcement learning (RL) problems as well. Taken as a classical RL approach, the PS has proven to be a successful tool for learning how to design quantum experiments (Melnikov et al. 2018). In Melnikov et al. (2018), PS was used to design experiments that generate high-dimensional multipartite entangled photonic states. The ability of PS to learn and adapt to an unknown environment was further used for optimizing and adapting quantum error correction codes (Nautrup et al. 2019). In a quite different context, PS is used to model complex skill acquisition in robotics (Hangl et al. 2016, 2020).

Although PS has been shown suitable for a number of applications, it is a fair question of just how well it does, compared with other models, or compared with theoretical optima. However, the empirical evaluation of a model through simulations and analytically proving the properties of the same model are fundamentally distinct matters. For example, in many applications, empirical convergence can be reached even if the conditions for theoretical convergence are not met. In any real-world application, such as learning to play the game of Go, convergence to optimal performance, even though it is theoretically feasible, is not reached due to the size of the state space, which for the game of Go consists of 10170 states. This, however, is not worrying in practice where the goal is to create a well-performing and fast algorithm without the goal of full convergence or theoretical guarantees. In numerical investigations of various textbook problems, it was shown that PS demonstrates a competitive performance with respect to standard RL methods (Melnikov et al. 2017, 2018; Mautner et al. 2015; Makmal et al. 2016). In this work, we complement those results by comparing PS with other RL approaches from a theoretical perspective. Specifically, we analyze if PS converges to an optimal solution. Other methods, like Q-learning and SARSA, have already been proven to converge in environments which are describable by Markov Decision Processes (MDPs) (Dayan and Sejnowski 1994; Singh et al. 2000; Jaakkola et al. 1994; Watkins and Dayan 1992). One should notice, however, that Q-learning and SARSA are methods equipped with update rules explicitly designed for such problems. PS, in contrast, was designed with a broader set of time-varying and partially observable learning environments in mind. For this reason, it is capable of solving tasks that a direct (naive) implementation of Q-learning and SARSA fails to learn as they are designed to obtain a time-independent optimal policy (Watkins and Dayan 1992; Sutton and Barto 2018); examples can be found in Mautner et al. (2015). Thus, it would be unlikely for a PS agent to exactly realize the same optimality with respect to the discounted infinite horizon reward figures of merit (for which Q-learning was designed) without any further adjustment to the model. Nonetheless, in this work, we analyze the properties of PS taken as a pure MDP solving RL algorithm. We show that a slightly modified PS variant recovers the notion of state-action values as a function of its internal parameters, while preserving the main characteristics that make PS stand out from other RL algorithms, such as the locality of the update rules. As we show, this new variant is suitable for episodic MDPs, and we can prove convergence to the optimal strategy for a range of solutions. In the process, we connect the modified PS model with the basic PS model, which allows us to partially explain and understand the empirical performance and successes of PS reported in previous experimental works.

This paper is organized as follows: We quickly recap the main concepts of RL theoryFootnote 1 in Section 2 concerning MDPs that will be used by us during the rest of this paper before we present the PS model in Section 3. In Section 4, we begin by introducing the adaption to PS needed for the convergence proof, which will be followed by the convergence proof that is based on a well-known theorem in stochastic approximation theory. In the Appendix of the paper, we provide a detailed exposition of RL methods which introduces the necessary concepts for the analysis, with a broader perspective on RL theory in mind. Additionally, after discussing multiple variants of the PS update rules and their implications, we present an extensive investigation of the similarities and difference of PS to standard RL methods.

2 Markov decision processes

2.1 Policy and discounted return

In the RL framework, an RL problem is a general concept that encompasses the learning of an agent through the interaction with an environment with the goal of maximizing some precisely defined figure of merit such as a reward function. In a discrete-time framework, the agent–environment interaction can be modeled as follows. At every time step t, the agent perceives the environmental stateSt. Then, the agent chooses an action At to execute upon the environment. The environment completes the cycle by signaling to the agent a new state St+ 1 and a reward Rt+ 1. The variables Rt, St, and At are, in general, random variables, where Rt can take values \(r_{t}\in \mathbb {R}\), while St and At take values sampled from sets \(\mathcal {S}=\{s_{1},s_{2},\dots \}\) and \(\mathcal {A}=\{a_{1},a_{2},\dots \}\) respectively. For simplicity, we assume in the following that these two sets are finite and rt is bounded for all time steps t.

A particularly important set of RL problems are those where the environment satisfies the Markovian property. These problems can be modeled by Markov Decision Processes (MDPs). In an MDP, the probabilities of transitions and rewards are given by the set of probabilities:

$$ p(s^{\prime},r \mid s,a)=: \Pr\{S_{t+1}=s^{\prime},R_{t+1}=r \mid S_{t}=s,A_{t}=a\}. $$
(1)

At every time step, the agent chooses an action as the result of some internal function that takes as input the current state of the environment. Thus, formally, an agent maps states into actions, which is captured by the so-called policy of the agent. Mathematically, the policy (at a certain time step t) can be defined as the set of probabilities:

$$ \pi(a\mid s)=: \Pr\{A_{t}=a\mid S_{t}=s\}. $$
(2)

The successive modification of these probabilities, π = πt, through the experience with the environment constitutes the learning that the agent undergoes in order to achieve a goal. In an MDP, the notion of goal can be formalized by introducing a new random variable:

$$ G_{t}(\gamma_{\text{dis}})=:\sum\limits_{k=0}^{\infty}\gamma_{\text{dis}}^{k}R_{t+k+1}, $$
(3)

called the discounted return, where γdis ∈ [0,1] is the discount parameter. The case with γdis = 1 is reserved for episodic tasks, where the agent–environment interaction naturally terminates at some finite time. The discounted return at some time step t consists of the sum of all rewards received after t, discounted by how far in the future they are received. The solution to the MDP is the policy that maximizes the expected return starting from any state s, called the optimal policy.

A particular set of RL problems we will consider in this work are the so-called episodic tasks. In these, the agent-environment interactions naturally break into episodes, e.g., an agent playing some card game, or trying to escape from a maze. Note that while in some episodic problems the objective could be to finish the episode with the fewest possible actions (e.g., escaping a maze), in general, the optimal solution is not necessarily related to ending the episode. A notion of episodic MDP can be easily incorporated into the theoretical formalism recalled above, by including a set \(\mathcal {S}_{T}\subset \mathcal {S}\), of so-called terminal or absorbing states. These states are characterized by the fact that transitions from a terminal state lead back to the same state with unit probability and zero reward. In episodic MDPs, the goal for the agent is to maximize the expected discounted return per episode.

It should be noted that the concept of absorbing states is a theoretical construct introduced to include the concept of episodic and non-episodic MDPs into a single formalism. In a practical implementation, however, after reaching a terminal state, an agent would be reset to some initial state, which could be a predefined state or chosen at random for instance. While such a choice could have an impact on learning rates, it is irrelevant regarding the optimal policy. For this reason, in the following, we do not make any assumption about the choice of the initial states. We will assume, however, that the environment signals the finalization of the episode to the agent.

2.2 Value functions and optimal policy

The concept of an optimal policy is closely intertwined with that of value functions. The value vπs of a state \(s\in \mathcal {S}\) under a certain policy π is defined as the expected return after state s is visited; i.e., it is the value:

$$ v_{\pi}(s)=:{\mathrm{E}}_{\pi} \left\{G_{t}\mid S_{t}=s\right\}. $$
(4)

It has been proven for finite MDPs that there exists at least one policy, called the optimal policyπ, which maximizes over the space of policies vπ(s) ∀s simultaneously, i.e.:

$$ v_{*}(s)=\max_{\pi}\left\{v_{\pi}(s)\right\}, \forall s\in\mathcal{S}, $$
(5)

where v denotes the value functions associated to the optimal policy.

Value functions can also be defined for state-action pairs. The so-called Q-value of a pair (s, a), for a certain policy π, is defined as the expected return received by the agent following the execution of action a while in state s, and sticking to the policy π afterwards. The Q-values of the optimal policy, or optimal Q-values, can be written in terms of the optimal state value functions as:

$$ q_{*}(s,a)=r(s,a)+\gamma_{\text{dis}} \mathrm{E}\left\{v_{*}(S_{t+1})|S_{t}=s, A_{t}=a\right\}, $$
(6)

where

$$ r(s,a)=\mathrm{E} \left\{R_{t+1}\mid S_{t}=s,A_{t}=a\right\}. $$
(7)

The relevance of Q-values is evidenced by noting that given the set of all q values, an optimal policy can be derived straightforwardly as:

$$ \pi^{*}(s)=\underset{a^{\prime}}{\text{arg\ max}} \left\{q_{*}(s,a^{\prime})\right\}. $$
(8)

(Note the notational difference in the arguments to distinguish between the stochastic policy (2), which returns a probability, and the deterministic policy (8), which returns an action.) For this reason, standard RL methods achieve an optimal policy in an indirect way, as a function of the internal parameters of the model, which are those which are updated through the learning of the model, and which in the limit converge to the q values. A similar procedure will be used by us in Section 4, where we discuss the convergence of PS to the optimal policy of MDPs.

2.3 Q-Learning and SARSA

Q-Learning and SARSA are two prominent algorithms that capture an essential idea of RL: online learning in an unknown environment. They are particularly designed to solve Markovian environments and their prominence can in part be ascribed to the theoretical results that prove their convergence in MDPs. In both algorithms, learning is achieved by estimating the action value functionqπ(s, a) for every state action pair for a given policy π. This estimate is described as the Q-value which is assigned to each state-action pair. The update of the Q-value is given by:

$$ \begin{array}{@{}rcl@{}} Q_{t+1}(s_{t},a_{t})&=& (1-\alpha)Q_{t}(s_{t},a_{t})+\alpha (R_{t+1}\\ && +\gamma_{\text{dis}} f(Q_{t}(s_{t+1},a_{t+1})). \end{array} $$
(9)

The learning rate α describes how fast a new estimate of the Q-value overwrites the previous estimate. In SARSA, the function f is the identity, so that the Q-value is not only updated by the reward Rt+ 1 but also with the Q-value of the next state-action pair along the policy π. Thus, SARSA is an on-policy algorithm, as described in Appendix A. In Q-learning, on the other hand, the function \(f=\max \limits _{a_{t+1}}\) takes the maximal Q-value of the next state. This algorithm is an off-policy algorithm due to sampling of the next action independently from the update of the Q-values.

3 Projective simulation

Projective simulation (PS) is a physically inspired framework for artificial intelligence introduced in Briegel and las Cuevas (2012). The core of the model is a particular kind of memory called episodic and compositional memory (ECM) composed of a stochastic network of interconnected units, called clips (cf. Fig. 2 in Briegel and las Cuevas (2012)). Clips represent either percepts or actions experienced in the past, or in more general versions of the model, combinations of those. The architecture of ECM, representing deliberation as a random walk in a network of clips, together with the possibility of combining clips and thus creating structures within the network, allows for modeling incipient forms of creativity (Briegel 2012; Hangl et al. 2020). Additionally, the deliberation process leading from percepts to actions has a physical interpretation in PS. Visiting any environmental state activates a corresponding percept clip in the ECM. This activation can be interpreted as an excitation, which then propagates stochastically through the network in the form of a random walk. The underlying dynamics have the potential to be implementable by real physical processes, thus relating the model to embodied agents including systems which exploit quantum effects, as has been explored in Dunjko et al. (2016) and Clausen and Briegel (2018).

PS can be used as an RL approach, where the action, the percept, and the reward are used to update the ECM structure. In general, the PS framework enables to leverage complex graph structures to enhance learning. For example, generalization can be implemented through manipulation of the ECM topology so that the RL agent is capable of learning in scenarios it would otherwise fail to learn (Melnikov et al. 2017). However, this generalization mechanism is not necessary for solving MDP environments.

Before we discuss the ECM for solving MDPs in detail, we need to emphasize the difference between the state of the environment and the percept the agent receives. In an algorithm specifically designed to solve MDPs, the state contains sufficient information of the environment such that the corresponding transition function fulfills the Markov property. We will refer to this type of state as Markov state. This assumption on the state space can generally not be made in most realistic learning scenarios but it can be generalized to partially observable MDPs where the Markovian dynamics are hidden. In a partially observable environment, the input of the learning algorithm is an observation that is linked to a Markov state via a, from the perspective of the algorithm, unknown probability distribution.

A percept, as introduced in the PS model, further generalizes the concept of such an observation. Here, the percept does not necessarily have to be connected to an underlying Markov state contrary to the observation in partially observable MDPs. This distinction might not seem necessary for learning in a classical environment but plays a significant role when one considers quantum systems that cannot be described with hidden variable models. In this work, since we focus on MDPs, we will equate the percepts an agent receives and the state of the MDP. In the following, both are denoted by s. Furthermore, we will not emphasize the difference between the percept and its corresponding percept clip, assuming there is a one-to-one correspondence between percept and percept clip. The same holds for the actions and their corresponding action clips.

The ECM structure used to solve MDPs consists of one layer of percept clips that is fully connected with a layer of action clips. Each edge represents a state action pair (s, a) which is assigned a real-valued weight (or hopping value) h= h(s, a) and a non-negative glow value g= g(s, a). While the weight h determines the probability of transition between a percept clip and an action clip, the glow variable g measures how ‘susceptible’ this weight h is to future rewards from the environment. In Eq. 10, heq is an (arbitrarily given) equilibrium value, and λt+ 1 is the reward received immediately after action at, in accordance with the time-indexing conventions in Sutton and Barto (2018) as shown in Fig. 1.

Fig. 1
figure 1

Transition from time step t to t + 1, (t = 0,1,2,…), via the agent’s decision at, where s and λ denote environment state and reward (λ0 = 0), respectively (adapted from Sutton and Barto (2018))

A variety of different update rules are discussed in Appendix D and compared with other RL methods in Appendix E. In the following, we will focus on the standard update used in Briegel and las Cuevas (2012), Mautner et al. (2015), and Melnikov et al. (2018). The update rules for the h-value and the glow value are given by:

$$ \begin{array}{@{}rcl@{}} &h_{t+1}(s,a)=h_{t}(s,a)-\gamma(h_{t}(s,a)-h^{\text{eq}})\\ &+g_{t}(s,a)\lambda_{t+1} \end{array} $$
(10)
$$ \begin{array}{@{}rcl@{}} & g_{t}(s,a)= (1-\delta_{(s,a),(s_{t},a_{t})})(1-\eta)g_{t-1}(s,a)\\ &+\delta_{(s,a),(s_{t},a_{t})} \end{array} $$
(11)

The update of the h-value consists, in the language used in the context of Master equations, of a gain and a loss term. The parameter for the loss term is called damping parameter and is denoted by γ ∈ [0,1]. The parameter for the gain term is called glow parameter and is denoted by η ∈ [0,1]. In particular, η = 1 recovers the original PS as introduced in Briegel and las Cuevas (2012). Finally, δt:= \(\delta _{s,s_{t}}\delta _{a,a_{t}}=\delta _{(s,a),(s_{t},a_{t})}\) denotes the Kronecker delta symbol, which becomes 1 if the respective (s, a)-pair is visited at cycle t, and is otherwise 0. The agent’s policy is defined as the set of all conditional probabilities (i.e., transition probabilities in the ECM clip network):

$$ p_{ij}=p(a_{j}|s_{i})=\frac{{\varPi}({h}_{ij})}{\kappa_{i}},\quad \kappa_{i}=\sum\limits_{j}{\varPi}({h}_{ij}), $$
(12)

of selecting action aj when in state si and is here described in terms of some given function π. Examples of π which have been used or discussed in the context of PS are an identity function (Briegel and las Cuevas 2012; Mautner et al. 2015):

$$ {\varPi}(x)=x, $$
(13)

if x is non-negative, and an exponential function leading to the well-known softmax policy (Melnikov et al. 2018) if a normalization factor is added:

$$ {\varPi}(x)=\mathrm{e}^{\beta{x}}, $$
(14)

where β ≥ 0 is a real-valued parameter.

4 Convergence of PS in episodic MDPs

In previous works, it has been shown numerically that the basic version of a PS agent is capable of learning the optimal strategy in a variety of textbook RL problems. The PS model with standard update rules, however, does not necessarily converge in all MDP settings. This version of the PS is thoroughly analyzed in Appendix D and Appendix E. As recalled in Section 2, in MDPs, optimality can be defined in terms of the optimal policy. In this section, we present a modified version of the PS that has been designed exclusively to tackle this kind of problem. We consider arbitrary episodic MDPs, and derive an analytical proof of convergence. In this version, the policy function depends on the normalized \(\tilde {h}\) values, which, as we show later, behave similarly as state-action values, and in fact, in episodic MDPs, they converge to the optimal q values for a range of discount parameters.

4.1 Projective simulation for solving MDPs

In the following, we introduce a new variant of PS aimed at solving episodic MDPs. In those problems, there is a well-defined notion of optimality, given by the optimal policy. As described above, the basic PS constitutes a direct policy method (see also Appendix A). Finding the optimal policy of an MDP by policy exploration seems a rather difficult task. However, as other methods have proven, finding the optimal q values can be done with relatively simple algorithms, and the optimal policy can be derived from the q values in a straightforward manner. Motivated by this, we add a new local variable to the ECM network in order to recover a notion of state-action values while maintaining the locality of the model.

For this version, we consider “first-visit” glow, defined as follows.Footnote 2 The glow of any given edge is set to 1 whenever that edge is visited for the first time during an episode and in any other circumstance it is damped by a factor (1 − η), even if the same edge is visited again during the same episode. In addition, the entire glow matrix is reset to zero at the end of an episode. We thus write the updates as:

$$ \begin{array}{@{}rcl@{}} {h_{t+1}(s,a)}&=&{h_{t}(s,a)}+\lambda_{t+1}{g_{t}(s,a)} \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} {g_{t}(s,a)}&=&(1-\eta){g_{t-1}(s,a)}+\delta_{(s,a),(s,a)_{\text{first-visit}}} \end{array} $$
(16)
$$ \begin{array}{@{}rcl@{}} N_{t+1}(s,a)&=& N_{t}(s,a)+\delta_{(s,a),(s,a)_{\text{first-visit}}} \end{array} $$
(17)

Here, the update for h is the same as in Eq. 10, but given that the MDPs are time-independent environments, γ has been set to 0. We add a matrix N to the standard PS, which counts the number of episodes during which each entry of h has been updated. The idea behind these updates is that the ratios:

$$ \tilde{h}_{t}(s,a)=:\frac{h_{t}(s,a)}{N+1} $$
(18)

resemble state-action values. To gain some intuition about this, note that h-values associated to visited edges will accumulate during a single episode a sum of rewards of the form:

$$ \lambda_{t}+(1-\eta) \lambda_{t+1}+(1-\eta)^{2} \lambda_{t+2} +\dots, $$
(19)

which gets truncated at the time step the episode ends. Hence, the normalized \(\tilde {h}\) values become averages of sampled discounted rewards (see Appendix E.1). Later, we show that paired with the right policy and glow coefficient the \(\tilde {h}\) values converge to the optimal q values.

Instead of considering a policy function of the h-values as in Eq. 12, here we will consider a policy function given by

$$ p_{i,j} = \frac{{\varPi} (\tilde{h}_{i,j})}{c_{i}},\quad c_{i}=\sum\limits_{j} {\varPi} (\tilde{h}_{i,j} ), $$
(20)

for a certain function π(⋅). Given that the \(\tilde {h}\)-values are, in general, not diverging with time (in fact they are bounded in the case of bounded rewards) a linear function, as in Eq. 13, would fail to converge to a deterministic policy. A solution for that is to use a softmax function as in Eq. 14, where the free coefficient β is made time dependent. By letting β diverge with time, the policy can become deterministic in the limit.

Similarly to Monte Carlo methods, which may be equipped with a variety of update rules, giving rise to first-visit or many-visit Monte Carlo methods, the choice of the glow update rule is to some extent arbitrary but may depend on the physical implementation of PS and the ECM. For example, instead of Eq. 15, one could use the accumulating glow update, given in Eq. 53. In that case, one simply needs to change the update rule of N, given in Eq. 17 in such a way that every visit of the edge is counted, instead of only first visits. Intuitively, both pairs of update rules have similar effects, in the sense that in both cases \(\tilde {h}(s,a)\) equals an average of sampled discounted returns starting from the time a state-action pair (s, a) was visited. However, while for first-visit glow, we were able to prove convergence, that is not the case for accumulating glow. Therefore, when referring to this version of PS in the following, we assume update rules given by Eqs. 1517.

4.2 Convergence to the optimal policy

The convergence of \(\tilde {h}\) values to q values can be proven by a standard approach used in the literature to prove, for example, the convergence of RL methods like Q-learning and SARSA, or prediction methods like TD(λ). In the remainder of the paper, we will use interchangeably the boldface notation e to denote a state-action pair as well as the explicit notation (s, a) whenever convenience dictates. Denoting by \(\tilde {h}_{m} (\boldsymbol {e})\) the \(\tilde {h}\)-value of edge e at the end of episode m, we define the family of random variables \({\varDelta }_{m} (\boldsymbol {e}) =: \tilde {h}_{m} (\boldsymbol {e}) - q_{*}(\boldsymbol {e})\). We want to show that in the limit of large m, Δm(e) converges to zero for all e. Moreover, it is desirable that such convergence occurs in a strong sense, i.e., with probability 1. We show that by following the standard approach of constructing an update rule for Δm(e) which satisfies the conditions of the following theorem Footnote 3

Theorem 1

A random iterative process \({\varDelta }_{m+1} (\boldsymbol {x}) =\left [1-\alpha _{m} (\boldsymbol {x})\right ]{\varDelta }_{m} (\boldsymbol {x}) +\alpha _{m} (\boldsymbol {x}) F_{m} (\boldsymbol {x})\), xX converges to zero with probability one (w.p.1) if the following properties hold:

  1. 1.

    the set of possible states X is finite.

  2. 2.

    0 ≤ αm(x) ≤ 1, \({\sum }_{m} \alpha _{m} (\boldsymbol {x}) = \infty \), \({\sum }_{m} {\alpha _{m}^{2}} (\boldsymbol {x}) < \infty \) w.p.1, where the probability is over the learning rates αm(x).

  3. 3.

    ∥E{Fm(⋅)|Pm}∥WκΔm(⋅)∥W + cm, where κ ∈ [0,1) and cm converges to zero w.p.1

  4. 4.

    Var{Fm(x)|Pm}≤ K(1 + ∥Δm(⋅)∥W)2, where K is some constant.

Here Pm is the past of the process at step m, and the notation ∥⋅∥ denotes some fixed weighted maximum norm.

In addition to Δm(e) meeting the conditions of the theorem, the policy function must also satisfy two specific requirements. First of all, it must be greedy with respect to the \(\tilde {h}\)-values (at least in the limit of m to infinity). In that way, provided that the \(\tilde {h}\)-values converge to the optimal q values, the policy becomes automatically an optimal policy. Additionally, to guarantee that all Δm keep being periodically updated, the policy must guarantee infinite exploration. A policy that satisfies these two properties is called GLIE (Singh et al. 2000), standing for Greedy in the Limit and Infinite Exploration. Adapting the results from Singh et al. (2000) for PS and episodic environments, we can show (see Appendix G) that a softmax policy function defined by:

$$ \pi_{m}(a|s,\boldsymbol{\tilde{h}}_{m})=\frac{\exp\left[\beta_{m}\tilde{h}_{m}(s,a)\right]}{{\sum}_{a^{\prime}\in\mathcal{A}}\exp\left[\beta_{m}\tilde{h}_{m}(s,a^{\prime}) \right]} $$
(21)

is GLIE, provided that \(\beta _{m}\rightarrow _{m\rightarrow \infty }\infty \) and \(\beta _{m}\leq C \ln (m)\), where C is a constant depending on η and \(\left |\mathcal {S}\right |\). While the first condition on βm guarantees that the policy is greedy in the limit, the second one guarantees that the agent will keep exploring all state-action pairs infinitely often. In this particular example, we have considered βm to depend exclusively on the episode index. By doing so, the policy remains local, because βm can be updated using exclusively the signal of the environment indicating the finalization of the episode. Note however that the choice of the policy function, as far as it is GLIE, has no impact on the convergence proof. We are now in a position to state our main result about the convergence of PS-agents in the form of the following theorem.

Theorem 2

For any finite episodic MDP with a discount factor of γdis, the policy resulting from the new updates converges with probability one to the optimal policy, provided that:

  1. 1.

    Rewards are bounded,

  2. 2.

    0 ≤ γdis ≤ 1/3, where γdis = 1 − η,

  3. 3.

    The policy is a GLIE function of the \(\tilde {h}\)-values.

Note that we have restricted the range of values γdis can take. The reason for that is related to the way the h-values are updated in PS. In Q-learning and SARSA, where the γdis parameter of the MDP is directly included in the algorithm, every time an action is taken its corresponding Q-value is updated by a sum of a single reward and a discounted bootstrapping term. Given that the PS updates do not use bootstrapping, that term is “replaced” by a discounted sum of rewards. Due to this difference, the contraction property (Condition 3 in Theorem 1) is not so straightforward to prove forcing us to consider smaller values of γdis. However, this condition on the γdis parameter is not a fundamental restriction of the PS model, but merely a result of how convergence is proven in this work.

4.3 Environments without terminal states

In Theorem 2, we have considered exclusively episodic MDPs. However, it is still possible for these environments to have an optimal policy which does not drive the agent to any terminal states. This observation suggests that the scope of problems solvable by PS can be further extended to a subset of non-episodic MDPs.

Given any non-episodic MDP, one can construct an episodic MDP from it by adding one single terminal state sT and one single transition leading to it with non-zero probability, i.e., by defining \(p_{T}=\Pr (s_{T}|s,a)\neq 0\) for some arbitrary pair (s, a). Thus, while the original non-episodic MDP falls outside the scope of Theorem 2, PS could be used to tackle the non-episodic MDP. Anyway, in general, these two problems might have different solutions, i.e., different optimal policies. However, given that both the pair (s, a) for which pT≠ 0 and the value of pT are arbitrary, by properly choosing them, the difference between the two optimal policies could become negligible or non-existent. That could be done easily having some useful knowledge about the solution of the original MDP. Consider for instance a grid world, where multiple rewards are placed randomly around some central area of grid cells. Even without knowing the exact optimal policy, one can correctly guess that there will be an optimal cyclic path about the center of the world yielding the maximum expected discounted return. Hence, adding a terminal state in some remote corner of the world would very likely leave the optimal policy unchanged.

4.4 Proof of Theorem 2

In this section, we discuss the core of the proof of Theorem 2, leaving for Appendix F the most involved calculations. Given that the policy is a greedy-in-the-limit-function of the \(\tilde {h}_{m}\) values, the proof of Theorem 2 follows if we show that:

$$ {\varDelta}_{m} (\boldsymbol{e}) =: \tilde{h}_{m} (\boldsymbol{e}) -q_{*} (\boldsymbol{e}) $$
(22)

converges to 0 with probability 1. In order to do so, we show that Δm(e) obeys an update rule of the form given in Theorem 1 and the four conditions of the theorem are satisfied.

We begin by deriving an update rule for the h-values between episodes. In the case where an edge e is not visited during the m-th episode, its corresponding h-value is left unchanged, i.e., hm(e) = hm− 1(e). Otherwise, due to the decreasing value of the glow during the episode, in the m-th episode, the h(e) value will accumulate a discounted sum of rewards given by:

$$ D_{m} (\boldsymbol{e}) =\sum\limits_{t=t_{m}(\boldsymbol{e})}^{T_{m}} \bar{\eta}^{t-t_{m}(\boldsymbol{e})} \lambda_{t}, $$
(23)

where tm(e) and Tm are the times at which the first visit to e during episode m occurred and at which the episode finished, respectively, and \(\bar {\eta }=1-\eta \). Therefore, in general hm(e) = hm− 1(e) + χm(e)Dm(e), where χm(e) is given by

$$ \chi_{m}(\boldsymbol{e})= \begin{cases} 1\quad\text{ if } \boldsymbol{e} \text{ is visited during the } m \text{-th episode,}\\ 0\quad\text{otherwise.}\\ \end{cases} $$
(24)

We denote, respectively, by Nm(e) and \(\tilde {h}_{m} (\boldsymbol {e})\) the N-value and \(\tilde {h}\)-value associated to edge e at the end of episode m. Thus, we have that \( \tilde {h}_{m} (\boldsymbol {e})= h_{m}(\boldsymbol {e})/[N_{m}(\boldsymbol {e})+1]\) and it obeys an update rule of the form:

$$ \begin{array}{@{}rcl@{}} \tilde{h}_{m} (\boldsymbol{e}) &=&\frac{1}{N_{m}(\boldsymbol{e})+1}\left\{\left[N_{m-1}(\boldsymbol{e})+1\right] \tilde{h}_{m-1} (\boldsymbol{e})\right.\\ &&\left.+{\vphantom{\frac{1}{N_{m}(\boldsymbol{e})+1}}}\chi_{m}(\boldsymbol{e})D_{m} (\boldsymbol{e})\right\}. \end{array} $$
(25)

Noting that the variables Nm(e) can be written in terms of χm(e) as the sum \(N_{m}(\boldsymbol {e})={\sum }_{j=1}^{m}\chi _{j} (\boldsymbol {e})\), it follows from Eq. 25 that the variable Δm(e) given in Eq. 22 satisfies the recursive relation:

$$ {\varDelta}_{m} (\boldsymbol{e}) =\left[1-\alpha_{m} (\boldsymbol{e})\right]{\varDelta}_{m-1} (\boldsymbol{e}) +\alpha_{m} (\boldsymbol{e}) F_{m} (\boldsymbol{e}), $$
(26)

where the ratios

$$ \alpha_{m} (\boldsymbol{e})=:\frac{\chi_{m} (\boldsymbol{e})} {N_{m} (\boldsymbol{e})+1}, $$
(27)

play the role of learning rates, and Fm(e) is defined as:

$$ F_{m} (\boldsymbol{e})=: \chi_{m} (\boldsymbol{e}) \left( D_{m}(\boldsymbol{e}) -q_{*}(\boldsymbol{e}) \right). $$
(28)

The update rule in Eq. 26 is exactly of the form given in Theorem 1. Therefore, we are left with showing that αm(e) satisfies Condition 2 in Theorem 1, and Fm(e) satisfies Conditions 3 and 4. Below, we describe the general procedure to prove that, while most of the details can be found in Appendix F.

The fact that αm(e) satisfies Condition 2 in Theorem 1 follows from noting that \({\sum }_{m} \alpha _{m} (\boldsymbol {e})={\sum }_{n} 1/n\) and \({\sum }_{n} {\alpha ^{2}_{m}}(\boldsymbol {e})={\sum }_{n} 1/n^{2}\), which are, respectively, a divergent and a convergent series. Regarding Condition 3, note that by tweaking the free glow parameter in such a way that \(\bar {\eta } = \gamma _{\text {dis}}\), the variable Dm(e) becomes a truncated sample of the discounted return G(e, γdis) given in Eq. 3. Thus, \(\tilde {h}\) values undergo a similar update to that found in SARSA, with the difference that instead of a bootstrapping term an actual sample of rewards is used. Due to these similarities, we can use the same techniques used in the proof of convergence of RL methods (Jaakkola et al. 1994; Singh et al. 2000) and show that:

$$ {\|\mathrm{E}\{F_{m}(\cdot)|P_{m}\}\|}_{\mathrm{W}}\leq f(\gamma_{\text{dis}}) {\|{\varDelta}_{m} (\cdot)\|}_{\mathrm{W}}+c_{m}, $$
(29)

where cm converges to 0 w.p.1 and \(f(\gamma _{\text {dis}})=\frac {2\gamma _{\text {dis}}}{1-\gamma _{\text {dis}}} \). This equation satisfies Condition 3 in Theorem 1 as far as f(γdis) < 1, which occurs for γdis < 1/3.

Finally, Condition 4 in Theorem 1 follows from the fact that rewards are bounded. This implies that \(\tilde {h}\)-values and, in turn, the variance of Fm(e) are bounded as well. This concludes the proof of Theorem 2.

5 Conclusion

In this work, we studied the convergence of a variant of PS applied to episodic MDPs. Given that MDPs have a clear definition of a goal, characterized by the optimal policy, we took the approach of adapting the PS model to deal with this kind of problem specifically. The first visit glow version of PS presented in this work internally recovers a certain notion of state-action values, while preserving the locality of the parameter updates, crucial to guarantee a physical implementation of the model by simple means. We have shown that with this model a PS agent achieves optimal behavior in episodic MDPs, for a range of discount parameters. This proof and the theoretical analysis of the PS update rules shed light on how PS, or, more precisely, its policy, behaves in a general RL problem.

The PS updates that alter the h-values at every time step asynchronously pose a particular challenge for proving convergence. To deal with that, we analyzed the subsequence of internal parameters at the times when episodes end, thus recovering a synchronous update. We could then apply techniques from stochastic approximation theory to prove that the internal parameters of PS converge to the optimal q values, similarly as in the convergence proofs of other RL methods.

We have also chosen a specific glow update rule, which we have called first-visit glow. While other glow updates, like accumulating or replacing glow, show the same behavior at an intuitive level, trying to prove the convergence with those updates has proven to be more cumbersome. Therefore, from a practical point of view, several glow mechanisms could be potentially utilized, but convergence in the limit is, at the moment, only guaranteed for first-visit glow.

Although only episodic MDPs fall within the scope of our theorem, no constraints are imposed on the nature of the optimal policy. Hence, episodic problems where the optimal policy completely avoids terminal states (i.e., the probability that an agent reaches a terminal state by following that policy is strictly zero) can also be considered. Furthermore, the agent could be equipped with any policy, as far as the GLIE condition is satisfied. In this paper, we provided a particular example of a GLIE policy function, in the form of a softmax function with a global parameter, which depends exclusively on the episode index. In this particular case, the policy is compatible with local updates, in the sense that the probabilities to take an action given a state can be computed locally.