Improved bound on the worst case complexity of Policy Iteration

https://doi.org/10.1016/j.orl.2016.01.010Get rights and content

Abstract

Solving Markov Decision Processes is a recurrent task in engineering which can be performed efficiently in practice using the Policy Iteration algorithm. Regarding its complexity, both lower and upper bounds are known to be exponential (but far apart) in the size of the problem. In this work, we provide the first improvement over the now standard upper bound from Mansour and Singh (1999). We also show that this bound is tight for a natural relaxation of the problem.

Introduction

Markov Decision Processes (MDPs) have been found to be a powerful modeling tool for the decision problems that arise daily in various domains of engineering such as control, finance, queuing systems, PageRank optimization, and many more (see  [23] for a more exhaustive list).

MDPs are described from a set of n states in which a system can be. When being in a state, the controller of the system must choose an available action in that state, each of which induces a reward and moves the system to another state according to given transition probabilities. In this work, we assume that the number of actions per state is bounded by a constant k. A policy refers to the stationary choice of one action in every state. Choosing a policy implies fixing a dynamics that corresponds to a Markov chain. Given any policy (there are at most kn of them), we can associate a value to each state of the MDP that corresponds to the infinite-horizon expected reward of an agent starting in that state. By solving an MDP, we mean providing an optimal policy that maximizes the value of every state. Depending on the application, a total-, discounted- or average-reward criterion may be best suited to define the value function. In every case, an optimal policy always exists. See, e.g.,  [20] for an in-depth study of MDPs.

A practically efficient way of finding the optimal policy for an MDP is to use Policy Iteration (PI). Starting from an initial policy π0,i=0, this simple iterative scheme repeatedly computes the value of πi at every state and greedily modifies this policy using its evaluation to obtain the next iterate πi+1. The modification always ensures that the value of πi+1 improves on that of πi at every state. The process is then repeated until convergence to the optimal policy π in a finite number of steps (obviously at most kn steps—the maximum number of policies). We refer to the ordered set of explored policies as the PI-sequence. A more precise statement of the algorithm as well as some important properties are described in Section  2.

Every iteration of the algorithm can be performed in polynomial time and its number of steps has been shown by Ye to be strongly polynomial in the important particular case of discounted-reward MDPs with a fixed discount rate  [24] (the bound in this result was later improved in  [14], [21]). Building on this result, similar conclusions were obtained for other special cases of MDPs  [19], [4], [1], [6]. Ye’s result does however not extend to Value Iteration and Modified Policy Iteration, the two standard and closely related competitors of PI  [5], [7].

In contrast to these positive results, the number of iterations of PI can be exponentially large in general. Based on the work of Friedmann on Parity Games  [8], PI has been shown to require at least Ω(2n/7) steps to converge in the worst case for the total- and average-reward criteria  [3] and for the discounted-reward criterion  [17]. Friedmann’s result was also a major milestone for the study of the Simplex algorithm for Linear Programming as it led to exponential lower bounds for some critical pivoting rules [9], [10]. On the other hand, the best known upper bound for PI to date was due to Mansour and Singh with a 13knn steps bound  [18]. In Theorem 1, Section  3, we provide the first improvement in fifteen years to this bound, namely kk1knn+o(knn).

To obtain our bound, we use a number of properties of PI-sequences. It is of natural interest to explore which of these properties could be further exploited to improve the bound and which ones cannot. It turns out that the properties we actually use to obtain our upper bound cannot lead to further improvements, that is, they are “fully exploited”. To formally prove this fact, we introduce in Section  2 the notion of pseudo-PI-sequence to describe any sequence of policies satisfying only the properties that we use to obtain our bound from Theorem 1. We then show in Theorem 2, Section  3, that there always exists a pseudo-PI-sequence whose size matches the upper bound of Theorem 1. This confirms that the bound is sharp for pseudo-PI-sequences. Therefore, obtaining new bounds on PI-sequences would require exploiting stronger properties.

An attempt in that direction–based on the so-called Order-Regular matrices–has been proposed in  [13] and developed in  [12]. Based on numerical evidence, Hansen and Zwick conjectured that the number of iterations of PI for k=2 should be bounded by Fn+2(=O(1.618n)), the (n+2)nd Fibonacci number. If true, this bound would significantly improve ours.

As a final remark, note that our analysis also fits in the frameworks of the Strategy Iteration algorithm to solve 2-Player Turn-Based Stochastic Games  [15]–a 2-player generalization of MDPs–and of the Bottom Antipodal algorithm to find the sink of an Acyclic Unique Sink Orientation of a grid  [22], [11]. Our bound can also be adapted for these algorithms. It is to be noted that no polynomial-time algorithm is known for either case, which is an additional incentive to improve the exponential bounds (although the strongly polynomial time bound from Ye  [24] extends to 2TBSGs as well when a fixed discount factor is chosen  [14]).

Section snippets

Problem statement and preliminary results

Definition 1 Markov Decision Process

Let S={1,,n} be a set of n states and As be a set of k actions available for state sS. To each choice of an action corresponds a transition probability distribution for the next state to visit as well as a reward. For simplicity, we use a common numbering for the actions, that is, AsA={1,,k} for all sS. With this notation, for every pair (s,a)S×A, the transition probability and reward functions are uniquely defined. Let a policy π{1,,k}n be the stationary choice of one action for every

Main result: a better upper bound on PI that is tight for Problem 2

In order to precisely solve Problem 2, we need to both provide a lower and an upper bound on the length of pseudo-PI-sequences. We start by showing the upper bound, which also holds for Problem 1 and therefore provides a new upper bound on the complexity of Policy Iteration in general.

Theorem 1

The number of iterations of Policy Iteration is bounded above by kk1knn+o(knn).

Before we proceed to the proof of Theorem 1, we need to formulate two additional properties. First, we derive the following lemma

Acknowledgments

This work was supported by a ARC grant from the French Community of Belgium (reference 13/18-054) and by the IAP network “Dysco” funded by the office of the Prime Minister of Belgium (reference IAP VII/19). The scientific responsibility rests with the authors.

References (24)

  • B. Gärtner et al.

    Unique sink orientations of grids

    Algorithmica

    (2008)
  • B. Gerencsér, R. Hollanders, J.-C. Delvenne, R.M. Jungers, A complexity analysis of policy iteration through...
  • Cited by (0)

    1

    CORE and NAXYS fellow.

    2

    F.R.S./FNRS Research Associate.

    View full text