Improved bound on the worst case complexity of Policy Iteration

doi:10.1016/j.orl.2016.01.010

Operations Research Letters

Volume 44, Issue 2, March 2016, Pages 267-272

https://doi.org/10.1016/j.orl.2016.01.010 Get rights and content

Abstract

Solving Markov Decision Processes is a recurrent task in engineering which can be performed efficiently in practice using the Policy Iteration algorithm. Regarding its complexity, both lower and upper bounds are known to be exponential (but far apart) in the size of the problem. In this work, we provide the first improvement over the now standard upper bound from Mansour and Singh (1999). We also show that this bound is tight for a natural relaxation of the problem.

Introduction

Markov Decision Processes (MDPs) have been found to be a powerful modeling tool for the decision problems that arise daily in various domains of engineering such as control, finance, queuing systems, PageRank optimization, and many more (see [23] for a more exhaustive list).

MDPs are described from a set of $n$ states in which a system can be. When being in a state, the controller of the system must choose an available action in that state, each of which induces a reward and moves the system to another state according to given transition probabilities. In this work, we assume that the number of actions per state is bounded by a constant $k$ . A policy refers to the stationary choice of one action in every state. Choosing a policy implies fixing a dynamics that corresponds to a Markov chain. Given any policy (there are at most $k^{n}$ of them), we can associate a value to each state of the MDP that corresponds to the infinite-horizon expected reward of an agent starting in that state. By solving an MDP, we mean providing an optimal policy that maximizes the value of every state. Depending on the application, a total-, discounted- or average-reward criterion may be best suited to define the value function. In every case, an optimal policy always exists. See, e.g., [20] for an in-depth study of MDPs.

A practically efficient way of finding the optimal policy for an MDP is to use Policy Iteration (PI). Starting from an initial policy $π_{0}, i = 0$ , this simple iterative scheme repeatedly computes the value of $π_{i}$ at every state and greedily modifies this policy using its evaluation to obtain the next iterate $π_{i + 1}$ . The modification always ensures that the value of $π_{i + 1}$ improves on that of $π_{i}$ at every state. The process is then repeated until convergence to the optimal policy $π^{*}$ in a finite number of steps (obviously at most $k^{n}$ steps—the maximum number of policies). We refer to the ordered set of explored policies as the PI-sequence. A more precise statement of the algorithm as well as some important properties are described in Section 2.

Every iteration of the algorithm can be performed in polynomial time and its number of steps has been shown by Ye to be strongly polynomial in the important particular case of discounted-reward MDPs with a fixed discount rate [24] (the bound in this result was later improved in [14], [21]). Building on this result, similar conclusions were obtained for other special cases of MDPs [19], [4], [1], [6]. Ye’s result does however not extend to Value Iteration and Modified Policy Iteration, the two standard and closely related competitors of PI [5], [7].

In contrast to these positive results, the number of iterations of PI can be exponentially large in general. Based on the work of Friedmann on Parity Games [8], PI has been shown to require at least $Ω (2^{n / 7})$ steps to converge in the worst case for the total- and average-reward criteria [3] and for the discounted-reward criterion [17]. Friedmann’s result was also a major milestone for the study of the Simplex algorithm for Linear Programming as it led to exponential lower bounds for some critical pivoting rules [9], [10]. On the other hand, the best known upper bound for PI to date was due to Mansour and Singh with a $13 \cdot \frac{k^{n}}{n}$ steps bound [18]. In Theorem 1, Section 3, we provide the first improvement in fifteen years to this bound, namely $\frac{k}{k - 1} \cdot \frac{k^{n}}{n} + o (\frac{k^{n}}{n})$ .

To obtain our bound, we use a number of properties of PI-sequences. It is of natural interest to explore which of these properties could be further exploited to improve the bound and which ones cannot. It turns out that the properties we actually use to obtain our upper bound cannot lead to further improvements, that is, they are “fully exploited”. To formally prove this fact, we introduce in Section 2 the notion of pseudo-PI-sequence to describe any sequence of policies satisfying only the properties that we use to obtain our bound from Theorem 1. We then show in Theorem 2, Section 3, that there always exists a pseudo-PI-sequence whose size matches the upper bound of Theorem 1. This confirms that the bound is sharp for pseudo-PI-sequences. Therefore, obtaining new bounds on PI-sequences would require exploiting stronger properties.

An attempt in that direction–based on the so-called Order-Regular matrices–has been proposed in [13] and developed in [12]. Based on numerical evidence, Hansen and Zwick conjectured that the number of iterations of PI for $k = 2$ should be bounded by $F_{n + 2} (= O (1.61 8^{n}))$ , the ${(n + 2)}^{nd}$ Fibonacci number. If true, this bound would significantly improve ours.

As a final remark, note that our analysis also fits in the frameworks of the Strategy Iteration algorithm to solve 2-Player Turn-Based Stochastic Games [15]–a 2-player generalization of MDPs–and of the Bottom Antipodal algorithm to find the sink of an Acyclic Unique Sink Orientation of a grid [22], [11]. Our bound can also be adapted for these algorithms. It is to be noted that no polynomial-time algorithm is known for either case, which is an additional incentive to improve the exponential bounds (although the strongly polynomial time bound from Ye [24] extends to 2TBSGs as well when a fixed discount factor is chosen [14]).

Section snippets

Problem statement and preliminary results

Definition 1 Markov Decision Process

Let $S = {1, \dots, n}$ be a set of $n$ states and $A_{s}$ be a set of $k$ actions available for state $s \in S$ . To each choice of an action corresponds a transition probability distribution for the next state to visit as well as a reward. For simplicity, we use a common numbering for the actions, that is, $A_{s} ≜ A = {1, \dots, k}$ for all $s \in S$ . With this notation, for every pair $(s, a) \in S \times A$ , the transition probability and reward functions are uniquely defined. Let a policy $π \in {1, \dots, k}^{n}$ be the stationary choice of one action for every

Main result: a better upper bound on PI that is tight for Problem 2

In order to precisely solve Problem 2, we need to both provide a lower and an upper bound on the length of pseudo-PI-sequences. We start by showing the upper bound, which also holds for Problem 1 and therefore provides a new upper bound on the complexity of Policy Iteration in general.

Theorem 1

The number of iterations of Policy Iteration is bounded above by $\frac{k}{k - 1} \cdot \frac{k^{n}}{n} + o (\frac{k^{n}}{n})$ .

Before we proceed to the proof of Theorem 1, we need to formulate two additional properties. First, we derive the following lemma

Acknowledgments

This work was supported by a ARC grant from the French Community of Belgium (reference 13/18-054) and by the IAP network “Dysco” funded by the office of the Prime Minister of Belgium (reference IAP VII/19). The scientific responsibility rests with the authors.

References (24)

E.A. Feinberg et al.
Strong polynomiality of policy iterations for average-cost MDPs modeling replacement and maintenance problems
Oper. Res. Lett.
(2013)
E.A. Feinberg et al.
The value iteration algorithm is not strongly polynomial for discounted dynamic programming
Oper. Res. Lett.
(2014)
E.A. Feinberg et al.
Modified policy iteration algorithms are not strongly polynomial for discounted dynamic programming
Oper. Res. Lett.
(2014)
M. Akian, S. Gaubert, Policy iteration for perfect information stochastic mean payoff games with bounded first return...
D.P. Bertsekas
Dynamic Programming and Optimal Control
(2007)
J. Fearnley, Exponential lower bounds for policy iteration, in: Proceedings of the 37th International Colloquium on...
E.A. Feinberg, J. Huang, On the reduction of total-cost and average-cost MDPs to discounted MDPs, arXiv preprint...
O. Friedmann, An exponential lower bound for the parity game strategy improvement algorithm as we know it, in:...
O. Friedmann, A subexponential lower bound for zadehs pivoting rule for solving linear programs and games, in:...
O. Friedmann, T.D. Hansen, U. Zwick, Subexponential lower bounds for randomized pivoting rules for the simplex...

B. Gärtner et al.

Unique sink orientations of grids

Algorithmica

(2008)

B. Gerencsér, R. Hollanders, J.-C. Delvenne, R.M. Jungers, A complexity analysis of policy iteration through...

Cited by (0)

¹: CORE and NAXYS fellow.

²: F.R.S./FNRS Research Associate.

View full text

Improved bound on the worst case complexity of Policy Iteration

Abstract

Introduction

Section snippets

Problem statement and preliminary results

Main result: a better upper bound on PI that is tight for Problem 2

Acknowledgments

Oper. Res. Lett.

Oper. Res. Lett.

Oper. Res. Lett.

Dynamic Programming and Optimal Control

Unique sink orientations of grids

Algorithmica