Skip to main content
Log in

Multiscale Q-learning with linear function approximation

  • Published:
Discrete Event Dynamic Systems Aims and scope Submit manuscript

Abstract

We present in this article a two-timescale variant of Q-learning with linear function approximation. Both Q-values and policies are assumed to be parameterized with the policy parameter updated on a faster timescale as compared to the Q-value parameter. This timescale separation is seen to result in significantly improved numerical performance of the proposed algorithm over Q-learning. We show that the proposed algorithm converges almost surely to a closed connected internally chain transitive invariant set of an associated differential inclusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Abdulla MS, Bhatnagar S (2007) Reinforcement learning based algorithms for average cost Markov decision processes. Discrete Event Dyn Syst Theory Appl 17(1):23–52

    Article  MathSciNet  MATH  Google Scholar 

  • Abounadi J, Bertsekas D, Borkar VS (2001) Learning algorithms for Markov decision processes. SIAM J Control Optim 40:681–698

    Article  MathSciNet  MATH  Google Scholar 

  • Aubin J, Cellina A (1984) Differential inclusions: set-valued maps and viability theory. Springer, New York

    Book  MATH  Google Scholar 

  • Azar MG, Gomez V, Kappen HJ (2011) Dynamic policy programming with function approximation. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS), Fort Lauderdale

  • Baird LC (1995) Residual algorithms: reinforcement learning with function approximation. In: Proceedings of ICML. Morgan Kaufmann, pp 30–37

  • Benaim M, Hofbauer J, Sorin S (2005) Stochastic approximations and differential inclusions. SIAM J Control Optim 44(1):328–348

    Article  MathSciNet  MATH  Google Scholar 

  • Benaim M, Hofbauer J, Sorin S (2006) Stochastic approximations and differential inclusions, Part II: applications. Math Oper Res 31(4):673–695

    MathSciNet  MATH  Google Scholar 

  • Bertsekas DP (2005) Dynamic programming and optimal control, 3rd ed. Athena Scientific, Belmont

    MATH  Google Scholar 

  • Bertsekas DP (2007) Dynamic programming and optimal control, vol II, 3rd ed. Athena Scientific, Belmont

    Google Scholar 

  • Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmont

    MATH  Google Scholar 

  • Bhatnagar S, Babu KM (2008) New algorithms of the Q-learning type. Automatica 44(4):1111–1119

    Article  MathSciNet  MATH  Google Scholar 

  • Bhatnagar S, Borkar VS (1997) Multiscale stochastic approximation for parametric optimization of hidden Markov models. Probab Eng Inf Sci 11:509–522

    Article  MathSciNet  MATH  Google Scholar 

  • Bhatnagar S, Fu MC, Marcus SI, Wang I-J (2003) Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modelling and Computer Simulation 13(2):180–209

    Article  Google Scholar 

  • Bhatnagar S, Kumar S (2004) A simultaneous perturbation stochastic approximation based actor–critic algorithm for Markov decision processes. IEEE Trans Autom Control 49(4):592–598

    Article  MathSciNet  Google Scholar 

  • Bhatnagar S (2005) Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization. ACM Transactions on Modeling and Computer Simulation 15(1):74–107

    Article  Google Scholar 

  • Bhatnagar S (2007) Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization. ACM Transactions on Modeling and Computer Simulation 18(1):2:1–2:35

    Article  Google Scholar 

  • Bhatnagar S, Prasad HL, Prashanth LA (2013) Stochastic recursive algorithms for optimization: simultaneous perturbation methods, lecture notes in control and information sciences. Springer, London

    Book  Google Scholar 

  • Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M (2009) Natural actor-critic algorithms. Automatica 45:2471–2482

    Article  MathSciNet  MATH  Google Scholar 

  • Bhatnagar S, Lakshmanan K (2012) An online actor-critic algorithm with function approximation for constrained Markov decision processes. J Optim Theory Appl 153(3):688–708

    Article  MathSciNet  MATH  Google Scholar 

  • Borkar VS (1995) Probability theory: an advanced course. Springer, New York

    Book  MATH  Google Scholar 

  • Borkar VS (1997) Stochastic approximation with two timescales. Syst Control Lett 29:291–294

    Article  MathSciNet  MATH  Google Scholar 

  • Borkar VS (2008) Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press and Hindustan Book Agency

  • Borkar VS, Meyn SP (2000) The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J Control Optim 38(2):447–469

    Article  MathSciNet  MATH  Google Scholar 

  • Brandiere O (1998) Some pathological traps for stochastic approximation. SIAM J Contr Optim 36:1293–1314

    Article  MathSciNet  MATH  Google Scholar 

  • Ephremides A, Varaiya P, Walrand J (1980) A simple dynamic routing problem. IEEE Trans Autom Control 25(4):690–693

    Article  MathSciNet  MATH  Google Scholar 

  • Gelfand SB, Mitter SK (1991) Recursive stochastic algorithms for global optimization in \({\mathcal R}^{d_{*}}\). SIAM J Control Optim 29(5):999–1018

    Article  MathSciNet  MATH  Google Scholar 

  • Konda VR, Borkar VS (1999) Actor–critic like learning algorithms for Markov decision processes. SIAM J Control Optim 38(1):94–123

    Article  MathSciNet  MATH  Google Scholar 

  • Konda VR, Tsitsiklis JN (2003) On actor–critic algorithms. SIAM J Control Optim 42(4):1143–1166

    Article  MathSciNet  MATH  Google Scholar 

  • Kushner HJ, Clark DS (1978) Stochastic approximation methods for constrained and unconstrained systems. Springer, New York

    Book  MATH  Google Scholar 

  • Kushner HJ, Yin GG (1997) Stochastic approximation algorithms and applications. Springer, New York

    Book  MATH  Google Scholar 

  • Maei HR, Szepesvari C, Bhatnagar S, Precup D, Silver D, Sutton RS (2009) Convergent temporal-difference learning with arbitrary smooth function approximation. Proceedings of NIPS

  • Maei HR, Szepesvari Cs, Bhatnagar S, Sutton RS (2010) Toward off-policy learning control with function approximation. Proceedings of ICML, Haifa

    Google Scholar 

  • Melo F, Ribeiro M (2007) Q-learning with linear function approximation. Learning Theory, Springer, pp 308–322

  • Pemantle R (1990) Nonconvergence to unstable points in urn models and stochastic approximations. Annals Prob 18:698–712

    Article  MathSciNet  MATH  Google Scholar 

  • Prashanth LA, Chatterjee A, Bhatnagar S (2014) Two timescale convergent Q-learning for sleep scheduling in wireless sensor networks. Wirel Netw 20:2589–2604

    Article  Google Scholar 

  • Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York

    Book  MATH  Google Scholar 

  • Schweitzer PJ (1968) Perturbation theory and finite Markov chains. J Appl Probab 5:401–413

    Article  MathSciNet  MATH  Google Scholar 

  • Sutton RS (1988) Learning to predict by the method of temporal differences. Mach Learn 3:9–44

    Google Scholar 

  • Sutton RS, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge

    Google Scholar 

  • Sutton RS, Szepesvari Cs, Maei HR (2009) A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In: Proceedings of NIPS. MIT Press, pp 1609–1616

  • Sutton RS, Maei HR, Precup D, Bhatnagar S, Silver D, Szepesvari Cs, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of ICML. ACM, pp 993–1000

  • Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans Autom Control 37(3):332–341

    Article  MathSciNet  MATH  Google Scholar 

  • Spall JC (1997) A one-measurement form of simultaneous perturbation stochastic approximation. Automatica 33:109–112

    Article  MathSciNet  MATH  Google Scholar 

  • Szepesvari C, Smart WD (2004) Interpolation-based Q-learning. In: Proceedings of ICML. Banff, Canada

    Book  Google Scholar 

  • Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Mach Learn 16:185–202

    MATH  Google Scholar 

  • Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690

    Article  MathSciNet  MATH  Google Scholar 

  • Tsitsikis J, Van Roy B (1999) Average cost temporal-difference learning. Automatica 35:1799–1808

    Article  MATH  Google Scholar 

  • Walrand J (1988) An introduction to queueing networks. Prentice Hall, New Jersey

    MATH  Google Scholar 

  • Watkins C, Dayan P (1992) Q-learning. Mach Learn 8:279–292

    MATH  Google Scholar 

  • Weber RW (1978) On the optimal assignment of customers to parallel servers. J Appl Probab 15:406–413

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

The authors thank the Editor Prof. C. G. Cassandras, the Associate Editor, and all the anonymous reviewers for their detailed comments and criticisms on the various drafts of this paper, that led to several corrections in the proof and presentation. In particular, the authors gratefully thank the reviewer who suggested that they follow a differential inclusions based approach for the slower scale dynamics. The authors thank Prof. V. S. Borkar for helpful discussions. This work was partially supported through projects from the Department of Science and Technology (Government of India), Xerox Corporation (USA), and the Robert Bosch Centre (Indian Institute of Science).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shalabh Bhatnagar.

Appendix

Appendix

In this section, we present detailed proofs of some of the results given in Section 3.

Proof of Proposition 1

Note that the n-step (n > 1) transition probability of going from state (i, a) to (j, b) is

$${p_{w}^{n}}(i,a;j,b) = P(X_{n}=j,Z_{n}=b\mid X_{0}=i, Z_{0}=a,\pi_{w}) $$
$$= {q_{w}^{n}}(i,a,j)\pi_{w}(j,b),$$

where \({q_{w}^{n}}(i,a,j)\) is the n-step probability of going to state j when the initial state is i and action a is chosen (in state i), while actions in other stages (from 1 to n−1) are chosen according to the SRP π w . It is easy to see that \({ \sum \limits _{j\in S} {q_{w}^{n}}(i,a,j) =1}\), ∀iS, aA(i).

Let lS be such that p(i, a, l)>0. Now from Assumption 1, X n , n ≥ 0, under any SRP π w is irreducible. Thus, given SRP π w and states l, j, there exists an integer n 1 > 0 such that

$$p^{n_{1}}(l,j,\pi_{w}) \overset{\triangle}{=} P(X_{n_{1}}=j\mid X_{0}=l,\pi_{w}) >0.$$

Note that in estimating p n(l, j, π w ), it is assumed that the actions at each of the n stages are picked according to the policy π w . This is unlike estimating \({q_{w}^{n}}(l,a,j)\) where the first action to be picked is a in state l while the actions in the remaining n−1 stages are picked according to π w . Now observe that

$$p^{n}_{w}(i,a; j,b) \geq p(i,a,l) p^{n-1}(l,j,\pi_{w})\pi_{w}(j,b). $$

Thus, \(p_{w}^{n_{1}+1}(i,a;j,b) >0\). Similarly, it can be shown that there exists an integer n 2 > 0 such that \(p_{w}^{n_{2}+1}(j,b; i,a)>0\). Thus, {(X n , Z n )} is an irreducible Markov chain when Z n , n ≥ 0 are obtained according to π w .

Next, we show that {(X n , Z n )} is aperiodic. Again let lS be such that p(i, a, l)>0. Since the process {X n } is aperiodic under π w , from Assumption 1, there exists an integer M > 0 such that p n(l, l, π w ) > 0∀nM, see for instance, Lemma 5.3.2, pp.99, of (Borkar 1995). By irreducibility of {X n } under π w , there exists n 3 > 0 (integer) such that \(p^{n_{3}}(l,i,\pi _{w})>0\). Now note that

$$p^{1+n+n_{3}}_{w}(i,a; i,a) \geq p(i,a,l) p^{n}(l,l,\pi_{w}) p^{n_{3}}(l,i,\pi_{w})\pi_{w}(i,a)$$
$$>0, \forall n \geq M.$$

Thus, \({p_{w}^{n}}(i,a;i,a)>0 \forall n\geq (1+M+n_{3})\). Hence, {(X n , Z n )} is aperiodic under π w as well. Finally, since S × A(S) is a finite set, {(X n , Z n )} is also positive recurrent. The claim follows. □

Proof of Lemma 1

We shall use a key result from Schweitzer (1968) for the proof. Let \({ P_{w}^{\infty } = \lim _{m\rightarrow \infty } \frac {1}{m}\sum \limits _{n=1}^{m} {P_{w}^{n}}}\) and \(Z_{w} \overset {\triangle }{=} (I- P_{w} - P_{w}^{\infty })^{-1}\), respectively, where I denotes the (|S × A(S)| × |S × A(S)|)-identity matrix and \({P_{w}^{m}}\) is the matrix of m-step transition probabilities \({p^{m}_{w}}(i,a;j,b)\), i, jS, aA(i), bA(j). From Theorem 2, pp.402-403 of (Schweitzer 1968), one can write

$$\mathbb{f}_{w+\xi e_{i}} = {\mathbb{f}}_{w}(I + (P_{w+\xi e_{i}}-P_{w})Z_{w} + o(\xi)), $$

where ξ > 0 is a small quantity and e i , i ∈ {1,…,N} is a unit vector with 1 as its ith entry and 0s elsewhere. Hence, we get

$$\nabla_{w,i} {\mathbb{f}}_{w} = {\mathbb{f}}_{w} \nabla_{w,i} P_{w} Z_{w}, i=1,\ldots, N. $$

Thus, \(\nabla _{w} {\mathbb {f}}_{w} = {\mathbb {f}}_{w} \nabla P_{w} Z_{w}\). Now since p w (i, a;j, b) = p(i, a, j)π w (j, b), it follows from Assumption 2, it follows that p w (i, a;j, b) are continuously differentiable, i.e., ∇ w P w exists and is continuous. Hence, \(\nabla _{w} {\mathbb {f}}_{w}\) exists.

Next we verify that \(\nabla _{w} {\mathbb {f}}_{w}\) is continuous as well. Note that \({\mathbb {f}}_{w}\) is continuous since it is differentiable. Further, ∇ w P w is continuous as noted above. Also, from Cramer’s rule, it follows that Z w is continuously differentiable and hence also continuous over wC. Since the set C is a compact subset of \({\mathcal R}^{N}\), it is easy to see that \(\nabla _{w} {\mathbb {f}}_{w}\) is continuous as well. The claim follows. □

Proof of Lemma 2

It is easy to see from the definition of R(𝜃, w) and Lemma 1 that the partial derivatives of R(𝜃, w) with respect to any \(\theta \in {\mathcal R}^{d}\) and wC exist. Note that from definition, for a given wC,

$$\nabla_{\theta} R(\theta,w) = \sum\limits_{(i,a)\in S\times A(S)} f_{w}(i,a)\phi_{i,a},$$

which is a constant function of 𝜃, hence continuous. Now consider

$$\nabla_{w} R(\theta,w) = (\nabla_{w,1}R(\theta,w),\ldots,\nabla_{w,N}R(\theta,w))^{T}, $$

where ∇ w, i R(𝜃, w) is the partial derivative of R(𝜃, w) with respect to w i , given 𝜃D. Note that sup𝜃D𝜃∥ < , since D is bounded. Now, given 𝜃D,

$$\nabla_{w} R(\theta,w) = \sum\limits_{(i, a)\in S\times A(S)} \nabla_{w} f_{w}(i,a) \theta^{T} \phi_{i,a},$$

since S × A(S) is a finite set. Let w 1 and w 2 be two points in C. Then,

$$\parallel \nabla_{w} R(\theta,w^{1}) - \nabla_{w} R(\theta,w^{2}) \parallel $$
$$\leq \sum\limits_{(i,a)\in S\times A(S)} \parallel \nabla_{w} f_{w^{1}}(i,a) {\theta}^{T} \phi_{i,a} - \nabla_{w} f_{w^{2}}(i,a) {\theta}^{T} \phi_{i,a} \parallel $$
$$\leq \sum\limits_{(i,a)\in S\times A(S)} \parallel \nabla_{w} f_{w^{1}}(i,a)-\nabla_{w} f_{w^{2}}(i,a)\parallel |{\theta}^{T} \phi_{i,a}|. $$

Now since D is a compact set, note that

$$L_{2} \overset{\triangle}{=} \max_{(i,a)\in S\times A(S)}\max_{\theta\in D}|\theta^{T}\phi_{i,a}|<\infty. $$

The claim now follows since ∇ w f w (i, a) is a continuous function from Lemma 1 (in fact also uniformly continuous since wC, a compact set). □

Proof of Lemma 6

We first show the claim in (3.5). Recall from Lemma 5 that

$$\parallel w_{n+s}-w_{n} \parallel \rightarrow 0 ~\text{as}~ n\rightarrow\infty, $$

almost surely, for all s ∈ {1,…,P}. From Lemma 2 and the above, it follows that

$$\parallel \nabla_{w,k} R(\theta, w_{n+s}) - \nabla_{w,k} R(\theta, w_{n}) \parallel \rightarrow 0 \,\,\text{as}\,\, n\rightarrow\infty, $$

s ∈ {1,…,P}, k ∈ {1,…,N}. By letting M = P in Assumption 5, it follows that a(j)/a(m)→1 as m for any j ∈ {m,…,m + P−1}. Note also that P is an even integer. As a consequence of Lemma 4, one can split any set of the type \(A_{m} \overset {\triangle }{=}\{m,m+1,\ldots ,m+P-1\}\) into two disjoint subsets \(A_{m,k,l}^{+}\) and \(A_{m,k,l}^{-}\) each having the same number of elements, with \(A_{m,k,l}^{+} \cup A_{m,k,l}^{-} = A_{m}\) and such that \({\frac {{\triangle _{n}^{k}}}{{\triangle _{n}^{l}}}}\) takes value \(+1 \forall n\in A_{m,k,l}^{+}\) and \(-1 \forall n \in A_{m,k,l}^{-}\), respectively. Thus,

$$\parallel \sum\limits_{n=m}^{m+P-1} \!\!\frac{a(n)}{a(m)} \frac{{\triangle_{n}^{k}}}{{\triangle_{n}^{l}}} \nabla_{w,k} R(\theta, w_{n}) \parallel = \parallel\!\! \sum\limits_{n \in A_{m,k,l}^{+}} \!\frac{a(n)}{a(m)} \nabla_{w,k}R(\theta, w_{n}) -\!\! \sum\limits_{n \in A_{m,k,l}^{-}} \frac{a(n)}{a(m)} \nabla_{w,k}R(\theta, w_{n}) \parallel. $$

It now follows as a consequence of the above that

$$\parallel \sum\limits_{n=m}^{m+P-1} \frac{a(n)}{a(m)} \frac{{\triangle_{n}^{k}}}{{\triangle_{n}^{l}}} \nabla_{w,k} R(\theta, w_{n}) \parallel \rightarrow 0, $$

almost surely as m. Finally, the claim in (3.6) follows from Lemma 5, Lemma 2 and Assumption 5, in a similar manner as (3.5). □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhatnagar, S., Lakshmanan, K. Multiscale Q-learning with linear function approximation. Discrete Event Dyn Syst 26, 477–509 (2016). https://doi.org/10.1007/s10626-015-0216-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10626-015-0216-z

Keywords

Navigation