Abstract
We present in this article a two-timescale variant of Q-learning with linear function approximation. Both Q-values and policies are assumed to be parameterized with the policy parameter updated on a faster timescale as compared to the Q-value parameter. This timescale separation is seen to result in significantly improved numerical performance of the proposed algorithm over Q-learning. We show that the proposed algorithm converges almost surely to a closed connected internally chain transitive invariant set of an associated differential inclusion.
Similar content being viewed by others
References
Abdulla MS, Bhatnagar S (2007) Reinforcement learning based algorithms for average cost Markov decision processes. Discrete Event Dyn Syst Theory Appl 17(1):23–52
Abounadi J, Bertsekas D, Borkar VS (2001) Learning algorithms for Markov decision processes. SIAM J Control Optim 40:681–698
Aubin J, Cellina A (1984) Differential inclusions: set-valued maps and viability theory. Springer, New York
Azar MG, Gomez V, Kappen HJ (2011) Dynamic policy programming with function approximation. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS), Fort Lauderdale
Baird LC (1995) Residual algorithms: reinforcement learning with function approximation. In: Proceedings of ICML. Morgan Kaufmann, pp 30–37
Benaim M, Hofbauer J, Sorin S (2005) Stochastic approximations and differential inclusions. SIAM J Control Optim 44(1):328–348
Benaim M, Hofbauer J, Sorin S (2006) Stochastic approximations and differential inclusions, Part II: applications. Math Oper Res 31(4):673–695
Bertsekas DP (2005) Dynamic programming and optimal control, 3rd ed. Athena Scientific, Belmont
Bertsekas DP (2007) Dynamic programming and optimal control, vol II, 3rd ed. Athena Scientific, Belmont
Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmont
Bhatnagar S, Babu KM (2008) New algorithms of the Q-learning type. Automatica 44(4):1111–1119
Bhatnagar S, Borkar VS (1997) Multiscale stochastic approximation for parametric optimization of hidden Markov models. Probab Eng Inf Sci 11:509–522
Bhatnagar S, Fu MC, Marcus SI, Wang I-J (2003) Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modelling and Computer Simulation 13(2):180–209
Bhatnagar S, Kumar S (2004) A simultaneous perturbation stochastic approximation based actor–critic algorithm for Markov decision processes. IEEE Trans Autom Control 49(4):592–598
Bhatnagar S (2005) Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization. ACM Transactions on Modeling and Computer Simulation 15(1):74–107
Bhatnagar S (2007) Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization. ACM Transactions on Modeling and Computer Simulation 18(1):2:1–2:35
Bhatnagar S, Prasad HL, Prashanth LA (2013) Stochastic recursive algorithms for optimization: simultaneous perturbation methods, lecture notes in control and information sciences. Springer, London
Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M (2009) Natural actor-critic algorithms. Automatica 45:2471–2482
Bhatnagar S, Lakshmanan K (2012) An online actor-critic algorithm with function approximation for constrained Markov decision processes. J Optim Theory Appl 153(3):688–708
Borkar VS (1995) Probability theory: an advanced course. Springer, New York
Borkar VS (1997) Stochastic approximation with two timescales. Syst Control Lett 29:291–294
Borkar VS (2008) Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press and Hindustan Book Agency
Borkar VS, Meyn SP (2000) The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J Control Optim 38(2):447–469
Brandiere O (1998) Some pathological traps for stochastic approximation. SIAM J Contr Optim 36:1293–1314
Ephremides A, Varaiya P, Walrand J (1980) A simple dynamic routing problem. IEEE Trans Autom Control 25(4):690–693
Gelfand SB, Mitter SK (1991) Recursive stochastic algorithms for global optimization in \({\mathcal R}^{d_{*}}\). SIAM J Control Optim 29(5):999–1018
Konda VR, Borkar VS (1999) Actor–critic like learning algorithms for Markov decision processes. SIAM J Control Optim 38(1):94–123
Konda VR, Tsitsiklis JN (2003) On actor–critic algorithms. SIAM J Control Optim 42(4):1143–1166
Kushner HJ, Clark DS (1978) Stochastic approximation methods for constrained and unconstrained systems. Springer, New York
Kushner HJ, Yin GG (1997) Stochastic approximation algorithms and applications. Springer, New York
Maei HR, Szepesvari C, Bhatnagar S, Precup D, Silver D, Sutton RS (2009) Convergent temporal-difference learning with arbitrary smooth function approximation. Proceedings of NIPS
Maei HR, Szepesvari Cs, Bhatnagar S, Sutton RS (2010) Toward off-policy learning control with function approximation. Proceedings of ICML, Haifa
Melo F, Ribeiro M (2007) Q-learning with linear function approximation. Learning Theory, Springer, pp 308–322
Pemantle R (1990) Nonconvergence to unstable points in urn models and stochastic approximations. Annals Prob 18:698–712
Prashanth LA, Chatterjee A, Bhatnagar S (2014) Two timescale convergent Q-learning for sleep scheduling in wireless sensor networks. Wirel Netw 20:2589–2604
Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York
Schweitzer PJ (1968) Perturbation theory and finite Markov chains. J Appl Probab 5:401–413
Sutton RS (1988) Learning to predict by the method of temporal differences. Mach Learn 3:9–44
Sutton RS, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
Sutton RS, Szepesvari Cs, Maei HR (2009) A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In: Proceedings of NIPS. MIT Press, pp 1609–1616
Sutton RS, Maei HR, Precup D, Bhatnagar S, Silver D, Szepesvari Cs, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of ICML. ACM, pp 993–1000
Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans Autom Control 37(3):332–341
Spall JC (1997) A one-measurement form of simultaneous perturbation stochastic approximation. Automatica 33:109–112
Szepesvari C, Smart WD (2004) Interpolation-based Q-learning. In: Proceedings of ICML. Banff, Canada
Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Mach Learn 16:185–202
Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690
Tsitsikis J, Van Roy B (1999) Average cost temporal-difference learning. Automatica 35:1799–1808
Walrand J (1988) An introduction to queueing networks. Prentice Hall, New Jersey
Watkins C, Dayan P (1992) Q-learning. Mach Learn 8:279–292
Weber RW (1978) On the optimal assignment of customers to parallel servers. J Appl Probab 15:406–413
Acknowledgments
The authors thank the Editor Prof. C. G. Cassandras, the Associate Editor, and all the anonymous reviewers for their detailed comments and criticisms on the various drafts of this paper, that led to several corrections in the proof and presentation. In particular, the authors gratefully thank the reviewer who suggested that they follow a differential inclusions based approach for the slower scale dynamics. The authors thank Prof. V. S. Borkar for helpful discussions. This work was partially supported through projects from the Department of Science and Technology (Government of India), Xerox Corporation (USA), and the Robert Bosch Centre (Indian Institute of Science).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
In this section, we present detailed proofs of some of the results given in Section 3.
Proof of Proposition 1
Note that the n-step (n > 1) transition probability of going from state (i, a) to (j, b) is
where \({q_{w}^{n}}(i,a,j)\) is the n-step probability of going to state j when the initial state is i and action a is chosen (in state i), while actions in other stages (from 1 to n−1) are chosen according to the SRP π w . It is easy to see that \({ \sum \limits _{j\in S} {q_{w}^{n}}(i,a,j) =1}\), ∀i ∈ S, a ∈ A(i).
Let l ∈ S be such that p(i, a, l)>0. Now from Assumption 1, X n , n ≥ 0, under any SRP π w is irreducible. Thus, given SRP π w and states l, j, there exists an integer n 1 > 0 such that
Note that in estimating p n(l, j, π w ), it is assumed that the actions at each of the n stages are picked according to the policy π w . This is unlike estimating \({q_{w}^{n}}(l,a,j)\) where the first action to be picked is a in state l while the actions in the remaining n−1 stages are picked according to π w . Now observe that
Thus, \(p_{w}^{n_{1}+1}(i,a;j,b) >0\). Similarly, it can be shown that there exists an integer n 2 > 0 such that \(p_{w}^{n_{2}+1}(j,b; i,a)>0\). Thus, {(X n , Z n )} is an irreducible Markov chain when Z n , n ≥ 0 are obtained according to π w .
Next, we show that {(X n , Z n )} is aperiodic. Again let l ∈ S be such that p(i, a, l)>0. Since the process {X n } is aperiodic under π w , from Assumption 1, there exists an integer M > 0 such that p n(l, l, π w ) > 0∀n ≥ M, see for instance, Lemma 5.3.2, pp.99, of (Borkar 1995). By irreducibility of {X n } under π w , there exists n 3 > 0 (integer) such that \(p^{n_{3}}(l,i,\pi _{w})>0\). Now note that
Thus, \({p_{w}^{n}}(i,a;i,a)>0 \forall n\geq (1+M+n_{3})\). Hence, {(X n , Z n )} is aperiodic under π w as well. Finally, since S × A(S) is a finite set, {(X n , Z n )} is also positive recurrent. The claim follows. □
Proof of Lemma 1
We shall use a key result from Schweitzer (1968) for the proof. Let \({ P_{w}^{\infty } = \lim _{m\rightarrow \infty } \frac {1}{m}\sum \limits _{n=1}^{m} {P_{w}^{n}}}\) and \(Z_{w} \overset {\triangle }{=} (I- P_{w} - P_{w}^{\infty })^{-1}\), respectively, where I denotes the (|S × A(S)| × |S × A(S)|)-identity matrix and \({P_{w}^{m}}\) is the matrix of m-step transition probabilities \({p^{m}_{w}}(i,a;j,b)\), i, j ∈ S, a ∈ A(i), b ∈ A(j). From Theorem 2, pp.402-403 of (Schweitzer 1968), one can write
where ξ > 0 is a small quantity and e i , i ∈ {1,…,N} is a unit vector with 1 as its ith entry and 0s elsewhere. Hence, we get
Thus, \(\nabla _{w} {\mathbb {f}}_{w} = {\mathbb {f}}_{w} \nabla P_{w} Z_{w}\). Now since p w (i, a;j, b) = p(i, a, j)π w (j, b), it follows from Assumption 2, it follows that p w (i, a;j, b) are continuously differentiable, i.e., ∇ w P w exists and is continuous. Hence, \(\nabla _{w} {\mathbb {f}}_{w}\) exists.
Next we verify that \(\nabla _{w} {\mathbb {f}}_{w}\) is continuous as well. Note that \({\mathbb {f}}_{w}\) is continuous since it is differentiable. Further, ∇ w P w is continuous as noted above. Also, from Cramer’s rule, it follows that Z w is continuously differentiable and hence also continuous over w ∈ C. Since the set C is a compact subset of \({\mathcal R}^{N}\), it is easy to see that \(\nabla _{w} {\mathbb {f}}_{w}\) is continuous as well. The claim follows. □
Proof of Lemma 2
It is easy to see from the definition of R(𝜃, w) and Lemma 1 that the partial derivatives of R(𝜃, w) with respect to any \(\theta \in {\mathcal R}^{d}\) and w ∈ C exist. Note that from definition, for a given w ∈ C,
which is a constant function of 𝜃, hence continuous. Now consider
where ∇ w, i R(𝜃, w) is the partial derivative of R(𝜃, w) with respect to w i , given 𝜃 ∈ D. Note that sup𝜃 ∈ D∥𝜃∥ < ∞, since D is bounded. Now, given 𝜃 ∈ D,
since S × A(S) is a finite set. Let w 1 and w 2 be two points in C. Then,
Now since D is a compact set, note that
The claim now follows since ∇ w f w (i, a) is a continuous function from Lemma 1 (in fact also uniformly continuous since w ∈ C, a compact set). □
Proof of Lemma 6
We first show the claim in (3.5). Recall from Lemma 5 that
almost surely, for all s ∈ {1,…,P}. From Lemma 2 and the above, it follows that
∀s ∈ {1,…,P}, k ∈ {1,…,N}. By letting M = P in Assumption 5, it follows that a(j)/a(m)→1 as m → ∞ for any j ∈ {m,…,m + P−1}. Note also that P is an even integer. As a consequence of Lemma 4, one can split any set of the type \(A_{m} \overset {\triangle }{=}\{m,m+1,\ldots ,m+P-1\}\) into two disjoint subsets \(A_{m,k,l}^{+}\) and \(A_{m,k,l}^{-}\) each having the same number of elements, with \(A_{m,k,l}^{+} \cup A_{m,k,l}^{-} = A_{m}\) and such that \({\frac {{\triangle _{n}^{k}}}{{\triangle _{n}^{l}}}}\) takes value \(+1 \forall n\in A_{m,k,l}^{+}\) and \(-1 \forall n \in A_{m,k,l}^{-}\), respectively. Thus,
It now follows as a consequence of the above that
almost surely as m → ∞. Finally, the claim in (3.6) follows from Lemma 5, Lemma 2 and Assumption 5, in a similar manner as (3.5). □
Rights and permissions
About this article
Cite this article
Bhatnagar, S., Lakshmanan, K. Multiscale Q-learning with linear function approximation. Discrete Event Dyn Syst 26, 477–509 (2016). https://doi.org/10.1007/s10626-015-0216-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10626-015-0216-z