Abstract
The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future.
Sutton (1988) proved that for a special case of temporal differences, the expected values of the predictions converge to their correct values, as large samples are taken, and Dayan (1992) extended his proof to the general case. This article proves the stronger result that the predictions of a slightly modified form of temporal difference learning converge with probability one, and shows how to quantify the rate of convergence.
Article PDF
Similar content being viewed by others
References
Benveniste, A., Métivier, M., & Priouret, P. (1990).Adaptive algorithms and stochastic approximation. Berlin: Springer-Verlag.
Dayan, P. (1992). The convergence of TD(λ) for general λ.Machine Learning, 8, 341–362.
Geman, S., Bienenstock, E., & Doursat, R. (1991). Neural networks and the bias/variance dilemma.Neural Computation, 4, 1–58.
Kuan, C.M., & White, H. (1990).Recursive m-estimation, non-linear regression and neural network learning with dependent observations (discussion paper). Department of Economics, University of California at San Diego.
Kuan, C.M., & White, H. (1991).Strong convergence of recursive m-estimators for models with dynamic latent variables (discussion paper 91-05). Department of Economics, University of California at San Diego.
Kushner, H.J. (1984).Approximation and weak convergence methods for random processes, with applications to stochastic systems theory. Cambridge, MA: MIT Press.
Kushner, H.J., & Clark, D. (1978).Stochastic approximation methods for constrained and unconstrained systems. Berlin: Springer-Verlag.
Robbins, H., & Monro, S. (1951). A stochastic approximation method.Annals of Mathematical Statistics, 22, 400–407.
Ross, S. (1983).Introduction to stochastic dynamic programming. New York: Academic Press.
Samuel, A.L. (1959). Some studies in machine learning using the game of checkers.IBM Journal of Research and Development, 3, 311–229.
Sutton, R.S. (1984).Temporal credit assignment in reinforcement learning. Ph.D. thesis, Department of Computer Science, University of Massachusetts, Amherst, MA.
Sutton, R.S. (1988). Learning to predict by the methods of temporal difference.Machine Learning, 3, 9–44.
Sutton, R.S., & Barto, A.G. (1987). A temporal-difference model of classical conditioning. GTE Laboratories Report TR87-509-2. Waltham, MA.
Tesauro, G. (1992). Practical issues in temporal difference learning.Machine Learning, 8, 257–278.
Watkins, C.J.C.H. (1989).Learning from delayed rewards. Ph.D. thesis, King's College, University of Cambridge, England.
Watkins, C.J.C.H., & Dayan, P. (1992). Q-learning.Machine Learning, 8, 279–292.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Dayan, P., Sejnowski, T.J. TD(λ) converges with probability 1. Mach Learn 14, 295–301 (1994). https://doi.org/10.1007/BF00993978
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF00993978