Abstract
A new algorithm is proposed, which immolates the optimality of control policies potentially to obtain the robusticity of solutions. The robusticity of solutions maybe becomes a very important property for a learning system when there exists non-matching between theory models and practical physical system, or the practical system is not static, or the availability of a control action changes along with the variety of time. The main contribution is that a set of approximation algorithms and their convergence results are given. A generalized average operator instead of the general optimal operator max (or min) is applied to study a class of important learning algorithms, dynamic programming algorithms, and discuss their convergences from theoretic point of view. The purpose for this research is to improve the robusticity of reinforcement learning algorithms theoretically.
Similar content being viewed by others
References
Sutton R S. Learning to predict by the method of temporal difference[J]. Machine Learning, 1988, 3(1):9–44.
Sutton R S. Open theoretical questions in reinforcement learning[C]. In: Proc of Euro-COLT’99 (Computational Learning Theory). Cambridge, MA: MIT Press, 1999, 11–17.
Sutton R S, Barto A G. Reinforcement learning: an introduction[M]. MA: MIT Press, 1998, 20–300.
Watkins C J C H, Dayan P. Q-learning[J]. Machine Learning, 1992, 8(13):279–292.
Watkins C J C H Learning from delayed rewards[D]. Ph D Dissertation. University of Cambridge, England, 1989.
Bertsekas D P, Tsitsiklis J N. Parallel and distributed computation: numerical methods[M]. Englewood Cliffs, NJ: Prentice-Hall, 1989, 10–109.
Yin Changming, Chen Huanwen, Xie Lijuan. A relative value iteration q-learning algorithm and its convergence based on finite samples[J]. Journal of Computer Research and Development, 2002, 39(9):1064–1070.
Yin Changming, Chen Huanwen, Xie Lijuan. Optimality cost relative value iteration Q-learning algorithm based on finite samples[J]. Journal of Computer Engineering and Applications, 2002, 38(11):65–67.
Wiering M, Schmidhuber J. Speeding up Q-learning[C]. In: Proc of the 10th European Conf on Machine Learning, Springer-Verlag, 1998, 253–363.
Singh S. Soft dynamic programming algorithms: convergence proofs[C]. In: Proceedings of Workshop on Computational Learning and Natural Learning (CLNL), Provincetown, Massachusetts, 1993.
Cavazos-Cadena R, Montes-de-Oca R. The value iteration algorithm in risk-sensitive average Markov decision chains with finite state[J]. Mathematics of Operations Research, 2003, 28(4):752–776.
Peng J, Williams R. Incremental multi-step Q-learning[J]. Machine Learning, 1996, 22(4):283–290.
Singh S. Reinforcement learning algorithm for average-payoff markovian decision processes[C]. In: Proceedings of the 12th National Conference on Artificial Intelligence, 1994, Vol 1, 700–705.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by GUO Xing-ming
Project supported by the National Natural Science Foundation of China (Nos.10471088 and 60572126)
Rights and permissions
About this article
Cite this article
Yin, Cm., Han-xing, W. & Fei, Z. Risk-sensitive reinforcement learning algorithms with generalized average criterion. Appl Math Mech 28, 405–416 (2007). https://doi.org/10.1007/s10483-007-0313-x
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/s10483-007-0313-x