Skip to main content
Log in

Greedy feature replacement for online value function approximation

  • Published:
Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Abstract

Reinforcement learning (RL) in real-world problems requires function approximations that depend on selecting the appropriate feature representations. Representational expansion techniques can make linear approximators represent value functions more effectively; however, most of these techniques function well only for low dimensional problems. In this paper, we present the greedy feature replacement (GFR), a novel online expansion technique, for value-based RL algorithms that use binary features. Given a simple initial representation, the feature representation is expanded incrementally. New feature dependencies are added automatically to the current representation and conjunctive features are used to replace current features greedily. The virtual temporal difference (TD) error is recorded for each conjunctive feature to judge whether the replacement can improve the approximation. Correctness guarantees and computational complexity analysis are provided for GFR. Experimental results in two domains show that GFR achieves much faster learning and has the capability to handle large-scale problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Albus, J.S., 1971. A theory of cerebellar function. Math. Biosci., 10(1–2):25–61. [doi:10.1016/0025-5564(71)900 51-4]

    Article  Google Scholar 

  • Barto, A.G., Bradtke, S.J., Singh, S.P., 1995. Learning to act using real-time dynamic programming. Artif. Intell., 72(1–2):81–138. [doi:10.1016/0004-3702(94)00011-O]

    Article  Google Scholar 

  • Buro, M., 1999. From simple features to sophisticated evaluation functions. Proc. 1st Int. Conf. on Computers and Games, p.126–145. [doi:10.1007/3-540-48957-6_8]

    Chapter  Google Scholar 

  • de Hauwere, Y.M., Vrancx, P., Nowé, A., 2010. Generalized learning automata for multi-agent reinforcement learning. AI Commun., 23(4):311–324. [doi:10.3233/AIC-2010-0476]

    MATH  MathSciNet  Google Scholar 

  • Geramifard, A., Doshi, F., Redding, J., et al., 2011. Online discovery of feature dependencies. Proc. 28th Int. Conf. on Machine Learning, p.881–888.

    Google Scholar 

  • Geramifard, A., Dann, C., How, J.P., 2013. Off-policy learning combined with automatic feature expansion for solving large MDPs. Proc. 1st Multidisciplinary Conf. on Reinforcement Learning and Decision Making, p.29–33.

    Google Scholar 

  • Kaelbling, L.P., Littman, M.L., Moore, A.W., 1996. Reinforcement learning: a survey. J. Artif. Intell. Res., 4:237–285. [doi:10.1613/jair.301]

    Google Scholar 

  • Kolter, J.Z., Ng, A.Y., 2009. Near-Bayesian exploration in polynomial time. Proc. 26th Annual Int. Conf. on Machine Learning, p. 513–520. [doi:10.1145/1553374. 1553441]

    Google Scholar 

  • Lagoudakis, M.G., Parr, R., 2003. Least-squares policy iteration. J. Mach. Learn. Res., 4(6):1107–1149.

    MathSciNet  Google Scholar 

  • Pazis, J., Lagoudakis, M.G., 2009. Binary action search for learning continuous-action control policies. Proc. 26th Annual Int. Conf. on Machine Learning, p.793–800. [doi:10.1145/1553374.1553476]

    Google Scholar 

  • Puterman, M.L., 1994. Markov Decision Processes-Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, NY, p.139–161.

    MATH  Google Scholar 

  • Ratitch, B., Precup, D., 2004. Sparse distributed memories for on-line value-based reinforcement learning. Proc. 15th European Conf. on Machine Learning, p.347–358. [doi:10. 1007/978-3-540-30115-8_33]

    Google Scholar 

  • Rummery, G.A., Niranjan, M., 1994. On-line Q-learning Using Connectionist Systems. Technical Report No. cued/f-infeng/tr166, Engineering Department, Cambridge University.

    Google Scholar 

  • Singh, S., Jaakkola, T., Littman, M.L., et al., 2000. Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn., 38(3):287–308. [doi:10.1023/A: 1007678930559]

    Article  MATH  Google Scholar 

  • Singh, S.P., Sutton, R.S., 1996. Reinforcement learning with replacing eligibility traces. Mach. Learn., 22(1–3):123–158. [doi:10.1023/A:1018012322525]

    MATH  Google Scholar 

  • Singh, S.P., Yee, R.C., 1994. An upper bound on the loss from approximate optimal-value functions. Mach. Learn., 16(3):227–233. [doi:10.1007/Bf00993308]

    MATH  Google Scholar 

  • Sprague, N., Ballard, D., 2003. Multiple-goal reinforcement learning with modular sarsa(0). Proc. 18th Int. Joint Conf. on Artificial Intelligence, p.1445–1447.

    Google Scholar 

  • Sturtevant, N.R., White, A.M., 2006. Feature construction for reinforcement learning in hearts. Proc. 5th Int. Conf. on Computers and Games, p.122–134. [doi:10.1007/978-3-540-75538-8_11]

    Google Scholar 

  • Sutton, R.S., 1996. Generalization in reinforcement learning: successful examples using sparse coarse coding. Adv. Neur. Inform. Process. Syst., 8:1038–1044.

    Google Scholar 

  • Sutton, R.S., Barto, A.G., 1998. Reinforcement Learning: an Introduction. MIT Press, Cambridge, MA, USA, p.3–25.

    Google Scholar 

  • Tsitsiklis, J.N., 1994. Asynchronous stochastic approximation and Q-learning. Mach. Learn., 16(3):185–202. [doi:10. 1007/Bf00993306]

    MATH  MathSciNet  Google Scholar 

  • Tsitsiklis, J.N., van Roy, B., 1997. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Contr., 42(5):674–690. [doi:10.1109/9. 580874]

    Article  MATH  Google Scholar 

  • Watkins, C.J.C.H., Dayan, P., 1992. Q-learning. Mach. Learn., 8(3–4):279–292. [doi:10.1007/Bf00992698]

    MATH  Google Scholar 

  • Whiteson, S., Taylor, M.E., Stone, P., 2007. Adaptive Tile Coding for Value Function Approximation. Technical Report No. AI-TR-07-339, University of Texas at Austin.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng-fei Zhao.

Additional information

Project supported by the 12th Five-Year Defense Exploration Project of China (No. 041202005) and the Ph.D. Program Foundation of the Ministry of Education of China (No. 20120002130007)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, Ff., Qin, Z., Shao, Z. et al. Greedy feature replacement for online value function approximation. J. Zhejiang Univ. - Sci. C 15, 223–231 (2014). https://doi.org/10.1631/jzus.C1300246

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/jzus.C1300246

Key words

CLC number

Navigation