ABSTRACT
Inference in Markov Decision Processes has recently received interest as a means to infer goals of an observed action, policy recognition, and also as a tool to compute policies. A particularly interesting aspect of the approach is that any existing inference technique in DBNs now becomes available for answering behavioral question--including those on continuous, factorial, or hierarchical state representations. Here we present an Expectation Maximization algorithm for computing optimal policies. Unlike previous approaches we can show that this actually optimizes the discounted expected future return for arbitrary reward functions and without assuming an ad hoc finite total time. The algorithm is generic in that any inference technique can be utilized in the E-step. We demonstrate this for exact inference on a discrete maze and Gaussian belief state propagation in continuous stochastic optimal control problems.
- Attias, H. (2003). Planning by probabilistic inference. Proc. of the 9th Int. Workshop on Artificial Intelligence and Statistics.Google Scholar
- Boutilier, C., Dearden, R., & Goldszmidt, M. (1995). Exploiting structure in policy construction. Proc. of the 14th Int. Joint Conf. on Artificial Intelligence (IJCAI 1995) (pp. 1104--1111). Google ScholarDigital Library
- Bui, H., Venkatesh, S., & West, G. (2002). Policy recognition in the abstract hidden markov models. Journal of Artificial Intelligence Research, 17, 451--499. Google ScholarDigital Library
- Ghahramani, Z., & Jordan, M. I. (1995). Factorial hidden Markov models. Advances in Neural Information Processing Systems, NIPS (pp. 472--478). MIT Press. Google ScholarDigital Library
- Guestrin, C., Koller, D., Parr, R., & Venkataraman, S. (2003). Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research (JAIR), 19, 399--468. Google ScholarDigital Library
- Hauskrecht, M., Meuleau, N., Kaelbling, L. P., Dean, T., & Boutilier, C. (1998). Hierarchical solution of Markov decision processes using macro-actions. Proc. of Uncertainty in Artificial Intelligence (UAI 1998) (pp. 220--229). Google ScholarDigital Library
- Koller, D., & Parr, R. (1999). Computing factored value functions for policies in structured MDPs. Proc. of the 16th Int. Joint Conf. on Artificial Intelligence (IJCAI) (pp. 1332--1339). Google ScholarDigital Library
- Kveton, B., & Hauskrecht, M. (2005). An MCMC approach to solving hybrid factored MDPs. Proc. of the 19th Int. Joint Conf. on Artificial Intelligence (IJCAI 2005). Google ScholarDigital Library
- Minka, T. (2001). A family of algorithms for approximate bayesian inference. PhD thesis, MIT. Google ScholarDigital Library
- Murphy, K. (2002). Dynamic bayesian networks: Representation, inference and learning. PhD Thesis, UC Berkeley, Computer Science Division. See particularly the chapter on DBN at http://www.cs.ubc.ca/murphyk/Papers/dbnchapter.pdf. Google ScholarDigital Library
- Ng, A. Y., Parr, R., & Koller, D. (1999). Policy search via density estimation. Advances in Neural Information Processing Systems 12 (pp. 1022--1028).Google Scholar
- Sutton, R., & Barto, A. (1998). Reinforcement learning. MIT Press, Cambridge. Google ScholarDigital Library
- Verma, D., & Rao, R. P. N. (2006). Goal-based imitation as probabilistic inference over graphical models. Advances in Neural Information Processing Systems 18 (NIPS 2005).Google Scholar
Index Terms
- Probabilistic inference for solving discrete and continuous state Markov Decision Processes
Recommendations
Continuous-Time Markov Decision Processes with State-Dependent Discount Factors
We consider continuous-time Markov decision processes in Polish spaces. The performance of a control policy is measured by the expected discounted reward criterion associated with state-dependent discount factors. All underlying Markov processes are ...
Continuous Time Discounted Jump Markov Decision Processes: A Discrete-Event Approach
This paper introduces and develops a new approach to the theory of continuous time jump Markov decision processes (CTJMDP). This approach reduces discounted CTJMDPs to discounted semi-Markov decision processes (SMDPs) and eventually to discrete-time ...
Continuous-Time Markov Decision Processes with Discounted Rewards: The Case of Polish Spaces
This paper deals with continuous-time Markov decision processes in Polish spaces, under an expected discounted reward criterion. The transition rates of underlying continuous-time jump Markov processes are allowed to be unbounded, and the reward rates ...
Comments