Abstract
A central task in control theory, artificial intelligence, and formal methods is to synthesize reward-maximizing strategies for agents that operate in partially unknown environments. In environments modeled by gray-box Markov decision processes (MDPs), the impact of the agents’ actions are known in terms of successor states but not the stochastics involved. In this paper, we devise a strategy synthesis algorithm for gray-box MDPs via reinforcement learning that utilizes interval MDPs as internal model. To compete with limited sampling access in reinforcement learning, we incorporate two novel concepts into our algorithm, focusing on rapid and successful learning rather than on stochastic guarantees and optimality: lower confidence bound exploration reinforces variants of already learned practical strategies and action scoping reduces the learning action space to promising actions. We illustrate benefits of our algorithms by means of a prototypical implementation applied on examples from the AI and formal methods communities.
The authors are supported by the DFG through the Cluster of Excellence EXC 2050/1 (CeTI, project ID 390696704, as part of Germany’s Excellence Strategy) and the TRR 248 (see https://perspicuous-computing.science, project ID 389792660).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
https://osf.io/r24mu/?view_only=b44cec578cce44e5920f150940f68230
Amin, S., Gomrokchi, M., Satija, H., van Hoof, H., Precup, D.: A survey of exploration methods in reinforcement learning (2021)
Anderson, J.R.: Learning and Memory: An Integrated Approach, 2nd edn. Wiley, Hoboken (2000)
Ashok, P., Křetínský, J., Weininger, M.: PAC statistical model checking for Markov decision processes and stochastic games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 497–519. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_29
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2004)
Baier, C., Klein, J., Leuschner, L., Parker, D., Wunderlich, S.: Ensuring the reliability of your model checker: interval iteration for Markov decision processes. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10426, pp. 160–180. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63387-9_8
Baier, C., Cuevas Rivera, D., Dubslaff, C., Kiebel, S.J.: Human-Inspired Models for Tactile Computing, chap. 8, pp. 173–200. Academic Press (2021)
Baier, C., Dubslaff, C., Hermanns, H., Klauck, M., Klüppelholz, S., Köhl, M.A.: Components in probabilistic systems: suitable by construction. In: Margaria, T., Steffen, B. (eds.) ISoLA 2020. LNCS, vol. 12476, pp. 240–261. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61362-4_13
Baier, C., Dubslaff, C., Wienhöft, P., Kiebel, S.J.: Strategy synthesis in Markov decision processes under limited sampling access. Extended Version (2023). https://arxiv.org/abs/2303.12718
Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72(1–2), 81–138 (1995)
Bertsekas, D.P., Tsitsiklis, J.N.: An analysis of stochastic shortest path problems. Math. Oper. Res. 16(3), 580–595 (1991). https://doi.org/10.1287/moor.16.3.580
Brafman, R.I., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2003). https://doi.org/10.1162/153244303765208377
Brázdil, T., et al.: Verification of Markov decision processes using learning algorithms. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 98–114. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11936-6_8
Chatterjee, K., Sen, K., Henzinger, T.A.: Model-checking \(\mathit{\omega }\)-regular properties of interval Markov chains. In: Amadio, R. (ed.) FoSSaCS 2008. LNCS, vol. 4962, pp. 302–317. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78499-9_22
Daca, P., Henzinger, T.A., Křetínský, J., Petrov, T.: Faster statistical model checking for unbounded temporal properties (2016)
Daca, P., Henzinger, T.A., Křetínský, J., Petrov, T.: Faster statistical model checking for unbounded temporal properties. ACM Trans. Comput. Logic 18(2), 1–25 (2017). https://doi.org/10.1145/3060139
Givan, R., Leach, S., Dean, T.: Bounded-parameter Markov decision processes. Artif. Intell. 122(1), 71–109 (2000). https://doi.org/10.1016/S0004-3702(00)00047-3
Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Steinmetz, M.: Deep statistical model checking. In: Gotsman, A., Sokolova, A. (eds.) FORTE 2020. LNCS, vol. 12136, pp. 96–114. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50086-3_6
Haddad, S., Monmege, B.: Interval iteration algorithm for MDPs and IMDPs. Theoret. Comput. Sci. 735, 111–131 (2018). https://doi.org/10.1016/j.tcs.2016.12.003
He, R., Jennings, P., Basu, S., Ghosh, A., Wu, H.: A bounded statistical approach for model checking of unbounded until properties, pp. 225–234 (2010)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)
Ishii, S., Yoshida, W., Yoshimoto, J.: Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw. 15(4), 665–687 (2002). https://doi.org/10.1016/S0893-6080(02)00056-4
Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11(51), 1563–1600 (2010)
Kaelbling, L.P.: Learning in Embedded Systems. The MIT Press, Cambridge (1993). https://doi.org/10.7551/mitpress/4168.001.0001
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Int. Res. 4(1), 237–285 (1996)
Kallenberg, L.: Lecture Notes Markov Decision Problems - version 2020 (2020)
Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 209–232 (2002). https://doi.org/10.1023/A:1017984413808
Legay, A., Lukina, A., Traonouez, L.M., Yang, J., Smolka, S.A., Grosu, R.: Statistical model checking. In: Steffen, B., Woeginger, G. (eds.) Computing and Software Science. LNCS, vol. 10000, pp. 478–504. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-91908-9_23
Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)
Pineda, L.E., Zilberstein, S.: Planning under uncertainty using reduced models: revisiting determinization. In: ICAPS (2014)
Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)
Schwoebel, S., Markovic, D., Smolka, M.N., Kiebel, S.J.: Balancing control: a Bayesian interpretation of habitual and goal-directed behavior. J. Math. Psychol. 100, 102472 (2021). https://doi.org/10.1016/j.jmp.2020.102472
Sen, K., Viswanathan, M., Agha, G.: Model-checking Markov chains in the presence of uncertainties. In: Hermanns, H., Palsberg, J. (eds.) TACAS 2006. LNCS, vol. 3920, pp. 394–410. Springer, Heidelberg (2006). https://doi.org/10.1007/11691372_26
Strehl, A., Littman, M.: An empirical evaluation of interval estimation for Markov decision processes, pp. 128–135 (2004). https://doi.org/10.1109/ICTAI.2004.28
Strehl, A., Littman, M.: An analysis of model-based interval estimation for Markov decision processes. J. Comput. Syst. Sci. 74, 1309–1331 (2008). https://doi.org/10.1016/j.jcss.2007.08.009
Suilen, M., Simão, T., Jansen, N., Parker, D.: Robust anytime learning of Markov decision processes. In: Proceedings of NeurIPS (2022)
Sutton, R.S.: Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bull. 2(4), 160–163 (1991). https://doi.org/10.1145/122344.122377
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. The MIT Press, Cambridge (2018)
Thrun, S.B., Möller, K.: Active exploration in dynamic environments. In: Moody, J., Hanson, S., Lippmann, R.P. (eds.) Advances in Neural Information Processing Systems, vol. 4. Morgan-Kaufmann (1992)
Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8, 279–292 (1992). https://doi.org/10.1007/BF00992698
Weber, R.: On the Gittins index for multiarmed bandits. Ann. Appl. Probab. 2(4), 1024–1033 (1992). https://doi.org/10.1214/aoap/1177005588
Wiering, M., Schmidhuber, J.: Efficient model-based exploration. In: Proceedings of the Sixth Intercational Conference on Simulation of Adaptive Behaviour: From Animals to Animats 6, pp. 223–228. MIT Press/Bradford Books (1998)
Wood, W., Rünger, D.: Psychology of habit. Annu. Rev. Psychol. 67(1), 289–314 (2016). https://doi.org/10.1146/annurev-psych-122414-033417
Wu, D., Koutsoukos, X.: Reachability analysis of uncertain systems using bounded-parameter Markov decision processes. Artif. Intell. 172(8), 945–954 (2008). https://doi.org/10.1016/j.artint.2007.12.002
Younes, H.L.S., Clarke, E.M., Zuliani, P.: Statistical verification of probabilistic properties with unbounded until. In: Davies, J., Silva, L., Simao, A. (eds.) SBMF 2010. LNCS, vol. 6527, pp. 144–160. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19829-8_10
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Baier, C., Dubslaff, C., Wienhöft, P., Kiebel, S.J. (2023). Strategy Synthesis in Markov Decision Processes Under Limited Sampling Access. In: Rozier, K.Y., Chaudhuri, S. (eds) NASA Formal Methods. NFM 2023. Lecture Notes in Computer Science, vol 13903. Springer, Cham. https://doi.org/10.1007/978-3-031-33170-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-33170-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33169-5
Online ISBN: 978-3-031-33170-1
eBook Packages: Computer ScienceComputer Science (R0)