Strategy Synthesis in Markov Decision Processes Under Limited Sampling Access

Baier, Christel; Dubslaff, Clemens; Wienhöft, Patrick; Kiebel, Stefan J.

doi:10.1007/978-3-031-33170-1_6

Christel Baier^9,10,
Clemens Dubslaff^9,12,
Patrick Wienhöft^9,10 &
…
Stefan J. Kiebel^9,11

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13903))

Included in the following conference series:

NASA Formal Methods Symposium

501 Accesses
1 Citations

Abstract

A central task in control theory, artificial intelligence, and formal methods is to synthesize reward-maximizing strategies for agents that operate in partially unknown environments. In environments modeled by gray-box Markov decision processes (MDPs), the impact of the agents’ actions are known in terms of successor states but not the stochastics involved. In this paper, we devise a strategy synthesis algorithm for gray-box MDPs via reinforcement learning that utilizes interval MDPs as internal model. To compete with limited sampling access in reinforcement learning, we incorporate two novel concepts into our algorithm, focusing on rapid and successful learning rather than on stochastic guarantees and optimality: lower confidence bound exploration reinforces variants of already learned practical strategies and action scoping reduces the learning action space to promising actions. We illustrate benefits of our algorithms by means of a prototypical implementation applied on examples from the AI and formal methods communities.

The authors are supported by the DFG through the Cluster of Excellence EXC 2050/1 (CeTI, project ID 390696704, as part of Germany’s Excellence Strategy) and the TRR 248 (see https://perspicuous-computing.science, project ID 389792660).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Omega-Regular Objectives in Model-Free Reinforcement Learning

A Framework for Transforming Specifications in Reinforcement Learning

Good-for-MDPs Automata for Probabilistic Analysis and Reinforcement Learning

References

https://osf.io/r24mu/?view_only=b44cec578cce44e5920f150940f68230
Amin, S., Gomrokchi, M., Satija, H., van Hoof, H., Precup, D.: A survey of exploration methods in reinforcement learning (2021)
Google Scholar
Anderson, J.R.: Learning and Memory: An Integrated Approach, 2nd edn. Wiley, Hoboken (2000)
Google Scholar
Ashok, P., Křetínský, J., Weininger, M.: PAC statistical model checking for Markov decision processes and stochastic games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 497–519. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_29
Chapter Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2004)
Article MATH Google Scholar
Baier, C., Klein, J., Leuschner, L., Parker, D., Wunderlich, S.: Ensuring the reliability of your model checker: interval iteration for Markov decision processes. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10426, pp. 160–180. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63387-9_8
Chapter Google Scholar
Baier, C., Cuevas Rivera, D., Dubslaff, C., Kiebel, S.J.: Human-Inspired Models for Tactile Computing, chap. 8, pp. 173–200. Academic Press (2021)
Google Scholar
Baier, C., Dubslaff, C., Hermanns, H., Klauck, M., Klüppelholz, S., Köhl, M.A.: Components in probabilistic systems: suitable by construction. In: Margaria, T., Steffen, B. (eds.) ISoLA 2020. LNCS, vol. 12476, pp. 240–261. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61362-4_13
Chapter Google Scholar
Baier, C., Dubslaff, C., Wienhöft, P., Kiebel, S.J.: Strategy synthesis in Markov decision processes under limited sampling access. Extended Version (2023). https://arxiv.org/abs/2303.12718
Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72(1–2), 81–138 (1995)
Article Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: An analysis of stochastic shortest path problems. Math. Oper. Res. 16(3), 580–595 (1991). https://doi.org/10.1287/moor.16.3.580
Article MathSciNet MATH Google Scholar
Brafman, R.I., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2003). https://doi.org/10.1162/153244303765208377
Article MathSciNet MATH Google Scholar
Brázdil, T., et al.: Verification of Markov decision processes using learning algorithms. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 98–114. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11936-6_8
Chapter Google Scholar
Chatterjee, K., Sen, K., Henzinger, T.A.: Model-checking \(\mathit{\omega }\)-regular properties of interval Markov chains. In: Amadio, R. (ed.) FoSSaCS 2008. LNCS, vol. 4962, pp. 302–317. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78499-9_22
Chapter MATH Google Scholar
Daca, P., Henzinger, T.A., Křetínský, J., Petrov, T.: Faster statistical model checking for unbounded temporal properties (2016)
Google Scholar
Daca, P., Henzinger, T.A., Křetínský, J., Petrov, T.: Faster statistical model checking for unbounded temporal properties. ACM Trans. Comput. Logic 18(2), 1–25 (2017). https://doi.org/10.1145/3060139
Article MathSciNet MATH Google Scholar
Givan, R., Leach, S., Dean, T.: Bounded-parameter Markov decision processes. Artif. Intell. 122(1), 71–109 (2000). https://doi.org/10.1016/S0004-3702(00)00047-3
Article MathSciNet MATH Google Scholar
Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Steinmetz, M.: Deep statistical model checking. In: Gotsman, A., Sokolova, A. (eds.) FORTE 2020. LNCS, vol. 12136, pp. 96–114. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50086-3_6
Chapter Google Scholar
Haddad, S., Monmege, B.: Interval iteration algorithm for MDPs and IMDPs. Theoret. Comput. Sci. 735, 111–131 (2018). https://doi.org/10.1016/j.tcs.2016.12.003
Article MathSciNet MATH Google Scholar
He, R., Jennings, P., Basu, S., Ghosh, A., Wu, H.: A bounded statistical approach for model checking of unbounded until properties, pp. 225–234 (2010)
Google Scholar
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)
Article MathSciNet MATH Google Scholar
Ishii, S., Yoshida, W., Yoshimoto, J.: Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw. 15(4), 665–687 (2002). https://doi.org/10.1016/S0893-6080(02)00056-4
Article Google Scholar
Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11(51), 1563–1600 (2010)
MathSciNet MATH Google Scholar
Kaelbling, L.P.: Learning in Embedded Systems. The MIT Press, Cambridge (1993). https://doi.org/10.7551/mitpress/4168.001.0001
Book Google Scholar
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Int. Res. 4(1), 237–285 (1996)
Google Scholar
Kallenberg, L.: Lecture Notes Markov Decision Problems - version 2020 (2020)
Google Scholar
Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 209–232 (2002). https://doi.org/10.1023/A:1017984413808
Article MATH Google Scholar
Legay, A., Lukina, A., Traonouez, L.M., Yang, J., Smolka, S.A., Grosu, R.: Statistical model checking. In: Steffen, B., Woeginger, G. (eds.) Computing and Software Science. LNCS, vol. 10000, pp. 478–504. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-91908-9_23
Chapter Google Scholar
Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)
MATH Google Scholar
Pineda, L.E., Zilberstein, S.: Planning under uncertainty using reduced models: revisiting determinization. In: ICAPS (2014)
Google Scholar
Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)
Book MATH Google Scholar
Schwoebel, S., Markovic, D., Smolka, M.N., Kiebel, S.J.: Balancing control: a Bayesian interpretation of habitual and goal-directed behavior. J. Math. Psychol. 100, 102472 (2021). https://doi.org/10.1016/j.jmp.2020.102472
Article MathSciNet MATH Google Scholar
Sen, K., Viswanathan, M., Agha, G.: Model-checking Markov chains in the presence of uncertainties. In: Hermanns, H., Palsberg, J. (eds.) TACAS 2006. LNCS, vol. 3920, pp. 394–410. Springer, Heidelberg (2006). https://doi.org/10.1007/11691372_26
Chapter MATH Google Scholar
Strehl, A., Littman, M.: An empirical evaluation of interval estimation for Markov decision processes, pp. 128–135 (2004). https://doi.org/10.1109/ICTAI.2004.28
Strehl, A., Littman, M.: An analysis of model-based interval estimation for Markov decision processes. J. Comput. Syst. Sci. 74, 1309–1331 (2008). https://doi.org/10.1016/j.jcss.2007.08.009
Article MathSciNet MATH Google Scholar
Suilen, M., Simão, T., Jansen, N., Parker, D.: Robust anytime learning of Markov decision processes. In: Proceedings of NeurIPS (2022)
Google Scholar
Sutton, R.S.: Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bull. 2(4), 160–163 (1991). https://doi.org/10.1145/122344.122377
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. The MIT Press, Cambridge (2018)
MATH Google Scholar
Thrun, S.B., Möller, K.: Active exploration in dynamic environments. In: Moody, J., Hanson, S., Lippmann, R.P. (eds.) Advances in Neural Information Processing Systems, vol. 4. Morgan-Kaufmann (1992)
Google Scholar
Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8, 279–292 (1992). https://doi.org/10.1007/BF00992698
Article MATH Google Scholar
Weber, R.: On the Gittins index for multiarmed bandits. Ann. Appl. Probab. 2(4), 1024–1033 (1992). https://doi.org/10.1214/aoap/1177005588
Article MathSciNet MATH Google Scholar
Wiering, M., Schmidhuber, J.: Efficient model-based exploration. In: Proceedings of the Sixth Intercational Conference on Simulation of Adaptive Behaviour: From Animals to Animats 6, pp. 223–228. MIT Press/Bradford Books (1998)
Google Scholar
Wood, W., Rünger, D.: Psychology of habit. Annu. Rev. Psychol. 67(1), 289–314 (2016). https://doi.org/10.1146/annurev-psych-122414-033417
Article Google Scholar
Wu, D., Koutsoukos, X.: Reachability analysis of uncertain systems using bounded-parameter Markov decision processes. Artif. Intell. 172(8), 945–954 (2008). https://doi.org/10.1016/j.artint.2007.12.002
Article MathSciNet MATH Google Scholar
Younes, H.L.S., Clarke, E.M., Zuliani, P.: Statistical verification of probabilistic properties with unbounded until. In: Davies, J., Silva, L., Simao, A. (eds.) SBMF 2010. LNCS, vol. 6527, pp. 144–160. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19829-8_10
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Tactile Internet with Human-in-the-Loop (CeTI), Dresden, Germany
Christel Baier, Clemens Dubslaff, Patrick Wienhöft & Stefan J. Kiebel
Department of Computer Science, Technische Universität Dresden, Dresden, Germany
Christel Baier & Patrick Wienhöft
Department of Psychology, Technische Universität Dresden, Dresden, Germany
Stefan J. Kiebel
Eindhoven University of Technology, Eindhoven, The Netherlands
Clemens Dubslaff

Authors

Christel Baier
View author publications
You can also search for this author in PubMed Google Scholar
Clemens Dubslaff
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Wienhöft
View author publications
You can also search for this author in PubMed Google Scholar
Stefan J. Kiebel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patrick Wienhöft .

Editor information

Editors and Affiliations

Iowa State University, Ames, IA, USA
Kristin Yvonne Rozier
University of Texas at Austin, Austin, TX, USA
Swarat Chaudhuri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baier, C., Dubslaff, C., Wienhöft, P., Kiebel, S.J. (2023). Strategy Synthesis in Markov Decision Processes Under Limited Sampling Access. In: Rozier, K.Y., Chaudhuri, S. (eds) NASA Formal Methods. NFM 2023. Lecture Notes in Computer Science, vol 13903. Springer, Cham. https://doi.org/10.1007/978-3-031-33170-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-33170-1_6
Published: 03 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33169-5
Online ISBN: 978-3-031-33170-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Strategy Synthesis in Markov Decision Processes Under Limited Sampling Access

Abstract

Access this chapter

Similar content being viewed by others

Omega-Regular Objectives in Model-Free Reinforcement Learning

A Framework for Transforming Specifications in Reinforcement Learning

Good-for-MDPs Automata for Probabilistic Analysis and Reinforcement Learning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Strategy Synthesis in Markov Decision Processes Under Limited Sampling Access

Abstract

Access this chapter

Similar content being viewed by others

Omega-Regular Objectives in Model-Free Reinforcement Learning

A Framework for Transforming Specifications in Reinforcement Learning

Good-for-MDPs Automata for Probabilistic Analysis and Reinforcement Learning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation