Teaching Stratego to Play Ball: Optimal Synthesis for Continuous Space MDPs

Jaeger, Manfred; Jensen, Peter Gjøl; Guldstrand Larsen, Kim; Legay, Axel; Sedwards, Sean; Taankvist, Jakob Haahr

doi:10.1007/978-3-030-31784-3_5

Manfred Jaeger¹¹,
Peter Gjøl Jensen¹¹,
Kim Guldstrand Larsen¹¹,
Axel Legay^11,12,
Sean Sedwards¹³ &
…
Jakob Haahr Taankvist¹¹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11781))

Included in the following conference series:

International Symposium on Automated Technology for Verification and Analysis

1196 Accesses
15 Citations

Abstract

Formal models of cyber-physical systems, such as priced timed Markov decision processes, require a state space with continuous and discrete components. The problem of controller synthesis for such systems then can be cast as finding optimal strategies for Markov decision processes over a Euclidean state space. We develop two different reinforcement learning strategies that tackle the problem of continuous state spaces via online partition refinement techniques. We provide theoretical insights into the convergence of partition refinement schemes. Our techniques are implemented in . Experimental results show the advantages of our new techniques over previous optimization algorithms of .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We shall assume that the equation system has a solution for the considered MDP and goal set under any strategy.
2.
\(\mathcal {A}\) is consistent with \(\mathcal {G}\) if for any \(\nu \in \mathcal {A}\) either \(\nu \subseteq \mathcal {G}\) of \(\nu \cap \mathcal {G}=\emptyset \).
3.
Here \(\mathcal {C}_\mathcal {A}^{\inf }(\nu ,a,\nu ')= \underset{s\in \nu ,s'\in \nu '}{\inf }\mathcal {C}(s,a,s')\).
4.
Notice that Definition 1 is only applicable to integrable transition functions, voiding most statistical assumptions, including normality. We thus merely provide heuristics.
5.
http://doi.org/10.5281/zenodo.3268381
6.
http://doi.org/10.5281/zenodo.3252096.
7.
http://doi.org/10.5281/zenodo.3252098.

References

Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72(1–2), 81–138 (1995). https://doi.org/10.1016/0004-3702(94)00011-O. ISSN 0004–3702
Article Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees (1984)
Google Scholar
D’Argenio, P.R., Jeannet, B., Jensen, H.E., Larsen, K.G.: Reduction and refinement strategies for probabilistic analysis. In: Hermanns, H., Segala, R. (eds.) PAPM-PROBMIV 2002. LNCS, vol. 2399, pp. 57–76. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45605-8_5
Chapter MATH Google Scholar
David, A., et al.: Statistical model checking for networks of priced timed automata. In: Fahrenberg, U., Tripakis, S. (eds.) FORMATS 2011. LNCS, vol. 6919, pp. 80–96. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24310-3_7
Chapter Google Scholar
David, A., et al.: On time with minimal expected cost!. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 129–145. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11936-6_10
Chapter Google Scholar
David, A., Jensen, P.G., Larsen, K.G., Mikučionis, M., Taankvist, J.H.: Uppaal Stratego. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 206–211. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46681-0_16
Chapter Google Scholar
David, A., Larsen, K.G., Legay, A., Mikucionis, M., Poulsen, D.B.: Uppaal SMC tutorial. STTT 17(4), 397–415 (2015). https://doi.org/10.1007/s10009-014-0361-y
Article Google Scholar
Henriques, D., Martins, J.G., Zuliani, P., Platzer, A., Clarke, E.M.: Statistical model checking for Markov decision processes. In: QEST 2012, pp. 84–93 (2012). https://doi.org/10.1109/QEST.2012.19
Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safety verification of deep neural networks. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10426, pp. 3–29. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63387-9_1
Chapter Google Scholar
Kwiatkowska, M.Z., Norman, G., Parker, D.: Game-based abstraction for Markov decision processes. In: QEST 2006, pp. 157–166. IEEE Computer Society (2006). https://doi.org/10.1109/QEST.2006.19. ISBN 0-7695-2665-9
Larsen, K.G., Mikučionis, M., Taankvist, J.H.: Safe and optimal adaptive cruise control. In: Meyer, R., Platzer, A., Wehrheim, H. (eds.) Correct System Design. LNCS, vol. 9360, pp. 260–277. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23506-6_17
Chapter Google Scholar
Larsen, K.G., Mikučionis, M., Muñiz, M., Srba, J., Taankvist, J.H.: Online and compositional learning of controllers with application to floor heating. In: Chechik, M., Raskin, J.-F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 244–259. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49674-9_14
Chapter Google Scholar
Larsen, K.G., Le Coënt, A., Mikučionis, M., Taankvist, J.H.: Guaranteed control synthesis for continuous systems in Uppaal Tiga. In: Chamberlain, R., Taha, W., Törngren, M. (eds.) CyPhy/WESE-2018. LNCS, vol. 11615, pp. 113–133. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-23703-5_6. ISBN 978-3-030-23703-5
Chapter Google Scholar
Lun, Y.Z., Wheatley, J., D’Innocenzo, A., Abate, A.: Approximate abstractions of Markov chains with interval decision processes. ADHS 2018, pp. 91–96 (2018). https://doi.org/10.1016/j.ifacol.2018.08.016
Article Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)
Article Google Scholar
Strehl, L. Li, A.L., Littman, M.L.: Incremental model-based learners with formal learning-time guarantees. CoRR (2012)
Google Scholar
Sun, L., Guo, Y., Barbu, A.: A novel framework for online supervised learning with feature selection. arXiv e-prints, art. arXiv:1803.11521 (2018)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (2018)
Google Scholar
Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, King’s College, Cambridge (1989)
Google Scholar
Welford, B.P.: Note on a method for calculating corrected sums of squares and products. Technometrics 4(3), 419–420 (1962). https://doi.org/10.1080/00401706.1962.10490022
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work is partly supported by the Innovation Fund Denmark center DiCyPS, the ERC Advanced Grant LASSO, and the JST ERATO project: HASUO Metamathematics for Systems Design (JPMJER1603).

Author information

Authors and Affiliations

Department of Computer Science, Aalborg University, Aalborg, Denmark
Manfred Jaeger, Peter Gjøl Jensen, Kim Guldstrand Larsen, Axel Legay & Jakob Haahr Taankvist
Université catholique de Louvain, Ottignies-Louvain-la-Neuve, Belgium
Axel Legay
University of Waterloo, Waterloo, Canada
Sean Sedwards

Authors

Manfred Jaeger
View author publications
You can also search for this author in PubMed Google Scholar
Peter Gjøl Jensen
View author publications
You can also search for this author in PubMed Google Scholar
Kim Guldstrand Larsen
View author publications
You can also search for this author in PubMed Google Scholar
Axel Legay
View author publications
You can also search for this author in PubMed Google Scholar
Sean Sedwards
View author publications
You can also search for this author in PubMed Google Scholar
Jakob Haahr Taankvist
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Gjøl Jensen .

Editor information

Editors and Affiliations

Institute of Information Science, Academia Sinica, Taipei, Taiwan
Yu-Fang Chen
DENSO AUTOMOTIVE Deutschland GmbH, Eching, Germany
Chih-Hong Cheng
Institute of Computer Science, TU München, Munich, Germany
Javier Esparza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jaeger, M., Jensen, P.G., Guldstrand Larsen, K., Legay, A., Sedwards, S., Taankvist, J.H. (2019). Teaching Stratego to Play Ball: Optimal Synthesis for Continuous Space MDPs. In: Chen, YF., Cheng, CH., Esparza, J. (eds) Automated Technology for Verification and Analysis. ATVA 2019. Lecture Notes in Computer Science(), vol 11781. Springer, Cham. https://doi.org/10.1007/978-3-030-31784-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-31784-3_5
Published: 21 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31783-6
Online ISBN: 978-3-030-31784-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics