Skip to main content

Teaching Stratego to Play Ball: Optimal Synthesis for Continuous Space MDPs

  • Conference paper
  • First Online:
Automated Technology for Verification and Analysis (ATVA 2019)

Abstract

Formal models of cyber-physical systems, such as priced timed Markov decision processes, require a state space with continuous and discrete components. The problem of controller synthesis for such systems then can be cast as finding optimal strategies for Markov decision processes over a Euclidean state space. We develop two different reinforcement learning strategies that tackle the problem of continuous state spaces via online partition refinement techniques. We provide theoretical insights into the convergence of partition refinement schemes. Our techniques are implemented in . Experimental results show the advantages of our new techniques over previous optimization algorithms of .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We shall assume that the equation system has a solution for the considered MDP and goal set under any strategy.

  2. 2.

    \(\mathcal {A}\) is consistent with \(\mathcal {G}\) if for any \(\nu \in \mathcal {A}\) either \(\nu \subseteq \mathcal {G}\) of \(\nu \cap \mathcal {G}=\emptyset \).

  3. 3.

    Here \(\mathcal {C}_\mathcal {A}^{\inf }(\nu ,a,\nu ')= \underset{s\in \nu ,s'\in \nu '}{\inf }\mathcal {C}(s,a,s')\).

  4. 4.

    Notice that Definition 1 is only applicable to integrable transition functions, voiding most statistical assumptions, including normality. We thus merely provide heuristics.

  5. 5.

    http://doi.org/10.5281/zenodo.3268381

  6. 6.

    http://doi.org/10.5281/zenodo.3252096.

  7. 7.

    http://doi.org/10.5281/zenodo.3252098.

References

  1. Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72(1–2), 81–138 (1995). https://doi.org/10.1016/0004-3702(94)00011-O. ISSN 0004–3702

    Article  Google Scholar 

  2. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees (1984)

    Google Scholar 

  3. D’Argenio, P.R., Jeannet, B., Jensen, H.E., Larsen, K.G.: Reduction and refinement strategies for probabilistic analysis. In: Hermanns, H., Segala, R. (eds.) PAPM-PROBMIV 2002. LNCS, vol. 2399, pp. 57–76. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45605-8_5

    Chapter  MATH  Google Scholar 

  4. David, A., et al.: Statistical model checking for networks of priced timed automata. In: Fahrenberg, U., Tripakis, S. (eds.) FORMATS 2011. LNCS, vol. 6919, pp. 80–96. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24310-3_7

    Chapter  Google Scholar 

  5. David, A., et al.: On time with minimal expected cost!. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 129–145. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11936-6_10

    Chapter  Google Scholar 

  6. David, A., Jensen, P.G., Larsen, K.G., Mikučionis, M., Taankvist, J.H.: Uppaal Stratego. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 206–211. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46681-0_16

    Chapter  Google Scholar 

  7. David, A., Larsen, K.G., Legay, A., Mikucionis, M., Poulsen, D.B.: Uppaal SMC tutorial. STTT 17(4), 397–415 (2015). https://doi.org/10.1007/s10009-014-0361-y

    Article  Google Scholar 

  8. Henriques, D., Martins, J.G., Zuliani, P., Platzer, A., Clarke, E.M.: Statistical model checking for Markov decision processes. In: QEST 2012, pp. 84–93 (2012). https://doi.org/10.1109/QEST.2012.19

  9. Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safety verification of deep neural networks. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10426, pp. 3–29. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63387-9_1

    Chapter  Google Scholar 

  10. Kwiatkowska, M.Z., Norman, G., Parker, D.: Game-based abstraction for Markov decision processes. In: QEST 2006, pp. 157–166. IEEE Computer Society (2006). https://doi.org/10.1109/QEST.2006.19. ISBN 0-7695-2665-9

  11. Larsen, K.G., Mikučionis, M., Taankvist, J.H.: Safe and optimal adaptive cruise control. In: Meyer, R., Platzer, A., Wehrheim, H. (eds.) Correct System Design. LNCS, vol. 9360, pp. 260–277. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23506-6_17

    Chapter  Google Scholar 

  12. Larsen, K.G., Mikučionis, M., Muñiz, M., Srba, J., Taankvist, J.H.: Online and compositional learning of controllers with application to floor heating. In: Chechik, M., Raskin, J.-F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 244–259. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49674-9_14

    Chapter  Google Scholar 

  13. Larsen, K.G., Le Coënt, A., Mikučionis, M., Taankvist, J.H.: Guaranteed control synthesis for continuous systems in Uppaal Tiga. In: Chamberlain, R., Taha, W., Törngren, M. (eds.) CyPhy/WESE-2018. LNCS, vol. 11615, pp. 113–133. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-23703-5_6. ISBN 978-3-030-23703-5

    Chapter  Google Scholar 

  14. Lun, Y.Z., Wheatley, J., D’Innocenzo, A., Abate, A.: Approximate abstractions of Markov chains with interval decision processes. ADHS 2018, pp. 91–96 (2018). https://doi.org/10.1016/j.ifacol.2018.08.016

    Article  Google Scholar 

  15. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)

    Article  Google Scholar 

  16. Strehl, L. Li, A.L., Littman, M.L.: Incremental model-based learners with formal learning-time guarantees. CoRR (2012)

    Google Scholar 

  17. Sun, L., Guo, Y., Barbu, A.: A novel framework for online supervised learning with feature selection. arXiv e-prints, art. arXiv:1803.11521 (2018)

  18. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (2018)

    Google Scholar 

  19. Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, King’s College, Cambridge (1989)

    Google Scholar 

  20. Welford, B.P.: Note on a method for calculating corrected sums of squares and products. Technometrics 4(3), 419–420 (1962). https://doi.org/10.1080/00401706.1962.10490022

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is partly supported by the Innovation Fund Denmark center DiCyPS, the ERC Advanced Grant LASSO, and the JST ERATO project: HASUO Metamathematics for Systems Design (JPMJER1603).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Gjøl Jensen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jaeger, M., Jensen, P.G., Guldstrand Larsen, K., Legay, A., Sedwards, S., Taankvist, J.H. (2019). Teaching Stratego to Play Ball: Optimal Synthesis for Continuous Space MDPs. In: Chen, YF., Cheng, CH., Esparza, J. (eds) Automated Technology for Verification and Analysis. ATVA 2019. Lecture Notes in Computer Science(), vol 11781. Springer, Cham. https://doi.org/10.1007/978-3-030-31784-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-31784-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-31783-6

  • Online ISBN: 978-3-030-31784-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics