Safe Policy Improvement in Constrained Markov Decision Processes

Berducci, Luigi; Grosu, Radu

doi:10.1007/978-3-031-19849-6_21

Luigi Berducci⁹ &
Radu Grosu⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13701))

Included in the following conference series:

International Symposium on Leveraging Applications of Formal Methods

619 Accesses
1 Citations
1 Altmetric

Abstract

The automatic synthesis of a policy through reinforcement learning (RL) from a given set of formal requirements depends on the construction of a reward signal and consists of the iterative application of many policy-improvement steps. The synthesis algorithm has to balance target, safety, and comfort requirements in a single objective and to guarantee that the policy improvement does not increase the number of safety-requirements violations, especially for safety-critical applications. In this work, we present a solution to the synthesis problem by solving its two main challenges: reward-shaping from a set of formal requirements and safe policy update. For the first, we propose an automatic reward-shaping procedure, defining a scalar reward signal compliant with the task specification. For the second, we introduce an algorithm ensuring that the policy is improved in a safe fashion, with high-confidence guarantees. We also discuss the adoption of a model-based RL algorithm to efficiently use the collected data and train a model-free agent on the predicted trajectories, where the safety violation does not have the same impact as in the real world. Finally, we demonstrate in standard control benchmarks that the resulting learning procedure is effective and robust even under heavy perturbations of the hyperparameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abels, A., Roijers, D., Lenaerts, T., Nowé, A., Steckelmacher, D.: Dynamic weights in multi-objective deep reinforcement learning. In: International Conference on Machine Learning, pp. 11–20. PMLR (2019)
Google Scholar
Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017. Proceedings of Machine Learning Research, vol. 70, pp. 22–31. PMLR (2017). http://proceedings.mlr.press/v70/achiam17a.html
Agha, G., Palmskog, K.: A survey of statistical model checking. ACM Trans. Model. Comput. Simul. (TOMACS) 28(1), 1–39 (2018)
Article MathSciNet Google Scholar
Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. CoRR arXiv:1708.08611 (2017)
Altman, E.: Constrained markov decision processes with total cost criteria: Lagrangian approach and dual linear program. Math. Methods Oper. Res. 48(3), 387–417 (1998)
Article MathSciNet Google Scholar
Altman, E.: Constrained Markov decision processes, vol. 7. CRC Press (1999)
Google Scholar
Balakrishnan, A., Deshmukh, J.V.: Structured reward shaping using signal temporal logic specifications. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3481–3486 (2019). https://doi.org/10.1109/IROS40897.2019.8968254
Barrett, L., Narayanan, S.: Learning all optimal policies with multiple criteria. In: Proceedings of the 25th International Conference on Machine Learning, pp. 41–47 (2008)
Google Scholar
Berducci, L., Aguilar, E.A., Ničković, D., Grosu, R.: Hierarchical potential-based reward shaping from task specifications. arXiv (2021). https://doi.org/10.48550/ARXIV.2110.02792
Bertsekas, D.P.: Constrained optimization and Lagrange multiplier methods. Academic press (2014)
Google Scholar
Brunke, L., et al.: Safe learning in robotics: From learning-based control to safe reinforcement learning. CoRR arXiv:2108.06266 (2021)
Brunnbauer, A., et al.: Latent imagination facilitates zero-shot transfer in autonomous racing. arXiv preprint arXiv:2103.04909 (2021)
Brys, T., Harutyunyan, A., Vrancx, P., Taylor, M.E., Kudenko, D., Nowé, A.: Multi-objectivization of reinforcement learning problems by reward shaping. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 2315–2322. IEEE (2014)
Google Scholar
Censi, A., et al.: Liability, ethics, and culture-aware behavior specification using rulebooks. In: International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, May 20–24, 2019, pp. 8536–8542 (2019)
Google Scholar
Chow, Y., Ghavamzadeh, M., Janson, L., Pavone, M.: Risk-constrained reinforcement learning with percentile risk criteria. CoRR arXiv:1512.01629 (2015)
Christiano, P.F., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp. 4299–4307 (2017). https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html
Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems 31 (2018)
Google Scholar
Dalal, G., Dvijotham, K., Vecerík, M., Hester, T., Paduraru, C., Tassa, Y.: Safe exploration in continuous action spaces. CoRR arXiv:1801.08757 (2018)
Deisenroth, M., Rasmussen, C.E.: Pilco: A model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465–472. Citeseer (2011)
Google Scholar
Deisenroth, M.P., Fox, D., Rasmussen, C.E.: Gaussian processes for data-efficient learning in robotics and control. IEEE Trans. Pattern Anal. Mach. Intell. 37(2), 408–423 (2013)
Article Google Scholar
Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Fox, D., Kavraki, L.E., Kurniawati, H. (eds.) Robotics: Science and Systems X, University of California, Berkeley, USA, July 12–16, 2014 (2014). https://doi.org/10.15607/RSS.2014.X.039. http://www.roboticsproceedings.org/rss10/p39.html
Gábor, Z., Kalmár, Z., Szepesvári, C.: Multi-criteria reinforcement learning. In: Shavlik, J.W. (ed.) Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), Madison, Wisconsin, USA, July 24–27, 1998, pp. 197–205. Morgan Kaufmann (1998)
Google Scholar
García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437–1480 (2015). http://dl.acm.org/citation.cfm?id=2886795
Gros, T.P., Hermanns, H., Hoffmann, J., Klauck, M., Steinmetz, M.: Deep Statistical Model Checking. In: Gotsman, A., Sokolova, A. (eds.) FORTE 2020. LNCS, vol. 12136, pp. 96–114. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50086-3_6
Chapter Google Scholar
Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.10122 (2018)
Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603 (2019)
Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning, pp. 2555–2565. PMLR (2019)
Google Scholar
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Icarte, R.T., Klassen, T., Valenzano, R., McIlraith, S.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: International Conference on Machine Learning, pp. 2107–2116. PMLR (2018)
Google Scholar
Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems 32 (2019)
Google Scholar
Jiang, Y., Bharadwaj, S., Wu, B., Shah, R., Topcu, U., Stone, P.: Temporal-logic-based reward shaping for continuing reinforcement learning tasks. In: Proceedings of the AAAI Conference on Artificial Intelligence 35(9), pp. 7995–8003, May 2021. https://ojs.aaai.org/index.php/AAAI/article/view/16975
Jones, A., Aksaray, D., Kong, Z., Schwager, M., Belta, C.: Robust satisfaction of temporal logic specifications via reinforcement learning (2015)
Google Scholar
Jothimurugan, K., Bansal, S., Bastani, O., Alur, R.: Compositional reinforcement learning from logical specifications. CoRR arXiv:2106.13906 (2021)
Kakade, S., Langford, J.: Approximately optimal approximate reinforcement learning. In: Proceedings of 19th International Conference on Machine Learning. Citeseer (2002)
Google Scholar
Legay, A., Lukina, A., Traonouez, L.M., Yang, J., Smolka, S.A., Grosu, R.: Statistical model checking. In: Steffen, B., Woeginger, G. (eds.) Computing and Software Science. LNCS, vol. 10000, pp. 478–504. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-91908-9_23
Chapter Google Scholar
Li, X., Ma, Y., Belta, C.: A policy search method for temporal logic specified reinforcement learning tasks. In: 2018 Annual American Control Conference (ACC), pp. 240–245 (2018)
Google Scholar
Li, X., Vasile, C.I., Belta, C.: Reinforcement learning with temporal logic rewards. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839 (2017). https://doi.org/10.1109/IROS.2017.8206234
Liu, C., Xu, X., Hu, D.: Multiobjective reinforcement learning: a comprehensive overview. IEEE Trans. Syst. Man Cybern. Sys. 45(3), 385–398 (2015). https://doi.org/10.1109/TSMC.2014.2358639
Article Google Scholar
Maler, O., Nickovic, D.: Monitoring temporal properties of continuous signals. In: Lakhnech, Y., Yovine, S. (eds.) FORMATS/FTRTFT -2004. LNCS, vol. 3253, pp. 152–166. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30206-3_12
Chapter MATH Google Scholar
Massart, P.: Concentration inequalities and model selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII-2003. Springer (2007)
Google Scholar
Nagabandi, A., Kahn, G., Fearing, R.S., Levine, S.: Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. IEEE (2018)
Google Scholar
Natarajan, S., Tadepalli, P.: Dynamic preferences in multi-criteria reinforcement learning. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 601–608 (2005)
Google Scholar
Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 278–287. Morgan Kaufmann (1999)
Google Scholar
Ničković, D., Yamaguchi, T.: RTAMT: online robustness monitors from STL. In: Hung, D.V., Sokolsky, O. (eds.) ATVA 2020. LNCS, vol. 12302, pp. 564–571. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59152-6_34
Chapter Google Scholar
Phan, D.T., Paoletti, N., Grosu, R., Jansen, N., Smolka, S.A., Stoller, S.D.: Neural simplex architecture. CoRR arXiv:1908.00528 (2019)
Pirotta, M., Restelli, M., Pecorino, A., Calandriello, D.: Safe policy iteration. In: International Conference on Machine Learning, pp. 307–315. PMLR (2013)
Google Scholar
Precup, D.: Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, p. 80 (2000)
Google Scholar
Puranic, A.G., Deshmukh, J.V., Nikolaidis, S.: Learning from demonstrations using signal temporal logic in stochastic and continuous domains. IEEE Robot. Autom. Lett. 6(4), 6250–6257 (2021). https://doi.org/10.1109/LRA.2021.3092676
Article Google Scholar
Rodionova, A., Bartocci, E., Nickovic, D., Grosu, R.: Temporal logic as filtering. In: Proceedings of the 19th International Conference on Hybrid Systems: Computation and Control, pp. 11–20 (2016)
Google Scholar
Roijers, D.M., Vamplew, P., Whiteson, S., Dazeley, R.: A survey of multi-objective sequential decision-making. J. Artif. Int. Res. 48(1), 67–113 (2013)
MathSciNet MATH Google Scholar
Saunders, W., Sastry, G., Stuhlmüller, A., Evans, O.: Trial without error: Towards safe reinforcement learning via human intervention. CoRR arXiv:1707.05173 (2017)
Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbeel, P.: Trust region policy optimization. CoRR arXiv:1502.05477 (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Shalev-Shwartz, S., Shammah, S., Shashua, A.: Safe, multi-agent, reinforcement learning for autonomous driving. CoRR arXiv:1610.03295 (2016)
Shelton, C.: Balancing multiple sources of reward in reinforcement learning. In: Leen, T., Dietterich, T., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13. MIT Press (2001)
Google Scholar
Thananjeyan, B., et al.: Recovery RL: safe reinforcement learning with learned recovery zones. IEEE Robotics Autom. Lett. 6(3), 4915–4922 (2021). https://doi.org/10.1109/LRA.2021.3070252
Article Google Scholar
Thomas, P., Theocharous, G., Ghavamzadeh, M.: High-confidence off-policy evaluation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29 (2015)
Google Scholar
Thomas, P., Theocharous, G., Ghavamzadeh, M.: High confidence policy improvement. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 37, pp. 2380–2388. PMLR, Lille, France, 07–09 Jul 2015. https://proceedings.mlr.press/v37/thomas15.html
Thomas, P.S.: Safe reinforcement learning (2015)
Google Scholar
Thomas, P.S., Castro da Silva, B., Barto, A.G., Giguere, S., Brun, Y., Brunskill, E.: Preventing undesirable behavior of intelligent machines. Science 366(6468), 999–1004 (2019)
Google Scholar
Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Teaching multiple tasks to an rl agent using ltl. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 452–461 (2018)
Google Scholar
Van Moffaert, K., Drugan, M.M., Nowé, A.: Scalarized multi-objective reinforcement learning: novel design techniques. In: 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 191–199 (2013). https://doi.org/10.1109/ADPRL.2013.6615007
Viswanadha, K., Kim, E., Indaheng, F., Fremont, D.J., Seshia, S.A.: Parallel and multi-objective falsification with Scenic and VerifAI. In: Feng, L., Fisman, D. (eds.) RV 2021. LNCS, vol. 12974, pp. 265–276. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88494-9_15
Chapter Google Scholar
Wilcox, A., Balakrishna, A., Thananjeyan, B., Gonzalez, J.E., Goldberg, K.: LS3: latent space safe sets for long-horizon visuomotor control of iterative tasks. CoRR arXiv:2107.04775 (2021)
Zhao, Y., Chen, Q., Hu, W.: Multi-objective reinforcement learning algorithm for mosdmp in unknown environment. In: 2010 8th World Congress on Intelligent Control and Automation, pp. 3190–3194 (2010). https://doi.org/10.1109/WCICA.2010.5553980

Download references

Acknowledgement

Luigi Berducci is supported by the Doctoral College Resilient Embedded Systems. This work has received funding from the Austrian FFG-ICT project ADEX.

Author information

Authors and Affiliations

CPS Group, TU Wien, Vienna, Austria
Luigi Berducci & Radu Grosu

Authors

Luigi Berducci
View author publications
You can also search for this author in PubMed Google Scholar
Radu Grosu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luigi Berducci .

Editor information

Editors and Affiliations

University of Limerick, CSIS and Lero, Limerick, Ireland
Tiziana Margaria
TU Dortmund, Dortmund, Germany
Bernhard Steffen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Berducci, L., Grosu, R. (2022). Safe Policy Improvement in Constrained Markov Decision Processes. In: Margaria, T., Steffen, B. (eds) Leveraging Applications of Formal Methods, Verification and Validation. Verification Principles. ISoLA 2022. Lecture Notes in Computer Science, vol 13701. Springer, Cham. https://doi.org/10.1007/978-3-031-19849-6_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-19849-6_21
Published: 17 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19848-9
Online ISBN: 978-3-031-19849-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Safe Policy Improvement in Constrained Markov Decision Processes