Skip to main content
Log in

Planning using hierarchical constrained Markov decision processes

  • Published:
Autonomous Robots Aims and scope Submit manuscript

Abstract

Constrained Markov decision processes offer a principled method to determine policies for sequential stochastic decision problems where multiple costs are concurrently considered. Although they could be very valuable in numerous robotic applications, to date their use has been quite limited. Among the reasons for their limited adoption is their computational complexity, since policy computation requires the solution of constrained linear programs with an extremely large number of variables. To overcome this limitation, we propose a hierarchical method to solve large problem instances. States are clustered into macro states and the parameters defining the dynamic behavior and the costs of the clustered model are determined using a Monte Carlo approach. We show that the algorithm we propose to create clustered states maintains valuable properties of the original model, like the existence of a solution for the problem. Our algorithm is validated in various planning problems in simulation and on a mobile robot platform, and we experimentally show that the clustered approach significantly outperforms the non-hierarchical solution while experiencing only moderate losses in terms of objective functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

Notes

  1. Note that even if the policy is deterministic the action at time t is a random variable since it is a function of the random variable \(X_t\).

  2. To the best of our knowledge no method has been proposed to analytically estimate costs and probabilities.

  3. The set of paths define a policy because for each vertex they identify an edge to traverse along the shortest path, and by construction this edge is associated with an action.

  4. For states close the boundary or to an obstacle, the action set is adjusted by removing actions that would violate these constraints.

  5. This means that if \(S_i = S_{i+1}\) we remove the latter from the sequence and we reiterate this step until \(S_i\ne S_{i+1}\) for all symbols left in the sequence.

References

  • Altman, E. (1999). Constrained Markov decision processes. Boca Raton: CRC Press.

    MATH  Google Scholar 

  • Bai, A., Wu, F., & Chen, X. (2012). Online planning for large MDPs with MAXQ decomposition. In Proceedings of the 11th international conference on autonomous agents and multiagent systems (Vol. 3, pp. 1215–1216).

  • Barry, J., Kaelbling, L. P., & Lozano-Pérez, T. (2010). Hierarchical solution of large Markov decision processes. Technical report, MIT.

  • Barry, J. L., Kaelbling, L. P., & Lozano-Pérez, T. T. (2011). DetH*: Approximate hierarchical solution of large markov decision processes. In International joint conference on artificial intelligence (IJCAI).

  • Bertsekas, D. P. (2005). Dynamic programming and optimal control (Vol. 1, 2). Belmont, MA: Athena Scientific.

    MATH  Google Scholar 

  • Bouvrie, J., & Maggioni, M. (2012). Efficient solution of Markov decision problems with multiscale representations. In 2012 50th annual Allerton conference on communication, control, and computing (Allerton) (pp. 474–481). IEEE.

  • Carpin, S., Pavone, M., & Sadler, B. M. (2014). Rapid multirobot deployment with time constraints. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (pp. 1147–1154).

  • Chow, Y.-L., Pavone, M., Sadler, B. M., & Carpin, S. (2015). Trading safety versus performance: rapid deployment of robotic swarms with robust performance constraints. ASME Journal of Dynamic Systems, Measurement and Control, 137(3), 031005-1–031005-11.

  • Dai, P., & Goldsmith, J. (2007). Topological value iteration algorithm for Markov decision processes. In Proceedings of the international joint conference on artificial intelligence (pp. 1860–1865).

  • Dai, P., Mausam, M., & Weld, D. S. (2009). Focused topological value iteration. In International conference on automated planning and scheduling.

  • Dai, P., Mausam, M., Weld, D. S., & Goldsmith, J. (2011). Topological value iteration algorithms. Journal of Artificial Intelligence Research, 42(1), 181–209.

    MathSciNet  MATH  Google Scholar 

  • Ding, X. C., Englot, B., Pinto, A., Speranzon, A., & Surana, A. (2014). Hierarchical multi-objective planning: From mission specifications to contingency management. In 2014 IEEE international conference on robotics and automation (ICRA) (pp .3735–3742). IEEE.

  • Ding, X. C., Pinto, A., & Surana, A. (2013). Strategic planning under uncertainties via constrained Markov decision processes. In Proceedings of the IEEE international conference on robotics and automation (pp. 4568–4575).

  • El Chamie, M., & Açikmeşe, B. (2016). Convex synthesis of optimal policies for Markov decision processes with sequentially-observed transitions. In Proceedings of the American control conference (pp. 3862–3867).

  • Feyzabadi, S., & Carpin, S. (2014). Risk aware path planning using hierarchical constrained Markov decision processes. In Proceedings of the IEEE international conference on automation science and engineering (pp. 297–303).

  • Feyzabadi, S., & Carpin, S. (2015). HCMDP: A hierarchical solution to constrained markov decision processes. In Proceedings of the IEEE international conference on robotics and automation (pp. 3791–3798).

  • Grisetti, G., Stachniss, C., & Burgard, W. (2007). Improved techniques for grid mapping with Rao-Blackwellized particle filters. IEEE Transactions on Robotics, 23(1), 36–46.

    Article  Google Scholar 

  • Hauskrecht, M., Meuleau, N., Kaelbling, L. P., Dean, T., & Boutilier, C. (1998). Hierarchical solution of Markov decision processes using macro-actions. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence (pp. 220–229). Morgan Kaufmann Publishers.

  • Hoey, J., St-Aubin, R., Hu, A.J., & Boutilier, C. C. (1999). SPUDD: Stochastic planning using decision diagrams. In Proceedings of uncertainty in artificial intelligence (pp .279–288).

  • Karaman, S., & Frazzoli, E. (2011). Sampling-based algorithms for optimal motion planning. International Journal of Robotics Research, 30(7), 846–894.

    Article  MATH  Google Scholar 

  • Kavraki, L. E., Švetska, P., Latombe, J. C., & Overmars, M. H. (1996). Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation, 12(4), 566–580.

    Article  Google Scholar 

  • Kochenderfer, M. J. (2015). Decision making under uncertainty: Theory and application. Cambridge: MIT Press.

    MATH  Google Scholar 

  • LaValle, S. M. (2006). Planning algorithms. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • LaValle, S. M., & Kuffner, J. J. (2001). Randomized kinodynamic planning. International Journal of Robotics Research, 20(5), 378–400.

    Article  Google Scholar 

  • Moldovan, T. M., & Abbeel, P. (2012). Risk aversion in Markov decision processes via near optimal Chernoff bounds. In NIPS (pp. 3140–3148).

  • Pineau, J., Roy, N., & Thrun, S. (2001). A hierarchical approach to pomdp planning and execution. In Workshop on hierarchy and memory in reinforcement learning (ICML) (Vol. 65, p. 51).

  • Puterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming. Hoboken: Wiley-Interscience.

    MATH  Google Scholar 

  • Thrun, S., Burgard, W., & Fox, D. (2006). Probabilistic robotics. Cambridge: MIT Press.

    MATH  Google Scholar 

  • Vien, N. A., & Toussaint, M. (2015). Hierarchical Monte-Carlo planning. In AAAI (pp. 3613–3619).

Download references

Acknowledgements

This paper extends preliminary results presented in Feyzabadi and Carpin (2015). This work is supported by the National Institute of Standards and Technology under cooperative agreement 70NANB12H143. Any opinions, findings, and conclusions or recommendations expressed in these materials are those of the authors and should not be interpreted as representing the official policies, either expressly or implied, of the funding agencies of the U.S. Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Carpin.

Appendix

Appendix

Proof of Theorem 1

Definition 2 establishes two conditions for saying that an HCMDP preserves connectivity. The first requires that \(\mathcal {X}_H\) is a partition of \(\mathcal {X}\). Algorithm 1 never considers a state twice, i.e., once a state has been assigned to a cluster it will not be considered again for assignment (line 4). Moreover, the main loop ensures that all states in \(\mathcal {X}\) are assigned to a cluster. Therefore, \(\mathcal {X}_H\) is a partition of \(\mathcal {X}\).

We next turn to the second condition. Let \(z \in \mathcal {X}'\) and \(y\in M\) be two states such that \(z\leadsto y\). By definition this means that there exists a sequence of states \(\mathcal {S}=s_1,s_2,\dots ,s_n\) such that for each \(1\le i\le n-1\) \(P_{s_i,s_{i+1}}^{u_i}\) for some \(u_i\in U(s_i)\) and \(s_1=y\) and \(s_n = z\). Since \(\mathcal {X}_H\) is a partition of \(\mathcal {X}\), this sequence of states is associated with a sequence of macrostates \(Z_H=S_1\dots S_n=Y_H\) such that \(s_i \in S_i\) for each i. Note that in general there could be some repeated elements in the sequence of macrostates. Let \(S_1,\dots S_k\) \((k\le n)\) be the sequence obtained removing subsequences of repeated macrostates.Footnote 5 First note that this sequence includes at least two elements. This is true because we started assuming \(y\notin M\) while \(z\in M\). According to Algorithm 1 all and only the states in M are mapped to an individual macrostate (line 1), so y cannot be in the same macrostate as z. Next, consider two successive elements in the sequence of macrostates, say \(S_i\) and \(S_{i+1}\). By construction, there exist two successive states in \(\mathcal {S}\), say \(s_j\) and \(s_{j+1}\), such that \(s_j \in S_i\) and \(s_{j+1} \in S_{i+1}\). Since these two states are part of \(\mathcal {S}\), there exists one input \(u_j\in U(s_j)\) such that \(P_{s_j,s_{j+1}}^{u_j}>0\). As per Eq. (7), this implies that an action \(S_{j+1}\) is added to the set of actions \(U(S_j)\). Next, consider the method described in Sect. 4.3, and in particular the definition of the boundary B between two macro states. It follows that \(s_{j+1} \in B_{S_i,S_{i+1}}\). The algorithm further continues computing the shortest path between each state in \(S_i\) and B, where the shortest path is computed over the induced graph G. For \(s_i\) the path trivially consists of a single edge to \(s_{i+1}\) (or some other vertex in B that is also one hop away from \(s_i\).) Next, the algorithm randomly selects one vertex from \(S_j\) using a uniform distribution and executes the policy to reach B. Let m be the total number of Monte Carlo samples generated. Then the probability that the estimate of \(P_{S_i,S_i+1}^{S_{i+1}}=0\) is bounded from above by

$$\begin{aligned} (1-\gamma )^{k_1}\left( 1-P_{s_j,s_{j+1}}^{u_j}\right) ^{k_2} \end{aligned}$$

where \(\gamma = \frac{1}{S_i}\), \(k_1\) is the number of times \(s_j\) was not sampled and \(k_2\) is the number of times \(s_j\) was sampled (\(k_1+k_2 =m, k_{1,2}\ge 0\)). This proves that as the total number of samples m grows, the estimate for \(P_{S_i,S_i+1}^{S_{i+1}}\) will be eventually be positive. This reasoning can be repeated for each couple of successive macro states, thus showing that \(Z_H\leadsto Y_H\), and this concludes the proof. \(\square \)

Proof of Theorem 2

We start observing that Algorithm 3 builds and solves a sequence of HCMDPs. Each is a CMDP with a suitable set of parameters and at every iteration the constrained linear program given in Eq. (5) is solved. Theorem 1 guarantees that state \(M_H\) is accessible from every macrostate, and therefore there exists at least one policy \(\pi '\) for which \(c(\pi ')\) is finite. Let us next consider the inequality constraints in Eq. (5). If the linear program is not feasible, then each bound \(D_{i,H}\) is increased by \(\varDelta D_{i,H}\) (line 6.) By construction, all costs \(d_{i,H}(x,u)\ge 0\) for each state/action pair (xu). Let \(n_s = |\mathcal {K}_H'|\) be the number of state/action pairs in the HCMDP, \(d_{max} = \max _{(x,u)\in K_H'} \{d_{i,H}(x,u)\}\) the largest among the additional costs, and \(D_{min} = \min \{\varDelta D_{i,H}\}\) the smallest among the increments in line 6. Therefore after at most \(\lceil \frac{n_sd_{max}}{D_{min}}\rceil \) iterations all inequality constraints become feasible. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Feyzabadi, S., Carpin, S. Planning using hierarchical constrained Markov decision processes. Auton Robot 41, 1589–1607 (2017). https://doi.org/10.1007/s10514-017-9630-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10514-017-9630-4

Keywords

Navigation