DynoPlan: Combining Motion Planning and Deep Neural Network based Controllers for Safe HRL

Daniel Angelov; Subramanian Ramamoorthy; Yordan Hristov

arxiv: 1906.10099 · v1 · pith:5Y63SHD6new · submitted 2019-06-24 · 💻 cs.RO

DynoPlan: Combining Motion Planning and Deep Neural Network based Controllers for Safe HRL

Daniel Angelov , Yordan Hristov , Subramanian Ramamoorthy This is my paper

Pith reviewed 2026-05-25 17:16 UTC · model grok-4.3

classification 💻 cs.RO

keywords hierarchical reinforcement learningoptions frameworkmotion planningdeep neural network controllersmodel predictive controlhybrid controllersroboticssafety assessment

0 comments

The pith

DynoPlan turns hierarchical RL option selection into model predictive control by giving each option its own dynamics model and a nearness-to-goal heuristic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DynoPlan as a way to combine hand-designed motion planners with learned deep neural network controllers inside a hierarchical reinforcement learning architecture for robotics. It extends the standard options framework by attaching an independent dynamics model and a heuristic drawn from demonstrations to every option. This change recasts the problem of learning or optimizing the high-level switching policy as short-horizon planning: each option is unrolled forward, future states are scored, and the best option is chosen, much like a simple hill-climbing search. The approach matters because it lets reliable but rigid planners and flexible but opaque neural policies operate together while still exposing initiation sets and planner properties that can be used to reason about safety.

Core claim

DynoPlan equips each option with its own dynamics model and a nearness-to-goal heuristic derived from demonstrations. This reformulation converts hierarchical policy optimization into model predictive control, so that a switching controller can unroll the dynamics of every option, evaluate the expected value of resulting states, and select the best policy inside a fixed time horizon in the manner of hill-climbing search. Because each option carries its own dynamics model, it can be activated independently of whether its underlying implementation is a motion planner or a neural network, thereby permitting a mixture of the two while still allowing safety assessment through initiation sets and,

What carries the argument

Per-option dynamics models that enable independent unrolling and expected-value-based switching over a short horizon.

If this is right

Motion planners and neural-network controllers can be used interchangeably as options inside the same high-level policy.
Safety regions of the hybrid controller can be delimited by inspecting the initiation sets of the constituent options.
Performance and completeness guarantees of the underlying motion planners carry over to the regions where those options are active.
The high-level switching logic remains simple and inspectable even when the low-level primitives differ in type.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit dynamics models could make it easier to verify or debug the high-level policy compared with fully end-to-end learned controllers.
Existing libraries of motion planners could be dropped in as options without retraining the entire hierarchy.
Increasing the planning horizon would raise computational cost but might improve decisions on tasks whose critical choice points lie farther in the future.

Load-bearing premise

Each option possesses an accurate dynamics model that can be unrolled independently to predict the states that will result if that option is chosen.

What would settle it

A concrete counter-example in which the option selected by the unrolled dynamics model produces an unsafe state or task failure that would have been avoided by a different choice, traceable to mismatch between the model and the true option behavior.

Figures

Figures reproduced from arXiv: 1906.10099 by Daniel Angelov, Subramanian Ramamoorthy, Yordan Hristov.

**Figure 2.** Figure 2: (a) The 19-state MDP problem. The action space of th [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: (a) The learned heuristics about how close the curr [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Many realistic robotics tasks are best solved compositionally, through control architectures that sequentially invoke primitives and achieve error correction through the use of loops and conditionals taking the system back to alternative earlier states. Recent end-to-end approaches to task learning attempt to directly learn a single controller that solves an entire task, but this has been difficult for complex control tasks that would have otherwise required a diversity of local primitive moves, and the resulting solutions are also not easy to inspect for plan monitoring purposes. In this work, we aim to bridge the gap between hand designed and learned controllers, by representing each as an option in a hybrid hierarchical Reinforcement Learning framework - DynoPlan. We extend the options framework by adding a dynamics model and the use of a nearness-to-goal heuristic, derived from demonstrations. This translates the optimization of a hierarchical policy controller to a problem of planning with a model predictive controller. By unrolling the dynamics of each option and assessing the expected value of each future state, we can create a simple switching controller for choosing the optimal policy within a constrained time horizon similarly to hill climbing heuristic search. The individual dynamics model allows each option to iterate and be activated independently of the specific underlying instantiation, thus allowing for a mix of motion planning and deep neural network based primitives. We can assess the safety regions of the resulting hybrid controller by investigating the initiation sets of the different options, and also by reasoning about the completeness and performance guarantees of the underpinning motion planners.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DynoPlan mixes motion planners and DNN controllers via options with per-option dynamics and a nearness heuristic, but the dynamics model for black-box DNN options remains underspecified.

read the letter

The paper's main move is to treat both classical motion planners and learned DNN policies as options inside an HRL setup, equip each with its own dynamics model, and then use short-horizon unrolling plus a demonstration-derived nearness heuristic to pick which option to run next. This turns the high-level policy into something that looks like a simple model-predictive switch, similar to hill-climbing search over a constrained horizon. The claim is that the same machinery works whether the option underneath is a planner or a network because the dynamics model sits on top of the option rather than inside it. That is the concrete extension they are offering over the basic options framework, and it is a reasonable practical pattern for robotics tasks that need both inspectable primitives and learned behaviors in the same loop. The safety discussion via initiation sets and planner completeness is also a useful pointer even if it stays at the level of suggestion rather than proof. The soft spot is exactly the one the stress-test note flags. The abstract asserts that the dynamics model lets every option be unrolled independently of its internal implementation, yet it gives no mechanism for obtaining or validating that model when the option is a black-box DNN policy. Without an explicit forward model or a clear approximation step, the unrolling reduces to policy rollouts whose accuracy is not guaranteed, which undercuts the safety and completeness arguments that rest on reliable prediction. If the full paper shows a concrete way to learn or fit those models for the DNN case, the concern shrinks; from what is visible it still looks load-bearing. This is aimed at people already working on hybrid hierarchical control in robotics who want a straightforward way to combine the two controller types. A reader looking for implementable ideas on option switching would get value from the switching controller description. It is concrete enough and touches a real engineering problem, so it deserves a serious referee rather than a desk reject, even if the dynamics-model part needs tightening.

Referee Report

1 major / 0 minor

Summary. The paper proposes DynoPlan, a hybrid hierarchical RL framework extending the options framework to represent both motion-planning primitives and DNN-based controllers as options. It adds per-option dynamics models and a nearness-to-goal heuristic derived from demonstrations, recasting hierarchical policy optimization as model-predictive planning. By unrolling each option's dynamics and evaluating expected future-state values, a switching controller selects the optimal option over a finite horizon; the construction is claimed to support mixed option types and to permit safety analysis via initiation sets together with completeness/performance guarantees inherited from the underlying motion planners.

Significance. If the dynamics-model construction is made rigorous, the work would offer a concrete route to compositional, inspectable controllers that safely interleave hand-designed and learned primitives—an incremental but useful advance over pure end-to-end or pure options approaches in robotics.

major comments (1)

[Abstract] Abstract (paragraph beginning 'The individual dynamics model allows...'): the claim that each option possesses a dynamics model usable for unrolling 'independently of the specific underlying instantiation' is load-bearing for both the switching controller and the safety/completeness arguments, yet the manuscript supplies no mechanism for obtaining or validating such a model when the option is realized by a black-box DNN; without an explicit forward model, unrolling reduces to policy rollouts whose accuracy is not guaranteed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting this important point about the dynamics models. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph beginning 'The individual dynamics model allows...'): the claim that each option possesses a dynamics model usable for unrolling 'independently of the specific underlying instantiation' is load-bearing for both the switching controller and the safety/completeness arguments, yet the manuscript supplies no mechanism for obtaining or validating such a model when the option is realized by a black-box DNN; without an explicit forward model, unrolling reduces to policy rollouts whose accuracy is not guaranteed.

Authors: We agree that the manuscript does not supply an explicit mechanism for constructing or validating the per-option dynamics model when the option is a black-box DNN controller. The abstract claim is therefore not fully supported in the current text. In the revision we will add a dedicated subsection describing how the dynamics model for each DNN option is obtained via supervised learning on state-transition data collected from option executions (or from the same demonstration trajectories used to derive the nearness-to-goal heuristic), together with a validation procedure based on held-out prediction error. This will make the unrolling procedure and its independence from option instantiation explicit and will also clarify the modeling assumptions underlying the safety and completeness arguments. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation extends options framework independently

full rationale

The paper's central construction extends the standard options framework by adding an explicit dynamics model per option and a nearness-to-goal heuristic derived from demonstrations, then applies unrolling to obtain a switching controller analogous to MPC. This is presented as a direct translation of hierarchical policy optimization into planning, with the independence of the dynamics model from the underlying controller (motion planning or DNN) asserted as an enabling property rather than derived from the result itself. No equations, self-citations, or steps in the abstract reduce any claimed prediction or guarantee to a fitted input or self-definition by construction. The approach remains self-contained against external benchmarks such as the options framework and standard MPC techniques.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

axioms (1)

domain assumption Dynamics models exist and are sufficiently accurate for each option to support short-horizon planning
Invoked when the abstract states that unrolling dynamics enables the switching controller.

pith-pipeline@v0.9.0 · 5799 in / 1234 out tokens · 25513 ms · 2026-05-25T17:16:50.367434+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By unrolling the dynamics of each option and assessing the expected value of each future state, we can create a simple switching controller... The individual dynamics model allows each option to iterate and be activated independently of the specific underlying instantiation
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We extend the options framework by adding a dynamics model and the use of a nearness-to-goal heuristic

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Learning robotic assembly from cad

Garrett Thomas, Melissa Chien, Aviv T amar, Juan Aparicio Ojea, and Pieter Abbeel. Learning robotic assembly from cad. 2018 IEEE International Conference on Robotics and Automat ion (ICRA), May 2018

work page 2018
[2]

Learnings Options End-to-End for Continuous Action Tasks

Martin Klissarov , Pierre-Luc Bacon, Jean Harb, and Doin a Precup. Learnings options end-to-end for continuous action tasks. arXiv preprint arXiv:1712.00004 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Argall, Sonia Chernova, Manuela V eloso, and Br ett Browning

Brenna D. Argall, Sonia Chernova, Manuela V eloso, and Br ett Browning. A survey of robot learning from demon- stration. Robotics and Autonomous Systems , 57(5):469 – 483, 2009

work page 2009
[4]

World Models

David Ha and Jurgen Schmidhuber. W orld models. arXiv preprint arXiv:1803.10122 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Recent advances in hierarchical reinforcement learning

Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13(1-2):41–77, 2003

work page 2003
[6]

A survey of inverse rei nforcement learning: Challenges, methods and progress

Saurabh Arora and Prashant Doshi. A survey of inverse rei nforcement learning: Challenges, methods and progress. arXiv preprint arXiv:1806.06877 , 2018

work page arXiv 2018
[7]

Betw een mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning

Richard S Sutton, Doina Precup, and Satinder Singh. Betw een mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artiﬁcial intelligence, 112(1-2):181–211, 1999

work page 1999
[8]

Eligibility traces for off-policy policy evaluation

Doina Precup. Eligibility traces for off-policy policy evaluation. CS Department Faculty Publication Series , 2000

work page 2000
[9]

A heuristic approach to the discovery of macr o-operators

Glenn A Iba. A heuristic approach to the discovery of macr o-operators. Machine Learning, 3(4):285–317, 1989

work page 1989
[10]

Rumschinski S

P . Rumschinski S. Streif R. Findeisen P . Andonov , A. Savchenko. Controller veriﬁcation and parametrization subje ct to quantitative and qualitative requirements. IF AC-PapersOnLine, 48(8):1174 – 1179, 2015

work page 2015
[11]

When waiting is not an option: Learning options with a deliberation cost

Jean Harb, Pierre-Luc Bacon, Martin Klissarov , and Doi na Precup. When waiting is not an option: Learning options with a deliberation cost. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018

work page 2018
[12]

Learning Safe Policies with Expert Guidance

Jessie Huang, Fa Wu, Doina Precup, and Y ang Cai. Learnin g safe policies with expert guidance. arXiv preprint arXiv:1805.08313, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Act ion understanding as inverse planning

Chris L Baker, Rebecca Saxe, and Joshua B T enenbaum. Act ion understanding as inverse planning. Cognition, 113(3):329–349, 2009

work page 2009
[14]

Risk-Aware Active Inverse Reinforcement Learning

Daniel S Brown, Yuchen Cui, and Scott Niekum. Risk-awar e active inverse reinforcement learning. arXiv preprint arXiv:1901.02161, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[15]

Learning with options that terminate off-policy

Anna Harutyunyan, Peter V rancx, Pierre-Luc Bacon, Doi na Precup, and Ann Nowe. Learning with options that terminate off-policy . In Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018. 4

work page 2018

[1] [1]

Learning robotic assembly from cad

Garrett Thomas, Melissa Chien, Aviv T amar, Juan Aparicio Ojea, and Pieter Abbeel. Learning robotic assembly from cad. 2018 IEEE International Conference on Robotics and Automat ion (ICRA), May 2018

work page 2018

[2] [2]

Learnings Options End-to-End for Continuous Action Tasks

Martin Klissarov , Pierre-Luc Bacon, Jean Harb, and Doin a Precup. Learnings options end-to-end for continuous action tasks. arXiv preprint arXiv:1712.00004 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Argall, Sonia Chernova, Manuela V eloso, and Br ett Browning

Brenna D. Argall, Sonia Chernova, Manuela V eloso, and Br ett Browning. A survey of robot learning from demon- stration. Robotics and Autonomous Systems , 57(5):469 – 483, 2009

work page 2009

[4] [4]

World Models

David Ha and Jurgen Schmidhuber. W orld models. arXiv preprint arXiv:1803.10122 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Recent advances in hierarchical reinforcement learning

Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13(1-2):41–77, 2003

work page 2003

[6] [6]

A survey of inverse rei nforcement learning: Challenges, methods and progress

Saurabh Arora and Prashant Doshi. A survey of inverse rei nforcement learning: Challenges, methods and progress. arXiv preprint arXiv:1806.06877 , 2018

work page arXiv 2018

[7] [7]

Betw een mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning

Richard S Sutton, Doina Precup, and Satinder Singh. Betw een mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artiﬁcial intelligence, 112(1-2):181–211, 1999

work page 1999

[8] [8]

Eligibility traces for off-policy policy evaluation

Doina Precup. Eligibility traces for off-policy policy evaluation. CS Department Faculty Publication Series , 2000

work page 2000

[9] [9]

A heuristic approach to the discovery of macr o-operators

Glenn A Iba. A heuristic approach to the discovery of macr o-operators. Machine Learning, 3(4):285–317, 1989

work page 1989

[10] [10]

Rumschinski S

P . Rumschinski S. Streif R. Findeisen P . Andonov , A. Savchenko. Controller veriﬁcation and parametrization subje ct to quantitative and qualitative requirements. IF AC-PapersOnLine, 48(8):1174 – 1179, 2015

work page 2015

[11] [11]

When waiting is not an option: Learning options with a deliberation cost

Jean Harb, Pierre-Luc Bacon, Martin Klissarov , and Doi na Precup. When waiting is not an option: Learning options with a deliberation cost. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018

work page 2018

[12] [12]

Learning Safe Policies with Expert Guidance

Jessie Huang, Fa Wu, Doina Precup, and Y ang Cai. Learnin g safe policies with expert guidance. arXiv preprint arXiv:1805.08313, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Act ion understanding as inverse planning

Chris L Baker, Rebecca Saxe, and Joshua B T enenbaum. Act ion understanding as inverse planning. Cognition, 113(3):329–349, 2009

work page 2009

[14] [14]

Risk-Aware Active Inverse Reinforcement Learning

Daniel S Brown, Yuchen Cui, and Scott Niekum. Risk-awar e active inverse reinforcement learning. arXiv preprint arXiv:1901.02161, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[15] [15]

Learning with options that terminate off-policy

Anna Harutyunyan, Peter V rancx, Pierre-Luc Bacon, Doi na Precup, and Ann Nowe. Learning with options that terminate off-policy . In Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018. 4

work page 2018