pith. sign in

arxiv: 1907.04799 · v2 · pith:CXJPQ53Onew · submitted 2019-07-10 · 💻 cs.RO · cs.AI· cs.LG

RL-RRT: Kinodynamic Motion Planning via Learning Reachability Estimators from RL Policies

Pith reviewed 2026-05-24 23:46 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords kinodynamic planningreinforcement learningRRTreachability estimationmotion planningsampling-based algorithmsrobot navigation
0
0 comments X

The pith

A reachability estimator learned from an RL policy lets RRT plan kinodynamic motions more efficiently by biasing growth toward reachable states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that training a reinforcement learning policy to avoid obstacles and then supervising a reachability estimator on its time-to-reach behavior allows the RRT algorithm to solve long-range kinodynamic planning problems faster. The estimator serves as a distance function that guides the tree toward promising regions without requiring expensive steering between states. Because the policy and estimator are neural networks, planning reduces to fast inference rather than solving differential equations at each step. The resulting RL-RRT method shows better efficiency than prior kinodynamic planners across three robot systems and transfers from simulation to real hardware.

Core claim

RL-RRT uses an RL policy as a local planner and a reachability estimator trained to predict the policy's time to reach a state amid obstacles as the distance function in an RRT. This combination produces shorter planning times and shorter path finish times than state-of-the-art methods on three tested systems, including physical robots, while the learned components transfer to unseen environments.

What carries the argument

The reachability estimator, a neural network that predicts the time for the RL policy to reach a candidate state while avoiding obstacles.

If this is right

  • The planner completes searches faster than existing kinodynamic RRT variants.
  • Paths are executed in less time than those from steering-free methods.
  • The same policy and estimator work in new environments without retraining.
  • Planning cost shifts from repeated steering solves to single neural network evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the estimator generalizes well, similar learned components could accelerate other sampling-based planners that currently rely on analytic distance functions.
  • The approach suggests that policies trained for short-horizon control can supply long-horizon guidance when paired with a learned reachability model.
  • Physical robot results imply that simulation-trained estimators remain useful when the policy transfers, opening a route to data-efficient real-world deployment.

Load-bearing premise

The reachability estimator trained only in simulation continues to give accurate time predictions once obstacles are present on the physical robot.

What would settle it

Running the RL policy on a physical robot in a new cluttered environment and checking whether the estimator's predicted times match the actual times measured during execution.

Figures

Figures reproduced from arXiv: 1907.04799 by Aleksandra Faust, Hao-Tien Lewis Chiang, Jasmine Hsu, Lydia Tapia, Marek Fiser.

Figure 1
Figure 1. Figure 1: (a) Example trees constructed with RL-RRT (yellow) and SST [14] (blue) for a kinodynamic car navigating from start (S) to goal (G). (b) The Fetch robot. (c) RL-RRT (green) and the real-world trajectory executed (cyan) from the start (green dot) towards the goal (blue dot) in Map 2. Map 2 is a SLAM map of an actual office building. current state. The methods iteratively use a local planner to attempt to gro… view at source ↗
Figure 2
Figure 2. Figure 2: AutoRL P2P navigation success rate as a function of start and goal distance for (a) Fetch, (b) Car and (c) Asteroid robot. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Predicted cumulative future time to reach cost v.s. true value for various robots. (a) Training environment (22.7 x 18.0 m) (b) Predicted (c) Ground truth [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) The training environment. Contour plot of (b) Predicted future cumulative time to reach cost v.s. (c) the true value for Car to reach the goal near the center marked by the blue dot. The white regions have time to reach value over the 40s horizon, i.e., un-reachable. All start states and the goal have 0 as linear speed and orientation. D. Planning Results RL-RRT finds a solution faster than SST for all… view at source ↗
Figure 5
Figure 5. Figure 5: Success rate (top) and Finish time (bottom) of RL-RRT (black) compared to, SST (blue), RRT-DW (red, RRT with DWA obstacle-avoiding steering [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Contour plot of (a) Predicted future cumulative time to reach cost v.s. (b) the true value for Car to reach the goal near the center marked by the blue dot. The white regions have time to reach value over the 40s horizon, i.e., un-reachable. All start states and the goal have 0 as linear speed and orientation. The environment size is 50 m by 40 m. One limitation of RL-RRT is that the obstacle-aware reachab… view at source ↗
Figure 8
Figure 8. Figure 8: Predicted time to reach v.s. true value for various robots. The estimators are trained and evaluated with only states that can reach the goal [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

This paper addresses two challenges facing sampling-based kinodynamic motion planning: a way to identify good candidate states for local transitions and the subsequent computationally intractable steering between these candidate states. Through the combination of sampling-based planning, a Rapidly Exploring Randomized Tree (RRT) and an efficient kinodynamic motion planner through machine learning, we propose an efficient solution to long-range planning for kinodynamic motion planning. First, we use deep reinforcement learning to learn an obstacle-avoiding policy that maps a robot's sensor observations to actions, which is used as a local planner during planning and as a controller during execution. Second, we train a reachability estimator in a supervised manner, which predicts the RL policy's time to reach a state in the presence of obstacles. Lastly, we introduce RL-RRT that uses the RL policy as a local planner, and the reachability estimator as the distance function to bias tree-growth towards promising regions. We evaluate our method on three kinodynamic systems, including physical robot experiments. Results across all three robots tested indicate that RL-RRT outperforms state of the art kinodynamic planners in efficiency, and also provides a shorter path finish time than a steering function free method. The learned local planner policy and accompanying reachability estimator demonstrate transferability to the previously unseen experimental environments, making RL-RRT fast because the expensive computations are replaced with simple neural network inference. Video: https://youtu.be/dDMVMTOI8KY

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes RL-RRT, which learns an obstacle-avoiding RL policy to serve as a local planner and controller, trains a supervised reachability estimator to predict the policy's time-to-reach, and integrates both into RRT by using the estimator as a distance heuristic to bias tree growth. It reports that this yields more efficient kinodynamic planning than SOTA methods across three systems (including hardware) with shorter finish times than steering-free baselines and zero-shot transfer to unseen environments.

Significance. If the empirical claims hold, the work demonstrates a practical way to replace expensive steering computations in kinodynamic RRT with fast neural-network inference while retaining sampling-based completeness properties, which could improve planning speed for underactuated or high-dimensional robotic systems.

major comments (2)
  1. [Abstract / Results] The central efficiency and transfer claims rest on the reachability estimator remaining accurate when obstacles are present and under sim-to-real domain shift, yet the provided abstract and description supply no quantitative error metrics, calibration plots, or failure-mode analysis for obstacle-present rollouts (the weakest assumption identified in the stress test).
  2. [Abstract] Outperformance is asserted over SOTA kinodynamic planners and a steering-function-free baseline, but the abstract reports no numerical values, error bars, data-split details, or statistical tests, rendering the quantitative superiority unverifiable from the summary alone.
minor comments (2)
  1. Clarify the precise training distribution for the reachability estimator (e.g., whether trajectories include obstacles or only free space) and how nearest-neighbor selection in RRT interacts with estimator error.
  2. Add explicit comparison tables with baseline runtimes, path lengths, and success rates including standard deviations across repeated trials.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Results] The central efficiency and transfer claims rest on the reachability estimator remaining accurate when obstacles are present and under sim-to-real domain shift, yet the provided abstract and description supply no quantitative error metrics, calibration plots, or failure-mode analysis for obstacle-present rollouts (the weakest assumption identified in the stress test).

    Authors: The reachability estimator is trained on rollouts that include obstacle interactions because the underlying RL policy is obstacle-avoiding. Its utility is validated end-to-end via planning success rates, tree expansion efficiency, and hardware transfer on three systems. We agree that explicit quantitative metrics for the estimator itself would strengthen the abstract. In revision we will add a concise statement reporting mean absolute error on held-out obstacle-present trajectories and will include calibration plots in the supplementary material. revision: yes

  2. Referee: [Abstract] Outperformance is asserted over SOTA kinodynamic planners and a steering-function-free baseline, but the abstract reports no numerical values, error bars, data-split details, or statistical tests, rendering the quantitative superiority unverifiable from the summary alone.

    Authors: The abstract is intentionally high-level to respect length constraints. The full manuscript supplies the requested numerical comparisons, error bars, and data details in the experimental evaluation. We will revise the abstract to incorporate the most salient quantitative improvements (e.g., planning-time reductions) while remaining within the word limit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method validated by experiments

full rationale

The paper presents an empirical pipeline: an RL policy is trained as a local planner, a supervised reachability estimator is fit to predict that policy's time-to-reach, and both are inserted into RRT. All performance claims (efficiency, path finish time, transfer) rest on experimental comparisons across simulated and physical robots rather than any closed-form derivation or equation that reduces to its own fitted inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing mathematical facts. The derivation chain is therefore self-contained and externally falsifiable via the reported robot trials.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard supervised and reinforcement learning assumptions plus the untested transfer of the learned components; no new physical entities are postulated.

free parameters (2)
  • RL policy network weights
    Trained via deep RL on sensor-to-action mapping; these are fitted parameters whose values determine local planner behavior.
  • Reachability estimator network weights
    Trained in supervised manner on time-to-reach labels generated by the RL policy; these are fitted parameters used as the distance function.
axioms (2)
  • domain assumption The RL policy produces collision-free trajectories when executed from any state reached during planning.
    Invoked when the policy is used as local planner inside RRT; stated in the description of the first contribution.
  • domain assumption Standard neural network training converges to a policy and estimator that generalize to unseen environments.
    Required for the transferability claim; implicit in the evaluation on previously unseen experimental environments.

pith-pipeline@v0.9.0 · 5807 in / 1480 out tokens · 23782 ms · 2026-05-24T23:46:02.342098+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Allen and M

    R. Allen and M. Pavone. A real-time framework for kinodynamic planning with application to quadrotor obstacle avoidance. In AIAA Guidance, Navigation, and Control Conference , page 1374, 2016

  2. [2]

    H.-T. L. Chiang, A. Faust, S. Satomi, and L. Tapia. Fast swept volume estimation with deep learning. In Proc. Int. Workshop on Algorithmic F oundations of Robotics (WAFR), page To appear, 2018

  3. [3]

    H.-T. L. Chiang and L. Tapia. Colreg-rrt: An rrt-based colregs- compliant motion planner for surface vehicle navigation. Robotics and Automat. Lett. , 3(3):2024–2031, 2018

  4. [4]

    Chiang, A

    L. Chiang, A. Faust, M. Fiser, and A. Francis. Learning navigation behaviors end-to-end with autorl. IEEE Robotics and Automation Letters (RA-L) , 2019

  5. [5]

    T. Fan, X. Cheng, J. Pan, P. Long, W. Liu, R. Yang, and D. Manocha. Getting robots unfrozen and unlost in dense pedestrian crowds. Robotics and Automat. Lett. , 2019

  6. [6]

    Faust, O

    A. Faust, O. Ramirez, M. Fiser, K. Oslund, A. Francis, J. Davidson, and L. Tapia. PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , pages 5113–5120, Brisbane, Australia, 2018

  7. [7]

    D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance. IEEE Robot. & Automation Mag. , 4(1):23–33, 1997

  8. [8]

    Francis, A

    A. Francis, A. Faust, H.-T. L. Chiang, J. Hsu, J. C. Kew, M. Fiser, and T.-W. E. Lee. Long-range indoor navigation with prm-rl. arXiv preprint arXiv:1902.09458, 2019

  9. [9]

    Golovin, B

    D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for black-box optimization. In Proc. of ACM Intl. Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM, 2017

  10. [10]

    Goodfellow, Y

    I. Goodfellow, Y . Bengio, and A. Courville. Deep learning . MIT press, 2016

  11. [11]

    Y . Kato, K. Kamiyama, and K. Morioka. Autonomous robot navigation system with learning based on deep q-network and topological maps. In 2017 IEEE/SICE International Symposium on System Integration (SII), pages 1040–1046, Dec 2017

  12. [12]

    Layek, N

    A. Layek, N. A. Vien, T. Chung, et al. Deep reinforcement learning algorithms for steering an underactuated ship. In IEEE Int. Conf. on Multisensor Fusion and Integration for Intell. Sys. (MFI) , pages 602–

  13. [13]

    Li and K

    Y . Li and K. E. Bekris. Learning approximate cost-to-go metrics to improve sampling-based motion planning. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , pages 4196–4201. IEEE, 2011

  14. [14]

    Y . Li, Z. Littlefield, and K. E. Bekris. Asymptotically optimal sampling-based kinodynamic planning. Int. J. Robot. Res. , 35(5):528– 564, 2016

  15. [15]

    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforce- ment learning. arXiv preprint arXiv:1509.02971 , 2015

  16. [16]

    S. Liu, M. Watterson, K. Mohta, K. Sun, S. Bhattacharya, C. J. Taylor, and V . Kumar. Planning dynamically feasible trajectories for quadrotors using safe flight corridors in 3-d complex environments. Robotics and Automat. Lett. , 2(3):1688–1695, 2017

  17. [17]

    V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

  18. [18]

    Paden, M

    B. Paden, M. ˇC´ap, S. Z. Yong, D. Yershov, and E. Frazzoli. A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. on intel. vehicles , 1(1):33–55, 2016

  19. [19]

    Palmieri and K

    L. Palmieri and K. O. Arras. Distance metric learning for rrt-based motion planning with constant-time inference. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , pages 637–643. IEEE, 2015

  20. [20]

    Pfeiffer, M

    M. Pfeiffer, M. Schaeuble, J. I. Nieto, R. Siegwart, and C. Cadena. From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots. Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , pages 1527–1533, 2017

  21. [21]

    J. M. Phillips, N. Bedrossian, and L. E. Kavraki. Guided expansive spaces trees: A search strategy for motion-and cost-constrained state spaces. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , volume 4, pages 3968–3973. IEEE, 2004

  22. [22]

    Richards, T

    A. Richards, T. Schouwenaars, J. P. How, and E. Feron. Spacecraft tra- jectory planning with avoidance constraints using mixed-integer linear programming. J. of Guidance, Control, and Dynamics , 25(4):755–764, 2002

  23. [23]

    Schmerling, L

    E. Schmerling, L. Janson, and M. Pavone. Optimal sampling-based motion planning under differential constraints: the drift case with linear affine dynamics. InIEEE Conf. on Decision and Control (CDC) , pages 2574–2581. IEEE, 2015

  24. [24]

    TF-Agents: A library for reinforcement learning in tensorflow

    Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Cas- tro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent Vanhoucke, Eugene Brevdo. TF-Agents: A library for reinforcement learning in tensorflow. https://github.com/ tensorflow/agents, 2018

  25. [25]

    L. Tai, G. Paolo, and M. Liu. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 31–36, Sept 2017

  26. [26]

    D. J. Webb and J. Van Den Berg. Kinodynamic rrt*: Asymptotically optimal motion planning for robots with linear dynamics. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , pages 5054–5061. IEEE, 2013

  27. [27]

    W. J. Wolfslag, M. Bharatheesha, T. M. Moerland, and M. Wisse. RRT-colearn: towards kinodynamic planning without numerical tra- jectory optimization. Robotics and Automat. Lett. , 3(3):1655–1662, 2018

  28. [28]

    Zhang, J

    J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard. Deep reinforcement learning with successor features for navigation across similar environments. In Proc. IEEE Int. Conf. Intel. Rob. Syst. (IROS) , pages 2371–2378. IEEE, 2017. SUPPLEMENTAL MATERIAL : REWARDS FOR THE P2P A. P2P for differential drive robots The P2P agent was developed in [4]. The...