RL-RRT: Kinodynamic Motion Planning via Learning Reachability Estimators from RL Policies
Pith reviewed 2026-05-24 23:46 UTC · model grok-4.3
The pith
A reachability estimator learned from an RL policy lets RRT plan kinodynamic motions more efficiently by biasing growth toward reachable states.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RL-RRT uses an RL policy as a local planner and a reachability estimator trained to predict the policy's time to reach a state amid obstacles as the distance function in an RRT. This combination produces shorter planning times and shorter path finish times than state-of-the-art methods on three tested systems, including physical robots, while the learned components transfer to unseen environments.
What carries the argument
The reachability estimator, a neural network that predicts the time for the RL policy to reach a candidate state while avoiding obstacles.
If this is right
- The planner completes searches faster than existing kinodynamic RRT variants.
- Paths are executed in less time than those from steering-free methods.
- The same policy and estimator work in new environments without retraining.
- Planning cost shifts from repeated steering solves to single neural network evaluations.
Where Pith is reading between the lines
- If the estimator generalizes well, similar learned components could accelerate other sampling-based planners that currently rely on analytic distance functions.
- The approach suggests that policies trained for short-horizon control can supply long-horizon guidance when paired with a learned reachability model.
- Physical robot results imply that simulation-trained estimators remain useful when the policy transfers, opening a route to data-efficient real-world deployment.
Load-bearing premise
The reachability estimator trained only in simulation continues to give accurate time predictions once obstacles are present on the physical robot.
What would settle it
Running the RL policy on a physical robot in a new cluttered environment and checking whether the estimator's predicted times match the actual times measured during execution.
Figures
read the original abstract
This paper addresses two challenges facing sampling-based kinodynamic motion planning: a way to identify good candidate states for local transitions and the subsequent computationally intractable steering between these candidate states. Through the combination of sampling-based planning, a Rapidly Exploring Randomized Tree (RRT) and an efficient kinodynamic motion planner through machine learning, we propose an efficient solution to long-range planning for kinodynamic motion planning. First, we use deep reinforcement learning to learn an obstacle-avoiding policy that maps a robot's sensor observations to actions, which is used as a local planner during planning and as a controller during execution. Second, we train a reachability estimator in a supervised manner, which predicts the RL policy's time to reach a state in the presence of obstacles. Lastly, we introduce RL-RRT that uses the RL policy as a local planner, and the reachability estimator as the distance function to bias tree-growth towards promising regions. We evaluate our method on three kinodynamic systems, including physical robot experiments. Results across all three robots tested indicate that RL-RRT outperforms state of the art kinodynamic planners in efficiency, and also provides a shorter path finish time than a steering function free method. The learned local planner policy and accompanying reachability estimator demonstrate transferability to the previously unseen experimental environments, making RL-RRT fast because the expensive computations are replaced with simple neural network inference. Video: https://youtu.be/dDMVMTOI8KY
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RL-RRT, which learns an obstacle-avoiding RL policy to serve as a local planner and controller, trains a supervised reachability estimator to predict the policy's time-to-reach, and integrates both into RRT by using the estimator as a distance heuristic to bias tree growth. It reports that this yields more efficient kinodynamic planning than SOTA methods across three systems (including hardware) with shorter finish times than steering-free baselines and zero-shot transfer to unseen environments.
Significance. If the empirical claims hold, the work demonstrates a practical way to replace expensive steering computations in kinodynamic RRT with fast neural-network inference while retaining sampling-based completeness properties, which could improve planning speed for underactuated or high-dimensional robotic systems.
major comments (2)
- [Abstract / Results] The central efficiency and transfer claims rest on the reachability estimator remaining accurate when obstacles are present and under sim-to-real domain shift, yet the provided abstract and description supply no quantitative error metrics, calibration plots, or failure-mode analysis for obstacle-present rollouts (the weakest assumption identified in the stress test).
- [Abstract] Outperformance is asserted over SOTA kinodynamic planners and a steering-function-free baseline, but the abstract reports no numerical values, error bars, data-split details, or statistical tests, rendering the quantitative superiority unverifiable from the summary alone.
minor comments (2)
- Clarify the precise training distribution for the reachability estimator (e.g., whether trajectories include obstacles or only free space) and how nearest-neighbor selection in RRT interacts with estimator error.
- Add explicit comparison tables with baseline runtimes, path lengths, and success rates including standard deviations across repeated trials.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract / Results] The central efficiency and transfer claims rest on the reachability estimator remaining accurate when obstacles are present and under sim-to-real domain shift, yet the provided abstract and description supply no quantitative error metrics, calibration plots, or failure-mode analysis for obstacle-present rollouts (the weakest assumption identified in the stress test).
Authors: The reachability estimator is trained on rollouts that include obstacle interactions because the underlying RL policy is obstacle-avoiding. Its utility is validated end-to-end via planning success rates, tree expansion efficiency, and hardware transfer on three systems. We agree that explicit quantitative metrics for the estimator itself would strengthen the abstract. In revision we will add a concise statement reporting mean absolute error on held-out obstacle-present trajectories and will include calibration plots in the supplementary material. revision: yes
-
Referee: [Abstract] Outperformance is asserted over SOTA kinodynamic planners and a steering-function-free baseline, but the abstract reports no numerical values, error bars, data-split details, or statistical tests, rendering the quantitative superiority unverifiable from the summary alone.
Authors: The abstract is intentionally high-level to respect length constraints. The full manuscript supplies the requested numerical comparisons, error bars, and data details in the experimental evaluation. We will revise the abstract to incorporate the most salient quantitative improvements (e.g., planning-time reductions) while remaining within the word limit. revision: yes
Circularity Check
No significant circularity; empirical method validated by experiments
full rationale
The paper presents an empirical pipeline: an RL policy is trained as a local planner, a supervised reachability estimator is fit to predict that policy's time-to-reach, and both are inserted into RRT. All performance claims (efficiency, path finish time, transfer) rest on experimental comparisons across simulated and physical robots rather than any closed-form derivation or equation that reduces to its own fitted inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing mathematical facts. The derivation chain is therefore self-contained and externally falsifiable via the reported robot trials.
Axiom & Free-Parameter Ledger
free parameters (2)
- RL policy network weights
- Reachability estimator network weights
axioms (2)
- domain assumption The RL policy produces collision-free trajectories when executed from any state reached during planning.
- domain assumption Standard neural network training converges to a policy and estimator that generalize to unseen environments.
Reference graph
Works this paper leans on
-
[1]
R. Allen and M. Pavone. A real-time framework for kinodynamic planning with application to quadrotor obstacle avoidance. In AIAA Guidance, Navigation, and Control Conference , page 1374, 2016
work page 2016
-
[2]
H.-T. L. Chiang, A. Faust, S. Satomi, and L. Tapia. Fast swept volume estimation with deep learning. In Proc. Int. Workshop on Algorithmic F oundations of Robotics (WAFR), page To appear, 2018
work page 2018
-
[3]
H.-T. L. Chiang and L. Tapia. Colreg-rrt: An rrt-based colregs- compliant motion planner for surface vehicle navigation. Robotics and Automat. Lett. , 3(3):2024–2031, 2018
work page 2024
- [4]
-
[5]
T. Fan, X. Cheng, J. Pan, P. Long, W. Liu, R. Yang, and D. Manocha. Getting robots unfrozen and unlost in dense pedestrian crowds. Robotics and Automat. Lett. , 2019
work page 2019
-
[6]
A. Faust, O. Ramirez, M. Fiser, K. Oslund, A. Francis, J. Davidson, and L. Tapia. PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , pages 5113–5120, Brisbane, Australia, 2018
work page 2018
-
[7]
D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance. IEEE Robot. & Automation Mag. , 4(1):23–33, 1997
work page 1997
-
[8]
A. Francis, A. Faust, H.-T. L. Chiang, J. Hsu, J. C. Kew, M. Fiser, and T.-W. E. Lee. Long-range indoor navigation with prm-rl. arXiv preprint arXiv:1902.09458, 2019
-
[9]
D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for black-box optimization. In Proc. of ACM Intl. Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM, 2017
work page 2017
-
[10]
I. Goodfellow, Y . Bengio, and A. Courville. Deep learning . MIT press, 2016
work page 2016
-
[11]
Y . Kato, K. Kamiyama, and K. Morioka. Autonomous robot navigation system with learning based on deep q-network and topological maps. In 2017 IEEE/SICE International Symposium on System Integration (SII), pages 1040–1046, Dec 2017
work page 2017
- [12]
- [13]
-
[14]
Y . Li, Z. Littlefield, and K. E. Bekris. Asymptotically optimal sampling-based kinodynamic planning. Int. J. Robot. Res. , 35(5):528– 564, 2016
work page 2016
-
[15]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforce- ment learning. arXiv preprint arXiv:1509.02971 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[16]
S. Liu, M. Watterson, K. Mohta, K. Sun, S. Bhattacharya, C. J. Taylor, and V . Kumar. Planning dynamically feasible trajectories for quadrotors using safe flight corridors in 3-d complex environments. Robotics and Automat. Lett. , 2(3):1688–1695, 2017
work page 2017
-
[17]
V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015
work page 2015
- [18]
-
[19]
L. Palmieri and K. O. Arras. Distance metric learning for rrt-based motion planning with constant-time inference. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , pages 637–643. IEEE, 2015
work page 2015
-
[20]
M. Pfeiffer, M. Schaeuble, J. I. Nieto, R. Siegwart, and C. Cadena. From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots. Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , pages 1527–1533, 2017
work page 2017
-
[21]
J. M. Phillips, N. Bedrossian, and L. E. Kavraki. Guided expansive spaces trees: A search strategy for motion-and cost-constrained state spaces. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , volume 4, pages 3968–3973. IEEE, 2004
work page 2004
-
[22]
A. Richards, T. Schouwenaars, J. P. How, and E. Feron. Spacecraft tra- jectory planning with avoidance constraints using mixed-integer linear programming. J. of Guidance, Control, and Dynamics , 25(4):755–764, 2002
work page 2002
-
[23]
E. Schmerling, L. Janson, and M. Pavone. Optimal sampling-based motion planning under differential constraints: the drift case with linear affine dynamics. InIEEE Conf. on Decision and Control (CDC) , pages 2574–2581. IEEE, 2015
work page 2015
-
[24]
TF-Agents: A library for reinforcement learning in tensorflow
Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Cas- tro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent Vanhoucke, Eugene Brevdo. TF-Agents: A library for reinforcement learning in tensorflow. https://github.com/ tensorflow/agents, 2018
work page 2018
-
[25]
L. Tai, G. Paolo, and M. Liu. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 31–36, Sept 2017
work page 2017
-
[26]
D. J. Webb and J. Van Den Berg. Kinodynamic rrt*: Asymptotically optimal motion planning for robots with linear dynamics. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA) , pages 5054–5061. IEEE, 2013
work page 2013
-
[27]
W. J. Wolfslag, M. Bharatheesha, T. M. Moerland, and M. Wisse. RRT-colearn: towards kinodynamic planning without numerical tra- jectory optimization. Robotics and Automat. Lett. , 3(3):1655–1662, 2018
work page 2018
-
[28]
J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard. Deep reinforcement learning with successor features for navigation across similar environments. In Proc. IEEE Int. Conf. Intel. Rob. Syst. (IROS) , pages 2371–2378. IEEE, 2017. SUPPLEMENTAL MATERIAL : REWARDS FOR THE P2P A. P2P for differential drive robots The P2P agent was developed in [4]. The...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.