Unifying Hamilton-Jacobi Reachability and Reinforcement Learning
Pith reviewed 2026-05-16 14:33 UTC · model grok-4.3
The pith
A running cost in RL makes the value function the unique viscosity solution to the time-dependent HJB PDE whose negative sublevel set is the strict backward reachable tube.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The resultant travel-cost value function is the unique bounded viscosity solution of a time-dependent Hamilton-Jacobi Bellman (HJB) Partial Differential Equation (PDE) with zero terminal data, whose negative sublevel set equals the strict backward-reachable tube. Using a forward reparameterization and a contraction inducing Bellman update, fixed points of small-step RL value iteration converge to the viscosity solution of the forward discounted HJB.
What carries the argument
The proposed running cost formulation that makes the RL travel-cost value function exactly equal the reachability indicator function.
If this is right
- The negative sublevel set of the value function equals the strict backward-reachable tube.
- Fixed points of small-step RL value iteration converge to the viscosity solution of the forward discounted HJB.
- The framework preserves reachability-based safety semantics while remaining compatible with deep RL implementations.
- Learned value functions converge toward semi-Lagrangian HJB solutions with quantifiable approximation error across the state space.
Where Pith is reading between the lines
- This link could let model-free RL algorithms compute reachable sets in high-dimensional systems where grid-based HJB solvers become intractable.
- Safety-critical RL policies might be trained by directly optimizing the proposed travel cost, inheriting reachability guarantees without separate verification.
- The same cost construction might extend to stochastic reachability or differential games by modifying the underlying HJB equation accordingly.
Load-bearing premise
The running cost is chosen so the RL value function exactly matches the reachability indicator function, together with standard Lipschitz regularity on the dynamics and costs.
What would settle it
Numerical experiments in which the negative sublevel set of the learned value function deviates from the true strict backward reachable tube computed by an independent semi-Lagrangian HJB solver, or in which the value function fails to satisfy the HJB PDE in the viscosity sense.
Figures
read the original abstract
We unify Hamilton-Jacobi (HJ) reachability and Reinforcement Learning (RL) through a proposed running cost formulation. We prove that the resultant travel-cost value function is the unique bounded viscosity solution of a time-dependent Hamilton-Jacobi Bellman (HJB) Partial Differential Equation (PDE) with zero terminal data, whose negative sublevel set equals the strict backward-reachable tube. Using a forward reparameterization and a contraction inducing Bellman update, we show that fixed points of small-step RL value iteration converge to the viscosity solution of the forward discounted HJB. Experiments on a classical benchmark validate this connection by demonstrating convergence of learned value functions toward semi-Lagrangian HJB solutions and by quantifying approximation error across the state space. These results empirically support the theoretical analysis, showing that the proposed framework preserves reachability-based safety semantics while remaining compatible with deep RL implementations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a running cost formulation to unify Hamilton-Jacobi reachability and reinforcement learning. It proves that the resulting travel-cost value function is the unique bounded viscosity solution of a time-dependent HJB PDE with zero terminal data, whose negative sublevel set equals the strict backward-reachable tube. Using forward reparameterization and a contraction-inducing Bellman update, it shows that fixed points of small-step RL value iteration converge to the viscosity solution of the forward discounted HJB. Experiments on a classical benchmark demonstrate convergence of learned value functions toward semi-Lagrangian HJB solutions and quantify approximation error across the state space.
Significance. If the central claims hold, the work is significant for providing a rigorous bridge between reachability analysis and RL, enabling RL methods to preserve reachability-based safety semantics. The direct mathematical proof establishing equivalence (without circularity or free parameters) and the contraction argument for convergence are strengths, as is the empirical validation with quantified errors. This could support safer deep RL implementations in control systems.
major comments (2)
- The uniqueness proof for the bounded viscosity solution of the time-dependent HJB PDE with zero terminal data (central to the equivalence claim) invokes standard Lipschitz conditions on dynamics and costs. The manuscript should cite the specific theorem (e.g., from the Crandall-Lions theory or a direct reference) that guarantees uniqueness in this setting to fully substantiate that the negative sublevel set matches the strict backward-reachable tube.
- Experiments section: the reported approximation errors and convergence to semi-Lagrangian solutions are used to support preservation of safety semantics. The error metric and its relation to the reachability tube should be defined more explicitly (e.g., via a specific equation or table) to confirm the link to the theoretical equivalence.
minor comments (2)
- Abstract and experiments: the term 'semi-Lagrangian HJB solutions' appears without definition or reference; add a brief explanation or citation in the main text for accessibility.
- Notation throughout: ensure consistent terminology between 'travel-cost value function' and 'reachability indicator function' to prevent minor reader confusion.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive comments. We address each major comment point by point below and have incorporated revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: The uniqueness proof for the bounded viscosity solution of the time-dependent HJB PDE with zero terminal data (central to the equivalence claim) invokes standard Lipschitz conditions on dynamics and costs. The manuscript should cite the specific theorem (e.g., from the Crandall-Lions theory or a direct reference) that guarantees uniqueness in this setting to fully substantiate that the negative sublevel set matches the strict backward-reachable tube.
Authors: We agree that an explicit citation will strengthen the substantiation. In the revised manuscript, we now cite the relevant uniqueness result for bounded viscosity solutions of time-dependent HJB equations under standard Lipschitz assumptions on the dynamics and running cost (specifically, we reference Theorem 2.1 from Crandall, Lions, and Souganidis (1992) on viscosity solutions for Hamilton-Jacobi equations, adapted to the zero-terminal-data case). This citation is added directly to the uniqueness proof in Section 3, confirming that the negative sublevel set equals the strict backward-reachable tube without circularity. revision: yes
-
Referee: Experiments section: the reported approximation errors and convergence to semi-Lagrangian solutions are used to support preservation of safety semantics. The error metric and its relation to the reachability tube should be defined more explicitly (e.g., via a specific equation or table) to confirm the link to the theoretical equivalence.
Authors: We appreciate the suggestion for greater explicitness. In the revised Experiments section, we now define the error metric explicitly in a new Equation (12) as the supremum norm of the pointwise difference between the learned RL value function and the semi-Lagrangian HJB solution over a discretized state grid. We have also added Table 1, which reports both the global average error and the maximum error restricted to the negative sublevel set (i.e., inside the reachability tube). This directly ties the quantified approximation errors to the preservation of safety semantics as established by the theoretical equivalence. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper's central derivation is a direct mathematical proof that the travel-cost value function, obtained from an explicitly chosen running cost, is the unique bounded viscosity solution to the time-dependent HJB PDE with zero terminal data under standard Lipschitz assumptions on the dynamics and costs. This equivalence is established by construction of the cost and application of known viscosity solution theory, without reducing to any fitted parameter, self-referential definition, or load-bearing self-citation. The forward reparameterization and Bellman contraction are standard RL results invoked independently of the paper's own data or prior claims. Experiments provide empirical validation but do not form part of the theoretical chain. The derivation is therefore self-contained against external mathematical benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption System dynamics and cost functions are Lipschitz continuous and satisfy standard regularity conditions for existence and uniqueness of viscosity solutions to HJB PDEs
- ad hoc to paper The running cost is formulated so that the resulting value function coincides with the reachability indicator
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that the resultant travel-cost value function is the unique bounded viscosity solution of a time-dependent Hamilton-Jacobi Bellman (HJB) Partial Differential Equation (PDE) with zero terminal data, whose negative sublevel set equals the strict backward-reachable tube.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using a forward reparameterization and a contraction inducing Bellman update, we show that fixed points of small-step RL value iteration converge to the viscosity solution of the forward discounted HJB.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Reachability-based safe learning with gaussian processes
Anayo K Akametalu, Jaime F Fisac, Jeremy H Gillula, Shahab Kaynama, Melanie N Zeilinger, and Claire J Tomlin. Reachability-based safe learning with gaussian processes. In 53rd IEEE conference on decision and control, pages 1424–
-
[2]
Anayo K Akametalu, Shromona Ghosh, Jaime F Fisac, Vicenc Rubies-Royo, and Claire J Tomlin. A minimum discounted reward hamilton–jacobi formulation for computing reachable sets.IEEE Transactions on Automatic Control, 69(2):1097–1103, 2023
work page 2023
-
[3]
Control barrier functions: Theory and applications
Aaron D Ames, Samuel Coogan, Magnus Egerstedt, Gennaro Notomista, Koushil Sreenath, and Paulo Tabuada. Control barrier functions: Theory and applications. In2019 18th European control conference (ECC), pages 3420–3431. Ieee, 2019
work page 2019
- [4]
-
[5]
Deepreach: A deep learning approach to high-dimensional reachability
Somil Bansal and Claire J Tomlin. Deepreach: A deep learning approach to high-dimensional reachability. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1817–1824. IEEE, 2021
work page 2021
-
[6]
Martino Bardi, Italo Capuzzo Dolcetta, et al.Optimal control and viscosity solutions of Hamilton-Jacobi-Bellman equations, volume 12. Springer, 1997
work page 1997
-
[7]
Guy Barles and Panagiotis E Souganidis. Convergence of approximation schemes for fully nonlinear second order equations.Asymptotic analysis, 4(3):271–283, 1991
work page 1991
-
[8]
Mo Chen, Sylvia L Herbert, Mahesh S Vashishtha, Somil Bansal, and Claire J Tomlin. Decomposition of reachable sets and tubes for a class of nonlinear systems.IEEE Transactions on Automatic Control, 63(11):3675–3688, 2018
work page 2018
-
[9]
Xuchan Chen, Ugo Rosolia, and Claire Tomlin. Hamilton- jacobi reachability in reinforcement learning: A survey.arXiv preprint arXiv:2310.06764, 2023
-
[10]
Robust control barrier–value functions for safety-critical control
Jason J Choi, Donggun Lee, Koushil Sreenath, Claire J Tomlin, and Sylvia L Herbert. Robust control barrier–value functions for safety-critical control. In2021 60th IEEE Conference on Decision and Control (CDC), pages 6814–
-
[11]
Michael G Crandall, Hitoshi Ishii, and Pierre-Louis Lions. User’s guide to viscosity solutions of second order partial differential equations.Bulletin of the American mathematical society, 27(1):1–67, 1992
work page 1992
-
[12]
J´ erˆ ome Darbon and Stanley Osher. Algorithms for overcoming the curse of dimensionality for certain hamilton– jacobi equations arising in control theory and elsewhere. Research in the Mathematical Sciences, 3(1):19, 2016
work page 2016
-
[13]
Lawrence C Evans and Panagiotis E Souganidis. Differential games and representation formulas for solutions of hamilton- jacobi-isaacs equations.Indiana University mathematics journal, 33(5):773–797, 1984
work page 1984
-
[14]
Maurizio Falcone and Roberto Ferretti.Semi-Lagrangian approximation schemes for linear and Hamilton—Jacobi equations. SIAM, 2013
work page 2013
-
[15]
Bridging hamilton- jacobi safety analysis and reinforcement learning
Jaime F Fisac, Neil F Lugovoy, Vicen¸ c Rubies-Royo, Shromona Ghosh, and Claire J Tomlin. Bridging hamilton- jacobi safety analysis and reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pages 8550–8556. IEEE, 2019
work page 2019
-
[16]
Milan Ganai, Zheng Gong, Chenning Yu, Sylvia Herbert, and Sicun Gao. Iterative reachability estimation for safe reinforcement learning.Advances in Neural Information Processing Systems, 36:69764–69797, 2023
work page 2023
-
[17]
Calculation of gauss quadrature rules.Mathematics of computation, 23(106):221– 230, 1969
Gene H Golub and John H Welsch. Calculation of gauss quadrature rules.Mathematics of computation, 23(106):221– 230, 1969
work page 1969
-
[18]
On reachability and minimum cost optimal control.Automatica, 40(6):917–927, 2004
John Lygeros. On reachability and minimum cost optimal control.Automatica, 40(6):917–927, 2004
work page 2004
-
[19]
I. M. Mitchell, A. M. Bayen, and C. J. Tomlin. A time- dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games.IEEE Transactions on Automatic Control, 50(7):947–957, 2005
work page 2005
-
[20]
Ian M Mitchell. The flexible, extensible and efficient toolbox of level set methods.Journal of Scientific Computing, 35(2):300–329, 2008
work page 2008
-
[21]
Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015
work page 2015
-
[22]
Keiko Nagami and Mac Schwager. Hjb-rl: Initializing reinforcement learning with optimal control policies applied to autonomous drone racing. InRobotics: science and systems, pages 1–9, 2021
work page 2021
-
[23]
Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions.Advances in neural information processing systems, 33:7462–7473, 2020
work page 2020
-
[24]
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
work page 1998
-
[25]
Bhagyashree Umathe, Duvan Tellez-Castro, and Umesh Vaidya. Reachability analysis using spectrum of koopman operator.IEEE Control Systems Letters, 7:595–600, 2022
work page 2022
-
[26]
Distributional hamilton- jacobi-bellman equations for continuous-time reinforcement learning
Harley E Wiltzer, David Meger, and Marc G Bellemare. Distributional hamilton- jacobi-bellman equations for continuous-time reinforcement learning. InInternational Conference on Machine Learning, pages 23832–23856. PMLR, 2022
work page 2022
-
[27]
He Yin, Murat Arcak, Andrew Packard, and Peter Seiler. Backward reachability for polynomial systems on a finite horizon.IEEE Transactions on Automatic Control, 66(12):6025–6032, 2021. 17
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.