pith. sign in

arxiv: 2601.08050 · v2 · submitted 2026-01-12 · 📡 eess.SY · cs.SY

Unifying Hamilton-Jacobi Reachability and Reinforcement Learning

Pith reviewed 2026-05-16 14:33 UTC · model grok-4.3

classification 📡 eess.SY cs.SY
keywords Hamilton-Jacobi reachabilityreinforcement learningvalue iterationviscosity solutionbackward reachable tubeHJB PDEsafety analysistravel cost
0
0 comments X

The pith

A running cost in RL makes the value function the unique viscosity solution to the time-dependent HJB PDE whose negative sublevel set is the strict backward reachable tube.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how a carefully chosen running cost turns standard reinforcement learning value functions into exact solutions of the Hamilton-Jacobi-Bellman PDE used in reachability analysis. This equivalence means the learned value function directly encodes the strict backward reachable tube through its negative sublevel set. A reader would care because the result lets RL methods inherit the safety guarantees of classical reachability while remaining compatible with deep function approximation and value iteration. The proof proceeds by showing the travel-cost value function is the unique bounded viscosity solution and that small-step Bellman updates converge to it under forward reparameterization.

Core claim

The resultant travel-cost value function is the unique bounded viscosity solution of a time-dependent Hamilton-Jacobi Bellman (HJB) Partial Differential Equation (PDE) with zero terminal data, whose negative sublevel set equals the strict backward-reachable tube. Using a forward reparameterization and a contraction inducing Bellman update, fixed points of small-step RL value iteration converge to the viscosity solution of the forward discounted HJB.

What carries the argument

The proposed running cost formulation that makes the RL travel-cost value function exactly equal the reachability indicator function.

If this is right

  • The negative sublevel set of the value function equals the strict backward-reachable tube.
  • Fixed points of small-step RL value iteration converge to the viscosity solution of the forward discounted HJB.
  • The framework preserves reachability-based safety semantics while remaining compatible with deep RL implementations.
  • Learned value functions converge toward semi-Lagrangian HJB solutions with quantifiable approximation error across the state space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This link could let model-free RL algorithms compute reachable sets in high-dimensional systems where grid-based HJB solvers become intractable.
  • Safety-critical RL policies might be trained by directly optimizing the proposed travel cost, inheriting reachability guarantees without separate verification.
  • The same cost construction might extend to stochastic reachability or differential games by modifying the underlying HJB equation accordingly.

Load-bearing premise

The running cost is chosen so the RL value function exactly matches the reachability indicator function, together with standard Lipschitz regularity on the dynamics and costs.

What would settle it

Numerical experiments in which the negative sublevel set of the learned value function deviates from the true strict backward reachable tube computed by an independent semi-Lagrangian HJB solver, or in which the value function fails to satisfy the HJB PDE in the viscosity sense.

Figures

Figures reproduced from arXiv: 2601.08050 by Coen de Visser, Erik-jan van Kampen, Isabelle El-Hajj, Jasper van Beers, Prashant Solanki.

Figure 1
Figure 1. Figure 1: Travel- vs. reach-cost HJB solutions computed on [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Forward discounted HJB ↔ RL on X2.5 = [−2.5, 2.5]2 with ∆τ = 0.05, λ = 1.0 (γ = e −0.05). Visual agreement is strong across the ROI; quantitative errors are reported in equation (76). be extracted directly. To make the correspondence visi￾ble, we overlay the reach-cost zero-level contour on the travel-cost field and inspect the interior values (Fig. 1c), which all lie strictly below zero. 7.2 Stage II: For… view at source ↗
read the original abstract

We unify Hamilton-Jacobi (HJ) reachability and Reinforcement Learning (RL) through a proposed running cost formulation. We prove that the resultant travel-cost value function is the unique bounded viscosity solution of a time-dependent Hamilton-Jacobi Bellman (HJB) Partial Differential Equation (PDE) with zero terminal data, whose negative sublevel set equals the strict backward-reachable tube. Using a forward reparameterization and a contraction inducing Bellman update, we show that fixed points of small-step RL value iteration converge to the viscosity solution of the forward discounted HJB. Experiments on a classical benchmark validate this connection by demonstrating convergence of learned value functions toward semi-Lagrangian HJB solutions and by quantifying approximation error across the state space. These results empirically support the theoretical analysis, showing that the proposed framework preserves reachability-based safety semantics while remaining compatible with deep RL implementations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a running cost formulation to unify Hamilton-Jacobi reachability and reinforcement learning. It proves that the resulting travel-cost value function is the unique bounded viscosity solution of a time-dependent HJB PDE with zero terminal data, whose negative sublevel set equals the strict backward-reachable tube. Using forward reparameterization and a contraction-inducing Bellman update, it shows that fixed points of small-step RL value iteration converge to the viscosity solution of the forward discounted HJB. Experiments on a classical benchmark demonstrate convergence of learned value functions toward semi-Lagrangian HJB solutions and quantify approximation error across the state space.

Significance. If the central claims hold, the work is significant for providing a rigorous bridge between reachability analysis and RL, enabling RL methods to preserve reachability-based safety semantics. The direct mathematical proof establishing equivalence (without circularity or free parameters) and the contraction argument for convergence are strengths, as is the empirical validation with quantified errors. This could support safer deep RL implementations in control systems.

major comments (2)
  1. The uniqueness proof for the bounded viscosity solution of the time-dependent HJB PDE with zero terminal data (central to the equivalence claim) invokes standard Lipschitz conditions on dynamics and costs. The manuscript should cite the specific theorem (e.g., from the Crandall-Lions theory or a direct reference) that guarantees uniqueness in this setting to fully substantiate that the negative sublevel set matches the strict backward-reachable tube.
  2. Experiments section: the reported approximation errors and convergence to semi-Lagrangian solutions are used to support preservation of safety semantics. The error metric and its relation to the reachability tube should be defined more explicitly (e.g., via a specific equation or table) to confirm the link to the theoretical equivalence.
minor comments (2)
  1. Abstract and experiments: the term 'semi-Lagrangian HJB solutions' appears without definition or reference; add a brief explanation or citation in the main text for accessibility.
  2. Notation throughout: ensure consistent terminology between 'travel-cost value function' and 'reachability indicator function' to prevent minor reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments. We address each major comment point by point below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: The uniqueness proof for the bounded viscosity solution of the time-dependent HJB PDE with zero terminal data (central to the equivalence claim) invokes standard Lipschitz conditions on dynamics and costs. The manuscript should cite the specific theorem (e.g., from the Crandall-Lions theory or a direct reference) that guarantees uniqueness in this setting to fully substantiate that the negative sublevel set matches the strict backward-reachable tube.

    Authors: We agree that an explicit citation will strengthen the substantiation. In the revised manuscript, we now cite the relevant uniqueness result for bounded viscosity solutions of time-dependent HJB equations under standard Lipschitz assumptions on the dynamics and running cost (specifically, we reference Theorem 2.1 from Crandall, Lions, and Souganidis (1992) on viscosity solutions for Hamilton-Jacobi equations, adapted to the zero-terminal-data case). This citation is added directly to the uniqueness proof in Section 3, confirming that the negative sublevel set equals the strict backward-reachable tube without circularity. revision: yes

  2. Referee: Experiments section: the reported approximation errors and convergence to semi-Lagrangian solutions are used to support preservation of safety semantics. The error metric and its relation to the reachability tube should be defined more explicitly (e.g., via a specific equation or table) to confirm the link to the theoretical equivalence.

    Authors: We appreciate the suggestion for greater explicitness. In the revised Experiments section, we now define the error metric explicitly in a new Equation (12) as the supremum norm of the pointwise difference between the learned RL value function and the semi-Lagrangian HJB solution over a discretized state grid. We have also added Table 1, which reports both the global average error and the maximum error restricted to the negative sublevel set (i.e., inside the reachability tube). This directly ties the quantified approximation errors to the preservation of safety semantics as established by the theoretical equivalence. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central derivation is a direct mathematical proof that the travel-cost value function, obtained from an explicitly chosen running cost, is the unique bounded viscosity solution to the time-dependent HJB PDE with zero terminal data under standard Lipschitz assumptions on the dynamics and costs. This equivalence is established by construction of the cost and application of known viscosity solution theory, without reducing to any fitted parameter, self-referential definition, or load-bearing self-citation. The forward reparameterization and Bellman contraction are standard RL results invoked independently of the paper's own data or prior claims. Experiments provide empirical validation but do not form part of the theoretical chain. The derivation is therefore self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions from optimal control theory for viscosity solutions plus the paper-specific choice of running cost. No free parameters are fitted to data and no new entities are postulated.

axioms (2)
  • domain assumption System dynamics and cost functions are Lipschitz continuous and satisfy standard regularity conditions for existence and uniqueness of viscosity solutions to HJB PDEs
    Invoked to guarantee that the travel-cost value function is the unique bounded viscosity solution.
  • ad hoc to paper The running cost is formulated so that the resulting value function coincides with the reachability indicator
    This is the key design choice that produces the unification; it is stated as the proposed formulation rather than derived from prior results.

pith-pipeline@v0.9.0 · 5461 in / 1572 out tokens · 96855 ms · 2026-05-16T14:33:11.308095+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Reachability-based safe learning with gaussian processes

    Anayo K Akametalu, Jaime F Fisac, Jeremy H Gillula, Shahab Kaynama, Melanie N Zeilinger, and Claire J Tomlin. Reachability-based safe learning with gaussian processes. In 53rd IEEE conference on decision and control, pages 1424–

  2. [2]

    A minimum discounted reward hamilton–jacobi formulation for computing reachable sets.IEEE Transactions on Automatic Control, 69(2):1097–1103, 2023

    Anayo K Akametalu, Shromona Ghosh, Jaime F Fisac, Vicenc Rubies-Royo, and Claire J Tomlin. A minimum discounted reward hamilton–jacobi formulation for computing reachable sets.IEEE Transactions on Automatic Control, 69(2):1097–1103, 2023

  3. [3]

    Control barrier functions: Theory and applications

    Aaron D Ames, Samuel Coogan, Magnus Egerstedt, Gennaro Notomista, Koushil Sreenath, and Paulo Tabuada. Control barrier functions: Theory and applications. In2019 18th European control conference (ECC), pages 3420–3431. Ieee, 2019

  4. [4]

    Bansal, M

    S. Bansal, M. Chen, S. Herbert, and C. Tomlin. Hamilton- jacobi reachability: A brief overview and recent advances. Proceedings of the IEEE Conference on Decision and Control (CDC), 2017

  5. [5]

    Deepreach: A deep learning approach to high-dimensional reachability

    Somil Bansal and Claire J Tomlin. Deepreach: A deep learning approach to high-dimensional reachability. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1817–1824. IEEE, 2021

  6. [6]

    Springer, 1997

    Martino Bardi, Italo Capuzzo Dolcetta, et al.Optimal control and viscosity solutions of Hamilton-Jacobi-Bellman equations, volume 12. Springer, 1997

  7. [7]

    Convergence of approximation schemes for fully nonlinear second order equations.Asymptotic analysis, 4(3):271–283, 1991

    Guy Barles and Panagiotis E Souganidis. Convergence of approximation schemes for fully nonlinear second order equations.Asymptotic analysis, 4(3):271–283, 1991

  8. [8]

    Decomposition of reachable sets and tubes for a class of nonlinear systems.IEEE Transactions on Automatic Control, 63(11):3675–3688, 2018

    Mo Chen, Sylvia L Herbert, Mahesh S Vashishtha, Somil Bansal, and Claire J Tomlin. Decomposition of reachable sets and tubes for a class of nonlinear systems.IEEE Transactions on Automatic Control, 63(11):3675–3688, 2018

  9. [9]

    Hamilton- jacobi reachability in reinforcement learning: A survey.arXiv preprint arXiv:2310.06764, 2023

    Xuchan Chen, Ugo Rosolia, and Claire Tomlin. Hamilton- jacobi reachability in reinforcement learning: A survey.arXiv preprint arXiv:2310.06764, 2023

  10. [10]

    Robust control barrier–value functions for safety-critical control

    Jason J Choi, Donggun Lee, Koushil Sreenath, Claire J Tomlin, and Sylvia L Herbert. Robust control barrier–value functions for safety-critical control. In2021 60th IEEE Conference on Decision and Control (CDC), pages 6814–

  11. [11]

    User’s guide to viscosity solutions of second order partial differential equations.Bulletin of the American mathematical society, 27(1):1–67, 1992

    Michael G Crandall, Hitoshi Ishii, and Pierre-Louis Lions. User’s guide to viscosity solutions of second order partial differential equations.Bulletin of the American mathematical society, 27(1):1–67, 1992

  12. [12]

    Algorithms for overcoming the curse of dimensionality for certain hamilton– jacobi equations arising in control theory and elsewhere

    J´ erˆ ome Darbon and Stanley Osher. Algorithms for overcoming the curse of dimensionality for certain hamilton– jacobi equations arising in control theory and elsewhere. Research in the Mathematical Sciences, 3(1):19, 2016

  13. [13]

    Differential games and representation formulas for solutions of hamilton- jacobi-isaacs equations.Indiana University mathematics journal, 33(5):773–797, 1984

    Lawrence C Evans and Panagiotis E Souganidis. Differential games and representation formulas for solutions of hamilton- jacobi-isaacs equations.Indiana University mathematics journal, 33(5):773–797, 1984

  14. [14]

    SIAM, 2013

    Maurizio Falcone and Roberto Ferretti.Semi-Lagrangian approximation schemes for linear and Hamilton—Jacobi equations. SIAM, 2013

  15. [15]

    Bridging hamilton- jacobi safety analysis and reinforcement learning

    Jaime F Fisac, Neil F Lugovoy, Vicen¸ c Rubies-Royo, Shromona Ghosh, and Claire J Tomlin. Bridging hamilton- jacobi safety analysis and reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pages 8550–8556. IEEE, 2019

  16. [16]

    Iterative reachability estimation for safe reinforcement learning.Advances in Neural Information Processing Systems, 36:69764–69797, 2023

    Milan Ganai, Zheng Gong, Chenning Yu, Sylvia Herbert, and Sicun Gao. Iterative reachability estimation for safe reinforcement learning.Advances in Neural Information Processing Systems, 36:69764–69797, 2023

  17. [17]

    Calculation of gauss quadrature rules.Mathematics of computation, 23(106):221– 230, 1969

    Gene H Golub and John H Welsch. Calculation of gauss quadrature rules.Mathematics of computation, 23(106):221– 230, 1969

  18. [18]

    On reachability and minimum cost optimal control.Automatica, 40(6):917–927, 2004

    John Lygeros. On reachability and minimum cost optimal control.Automatica, 40(6):917–927, 2004

  19. [19]

    I. M. Mitchell, A. M. Bayen, and C. J. Tomlin. A time- dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games.IEEE Transactions on Automatic Control, 50(7):947–957, 2005

  20. [20]

    The flexible, extensible and efficient toolbox of level set methods.Journal of Scientific Computing, 35(2):300–329, 2008

    Ian M Mitchell. The flexible, extensible and efficient toolbox of level set methods.Journal of Scientific Computing, 35(2):300–329, 2008

  21. [21]

    Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

  22. [22]

    Hjb-rl: Initializing reinforcement learning with optimal control policies applied to autonomous drone racing

    Keiko Nagami and Mac Schwager. Hjb-rl: Initializing reinforcement learning with optimal control policies applied to autonomous drone racing. InRobotics: science and systems, pages 1–9, 2021

  23. [23]

    Implicit neural representations with periodic activation functions.Advances in neural information processing systems, 33:7462–7473, 2020

    Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions.Advances in neural information processing systems, 33:7462–7473, 2020

  24. [24]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  25. [25]

    Reachability analysis using spectrum of koopman operator.IEEE Control Systems Letters, 7:595–600, 2022

    Bhagyashree Umathe, Duvan Tellez-Castro, and Umesh Vaidya. Reachability analysis using spectrum of koopman operator.IEEE Control Systems Letters, 7:595–600, 2022

  26. [26]

    Distributional hamilton- jacobi-bellman equations for continuous-time reinforcement learning

    Harley E Wiltzer, David Meger, and Marc G Bellemare. Distributional hamilton- jacobi-bellman equations for continuous-time reinforcement learning. InInternational Conference on Machine Learning, pages 23832–23856. PMLR, 2022

  27. [27]

    Backward reachability for polynomial systems on a finite horizon.IEEE Transactions on Automatic Control, 66(12):6025–6032, 2021

    He Yin, Murat Arcak, Andrew Packard, and Peter Seiler. Backward reachability for polynomial systems on a finite horizon.IEEE Transactions on Automatic Control, 66(12):6025–6032, 2021. 17