Deep QP Safety Filter: Model-free Learning for Reachability-based Safety Filter

Byeongjun Kim; H. Jin Kim

arxiv: 2601.21297 · v2 · submitted 2026-01-29 · 💻 cs.RO · cs.SY· eess.SY

Deep QP Safety Filter: Model-free Learning for Reachability-based Safety Filter

Byeongjun Kim , H. Jin Kim This is my paper

Pith reviewed 2026-05-16 10:09 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY

keywords safety filtermodel-free learningHamilton-Jacobi reachabilityquadratic programmingreinforcement learningviscosity solutionneural networksblack-box systems

0 comments

The pith

A model-free quadratic program safety filter can be learned for black-box systems by training neural networks on contraction losses derived from Hamilton-Jacobi reachability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Deep QP Safety Filter as a fully data-driven safety layer that adds reachability-based constraints to control of unknown dynamical systems. It trains two neural networks, one for the safety value function and one for its derivatives, using specially constructed contraction losses that require no model of the dynamics. In the exact-data limit this produces convergence to the viscosity solution even when the value function is non-smooth, and empirical tests across continuous and hybrid systems show fewer early failures and faster progress to higher returns in reinforcement-learning tasks.

Core claim

Deep QP Safety Filter constructs contraction-based losses for both the safety value function and its derivatives; two neural networks trained on these losses converge, in the exact setting, to the viscosity solution of the associated Hamilton-Jacobi equation and to its derivative, thereby yielding a quadratic-program safety filter that operates without any knowledge of the system dynamics.

What carries the argument

Contraction-based losses on the Hamilton-Jacobi reachability value function and its spatial derivatives, used to train a pair of neural networks that parameterize a quadratic-program safety filter.

If this is right

Pre-convergence failures in reinforcement learning are substantially reduced across multiple tasks.
Learning converges to higher returns than strong model-free baselines.
The same procedure applies unchanged to hybrid dynamical systems.
Safety is enforced at every step without requiring an explicit dynamics model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contraction-loss structure could be reused to learn safety filters for other value-function problems whose viscosity solutions are known to exist.
Because the method never uses the dynamics, it may remain effective under unmodeled disturbances that would invalidate model-based reachability filters.
Extending the data-collection policy to include controlled unsafe excursions could tighten the convergence rate without sacrificing the safety guarantee.

Load-bearing premise

Sufficient data coverage exists to train the networks such that the contraction losses drive convergence to the true reachability solution without model knowledge.

What would settle it

Collect data only from safe trajectories on a known system whose true viscosity solution is computable, then check whether the learned filter still prevents entry into unsafe states during closed-loop execution.

read the original abstract

We introduce Deep QP Safety Filter, a fully data-driven safety layer for black-box dynamical systems. Our method learns a Quadratic-Program (QP) safety filter without model knowledge by combining Hamilton-Jacobi (HJ) reachability with model-free learning. We construct contraction-based losses for both the safety value and its derivatives, and train two neural networks accordingly. In the exact setting, the learned critic converges to the viscosity solution (and its derivative), even for non-smooth values. Across diverse dynamical systems -- even including a hybrid system -- and multiple RL tasks, Deep QP Safety Filter substantially reduces pre-convergence failures while accelerating learning toward higher returns than strong baselines, offering a principled and practical route to safe, model-free control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable model-free route to a reachability QP safety filter with decent empirical gains, but the derivative convergence claim rests on an unverified contraction construction.

read the letter

The main thing here is a data-driven QP safety layer that learns both the reachability value function and its derivative through contraction losses, then plugs the result into a standard safety QP. That combination is new enough to notice: prior HJ safety filters either need a model or stop at the value function. The experiments show clear drops in early failures on several RL tasks and even a hybrid system, which is the part that would actually matter to someone trying to bolt safety onto black-box control. They also report faster convergence to higher returns than the baselines they chose. Those results look reproducible from the description and give the work its practical weight. The soft spot is the central theoretical claim. The abstract asserts convergence to the viscosity solution and its derivative in the exact setting, yet the derivative loss has to approximate the Hamiltonian without any dynamics model or Jacobian. Standard neural approximations do not automatically satisfy the viscosity definition for non-smooth functions, and the paper does not appear to supply an independent check that the learned derivative actually satisfies the PDE away from the data. That gap makes the strongest claim hard to trust without the full derivation and error analysis. The empirical side stands on its own, so the paper is worth a serious referee who can press on the theory while keeping the practical contribution. I would bring it to a reading group focused on safe RL to see how the losses are actually implemented.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Deep QP Safety Filter, a fully data-driven safety layer for black-box dynamical systems. It combines Hamilton-Jacobi reachability with model-free learning by training two neural networks (for the safety value and its derivative) via contraction-based losses, without requiring a dynamics model. The central theoretical claim is that, in the exact (infinite-data) setting, the learned networks converge to the viscosity solution of the HJ equation and its derivative, even for non-smooth value functions. Empirical results across dynamical systems (including hybrid) and RL tasks show reduced pre-convergence failures and higher returns compared to baselines.

Significance. If the model-free convergence claim holds, particularly the derivative network's ability to satisfy the viscosity solution without dynamics knowledge, the work would provide a principled bridge between reachability analysis and data-driven safe control, enabling safety filters for unknown systems. The extension to hybrid systems and RL integration adds practical value, but the significance hinges on resolving the construction of the derivative loss.

major comments (1)

[Abstract] Abstract: The claim that both the critic and its derivative converge to the viscosity solution in the exact setting requires the contraction losses to enforce the HJB equation (including viscosity definition) using only trajectory samples. The construction of the derivative loss is not shown to be compatible with a strictly model-free premise; any penalty on deviation from the Hamiltonian typically requires either explicit differentiation through f(x,u) or an approximation that implicitly uses the dynamics or its Jacobian, which contradicts the black-box assumption. This is load-bearing for the central theoretical contribution and needs an explicit derivation or counterexample showing how the loss is formed without model knowledge.

minor comments (1)

[Empirical Evaluation] Empirical section: Performance claims on RL tasks (reduced failures, higher returns) are presented without visible baseline details, statistical tests, or variance reporting, making it difficult to assess the magnitude and reliability of the gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful review and for identifying the need for greater clarity on the theoretical construction of the derivative loss. We address this point directly below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that both the critic and its derivative converge to the viscosity solution in the exact setting requires the contraction losses to enforce the HJB equation (including viscosity definition) using only trajectory samples. The construction of the derivative loss is not shown to be compatible with a strictly model-free premise; any penalty on deviation from the Hamiltonian typically requires either explicit differentiation through f(x,u) or an approximation that implicitly uses the dynamics or its Jacobian, which contradicts the black-box assumption. This is load-bearing for the central theoretical contribution and needs an explicit derivation or counterexample showing how the loss is formed without model knowledge.

Authors: We appreciate the referee highlighting this foundational aspect. The derivative loss is constructed strictly from trajectory samples without access to f or its Jacobian. We define a contraction operator on pairs of networks (V, D) where D approximates the spatial gradient of V. The loss penalizes violations of the HJB PDE by replacing the directional derivative term with a finite-difference estimate computed directly from consecutive state samples along observed trajectories; the time derivative is likewise obtained from the sampled time steps. Because these differences are formed solely from the data points (x_t, x_{t+1}), no explicit dynamics model enters the computation. In the infinite-data limit the contraction property of the loss ensures convergence to the viscosity solution of the HJ equation and its derivative, consistent with the stated theorem. We will insert a self-contained derivation of this loss (including the precise finite-difference stencil and the proof that it remains model-free) into the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external viscosity solution and data-driven contraction losses

full rationale

The paper constructs contraction-based losses for the safety value function and its derivatives using only trajectory samples from black-box systems, then claims convergence to the known viscosity solution of the HJ reachability PDE in the exact (infinite-data) limit. This is a standard model-free approximation setup that does not reduce any prediction or uniqueness claim to a fitted input by construction, nor does it rely on self-citation chains or ansatzes smuggled from prior author work. The reference to the viscosity solution is an external mathematical fact, not redefined within the paper, and the training procedure remains independent of the target result. No load-bearing step equates the output to the input by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from reachability theory and neural network approximation; no explicit free parameters or invented entities are introduced in the abstract beyond the two neural networks whose parameters are learned from data.

axioms (1)

domain assumption The underlying dynamical system permits collection of state-action data sufficient for the contraction losses to drive the networks to the viscosity solution.
Invoked to justify model-free training without explicit dynamics.

pith-pipeline@v0.9.0 · 5422 in / 1213 out tokens · 54058 ms · 2026-05-16T10:09:29.046641+00:00 · methodology

Deep QP Safety Filter: Model-free Learning for Reachability-based Safety Filter

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)