pith. sign in

arxiv: 2604.13192 · v1 · submitted 2026-04-14 · 📡 eess.SY · cs.RO· cs.SY

Synthesis and Deployment of Maximal Robust Control Barrier Functions through Adversarial Reinforcement Learning

Pith reviewed 2026-05-10 14:23 UTC · model grok-4.3

classification 📡 eess.SY cs.ROcs.SY
keywords robust control barrier functionsadversarial reinforcement learningsafety value functionIsaacs equationQ-functionnonlinear systemsblack-box dynamicssafety filtering
0
0 comments X

The pith

The safety value function from the Isaacs equation serves as a robust discrete-time control barrier function that certifies the maximal safe set under bounded uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that solving the dynamic programming Isaacs equation for safety yields a function that works directly as a robust control barrier function. It enforces safety precisely on the largest set of states that cannot be driven unsafe by any disturbance within the bound. Representing the same function as a Q-function lifts the certificate into state-action space, so safety can be checked and enforced without an explicit model of the system dynamics. Adversarial reinforcement learning then trains these functions end-to-end for black-box nonlinear systems. The result matters for robotics and other complex plants where closed-form dynamics are unavailable and earlier robust barrier methods produced overly small safe regions.

Core claim

The safety value function solving the dynamic programming Isaacs equation is a valid robust discrete-time CBF that enforces safety on the maximal robust safe set. By adopting the Q-function from reinforcement learning, the barrier is lifted into state-action space to produce a robust Q-CBF constraint; adversarial RL then synthesizes and deploys these Q-CBFs on general nonlinear systems whose dynamics are treated as black boxes.

What carries the argument

The robust Q-CBF obtained by lifting the safety value function into state-action space so that safety filtering no longer requires explicit dynamics.

If this is right

  • Robust CBFs become available for general nonlinear systems without control-affine structure or explicit uncertainty models.
  • The synthesized Q-CBFs certify substantially larger safe sets than existing barrier-based methods on the inverted pendulum benchmark.
  • Safety enforcement remains reliable on high-dimensional systems such as 36-dimensional quadruped simulators even when disturbances are chosen adversarially.
  • The learned Q-function can be used directly for online safety filtering without recovering the underlying dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lifting and adversarial training idea could be applied to other certificates such as Lyapunov functions for stability.
  • Hardware experiments on physical robots would test whether the black-box approximation holds under real sensor noise and actuator limits.
  • If uncertainty bounds are themselves uncertain, the method could be extended to learn a joint bound and value function.
  • Discretization choices in the Isaacs equation may affect conservatism; comparing different time steps on the same plant would quantify the effect.

Load-bearing premise

Adversarial reinforcement learning can accurately approximate the Q-function of the safety value function when dynamics are unknown and disturbances are bounded.

What would settle it

On a system whose true maximal robust safe set is known by other means, deploy the learned Q-CBF controller and check whether it prevents all safety violations under worst-case disturbances while still allowing the full certified set.

Figures

Figures reproduced from arXiv: 2604.13192 by Donggeon David Oh, Duy P. Nguyen, Haimin Hu, Jaime Fern\'andez Fisac.

Figure 1
Figure 1. Figure 1: Our proposed neural Q-CBF yields reliable safety enforcement on a simulated 36-D quadruped with black-box dynamics under adversarial uncertainty realizations. The robot uses a pure-pursuit task policy to move right from the purple starting point. Over 50 trials, the safe rates are 100% for Q-CBF, 38% for the least-restrictive safety filter (LRSF), and 16% for the unfiltered policy. Left: Trajectory compari… view at source ↗
Figure 2
Figure 2. Figure 2: Safe sets and rollout trajectories for the disturbed [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Robust control barrier functions (CBFs) provide a principled mechanism for smooth safety enforcement under worst-case disturbances. However, existing approaches typically rely on explicit, closed-form structure in the dynamics (e.g., control-affine) and uncertainty models. This has led to limited scalability and generality, with most robust CBFs certifying only conservative subsets of the maximal robust safe set. In this paper, we introduce a new robust CBF framework for general nonlinear systems under bounded uncertainty. We first show that the safety value function solving the dynamic programming Isaacs equation is a valid robust discrete-time CBF that enforces safety on the maximal robust safe set. We then adopt the key reinforcement learning (RL) notion of quality function (or Q-function), which removes the need for explicit dynamics by lifting the barrier certificate into state-action space and yields a novel robust Q-CBF constraint for safety filtering. Combined with adversarial RL, this enables the synthesis and deployment of robust Q-CBFs on general nonlinear systems with black-box dynamics and unknown uncertainty structure. We validate the framework on a canonical inverted pendulum benchmark and a 36-D quadruped simulator, achieving substantially less conservative safe sets than barrier-based baselines on the pendulum and reliable safety enforcement even under adversarial uncertainty realizations on the quadruped.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that the safety value function solving the discrete-time Isaacs equation is a valid robust discrete-time CBF enforcing safety on the maximal robust safe set for general nonlinear systems under bounded uncertainty. It lifts this to a Q-function in state-action space to obtain a novel robust Q-CBF constraint, enabling model-free synthesis and deployment via adversarial RL on black-box dynamics with unknown uncertainty structure. Validation is provided on an inverted pendulum benchmark (showing less conservative safe sets than baselines) and a 36-D quadruped simulator (showing reliable safety under adversarial disturbances).

Significance. If the approximation and invariance guarantees hold, the framework would meaningfully advance scalable robust safety filtering beyond control-affine or explicitly modeled systems, allowing maximal safe sets to be synthesized for high-dimensional black-box robots where traditional robust CBF methods are intractable.

major comments (3)
  1. [Main theoretical results on value function and Q-CBF] The central step that the safety value function V solving the Isaacs equation is a valid robust CBF, and its lift to a Q-CBF whose robust decrease condition (min_u max_w Q(x,u,w) >= 0) preserves invariance, is load-bearing for the model-free claim; the manuscript must explicitly derive the Q-CBF constraint and show that any RL approximation error is bounded in a manner that does not violate the worst-case decrease (see the derivation following the Isaacs equation and the robust Q-CBF definition).
  2. [Adversarial RL procedure for Q-CBF synthesis] Adversarial RL is asserted to approximate the Q-function corresponding to the Isaacs solution without explicit dynamics, yet no convergence or error-bound analysis is given for general nonlinear systems; this directly affects whether the learned Q satisfies the robust CBF condition up to a margin that certifies safety (see the adversarial RL training section and the quadruped experiment).
  3. [36-D quadruped simulator experiments] In the quadruped validation, the claim of reliable safety enforcement under adversarial uncertainty realizations assumes the training adversary fully realizes the bounded but unstructured disturbances used in the theory; without reporting the realized disturbance bounds or a post-training verification that the learned Q meets the decrease condition on held-out worst-case samples, the invariance guarantee remains unverified.
minor comments (2)
  1. [Abstract and results] The abstract states 'substantially less conservative safe sets' on the pendulum; include quantitative metrics (e.g., volume ratios or boundary distances) and explicit baseline comparisons in the main results section.
  2. [Preliminaries and notation] Clarify early the precise difference between the proposed robust Q-CBF constraint and standard safety-filter Q-functions to avoid notation ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments identify key areas where the theoretical derivations and experimental validation can be strengthened for clarity and rigor. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Main theoretical results on value function and Q-CBF] The central step that the safety value function V solving the Isaacs equation is a valid robust CBF, and its lift to a Q-CBF whose robust decrease condition (min_u max_w Q(x,u,w) >= 0) preserves invariance, is load-bearing for the model-free claim; the manuscript must explicitly derive the Q-CBF constraint and show that any RL approximation error is bounded in a manner that does not violate the worst-case decrease (see the derivation following the Isaacs equation and the robust Q-CBF definition).

    Authors: We agree that greater explicitness is warranted. In the revised manuscript, we will expand the derivation immediately following the Isaacs equation to include a complete step-by-step proof that V is a valid robust discrete-time CBF on the maximal safe set, followed by the lifting argument showing that the Q-function satisfies the robust decrease condition min_u max_w Q(x,u,w) >= 0 and thereby preserves forward invariance. On approximation error, the adversarial RL objective directly penalizes violations of the worst-case Bellman inequality, so the learned Q is trained to satisfy the condition up to the optimization residual; we will add a remark acknowledging that a general a-priori error bound for arbitrary nonlinear systems is not derived here and remains an open question, while noting that the empirical safety margins observed in both benchmarks remain positive. revision: partial

  2. Referee: [Adversarial RL procedure for Q-CBF synthesis] Adversarial RL is asserted to approximate the Q-function corresponding to the Isaacs solution without explicit dynamics, yet no convergence or error-bound analysis is given for general nonlinear systems; this directly affects whether the learned Q satisfies the robust CBF condition up to a margin that certifies safety (see the adversarial RL training section and the quadruped experiment).

    Authors: We acknowledge that a formal convergence guarantee for adversarial RL with neural-network approximators on general nonlinear dynamics is not provided and is indeed difficult to obtain without restrictive assumptions (e.g., linear-quadratic structure or finite state-action spaces). The manuscript instead relies on the fact that the min-max robust Q-CBF constraint is embedded directly in the training loss, so that any converged Q approximately solves the Isaacs equation by construction. We will augment the training-section description with the precise loss formulation, network architectures, and hyper-parameters used, and we will add a short discussion of this theoretical limitation together with the empirical evidence that the learned Q-CBF maintains positive safety margins under adversarial disturbances. revision: no

  3. Referee: [36-D quadruped simulator experiments] In the quadruped validation, the claim of reliable safety enforcement under adversarial uncertainty realizations assumes the training adversary fully realizes the bounded but unstructured disturbances used in the theory; without reporting the realized disturbance bounds or a post-training verification that the learned Q meets the decrease condition on held-out worst-case samples, the invariance guarantee remains unverified.

    Authors: We will revise the quadruped experimental section to explicitly state the disturbance bounds employed during both training and testing, together with the realized disturbance magnitudes observed in the reported trials. In addition, we will include a post-training verification procedure that evaluates the robust decrease condition min_u max_w Q(x,u,w) >= 0 on a held-out set of worst-case disturbance samples drawn from the same bounded set; the results of this verification will be reported to confirm that the learned Q-CBF satisfies the invariance condition within a quantifiable margin. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from dynamic programming.

full rationale

The core claim that the safety value function solving the discrete-time Isaacs equation is a valid robust CBF follows directly from standard dynamic programming arguments for zero-sum games and invariance, without reducing to fitted parameters or self-referential definitions. The subsequent lifting to a Q-CBF and use of adversarial RL for model-free synthesis on black-box dynamics is presented as an approximation technique for deployment, not as a proof that relies on the approximation itself being exact by construction. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results appear in the provided derivation chain. The paper validates on external benchmarks (inverted pendulum, quadruped simulator), keeping the central mathematical step independent of the learning procedure.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework relies on standard assumptions from optimal control and RL, with the main addition being the Q-CBF lifting.

free parameters (1)
  • RL hyperparameters and adversarial training parameters
    Not specified but implied in the RL approach for learning the Q-function.
axioms (1)
  • domain assumption Existence and uniqueness of the safety value function solving the Isaacs equation for the given system and uncertainty bounds.
    Invoked when stating that the safety value function is a valid robust CBF.
invented entities (1)
  • robust Q-CBF no independent evidence
    purpose: To enforce safety in state-action space without explicit dynamics.
    New concept introduced in the paper to combine Q-function with CBF.

pith-pipeline@v0.9.0 · 5538 in / 1429 out tokens · 37533 ms · 2026-05-10T14:23:31.187833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    The safety filter: A unified view of safety-critical control in autonomous systems,

    K. C. Hsu, H. Hu, and J. F. Fisac, “The safety filter: A unified view of safety-critical control in autonomous systems,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 7, no. 1, pp. 47–72, 2024

  2. [2]

    Robust control barrier functions for constrained sta- bilization of nonlinear systems,

    M. Jankovic, “Robust control barrier functions for constrained sta- bilization of nonlinear systems,”Automatica, vol. 96, pp. 359–367, 2018

  3. [3]

    Robust adaptive control barrier functions: An adaptive and data-driven approach to safety,

    B. T. Lopez, J.-J. E. Slotine, and J. P. How, “Robust adaptive control barrier functions: An adaptive and data-driven approach to safety,” IEEE Control Systems Letters, vol. 5, no. 3, pp. 1031–1036, 2020

  4. [4]

    Parameterized barrier functions to guarantee safety under uncertainty,

    A. Alan, T. G. Molnar, A. D. Ames, and G. Orosz, “Parameterized barrier functions to guarantee safety under uncertainty,”IEEE Control Systems Letters, vol. 7, pp. 2077–2082, 2023

  5. [5]

    Distur- bance observers for robust safety-critical control with control barrier functions,

    A. Alan, T. G. Molnar, E. Das ¸, A. D. Ames, and G. Orosz, “Distur- bance observers for robust safety-critical control with control barrier functions,”IEEE control systems letters, vol. 7, pp. 1123–1128, 2022

  6. [6]

    Robust control barrier func- tions for nonlinear control systems with uncertainty: A duality-based approach,

    M. H. Cohen, C. Belta, and R. Tron, “Robust control barrier func- tions for nonlinear control systems with uncertainty: A duality-based approach,” in2022 IEEE 61st Conference on Decision and Control (CDC). IEEE, 2022, pp. 174–179

  7. [7]

    Robust control barrier functions with sector-bounded uncertainties,

    J. Buch, S.-C. Liao, and P. Seiler, “Robust control barrier functions with sector-bounded uncertainties,”IEEE Control Systems Letters, vol. 6, pp. 1994–1999, 2021

  8. [8]

    Data-driven robust barrier functions for safe, long-term operation,

    Y . Emam, P. Glotfelter, S. Wilson, G. Notomista, and M. Egerstedt, “Data-driven robust barrier functions for safe, long-term operation,” IEEE transactions on robotics, vol. 38, no. 3, pp. 1671–1685, 2021

  9. [9]

    Towards robust data-driven control synthesis for nonlinear systems with actuation uncertainty,

    A. J. Taylor, V . D. Dorobantu, S. Dean, B. Recht, Y . Yue, and A. D. Ames, “Towards robust data-driven control synthesis for nonlinear systems with actuation uncertainty,” in2021 60th IEEE Conference on Decision and Control (CDC). IEEE, 2021, pp. 6469–6476

  10. [10]

    Learning robust output control barrier functions from safe expert demonstrations,

    L. Lindemann, A. Robey, L. Jiang, S. Das, S. Tu, and N. Matni, “Learning robust output control barrier functions from safe expert demonstrations,”IEEE Open Journal of Control Systems, vol. 3, pp. 158–172, 2024

  11. [11]

    A time-dependent Hamilton-Jacobi formulation of reachable sets for continuous dynamic games,

    I. M. Mitchell, A. M. Bayen, and C. J. Tomlin, “A time-dependent Hamilton-Jacobi formulation of reachable sets for continuous dynamic games,”IEEE Transactions on automatic control, vol. 50, no. 7, pp. 947–957, 2005

  12. [12]

    ISAACS: Iterative soft adversarial actor-critic for safety,

    K.-C. Hsu, D. P. Nguyen, and J. F. Fisac, “ISAACS: Iterative soft adversarial actor-critic for safety,” inProceedings of the 5th Annual Learning for Dynamics and Control Conference (L4DC). PMLR, 2023, pp. 90–103. [Online]. Available: https: //proceedings.mlr.press/v211/hsu23a.html

  13. [13]

    MAGICS: Adversarial RL with Minimax Actors Guided by Implicit Critic Stackelberg for Convergent Neural Synthesis of Robot Safety,

    J. Wang, H. Hu, D. P. Nguyen, and J. F. Fisac, “MAGICS: Adversarial RL with Minimax Actors Guided by Implicit Critic Stackelberg for Convergent Neural Synthesis of Robot Safety,” inProceedings of the 16th Workshop on the Algorithmic Foundations of Robotics (WAFR), 2024

  14. [14]

    Gameplay filters: Robust zero-shot safety through adversarial imagination,

    D. P. Nguyen, K.-C. Hsu, W. Yu, J. Tan, and J. F. Fisac, “Gameplay filters: Robust zero-shot safety through adversarial imagination,” in Conference on Robot Learning. PMLR, 2025, pp. 387–407. [Online]. Available: https://proceedings.mlr.press/v270/nguyen25a.html

  15. [15]

    Robust control barrier–value functions for safety-critical control,

    J. J. Choi, D. Lee, K. Sreenath, C. J. Tomlin, and S. L. Herbert, “Robust control barrier–value functions for safety-critical control,” in 2021 60th IEEE Conference on Decision and Control (CDC). IEEE, 2021, pp. 6814–6821

  16. [16]

    Viscosity CBFs: Bridging the control barrier function and Hamilton-Jacobi reachability frameworks in safe control theory,

    D. Hirsch, J. F. Fisac, and S. Herbert, “Viscosity cbfs: Bridging the control barrier function and hamilton-jacobi reachability frameworks in safe control theory,”arXiv preprint arXiv:2510.09929, 2025

  17. [17]

    Q-learning,

    C. J. Watkins and P. Dayan, “Q-learning,”Machine learning, vol. 8, pp. 279–292, 1992

  18. [18]

    O.-R. A. D. O. Committee,Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles. SAE international, 2021

  19. [19]

    H. K. Khalil and J. W. Grizzle,Nonlinear systems. Prentice hall Upper Saddle River, NJ, 2002, vol. 3

  20. [20]

    Hamilton–Jacobi reachability: Some recent theoretical advances and applications in unmanned airspace management,

    M. Chen and C. J. Tomlin, “Hamilton–Jacobi reachability: Some recent theoretical advances and applications in unmanned airspace management,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 1, no. 1, pp. 333–358, 2018

  21. [21]

    Hamilton-Jacobi reachability: A brief overview and recent advances,

    S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin, “Hamilton-Jacobi reachability: A brief overview and recent advances,” in2017 IEEE 56th Annual Conference on Decision and Control (CDC). IEEE, 2017, pp. 2242–2253

  22. [22]

    Provably op- timal reinforcement learning under safety filtering,

    D. D. Oh, D. P. Nguyen, H. Hu, and J. F. Fisac, “Provably op- timal reinforcement learning under safety filtering,”arXiv preprint arXiv:2510.18082, 2025

  23. [23]

    Safety with Agency: Human-Centered Safety Filter with Application to AI-Assisted Motorsports,

    D. D. Oh, J. Lidard, H. Hu, H. Sinhmar, E. Lazarski, D. Gopinath, E. S. Sumner, J. A. DeCastro, G. Rosman, N. E. Leonardet al., “Safety with Agency: Human-Centered Safety Filter with Application to AI-Assisted Motorsports,” inProceedings of Robotics: Science and Systems (RSS), 2025

  24. [24]

    Local convergence analysis of gradient descent ascent with finite timescale separation,

    T. Fiez and L. J. Ratliff, “Local convergence analysis of gradient descent ascent with finite timescale separation,” inProceedings of the International Conference on Learning Representation, 2021

  25. [25]

    What is local optimality in nonconvex-nonconcave minimax optimization?

    C. Jin, P. Netrapalli, and M. Jordan, “What is local optimality in nonconvex-nonconcave minimax optimization?” inInternational conference on machine learning. PMLR, 2020, pp. 4880–4889

  26. [26]

    Verification of neural reachable tubes via sce- nario optimization and conformal prediction,

    A. Lin and S. Bansal, “Verification of neural reachable tubes via sce- nario optimization and conformal prediction,” in6th Annual Learning for Dynamics & Control Conference. PMLR, 2024, pp. 719–731

  27. [27]

    Safety-critical control under multiple state and input constraints and application to fixed-wing UA V,

    D. D. Oh, D. Lee, and H. J. Kim, “Safety-critical control under multiple state and input constraints and application to fixed-wing UA V,” in2023 62nd IEEE Conference on Decision and Control (CDC). IEEE, 2023, pp. 1748–1755

  28. [28]

    Optimizeddp: An efficient, user-friendly library for optimal control and dynamic programming,

    M. Bui, G. Giovanis, M. Chen, and A. Shriraman, “Optimizeddp: An efficient, user-friendly library for optimal control and dynamic programming,”arXiv preprint arXiv:2204.05520, 2022

  29. [29]

    Active uncertainty reduction for safe and efficient interaction planning: A shielding-aware dual control approach,

    H. Hu, D. Isele, S. Bae, and J. F. Fisac, “Active uncertainty reduction for safe and efficient interaction planning: A shielding-aware dual control approach,”The International Journal of Robotics Research, vol. 43, no. 9, pp. 1382–1408, 2024

  30. [30]

    Multi-step model pre- dictive safety filters: Reducing chattering by increasing the prediction horizon,

    F. P. Bejarano, L. Brunke, and A. P. Schoellig, “Multi-step model pre- dictive safety filters: Reducing chattering by increasing the prediction horizon,” in2023 62nd IEEE Conference on Decision and Control (CDC). IEEE, 2023, pp. 4723–4730