arxiv: 2604.23576 · v1 · submitted 2026-04-26 · 💻 cs.LG · cs.AI

CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning

Rahul Narava , Siddharth Verma , Ojas Jain , Shashi Shekhar Jha , Mayank Shekhar Jha This is my paper

Pith reviewed 2026-05-08 06:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords safe reinforcement learningcontrol barrier functionsprobabilistic dynamics modelaction correctionoffline learningcontinuous controluncertainty-aware safety

0 comments

The pith

Offline learning of a probabilistic control-affine model allows construction of uncertainty-aware control barrier functions that enforce hard safety constraints during reinforcement learning through online action correction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve safe exploration in high-dimensional continuous control systems whose dynamics are unknown. It learns a probabilistic control-affine dynamics model from offline data, then builds control barrier functions that treat the model's uncertainty as a source of conservatism in the safety constraints. These constraints are turned into an online correction step that adjusts actions to stay inside the safe set while still pursuing the task objective. If the approach holds, it supplies deterministic safety guarantees instead of guarantees only in expectation and avoids the need for an exact dynamics model at design time.

Core claim

The paper claims that a probabilistic control-affine dynamics model learned offline can be used to construct control barrier functions that incorporate model uncertainty, yielding conservative safety constraints that are enforced by an online constraint-based action correction mechanism, thereby enabling safe reinforcement learning with hard guarantees and without large losses in task performance.

What carries the argument

Control barrier functions constructed from the learned probabilistic control-affine model, which embed uncertainty bounds to produce conservative safety constraints enforced by online action correction.

If this is right

The method supplies hard, constraint-based safety guarantees rather than safety only in expectation.
Safe exploration becomes feasible in high-dimensional continuous-control tasks with unknown dynamics.
Task returns remain comparable to those of existing safe reinforcement learning baselines.
Empirical safety violations drop substantially on nonlinear benchmarks while performance is preserved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same offline-to-CBF pipeline could be applied to model-based planning methods outside reinforcement learning.
If the learned model is periodically updated, the framework might support gradual relaxation of conservatism as uncertainty decreases.
The online correction step could be combined with other constraint solvers to handle additional state or input limits.

Load-bearing premise

That the probabilistic model learned from offline data is accurate enough to produce control barrier functions whose safety constraints remain valid for the true unknown dynamics.

What would settle it

A recorded safety violation on a test trajectory where the offline data covered the operating region and the model's uncertainty bounds were respected yet the barrier condition was breached.

Figures

Figures reproduced from arXiv: 2604.23576 by Mayank Shekhar Jha, Ojas Jain, Rahul Narava, Shashi Shekhar Jha, Siddharth Verma.

**Figure 1.** Figure 1: Illustration of the safe, 𝜖-safe, and unsafe sets induced by a control barrier function. ℎ(𝑓 (𝑠𝑡 ) + 𝑔(𝑠𝑡 )𝑎) ≥ (1 − 𝛼) ℎ(𝑠𝑡 ) (6) This inequality defines a state-dependent constraint on the action 𝑎, ensuring that the safe set C is forward invariant. If the system starts within C, any action satisfying the above condition guarantees that the state remains safe for all future time. In practice, the actio… view at source ↗

**Figure 2.** Figure 2: The proposed CAPSULE Algorithmic flow The resulting optimization equation to solve the CBF then becomes 𝑎 CBF 𝑡 = arg min 𝑎,𝜖 𝑎 𝑇 𝑎 + 𝑘𝜖2 s.t. ℎ 𝑠𝑡 + ¯𝑓 (𝑠𝑡 ) + 𝑔¯(𝑠𝑡 ) (𝑎 RL 𝑡 + 𝑎 𝑏𝑎𝑟 𝑡 + 𝑎) − 𝑝𝛿 |𝜎¯(𝑠𝑡 )| − ℎ(𝑠𝑡 ) ≥ −𝛼ℎ(𝑠𝑡 ) − 𝜖, 𝜖 ≥ 0 (15) All the executed transitions are stored in a replay buffer 𝐵 to update the RL policy, whereas the tuples (𝑠𝑡 , 𝑎CBF 𝑡 + 𝑎 𝑏𝑎𝑟 𝑡 ) are stored separately in 𝐵ˆ to … view at source ↗

**Figure 3.** Figure 3: Visualizations of different MuJoCo control environments used in our experiments: Hopper, Walker, HalfCheetah. view at source ↗

**Figure 4.** Figure 4: Pre-training results on MuJoCo continuous control environments. view at source ↗

**Figure 5.** Figure 5: Policy Evaluation on SafeVelocity in Mujoco continuos control environments. dynamics a priori. We utilize grounded concepts from CBF along with uncertainty-based stochastic models to overcome these issues and evaluate in challenging environments. 6 CONCLUSION In this work, we introduced CAPSULE, a safe RL framework that employs control-theoretic formulations to provide safety guarantees and leverages unce… view at source ↗

read the original abstract

Ensuring safe exploration in high-dimensional systems with unknown dynamics remains a significant challenge. Existing safe reinforcement learning methods often provide safety guarantees only in expectation, which can still lead to safety violations. Control-theoretic approaches, in contrast, offer hard constraint-based safety guarantees but typically assume access to known system dynamics or require accurate estimation of control-affine models. In this paper, we propose a safe reinforcement learning framework that learns a probabilistic control-affine dynamics model in an offline setting. The learned model is leveraged to explicitly construct control barrier functions (CBFs) that incorporate model uncertainty to provide conservative safety constraints. These CBF constraints are enforced through an online constraint-based action correction mechanism, enabling safe exploration without overly restricting task performance. Empirical evaluations on nonlinear, complex continuous-control benchmarks demonstrate that our approach achieves returns comparable to those of existing baselines while significantly reducing safety violations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete pipeline for safe RL by learning probabilistic control-affine models offline then building uncertainty-aware CBFs with online QP correction, but the hard safety claims rest on unproven uncertainty bounds.

read the letter

The main thing to know is that the authors learn a probabilistic control-affine dynamics model from offline data, use the posterior to make CBFs conservative, and then run an online constraint-based action correction step inside the RL loop. This produces a method that tries to deliver hard safety constraints rather than just expectation-based ones for nonlinear continuous control tasks. On the benchmarks they report, returns stay close to unsafe baselines while safety violations drop noticeably. That combination of offline model learning, uncertainty-aware CBF construction, and online correction is the actual new piece; it extends standard CBF-RL ideas without claiming to invent any of the components from scratch. The empirical results give a practical signal that the correction does not destroy task performance. The soft spot is exactly the one the stress test flags. Offline data in high-dimensional spaces leaves large uncovered regions, so the learned uncertainty set can easily fail to contain the true dynamics along a trajectory. Without a proof that the chosen quantile or worst-case bound produces forward invariance under the real system, or evidence that the posterior is calibrated for the Lie derivative condition, the hard guarantee does not follow. The paper appears to rest on the empirical violation counts instead. This work is aimed at safe RL researchers who already know CBFs and want a way to apply them with learned models on continuous control problems. It has enough of a method and experiments to be worth a referee's time, even though the theoretical safety argument would need tightening.

Referee Report

2 major / 2 minor

Summary. The paper proposes CAPSULE, a safe RL framework that learns a probabilistic control-affine dynamics model offline from data. This model is used to construct control barrier functions (CBFs) incorporating model uncertainty for conservative safety constraints. These constraints are enforced online via a constraint-based action correction mechanism (likely a QP) to enable safe exploration in high-dimensional continuous control without severely limiting task performance. Empirical results on nonlinear benchmarks are reported to show returns comparable to baselines with significantly fewer safety violations.

Significance. If the uncertainty-aware CBF construction can be shown to yield reliable conservative bounds that preserve forward invariance under the true dynamics, the approach would meaningfully advance the integration of offline model learning with control-theoretic safety in RL. It targets the gap between expectation-based safety methods and hard guarantees, potentially enabling safer deployment in systems with unknown dynamics while avoiding excessive conservatism.

major comments (2)

[Abstract and §3] Abstract and §3 (method overview): The central claim of 'hard constraint-based safety guarantees' via conservative CBFs is not supported by a formal theorem establishing forward invariance of the safe set under the true (unknown) dynamics. The description relies on posterior bounds or quantiles over the learned p(f,g|D_offline), but without a proof that the true dynamics lie in the uncertainty set along all visited trajectories with high probability, the online QP correction provides only heuristic safety rather than the stated hard guarantees.
[§4] §4 (CBF construction and uncertainty propagation): The Lie derivative condition for the CBF is made conservative using the probabilistic model, but the specific mechanism (e.g., worst-case over support, quantile bound, or Gaussian process variance) is not shown to be calibrated for the true model error. In high-dimensional continuous control, sparse offline coverage away from the training distribution can cause posterior variance to underestimate error, violating the invariance condition; the paper reports only empirical violation counts, not a calibration test or robustness result for this step.

minor comments (2)

[§3] Notation for the probabilistic model p(f,g|D) and the resulting CBF h(x) should be introduced with explicit definitions of the uncertainty set used in the Lie derivative bound to improve readability.
[§5] The experimental section would benefit from reporting the exact form of the action correction QP (including how the CBF constraint is linearized) and ablation on the uncertainty quantile level to clarify sensitivity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We address each of the major comments in detail below and have updated the manuscript accordingly to improve clarity and precision regarding the safety guarantees.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method overview): The central claim of 'hard constraint-based safety guarantees' via conservative CBFs is not supported by a formal theorem establishing forward invariance of the safe set under the true (unknown) dynamics. The description relies on posterior bounds or quantiles over the learned p(f,g|D_offline), but without a proof that the true dynamics lie in the uncertainty set along all visited trajectories with high probability, the online QP correction provides only heuristic safety rather than the stated hard guarantees.

Authors: We agree with the referee that a formal theorem establishing forward invariance under the true dynamics is absent from the manuscript. The conservative CBF construction uses uncertainty bounds from the learned probabilistic model to enforce constraints via the QP, but this yields safety with respect to the model rather than a rigorous guarantee for the true system without additional assumptions on model coverage. We have revised the abstract and Section 3 to replace 'hard constraint-based safety guarantees' with 'conservative safety constraints that incorporate model uncertainty' and added a paragraph discussing the conditions under which stronger guarantees could hold. These changes clarify the nature of the provided safety without overstating the theoretical results. revision: yes
Referee: [§4] §4 (CBF construction and uncertainty propagation): The Lie derivative condition for the CBF is made conservative using the probabilistic model, but the specific mechanism (e.g., worst-case over support, quantile bound, or Gaussian process variance) is not shown to be calibrated for the true model error. In high-dimensional continuous control, sparse offline coverage away from the training distribution can cause posterior variance to underestimate error, violating the invariance condition; the paper reports only empirical violation counts, not a calibration test or robustness result for this step.

Authors: The specific mechanism in our implementation is the use of upper quantile bounds on the posterior distribution of the Lie derivatives to ensure a conservative estimate of the CBF condition. We acknowledge that this may not be perfectly calibrated in all regions, particularly with sparse data in high dimensions, and that posterior variance can underestimate true error. The manuscript relies on empirical evidence of fewer safety violations to support the approach. In the revision, we have expanded Section 4 to explicitly describe the quantile selection process and its rationale for conservatism, along with a brief discussion of potential limitations due to data coverage. A full calibration analysis or robustness test would necessitate new experiments and is noted as future work, but we believe the current empirical results and clarifications address the core concern. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The framework learns a probabilistic control-affine model offline via standard supervised learning, then applies established CBF theory to construct uncertainty-aware barriers and uses a QP for online correction. This chain depends on external results from dynamics learning and control barrier function literature rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation that reduces the safety claim to the inputs by construction. The derivation remains self-contained against independent benchmarks such as CBF forward-invariance theorems and offline RL methods.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that system dynamics admit a probabilistic control-affine representation learnable from offline data and that the resulting uncertainty model yields conservative yet non-vacuous CBFs.

axioms (1)

domain assumption System dynamics can be represented as a probabilistic control-affine model
Invoked when stating that a probabilistic control-affine dynamics model is learned offline to construct CBFs.

pith-pipeline@v0.9.0 · 5461 in / 1237 out tokens · 49415 ms · 2026-05-08T06:28:14.708472+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy optimization. InInternational conference on machine learning. PMLR, 22– 31

2017
[2]

Aaron D Ames, Xiangru Xu, Jessy W Grizzle, and Paulo Tabuada. 2016. Control barrier function based quadratic programs for safety critical systems.IEEE Trans. Automat. Control62, 8 (2016), 3861–3876

2016
[3]

Richard Cheng, Gábor Orosz, Richard M Murray, and Joel W Burdick. 2019. End- to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 3387–3395

2019
[4]

Yikun Cheng, Pan Zhao, and Naira Hovakimyan. 2023. Safe and efficient rein- forcement learning using disturbance-observer-based control barrier functions. InLearning for Dynamics and Control Conference. PMLR, 104–115

2023
[5]

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. 2018. Deep reinforcement learning in a handful of trials using probabilistic dynamics models.Advances in neural information processing systems31 (2018)

2018
[6]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning. Pmlr, 1861–1870

2018
[7]

Jiaming Ji, Borong Zhang, Jiayi Zhou, Xuehai Pan, Weidong Huang, Ruiyang Sun, Yiran Geng, Yifan Zhong, Josef Dai, and Yaodong Yang. 2023. Safety gymnasium: A unified safe reinforcement learning benchmark.Advances in Neural Information Processing Systems36 (2023), 18964–18993

2023
[8]

Zahra Marvi and Bahare Kiumarsi. 2020. Safe off-policy reinforcement learning using barrier functions. In2020 American Control Conference (ACC). IEEE, 2176– 2181

2020
[9]

Alex Ray, Joshua Achiam, and Dario Amodei. 2019. Benchmarking safe explo- ration in deep reinforcement learning.arXiv preprint arXiv:1910.017087, 1 (2019), 2

work page internal anchor Pith review arXiv 2019
[10]

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz
[11]

InInternational conference on machine learning

Trust region policy optimization. InInternational conference on machine learning. PMLR, 1889–1897
[12]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
[13]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review arXiv 2017
[14]

Minjun Sung, Sambhu Harimanas Karumanchi, Aditya Gahlawat, and Naira Hov- akimyan. 2024. Robust Model Based Reinforcement Learning Using 𝐿1 Adaptive Control. InThe Twelfth International Conference on Learning Representations

2024
[15]

1998.Reinforcement learning: An intro- duction

Richard S Sutton, Andrew G Barto, et al. 1998.Reinforcement learning: An intro- duction. Vol. 1. MIT press Cambridge

1998
[16]

Yixuan Wang, Simon Sinong Zhan, Ruochen Jiao, Zhilu Wang, Wanxin Jin, Zhuo- ran Yang, Zhaoran Wang, Chao Huang, and Qi Zhu. 2023. Enforcing hard constraints with soft barriers: Safe reinforcement learning in unknown stochastic environments. InInternational Conference on Machine Learning. PMLR, 36593– 36604

2023
[17]

Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, and Peter J Ramadge. 2020. Projection-Based Constrained Policy Optimization. InInternational Conference on Learning Representations

2020
[18]

Yujie Yang, Yuxuan Jiang, Yichen Liu, Jianyu Chen, and Shengbo Eben Li. 2023. Model-free safe reinforcement learning through neural barrier certificate.IEEE Robotics and Automation Letters8, 3 (2023), 1295–1302

2023
[19]

Baohe Zhang, Yuan Zhang, Lilli Frison, Thomas Brox, and Joschka Bödecker
[20]

Constrained reinforcement learning with smoothed log barrier function.arXiv preprint arXiv:2403.14508, 2024a

Constrained reinforcement learning with smoothed log barrier function. arXiv preprint arXiv:2403.14508(2024)

work page arXiv 2024
[21]

Yiming Zhang, Quan Vuong, and Keith Ross. 2020. First order constrained optimization in policy space.Advances in Neural Information Processing Systems 33 (2020), 15338–15349

2020