SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints

Ankit Kanwar; Dominik Wagner; Luke Ong

arxiv: 2512.23770 · v3 · submitted 2025-12-29 · 💻 cs.LG · cs.AI

SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints

Dominik Wagner , Ankit Kanwar , Luke Ong This is my paper

Pith reviewed 2026-05-16 19:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords safe reinforcement learningtrust region policy optimizationhard constraintsnatural policy gradientconvex combinationSafety Gymnasium

0 comments

The pith

SB-TRPO updates policies through a dynamic convex combination of reward and cost natural gradients to guarantee a fixed fraction of optimal safety improvement while using spare capacity for reward gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SB-TRPO as a method for reinforcement learning under hard safety constraints, where agents must drive costs exactly to zero while maximizing rewards. It achieves this by blending the natural policy gradients for reward and cost at each update step in a way that reserves a predetermined share of the trust-region step for cost reduction. Formal analysis shows this produces local progress toward safety without forcing the policy to become overly conservative. When the two gradients point in compatible directions, the method still advances the reward objective with the remaining step size. Experiments on Safety Gymnasium benchmarks confirm the approach delivers stronger safety-task trade-offs than prior model-free constrained RL algorithms.

Core claim

SB-TRPO updates via a dynamic convex combination of the reward and cost natural policy gradients, ensuring a fixed fraction of optimal cost reduction while using remaining update capacity for reward improvement, with formal guarantees of local progress on safety whenever gradients align suitably.

What carries the argument

Dynamic convex combination of reward and cost natural policy gradients that reserves a fixed fraction of the trust-region update for cost reduction.

If this is right

The method supplies local safety progress guarantees without requiring manual penalty tuning.
Reward improvement remains possible whenever the reward and cost gradients are not directly opposed.
The algorithm remains compatible with standard trust-region policy optimization machinery.
Empirical results on Safety Gymnasium tasks show superior safety-task balance in the hard-constrained regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same convex-combination idea could be tested on continuous control tasks outside the Safety Gymnasium suite to check robustness to different dynamics.
If the fraction parameter proves stable across environments, it may reduce the sample complexity of safety tuning compared with Lagrangian methods.
Extending the approach to settings with multiple simultaneous cost constraints would require only a weighted sum inside the combination step.

Load-bearing premise

A fixed fraction of optimal cost reduction can be secured by the convex combination while preserving trust-region validity and allowing reward gains when gradients align.

What would settle it

A controlled run on a simple linear-quadratic safety task where the observed cost reduction after an SB-TRPO step falls short of the claimed fixed fraction of the optimal cost gradient step.

Figures

Figures reproduced from arXiv: 2512.23770 by Ankit Kanwar, Dominik Wagner, Luke Ong.

**Figure 1.** Figure 1: Visualisation of the adaptive convex combination ∆ of ∆r and ∆c given by Equation (4) for ϵ := 1.4 = −β · ⟨gc, ∆c⟩, where β := 0.7, and the special case that ∆r = gr and ∆c = −gc. Algorithm 1 Safety-Biased Trust Region Policy Optimisation (SB-TRPO) Require: KL divergence limit δ > 0, safety bias β ∈ [0, 1], training epochs N ∈ N 1: initialise θ 2: κ ← 10−8 {small constant to avoid division by 0} 3: for N … view at source ↗

**Figure 2.** Figure 2: Car Circle: staying safe and improving reward task performance. When another method attains a higher safe reward, it typically comes at the expense of a substantial reduction in safety; conversely, methods with higher safety probabilities generally achieve significantly lower safe reward. These trends also manifest for raw rewards and costs. On the other hand, PPO-Lagrangian often collapses to poor reward… view at source ↗

**Figure 3.** Figure 3: Ablation study of safety bias β 0 25 50 75 100 125 150 175 Angle (Degrees) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Density Policy Updates in Direction of Reward Gradient SB-TRPO C-TRPO CPO (a) Car Circle 70 75 80 85 90 95 100 105 110 Angle (Degrees) 0.00 0.05 0.10 0.15 0.20 0.25 Density Policy Updates in Direction of Reward Gradient SB-TRPO C-TRPO CPO (b) Point Button [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗

**Figure 4.** Figure 4: Angles between policy updates and reward gradients [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Safe navigation tasks of Safety Gymnasium (Ji et al., 2023) (images taken from https://safety-gymnasium. readthedocs.io/en/latest/environments/safe_navigation.html) [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Training curves 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Point Button: training longer, doing better. 2 0 2 4 6 8 Reward 0 50 100 150 Cost SafetyCarGoal2-v0 Ours ( = 0.60) Ours ( = 0.65) Ours ( = 0.70) Ours ( = 0.75) Ours ( = 0.80) Ours ( = 0.85) Ours ( = 0.90) PPO-Lagrangian C-TRPO CPO CUP Reward = 0 4 2 0 2 4 6 Reward 0 50 100 150 Cost SafetyPointButton2-v0 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study of safety bias β C.2.2. DEVIATIONS FROM THEORY IN TRAINING CURVES While our theoretical results (Theorem 4.2) guarantee consistent improvement in both reward and cost for sufficiently small steps, practical training exhibits deviations due to finite sample estimates and the sparsity of the cost signal in some tasks. We summarise task-specific observations: • In Swimmer Velocity, some update … view at source ↗

**Figure 9.** Figure 9: Angles between policy updates and cost gradients [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Effect of critics on training curves of SB-TRPO D. Novelty and Comparison to Existing Methods We proceed by highlighting our core contribution and conceptual novelty, and explaining why these lead to superior performance in practice. Core contribution and conceptual novelty. Whilst prior work usually addresses general CMDPs, our method targets the setting with hard constraints specifically, allowing it to… view at source ↗

read the original abstract

In safety-critical domains, reinforcement learning (RL) agents must often satisfy strict, zero-cost safety constraints while accomplishing tasks. Existing model-free methods frequently either fail to achieve near-zero safety violations or become overly conservative. We introduce Safety-Biased Trust Region Policy Optimisation (SB-TRPO), a principled algorithm for hard-constrained RL that dynamically balances cost reduction with reward improvement. At each step, SB-TRPO updates via a dynamic convex combination of the reward and cost natural policy gradients, ensuring a fixed fraction of optimal cost reduction while using remaining update capacity for reward improvement. Our method comes with formal guarantees of local progress on safety, while still improving reward whenever gradients are suitably aligned. Experiments on standard and challenging Safety Gymnasium tasks demonstrate that SB-TRPO consistently achieves the best balance of safety and task performance in the hard-constrained regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SB-TRPO blends natural gradients for reward and cost to hit a fixed safety reduction target inside TRPO, but the composite step risks breaking the cost-side trust region bound.

read the letter

The paper's main move is SB-TRPO, which picks a dynamic weight to form a convex combination of the reward natural gradient and the cost natural gradient. The weight is chosen so that the update delivers a preset fraction of the best possible cost reduction, then uses whatever step size is left for reward improvement. This is presented as a way to enforce hard zero-cost constraints without the usual conservatism of prior constrained RL methods.

Referee Report

2 major / 2 minor

Summary. The paper introduces Safety-Biased Trust Region Policy Optimisation (SB-TRPO), which performs policy updates via a dynamic convex combination of the reward and cost natural policy gradients. The combination is chosen to guarantee a fixed fraction of the optimal cost reduction at each step while allocating remaining capacity to reward improvement when the gradients are aligned. The method claims formal local progress guarantees on safety constraints and reports superior empirical balance of safety and task performance on standard and challenging Safety Gymnasium benchmarks in the hard-constrained regime.

Significance. If the local safety-progress guarantee can be established without violating the underlying TRPO trust-region analysis, the approach would provide a principled, non-conservative alternative to existing model-free constrained RL methods. The explicit use of natural gradients and the convex-combination construction could serve as a template for other hard-constraint settings, particularly where zero-cost violations are required.

major comments (2)

[§3.2] §3.2, Eq. (7) and the subsequent derivation of the dynamic weight λ: the claim that the composite step preserves a fixed fraction α of the optimal cost reduction inside the trust region is not obviously supported once the reward and cost natural gradients have different magnitudes. The standard TRPO analysis bounds the KL divergence of the chosen step; a convex combination sized to meet the cost target can produce a step whose projection onto the cost gradient alone exceeds the allowed radius, breaking the local-progress guarantee.
[Theorem 1] Theorem 1 (local safety progress): the proof sketch relies on the composite update remaining feasible for the cost objective. No explicit bound is given showing that the chosen λ always satisfies the original KL constraint with respect to the cost natural gradient; a counter-example direction where the gradients are nearly orthogonal would falsify the claim.

minor comments (2)

[§3.1] Notation for the cost natural gradient is introduced without an explicit definition of the cost advantage function used in its estimation; this should be stated in §3.1 for reproducibility.
[Figure 3] Figure 3 caption does not indicate whether the shaded regions represent standard error or min/max over seeds; clarify the number of independent runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments on the trust-region analysis in Section 3.2 and the proof of Theorem 1 are well-taken. We address each point below and will revise the manuscript accordingly to close the identified gaps.

read point-by-point responses

Referee: [§3.2] §3.2, Eq. (7) and the subsequent derivation of the dynamic weight λ: the claim that the composite step preserves a fixed fraction α of the optimal cost reduction inside the trust region is not obviously supported once the reward and cost natural gradients have different magnitudes. The standard TRPO analysis bounds the KL divergence of the chosen step; a convex combination sized to meet the cost target can produce a step whose projection onto the cost gradient alone exceeds the allowed radius, breaking the local-progress guarantee.

Authors: We agree that the current derivation does not explicitly handle cases of differing gradient magnitudes. In the revision we will replace the direct convex combination with a scaled version: first compute the candidate direction as the λ-weighted sum, then scale the entire step by the minimum of the individual TRPO step sizes that would be admissible for the reward and cost natural gradients separately. This guarantees that the resulting update satisfies the original KL constraint with respect to the cost gradient while still delivering at least fraction α of the optimal cost reduction. A new supporting lemma will be added to bound the KL divergence of the scaled composite step. revision: yes
Referee: [Theorem 1] Theorem 1 (local safety progress): the proof sketch relies on the composite update remaining feasible for the cost objective. No explicit bound is given showing that the chosen λ always satisfies the original KL constraint with respect to the cost natural gradient; a counter-example direction where the gradients are nearly orthogonal would falsify the claim.

Authors: The referee correctly identifies that the existing proof sketch lacks an explicit bound on λ under arbitrary angles between the gradients. We will supply a complete proof of Theorem 1 that derives a closed-form upper bound on admissible λ as a function of the cosine of the angle between the two natural gradients and their relative magnitudes. The bound ensures the composite step remains inside the trust region for the cost objective. We will also include a short analysis of the near-orthogonal case demonstrating that the dynamic weighting still yields the required local safety progress without KL violation. revision: yes

Circularity Check

0 steps flagged

SB-TRPO derivation builds on standard TRPO without self-referential reduction

full rationale

The paper's core update rule is a dynamic convex combination of reward and cost natural policy gradients that preserves a fixed fraction of the optimal cost reduction inside the trust region. This construction directly extends the existing TRPO analysis and natural-gradient machinery; the local safety-progress guarantee follows from the standard KL-constrained step rather than from any parameter fitted to the target result or from a self-citation chain. No equation or claim reduces the claimed guarantee to its own inputs by definition, and the provided text contains no load-bearing self-citations that would force the result.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The method relies on standard assumptions in policy gradient methods and trust regions; a parameter controls the fixed fraction of cost reduction.

free parameters (1)

fixed fraction of optimal cost reduction
A parameter controlling how much of the update is dedicated to cost reduction.

pith-pipeline@v0.9.0 · 5437 in / 998 out tokens · 37953 ms · 2026-05-16T19:47:24.290907+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SB-TRPO updates via a dynamic convex combination of the reward and cost natural policy gradients, ensuring a fixed fraction of optimal cost reduction while using remaining update capacity for reward improvement.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 4.1 and Theorem 4.2 on monotonic cost decrease and conditional reward improvement inside trust region

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Dickerson

doi: 10.48550/ARXIV .2506.00700. URL https: //doi.org/10.48550/arXiv.2506.00700. Perkins, T. J. and Barto, A. G. Lyapunov design for safe reinforcement learning.Journal of Machine Learning Research, 3(Dec):803–832, 2002. Ray, A., Achiam, J., and Amodei, D. Benchmarking safe exploration in deep reinforcement learning. 2019. Schulman, J., Levine, S., Abbeel...

work page internal anchor Pith review doi:10.48550/arxiv 2002
[2]

The cost is monotonically decreasing:J c(πk+1)≤J c(πk)for allk

work page
[3]

Whenever no cost decrease is possible, reward improves:J c(πk+1) =J c(πk)impliesJ r(πk+1)≥J r(πk)

work page
[4]

The theorem follows directly by definition of (Update 1)

If for some K neither cost nor reward improves, Jc(πK+1) =J c(πK) and Jr(πK+1) =J r(πK), then πK is a trust-region local optimum of both cost and the modified constrained problem: Jc(πK) = min π:Dmax KL (πK ∥π)≤δ Jc(π) =c ∗ πK Jr(πK) = max π:Jc(π)≤c∗ πK , Dmax KL (πK ∥π)≤δ Jr(π) Proof sketch.Note thatJ c(πk+1) =J c(πk)impliesϵ= 0. The theorem follows dire...

work page
[5]

ifg c = 0thenJ r(θ+η·∆)≥J r(θ) +η· ⟨g r,∆ r⟩ − L·η2 2 · ∥∆∥2. Proof. First, note that gr =∇J r(θ) and gc =∇J c(θ). Therefore, by assumption and Taylor’s theorem, for everyη∈[0,1] , Jc(θ+η·∆)≤J c(θ) +η· ⟨g c,∆⟩+ L·η 2 2 · ∥∆∥2 Jr(θ+η·∆)≥J r(θ) +η· ⟨g r,∆⟩ − L·η 2 2 · ∥∆∥2 since ∥η·∆∥ ≤M . By the choice of ϵ and Lemma B.3, ⟨gc,∆⟩ ≤β· ⟨g c,∆ c⟩ and the first...

work page 2023
[6]

ifg c ̸= 0thenJ c(θ+η·∆)< J c(θ)

work page
[7]

ifg r ̸= 0and⟨g r,∆ c⟩ ≥0(in particular,g c = 0) thenJ r(θ+η·∆)≥J r(θ). Proof. Since Jr and Jc are smooth, they are L-smooth on the bounded set BM(θ) for M := max{∥∆ r∥,∥∆ c∥} and sufficiently largeL >0. Ifg c ̸= 0then⟨g c,∆ c⟩<0and the claim follows from Lemma B.5 for sufficiently smallη∈(0,1]. Next, suppose thatg r ̸= 0and⟨g r,∆ c⟩ ≥0. By definition of∆...

work page 2023
[8]

distance

on a given environment (e.g., SafetyPointGoal2-v0), navigate to thesingle agentdirectory and run: python ppo_lag.py --task SafetyPointGoal2-v0 --seed 2000 --cost-limit 0 Similarly, for our method, SB-TRPO, at a specific safety bias (e.g.,β= 0.65), use: 15 SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints python sb-trpo.py --task SafetyPoi...

work page 2000
[9]

cost strictly decreases at every iteration (unless cost gradient vanishes),

work page
[10]

In particular, there is no recovery phase:even if cost is high, updates still improve reward(as long gradients are not misaligned)

reward strictly improves whenever reward and cost gradients are sufficiently aligned. In particular, there is no recovery phase:even if cost is high, updates still improve reward(as long gradients are not misaligned). Besides, in the idealised setting of (Update 1), if neither cost nor reward improves further, the method arrives at a trust-region local op...

work page

[1] [1]

Dickerson

doi: 10.48550/ARXIV .2506.00700. URL https: //doi.org/10.48550/arXiv.2506.00700. Perkins, T. J. and Barto, A. G. Lyapunov design for safe reinforcement learning.Journal of Machine Learning Research, 3(Dec):803–832, 2002. Ray, A., Achiam, J., and Amodei, D. Benchmarking safe exploration in deep reinforcement learning. 2019. Schulman, J., Levine, S., Abbeel...

work page internal anchor Pith review doi:10.48550/arxiv 2002

[2] [2]

The cost is monotonically decreasing:J c(πk+1)≤J c(πk)for allk

work page

[3] [3]

Whenever no cost decrease is possible, reward improves:J c(πk+1) =J c(πk)impliesJ r(πk+1)≥J r(πk)

work page

[4] [4]

The theorem follows directly by definition of (Update 1)

If for some K neither cost nor reward improves, Jc(πK+1) =J c(πK) and Jr(πK+1) =J r(πK), then πK is a trust-region local optimum of both cost and the modified constrained problem: Jc(πK) = min π:Dmax KL (πK ∥π)≤δ Jc(π) =c ∗ πK Jr(πK) = max π:Jc(π)≤c∗ πK , Dmax KL (πK ∥π)≤δ Jr(π) Proof sketch.Note thatJ c(πk+1) =J c(πk)impliesϵ= 0. The theorem follows dire...

work page

[5] [5]

ifg c = 0thenJ r(θ+η·∆)≥J r(θ) +η· ⟨g r,∆ r⟩ − L·η2 2 · ∥∆∥2. Proof. First, note that gr =∇J r(θ) and gc =∇J c(θ). Therefore, by assumption and Taylor’s theorem, for everyη∈[0,1] , Jc(θ+η·∆)≤J c(θ) +η· ⟨g c,∆⟩+ L·η 2 2 · ∥∆∥2 Jr(θ+η·∆)≥J r(θ) +η· ⟨g r,∆⟩ − L·η 2 2 · ∥∆∥2 since ∥η·∆∥ ≤M . By the choice of ϵ and Lemma B.3, ⟨gc,∆⟩ ≤β· ⟨g c,∆ c⟩ and the first...

work page 2023

[6] [6]

ifg c ̸= 0thenJ c(θ+η·∆)< J c(θ)

work page

[7] [7]

ifg r ̸= 0and⟨g r,∆ c⟩ ≥0(in particular,g c = 0) thenJ r(θ+η·∆)≥J r(θ). Proof. Since Jr and Jc are smooth, they are L-smooth on the bounded set BM(θ) for M := max{∥∆ r∥,∥∆ c∥} and sufficiently largeL >0. Ifg c ̸= 0then⟨g c,∆ c⟩<0and the claim follows from Lemma B.5 for sufficiently smallη∈(0,1]. Next, suppose thatg r ̸= 0and⟨g r,∆ c⟩ ≥0. By definition of∆...

work page 2023

[8] [8]

distance

on a given environment (e.g., SafetyPointGoal2-v0), navigate to thesingle agentdirectory and run: python ppo_lag.py --task SafetyPointGoal2-v0 --seed 2000 --cost-limit 0 Similarly, for our method, SB-TRPO, at a specific safety bias (e.g.,β= 0.65), use: 15 SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints python sb-trpo.py --task SafetyPoi...

work page 2000

[9] [9]

cost strictly decreases at every iteration (unless cost gradient vanishes),

work page

[10] [10]

In particular, there is no recovery phase:even if cost is high, updates still improve reward(as long gradients are not misaligned)

reward strictly improves whenever reward and cost gradients are sufficiently aligned. In particular, there is no recovery phase:even if cost is high, updates still improve reward(as long gradients are not misaligned). Besides, in the idealised setting of (Update 1), if neither cost nor reward improves further, the method arrives at a trust-region local op...

work page