Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

Hannah Markgraf; Hanna Krasowski; Lukas Sch\"afer; Matthias Althoff; Sebastien Gros; Shambhuraj Sawant

arxiv: 2509.12833 · v2 · submitted 2025-09-16 · 💻 cs.LG

Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

Hannah Markgraf , Shambhuraj Sawant , Hanna Krasowski , Lukas Sch\"afer , Sebastien Gros , Matthias Althoff This is my paper

Pith reviewed 2026-05-18 15:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords safe reinforcement learningaction projectionsafety filterspolicy gradientsaction aliasingactor-critic algorithms

0 comments

The pith

In safe RL, applying action projection inside the policy network produces worse gradient estimates than applying it as part of the environment due to action aliasing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two common ways to add projection-based safety to reinforcement learning agents. In one, the projection happens after the policy and is treated like part of the world the agent interacts with. In the other, the projection is built directly into the policy using a differentiable layer. The authors show that both approaches suffer from action aliasing, where several different unsafe actions get mapped to the exact same safe action and therefore lose distinct information for learning. However, the way this loss affects the training process differs: the environment version lets the critic network absorb the problem indirectly, while the policy version makes the loss appear immediately as singular matrices when computing gradients. They also test ways to reduce the damage and find that a penalty term added to the policy version can bring its performance up to or above the other approach.

Core claim

The central claim is that action aliasing in projection-based safety filters leads to information loss in policy gradients, with this effect being implicitly handled by the critic in safe environment RL but appearing explicitly as rank-deficient Jacobians in safe policy RL during backpropagation through the safeguard.

What carries the argument

Action aliasing caused by the projection operator mapping multiple unsafe actions to one safe action, and how it impacts the Jacobian of the policy gradient in actor-critic algorithms.

If this is right

SP-RL experiences more direct harm from action aliasing than SE-RL.
A novel penalty-based mitigation for SP-RL aligns it with practices in SE-RL and improves performance.
With suitable improvements, SP-RL can match or outperform SE-RL in various environments.
Empirical results confirm that action aliasing hurts SP-RL more severely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners facing frequent unsafe actions during training may prefer the environment-based approach unless they apply the penalty fix to the policy version.
Testing the methods in domains with very high-dimensional action spaces could reveal whether the rank deficiency becomes even more limiting.
Extending the analysis to other safety filter types beyond projections might show similar gradient issues in policy-embedded versions.

Load-bearing premise

The safety filter must be expressible as a differentiable projection operator that can be placed inside the policy network without breaking the usual actor-critic gradient flow.

What would settle it

Measuring the rank of the Jacobian matrices when backpropagating through the safeguard in SP-RL and checking whether they are lower rank than in the SE-RL case, or observing larger performance gaps in tasks where projections occur often.

Figures

Figures reproduced from arXiv: 2509.12833 by Hannah Markgraf, Hanna Krasowski, Lukas Sch\"afer, Matthias Althoff, Sebastien Gros, Shambhuraj Sawant.

**Figure 2.** Figure 2: Effect of action aliasing on SE-RL and SP-RL algorithms using deterministic policies. We illustrate [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of improvement strategies when using a differentiable safeguard during policy updates for [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Interquartile mean and 95% bootstrap confidence interval of the return and safeguard interventions [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of vanilla and modified SE-RL/SP-RL approaches (PSL: per-sample loss, PenC: [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of action aliasing on SE-RL and SP-RL algorithms using deterministic policies. We illustrate [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of improvement strategies when using a differentiable safeguard during policy updates for [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

read the original abstract

Projection-based safety filters, which modify unsafe actions by mapping them to the closest safe alternative, are widely used to enforce safety constraints in reinforcement learning (RL). Two integration strategies are commonly considered: Safe environment RL (SE-RL), where the safeguard is treated as part of the environment, and safe policy RL (SP-RL), where it is embedded within the policy through differentiable optimization layers. Despite their practical relevance in safety-critical settings, a formal understanding of their differences is lacking. In this work, we present a theoretical comparison of SE-RL and SP-RL. We identify a key distinction in how each approach is affected by action aliasing, a phenomenon in which multiple unsafe actions are projected to the same safe action, causing information loss in the policy gradients. In SE-RL, this effect is implicitly approximated by the critic, while in SP-RL, it manifests directly as rank-deficient Jacobians during backpropagation through the safeguard. Our contributions are threefold: (i) a unified formalization of SE-RL and SP-RL in the context of actor-critic algorithms, (ii) a theoretical analysis of their respective policy gradient estimates, highlighting the role of action aliasing, and (iii) a comparative study of mitigation strategies, including a novel penalty-based improvement for SP-RL that aligns with established SE-RL practices. Empirical results support our theoretical predictions, showing that action aliasing is more detrimental for SP-RL than for SE-RL. However, with appropriate improvement strategies, SP-RL can match or outperform improved SE-RL across a range of environments. These findings provide actionable insights for choosing and refining projection-based safe RL methods based on task characteristics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows SP-RL gradients degrade more from action aliasing than SE-RL but a penalty term closes the gap.

read the letter

The main point is that embedding the projection inside the policy (SP-RL) produces rank-deficient Jacobians under action aliasing while treating it as part of the environment (SE-RL) lets the critic absorb the effect, and the authors supply a penalty fix that brings SP-RL performance back up. They formalize both styles inside actor-critic updates, derive the difference in gradient quality, and test mitigation strategies including their new penalty for SP-RL. That comparison and the penalty are the concrete additions. The experiments across environments line up with the theory and show the penalty can make SP-RL competitive or better. The central argument holds up on the terms they set. The soft spot is the differentiability of the projection operator. If the layer only supplies subgradients at the boundary points where aliasing is worst, or if the implementation adds smoothing, the exact rank-deficiency story needs extra justification to match the SE-RL side cleanly. The abstract claims support but leaves the precise protocol and ablation depth for the full text. This is useful for anyone choosing how to wire a projection safeguard into a learned controller in robotics or autonomous systems. A reader who cares about practical gradient behavior in safe RL will get clear trade-off guidance. Send it for peer review; the formal distinction and the mitigation are worth referee scrutiny on the gradient details.

Referee Report

2 major / 2 minor

Summary. The paper compares two strategies for incorporating projection-based safety filters into RL: SE-RL (treating the safeguard as part of the environment) and SP-RL (embedding a differentiable projection inside the policy network). It unifies both within actor-critic methods, analyzes how action aliasing produces information loss in policy gradients (implicitly approximated by the critic in SE-RL versus explicit rank-deficient Jacobians in SP-RL), introduces a penalty-based mitigation for SP-RL, and reports empirical results indicating that improved SP-RL can match or outperform improved SE-RL.

Significance. If the gradient analysis is sound, the work supplies a useful theoretical distinction for a common practical choice in safe RL and a mitigation strategy that aligns SP-RL with existing SE-RL techniques. The cross-environment empirical support and explicit focus on action aliasing add value for practitioners selecting or refining projection-based safeguards.

major comments (2)

[§3] §3 (Policy Gradient Analysis): The central distinction—that SP-RL produces rank-deficient Jacobians under action aliasing while SE-RL lets the critic absorb the effect—rests on the projection operator being realized as a differentiable optimization layer that preserves the standard actor-critic gradient structure. For Euclidean projection onto a convex set, the map is only subdifferentiable at boundary points where aliasing is most pronounced; without an explicit statement of the gradient (e.g., Clarke subdifferential, smoothing parameter, or solver-specific approximation) used in the back-propagation, the claimed rank deficiency does not necessarily follow from the same formalization applied to SE-RL.
[§4] §4 (Mitigation Strategies): The novel penalty-based improvement for SP-RL is presented as aligning with SE-RL practice, yet the derivation does not show how the added penalty term interacts with the (potentially rank-deficient) Jacobian of the projection layer. If the penalty is applied after the projection, the gradient flow may still be affected by the same aliasing-induced deficiency unless an additional correction is derived.

minor comments (2)

[Experimental Setup] The experimental section would benefit from an explicit statement of the projection solver (e.g., CVXPY layer, custom QP solver) and any gradient approximation or smoothing used, so that the reported performance differences can be reproduced and the rank-deficiency effect verified.
[Preliminaries] Notation for the projected action and its Jacobian is introduced without a consolidated table; a short notation summary would improve readability when comparing the two RL variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help us strengthen the technical presentation of the gradient analysis and mitigation strategy. We address each major comment below, indicating planned revisions to the manuscript.

read point-by-point responses

Referee: [§3] §3 (Policy Gradient Analysis): The central distinction—that SP-RL produces rank-deficient Jacobians under action aliasing while SE-RL lets the critic absorb the effect—rests on the projection operator being realized as a differentiable optimization layer that preserves the standard actor-critic gradient structure. For Euclidean projection onto a convex set, the map is only subdifferentiable at boundary points where aliasing is most pronounced; without an explicit statement of the gradient (e.g., Clarke subdifferential, smoothing parameter, or solver-specific approximation) used in the back-propagation, the claimed rank deficiency does not necessarily follow from the same formalization applied to SE-RL.

Authors: We agree that an explicit statement of the gradient computation is necessary for rigor. In the current manuscript, the SP-RL projection is realized via a differentiable optimization layer (Section 3) that back-propagates through the KKT conditions of the Euclidean projection problem, using the standard solver approximation implemented in the optimization layer library. At boundary points the Jacobian is indeed rank-deficient by construction when aliasing occurs, which is the source of the information loss we analyze. To address the referee's concern, we will revise §3 to explicitly document the gradient method (including any smoothing parameter or Clarke subdifferential handling) and to show that the rank deficiency holds under this formalization, thereby preserving the distinction with SE-RL where the critic implicitly averages over the aliased actions. revision: yes
Referee: [§4] §4 (Mitigation Strategies): The novel penalty-based improvement for SP-RL is presented as aligning with SE-RL practice, yet the derivation does not show how the added penalty term interacts with the (potentially rank-deficient) Jacobian of the projection layer. If the penalty is applied after the projection, the gradient flow may still be affected by the same aliasing-induced deficiency unless an additional correction is derived.

Authors: The referee correctly notes that the interaction between the penalty and the projection Jacobian merits further derivation. In our formulation the penalty is added to the policy objective before the projection step, which reduces the probability mass on actions that would trigger aliasing and thereby lowers the frequency of rank-deficient Jacobians during back-propagation. We will expand the derivation in §4 to include the chain-rule expansion through the (sub)differentiable projection layer, demonstrating that the penalty term directly attenuates the contribution of aliased directions without requiring an extra correction. This aligns the improved SP-RL more closely with the SE-RL penalty practice while preserving the theoretical distinction we established. revision: partial

Circularity Check

0 steps flagged

Standard actor-critic formalization with independent empirical support; no load-bearing reductions

full rationale

The paper supplies a unified formalization of SE-RL and SP-RL inside actor-critic methods, derives the effect of action aliasing on policy gradients from the respective placements of the projection operator, and validates the distinction with experiments across environments. No equation or central claim reduces the reported gradient behavior or performance gap to a quantity that was fitted or defined inside the same study. The analysis rests on standard differentiable optimization layers and established RL theory rather than self-referential definitions or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard Markov decision process and actor-critic assumptions together with the existence of a differentiable projection operator; no new free parameters or invented entities are introduced.

axioms (1)

domain assumption The environment is a Markov decision process and both SE-RL and SP-RL are implemented inside an actor-critic framework.
The theoretical analysis and gradient derivations are performed under the standard actor-critic setting described in the abstract.

pith-pipeline@v0.9.0 · 5859 in / 1344 out tokens · 55306 ms · 2026-05-18T15:53:46.522676+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dyna-Style Safety Augmented Reinforcement Learning: Staying Safe in the Face of Uncertainty
cs.LG 2026-04 unverdicted novelty 5.0

Dyna-SAuR learns scalable safety filters and policies from an uncertainty-aware model, cutting failures by two orders of magnitude on CartPole and MuJoCo Walker tasks.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Safe Exploration in Continuous Action Spaces

Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe exploration in continuous action spaces.arXiv preprint arXiv:1801.08757,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Differentiable nonlinear model predictive control.arXiv preprint arXiv:2505.01353,

Jonathan Frey, Katrin Baumgärtner, Gianluca Frison, Dirk Reinhardt, Jasper Hoffmann, Leonard Fichtner, Sebastien Gros, and Moritz Diehl. Differentiable nonlinear model predictive control.arXiv preprint arXiv:2505.01353,

work page arXiv
[3]

Continuous control with deep reinforcement learning

16 Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Asynchronous methods for deep reinforcement learning

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInter- national conference on machine learning, pp. 1928–1937,

work page 1928
[5]

Fsnet: Feasibility-seeking neural network for constrained optimization with guarantees.arXiv preprint arXiv:2506.00362,

Hoang T Nguyen and Priya L Donti. Fsnet: Feasibility-seeking neural network for constrained optimization with guarantees.arXiv preprint arXiv:2506.00362,

work page arXiv
[6]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

Tim Walter, Hannah Markgraf, Jonathan Külz, and Matthias Althoff. Provably safe reinforcement learning from analytic gradients.arXiv preprint arXiv:2506.01665,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

A Appendix A.1 Action Projection Using Zonotopes One option for defining safe action sets is to consider control invariant sets as safe state setsXφ, where xt∈Xφensures that there exists an admissible actionut∈Usuch that equation 10 can satisfied for all times. Then, the constraints in equation 12b can be defined usingC(xt,ut,wt)⊆Xφ, whereCis the set of a...

work page 2020
[10]

Buν=˜u−cu (33) ∥ν∥∞≤1.(34) A.2 Differentiating the Safeguard Using the Implicit Function Theorem If the projection safeguardΦis integrated into the policy as shown in figure 1b, we require the sensitivity of its output (the safe action) with respect to its input (the unsafe action) for the backward pass of the policy optimization. To obtain the sensitivit...

work page 2002
[11]

∫ ˜X ∫ U

19 Proof.In SE-RL, for a certain policyπ, the state value function (equation 15b) is given as vSE π(x) =E ut∼π,xt∼pSEx [ gt ⏐⏐⏐⏐x0 =x ] (40) =E ut∼π,xt∼pSEx [ ∞∑ k=0 γkrt+k+1 ⏐⏐⏐⏐x0 =x ] (41) = ∞∑ k=0 γk ∫ ˜X ... ∫ ˜X ∫ U ... ∫ U ∫ R rp SE r (r|xk,uk)π(uk|xk) [k−1∏ i=0 pSE x (xi+1|xi,ui)π(ui|xi) ] drdu 0...dukdx1...dxk (42) = ∞∑ k=0 γk ∫ ˜X ... ∫ ˜X ∫ U ....

work page 2021
[12]

The environment has the state x= [ ϑ,˙ϑ ]T and the dynamics ˙x= ( ˙ϑ g ℓsin(ϑ) +1 mℓ2u ) ,(53) wheregis gravity andm,ℓare the mass and the length of the pendulum, respectively

A.8 Benchmark Problems A.8.1 Pendulum Stabilization Task Our pendulum environment is closely related to theOpenAI Gym Pendulum-V02 environment with the difference that we limit the one-dimensional control input to|u|≤8rads−1. The environment has the state x= [ ϑ,˙ϑ ]T and the dynamics ˙x= ( ˙ϑ g ℓsin(ϑ) +1 mℓ2u ) ,(53) wheregis gravity andm,ℓare the mass ...

work page 2024

[1] [1]

Safe Exploration in Continuous Action Spaces

Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe exploration in continuous action spaces.arXiv preprint arXiv:1801.08757,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Differentiable nonlinear model predictive control.arXiv preprint arXiv:2505.01353,

Jonathan Frey, Katrin Baumgärtner, Gianluca Frison, Dirk Reinhardt, Jasper Hoffmann, Leonard Fichtner, Sebastien Gros, and Moritz Diehl. Differentiable nonlinear model predictive control.arXiv preprint arXiv:2505.01353,

work page arXiv

[3] [3]

Continuous control with deep reinforcement learning

16 Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Asynchronous methods for deep reinforcement learning

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInter- national conference on machine learning, pp. 1928–1937,

work page 1928

[5] [5]

Fsnet: Feasibility-seeking neural network for constrained optimization with guarantees.arXiv preprint arXiv:2506.00362,

Hoang T Nguyen and Priya L Donti. Fsnet: Feasibility-seeking neural network for constrained optimization with guarantees.arXiv preprint arXiv:2506.00362,

work page arXiv

[6] [6]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

Tim Walter, Hannah Markgraf, Jonathan Külz, and Matthias Althoff. Provably safe reinforcement learning from analytic gradients.arXiv preprint arXiv:2506.01665,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

A Appendix A.1 Action Projection Using Zonotopes One option for defining safe action sets is to consider control invariant sets as safe state setsXφ, where xt∈Xφensures that there exists an admissible actionut∈Usuch that equation 10 can satisfied for all times. Then, the constraints in equation 12b can be defined usingC(xt,ut,wt)⊆Xφ, whereCis the set of a...

work page 2020

[10] [10]

Buν=˜u−cu (33) ∥ν∥∞≤1.(34) A.2 Differentiating the Safeguard Using the Implicit Function Theorem If the projection safeguardΦis integrated into the policy as shown in figure 1b, we require the sensitivity of its output (the safe action) with respect to its input (the unsafe action) for the backward pass of the policy optimization. To obtain the sensitivit...

work page 2002

[11] [11]

∫ ˜X ∫ U

19 Proof.In SE-RL, for a certain policyπ, the state value function (equation 15b) is given as vSE π(x) =E ut∼π,xt∼pSEx [ gt ⏐⏐⏐⏐x0 =x ] (40) =E ut∼π,xt∼pSEx [ ∞∑ k=0 γkrt+k+1 ⏐⏐⏐⏐x0 =x ] (41) = ∞∑ k=0 γk ∫ ˜X ... ∫ ˜X ∫ U ... ∫ U ∫ R rp SE r (r|xk,uk)π(uk|xk) [k−1∏ i=0 pSE x (xi+1|xi,ui)π(ui|xi) ] drdu 0...dukdx1...dxk (42) = ∞∑ k=0 γk ∫ ˜X ... ∫ ˜X ∫ U ....

work page 2021

[12] [12]

The environment has the state x= [ ϑ,˙ϑ ]T and the dynamics ˙x= ( ˙ϑ g ℓsin(ϑ) +1 mℓ2u ) ,(53) wheregis gravity andm,ℓare the mass and the length of the pendulum, respectively

A.8 Benchmark Problems A.8.1 Pendulum Stabilization Task Our pendulum environment is closely related to theOpenAI Gym Pendulum-V02 environment with the difference that we limit the one-dimensional control input to|u|≤8rads−1. The environment has the state x= [ ϑ,˙ϑ ]T and the dynamics ˙x= ( ˙ϑ g ℓsin(ϑ) +1 mℓ2u ) ,(53) wheregis gravity andm,ℓare the mass ...

work page 2024