Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

· 2026 · cs.LG · arXiv 2605.00667

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominent paradigm. Handling state-wise constraints with the Lagrangian method requires a distinct multiplier for every state, necessitating neural networks to approximate them as a multiplier network. However, applying standard dual gradient ascent to multiplier networks induces severe training oscillations. This is because the inherent instability of dual ascent is exacerbated by network generalization -- local overshoots and delayed updates propagate to adjacent states, further amplifying policy fluctuations. Existing stabilization techniques are designed for scalar multipliers, which are inadequate for state-dependent multiplier networks. To address this challenge, we propose an augmented Lagrangian multiplier network (ALaM) framework for stable learning of state-wise multipliers. ALaM consists of two key components. First, a quadratic penalty is introduced into the augmented Lagrangian to compensate for delayed multiplier updates and establish the local convexity near the optimum, thereby mitigating policy oscillations. Second, the multiplier network is trained via supervised regression toward a dual target, which stabilizes training and promotes convergence. Theoretically, we show that ALaM guarantees multiplier convergence and thus recovers the optimal policy of the constrained problem. Building on this framework, we integrate soft actor-critic (SAC) with ALaM to develop the SAC-ALaM algorithm. Experiments demonstrate that SAC-ALaM outperforms state-of-the-art safe RL baselines in both safety and return, while also stabilizing training dynamics and learning well-calibrated multipliers for risk identification.

representative citing papers

MoSSP: A Momentum-Based Single-Loop Stochastic Penalty Method for Nonconvex Constrained DC-Regularized Optimization

math.OC · 2026-05-28 · unverdicted · novelty 6.0

MoSSP is a new single-loop stochastic penalty method with Polyak or recursive momentum that achieves O(ε^{-4}) or O(ε^{-3}) oracle complexity for stochastic ε-KKT points in nonconvex constrained DC-regularized problems.

citing papers explorer

Showing 1 of 1 citing paper after filters.

MoSSP: A Momentum-Based Single-Loop Stochastic Penalty Method for Nonconvex Constrained DC-Regularized Optimization math.OC · 2026-05-28 · unverdicted · none · ref 73 · internal anchor
MoSSP is a new single-loop stochastic penalty method with Polyak or recursive momentum that achieves O(ε^{-4}) or O(ε^{-3}) oracle complexity for stochastic ε-KKT points in nonconvex constrained DC-regularized problems.

Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

fields

years

verdicts

representative citing papers

citing papers explorer