hub

Reward Constrained Policy Optimization

Chen Tessler, Daniel J Mankowitz, Shie Mannor · 2018 · cs.LG · arXiv 1805.11074

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

open full Pith review browse 15 citing papers arXiv PDF

abstract

Solving tasks in Reinforcement Learning is no easy feat. As the goal of the agent is to maximize the accumulated reward, it often learns to exploit loopholes and misspecifications in the reward signal resulting in unwanted behavior. While constraints may solve this issue, there is no closed form solution for general constraints. In this work we present a novel multi-timescale approach for constrained policy optimization, called `Reward Constrained Policy Optimization' (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one. We prove the convergence of our approach and provide empirical evidence of its ability to train constraint satisfying policies.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.

Covert Multi-bit LLM Watermarking: An Information Theory and Coding Approach

cs.IT · 2026-05-15 · unverdicted · novelty 6.0

Characterizes the exact capacity of multi-bit covert LLM watermarking via Gelfand-Pinsker and channel synthesis, then gives a polar-code algorithm achieving 0.375 bits/token at under 10% BER with negligible perplexity impact.

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

Action-conditioned near-term risk prediction gates optimistic and conservative value estimates in RL to approximate risk-sensitive POMDP control, yielding better safety-performance tradeoffs with lower runtime than belief planning baselines.

Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

CVaR-constrained TD3 policies for robot navigation show larger safety margins and higher post-training reachability verification rates than average-cost baselines across simulated scenarios and real-robot tests.

Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Introduces RAPCs and a contraction Bellman operator for cost-optimal policies that satisfy probabilistic reach-avoid specifications in stochastic MDPs, with almost-sure convergence to local optima.

BarrierSteer: LLM Safety via Learning Barrier Steering

cs.LG · 2026-02-23 · unverdicted · novelty 6.0

BarrierSteer applies control barrier functions to LLM latent states for constraint-guided steering that reduces unsafe generations while preserving utility.

How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?

cs.LG · 2026-02-02 · unverdicted · novelty 6.0

ALGD augments the Lagrangian to locally convexify the energy landscape in diffusion models, stabilizing safe RL training and generation without changing optimal policies.

AdaFair-MARL: Enforcing Adaptive Fairness Constraints in Multi-Agent Reinforcement Learning

cs.LG · 2025-11-18 · unverdicted · novelty 6.0

AdaFair-MARL enforces workload fairness as an explicit second-order cone constraint in cooperative MARL via adaptive primal-dual optimization, achieving near-perfect constraint satisfaction while preserving team performance.

Constraint-Aware Reinforcement Learning via Adaptive Action Scaling

cs.RO · 2025-10-13 · unverdicted · novelty 6.0

A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.

Shaping Zero-Shot Coordination via State Blocking

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.

Why Does Agentic Safety Fail to Generalize Across Tasks?

cs.LG · 2026-05-07 · conditional · novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.

CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection

cs.RO · 2026-04-08 · unverdicted · novelty 6.0

CMP projects actions onto a learned competence manifold using a frame-wise safety scheme and isomorphic latent space to achieve up to 10x better survival in out-of-distribution scenarios with under 10% tracking loss.

Constrained Policy Optimization for Provably Fair Order Matching

cs.GT · 2026-04-07 · unverdicted · novelty 6.0

CPO-FOAM recovers over 95% of unconstrained order-matching throughput at under 3% fairness violation rates by combining analytic trust-region updates with PID-driven safety margins in a CMDP.

Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems

cs.MA · 2026-05-11 · unverdicted · novelty 5.0

Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.

Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production

cs.AI · 2026-04-14 · unverdicted · novelty 5.0

PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.

citing papers explorer

Showing 15 of 15 citing papers.

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback cs.LG · 2026-04-21 · unverdicted · none · ref 47
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Covert Multi-bit LLM Watermarking: An Information Theory and Coding Approach cs.IT · 2026-05-15 · unverdicted · none · ref 37 · internal anchor
Characterizes the exact capacity of multi-bit covert LLM watermarking via Gelfand-Pinsker and channel synthesis, then gives a polar-code algorithm achieving 0.375 bits/token at under 10% BER with negligible perplexity impact.
Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability cs.LG · 2026-05-14 · unverdicted · none · ref 31 · internal anchor
Action-conditioned near-term risk prediction gates optimistic and conservative value estimates in RL to approximate risk-sensitive POMDP control, yielding better safety-performance tradeoffs with lower runtime than belief planning baselines.
Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation cs.RO · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
CVaR-constrained TD3 policies for robot navigation show larger safety margins and higher post-training reachability verification rates than average-cost baselines across simulated scenarios and real-robot tests.
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 11 · 2 links · internal anchor
Introduces RAPCs and a contraction Bellman operator for cost-optimal policies that satisfy probabilistic reach-avoid specifications in stochastic MDPs, with almost-sure convergence to local optima.
BarrierSteer: LLM Safety via Learning Barrier Steering cs.LG · 2026-02-23 · unverdicted · none · ref 20 · internal anchor
BarrierSteer applies control barrier functions to LLM latent states for constraint-guided steering that reduces unsafe generations while preserving utility.
How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models? cs.LG · 2026-02-02 · unverdicted · none · ref 18 · internal anchor
ALGD augments the Lagrangian to locally convexify the energy landscape in diffusion models, stabilizing safe RL training and generation without changing optimal policies.
AdaFair-MARL: Enforcing Adaptive Fairness Constraints in Multi-Agent Reinforcement Learning cs.LG · 2025-11-18 · unverdicted · none · ref 44 · internal anchor
AdaFair-MARL enforces workload fairness as an explicit second-order cone constraint in cooperative MARL via adaptive primal-dual optimization, achieving near-perfect constraint satisfaction while preserving team performance.
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling cs.RO · 2025-10-13 · unverdicted · none · ref 15 · internal anchor
A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.
Shaping Zero-Shot Coordination via State Blocking cs.LG · 2026-05-12 · unverdicted · none · ref 43
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
Why Does Agentic Safety Fail to Generalize Across Tasks? cs.LG · 2026-05-07 · conditional · none · ref 105
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection cs.RO · 2026-04-08 · unverdicted · none · ref 47
CMP projects actions onto a learned competence manifold using a frame-wise safety scheme and isomorphic latent space to achieve up to 10x better survival in out-of-distribution scenarios with under 10% tracking loss.
Constrained Policy Optimization for Provably Fair Order Matching cs.GT · 2026-04-07 · unverdicted · none · ref 1
CPO-FOAM recovers over 95% of unconstrained order-matching throughput at under 3% fairness violation rates by combining analytic trust-region updates with PID-driven safety margins in a CMDP.
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems cs.MA · 2026-05-11 · unverdicted · none · ref 37
Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production cs.AI · 2026-04-14 · unverdicted · none · ref 73
PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.

Reward Constrained Policy Optimization

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer