Reward Constrained Policy Optimization

Chen Tessler , Daniel J. Mankowitz , Shie Mannor

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIstat.ML

keywords policyrewardconstrainedoptimizationapproachconstraintconstraintssatisfying

read the original abstract

Solving tasks in Reinforcement Learning is no easy feat. As the goal of the agent is to maximize the accumulated reward, it often learns to exploit loopholes and misspecifications in the reward signal resulting in unwanted behavior. While constraints may solve this issue, there is no closed form solution for general constraints. In this work we present a novel multi-timescale approach for constrained policy optimization, called `Reward Constrained Policy Optimization' (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one. We prove the convergence of our approach and provide empirical evidence of its ability to train constraint satisfying policies.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
cs.LG 2026-04 unverdicted novelty 7.0

Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability
cs.LG 2026-05 unverdicted novelty 6.0

Action-conditioned near-term risk prediction gates optimistic and conservative value estimates in RL to approximate risk-sensitive POMDP control, yielding better safety-performance tradeoffs with lower runtime than be...
Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation
cs.RO 2026-05 unverdicted novelty 6.0

CVaR-constrained TD3 policies for robot navigation show larger safety margins and higher post-training reachability verification rates than average-cost baselines across simulated scenarios and real-robot tests.
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Introduces RAPCs and a contraction Bellman operator that jointly enforce probabilistic reach-avoid constraints while minimizing expected costs in stochastic RL, with almost-sure convergence to local optima.
Shaping Zero-Shot Coordination via State Blocking
cs.LG 2026-05 unverdicted novelty 6.0

SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection
cs.RO 2026-04 unverdicted novelty 6.0

CMP projects actions onto a learned competence manifold using a frame-wise safety scheme and isomorphic latent space to achieve up to 10x better survival in out-of-distribution scenarios with under 10% tracking loss.
Constrained Policy Optimization for Provably Fair Order Matching
cs.GT 2026-04 unverdicted novelty 6.0

CPO-FOAM recovers over 95% of unconstrained order-matching throughput at under 3% fairness violation rates by combining analytic trust-region updates with PID-driven safety margins in a CMDP.
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems
cs.MA 2026-05 unverdicted novelty 5.0

Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production
cs.AI 2026-04 unverdicted novelty 5.0

PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.