Recognition: unknown
Reward Constrained Policy Optimization
read the original abstract
Solving tasks in Reinforcement Learning is no easy feat. As the goal of the agent is to maximize the accumulated reward, it often learns to exploit loopholes and misspecifications in the reward signal resulting in unwanted behavior. While constraints may solve this issue, there is no closed form solution for general constraints. In this work we present a novel multi-timescale approach for constrained policy optimization, called `Reward Constrained Policy Optimization' (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one. We prove the convergence of our approach and provide empirical evidence of its ability to train constraint satisfying policies.
This paper has not been read by Pith yet.
Forward citations
Cited by 10 Pith papers
-
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
-
Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability
Action-conditioned near-term risk prediction gates optimistic and conservative value estimates in RL to approximate risk-sensitive POMDP control, yielding better safety-performance tradeoffs with lower runtime than be...
-
Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation
CVaR-constrained TD3 policies for robot navigation show larger safety margins and higher post-training reachability verification rates than average-cost baselines across simulated scenarios and real-robot tests.
-
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning
Introduces RAPCs and a contraction Bellman operator that jointly enforce probabilistic reach-avoid constraints while minimizing expected costs in stochastic RL, with almost-sure convergence to local optima.
-
Shaping Zero-Shot Coordination via State Blocking
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
-
Why Does Agentic Safety Fail to Generalize Across Tasks?
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
-
CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection
CMP projects actions onto a learned competence manifold using a frame-wise safety scheme and isomorphic latent space to achieve up to 10x better survival in out-of-distribution scenarios with under 10% tracking loss.
-
Constrained Policy Optimization for Provably Fair Order Matching
CPO-FOAM recovers over 95% of unconstrained order-matching throughput at under 3% fairness violation rates by combining analytic trust-region updates with PID-driven safety margins in a CMDP.
-
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems
Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
-
Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production
PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.