Constrained policy optimization,

· 2017

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Learning Reachability of Energy Storage Arbitrage

eess.SY · 2025-12-06 · unverdicted · novelty 7.0

A stopping-time reward and chance-constrained SoC penalty embedded in an end-to-end learning framework improves battery reachability of target ranges, raises arbitrage profit, and lowers profit variance under volatile prices.

Constraint-Aware Reinforcement Learning via Adaptive Action Scaling

cs.RO · 2025-10-13 · unverdicted · novelty 6.0

A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.

COOPO: Cyclic Offline-Online Policy Optimization Algorithm

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.

citing papers explorer

Showing 3 of 3 citing papers.

Learning Reachability of Energy Storage Arbitrage eess.SY · 2025-12-06 · unverdicted · none · ref 29
A stopping-time reward and chance-constrained SoC penalty embedded in an end-to-end learning framework improves battery reachability of target ranges, raises arbitrage profit, and lowers profit variance under volatile prices.
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling cs.RO · 2025-10-13 · unverdicted · none · ref 14
A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.
COOPO: Cyclic Offline-Online Policy Optimization Algorithm cs.LG · 2026-05-18 · unverdicted · none · ref 43
COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.

Constrained policy optimization,

fields

years

verdicts

representative citing papers

citing papers explorer