A stopping-time reward and chance-constrained SoC penalty embedded in an end-to-end learning framework improves battery reachability of target ranges, raises arbitrage profit, and lowers profit variance under volatile prices.
Constrained policy optimization,
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.
COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.
citing papers explorer
-
Learning Reachability of Energy Storage Arbitrage
A stopping-time reward and chance-constrained SoC penalty embedded in an end-to-end learning framework improves battery reachability of target ranges, raises arbitrage profit, and lowers profit variance under volatile prices.
-
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
A separate regulator module adaptively scales actions in RL to reduce constraint violations while preserving exploration, yielding up to 126x fewer violations and over 10x higher returns on Safety Gym tasks.
-
COOPO: Cyclic Offline-Online Policy Optimization Algorithm
COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.