RAPO adds environment-level harm-trace and scar fields with bounded transition reweighting to reduce replay of delayed harm in RL, cutting re-amplification gain from 0.98 to 0.33 on graph tasks while retaining 82% task return.
A survey of deep reinforcement learning in non-stationary environments.arXiv preprint arXiv:2301.02804
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Regret-Aware Policy Optimization: Environment-Level Memory for Replay Suppression under Delayed Harm
RAPO adds environment-level harm-trace and scar fields with bounded transition reweighting to reduce replay of delayed harm in RL, cutting re-amplification gain from 0.98 to 0.33 on graph tasks while retaining 82% task return.