SB-TRPO uses dynamic convex combinations of reward and cost natural policy gradients to guarantee local safety progress while improving rewards in hard-constrained RL.
In particular, there is no recovery phase:even if cost is high, updates still improve reward(as long gradients are not misaligned)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints
SB-TRPO uses dynamic convex combinations of reward and cost natural policy gradients to guarantee local safety progress while improving rewards in hard-constrained RL.