SB-TRPO uses dynamic convex combinations of reward and cost natural policy gradients to guarantee local safety progress while improving rewards in hard-constrained RL.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints
SB-TRPO uses dynamic convex combinations of reward and cost natural policy gradients to guarantee local safety progress while improving rewards in hard-constrained RL.