In particular, there is no recovery phase:even if cost is high, updates still improve reward(as long gradients are not misaligned)

reward strictly improves whenever reward, cost gradients are sufficiently aligned

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints

cs.LG · 2025-12-29 · unverdicted · novelty 6.0

SB-TRPO uses dynamic convex combinations of reward and cost natural policy gradients to guarantee local safety progress while improving rewards in hard-constrained RL.

citing papers explorer

Showing 1 of 1 citing paper.

SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints cs.LG · 2025-12-29 · unverdicted · none · ref 10
SB-TRPO uses dynamic convex combinations of reward and cost natural policy gradients to guarantee local safety progress while improving rewards in hard-constrained RL.

In particular, there is no recovery phase:even if cost is high, updates still improve reward(as long gradients are not misaligned)

fields

years

verdicts

representative citing papers

citing papers explorer