pith. sign in

During policy iteration, DUIPI updates the baseline policy by iteratively increasing the probability of the action with the highest penalized action-value ๐‘ˆ

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

fields

cs.LG 1

years

2026 1

verdicts

UNVERDICTED 1

representative citing papers

Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

cs.LG ยท 2026-05-11 ยท unverdicted ยท novelty 5.0

Shielding the policy improvement process in offline RL yields policies that are safe with high probability while outperforming unshielded baselines in both average and worst-case performance, especially under limited data.

citing papers explorer

Showing 1 of 1 citing paper.

  • Robust Probabilistic Shielding for Safe Offline Reinforcement Learning cs.LG ยท 2026-05-11 ยท unverdicted ยท none ยท ref 49

    Shielding the policy improvement process in offline RL yields policies that are safe with high probability while outperforming unshielded baselines in both average and worst-case performance, especially under limited data.