During policy iteration, DUIPI updates the baseline policy by iteratively increasing the probability of the action with the highest penalized action-value 𝑈

for a more extensive discussion

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

Shielding the policy improvement process in offline RL yields policies that are safe with high probability while outperforming unshielded baselines in both average and worst-case performance, especially under limited data.

citing papers explorer

Showing 1 of 1 citing paper.

Robust Probabilistic Shielding for Safe Offline Reinforcement Learning cs.LG · 2026-05-11 · unverdicted · none · ref 49
Shielding the policy improvement process in offline RL yields policies that are safe with high probability while outperforming unshielded baselines in both average and worst-case performance, especially under limited data.

During policy iteration, DUIPI updates the baseline policy by iteratively increasing the probability of the action with the highest penalized action-value 𝑈

fields

years

verdicts

representative citing papers

citing papers explorer