Extends SPIBB with soft uncertainty-constrained policy search for less conservative safe policy improvement in batch RL, with optimal and approximate solvers shown empirically on finite and neural MDPs.
Dataset generation The MDP is modified to include another goal: terminal state with a reward of 1 when accessing it
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2019 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Safe Policy Improvement with Soft Baseline Bootstrapping
Extends SPIBB with soft uncertainty-constrained policy search for less conservative safe policy improvement in batch RL, with optimal and approximate solvers shown empirically on finite and neural MDPs.