pith. sign in

arxiv: 1907.05079 · v1 · pith:RKALPD4Qnew · submitted 2019-07-11 · 💻 cs.LG · cs.AI· stat.ML

Safe Policy Improvement with Soft Baseline Bootstrapping

Pith reviewed 2026-05-24 23:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords safe policy improvementbatch reinforcement learningSPIBBmodel uncertaintyconstrained optimizationprovable safetypolicy bootstrapping
0
0 comments X

The pith

Softening the binary safe-uncertain split lets batch RL search wider policies while keeping high-probability safety guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to make safe policy improvement less conservative by replacing the hard division of state-action pairs into uncertain versus safe-to-train-on sets with a continuous constraint that limits policy deviation in proportion to local model uncertainty. This change keeps the original high-probability guarantee that the learned policy will not underperform the behavioral baseline, yet permits larger updates where uncertainty is lower. A sympathetic reader would care because existing binary methods often produce policies that improve little or not at all over the data-collecting policy, restricting the usefulness of batch reinforcement learning in practice.

Core claim

By constraining the policy change locally according to model uncertainty rather than using a binary classification, the SPIBB algorithm is extended to optimize over a larger set of policies while preserving the high-probability safety guarantee, and two algorithms are provided to solve the resulting optimization problem.

What carries the argument

The uncertainty-dependent constraint on allowed policy deviation inside the optimization objective.

If this is right

  • The method yields higher mean performance than prior SPI algorithms on finite MDPs.
  • It extends to infinite MDPs when paired with neural-network function approximation.
  • The high-probability safety guarantee relative to the baseline is retained.
  • The approach is less conservative than existing safe policy improvement techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The soft constraint could be paired with other batch RL safety mechanisms to further reduce conservatism.
  • Domains with heterogeneous uncertainty levels would be natural places to measure where the graduated-risk approach gains the most.
  • The same local-uncertainty weighting idea might transfer to continuous-action settings without major reformulation.

Load-bearing premise

That locally scaling the permitted policy change by model uncertainty is enough to bound value-estimate errors and retain the original safety guarantee.

What would settle it

An MDP in which the new policy violates the high-probability safety bound in repeated trials despite accurate uncertainty estimates, or a head-to-head run on standard benchmarks where the soft method shows no performance gain over binary SPIBB.

Figures

Figures reproduced from arXiv: 1907.05079 by Kimia Nadjahi, R\'emi Tachet des Combes, Romain Laroche.

Figure 1
Figure 1. Figure 1: Average time to convergence. Complexity empirical analysis: In [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Benchmark on Random MDPs domain: mean and 1%-CVAR performances for a hard scenario (η = 0.9) and Soft-SPIBB with  = 2 [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Influence of η on Random MDPs domain: 0.1%-CVaR heatmaps as a function of η 4.2 Helicopter domain To assess our algorithms on tasks with more complex state spaces, making the use of function approximation inevitable, we apply them to a helicopter navigation task ( [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity to hyperparameter on Random MDPs: 1%-CVaR heatmaps for η = 0.9 [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Helicopter: mean and 10%-CVaR as a function of the hyper-parameter value to its position and velocity. We refer the reader to the appendix, Section C.1 for the detailed specifications. We generated a baseline by training online a DQN (Mnih et al., 2015) and applying a softmax on the learnt Q-network. During training, a discount factor of 0.9 is used, but the reported results show the undiscounted return ob… view at source ↗
Figure 6
Figure 6. Figure 6: Random MDPs (no additional goal): SPIBB hyper-parameter search results: (a-d) Mean and 1%-CVaR performance heatmaps as a function of N∧ (e-f) 1%-CVaR performance heatmaps as a function of η with the best hyper-parameter (N∧ = 10) [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Random MDPs: 0.1%-CVaR performance heatmaps [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Random MDPs: hyper-parameter mean performance heatmaps for Soft-SPIBB methods under a weak (η = 0.1) and a strong (η = 0.9) baseline [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Random MDPs: hyper-parameter 1%-CVaR performance heatmaps for Soft-SPIBB methods under a weak (η = 0.1) and a strong (η = 0.9) baseline [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Random MDPs: hyper-parameter mean and 1%-CVaR performance heatmaps for RaMDP methods under a weak (η = 0.1), a medium weak (η = 0.4), a medium strong (η = 0.6), and a strong (η = 0.9) baseline [PITH_FULL_IMAGE:figures/full_fig_p035_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Helicopter: mean and 10%-CVaR as a function of the hyper-parameter [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗
read the original abstract

Batch Reinforcement Learning (Batch RL) consists in training a policy using trajectories collected with another policy, called the behavioural policy. Safe policy improvement (SPI) provides guarantees with high probability that the trained policy performs better than the behavioural policy, also called baseline in this setting. Previous work shows that the SPI objective improves mean performance as compared to using the basic RL objective, which boils down to solving the MDP with maximum likelihood. Here, we build on that work and improve more precisely the SPI with Baseline Bootstrapping algorithm (SPIBB) by allowing the policy search over a wider set of policies. Instead of binarily classifying the state-action pairs into two sets (the \textit{uncertain} and the \textit{safe-to-train-on} ones), we adopt a softer strategy that controls the error in the value estimates by constraining the policy change according to the local model uncertainty. The method can take more risks on uncertain actions all the while remaining provably-safe, and is therefore less conservative than the state-of-the-art methods. We propose two algorithms (one optimal and one approximate) to solve this constrained optimization problem and empirically show a significant improvement over existing SPI algorithms both on finite MDPs and on infinite MDPs with a neural network function approximation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Soft SPIBB, extending the SPIBB algorithm for safe policy improvement in batch RL. It replaces the binary uncertain/safe-to-train-on partition with a continuous constraint that scales allowed policy deviation by local model uncertainty, claiming this permits greater risk-taking on uncertain actions while preserving high-probability performance lower bounds relative to the behavioral policy. Two solvers (exact and approximate) are given for the resulting constrained optimization, with empirical results showing gains over prior SPI methods on finite MDPs and on infinite MDPs using neural-network approximation.

Significance. If the safety bound is shown to transfer under the soft relaxation, the approach would meaningfully reduce conservatism in provably safe batch RL without sacrificing guarantees, enabling better empirical performance. The provision of both optimal and approximate algorithms, together with experiments on both tabular and function-approximation settings, strengthens the practical contribution.

major comments (3)
  1. [§3.2, Eq. (7)–(9)] §3.2, Eq. (7)–(9): The central claim that the soft constraint preserves the original SPIBB high-probability safety bound requires an explicit derivation showing that the continuous deviation term remains controlled by the same concentration radius used in the hard-partition case. The current sketch does not bound the additional error introduced when the policy is allowed graded deviation on uncertain actions; without this step the guarantee does not automatically transfer.
  2. [§4, Theorem 2] §4, Theorem 2: The approximate solver is shown to be computationally lighter, yet no analysis is given of the approximation error relative to the exact optimum or of how this error propagates into the safety bound. This is load-bearing because the empirical results rely on the approximate version for the neural-network experiments.
  3. [Table 3] Table 3, infinite-MDP rows: Performance deltas are reported without the number of independent runs, standard deviations, or statistical tests. Given that the safety claim is probabilistic, the absence of these quantities makes it impossible to assess whether the observed gains are consistent with the claimed high-probability improvement.
minor comments (2)
  1. [§2.1] §2.1: The definition of the local uncertainty measure U(s,a) is introduced without an explicit statement of how it is estimated from the batch; a short paragraph or reference would improve clarity.
  2. [Figure 2] Figure 2 caption: The legend does not distinguish the exact versus approximate Soft SPIBB curves; this should be added for readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and commit to revisions that strengthen the presentation of the safety guarantees and experimental reporting.

read point-by-point responses
  1. Referee: [§3.2, Eq. (7)–(9)] The central claim that the soft constraint preserves the original SPIBB high-probability safety bound requires an explicit derivation showing that the continuous deviation term remains controlled by the same concentration radius used in the hard-partition case. The current sketch does not bound the additional error introduced when the policy is allowed graded deviation on uncertain actions; without this step the guarantee does not automatically transfer.

    Authors: We agree that an explicit derivation is required to rigorously establish transfer of the high-probability bound. The manuscript sketch indicates that the soft deviation is scaled by the local uncertainty (itself controlled by the concentration radius), but does not fully bound the resulting error term. In the revision we will insert a new lemma that explicitly shows the additional error remains dominated by the same radius, thereby preserving the original SPIBB guarantee with only a modified (but still high-probability) constant. revision: yes

  2. Referee: [§4, Theorem 2] The approximate solver is shown to be computationally lighter, yet no analysis is given of the approximation error relative to the exact optimum or of how this error propagates into the safety bound. This is load-bearing because the empirical results rely on the approximate version for the neural-network experiments.

    Authors: We acknowledge the absence of an approximation-error analysis for the solver in Theorem 2 and its effect on the safety bound. The revision will add a short subsection deriving a bound on the sub-optimality gap of the approximate solver and showing that this gap does not invalidate the high-probability performance lower bound (under standard assumptions on the neural-network approximation). revision: yes

  3. Referee: [Table 3] Performance deltas are reported without the number of independent runs, standard deviations, or statistical tests. Given that the safety claim is probabilistic, the absence of these quantities makes it impossible to assess whether the observed gains are consistent with the claimed high-probability improvement.

    Authors: We agree that the infinite-MDP rows of Table 3 require additional statistical detail. The experiments were run with 10 independent seeds; we will augment the table with means, standard deviations, and paired t-test p-values to allow readers to evaluate consistency with the claimed probabilistic improvement. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to SPIBB baseline; new soft-constraint formulation remains independent

full rationale

The paper extends prior SPIBB work by replacing binary uncertain/safe partitions with a continuous policy-deviation constraint scaled by local model uncertainty. The abstract presents new algorithms for the resulting constrained optimization and claims the high-probability safety bound is preserved. No quoted equation or derivation reduces the new objective or safety statement to a fitted quantity or self-citation by construction; the central technical contribution (soft local control of value-estimate error) supplies independent content beyond the cited baseline.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the central safety claim rests on an unstated assumption that the uncertainty-based constraint suffices to bound value-estimate error.

pith-pipeline@v0.9.0 · 5760 in / 1098 out tokens · 20519 ms · 2026-05-24T23:00:45.427093+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    cs.LG 2020-05 unverdicted novelty 2.0

    Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS). Borgwardt, K. H. (1987). The Simplex Method: A Probabilistic Analysis . Springer- Verlag Berlin Heidelberg. Burda, Y ., E...

  2. [2]

    A policyπ is said to be(πb,e,ϵ )-constrained for baseline policyπb, error functione, and a hyper-parameter ϵ if, for all states x∈X , the following inequality holds: ∑ a∈A e(x,a ) ⏐⏐π(a|x)−πb(a|x) ⏐⏐≤ϵ. A.2 Error Bounds The difference between an estimated parameter and the true one can be bounded us- ing concentration bounds (or equivalently, Hoeffding’s ...

  3. [3]

    Then, the following concen- tration bound holds with high probabilities 1−δ:  ( dπb M−dπb ˆM ) ( ·|x)  1 ≤ 1 1−γ √ 2 ND(x) log 2|X| δ

    LetM be an MDP:⟨X,A,P,R,γ ⟩,πb be the baseline policy on which trajectoriesD have been collected, ˆM be the MLE MDP:⟨X,A,P,R,γ ⟩, andND(x) be the count of transitions starting from statex∈X inD. Then, the following concen- tration bound holds with high probabilities 1−δ:  ( dπb M−dπb ˆM ) ( ·|x)  1 ≤ 1 1−γ √ 2 ND(x) log 2|X| δ . (79) Safe Policy Imp...

  4. [4]

    (see Section B.1) for each hyper parameter value for the baseline do Generate a baseline

    Pseudo-code 3: Random MDPs benchmark Input: List of hyper-parameter values for the baseline Input: List of dataset size Input: List of algorithms in the benchmark Input: List of hyper-parameter values for each algorithm repeat 105 times Generate an MDP. (see Section B.1) for each hyper parameter value for the baseline do Generate a baseline. (see Section ...

  5. [5]

    Dataset generation The MDP is modified to include another goal: terminal state with a reward of 1 when accessing it

    Baseline generation See (Laroche et al., 2019, Appendix B.1.4). Dataset generation The MDP is modified to include another goal: terminal state with a reward of 1 when accessing it. The resulting environment isM∗. We do so to demon- strate the fact that Soft-SPIBB is less conservative than SPIBB, but still safe. The dataset generation depends a single param...

  6. [6]

    ∈ B π(i+1)(a4|x) = 0.2 π(i+1)(a4|x) = 0.2 10 20 50 100 200 500 1000 2000 Number of trajectories 5 7 10 15 20 30 50 70 100 N∧ −1.00 −0.75 −0.50 −0.25 0 .00 0 .25 0 .50 0 .75 1 .00 Normalized performance of the target policy π: ρ = ρ(π, M ∗) − ρb ρ∗ − ρb (a) MeanΠb-SPIBB withη = 0.9 10 20 50 100 200 500 1000 2000 Number of trajectories 5 7 10 15 20 30 50 70...

  7. [7]

    10 20 50 100 200 500 1000 2000 Number of trajectories 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized performance of the baseline: η = ρb − ρrand ρ∗ − ρrand −1.00 −0.75 −0.50 −0.25 0 .00 0 .25 0 .50 0 .75 1 .00 Normalized performance of the target policy π: ρ = ρ(π, M ∗) − ρb ρ∗ − ρb (f) 1%-CVaRΠ≤b-SPIBB (N∧ =

  8. [8]

    Random MDPs (no additional goal): SPIBB hyper-parameter search results: (a-d) Mean and 1%-CVaR performance heatmaps as a function ofN∧ (e-f) 1%-CVaR performance heatmaps as a function ofη with the best hyper-parameter (N∧ = 10). Safe Policy Improvement with Soft Baseline Bootstrapping 31 10 20 50 100 200 500 1000 2000 Number of trajectories 0.1 0.2 0.3 0....

  9. [9]

    Random MDPs: mean performance heatmaps. 32 Kimia Nadjahi, Romain Laroche, Rémi Tachet des Combes 10 20 50 100 200 500 1000 2000 Number of trajectories 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9η −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 Normalized performance ρ (a) 0.1%-CVaR: Basic RL. 10 20 50 100 200 500 1000 2000 Number of trajectories 0.1 0.2 0.3 0.4 ...

  10. [10]

    Random MDPs: 0.1%-CVaR performance heatmaps. Safe Policy Improvement with Soft Baseline Bootstrapping 33 10 20 50 100 200 500 1000 2000 Number of trajectories 0.1 0.2 0.5 1 2 5 epsilon −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 Normalized performance ρ (a) Mean: Exact-Soft-SPIBB 1-step (η = 0.1). 10 20 50 100 200 500 1000 2000 Number of trajectories...

  11. [11]

    Random MDPs: hyper-parameter mean performance heatmaps for Soft-SPIBB methods under a weak (η = 0.1) and a strong (η = 0.9) baseline. 34 Kimia Nadjahi, Romain Laroche, Rémi Tachet des Combes 10 20 50 100 200 500 1000 2000 Number of trajectories 0.1 0.2 0.5 1 2 5 epsilon −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 Normalized performance ρ (a) 1%-CVaR:...

  12. [12]

    Random MDPs: hyper-parameter 1%-CVaR performance heatmaps for Soft-SPIBB methods under a weak (η = 0.1) and a strong (η = 0.9) baseline. Safe Policy Improvement with Soft Baseline Bootstrapping 35 10 20 50 100 200 500 1000 2000 Number of trajectories 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 0.004 0.005 0.01 κad j −1.00 −0.75 −0.50 −0.25 0 .00 0 .25 0...

  13. [13]

    36 Kimia Nadjahi, Romain Laroche, Rémi Tachet des Combes C Helicopter experiment details C.1 Details about the helicopter environment See (Laroche et al., 2019, Appendix D.1)

    Random MDPs: hyper-parameter mean and 1%-CVaR performance heatmaps for RaMDP methods under a weak (η = 0.1), a medium weak (η = 0.4), a medium strong (η = 0.6), and a strong (η = 0.9) baseline. 36 Kimia Nadjahi, Romain Laroche, Rémi Tachet des Combes C Helicopter experiment details C.1 Details about the helicopter environment See (Laroche et al., 2019, Ap...

  14. [14]

    Compute the pseudo-counts

    Pseudo-code 4: Helicopter experimental process Input: List of algorithms Input: List of hyper-parameter values Input: List of dataset sizes repeat 20 times for each dataset size do Generate a dataset. Compute the pseudo-counts. repeat 15 times for each algorithm do for each hyper-parameter value do Train a policy. Evaluate the trained policy. Record the p...

  15. [15]

    The networks are trained for 2k passes on the dataset, and are fully converged by that time

    The learning rate is initialized at 0.01 and is annealed every 20k transitions or every pass on the dataset, whichever is larger. The networks are trained for 2k passes on the dataset, and are fully converged by that time. The models are trained with Pytorch (Paszke et al., 2017). The policy is tested for 1k steps at the end of training, with the initial ...

  16. [16]

    Inspired from Joelle Pineau’s talk at NeurIPS 2018 about repro- ducible, reusable, and robust Reinforcement Learning 1, we intend to also make our work reusable and reproducible

    Helicopter: mean and 10%-CVaR as a function of the hyper-parameter Safe Policy Improvement with Soft Baseline Bootstrapping 39 D Reproducible, reusable, and robust Reinforcement Learning This paper’s objective is to improve the robustness and the reliability of Reinforcement Learning algorithms. Inspired from Joelle Pineau’s talk at NeurIPS 2018 about rep...

  17. [17]

    – A complete proof of the claim

    See Sec- tion 3 for discussion. – A complete proof of the claim. ⇒ See Section A. For all figures and tables that present empirical results, check if you include: – A complete description of the data collection process, including sample size. ⇒ See Sections 4, B.1, and C.1. – A link to downloadable version of the dataset or simulation environment. ⇒ See Se...