pith. sign in

arxiv: 2606.00270 · v1 · pith:QXJSVAQQnew · submitted 2026-05-29 · 💻 cs.AI · cs.LG· cs.LO

Robust Shielding for Safe Reinforcement Learning

Pith reviewed 2026-06-28 22:19 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.LO
keywords shieldingrobust MDPssafe reinforcement learninglinear temporal logicPAC guaranteesMarkov decision processessafety verification
0
0 comments X

The pith

Shielding framework for robust MDPs is sound and optimal, admitting exactly the policies that meet LTL safety thresholds under worst-case transitions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a shielding method for reinforcement learning agents that works without exact knowledge of transition dynamics. It models uncertainty using robust MDPs whose transitions are sets of probabilities and defines safety as an LTL property that must hold above a given probability even in the worst case inside those sets. The central result is a proof that the shield is sound, so every policy it allows is safe, and optimal, so it allows every safe policy. By pairing the shield with sampling techniques that give probably approximately correct bounds on the probability sets, the method produces shields that are safe with high confidence in real unknown environments while staying minimally restrictive. This matters because it enables safe RL in settings where models must be learned from data rather than supplied in advance.

Core claim

We introduce a novel shielding framework for robust MDPs (RMDPs), i.e., MDPs with sets of transition probabilities. We define safety as the satisfaction of a linear temporal logic (LTL) formula with a certain threshold probability under the worst-case transition probabilities of the RMDP. We prove that our shielding framework is both sound and optimal for the RMDP: every policy admissible by the shield is safe, and conversely, every safe RMDP policy is admissible by the shield. We combine our approach with existing sampling methods for learning transition probabilities of MDPs with probably approximately correct (PAC) guarantees. This combination enables the construction of shields for MDPs

What carries the argument

The shielding framework on RMDPs that enforces LTL safety specifications with a probability threshold under worst-case transition sets, shown to be sound and optimal.

If this is right

  • Every policy the shield admits satisfies the LTL safety condition with the required probability under worst-case transitions.
  • Every safe policy in the RMDP is admitted by the shield.
  • When transition sets are obtained via PAC sampling, the shield supplies safety guarantees for the original MDP with high probability.
  • As the number of samples grows, the shield becomes less restrictive and expected returns approach those of the optimal safe policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same soundness and optimality arguments could be reused for other temporal logics or uncertainty representations beyond probability intervals.
  • An online version that updates the RMDP sets and shield during learning would reduce initial restrictiveness compared with the offline construction shown.
  • The framework suggests that robust shielding could be combined with model-free RL algorithms to produce safe policies directly from interaction data.

Load-bearing premise

The safety-relevant transition dynamics of the underlying MDP can be faithfully represented as sets of probabilities in an RMDP, and sampling methods exist that deliver PAC guarantees on those sets without requiring knowledge of the true transitions.

What would settle it

A policy that the shield admits yet fails the LTL formula with probability below the threshold for some transition probability inside the RMDP set, or a policy that satisfies the LTL condition for every probability in the set yet is rejected by the shield.

Figures

Figures reproduced from arXiv: 2606.00270 by Alessandro Abate, Edwin Hamel-De le Court, Francesco Belardinelli, Francesco Fabiano, Thom Badings.

Figure 1
Figure 1. Figure 1: Our shielding framework. Offline (shown by dashed arrows), we use sampled transitions [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The shield S = (D, Γ) deployed on an RMDP MR with an abstraction map α. the usual product between an MDP M and a DFA A, used in model checking [36] and shielding [6], with auxiliary actions. Indeed, this usual product can be recovered with M ⊗L A, where L is the labeling function of the MDP, and the DFA is identified with its associated monitor as in Example 1. We can now define our notion of a shield. In … view at source ↗
Figure 3
Figure 3. Figure 3: Avg. rew and sat as a function of the number of samples for each state-action pair. The dashed line shows the specification threshold 1 − p. Shaded regions indicate one standard deviation. experiments confirmed that, while conservative for small sample sizes, our robust shields guarantee probabilistic safety, whereas shields constructed for MDPs with maximum likelihood estimates do not. In the future, we p… view at source ↗
Figure 4
Figure 4. Figure 4: Learning curves for the Colour bomb gridworld environment. Left: rew. Right: sat. Rows correspond to different sample sizes. MDP MLE Kno Unk satisfaction prob (1 − p) 101 102 103 104 2 4 6 8 Sample size rew 101 102 103 104 0.2 0.4 0.6 0.8 1 Sample size sat [PITH_FULL_IMAGE:figures/full_fig_p041_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average performance on Colour bomb gridworld as a function of sample size. Left: rew. Right: sat. Colour bomb gridworld – LTL goal MDP MLE Kno Unk satisfaction prob (1 − p) Samples 0 1 2 3 4 5 ·105 0 10 20 Training steps rew 0 1 2 3 4 5 ·105 0 0.5 1 Training steps sat 10 0 1 2 3 4 5 ·105 0 10 20 Training steps rew 0 1 2 3 4 5 ·105 0.94 0.96 0.98 1 1.02 Training steps sat 100 41 [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 6
Figure 6. Figure 6: Learning curves for the Colour bomb gridworld – LTL environment. Left: rew. Right: sat. Rows correspond to different sample sizes. MDP MLE Kno Unk satisfaction prob (1 − p) 101 102 103 104 12 14 16 18 20 Sample size rew 101 102 103 104 0.2 0.4 0.6 0.8 1 Sample size sat 42 [PITH_FULL_IMAGE:figures/full_fig_p042_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Learning curves for the Pacman environment. Left: rew. Right: sat. Rows correspond to different sample sizes. MDP MLE Kno Unk satisfaction prob (1 − p) 101 102 103 104 19 20 Sample size rew 101 102 103 104 0.9 0.95 1 Sample size sat [PITH_FULL_IMAGE:figures/full_fig_p044_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Average performance on Pacman as a function of sample size. Left: rew. Right: sat. Pacman Slippery MDP MLE Kno Unk satisfaction prob (1 − p) Samples 0 1 2 3 4 5 ·105 15 20 Training steps rew 0 1 2 3 4 5 ·105 0.4 0.6 0.8 1 Training steps sat 10 0 1 2 3 4 5 ·105 15 20 Training steps rew 0 1 2 3 4 5 ·105 0.4 0.6 0.8 Training steps sat 100 0 1 2 3 4 5 ·105 15 20 Training steps rew 0 1 2 3 4 5 ·105 0.4 0.6 0.8 … view at source ↗
Figure 10
Figure 10. Figure 10: Learning curves for the Pacman Slippery environment. Left: rew. Right: sat. Rows correspond to different sample sizes. MDP MLE Kno Unk satisfaction prob (1 − p) 101 102 103 104 14 16 18 20 Sample size rew 101 102 103 104 0.6 0.8 1 Sample size sat [PITH_FULL_IMAGE:figures/full_fig_p045_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average performance on Pacman Slippery as a function of sample size. Left: rew. Right: sat. Media Streaming 45 [PITH_FULL_IMAGE:figures/full_fig_p045_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Learning curves for the Media Streaming environment. Left: rew. Right: sat. Rows correspond to different sample sizes. MDP MLE Kno Unk satisfaction prob (1 − p) 101 102 103 104 −15 −14 −13 Sample size rew 101 102 103 104 0.5 0.52 0.54 0.56 0.58 Sample size sat [PITH_FULL_IMAGE:figures/full_fig_p047_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Average performance on Media Streaming as a function of sample size. Left: rew. Right: sat. Media Streaming – Alternative MDP MLE Kno Unk satisfaction prob (1 − p) Samples 0 1 2 3 4 5 ·105 −50 −40 −30 −20 Training steps rew 0 1 2 3 4 5 ·105 0.4 0.6 Training steps sat 10 0 1 2 3 4 5 ·105 −50 −40 −30 −20 Training steps rew 0 1 2 3 4 5 ·105 0.4 0.6 Training steps sat 100 0 1 2 3 4 5 ·105 −60 −40 −20 Training… view at source ↗
Figure 14
Figure 14. Figure 14: Learning curves for the Media Streaming – Alternative environment. Left: rew. Right: sat. Rows correspond to different sample sizes. MDP MLE Kno Unk satisfaction prob (1 − p) 101 102 103 104 −50 −40 −30 Sample size rew 101 102 103 104 0.4 0.6 0.8 Sample size sat [PITH_FULL_IMAGE:figures/full_fig_p048_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Average performance on Media Streaming – Alternative as a function of sample size. Left: rew. Right: sat. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_15.png] view at source ↗
read the original abstract

Shielding is an effective approach to formally guarantee the safety of reinforcement learning agents in Markov decision processes (MDPs). However, existing shielding techniques typically assume knowledge of the safety-relevant transition dynamics - a requirement that is seldom met in practice. To address this limitation, we introduce a novel shielding framework for robust MDPs (RMDPs), i.e., MDPs with sets of transition probabilities. We define safety as the satisfaction of a linear temporal logic (LTL) formula with a certain threshold probability under the worst-case transition probabilities of the RMDP. We prove that our shielding framework is both sound and optimal for the RMDP: every policy admissible by the shield is safe, and conversely, every safe RMDP policy is admissible by the shield. We combine our approach with existing sampling methods for learning transition probabilities of MDPs with probably approximately correct (PAC) guarantees. This combination enables the construction of shields for MDPs that, with high confidence, guarantee safety while remaining minimally restrictive. Our experiments show that our shields for learned RMDPs guarantee safety in unknown MDPs while recovering strong expected return as the number of samples increases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a shielding framework for robust MDPs (RMDPs) with interval transition probabilities. Safety is defined via satisfaction of an LTL formula above a given probability threshold under worst-case transitions inside the sets. The authors prove that a shield constructed from a given RMDP is both sound (every admissible policy is safe) and optimal (every safe policy is admissible). They combine the framework with PAC sampling methods to learn the sets from an unknown MDP and report experiments showing that the resulting shields enforce safety while recovering expected return as sample count grows.

Significance. If the claims hold, the work is significant for extending formal shielding to the common practical case of unknown dynamics, using PAC guarantees to obtain high-confidence safety shields that remain minimally restrictive. The explicit soundness and optimality proofs for the exact-RMDP case, together with the sampling integration, constitute a clear technical contribution over prior shielding methods that assume known transitions.

major comments (2)
  1. [§4] §4 (combination with PAC sampling): the soundness and optimality theorems are stated and proved only for an exact given RMDP. No subsequent theorem or lemma supplies an explicit high-confidence bound showing that the worst-case LTL probability under the learned sets remains above the required threshold; path-dependent quantification in LTL can compound per-transition PAC error in ways not controlled by standard single-step PAC results.
  2. [Definition 3] Definition 3 (RMDP safety): the threshold probability is taken with respect to the worst-case transitions inside the sets, yet the text does not state how this threshold is adjusted (or not) to absorb the PAC approximation error when the sets themselves are learned; this choice is load-bearing for the “with high confidence” claim in the abstract.
minor comments (2)
  1. [§3] Notation for the robust value function and the shield construction could be introduced earlier and used consistently in the proofs.
  2. [Experiments] Experimental plots would benefit from reporting the empirical violation rate of the LTL specification across repeated trials rather than only average return.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need for an explicit connection between the PAC guarantees and the LTL safety threshold under learned interval sets. We address the major comments point by point below and will revise the manuscript to strengthen these aspects.

read point-by-point responses
  1. Referee: [§4] §4 (combination with PAC sampling): the soundness and optimality theorems are stated and proved only for an exact given RMDP. No subsequent theorem or lemma supplies an explicit high-confidence bound showing that the worst-case LTL probability under the learned sets remains above the required threshold; path-dependent quantification in LTL can compound per-transition PAC error in ways not controlled by standard single-step PAC results.

    Authors: We agree that the soundness and optimality results apply to a fixed RMDP and that the manuscript lacks an explicit theorem deriving a high-confidence bound on the LTL probability for the PAC-learned intervals. The combination section relies on the fact that the true MDP lies inside the learned RMDP with high probability, but does not address error compounding for path-dependent LTL properties. We will add a new lemma in §4 that adjusts the safety threshold using the PAC confidence parameter and a suitable concentration inequality (e.g., via union bound over relevant finite paths) to control the compounded error. revision: yes

  2. Referee: [Definition 3] Definition 3 (RMDP safety): the threshold probability is taken with respect to the worst-case transitions inside the sets, yet the text does not state how this threshold is adjusted (or not) to absorb the PAC approximation error when the sets themselves are learned; this choice is load-bearing for the “with high confidence” claim in the abstract.

    Authors: The referee is correct that Definition 3 defines safety for a given RMDP without specifying any adjustment to the probability threshold when the intervals are obtained from PAC sampling. This omission leaves the high-confidence claim in the abstract without full formal support. We will revise the text immediately after Definition 3 and in §4 to state explicitly how the threshold is raised (or the PAC parameter chosen) to absorb the approximation error, thereby making the overall guarantee rigorous. revision: yes

Circularity Check

0 steps flagged

No circularity; proofs derive directly from RMDP safety definitions

full rationale

The paper defines safety for RMDPs as LTL satisfaction above a threshold under worst-case transition probabilities within the sets, then proves soundness and optimality (every admissible policy is safe and vice versa) from those definitions. The PAC sampling combination is presented as a practical extension without reducing the central theorems to fitted inputs, self-citations, or ansatzes. The derivation chain is self-contained against the stated assumptions and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard MDP and LTL semantics plus the modeling choice that transition uncertainty is captured by probability sets; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Transition dynamics of an MDP can be represented as sets of probability distributions forming an RMDP
    Invoked when defining safety under worst-case transitions
  • domain assumption LTL formulas can be evaluated with threshold probabilities over RMDP paths
    Used to define the safety specification

pith-pipeline@v0.9.1-grok · 5742 in / 1246 out tokens · 19684 ms · 2026-06-28T22:19:12.688883+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018

  2. [2]

    Bagnell, and Jan Peters

    Jens Kober, J. Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32:1238–1274, 09 2013

  3. [3]

    Playing Atari with Deep Reinforcement Learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing Atari with deep reinforcement learning.CoRR, abs/1312.5602, 2013

  4. [4]

    Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A

    B. Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A. Al Sallab, Senthil Ku- mar Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous driving: A survey.IEEE Trans. Intell. Transp. Syst., 23(6):4909–4926, 2022

  5. [5]

    A comprehensive survey on safe reinforcement learning

    Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16:1437–1480, 2015

  6. [6]

    Safe Reinforcement Learning via Shielding

    Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. volume abs/1708.08611, 2017. 10

  7. [7]

    Edwin Hamel-De le Court, Francesco Belardinelli, and Alexander W. Goodall. Probabilistic shielding for safe reinforcement learning. InAAAI, pages 16091–16099. AAAI Press, 2025

  8. [8]

    Safe reinforcement learning using probabilistic shields (invited paper)

    Nils Jansen, Bettina Könighofer, Sebastian Junges, Alex Serban, and Roderick Bloem. Safe reinforcement learning using probabilistic shields (invited paper). InCONCUR, LIPIcs, pages 3:1–3:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020

  9. [9]

    Safe reinforcement learning via shielding under partial observability

    Steven Carr, Nils Jansen, Sebastian Junges, and Ufuk Topcu. Safe reinforcement learning via shielding under partial observability. InAAAI, pages 14748–14756. AAAI Press, 2023

  10. [10]

    Tsitsiklis

    Shie Mannor, Duncan Simester, Peng Sun, and John N. Tsitsiklis. Bias and variance approxima- tion in value function estimates.Manag. Sci., 53(2):308–322, 2007

  11. [11]

    Robust control of Markov decision processes with uncertain transition matrices.Oper

    Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices.Oper. Res., 53(5):780–798, 2005

  12. [12]

    Bovy, David Parker, and Nils Jansen

    Marnix Suilen, Thom Badings, Eline M. Bovy, David Parker, and Nils Jansen. Robust markov decision processes: A place where AI and formal methods meet. InPrinciples of Verification (3), Lecture Notes in Computer Science, pages 126–154. Springer, 2024

  13. [13]

    Robust Markov decision processes

    Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov decision processes. Math. Oper. Res., 38(1):153–183, 2013

  14. [14]

    Orna Kupferman and Moshe Y . Vardi. Model checking of safety properties.Formal Methods Syst. Des., 19(3):291–314, 2001

  15. [15]

    Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P

    Lukas Brunke, Melissa Greeff, Adam W. Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P. Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning.Annu. Rev. Control. Robotics Auton. Syst., 5:411–444, 2022

  16. [16]

    A review of safe reinforcement learning: Methods, theories, and applications.IEEE Trans

    Shangding Gu, Long Yang, Yali Du, Guang Chen, Florian Walter, Jun Wang, and Alois Knoll. A review of safe reinforcement learning: Methods, theories, and applications.IEEE Trans. Pattern Anal. Mach. Intell., 46(12):11216–11235, 2024

  17. [17]

    Shielded rein- forcement learning: A review of reactive methods for safe learning

    Haritz Odriozola-Olalde, Maider Zamalloa, and Nestor Arana-Arexolaleiba. Shielded rein- forcement learning: A review of reactive methods for safe learning. InSII, pages 1–8. IEEE, 2023

  18. [18]

    Safe reinforcement learning via probabilistic logic shields

    Wen-Chi Yang, Giuseppe Marra, Gavin Rens, and Luc De Raedt. Safe reinforcement learning via probabilistic logic shields. InIJCAI, pages 5739–5749. ijcai.org, 2023

  19. [19]

    Safe multi-agent reinforcement learning via shielding

    Ingy Elsayed-Aly, Suda Bharadwaj, Christopher Amato, Rüdiger Ehlers, Ufuk Topcu, and Lu Feng. Safe multi-agent reinforcement learning via shielding. InAAMAS, pages 483–491. ACM, 2021

  20. [20]

    Online shielding for reinforcement learning.Innov

    Bettina Könighofer, Julian Rudolf, Alexander Palmisano, Martin Tappler, and Roderick Bloem. Online shielding for reinforcement learning.Innov. Syst. Softw. Eng., 19(4):379–394, 2023

  21. [21]

    Shields for safe reinforcement learning.Commun

    Bettina Könighofer, Roderick Bloem, Nils Jansen, Sebastian Junges, and Stefan Pranger. Shields for safe reinforcement learning.Commun. ACM, 68(11):80–90, 2025

  22. [22]

    Goodall and Francesco Belardinelli

    Alexander W. Goodall and Francesco Belardinelli. Leveraging approximate model-based shielding for probabilistic safety guarantees in continuous environments. InAAMAS, pages 2291–2293. International Foundation for Autonomous Agents and Multiagent Systems / ACM, 2024

  23. [23]

    Safe reinforcement learning in black-box environments via adaptive shielding

    Daniel Bethell, Simos Gerasimou, Radu Calinescu, and Calum Imrie. Safe reinforcement learning in black-box environments via adaptive shielding. InECAI, Frontiers in Artificial Intelligence and Applications, pages 2450–2457. IOS Press, 2025

  24. [24]

    Learning-based shielding for safe autonomy under unknown dynamics

    Robert Reed and Morteza Lahijanian. Learning-based shielding for safe autonomy under unknown dynamics. InACC, pages 4940–4946. IEEE, 2025

  25. [25]

    Optimization-based robust permissive synthesis for interval MDPs, 2026

    Khang V o Huynh, David Parker, and Lu Feng. Optimization-based robust permissive synthesis for interval MDPs, 2026. 11

  26. [26]

    Risk-constrained reinforcement learning with percentile risk criteria.J

    Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria.J. Mach. Learn. Res., 18:167:1–167:51, 2017

  27. [27]

    Responsive safety in reinforcement learning by PID lagrangian methods

    Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by PID lagrangian methods. InICML, Proceedings of Machine Learning Research, pages 9133–9143. PMLR, 2020

  28. [28]

    Efficient policy optimization in robust constrained MDPs with iteration complexity guarantees

    Sourav Ganguly, Arnob Ghosh, Kishan Panaganti, and Adam Wierman. Efficient policy optimization in robust constrained MDPs with iteration complexity guarantees. InNeurIPS, 2025

  29. [29]

    Robust constrained-MDPs: Soft-constrained robust policy optimization under model uncertainty.CoRR, abs/2010.04870, 2020

    Reazul Hasan Russel, Mouhacine Benosman, and Jeroen van Baar. Robust constrained-MDPs: Soft-constrained robust policy optimization under model uncertainty.CoRR, abs/2010.04870, 2020

  30. [30]

    Duéñez-Guzmán, and Mohammad Ghavamzadeh

    Yinlam Chow, Ofir Nachum, Edgar A. Duéñez-Guzmán, and Mohammad Ghavamzadeh. A Lyapunov-based approach to safe reinforcement learning. InNeurIPS, pages 8103–8112, 2018

  31. [31]

    Lyapunov-based Safe Policy Optimization for Continuous Control

    Yinlam Chow, Ofir Nachum, Aleksandra Faust, Mohammad Ghavamzadeh, and Edgar A. Duéñez-Guzmán. Lyapunov-based safe policy optimization for continuous control.CoRR, abs/1901.10031, 2019

  32. [32]

    Schoellig, and Andreas Krause

    Felix Berkenkamp, Matteo Turchetta, Angela P. Schoellig, and Andreas Krause. Safe model- based reinforcement learning with stability guarantees. InNIPS, pages 908–918, 2017

  33. [33]

    Richard S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting.SIGART Bull., 2(4):160–163, 1991

  34. [34]

    When to trust your model: Model-based policy optimization

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. InNeurIPS, pages 12498–12509, 2019

  35. [35]

    Trust the model where it trusts itself - model-based actor-critic with uncertainty-aware rollout adaption

    Bernd Frauenknecht, Artur Eisele, Devdutt Subhasish, Friedrich Solowjow, and Sebastian Trimpe. Trust the model where it trusts itself - model-based actor-critic with uncertainty-aware rollout adaption. InICML, Proceedings of Machine Learning Research, pages 13973–14005. PMLR / OpenReview.net, 2024

  36. [36]

    MIT press, 2008

    Christel Baier and Joost-Pieter Katoen.Principles of model checking. MIT press, 2008

  37. [37]

    Bertsekas and Steven E

    Dimitri P. Bertsekas and Steven E. Shreve.Stochastic Optimal Control: The Discrete-Time Case. Athena Scientific, 2007

  38. [38]

    The temporal logic of programs

    Amir Pnueli. The temporal logic of programs. InFOCS, pages 46–57. IEEE Computer Society, 1977

  39. [39]

    Runtime verification - 17 years later

    Klaus Havelund and Grigore Rosu. Runtime verification - 17 years later. InRV, Lecture Notes in Computer Science, pages 3–17. Springer, 2018

  40. [40]

    Deshmukh, Alexandre Donzé, Georgios Fainekos, Oded Maler, Dejan Nickovic, and Sriram Sankaranarayanan

    Ezio Bartocci, Jyotirmoy V . Deshmukh, Alexandre Donzé, Georgios Fainekos, Oded Maler, Dejan Nickovic, and Sriram Sankaranarayanan. Specification-based monitoring of cyber- physical systems: A survey on theory, tools and applications. InLectures on Runtime Verification, Lecture Notes in Computer Science, pages 135–175. Springer, 2018

  41. [41]

    What are the odds? improving statistical model checking of Markov decision processes

    Tobias Meggendorfer, Maximilian Weininger, and Patrick Wienhöft. What are the odds? improving statistical model checking of Markov decision processes. InQEST+FORMATS, Lecture Notes in Computer Science, pages 195–218. Springer, 2025

  42. [42]

    Data-driven abstraction and synthesis for stochastic systems with unknown dynamics

    Mahdi Nazeri, Thom Badings, Anne-Kathrin Schmuck, Sadegh Soudjani, and Alessandro Abate. Data-driven abstraction and synthesis for stochastic systems with unknown dynamics. InCDC, pages 6754–6759. IEEE, 2025

  43. [43]

    Poonawala, Mariëlle Stoelinga, and Nils Jansen

    Thom Badings, Licio Romao, Alessandro Abate, David Parker, Hasan A. Poonawala, Mariëlle Stoelinga, and Nils Jansen. Robust control for dynamical systems with non-Gaussian noise via formal abstractions.J. Artif. Intell. Res., 76:341–391, 2023. 12

  44. [44]

    Simão, David Parker, and Nils Jansen

    Marnix Suilen, Thiago D. Simão, David Parker, and Nils Jansen. Robust anytime learning of Markov decision processes. InNeurIPS, 2022

  45. [45]

    Certifiably robust policies for uncertain parametric environments

    Yannik Schnitzer, Alessandro Abate, and David Parker. Certifiably robust policies for uncertain parametric environments. InTACAS (3), Lecture Notes in Computer Science, pages 63–83. Springer, 2025

  46. [46]

    The use of confidence or fiducial limits illustrated in the case of the binomial.Biometrika, 26(4):404–413, 1934

    Charles J Clopper and Egon S Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial.Biometrika, 26(4):404–413, 1934

  47. [47]

    Henzinger, Jan Kretínský, and Tatjana Petrov

    Przemyslaw Daca, Thomas A. Henzinger, Jan Kretínský, and Tatjana Petrov. Faster statistical model checking for unbounded temporal properties.ACM Trans. Comput. Log., 18(2):12:1– 12:25, 2017

  48. [48]

    Garud N. Iyengar. Robust dynamic programming.Math. Oper. Res., 30(2):257–280, 2005

  49. [49]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017

  50. [50]

    Goodall, Omar Adalat, Edwin Hamel De-le Court, and Francesco Belardinelli

    Alexander W. Goodall, Omar Adalat, Edwin Hamel De-le Court, and Francesco Belardinelli. MASA-Safe-RL: Multi and single agent safe reinforcement learning. https://github.com/ sacktock/MASA-Safe-RL/, 2025. GitHub repository

  51. [51]

    Simão, Marnix Suilen, and Nils Jansen

    Thom Badings, Thiago D. Simão, Marnix Suilen, and Nils Jansen. Decision-making under uncertainty: beyond probabilities.Int. J. Softw. Tools Technol. Transf., 25(3):375–391, 2023

  52. [52]

    Bertsimas and D

    D. Bertsimas and D. den Hertog.Robust and Adaptive Optimization. Dynamic Ideas LLC, 2022

  53. [53]

    DOPE: doubly optimistic and pessimistic exploration for safe reinforcement learning

    Archana Bura, Aria HasanzadeZonuzy, Dileep Kalathil, Srinivas Shakkottai, and Jean-François Chamberland. DOPE: doubly optimistic and pessimistic exploration for safe reinforcement learning. InNeurIPS, 2022. 13 A Extended Preliminaries Remark A.1.Throughout the appendices, we provide formalizations of all definitions and results from the paper, including t...

  54. [54]

    ifα(s) =α(s ′), thenL(s) =L(s ′)andA(s) =A(s ′)

  55. [55]

    for every actiona∈A(s), we have µ◦α −1 |µ∈ P(s, a) = µ◦α −1 |µ∈ P(s ′, a)

  56. [56]

    the induced correspondence Pα defined by Pα(ξ, a) = µ◦α −1 :µ∈ P(s, a) for any s∈α −1(ξ)has measurable graph. Intuitively, condition 1 requires that all states s, s′ ∈S mapping to the same abstract state have the same label and enabled actions, and condition 2 that these states induce the same sets of transition probabilities under the abstraction map. Si...

  57. [57]

    ∞X t=0 γtR(st, at) # −E s0a0···∼ν

    for all states (s, q, y) of MR ⊗ D, the set Γ(s, q, y) contains exactly those distributions ρ∈∆(A× V β)such that ρ(A(s)× V β) = 1, and such that E(a,v)∼ρ h sup µ∈P(s,a) Es′∼µ v s′, δ(q, L(s′)) i ≤y. Lemma B.2.The pairS(M R,A, β) = (D,Γ)is a shield over(S, A). Proof.SinceS,Q, andAare finite, we equip them with the discrete topology. The space Vβ = Y (s,q)∈...

  58. [58]

    ThenSis C- 2Zϵ 1−γ , γ -optimal over(M R, α)forΦ

    Suppose that sup(s,a)∈S×A |R(s, a)| ≤Z , that γ <1 , and that S is C-ϵ-W1,γ-complete over(M R, α)forΦ. ThenSis C- 2Zϵ 1−γ , γ -optimal over(M R, α)forΦ

  59. [59]

    ThenSis C-(2Bϵ,1)-optimal over(M R, α)forΦ

    Suppose that the absolute value of the total undiscounted return is uniformly bounded by B, and thatSisC-ϵ-TV-complete over(M R, α)forΦ. ThenSis C-(2Bϵ,1)-optimal over(M R, α)forΦ. Proof. We prove the first item; the second is analogous. Let π∈ C be such that MR, π|= Φ . By C-ϵ-W1,γ-completeness, there exists a compliant policy πsuch that, W1,γ probMR[θ],...

  60. [60]

    If the absolute value of the total undiscounted return is uniformly bounded by B, then S(MR/α,A, β)is HR-(2Bϵ,1)-optimal over(M R, α)forΦ

  61. [61]

    In particular, the shield S(MR/α,A, β ∞) is HR-(0, γ)-optimal for every γ∈(0,1] for which the corresponding return is well-defined

    Ifγ∈(0,1)andsup (s,a)∈S×A |R(s, a)| ≤Z, thenS(M R/α,A, β)is HR- 2Zϵ 1−γ , γ -optimal over(M R, α)forΦ. In particular, the shield S(MR/α,A, β ∞) is HR-(0, γ)-optimal for every γ∈(0,1] for which the corresponding return is well-defined. C Probabilistic Shielding for Unknown MDPs In this appendix, we present all extended definitions, results, and proofs for ...

  62. [62]

    for every HR policy π, the total rewardP∞ t=0 R(st, at) is almost surely convergent and has finite expectation

  63. [63]

    ∞X t=0 c((st, qt), at) # is the probability to reachGfollowingπ, so the constraint M, π|=P ≥1−p(φ) is equivalent to Eπ

    an optimal feasible HR policy exists. Then there exists an optimal feasible policy π⋆ ∈ C A(M). Moreover,π⋆ can be chosen as an initial mixture of at most two deterministicA-memory policies. Proof. Consider the product MDP M ⊗L A with state space S×Q , and let G=S×F be the set of bad product states. We make G absorbing and set the reward to zero after G i...

  64. [64]

    the choice of two actions, of a trade-off λ∈[0; 1] between those two action, inducing a distributionµover actions, and

  65. [65]

    the choice of an extremal point ofV λ(ξ, q, y). In practice, we optimize with PPO [49] over M S , we leave the choice of the trade-off λ to the actor, and implement the following heuristic for the choice of the extremal point ofV λ(ξ, q, y). Auxiliary action heuristic.We describe the heuristic used to choose the auxiliary actions after the actor has selec...

  66. [66]

    learning curves over training steps, shown for each sample size (the specific number that changes across the various environments represents the minimum number of samples needed forUnk)

  67. [67]

    averaged performance as a function of sample size. Colour bomb gridworld MDP MLE Kno Unk satisfaction prob(1−p) Samples 0 1 2 3 4 5 ·105 0 5 10 Training steps rew 0 1 2 3 4 5 ·105 0 0.5 1 Training steps sat 10 0 1 2 3 4 5 ·105 0 5 10 Training steps rew 0 1 2 3 4 5 ·105 0.5 1 Training steps sat 100 0 1 2 3 4 5 ·105 0 2 4 6 8 10 Training steps rew 0 1 2 3 4...