arxiv: 2605.14246 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.SY· eess.SY

Recognition: no theorem link

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

Yushen Liu , Yin-Jen Chen , Ziyi Chen , Tao Wang , Heng Huang , Xugui Zhou , Yanfu Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SY

keywords risk-sensitive controlpartial observabilitysafe reinforcement learningPOMDP approximationglucose regulationrisk gatingsafety-critical systemsfinite history proxy

0 comments

The pith

A learned action-conditioned predictor of near-term safety violations gates value estimates to approximate risk-sensitive control under partial observability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a reinforcement learning method that sidesteps the high cost of maintaining and planning over full beliefs in partially observable safety-critical tasks. It replaces the belief with a compact finite history of observations and trains a separate model to forecast the safety violation risk that would result from each possible next action. During learning this risk forecast adds a penalty to the value function; at decision time it interpolates between optimistic and conservative value estimates drawn from an ensemble, so that high-risk actions are treated more cautiously while low-risk actions remain closer to reward-maximizing estimates. The resulting policy is evaluated on automated glucose regulation across adult and adolescent patient cohorts and on Safety-Gym navigation benchmarks, where it improves the safety-performance tradeoff and runs substantially faster than a belief-space planning baseline.

Core claim

The central claim is that an action-conditioned predictor of near-term safety violation, built on a compact finite-history proxy state, enables effective approximate risk-sensitive POMDP control. The predicted risk is used both as a penalty added during value learning and as a decision-time gate that blends optimistic and conservative ensemble value estimates, so low-risk actions are evaluated closer to reward-seeking estimates while high-risk actions are evaluated more conservatively.

What carries the argument

Action-conditioned near-term safety-violation predictor applied as both a value-learning penalty and a decision-time interpolator between optimistic and conservative ensemble estimates.

If this is right

Improves overall glycemic tradeoffs across adult and adolescent glucose-control cohorts.
Substantially reduces runtime relative to a belief-space planning baseline.
Achieves a more favorable reward-cost balance than unconstrained RL and several standard safe-RL baselines on Safety-Gym navigation.
Low-risk actions receive value estimates closer to reward-seeking values; high-risk actions receive more conservative estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The short-history risk signal may suffice in many real-time domains where maintaining full beliefs is computationally prohibitive.
The same gating mechanism could be applied to other ensemble-based safe-RL methods to add partial-observability handling without redesigning the underlying planner.
In settings where near-term risk correlates strongly with long-term safety, the approach offers a practical substitute for explicit belief propagation.

Load-bearing premise

A compact finite-history proxy state plus a learned action-conditioned predictor of near-term safety violation is sufficient to produce effective risk-sensitive decisions under partial observability.

What would settle it

An experiment in which safety violations depend on long unobserved history and the short-history risk predictor fails to prevent them while a full belief-space planner succeeds.

Figures

Figures reproduced from arXiv: 2605.14246 by Heng Huang, Tao Wang, Xugui Zhou, Yanfu Zhang, Yin-Jen Chen, Yushen Liu, Ziyi Chen.

**Figure 1.** Figure 1: Positioning of risk-gated RL relative to classical POMDPs and standard safe RL. The key design choice is to estimate near-term risk for each candidate action directly from recent history, rather than infer a full latent-state posterior[23, 22]. For each candidate action a ∈ A, the controller predicts ρˆt(a) ≈ P(Et(a) = 1 | Ht), where Et(a) denotes a safety violation within a short horizon. This predicted h… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of control behavior over the final 2 evaluation days. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Pareto-style reward-cost comparison on the Safety-Gym benchmarks. 4.3 Risk Prediction Analysis We next analyze whether the learned risk predictor provides a reliable decision-time signal. We compare online predicted risk with a matched post-hoc realized-pulse risk, computed using the same pulse-risk functional as the controller but replacing the learned predicted pulse with the realized next-step glucose … view at source ↗

**Figure 4.** Figure 4: Predicted risk and matched post-hoc realized-pulse risk over the final 48 hours of evaluation. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation studies for the risk-gated controller. (a) Sensitivity to proxy history window length [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity analysis of the predicted-risk penalty [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against safety risk. Although belief-space planning provides a principled solution, maintaining and planning over beliefs can be computationally costly and sensitive to model specification in practical domains. We propose a lightweight risk-gated reinforcement learning approximation for risk-sensitive control under partial observability. The method constructs a compact finite-history proxy state and learns an action-conditioned predictor of near-term safety violation. This predicted candidate-action risk is used in two complementary ways: as a risk penalty during value learning, and as a decision-time gate that interpolates between optimistic and conservative ensemble value estimates. As a result, low-risk actions are evaluated closer to reward-seeking estimates, while high-risk actions are evaluated more conservatively. We evaluate the approach in two safety-critical partially observable domains: automated glucose regulation and safety-constrained navigation. Across adult and adolescent glucose-control cohorts, the method improves overall glycemic tradeoffs and substantially reduces runtime relative to a belief-space planning baseline. On Safety-Gym navigation benchmarks, it achieves a more favorable reward-cost balance than unconstrained RL and several standard safe-RL baselines. These results suggest that action-conditioned near-term risk can provide an effective local signal for approximate risk-sensitive POMDP control when full belief-space planning is impractical.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Action-conditioned risk gating offers a fast approximation for risk-sensitive POMDP control with promising results in glucose and navigation, though validation details are limited.

read the letter

The main takeaway is that this paper gives a workable lightweight approximation for risk-sensitive POMDP control. It builds a finite-history proxy state and trains an action-conditioned predictor of near-term safety violations, then uses that prediction both as a penalty in value learning and as a gate that blends optimistic and conservative ensemble estimates at decision time. Low-risk actions get evaluated closer to pure reward, high-risk ones more cautiously. This avoids the full cost of belief-space planning while still injecting risk awareness.

Referee Report

2 major / 2 minor

Summary. The paper proposes a lightweight risk-gated RL method for approximate risk-sensitive control in POMDPs. It constructs a finite-history proxy state and trains an action-conditioned predictor of near-term safety violations; this predictor supplies both a risk penalty in value learning and a decision-time gate that interpolates between optimistic and conservative ensemble value estimates. Empirical evaluation on automated glucose regulation (adult and adolescent cohorts) and Safety-Gym navigation benchmarks reports improved glycemic tradeoffs, substantially lower runtime than belief-space planning, and a more favorable reward-cost balance than unconstrained RL and standard safe-RL baselines.

Significance. If the reported empirical gains hold under rigorous statistical scrutiny, the work supplies a practical, low-overhead alternative to full belief-space planning for safety-critical control under partial observability. The combination of a compact history proxy with a learned local risk signal is a concrete contribution that could be useful in medical and robotic domains where maintaining accurate beliefs is expensive or model mismatch is common.

major comments (2)

[Experiments] Experimental section: the abstract and results summary state positive outcomes on glycemic tradeoffs and reward-cost balance, yet supply no error bars, number of random seeds, ablation details on the risk predictor, or statistical significance tests. Because the central claim is an empirical improvement over belief-space planning and safe-RL baselines, the absence of these elements makes the robustness of the reported gains impossible to assess from the given text.
[Method] Method description: the risk predictor is trained from data and then used to modulate value estimates, but no equation or derivation shows that the claimed improvement is independent of the particular fitted parameters of the predictor. This leaves open whether the performance edge is due to the gating mechanism itself or to incidental regularization introduced by the learned risk term.

minor comments (2)

[Preliminaries] Notation for the finite-history proxy state and the action-conditioned risk predictor should be introduced with explicit definitions and dimensions in the first section where they appear.
[Results] Figure captions for the glucose and Safety-Gym results should include the exact number of evaluation episodes and the precise definition of the cost metric used in the reward-cost plots.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We appreciate the recognition of the method's potential utility in safety-critical domains. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experimental section: the abstract and results summary state positive outcomes on glycemic tradeoffs and reward-cost balance, yet supply no error bars, number of random seeds, ablation details on the risk predictor, or statistical significance tests. Because the central claim is an empirical improvement over belief-space planning and safe-RL baselines, the absence of these elements makes the robustness of the reported gains impossible to assess from the given text.

Authors: We agree that the experimental reporting requires strengthening to allow assessment of robustness. Although the underlying experiments used multiple random seeds, the manuscript did not explicitly report error bars, seed counts, ablations on the risk predictor, or significance tests. In the revision we will add: mean and standard deviation over 10 independent seeds with error bars on all plots, an ablation study isolating the risk predictor, and statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank) against the belief-space and safe-RL baselines. revision: yes
Referee: [Method] Method description: the risk predictor is trained from data and then used to modulate value estimates, but no equation or derivation shows that the claimed improvement is independent of the particular fitted parameters of the predictor. This leaves open whether the performance edge is due to the gating mechanism itself or to incidental regularization introduced by the learned risk term.

Authors: We do not claim that the performance improvement is independent of the predictor parameters; the learned action-conditioned risk predictor is an integral part of the approach that supplies the local risk signal. The claimed benefit arises from the combination of the risk penalty during value learning and the decision-time gating that interpolates optimistic and conservative estimates. In the revision we will insert an explicit derivation of the gated value estimate and add an ablation that replaces the learned predictor with a fixed risk threshold, thereby isolating the contribution of the learned risk signal versus incidental regularization. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical RL method that trains an action-conditioned risk predictor on data and applies it as a penalty and gate within value estimation for POMDP control. All performance claims (glycemic tradeoffs, runtime reduction, reward-cost balance) are supported by direct experimental comparisons against belief-space planning and safe-RL baselines in two domains, with no equations that reduce the reported gains to quantities defined by the same fitted parameters. No self-citations serve as load-bearing uniqueness theorems, and no ansatz or renaming is smuggled in to create circularity. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach implicitly relies on standard RL assumptions (Markovian proxy state, learnable risk predictor) whose validity is not audited here.

pith-pipeline@v0.9.0 · 5567 in / 1151 out tokens · 26300 ms · 2026-05-15T02:48:00.155013+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 2 internal anchors

[1]

Constrained policy optimization

Joshua Achiam et al. Constrained policy optimization. InInternational Conference on Machine Learning. PMLR, 2017

work page 2017
[2]

Constrained policy optimization

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. InInternational Conference on Machine Learning, pages 22–31. PMLR, 2017

work page 2017
[3]

A distributional perspective on rein- forcement learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein- forcement learning. InInternational conference on machine learning, pages 449–458. Pmlr, 2017. 9

work page 2017
[4]

Wayne Bequette, Darrell M

Fraser Cameron, B. Wayne Bequette, Darrell M. Wilson, Bruce A. Buckingham, Hyunjin Lee, and Günter Niemeyer. A closed-loop artificial pancreas based on risk management.Journal of Diabetes Science and Technology, 5(2):368–379, 2011

work page 2011
[5]

Safe reinforcement learning via shielding under par- tial observability

Steven Carr, Nils Jansen, and Ufuk Topcu. Safe reinforcement learning via shielding under par- tial observability. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 14748–14756, 2023

work page 2023
[6]

Cassandra

Anthony R. Cassandra. A survey of pomdp applications. InWorking Notes of the AAAI 1998 Fall Symposium on Planning with Partially Observable Markov Decision Processes, volume 1724, 1998

work page 1998
[7]

Guidelines for reinforcement learning in healthcare.Nature medicine, 25(1):16–18, 2019

Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare.Nature medicine, 25(1):16–18, 2019

work page 2019
[8]

Enhancing automatic closed-loop glucose control in type 1 diabetes under announced meals using an adaptive meal bolus calculator.Artificial Intelligence in Medicine, 83:1–8, 2017

Pau Herrero, Ahmad Haidar, Madhuri Reddy, Mohamed El Sharkawy, Peter Pesl, Marina Xenou, Christofer Toumazou, Juan Hanusch, Ewa Pankowska, Patricia Herrero, Nick Oliver, Pantelis Georgiou, and Josep Vehi. Enhancing automatic closed-loop glucose control in type 1 diabetes under announced meals using an adaptive meal bolus calculator.Artificial Intelligence...

work page 2017
[9]

Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications.IEEE Access, 12:175473–175500, 2024

Sinan Ibrahim et al. Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications.IEEE Access, 12:175473–175500, 2024

work page 2024
[10]

Deep variational reinforcement learning for pomdps

Maximilian Igl et al. Deep variational reinforcement learning for pomdps. InInternational Conference on Machine Learning. PMLR, 2018

work page 2018
[11]

Safety-gymnasium: A unified safe reinforcement learning benchmark.arXiv preprint arXiv:2310.12567, 2023

Jiaming Ji, Borong Zhang, Jiayi Zhou, Xuehai Pan, Weidong Huang, Ruiyang Sun, Yiran Geng, Yifan Zhong, Juntao Dai, and Yaodong Yang. Safety-gymnasium: A unified safe reinforcement learning benchmark.arXiv preprint arXiv:2310.12567, 2023

work page arXiv 2023
[12]

Is pessimism provably efficient for offline rl? In International conference on machine learning, pages 5084–5096

Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International conference on machine learning, pages 5084–5096. PMLR, 2021

work page 2021
[13]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1–2):99–134, 1998

work page 1998
[14]

Shields for safe reinforcement learning.Communications of the ACM, 68(11):80–90, 2025

Bettina Könighofer et al. Shields for safe reinforcement learning.Communications of the ACM, 68(11):80–90, 2025

work page 2025
[15]

Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020

work page 2020
[16]

Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces

Hanna Kurniawati, David Hsu, and Wee Sun Lee. Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. InRobotics: Science and Systems, volume 2008, 2008

work page 2008
[17]

Partially observable markov decision processes in robotics: A survey.IEEE Transactions on Robotics, 39(1):21–40, 2023

Mikko Lauri, David Hsu, and Joni Pajarinen. Partially observable markov decision processes in robotics: A survey.IEEE Transactions on Robotics, 39(1):21–40, 2023

work page 2023
[18]

The uva/padova type 1 diabetes simulator: new features.Journal of diabetes science and technology, 8(1):26–34, 2014

Chiara Dalla Man, Francesco Micheletto, Dayu Lv, Marc Breton, Boris Kovatchev, and Claudio Cobelli. The uva/padova type 1 diabetes simulator: new features.Journal of diabetes science and technology, 8(1):26–34, 2014

work page 2014
[19]

Point-based value iteration: An anytime algorithm for pomdps

Joelle Pineau, Geoff Gordon, and Sebastian Thrun. Point-based value iteration: An anytime algorithm for pomdps. InIJCAI, volume 3, 2003

work page 2003
[21]

Benchmarking safe exploration in deep reinforcement learning,

Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep rein- forcement learning.arXiv preprint arXiv:1910.01708, 7(1):2, 2019. 10

work page arXiv 1910
[22]

One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning

Marc Rigter, Bruno Lacerda, and Nick Hawes. One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, pages 77520–77545, 2023

work page 2023
[23]

Finding approximate pomdp solutions through belief compression.Journal of Artificial Intelligence Research, 23:1–40, 2005

Nicholas Roy, Geoffrey Gordon, and Sebastian Thrun. Finding approximate pomdp solutions through belief compression.Journal of Artificial Intelligence Research, 23:1–40, 2005

work page 2005
[24]

Trust region policy optimization

John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimization. InInternational Conference on Machine Learning, pages 1889–1897. PMLR, 2015

work page 2015
[25]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

A survey of point-based pomdp solvers.Au- tonomous Agents and Multi-Agent Systems, 27(1):1–51, 2013

Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based pomdp solvers.Au- tonomous Agents and Multi-Agent Systems, 27(1):1–51, 2013

work page 2013
[27]

Monte-carlo planning in large pomdps

David Silver and Joel Veness. Monte-carlo planning in large pomdps. InAdvances in Neural Information Processing Systems, volume 23, 2010

work page 2010
[28]

Responsive safety in reinforcement learning by pid lagrangian methods

Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by pid lagrangian methods. InInternational Conference on Machine Learning, pages 9133–9143. PMLR, 2020

work page 2020
[29]

Sequential decision making with coherent risk.IEEE transactions on automatic control, 62(7):3323–3338, 2016

Aviv Tamar, Yinlam Chow, Mohammad Ghavamzadeh, and Shie Mannor. Sequential decision making with coherent risk.IEEE transactions on automatic control, 62(7):3323–3338, 2016

work page 2016
[30]

Evaluating deep q-learning algorithms for controlling blood glucose in in silico type 1 diabetes

Miguel Tejedor, Sigurd Nordtveit Hjerde, Jonas Nordhaug Myhre, and Fred Godtliebsen. Evaluating deep q-learning algorithms for controlling blood glucose in in silico type 1 diabetes. Diagnostics, 13(19):3150, 2023

work page 2023
[31]

Reward Constrained Policy Optimization

Chen Tessler, Daniel J. Mankowitz, and Shie Mannor. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Simglucose v0.2.1

Jinyu Xie. Simglucose v0.2.1. [Online]. Available: https://github.com/jxx123/ simglucose, 2018. Accessed: 2026-04-18

work page 2018
[33]

Cup: A conservative update policy algorithm for safe reinforcement learning.arXiv preprint arXiv:2202.07565, 2022

Long Yang, Jiaming Ji, Juntao Dai, Yu Zhang, Pengfei Li, and Gang Pan. Cup: A conservative update policy algorithm for safe reinforcement learning.arXiv preprint arXiv:2202.07565, 2022

work page arXiv 2022
[34]

Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, and Peter J. Ramadge. Projection-based constrained policy optimization.arXiv preprint arXiv:2010.03152, 2020

work page arXiv 2010
[35]

Rates of convergence for empirical processes of stationary mixing sequences.The Annals of Probability, pages 94–116, 1994

Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences.The Annals of Probability, pages 94–116, 1994

work page 1994
[36]

PhD thesis, Massachusetts Institute of Technology, 2006

Huan Yu.Approximate Solution Methods for Partially Observable Markov and Semi-Markov Decision Processes. PhD thesis, Massachusetts Institute of Technology, 2006

work page 2006
[37]

First order constrained optimization in policy space

Yiming Zhang, Quan Vuong, and Keith Ross. First order constrained optimization in policy space. InAdvances in Neural Information Processing Systems, volume 33, pages 15338–15349, 2020

work page 2020
[38]

A safe-enhanced fully closed-loop artificial pancreas controller based on deep reinforcement learning.PLOS ONE, 20(1):e0317662, 2025

Yan Feng Zhao, Jun Kit Chaw, Mei Choo Ang, Yiqi Tew, Xiao Yang Shi, Lin Liu, and Xiang Cheng. A safe-enhanced fully closed-loop artificial pancreas controller based on deep reinforcement learning.PLOS ONE, 20(1):e0317662, 2025

work page 2025
[39]

Design and validation of an open-source closed-loop testbed for artificial pancreas systems

Xugui Zhou, Maxfield Kouzel, Haotian Ren, and Homa Alemzadeh. Design and validation of an open-source closed-loop testbed for artificial pancreas systems. In2022 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), pages 1–12. IEEE, 2022. 11 A POMDP Interpretation and Surrogate-View Intuition This appendix p...

work page 2022
[40]

2.Uniqueness:The unconstrained maximizer˜π POMDP(˜st)is unique

Continuity and Compactness:The functions a7→ eQ∗(˜st, a) and a7→ˆρ n(˜st, a) are continuous on the compact action spaceA. 2.Uniqueness:The unconstrained maximizer˜π POMDP(˜st)is unique. 3.Strict Feasibility:There exists an actiona∈ Asuch thatˆρ n(˜st, a)< τ. C.3 Main Results Proposition 1 Under Assumption 1, 2, 3 and 4, let Q∗ be the Oracle Q-function and...

work page
[41]

Uniform Exponential Enveloping:The probability that the fixed point eQ∗ is not contained within the ensemble envelope is bounded by: P eQ∗(˜s, a)/∈[Q− M , Q+ M] ≤(p +)M + (p−)M , wherep + = 1−inf (˜s,a)Fϵ(0|˜s, a)andp − = sup(˜s,a)Fϵ(0|˜s, a)

work page
[42]

Extremal Consistency:As the ensemble size M→ ∞ , the envelope boundaries converge in probability to the physical error supports: Q+ M(˜s, a) p − →eQ∗(˜s, a) +ϵmax(˜s, a) Q− M(˜s, a) p − →eQ∗(˜s, a) +ϵmin(˜s, a)

work page
[43]

emergent

Asymptotic Mixed Value:The risk-gated Q-value Qgate(a) converges in probability to a safety-augmented landscape asM→ ∞ Qgate(˜s, a) p − →eQ∗(˜s, a) + (1−ˆρn)ϵmax + ˆρnϵmin| {z } safety aware shift 20 Sketch of Proof (Proposition 2):The proof of Proposition 2-1 follows from the independence of the M base learners (Assumption 8) and uniform bounds p+, p− fr...

work page
[44]

gray zone

Guaranteed Intervention:Let Bt ={ρ(H t,˜πPOMDP(˜st))> τ+ϵ} be the event where the POMDP policy is truly unsafe. Then: P(˜π∗ ̸= ˜πPOMDP andρ(H t,˜π∗)≤τ+ϵ| B t)≥1−δ Sketch of proof (Proposition 3):Consider En := ω∈Ω : sup ˜s,a |ˆρn(˜s, a)−ρ(Ht, a)| ≤ϵ(n, δ) By Assumption 10, we have P(En)≥1−δ . All subsequent arguments hold deterministically given En. Part ...

work page