pith. sign in

arxiv: 2603.18396 · v4 · pith:GVK7FQMTnew · submitted 2026-03-19 · 💻 cs.LG · cs.RO

RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

Pith reviewed 2026-05-21 09:50 UTC · model grok-4.3

classification 💻 cs.LG cs.RO
keywords deep reinforcement learningbus holding controlaleatoric uncertaintyepistemic uncertaintyensemble methodsrobust optimizationstochastic controlQ-value stability
0
0 comments X

The pith

RE-SAC disentangles aleatoric and epistemic uncertainties to stabilize Q-values in bus holding control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that conflating aleatoric noise with epistemic data gaps in standard actor-critic methods causes Q-value underestimation and policy collapse in stochastic bus environments. RE-SAC counters this by applying IPM weight regularization to hedge aleatoric risk and using a diversified Q-ensemble to address epistemic risk. This separation prevents misinterpreting noise as missing data, leading to higher rewards and better robustness in high-variability traffic. Sympathetic readers care because it offers a practical way to make DRL reliable for real transportation systems without expensive computations.

Core claim

The central claim is that explicitly separating aleatoric and epistemic uncertainties via IPM-based regularization for the critic and a diversified ensemble allows the robust Bellman operator to be bounded smoothly while penalizing overconfidence in sparse regions, avoiding the ablation-identified failure mode and yielding superior performance in simulations.

What carries the argument

RE-SAC's dual mechanism combining Integral Probability Metric regularization against aleatoric risk with diversified Q-ensemble penalization for epistemic risk.

If this is right

  • RE-SAC attains the highest cumulative reward of approximately -0.4 million in bus corridor simulations.
  • Q-value estimation error drops by up to 62 percent in rare out-of-distribution states compared to baselines.
  • The approach avoids catastrophic policy collapse from value underestimation in noisy states.
  • Ensemble variance no longer misidentifies irreducible noise as data insufficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method may extend to other reinforcement learning domains involving high stochasticity, such as autonomous driving or energy management.
  • Future work could explore combining this with other uncertainty quantification techniques for even greater robustness.
  • Real-world testing would need to verify if the simulation's traffic patterns match actual bus operations sufficiently.

Load-bearing premise

The bus corridor simulation accurately models the stochastic traffic and passenger demand that cause the value underestimation when uncertainties are not separated.

What would settle it

Running the RE-SAC and baseline agents on a physical bus corridor and measuring cumulative reward and Q-error under actual variable traffic conditions to check if improvements persist.

Figures

Figures reproduced from arXiv: 2603.18396 by Liang Zheng, Yifan Zhang.

Figure 1
Figure 1. Figure 1: Learning curves of cumulative reward for RE-SAC variants, SAC, DSAC-v1, BAC, and [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparative analysis of estimated Q-Values (Mean [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Oracle Q-Error (MAE) banded by Mahalanobis Rareness. RE-SAC maintains accurate [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
read the original abstract

Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes RE-SAC, an ensemble soft actor-critic variant for bus holding control that explicitly separates aleatoric risk (via IPM-based weight regularization on the critic, claimed to yield a smooth analytical lower bound on the robust Bellman operator) from epistemic risk (via a diversified Q-ensemble that penalizes overconfident estimates in data-sparse regions). The central claim is that this dual mechanism prevents ensemble variance from misidentifying irreducible noise as epistemic gaps, yielding higher cumulative rewards (approximately -0.4e6 versus -0.55e6 for vanilla SAC) and up to 62% lower Oracle Q-value MAE in rare out-of-distribution states within a bidirectional bus corridor simulation.

Significance. If the separation of uncertainty types and the reported robustness gains hold under broader conditions, the work would supply a concrete algorithmic template for stable DRL in stochastic transportation control, where conflating aleatoric and epistemic risks is a known source of policy collapse. The explicit identification of an ablation failure mode (ensemble variance treating noise as data insufficiency) is a useful diagnostic contribution.

major comments (3)
  1. [§3.2] §3.2 (IPM regularization derivation): The claim that IPM supplies a closed-form analytical lower bound on the robust Bellman operator without inner-loop perturbations is load-bearing for the aleatoric-hedging mechanism. Standard IPM definitions involve a supremum over a function class; when the critic is a neural network this supremum is not closed-form unless the dual is restricted to a parametric family or a surrogate metric is substituted. The manuscript does not state which restriction is used or whether additional gradient steps remain inside the bound, leaving the claimed separation of aleatoric hedging from epistemic ensemble variance dependent on an unstated assumption.
  2. [§5] §5 (Experimental results): The headline performance numbers (cumulative reward -0.4e6 vs. -0.55e6; MAE reduction from 4343 to 1647) are presented without error bars, statistical tests, or reporting of the number of independent random seeds. Because the central claim is superior robustness under high traffic variability, the absence of these quantifiers makes it impossible to determine whether the observed gains are statistically reliable or sensitive to particular simulation parameter choices.
  3. [§4.3] §4.3 (Ablation study): The post-hoc diagnosis that ensemble variance misidentifies noise as a data gap is presented as motivation for the diversified-ensemble component. However, the manuscript provides no quantitative test of whether this failure mode persists when the underlying traffic and passenger-demand stochasticity are altered (e.g., different variance schedules or corridor topologies), which is required to establish that the dual mechanism generalizes beyond the specific simulation used for demonstration.
minor comments (2)
  1. [Abstract and §5] The definition and exact computation of the Mahalanobis rareness metric used to identify out-of-distribution states should be stated explicitly (including the covariance estimator) rather than referenced only by name.
  2. [§3] Notation for the IPM regularization coefficient and the ensemble diversity parameter should be introduced once in §3 and used consistently thereafter; currently the symbols appear to be introduced ad hoc in different subsections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we plan to incorporate to strengthen the presentation and claims.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (IPM regularization derivation): The claim that IPM supplies a closed-form analytical lower bound on the robust Bellman operator without inner-loop perturbations is load-bearing for the aleatoric-hedging mechanism. Standard IPM definitions involve a supremum over a function class; when the critic is a neural network this supremum is not closed-form unless the dual is restricted to a parametric family or a surrogate metric is substituted. The manuscript does not state which restriction is used or whether additional gradient steps remain inside the bound, leaving the claimed separation of aleatoric hedging from epistemic ensemble variance dependent on an unstated assumption.

    Authors: We appreciate this observation on the derivation. The IPM regularization in §3.2 employs the Wasserstein IPM, which admits a closed-form dual representation via the Kantorovich-Rubinstein theorem when the critic is constrained to 1-Lipschitz functions. This constraint is enforced during training via a gradient penalty term, yielding the analytical lower bound on the robust Bellman operator without requiring inner-loop optimization or additional gradient steps at evaluation time. We will revise §3.2 to explicitly state the use of the 1-Lipschitz restriction and include the complete duality-based derivation steps. revision: yes

  2. Referee: [§5] §5 (Experimental results): The headline performance numbers (cumulative reward -0.4e6 vs. -0.55e6; MAE reduction from 4343 to 1647) are presented without error bars, statistical tests, or reporting of the number of independent random seeds. Because the central claim is superior robustness under high traffic variability, the absence of these quantifiers makes it impossible to determine whether the observed gains are statistically reliable or sensitive to particular simulation parameter choices.

    Authors: We agree that the current reporting lacks necessary statistical detail. In the revised manuscript we will present all headline metrics as means over 10 independent random seeds, accompanied by standard error bars. We will also add paired statistical tests (Wilcoxon signed-rank) to quantify the significance of the reward and MAE improvements relative to vanilla SAC and other baselines. revision: yes

  3. Referee: [§4.3] §4.3 (Ablation study): The post-hoc diagnosis that ensemble variance misidentifies noise as a data gap is presented as motivation for the diversified-ensemble component. However, the manuscript provides no quantitative test of whether this failure mode persists when the underlying traffic and passenger-demand stochasticity are altered (e.g., different variance schedules or corridor topologies), which is required to establish that the dual mechanism generalizes beyond the specific simulation used for demonstration.

    Authors: The ablation in §4.3 was performed on the primary bidirectional corridor to isolate the identified failure mode. To address generalizability, the revised manuscript will include an extended ablation with two additional settings: (i) doubled traffic variance and (ii) an alternative three-stop corridor topology. We will report quantitative metrics showing whether ensemble variance continues to misidentify noise as epistemic uncertainty and whether the diversified ensemble mitigates this under the altered stochastic conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; mechanisms introduced independently and validated via simulation

full rationale

The paper's core derivation introduces IPM-based critic regularization as an explicit hedge for aleatoric risk and a diversified Q-ensemble for epistemic risk. These are motivated by identifying conflation of uncertainties in standard SAC, then proposed as distinct algorithmic additions rather than being defined in terms of each other or the final reward numbers. Claims of providing a smooth analytical lower bound and preventing ensemble misidentification are presented as consequences of the new components, with empirical support from bus corridor simulations and Mahalanobis analysis. No load-bearing step reduces by construction to a fit, self-citation chain, or renaming of inputs; the central claims retain independent content outside the reported metrics.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the fidelity of the bidirectional bus corridor simulation and on the premise that conflating the two uncertainty types produces the observed value underestimation; no new physical entities are postulated, but several algorithmic hyperparameters are implicitly present.

free parameters (2)
  • IPM regularization coefficient
    Controls the strength of the weight regularization applied to the critic to hedge aleatoric risk; value not stated in abstract.
  • Ensemble diversity parameter
    Determines how the Q-ensemble is diversified to penalize overconfident estimates; value not stated in abstract.
axioms (1)
  • domain assumption The realistic bidirectional bus corridor simulation accurately reproduces the stochastic traffic and passenger demand that drive policy collapse in standard SAC.
    Invoked when claiming superior robustness and the 62% error reduction in out-of-distribution states.

pith-pipeline@v0.9.0 · 5804 in / 1475 out tokens · 65295 ms · 2026-05-21T09:50:02.677904+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BAPR: Bayesian amnesic piecewise-robust reinforcement learning for non-stationary continuous control

    cs.LG 2026-05 unverdicted novelty 7.0 full

    BAPR combines Bayesian change detection with robust RL, proves the core operator is a contraction via Lean 4, and adapts conservatism after detected regime shifts in continuous control.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    A headway-based approach to eliminate bus bunching: Systematic analysis and comparisons,

    Daganzo Carlos F., “A headway-based approach to eliminate bus bunching: Systematic analysis and comparisons,”Transportation Research Part B: Methodological, vol. 43, no. 10, pp. 913–921, 2009

  2. [2]

    Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,

    Xuan Yiguang, Argote Juan, and Daganzo Carlos F., “Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,”Transportation Research Part B: Methodological, vol. 45, no. 10, pp. 1831–1845, 2011

  3. [3]

    Dynamic holding control to avoid bus bunching: A multi-agent deep reinforcement learning framework,

    Wang Jiawei and Sun Lijun, “Dynamic holding control to avoid bus bunching: A multi-agent deep reinforcement learning framework,”Transportation Research Part C: Emerging Technolo- gies, vol. 116, p. 102661, 2020

  4. [4]

    Seizing serendipity: Exploiting the value of past success in off-policy actor-critic,

    Ji Tianying, Luo Yu, Sun Fuchun, Zhan Xianyuan, Zhang Jianwei, and Xu Huazhe, “Seizing serendipity: Exploiting the value of past success in off-policy actor-critic,” inInternational Conference on Machine Learning (ICML), 2024, pp. 21672–21718. 15

  5. [5]

    De- composition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning,

    Depeweg Stefan, Hernández-Lobato José Miguel, Doshi-Velez Finale, and Udluft Steffen, “De- composition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1184–1193

  6. [6]

    Estimating risk and uncertainty in deep reinforcement learning,

    Clements William R., Robaglia Benoît-Marie, van Delft Bastien, Slaoui Reda Bahi, and Toth Sébastien, “Estimating risk and uncertainty in deep reinforcement learning,”arXiv preprint arXiv:1905.09638, 2019

  7. [7]

    Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods,

    Hüllermeier Eyke and Waegeman Willem, “Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods,”Machine Learning, vol. 110, no. 3, pp. 457–506, 2021

  8. [8]

    Uncertainty-based offline reinforcement learning with diversified q-ensemble,

    An Gaon, Moon Seungyong, Kim Jang-Hyun, and Song Hyun Oh, “Uncertainty-based offline reinforcement learning with diversified q-ensemble,” inAdvances in Neural Information Pro- cessing Systems (NeurIPS), vol. 34, 2021, pp. 7436–7447

  9. [9]

    What uncertainties do we need in bayesian deep learning for computer vision?

    Kendall Alex and Gal Yarin, “What uncertainties do we need in bayesian deep learning for computer vision?” inAdvances in Neural Information Processing Systems (NIPS), vol. 30, 2017, pp. 5580–5590

  10. [10]

    Addressing function approximation error in actor-critic methods,

    Fujimoto Scott, van Hoof Herke, and Meger David, “Addressing function approximation error in actor-critic methods,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1587–1596

  11. [11]

    A bayesian approach to robust rein- forcement learning,

    Derman Esther, Mankowitz Daniel J., and Mannor Shie, “A bayesian approach to robust rein- forcement learning,” inUncertainty in Artificial Intelligence (UAI), 2019, pp. 648–658

  12. [12]

    Robust markov decision processes,

    Wiesemann Wolfram, Kuhn Daniel, and Rustem Berç, “Robust markov decision processes,” Mathematics of Operations Research, vol. 38, no. 1, pp. 153–183, 2013

  13. [13]

    Robustness and risk-sensitivity in markov decision processes,

    Osogami Takayuki, “Robustness and risk-sensitivity in markov decision processes,” inAdvances in Neural Information Processing Systems (NIPS), vol. 25, 2012, pp. 233–241

  14. [14]

    Robust regression and lasso,

    Xu Huan, Caramanis Constantine, and Mannor Shie, “Robust regression and lasso,” inAdvances in Neural Information Processing Systems (NIPS), vol. 21, 2008, pp. 1801–1808

  15. [15]

    Robustness and regularization of support vector machines,

    Xu Huan, Caramanis Constantine, “Robustness and regularization of support vector machines,” Journal of Machine Learning Research, vol. 10, no. 7, pp. 1485–1510, 2009

  16. [16]

    Natural actor-critic for robust reinforcement learning with function approximation,

    Zhou Ruida, Liu Tao, Cheng Min, Kalathil Dileep, Kumar P. R., and Tian Chao, “Natural actor-critic for robust reinforcement learning with function approximation,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023, pp. 30424–30438

  17. [17]

    Model-free robustϕ-divergence rein- forcement learning using both offline and online data,

    Panaganti Kishan, Wierman Adam, and Mazumdar Eric, “Model-free robustϕ-divergence rein- forcement learning using both offline and online data,” inInternational Conference on Machine Learning (ICML), 2024, pp. 39389–39459

  18. [18]

    Lipschitz continuity in model-based reinforcement learning,

    Asadi Kavosh, Misra Dipendra, and Littman Michael L., “Lipschitz continuity in model-based reinforcement learning,” inInternational Conference on Machine Learning (ICML), 2018, pp. 264–273

  19. [19]

    Deep exploration via bootstrapped dqn,

    Osband Ian, Blundell Charles, Pritzel Alexander, and Van Roy Benjamin, “Deep exploration via bootstrapped dqn,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016, pp. 4026–4034

  20. [20]

    Single agent robust deep reinforcement learning for bus fleet control,

    Zhang Yifan and Zheng Liang, “Single agent robust deep reinforcement learning for bus fleet control,”arXiv preprint, 2025. 16

  21. [21]

    Rorl: Robust offline reinforcement learning via conservative smoothing,

    Yang Rui, Bai Chenjia, Ma Xiaoteng, Wang Zhen, Zhang Chongjie, and Han Li, “Rorl: Robust offline reinforcement learning via conservative smoothing,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 26613–26624

  22. [22]

    Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,

    Xuan Yiguang and Daganzo Carlos F., “Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,”Transportation Research Part B: Method- ological, vol. 45, pp. 1831–1845, 2011

  23. [23]

    Multi-agent deep reinforcement learning: a survey,

    Gronauer Sven and Diepold Klaus, “Multi-agent deep reinforcement learning: a survey,”Arti- ficial Intelligence Review, vol. 55, no. 2, pp. 895–943, 2022

  24. [24]

    Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning

    Papoudakis Georgios, Christianos Filippos, Rahman Arrasy, and Albrecht Stefano V., “Dealing with non-stationarity in multi-agent deep reinforcement learning,”arXiv preprint arXiv:1906.04737, 2019

  25. [25]

    Grand- master level in starcraft ii using multi-agent reinforcement learning,

    Vinyals Oriol, Babuschkin Igor, Czarnecki Wojciech M., Mathieu Michaël, Dudzik Andrew, Chung Junyoung, Choi David H., Powell Richard, Ewalds Timo, Georgiev Petkoet al., “Grand- master level in starcraft ii using multi-agent reinforcement learning,”Nature, vol. 575, no. 7782, pp. 350–354, 2019

  26. [26]

    Algorithms for cvar optimization in mdps,

    Chow Yinlam and Ghavamzadeh Mohammad, “Algorithms for cvar optimization in mdps,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 27, 2014, pp. 3509–3517

  27. [27]

    Risk-sensitive and robust decision-making: A cvar optimization approach,

    Chow Yinlam, Tamar Aviv, Mannor Shie, and Pavone Marco, “Risk-sensitive and robust decision-making: A cvar optimization approach,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 28, 2015, pp. 1522–1530

  28. [28]

    A distributional perspective on reinforce- ment learning,

    Bellemare Marc G., Dabney Will, and Munos Rémi, “A distributional perspective on reinforce- ment learning,” inInternational Conference on Machine Learning (ICML), 2017, pp. 449–458

  29. [29]

    Distributional reinforce- ment learning with quantile regression,

    Dabney Will, Rowland Mark, Bellemare Marc G., and Munos Rémi, “Distributional reinforce- ment learning with quantile regression,” inAAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018, pp. 2892–2901

  30. [30]

    Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret,

    Fei Yingjie, Yang Zhuoran, Chen Yudong, Wang Zhaoran, and Xie Qiaomin, “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 22384–22395

  31. [31]

    Robust dynamic programming,

    Iyengar Garud N., “Robust dynamic programming,”Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005

  32. [32]

    A theory of regularized markov decision processes,

    Geist Matthieu, Scherrer Bruno, and Pietquin Olivier, “A theory of regularized markov decision processes,” inInternational Conference on Machine Learning (ICML), 2019, pp. 2160–2169

  33. [33]

    Double q-learning,

    Van Hasselt Hado, “Double q-learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 23, 2010, pp. 2613–2621

  34. [34]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

    Haarnoja Tuomas, Zhou Aurick, Abbeel Pieter, and Levine Sergey, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1861–1870

  35. [35]

    Conservative q-learning for of- fline reinforcement learning,

    Kumar Aviral, Zhou Aurick, Tucker George, and Levine Sergey, “Conservative q-learning for of- fline reinforcement learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 1179–1191

  36. [36]

    Dsac: Distributional soft actor-critic for risk-sensitive reinforcement learning,

    Ma Xiaoteng, Chen Junyao, Xia Li, Yang Jun, Zhao Qianchuan, and Zhou Zhengyuan, “Dsac: Distributional soft actor-critic for risk-sensitive reinforcement learning,”Journal of Artificial Intelligence Research, vol. 83, pp. 1–28, 2025, article 4

  37. [37]

    Villani,Optimal transport: old and new

    C. Villani,Optimal transport: old and new. Springer, 2009, vol. 338. 17

  38. [38]

    Maximum entropy rl (provably) solves some robust rl problems,

    Eysenbach Benjamin and Levine Sergey, “Maximum entropy rl (provably) solves some robust rl problems,” inInternational Conference on Learning Representations (ICLR), 2022

  39. [39]

    Discounted dynamic programming,

    D. Blackwell, “Discounted dynamic programming,”The Annals of Mathematical Statistics, vol. 36, no. 1, pp. 226–235, 1965

  40. [40]

    Human-level control through deep reinforcement learning,

    Mnih Volodymyr, Kavukcuoglu Koray, Silver David, Rusu Andrei A., Veness Joel, Bellemare Marc G., Graves Alex, Riedmiller Martin, Fidjeland Andreas K., Ostrovski Georget al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015

  41. [41]

    Deep reinforcement learning and the deadly triad,

    van Hasselt Hado, Doron Yotam, Strub Florian, Hessel Matteo, Sonnerat Nicolas, and Modayil Joseph, “Deep reinforcement learning and the deadly triad,” inNeurIPS 2018 Workshop on Deep Reinforcement Learning, 2018. 18 Appendix: Theoretical foundations and derivations A Sample complexity analysis We compare the theoretical sample complexity (data efficiency)...

  42. [42]

    A rigorous regret analysis would need to account for this non-stationarity (cf

    The penaltiesκandΓ epi change across training iterations as the network weights and target ensemble evolve. A rigorous regret analysis would need to account for this non-stationarity (cf. the per-step discussion in §E.1)

  43. [43]

    Golden Thread

    The shifted reward˜Rmay have a different range thanR, affecting the constants in the bound. The conceptual takeaway is that by replacing tail-distribution estimation (which requires exponential samples) with a deterministic structural penalty (which requires no additional samples), RE-SAC sidesteps the mechanismthat causes the exponential barrier. B The c...

  44. [44]

    •Epistemic Risk:Corresponds to the spread of the posteriorq(w)

    Decomposition (Depeweg et al.):We distinguish between reducible and irreducible uncertainty: H[p(s′|s, a)]| {z } Total Uncertainty =E q(w)[H(p(s′|s, a,w))]| {z } Aleatoric (Expected Data Noise) +I(s ′;w)| {z } Epistemic (Mutual Information) ,(26) whereI(s ′;w)is the Mutual Information between the weights and the prediction. •Epistemic Risk:Corresponds to ...

  45. [45]

    Credible Set

    Robustness as Bayesian proxy (Derman et al.):Derman et al. show that optimizing for the worst-case model within a Bayesian "Credible Set" (Ambiguity SetU) is a lower bound on the true Bayesian optimal value: VBayes(s)≥max π min P∈U α E[R].(27) This justifies our approach: we use the Ensemble to define the Credible Set (where the model might be), and we op...