RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach
Pith reviewed 2026-05-21 09:50 UTC · model grok-4.3
The pith
RE-SAC disentangles aleatoric and epistemic uncertainties to stabilize Q-values in bus holding control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that explicitly separating aleatoric and epistemic uncertainties via IPM-based regularization for the critic and a diversified ensemble allows the robust Bellman operator to be bounded smoothly while penalizing overconfidence in sparse regions, avoiding the ablation-identified failure mode and yielding superior performance in simulations.
What carries the argument
RE-SAC's dual mechanism combining Integral Probability Metric regularization against aleatoric risk with diversified Q-ensemble penalization for epistemic risk.
If this is right
- RE-SAC attains the highest cumulative reward of approximately -0.4 million in bus corridor simulations.
- Q-value estimation error drops by up to 62 percent in rare out-of-distribution states compared to baselines.
- The approach avoids catastrophic policy collapse from value underestimation in noisy states.
- Ensemble variance no longer misidentifies irreducible noise as data insufficiency.
Where Pith is reading between the lines
- This method may extend to other reinforcement learning domains involving high stochasticity, such as autonomous driving or energy management.
- Future work could explore combining this with other uncertainty quantification techniques for even greater robustness.
- Real-world testing would need to verify if the simulation's traffic patterns match actual bus operations sufficiently.
Load-bearing premise
The bus corridor simulation accurately models the stochastic traffic and passenger demand that cause the value underestimation when uncertainties are not separated.
What would settle it
Running the RE-SAC and baseline agents on a physical bus corridor and measuring cumulative reward and Q-error under actual variable traffic conditions to check if improvements persist.
Figures
read the original abstract
Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RE-SAC, an ensemble soft actor-critic variant for bus holding control that explicitly separates aleatoric risk (via IPM-based weight regularization on the critic, claimed to yield a smooth analytical lower bound on the robust Bellman operator) from epistemic risk (via a diversified Q-ensemble that penalizes overconfident estimates in data-sparse regions). The central claim is that this dual mechanism prevents ensemble variance from misidentifying irreducible noise as epistemic gaps, yielding higher cumulative rewards (approximately -0.4e6 versus -0.55e6 for vanilla SAC) and up to 62% lower Oracle Q-value MAE in rare out-of-distribution states within a bidirectional bus corridor simulation.
Significance. If the separation of uncertainty types and the reported robustness gains hold under broader conditions, the work would supply a concrete algorithmic template for stable DRL in stochastic transportation control, where conflating aleatoric and epistemic risks is a known source of policy collapse. The explicit identification of an ablation failure mode (ensemble variance treating noise as data insufficiency) is a useful diagnostic contribution.
major comments (3)
- [§3.2] §3.2 (IPM regularization derivation): The claim that IPM supplies a closed-form analytical lower bound on the robust Bellman operator without inner-loop perturbations is load-bearing for the aleatoric-hedging mechanism. Standard IPM definitions involve a supremum over a function class; when the critic is a neural network this supremum is not closed-form unless the dual is restricted to a parametric family or a surrogate metric is substituted. The manuscript does not state which restriction is used or whether additional gradient steps remain inside the bound, leaving the claimed separation of aleatoric hedging from epistemic ensemble variance dependent on an unstated assumption.
- [§5] §5 (Experimental results): The headline performance numbers (cumulative reward -0.4e6 vs. -0.55e6; MAE reduction from 4343 to 1647) are presented without error bars, statistical tests, or reporting of the number of independent random seeds. Because the central claim is superior robustness under high traffic variability, the absence of these quantifiers makes it impossible to determine whether the observed gains are statistically reliable or sensitive to particular simulation parameter choices.
- [§4.3] §4.3 (Ablation study): The post-hoc diagnosis that ensemble variance misidentifies noise as a data gap is presented as motivation for the diversified-ensemble component. However, the manuscript provides no quantitative test of whether this failure mode persists when the underlying traffic and passenger-demand stochasticity are altered (e.g., different variance schedules or corridor topologies), which is required to establish that the dual mechanism generalizes beyond the specific simulation used for demonstration.
minor comments (2)
- [Abstract and §5] The definition and exact computation of the Mahalanobis rareness metric used to identify out-of-distribution states should be stated explicitly (including the covariance estimator) rather than referenced only by name.
- [§3] Notation for the IPM regularization coefficient and the ensemble diversity parameter should be introduced once in §3 and used consistently thereafter; currently the symbols appear to be introduced ad hoc in different subsections.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we plan to incorporate to strengthen the presentation and claims.
read point-by-point responses
-
Referee: [§3.2] §3.2 (IPM regularization derivation): The claim that IPM supplies a closed-form analytical lower bound on the robust Bellman operator without inner-loop perturbations is load-bearing for the aleatoric-hedging mechanism. Standard IPM definitions involve a supremum over a function class; when the critic is a neural network this supremum is not closed-form unless the dual is restricted to a parametric family or a surrogate metric is substituted. The manuscript does not state which restriction is used or whether additional gradient steps remain inside the bound, leaving the claimed separation of aleatoric hedging from epistemic ensemble variance dependent on an unstated assumption.
Authors: We appreciate this observation on the derivation. The IPM regularization in §3.2 employs the Wasserstein IPM, which admits a closed-form dual representation via the Kantorovich-Rubinstein theorem when the critic is constrained to 1-Lipschitz functions. This constraint is enforced during training via a gradient penalty term, yielding the analytical lower bound on the robust Bellman operator without requiring inner-loop optimization or additional gradient steps at evaluation time. We will revise §3.2 to explicitly state the use of the 1-Lipschitz restriction and include the complete duality-based derivation steps. revision: yes
-
Referee: [§5] §5 (Experimental results): The headline performance numbers (cumulative reward -0.4e6 vs. -0.55e6; MAE reduction from 4343 to 1647) are presented without error bars, statistical tests, or reporting of the number of independent random seeds. Because the central claim is superior robustness under high traffic variability, the absence of these quantifiers makes it impossible to determine whether the observed gains are statistically reliable or sensitive to particular simulation parameter choices.
Authors: We agree that the current reporting lacks necessary statistical detail. In the revised manuscript we will present all headline metrics as means over 10 independent random seeds, accompanied by standard error bars. We will also add paired statistical tests (Wilcoxon signed-rank) to quantify the significance of the reward and MAE improvements relative to vanilla SAC and other baselines. revision: yes
-
Referee: [§4.3] §4.3 (Ablation study): The post-hoc diagnosis that ensemble variance misidentifies noise as a data gap is presented as motivation for the diversified-ensemble component. However, the manuscript provides no quantitative test of whether this failure mode persists when the underlying traffic and passenger-demand stochasticity are altered (e.g., different variance schedules or corridor topologies), which is required to establish that the dual mechanism generalizes beyond the specific simulation used for demonstration.
Authors: The ablation in §4.3 was performed on the primary bidirectional corridor to isolate the identified failure mode. To address generalizability, the revised manuscript will include an extended ablation with two additional settings: (i) doubled traffic variance and (ii) an alternative three-stop corridor topology. We will report quantitative metrics showing whether ensemble variance continues to misidentify noise as epistemic uncertainty and whether the diversified ensemble mitigates this under the altered stochastic conditions. revision: yes
Circularity Check
No significant circularity; mechanisms introduced independently and validated via simulation
full rationale
The paper's core derivation introduces IPM-based critic regularization as an explicit hedge for aleatoric risk and a diversified Q-ensemble for epistemic risk. These are motivated by identifying conflation of uncertainties in standard SAC, then proposed as distinct algorithmic additions rather than being defined in terms of each other or the final reward numbers. Claims of providing a smooth analytical lower bound and preventing ensemble misidentification are presented as consequences of the new components, with empirical support from bus corridor simulations and Mahalanobis analysis. No load-bearing step reduces by construction to a fit, self-citation chain, or renaming of inputs; the central claims retain independent content outside the reported metrics.
Axiom & Free-Parameter Ledger
free parameters (2)
- IPM regularization coefficient
- Ensemble diversity parameter
axioms (1)
- domain assumption The realistic bidirectional bus corridor simulation accurately reproduces the stochastic traffic and passenger demand that drive policy collapse in standard SAC.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
IPM-based weight regularization ... smooth analytical lower bound for the robust Bellman operator ... frozen-parameter design ... γ-contraction ... machine-verified in Lean 4 with no sorry (Proof.lean, Counterproof.lean)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
T_REV operator ... fixed penalties κ and Γ_epi ... Blackwell conditions ... continuous-space extension via Banach Fixed-Point Theorem
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
BAPR: Bayesian amnesic piecewise-robust reinforcement learning for non-stationary continuous control
BAPR combines Bayesian change detection with robust RL, proves the core operator is a contraction via Lean 4, and adapts conservatism after detected regime shifts in continuous control.
Reference graph
Works this paper leans on
-
[1]
A headway-based approach to eliminate bus bunching: Systematic analysis and comparisons,
Daganzo Carlos F., “A headway-based approach to eliminate bus bunching: Systematic analysis and comparisons,”Transportation Research Part B: Methodological, vol. 43, no. 10, pp. 913–921, 2009
work page 2009
-
[2]
Xuan Yiguang, Argote Juan, and Daganzo Carlos F., “Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,”Transportation Research Part B: Methodological, vol. 45, no. 10, pp. 1831–1845, 2011
work page 2011
-
[3]
Dynamic holding control to avoid bus bunching: A multi-agent deep reinforcement learning framework,
Wang Jiawei and Sun Lijun, “Dynamic holding control to avoid bus bunching: A multi-agent deep reinforcement learning framework,”Transportation Research Part C: Emerging Technolo- gies, vol. 116, p. 102661, 2020
work page 2020
-
[4]
Seizing serendipity: Exploiting the value of past success in off-policy actor-critic,
Ji Tianying, Luo Yu, Sun Fuchun, Zhan Xianyuan, Zhang Jianwei, and Xu Huazhe, “Seizing serendipity: Exploiting the value of past success in off-policy actor-critic,” inInternational Conference on Machine Learning (ICML), 2024, pp. 21672–21718. 15
work page 2024
-
[5]
De- composition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning,
Depeweg Stefan, Hernández-Lobato José Miguel, Doshi-Velez Finale, and Udluft Steffen, “De- composition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1184–1193
work page 2018
-
[6]
Estimating risk and uncertainty in deep reinforcement learning,
Clements William R., Robaglia Benoît-Marie, van Delft Bastien, Slaoui Reda Bahi, and Toth Sébastien, “Estimating risk and uncertainty in deep reinforcement learning,”arXiv preprint arXiv:1905.09638, 2019
-
[7]
Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods,
Hüllermeier Eyke and Waegeman Willem, “Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods,”Machine Learning, vol. 110, no. 3, pp. 457–506, 2021
work page 2021
-
[8]
Uncertainty-based offline reinforcement learning with diversified q-ensemble,
An Gaon, Moon Seungyong, Kim Jang-Hyun, and Song Hyun Oh, “Uncertainty-based offline reinforcement learning with diversified q-ensemble,” inAdvances in Neural Information Pro- cessing Systems (NeurIPS), vol. 34, 2021, pp. 7436–7447
work page 2021
-
[9]
What uncertainties do we need in bayesian deep learning for computer vision?
Kendall Alex and Gal Yarin, “What uncertainties do we need in bayesian deep learning for computer vision?” inAdvances in Neural Information Processing Systems (NIPS), vol. 30, 2017, pp. 5580–5590
work page 2017
-
[10]
Addressing function approximation error in actor-critic methods,
Fujimoto Scott, van Hoof Herke, and Meger David, “Addressing function approximation error in actor-critic methods,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1587–1596
work page 2018
-
[11]
A bayesian approach to robust rein- forcement learning,
Derman Esther, Mankowitz Daniel J., and Mannor Shie, “A bayesian approach to robust rein- forcement learning,” inUncertainty in Artificial Intelligence (UAI), 2019, pp. 648–658
work page 2019
-
[12]
Robust markov decision processes,
Wiesemann Wolfram, Kuhn Daniel, and Rustem Berç, “Robust markov decision processes,” Mathematics of Operations Research, vol. 38, no. 1, pp. 153–183, 2013
work page 2013
-
[13]
Robustness and risk-sensitivity in markov decision processes,
Osogami Takayuki, “Robustness and risk-sensitivity in markov decision processes,” inAdvances in Neural Information Processing Systems (NIPS), vol. 25, 2012, pp. 233–241
work page 2012
-
[14]
Xu Huan, Caramanis Constantine, and Mannor Shie, “Robust regression and lasso,” inAdvances in Neural Information Processing Systems (NIPS), vol. 21, 2008, pp. 1801–1808
work page 2008
-
[15]
Robustness and regularization of support vector machines,
Xu Huan, Caramanis Constantine, “Robustness and regularization of support vector machines,” Journal of Machine Learning Research, vol. 10, no. 7, pp. 1485–1510, 2009
work page 2009
-
[16]
Natural actor-critic for robust reinforcement learning with function approximation,
Zhou Ruida, Liu Tao, Cheng Min, Kalathil Dileep, Kumar P. R., and Tian Chao, “Natural actor-critic for robust reinforcement learning with function approximation,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023, pp. 30424–30438
work page 2023
-
[17]
Model-free robustϕ-divergence rein- forcement learning using both offline and online data,
Panaganti Kishan, Wierman Adam, and Mazumdar Eric, “Model-free robustϕ-divergence rein- forcement learning using both offline and online data,” inInternational Conference on Machine Learning (ICML), 2024, pp. 39389–39459
work page 2024
-
[18]
Lipschitz continuity in model-based reinforcement learning,
Asadi Kavosh, Misra Dipendra, and Littman Michael L., “Lipschitz continuity in model-based reinforcement learning,” inInternational Conference on Machine Learning (ICML), 2018, pp. 264–273
work page 2018
-
[19]
Deep exploration via bootstrapped dqn,
Osband Ian, Blundell Charles, Pritzel Alexander, and Van Roy Benjamin, “Deep exploration via bootstrapped dqn,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016, pp. 4026–4034
work page 2016
-
[20]
Single agent robust deep reinforcement learning for bus fleet control,
Zhang Yifan and Zheng Liang, “Single agent robust deep reinforcement learning for bus fleet control,”arXiv preprint, 2025. 16
work page 2025
-
[21]
Rorl: Robust offline reinforcement learning via conservative smoothing,
Yang Rui, Bai Chenjia, Ma Xiaoteng, Wang Zhen, Zhang Chongjie, and Han Li, “Rorl: Robust offline reinforcement learning via conservative smoothing,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 26613–26624
work page 2022
-
[22]
Xuan Yiguang and Daganzo Carlos F., “Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,”Transportation Research Part B: Method- ological, vol. 45, pp. 1831–1845, 2011
work page 2011
-
[23]
Multi-agent deep reinforcement learning: a survey,
Gronauer Sven and Diepold Klaus, “Multi-agent deep reinforcement learning: a survey,”Arti- ficial Intelligence Review, vol. 55, no. 2, pp. 895–943, 2022
work page 2022
-
[24]
Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning
Papoudakis Georgios, Christianos Filippos, Rahman Arrasy, and Albrecht Stefano V., “Dealing with non-stationarity in multi-agent deep reinforcement learning,”arXiv preprint arXiv:1906.04737, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[25]
Grand- master level in starcraft ii using multi-agent reinforcement learning,
Vinyals Oriol, Babuschkin Igor, Czarnecki Wojciech M., Mathieu Michaël, Dudzik Andrew, Chung Junyoung, Choi David H., Powell Richard, Ewalds Timo, Georgiev Petkoet al., “Grand- master level in starcraft ii using multi-agent reinforcement learning,”Nature, vol. 575, no. 7782, pp. 350–354, 2019
work page 2019
-
[26]
Algorithms for cvar optimization in mdps,
Chow Yinlam and Ghavamzadeh Mohammad, “Algorithms for cvar optimization in mdps,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 27, 2014, pp. 3509–3517
work page 2014
-
[27]
Risk-sensitive and robust decision-making: A cvar optimization approach,
Chow Yinlam, Tamar Aviv, Mannor Shie, and Pavone Marco, “Risk-sensitive and robust decision-making: A cvar optimization approach,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 28, 2015, pp. 1522–1530
work page 2015
-
[28]
A distributional perspective on reinforce- ment learning,
Bellemare Marc G., Dabney Will, and Munos Rémi, “A distributional perspective on reinforce- ment learning,” inInternational Conference on Machine Learning (ICML), 2017, pp. 449–458
work page 2017
-
[29]
Distributional reinforce- ment learning with quantile regression,
Dabney Will, Rowland Mark, Bellemare Marc G., and Munos Rémi, “Distributional reinforce- ment learning with quantile regression,” inAAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018, pp. 2892–2901
work page 2018
-
[30]
Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret,
Fei Yingjie, Yang Zhuoran, Chen Yudong, Wang Zhaoran, and Xie Qiaomin, “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 22384–22395
work page 2020
-
[31]
Iyengar Garud N., “Robust dynamic programming,”Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005
work page 2005
-
[32]
A theory of regularized markov decision processes,
Geist Matthieu, Scherrer Bruno, and Pietquin Olivier, “A theory of regularized markov decision processes,” inInternational Conference on Machine Learning (ICML), 2019, pp. 2160–2169
work page 2019
-
[33]
Van Hasselt Hado, “Double q-learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 23, 2010, pp. 2613–2621
work page 2010
-
[34]
Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,
Haarnoja Tuomas, Zhou Aurick, Abbeel Pieter, and Levine Sergey, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1861–1870
work page 2018
-
[35]
Conservative q-learning for of- fline reinforcement learning,
Kumar Aviral, Zhou Aurick, Tucker George, and Levine Sergey, “Conservative q-learning for of- fline reinforcement learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 1179–1191
work page 2020
-
[36]
Dsac: Distributional soft actor-critic for risk-sensitive reinforcement learning,
Ma Xiaoteng, Chen Junyao, Xia Li, Yang Jun, Zhao Qianchuan, and Zhou Zhengyuan, “Dsac: Distributional soft actor-critic for risk-sensitive reinforcement learning,”Journal of Artificial Intelligence Research, vol. 83, pp. 1–28, 2025, article 4
work page 2025
-
[37]
Villani,Optimal transport: old and new
C. Villani,Optimal transport: old and new. Springer, 2009, vol. 338. 17
work page 2009
-
[38]
Maximum entropy rl (provably) solves some robust rl problems,
Eysenbach Benjamin and Levine Sergey, “Maximum entropy rl (provably) solves some robust rl problems,” inInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[39]
Discounted dynamic programming,
D. Blackwell, “Discounted dynamic programming,”The Annals of Mathematical Statistics, vol. 36, no. 1, pp. 226–235, 1965
work page 1965
-
[40]
Human-level control through deep reinforcement learning,
Mnih Volodymyr, Kavukcuoglu Koray, Silver David, Rusu Andrei A., Veness Joel, Bellemare Marc G., Graves Alex, Riedmiller Martin, Fidjeland Andreas K., Ostrovski Georget al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015
work page 2015
-
[41]
Deep reinforcement learning and the deadly triad,
van Hasselt Hado, Doron Yotam, Strub Florian, Hessel Matteo, Sonnerat Nicolas, and Modayil Joseph, “Deep reinforcement learning and the deadly triad,” inNeurIPS 2018 Workshop on Deep Reinforcement Learning, 2018. 18 Appendix: Theoretical foundations and derivations A Sample complexity analysis We compare the theoretical sample complexity (data efficiency)...
work page 2018
-
[42]
A rigorous regret analysis would need to account for this non-stationarity (cf
The penaltiesκandΓ epi change across training iterations as the network weights and target ensemble evolve. A rigorous regret analysis would need to account for this non-stationarity (cf. the per-step discussion in §E.1)
-
[43]
The shifted reward˜Rmay have a different range thanR, affecting the constants in the bound. The conceptual takeaway is that by replacing tail-distribution estimation (which requires exponential samples) with a deterministic structural penalty (which requires no additional samples), RE-SAC sidesteps the mechanismthat causes the exponential barrier. B The c...
work page 2009
-
[44]
•Epistemic Risk:Corresponds to the spread of the posteriorq(w)
Decomposition (Depeweg et al.):We distinguish between reducible and irreducible uncertainty: H[p(s′|s, a)]| {z } Total Uncertainty =E q(w)[H(p(s′|s, a,w))]| {z } Aleatoric (Expected Data Noise) +I(s ′;w)| {z } Epistemic (Mutual Information) ,(26) whereI(s ′;w)is the Mutual Information between the weights and the prediction. •Epistemic Risk:Corresponds to ...
-
[45]
Robustness as Bayesian proxy (Derman et al.):Derman et al. show that optimizing for the worst-case model within a Bayesian "Credible Set" (Ambiguity SetU) is a lower bound on the true Bayesian optimal value: VBayes(s)≥max π min P∈U α E[R].(27) This justifies our approach: we use the Ensemble to define the Credible Set (where the model might be), and we op...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.