RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

Liang Zheng; Yifan Zhang

arxiv: 2603.18396 · v4 · pith:GVK7FQMTnew · submitted 2026-03-19 · 💻 cs.LG · cs.RO

RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

Yifan Zhang , Liang Zheng This is my paper

Pith reviewed 2026-05-21 09:50 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords deep reinforcement learningbus holding controlaleatoric uncertaintyepistemic uncertaintyensemble methodsrobust optimizationstochastic controlQ-value stability

0 comments

The pith

RE-SAC disentangles aleatoric and epistemic uncertainties to stabilize Q-values in bus holding control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that conflating aleatoric noise with epistemic data gaps in standard actor-critic methods causes Q-value underestimation and policy collapse in stochastic bus environments. RE-SAC counters this by applying IPM weight regularization to hedge aleatoric risk and using a diversified Q-ensemble to address epistemic risk. This separation prevents misinterpreting noise as missing data, leading to higher rewards and better robustness in high-variability traffic. Sympathetic readers care because it offers a practical way to make DRL reliable for real transportation systems without expensive computations.

Core claim

The central claim is that explicitly separating aleatoric and epistemic uncertainties via IPM-based regularization for the critic and a diversified ensemble allows the robust Bellman operator to be bounded smoothly while penalizing overconfidence in sparse regions, avoiding the ablation-identified failure mode and yielding superior performance in simulations.

What carries the argument

RE-SAC's dual mechanism combining Integral Probability Metric regularization against aleatoric risk with diversified Q-ensemble penalization for epistemic risk.

If this is right

RE-SAC attains the highest cumulative reward of approximately -0.4 million in bus corridor simulations.
Q-value estimation error drops by up to 62 percent in rare out-of-distribution states compared to baselines.
The approach avoids catastrophic policy collapse from value underestimation in noisy states.
Ensemble variance no longer misidentifies irreducible noise as data insufficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method may extend to other reinforcement learning domains involving high stochasticity, such as autonomous driving or energy management.
Future work could explore combining this with other uncertainty quantification techniques for even greater robustness.
Real-world testing would need to verify if the simulation's traffic patterns match actual bus operations sufficiently.

Load-bearing premise

The bus corridor simulation accurately models the stochastic traffic and passenger demand that cause the value underestimation when uncertainties are not separated.

What would settle it

Running the RE-SAC and baseline agents on a physical bus corridor and measuring cumulative reward and Q-error under actual variable traffic conditions to check if improvements persist.

Figures

Figures reproduced from arXiv: 2603.18396 by Liang Zheng, Yifan Zhang.

**Figure 2.** Figure 2: Comparative analysis of estimated Q-Values (Mean [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Oracle Q-Error (MAE) banded by Mahalanobis Rareness. RE-SAC maintains accurate [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

read the original abstract

Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RE-SAC adds a targeted split of aleatoric and epistemic handling to SAC for bus control, with some simulation gains, but the empirical claims rest on thin validation and the IPM bound needs closer scrutiny.

read the letter

The main point is that this paper shows how to keep aleatoric noise and epistemic gaps separate inside an ensemble SAC for bus holding control. They use IPM weight regularization on the critic for the first and a diversified Q-ensemble for the second, plus an ablation that flags when variance starts treating noise as missing data. That combination is new for this setting and gives a practical way to avoid the value underestimation they describe in volatile traffic.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes RE-SAC, an ensemble soft actor-critic variant for bus holding control that explicitly separates aleatoric risk (via IPM-based weight regularization on the critic, claimed to yield a smooth analytical lower bound on the robust Bellman operator) from epistemic risk (via a diversified Q-ensemble that penalizes overconfident estimates in data-sparse regions). The central claim is that this dual mechanism prevents ensemble variance from misidentifying irreducible noise as epistemic gaps, yielding higher cumulative rewards (approximately -0.4e6 versus -0.55e6 for vanilla SAC) and up to 62% lower Oracle Q-value MAE in rare out-of-distribution states within a bidirectional bus corridor simulation.

Significance. If the separation of uncertainty types and the reported robustness gains hold under broader conditions, the work would supply a concrete algorithmic template for stable DRL in stochastic transportation control, where conflating aleatoric and epistemic risks is a known source of policy collapse. The explicit identification of an ablation failure mode (ensemble variance treating noise as data insufficiency) is a useful diagnostic contribution.

major comments (3)

[§3.2] §3.2 (IPM regularization derivation): The claim that IPM supplies a closed-form analytical lower bound on the robust Bellman operator without inner-loop perturbations is load-bearing for the aleatoric-hedging mechanism. Standard IPM definitions involve a supremum over a function class; when the critic is a neural network this supremum is not closed-form unless the dual is restricted to a parametric family or a surrogate metric is substituted. The manuscript does not state which restriction is used or whether additional gradient steps remain inside the bound, leaving the claimed separation of aleatoric hedging from epistemic ensemble variance dependent on an unstated assumption.
[§5] §5 (Experimental results): The headline performance numbers (cumulative reward -0.4e6 vs. -0.55e6; MAE reduction from 4343 to 1647) are presented without error bars, statistical tests, or reporting of the number of independent random seeds. Because the central claim is superior robustness under high traffic variability, the absence of these quantifiers makes it impossible to determine whether the observed gains are statistically reliable or sensitive to particular simulation parameter choices.
[§4.3] §4.3 (Ablation study): The post-hoc diagnosis that ensemble variance misidentifies noise as a data gap is presented as motivation for the diversified-ensemble component. However, the manuscript provides no quantitative test of whether this failure mode persists when the underlying traffic and passenger-demand stochasticity are altered (e.g., different variance schedules or corridor topologies), which is required to establish that the dual mechanism generalizes beyond the specific simulation used for demonstration.

minor comments (2)

[Abstract and §5] The definition and exact computation of the Mahalanobis rareness metric used to identify out-of-distribution states should be stated explicitly (including the covariance estimator) rather than referenced only by name.
[§3] Notation for the IPM regularization coefficient and the ensemble diversity parameter should be introduced once in §3 and used consistently thereafter; currently the symbols appear to be introduced ad hoc in different subsections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we plan to incorporate to strengthen the presentation and claims.

read point-by-point responses

Referee: [§3.2] §3.2 (IPM regularization derivation): The claim that IPM supplies a closed-form analytical lower bound on the robust Bellman operator without inner-loop perturbations is load-bearing for the aleatoric-hedging mechanism. Standard IPM definitions involve a supremum over a function class; when the critic is a neural network this supremum is not closed-form unless the dual is restricted to a parametric family or a surrogate metric is substituted. The manuscript does not state which restriction is used or whether additional gradient steps remain inside the bound, leaving the claimed separation of aleatoric hedging from epistemic ensemble variance dependent on an unstated assumption.

Authors: We appreciate this observation on the derivation. The IPM regularization in §3.2 employs the Wasserstein IPM, which admits a closed-form dual representation via the Kantorovich-Rubinstein theorem when the critic is constrained to 1-Lipschitz functions. This constraint is enforced during training via a gradient penalty term, yielding the analytical lower bound on the robust Bellman operator without requiring inner-loop optimization or additional gradient steps at evaluation time. We will revise §3.2 to explicitly state the use of the 1-Lipschitz restriction and include the complete duality-based derivation steps. revision: yes
Referee: [§5] §5 (Experimental results): The headline performance numbers (cumulative reward -0.4e6 vs. -0.55e6; MAE reduction from 4343 to 1647) are presented without error bars, statistical tests, or reporting of the number of independent random seeds. Because the central claim is superior robustness under high traffic variability, the absence of these quantifiers makes it impossible to determine whether the observed gains are statistically reliable or sensitive to particular simulation parameter choices.

Authors: We agree that the current reporting lacks necessary statistical detail. In the revised manuscript we will present all headline metrics as means over 10 independent random seeds, accompanied by standard error bars. We will also add paired statistical tests (Wilcoxon signed-rank) to quantify the significance of the reward and MAE improvements relative to vanilla SAC and other baselines. revision: yes
Referee: [§4.3] §4.3 (Ablation study): The post-hoc diagnosis that ensemble variance misidentifies noise as a data gap is presented as motivation for the diversified-ensemble component. However, the manuscript provides no quantitative test of whether this failure mode persists when the underlying traffic and passenger-demand stochasticity are altered (e.g., different variance schedules or corridor topologies), which is required to establish that the dual mechanism generalizes beyond the specific simulation used for demonstration.

Authors: The ablation in §4.3 was performed on the primary bidirectional corridor to isolate the identified failure mode. To address generalizability, the revised manuscript will include an extended ablation with two additional settings: (i) doubled traffic variance and (ii) an alternative three-stop corridor topology. We will report quantitative metrics showing whether ensemble variance continues to misidentify noise as epistemic uncertainty and whether the diversified ensemble mitigates this under the altered stochastic conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; mechanisms introduced independently and validated via simulation

full rationale

The paper's core derivation introduces IPM-based critic regularization as an explicit hedge for aleatoric risk and a diversified Q-ensemble for epistemic risk. These are motivated by identifying conflation of uncertainties in standard SAC, then proposed as distinct algorithmic additions rather than being defined in terms of each other or the final reward numbers. Claims of providing a smooth analytical lower bound and preventing ensemble misidentification are presented as consequences of the new components, with empirical support from bus corridor simulations and Mahalanobis analysis. No load-bearing step reduces by construction to a fit, self-citation chain, or renaming of inputs; the central claims retain independent content outside the reported metrics.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the fidelity of the bidirectional bus corridor simulation and on the premise that conflating the two uncertainty types produces the observed value underestimation; no new physical entities are postulated, but several algorithmic hyperparameters are implicitly present.

free parameters (2)

IPM regularization coefficient
Controls the strength of the weight regularization applied to the critic to hedge aleatoric risk; value not stated in abstract.
Ensemble diversity parameter
Determines how the Q-ensemble is diversified to penalize overconfident estimates; value not stated in abstract.

axioms (1)

domain assumption The realistic bidirectional bus corridor simulation accurately reproduces the stochastic traffic and passenger demand that drive policy collapse in standard SAC.
Invoked when claiming superior robustness and the 62% error reduction in out-of-distribution states.

pith-pipeline@v0.9.0 · 5804 in / 1475 out tokens · 65295 ms · 2026-05-21T09:50:02.677904+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

IPM-based weight regularization ... smooth analytical lower bound for the robust Bellman operator ... frozen-parameter design ... γ-contraction ... machine-verified in Lean 4 with no sorry (Proof.lean, Counterproof.lean)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

T_REV operator ... fixed penalties κ and Γ_epi ... Blackwell conditions ... continuous-space extension via Banach Fixed-Point Theorem

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BAPR: Bayesian amnesic piecewise-robust reinforcement learning for non-stationary continuous control
cs.LG 2026-05 unverdicted novelty 7.0 full

BAPR combines Bayesian change detection with robust RL, proves the core operator is a contraction via Lean 4, and adapts conservatism after detected regime shifts in continuous control.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

A headway-based approach to eliminate bus bunching: Systematic analysis and comparisons,

Daganzo Carlos F., “A headway-based approach to eliminate bus bunching: Systematic analysis and comparisons,”Transportation Research Part B: Methodological, vol. 43, no. 10, pp. 913–921, 2009

work page 2009
[2]

Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,

Xuan Yiguang, Argote Juan, and Daganzo Carlos F., “Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,”Transportation Research Part B: Methodological, vol. 45, no. 10, pp. 1831–1845, 2011

work page 2011
[3]

Dynamic holding control to avoid bus bunching: A multi-agent deep reinforcement learning framework,

Wang Jiawei and Sun Lijun, “Dynamic holding control to avoid bus bunching: A multi-agent deep reinforcement learning framework,”Transportation Research Part C: Emerging Technolo- gies, vol. 116, p. 102661, 2020

work page 2020
[4]

Seizing serendipity: Exploiting the value of past success in off-policy actor-critic,

Ji Tianying, Luo Yu, Sun Fuchun, Zhan Xianyuan, Zhang Jianwei, and Xu Huazhe, “Seizing serendipity: Exploiting the value of past success in off-policy actor-critic,” inInternational Conference on Machine Learning (ICML), 2024, pp. 21672–21718. 15

work page 2024
[5]

De- composition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning,

Depeweg Stefan, Hernández-Lobato José Miguel, Doshi-Velez Finale, and Udluft Steffen, “De- composition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1184–1193

work page 2018
[6]

Estimating risk and uncertainty in deep reinforcement learning,

Clements William R., Robaglia Benoît-Marie, van Delft Bastien, Slaoui Reda Bahi, and Toth Sébastien, “Estimating risk and uncertainty in deep reinforcement learning,”arXiv preprint arXiv:1905.09638, 2019

work page arXiv 1905
[7]

Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods,

Hüllermeier Eyke and Waegeman Willem, “Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods,”Machine Learning, vol. 110, no. 3, pp. 457–506, 2021

work page 2021
[8]

Uncertainty-based offline reinforcement learning with diversified q-ensemble,

An Gaon, Moon Seungyong, Kim Jang-Hyun, and Song Hyun Oh, “Uncertainty-based offline reinforcement learning with diversified q-ensemble,” inAdvances in Neural Information Pro- cessing Systems (NeurIPS), vol. 34, 2021, pp. 7436–7447

work page 2021
[9]

What uncertainties do we need in bayesian deep learning for computer vision?

Kendall Alex and Gal Yarin, “What uncertainties do we need in bayesian deep learning for computer vision?” inAdvances in Neural Information Processing Systems (NIPS), vol. 30, 2017, pp. 5580–5590

work page 2017
[10]

Addressing function approximation error in actor-critic methods,

Fujimoto Scott, van Hoof Herke, and Meger David, “Addressing function approximation error in actor-critic methods,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1587–1596

work page 2018
[11]

A bayesian approach to robust rein- forcement learning,

Derman Esther, Mankowitz Daniel J., and Mannor Shie, “A bayesian approach to robust rein- forcement learning,” inUncertainty in Artificial Intelligence (UAI), 2019, pp. 648–658

work page 2019
[12]

Robust markov decision processes,

Wiesemann Wolfram, Kuhn Daniel, and Rustem Berç, “Robust markov decision processes,” Mathematics of Operations Research, vol. 38, no. 1, pp. 153–183, 2013

work page 2013
[13]

Robustness and risk-sensitivity in markov decision processes,

Osogami Takayuki, “Robustness and risk-sensitivity in markov decision processes,” inAdvances in Neural Information Processing Systems (NIPS), vol. 25, 2012, pp. 233–241

work page 2012
[14]

Robust regression and lasso,

Xu Huan, Caramanis Constantine, and Mannor Shie, “Robust regression and lasso,” inAdvances in Neural Information Processing Systems (NIPS), vol. 21, 2008, pp. 1801–1808

work page 2008
[15]

Robustness and regularization of support vector machines,

Xu Huan, Caramanis Constantine, “Robustness and regularization of support vector machines,” Journal of Machine Learning Research, vol. 10, no. 7, pp. 1485–1510, 2009

work page 2009
[16]

Natural actor-critic for robust reinforcement learning with function approximation,

Zhou Ruida, Liu Tao, Cheng Min, Kalathil Dileep, Kumar P. R., and Tian Chao, “Natural actor-critic for robust reinforcement learning with function approximation,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023, pp. 30424–30438

work page 2023
[17]

Model-free robustϕ-divergence rein- forcement learning using both offline and online data,

Panaganti Kishan, Wierman Adam, and Mazumdar Eric, “Model-free robustϕ-divergence rein- forcement learning using both offline and online data,” inInternational Conference on Machine Learning (ICML), 2024, pp. 39389–39459

work page 2024
[18]

Lipschitz continuity in model-based reinforcement learning,

Asadi Kavosh, Misra Dipendra, and Littman Michael L., “Lipschitz continuity in model-based reinforcement learning,” inInternational Conference on Machine Learning (ICML), 2018, pp. 264–273

work page 2018
[19]

Deep exploration via bootstrapped dqn,

Osband Ian, Blundell Charles, Pritzel Alexander, and Van Roy Benjamin, “Deep exploration via bootstrapped dqn,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016, pp. 4026–4034

work page 2016
[20]

Single agent robust deep reinforcement learning for bus fleet control,

Zhang Yifan and Zheng Liang, “Single agent robust deep reinforcement learning for bus fleet control,”arXiv preprint, 2025. 16

work page 2025
[21]

Rorl: Robust offline reinforcement learning via conservative smoothing,

Yang Rui, Bai Chenjia, Ma Xiaoteng, Wang Zhen, Zhang Chongjie, and Han Li, “Rorl: Robust offline reinforcement learning via conservative smoothing,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 26613–26624

work page 2022
[22]

Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,

Xuan Yiguang and Daganzo Carlos F., “Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,”Transportation Research Part B: Method- ological, vol. 45, pp. 1831–1845, 2011

work page 2011
[23]

Multi-agent deep reinforcement learning: a survey,

Gronauer Sven and Diepold Klaus, “Multi-agent deep reinforcement learning: a survey,”Arti- ficial Intelligence Review, vol. 55, no. 2, pp. 895–943, 2022

work page 2022
[24]

Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning

Papoudakis Georgios, Christianos Filippos, Rahman Arrasy, and Albrecht Stefano V., “Dealing with non-stationarity in multi-agent deep reinforcement learning,”arXiv preprint arXiv:1906.04737, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[25]

Grand- master level in starcraft ii using multi-agent reinforcement learning,

Vinyals Oriol, Babuschkin Igor, Czarnecki Wojciech M., Mathieu Michaël, Dudzik Andrew, Chung Junyoung, Choi David H., Powell Richard, Ewalds Timo, Georgiev Petkoet al., “Grand- master level in starcraft ii using multi-agent reinforcement learning,”Nature, vol. 575, no. 7782, pp. 350–354, 2019

work page 2019
[26]

Algorithms for cvar optimization in mdps,

Chow Yinlam and Ghavamzadeh Mohammad, “Algorithms for cvar optimization in mdps,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 27, 2014, pp. 3509–3517

work page 2014
[27]

Risk-sensitive and robust decision-making: A cvar optimization approach,

Chow Yinlam, Tamar Aviv, Mannor Shie, and Pavone Marco, “Risk-sensitive and robust decision-making: A cvar optimization approach,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 28, 2015, pp. 1522–1530

work page 2015
[28]

A distributional perspective on reinforce- ment learning,

Bellemare Marc G., Dabney Will, and Munos Rémi, “A distributional perspective on reinforce- ment learning,” inInternational Conference on Machine Learning (ICML), 2017, pp. 449–458

work page 2017
[29]

Distributional reinforce- ment learning with quantile regression,

Dabney Will, Rowland Mark, Bellemare Marc G., and Munos Rémi, “Distributional reinforce- ment learning with quantile regression,” inAAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018, pp. 2892–2901

work page 2018
[30]

Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret,

Fei Yingjie, Yang Zhuoran, Chen Yudong, Wang Zhaoran, and Xie Qiaomin, “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 22384–22395

work page 2020
[31]

Robust dynamic programming,

Iyengar Garud N., “Robust dynamic programming,”Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005

work page 2005
[32]

A theory of regularized markov decision processes,

Geist Matthieu, Scherrer Bruno, and Pietquin Olivier, “A theory of regularized markov decision processes,” inInternational Conference on Machine Learning (ICML), 2019, pp. 2160–2169

work page 2019
[33]

Double q-learning,

Van Hasselt Hado, “Double q-learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 23, 2010, pp. 2613–2621

work page 2010
[34]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

Haarnoja Tuomas, Zhou Aurick, Abbeel Pieter, and Levine Sergey, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1861–1870

work page 2018
[35]

Conservative q-learning for of- fline reinforcement learning,

Kumar Aviral, Zhou Aurick, Tucker George, and Levine Sergey, “Conservative q-learning for of- fline reinforcement learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 1179–1191

work page 2020
[36]

Dsac: Distributional soft actor-critic for risk-sensitive reinforcement learning,

Ma Xiaoteng, Chen Junyao, Xia Li, Yang Jun, Zhao Qianchuan, and Zhou Zhengyuan, “Dsac: Distributional soft actor-critic for risk-sensitive reinforcement learning,”Journal of Artificial Intelligence Research, vol. 83, pp. 1–28, 2025, article 4

work page 2025
[37]

Villani,Optimal transport: old and new

C. Villani,Optimal transport: old and new. Springer, 2009, vol. 338. 17

work page 2009
[38]

Maximum entropy rl (provably) solves some robust rl problems,

Eysenbach Benjamin and Levine Sergey, “Maximum entropy rl (provably) solves some robust rl problems,” inInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[39]

Discounted dynamic programming,

D. Blackwell, “Discounted dynamic programming,”The Annals of Mathematical Statistics, vol. 36, no. 1, pp. 226–235, 1965

work page 1965
[40]

Human-level control through deep reinforcement learning,

Mnih Volodymyr, Kavukcuoglu Koray, Silver David, Rusu Andrei A., Veness Joel, Bellemare Marc G., Graves Alex, Riedmiller Martin, Fidjeland Andreas K., Ostrovski Georget al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015

work page 2015
[41]

Deep reinforcement learning and the deadly triad,

van Hasselt Hado, Doron Yotam, Strub Florian, Hessel Matteo, Sonnerat Nicolas, and Modayil Joseph, “Deep reinforcement learning and the deadly triad,” inNeurIPS 2018 Workshop on Deep Reinforcement Learning, 2018. 18 Appendix: Theoretical foundations and derivations A Sample complexity analysis We compare the theoretical sample complexity (data efficiency)...

work page 2018
[42]

A rigorous regret analysis would need to account for this non-stationarity (cf

The penaltiesκandΓ epi change across training iterations as the network weights and target ensemble evolve. A rigorous regret analysis would need to account for this non-stationarity (cf. the per-step discussion in §E.1)

work page
[43]

Golden Thread

The shifted reward˜Rmay have a different range thanR, affecting the constants in the bound. The conceptual takeaway is that by replacing tail-distribution estimation (which requires exponential samples) with a deterministic structural penalty (which requires no additional samples), RE-SAC sidesteps the mechanismthat causes the exponential barrier. B The c...

work page 2009
[44]

•Epistemic Risk:Corresponds to the spread of the posteriorq(w)

Decomposition (Depeweg et al.):We distinguish between reducible and irreducible uncertainty: H[p(s′|s, a)]| {z } Total Uncertainty =E q(w)[H(p(s′|s, a,w))]| {z } Aleatoric (Expected Data Noise) +I(s ′;w)| {z } Epistemic (Mutual Information) ,(26) whereI(s ′;w)is the Mutual Information between the weights and the prediction. •Epistemic Risk:Corresponds to ...

work page
[45]

Credible Set

Robustness as Bayesian proxy (Derman et al.):Derman et al. show that optimizing for the worst-case model within a Bayesian "Credible Set" (Ambiguity SetU) is a lower bound on the true Bayesian optimal value: VBayes(s)≥max π min P∈U α E[R].(27) This justifies our approach: we use the Ensemble to define the Credible Set (where the model might be), and we op...

work page 2021

[1] [1]

A headway-based approach to eliminate bus bunching: Systematic analysis and comparisons,

Daganzo Carlos F., “A headway-based approach to eliminate bus bunching: Systematic analysis and comparisons,”Transportation Research Part B: Methodological, vol. 43, no. 10, pp. 913–921, 2009

work page 2009

[2] [2]

Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,

Xuan Yiguang, Argote Juan, and Daganzo Carlos F., “Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,”Transportation Research Part B: Methodological, vol. 45, no. 10, pp. 1831–1845, 2011

work page 2011

[3] [3]

Dynamic holding control to avoid bus bunching: A multi-agent deep reinforcement learning framework,

Wang Jiawei and Sun Lijun, “Dynamic holding control to avoid bus bunching: A multi-agent deep reinforcement learning framework,”Transportation Research Part C: Emerging Technolo- gies, vol. 116, p. 102661, 2020

work page 2020

[4] [4]

Seizing serendipity: Exploiting the value of past success in off-policy actor-critic,

Ji Tianying, Luo Yu, Sun Fuchun, Zhan Xianyuan, Zhang Jianwei, and Xu Huazhe, “Seizing serendipity: Exploiting the value of past success in off-policy actor-critic,” inInternational Conference on Machine Learning (ICML), 2024, pp. 21672–21718. 15

work page 2024

[5] [5]

De- composition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning,

Depeweg Stefan, Hernández-Lobato José Miguel, Doshi-Velez Finale, and Udluft Steffen, “De- composition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1184–1193

work page 2018

[6] [6]

Estimating risk and uncertainty in deep reinforcement learning,

Clements William R., Robaglia Benoît-Marie, van Delft Bastien, Slaoui Reda Bahi, and Toth Sébastien, “Estimating risk and uncertainty in deep reinforcement learning,”arXiv preprint arXiv:1905.09638, 2019

work page arXiv 1905

[7] [7]

Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods,

Hüllermeier Eyke and Waegeman Willem, “Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods,”Machine Learning, vol. 110, no. 3, pp. 457–506, 2021

work page 2021

[8] [8]

Uncertainty-based offline reinforcement learning with diversified q-ensemble,

An Gaon, Moon Seungyong, Kim Jang-Hyun, and Song Hyun Oh, “Uncertainty-based offline reinforcement learning with diversified q-ensemble,” inAdvances in Neural Information Pro- cessing Systems (NeurIPS), vol. 34, 2021, pp. 7436–7447

work page 2021

[9] [9]

What uncertainties do we need in bayesian deep learning for computer vision?

Kendall Alex and Gal Yarin, “What uncertainties do we need in bayesian deep learning for computer vision?” inAdvances in Neural Information Processing Systems (NIPS), vol. 30, 2017, pp. 5580–5590

work page 2017

[10] [10]

Addressing function approximation error in actor-critic methods,

Fujimoto Scott, van Hoof Herke, and Meger David, “Addressing function approximation error in actor-critic methods,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1587–1596

work page 2018

[11] [11]

A bayesian approach to robust rein- forcement learning,

Derman Esther, Mankowitz Daniel J., and Mannor Shie, “A bayesian approach to robust rein- forcement learning,” inUncertainty in Artificial Intelligence (UAI), 2019, pp. 648–658

work page 2019

[12] [12]

Robust markov decision processes,

Wiesemann Wolfram, Kuhn Daniel, and Rustem Berç, “Robust markov decision processes,” Mathematics of Operations Research, vol. 38, no. 1, pp. 153–183, 2013

work page 2013

[13] [13]

Robustness and risk-sensitivity in markov decision processes,

Osogami Takayuki, “Robustness and risk-sensitivity in markov decision processes,” inAdvances in Neural Information Processing Systems (NIPS), vol. 25, 2012, pp. 233–241

work page 2012

[14] [14]

Robust regression and lasso,

Xu Huan, Caramanis Constantine, and Mannor Shie, “Robust regression and lasso,” inAdvances in Neural Information Processing Systems (NIPS), vol. 21, 2008, pp. 1801–1808

work page 2008

[15] [15]

Robustness and regularization of support vector machines,

Xu Huan, Caramanis Constantine, “Robustness and regularization of support vector machines,” Journal of Machine Learning Research, vol. 10, no. 7, pp. 1485–1510, 2009

work page 2009

[16] [16]

Natural actor-critic for robust reinforcement learning with function approximation,

Zhou Ruida, Liu Tao, Cheng Min, Kalathil Dileep, Kumar P. R., and Tian Chao, “Natural actor-critic for robust reinforcement learning with function approximation,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023, pp. 30424–30438

work page 2023

[17] [17]

Model-free robustϕ-divergence rein- forcement learning using both offline and online data,

Panaganti Kishan, Wierman Adam, and Mazumdar Eric, “Model-free robustϕ-divergence rein- forcement learning using both offline and online data,” inInternational Conference on Machine Learning (ICML), 2024, pp. 39389–39459

work page 2024

[18] [18]

Lipschitz continuity in model-based reinforcement learning,

Asadi Kavosh, Misra Dipendra, and Littman Michael L., “Lipschitz continuity in model-based reinforcement learning,” inInternational Conference on Machine Learning (ICML), 2018, pp. 264–273

work page 2018

[19] [19]

Deep exploration via bootstrapped dqn,

Osband Ian, Blundell Charles, Pritzel Alexander, and Van Roy Benjamin, “Deep exploration via bootstrapped dqn,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016, pp. 4026–4034

work page 2016

[20] [20]

Single agent robust deep reinforcement learning for bus fleet control,

Zhang Yifan and Zheng Liang, “Single agent robust deep reinforcement learning for bus fleet control,”arXiv preprint, 2025. 16

work page 2025

[21] [21]

Rorl: Robust offline reinforcement learning via conservative smoothing,

Yang Rui, Bai Chenjia, Ma Xiaoteng, Wang Zhen, Zhang Chongjie, and Han Li, “Rorl: Robust offline reinforcement learning via conservative smoothing,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 26613–26624

work page 2022

[22] [22]

Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,

Xuan Yiguang and Daganzo Carlos F., “Dynamic bus holding strategies for schedule reliability: Optimal linear control and performance analysis,”Transportation Research Part B: Method- ological, vol. 45, pp. 1831–1845, 2011

work page 2011

[23] [23]

Multi-agent deep reinforcement learning: a survey,

Gronauer Sven and Diepold Klaus, “Multi-agent deep reinforcement learning: a survey,”Arti- ficial Intelligence Review, vol. 55, no. 2, pp. 895–943, 2022

work page 2022

[24] [24]

Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning

Papoudakis Georgios, Christianos Filippos, Rahman Arrasy, and Albrecht Stefano V., “Dealing with non-stationarity in multi-agent deep reinforcement learning,”arXiv preprint arXiv:1906.04737, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[25] [25]

Grand- master level in starcraft ii using multi-agent reinforcement learning,

Vinyals Oriol, Babuschkin Igor, Czarnecki Wojciech M., Mathieu Michaël, Dudzik Andrew, Chung Junyoung, Choi David H., Powell Richard, Ewalds Timo, Georgiev Petkoet al., “Grand- master level in starcraft ii using multi-agent reinforcement learning,”Nature, vol. 575, no. 7782, pp. 350–354, 2019

work page 2019

[26] [26]

Algorithms for cvar optimization in mdps,

Chow Yinlam and Ghavamzadeh Mohammad, “Algorithms for cvar optimization in mdps,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 27, 2014, pp. 3509–3517

work page 2014

[27] [27]

Risk-sensitive and robust decision-making: A cvar optimization approach,

Chow Yinlam, Tamar Aviv, Mannor Shie, and Pavone Marco, “Risk-sensitive and robust decision-making: A cvar optimization approach,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 28, 2015, pp. 1522–1530

work page 2015

[28] [28]

A distributional perspective on reinforce- ment learning,

Bellemare Marc G., Dabney Will, and Munos Rémi, “A distributional perspective on reinforce- ment learning,” inInternational Conference on Machine Learning (ICML), 2017, pp. 449–458

work page 2017

[29] [29]

Distributional reinforce- ment learning with quantile regression,

Dabney Will, Rowland Mark, Bellemare Marc G., and Munos Rémi, “Distributional reinforce- ment learning with quantile regression,” inAAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018, pp. 2892–2901

work page 2018

[30] [30]

Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret,

Fei Yingjie, Yang Zhuoran, Chen Yudong, Wang Zhaoran, and Xie Qiaomin, “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 22384–22395

work page 2020

[31] [31]

Robust dynamic programming,

Iyengar Garud N., “Robust dynamic programming,”Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005

work page 2005

[32] [32]

A theory of regularized markov decision processes,

Geist Matthieu, Scherrer Bruno, and Pietquin Olivier, “A theory of regularized markov decision processes,” inInternational Conference on Machine Learning (ICML), 2019, pp. 2160–2169

work page 2019

[33] [33]

Double q-learning,

Van Hasselt Hado, “Double q-learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 23, 2010, pp. 2613–2621

work page 2010

[34] [34]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

Haarnoja Tuomas, Zhou Aurick, Abbeel Pieter, and Levine Sergey, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational Conference on Machine Learning (ICML), 2018, pp. 1861–1870

work page 2018

[35] [35]

Conservative q-learning for of- fline reinforcement learning,

Kumar Aviral, Zhou Aurick, Tucker George, and Levine Sergey, “Conservative q-learning for of- fline reinforcement learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 1179–1191

work page 2020

[36] [36]

Dsac: Distributional soft actor-critic for risk-sensitive reinforcement learning,

Ma Xiaoteng, Chen Junyao, Xia Li, Yang Jun, Zhao Qianchuan, and Zhou Zhengyuan, “Dsac: Distributional soft actor-critic for risk-sensitive reinforcement learning,”Journal of Artificial Intelligence Research, vol. 83, pp. 1–28, 2025, article 4

work page 2025

[37] [37]

Villani,Optimal transport: old and new

C. Villani,Optimal transport: old and new. Springer, 2009, vol. 338. 17

work page 2009

[38] [38]

Maximum entropy rl (provably) solves some robust rl problems,

Eysenbach Benjamin and Levine Sergey, “Maximum entropy rl (provably) solves some robust rl problems,” inInternational Conference on Learning Representations (ICLR), 2022

work page 2022

[39] [39]

Discounted dynamic programming,

D. Blackwell, “Discounted dynamic programming,”The Annals of Mathematical Statistics, vol. 36, no. 1, pp. 226–235, 1965

work page 1965

[40] [40]

Human-level control through deep reinforcement learning,

Mnih Volodymyr, Kavukcuoglu Koray, Silver David, Rusu Andrei A., Veness Joel, Bellemare Marc G., Graves Alex, Riedmiller Martin, Fidjeland Andreas K., Ostrovski Georget al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015

work page 2015

[41] [41]

Deep reinforcement learning and the deadly triad,

van Hasselt Hado, Doron Yotam, Strub Florian, Hessel Matteo, Sonnerat Nicolas, and Modayil Joseph, “Deep reinforcement learning and the deadly triad,” inNeurIPS 2018 Workshop on Deep Reinforcement Learning, 2018. 18 Appendix: Theoretical foundations and derivations A Sample complexity analysis We compare the theoretical sample complexity (data efficiency)...

work page 2018

[42] [42]

A rigorous regret analysis would need to account for this non-stationarity (cf

The penaltiesκandΓ epi change across training iterations as the network weights and target ensemble evolve. A rigorous regret analysis would need to account for this non-stationarity (cf. the per-step discussion in §E.1)

work page

[43] [43]

Golden Thread

The shifted reward˜Rmay have a different range thanR, affecting the constants in the bound. The conceptual takeaway is that by replacing tail-distribution estimation (which requires exponential samples) with a deterministic structural penalty (which requires no additional samples), RE-SAC sidesteps the mechanismthat causes the exponential barrier. B The c...

work page 2009

[44] [44]

•Epistemic Risk:Corresponds to the spread of the posteriorq(w)

Decomposition (Depeweg et al.):We distinguish between reducible and irreducible uncertainty: H[p(s′|s, a)]| {z } Total Uncertainty =E q(w)[H(p(s′|s, a,w))]| {z } Aleatoric (Expected Data Noise) +I(s ′;w)| {z } Epistemic (Mutual Information) ,(26) whereI(s ′;w)is the Mutual Information between the weights and the prediction. •Epistemic Risk:Corresponds to ...

work page

[45] [45]

Credible Set

Robustness as Bayesian proxy (Derman et al.):Derman et al. show that optimizing for the worst-case model within a Bayesian "Credible Set" (Ambiguity SetU) is a lower bound on the true Bayesian optimal value: VBayes(s)≥max π min P∈U α E[R].(27) This justifies our approach: we use the Ensemble to define the Credible Set (where the model might be), and we op...

work page 2021