BAPR: Bayesian amnesic piecewise-robust reinforcement learning for non-stationary continuous control

Liang Zheng; Yifan Zhang

arxiv: 2605.16170 · v2 · pith:LIH2QFNUnew · submitted 2026-05-15 · 💻 cs.LG

BAPR: Bayesian amnesic piecewise-robust reinforcement learning for non-stationary continuous control

Yifan Zhang , Liang Zheng This is my paper

Pith reviewed 2026-05-20 20:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords Bayesian online change detectionpiecewise stationaryrobust reinforcement learningnon-stationary controlformal verificationcontinuous controlamnesic learning

0 comments

The pith

BAPR shows that freezing beliefs in a Bayesian mixture of Bellman operators yields a gamma-contraction for piecewise-stationary control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces BAPR to address non-stationary continuous control where dynamics switch between stable regimes. It combines Bayesian online change detection with robust reinforcement learning to balance conservatism and performance. The key result is that the BAPR operator, using frozen beliefs, acts as a gamma-contraction, with formal verification in Lean 4 proving the contraction property and bounding recovery after switches. If true, this allows policies to adaptively become conservative only after detecting changes without catastrophic failure.

Core claim

The BAPR operator is defined as a convex combination of mode-conditional Bellman operators weighted by a frozen belief distribution, and this operator is a gamma-contraction. A counterexample shows that if beliefs depend on the Q-function, the contraction factor becomes gamma plus lambda times delta where delta is the mode reward gap, leading to failure when gamma plus lambda delta is at least 1. A component-wise formal error budget is derived and machine-verified for the abstract operator to bound post-switch recovery.

What carries the argument

The BAPR operator, a convex combination of mode-conditional Bellman operators weighted by a frozen belief distribution.

If this is right

The policy becomes maximally conservative after detected change-points and smoothly relaxes as confidence grows.
Detection delay is O(log(1/delta)).
The formal error budget applies to the abstract mode-mixture operator and bounds post-switch recovery.
Context-conditioning provides mode-aware representations without mode labels at deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This design intuition of freezing parameters may enable similar contraction proofs in other mixture-based RL methods.
Machine verification of all theorems suggests a path toward certified RL algorithms for safety-critical applications.
The approach could be tested in real-world systems with abrupt changes like robotics or autonomous driving.

Load-bearing premise

The beliefs must remain frozen and independent of the Q-function to preserve the gamma-contraction property.

What would settle it

Observe a scenario where the belief distribution depends on the Q-function and check if the contraction fails exactly when gamma plus lambda times delta is greater than or equal to one, as shown in the Lean-verified counterexample.

Figures

Figures reproduced from arXiv: 2605.16170 by Liang Zheng, Yifan Zhang.

**Figure 2.** Figure 2: Training curves on the discrete-regime benchmark ( [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: BOCD detection + adaptive-conservatism dynamics on Ant-v2 discrete-mode (seed [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

read the original abstract

Real-world control systems frequently operate under \emph{piecewise stationary} conditions, where dynamics remain stable for extended periods before undergoing abrupt regime changes. Standard robust RL methods face a fundamental dilemma: a globally conservative policy wastes performance during stable periods, while a locally adaptive policy risks catastrophic failure when the regime changes undetected. We propose \textbf{BAPR} (Bayesian Amnesic Piecewise-Robust SAC), which unifies Bayesian Online Change Detection (BOCD) with robust ensemble RL. The BAPR operator -- a convex combination of mode-conditional Bellman operators weighted by a frozen belief distribution -- is a $\gamma$-contraction. A complementary counterexample, machine-verified in Lean~4, establishes a \emph{sharp boundary}: when beliefs depend on the Q-function, the contraction factor becomes $\gamma + \lambda\Delta$ (where $\Delta$ is the mode reward gap), and contraction fails exactly when $\gamma + \lambda\Delta \geq 1$. We derive a \emph{component-wise} formal error budget for the abstract operator -- every component machine-verified -- bounding post-switch recovery; the budget applies to the abstract mode-mixture operator and inherits to the implemented shared-critic algorithm only through the frozen-parameter design intuition. All results are formally verified with no \texttt{sorry} (1,145 lines across 3 Lean~4 files, 22 machine-verified theorems). BOCD drives an adaptive conservatism mechanism: the policy becomes maximally conservative after detected change-points and smoothly relaxes as confidence grows, with detection delay $O(\log(1/\delta))$. A context-conditioning module trained via RMDM loss provides mode-aware representations from simulator-provided mode IDs at training time and requires no mode labels at deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BAPR verifies a gamma-contraction for its abstract mode-mixture operator in Lean but the transfer to the shared-critic implementation rests on unverified intuition about frozen parameters.

read the letter

The main thing here is a machine-checked proof that the BAPR operator, a convex combination of mode-conditional Bellman operators under a frozen belief, is a gamma-contraction, plus a verified counterexample showing that any Q-dependence breaks it when gamma plus lambda Delta reaches 1. That formal piece is the clearest contribution and it is done properly with 22 theorems and no sorries in 1145 lines of Lean 4. The paper also ties this to BOCD for adaptive conservatism and adds a context-conditioning module that uses simulator mode IDs only at training time. Those elements are new relative to the cited prior work on robust ensemble RL and change detection. The component-wise error budget for post-switch recovery is another solid formal result that applies directly to the abstract operator. The soft spot is exactly the one the stress-test flags: the manuscript states that the verified budget inherits to the implemented shared-critic algorithm only through the frozen-parameter design intuition. If the BOCD updates, the context module, or the critic training introduce even indirect Q-dependence, the guarantee does not carry over. That gap is real and worth checking in the full code and training loop. The paper is aimed at researchers who care about formal guarantees in non-stationary continuous control and who are willing to read both the Lean files and the empirical sections. It deserves a serious referee because the formal work is reproducible and externally grounded rather than hand-wavy. I would send it out for review with a request that the authors either verify the implementation path or clearly bound the remaining assumptions.

Referee Report

1 major / 2 minor

Summary. The paper proposes BAPR, a Bayesian amnesic piecewise-robust SAC algorithm for non-stationary continuous control under piecewise-stationary dynamics. It defines the BAPR operator as a convex combination of mode-conditional Bellman operators weighted by a frozen belief distribution and claims this operator is a γ-contraction. The work provides a machine-verified counterexample in Lean 4 showing that Q-dependence changes the contraction factor to γ + λΔ with failure when γ + λΔ ≥ 1, derives a component-wise formal error budget for post-switch recovery (also machine-verified), and describes an implementation using BOCD for adaptive conservatism, a context-conditioning module via RMDM loss, and a shared-critic architecture. All formal results are supported by 22 theorems with no sorries across 1,145 lines of Lean 4.

Significance. If the verified contraction and error budget transfer to the implemented algorithm, the result would be significant for robust RL in real-world non-stationary settings by enabling adaptive conservatism with explicit recovery bounds after regime switches. The machine-checked proofs (22 theorems, 1,145 lines, no sorries) and the sharp counterexample delineating the dependence boundary are notable strengths that provide external grounding beyond self-referential arguments.

major comments (1)

[Abstract] Abstract: the statement that the component-wise formal error budget 'applies to the abstract mode-mixture operator and inherits to the implemented shared-critic algorithm only through the frozen-parameter design intuition' identifies a load-bearing gap. The counterexample (machine-verified in Lean 4) shows that any dependence on the Q-function changes the contraction factor to γ + λΔ and causes failure when γ + λΔ ≥ 1; without a formal argument or additional verification that BOCD updates, context-conditioning, and shared-critic training preserve the required independence, the guarantee does not transfer to the practical algorithm.

minor comments (2)

Clarify the precise definition of the frozen belief distribution and its separation from the Q-function in the implementation description to make the design intuition more explicit.
The notation for λ and Δ in the contraction factor should be introduced with an equation reference when first used in the abstract and main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comment correctly identifies a distinction between the scope of the machine-verified results and their transfer to the implemented algorithm. We address this point directly below and have revised the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that the component-wise formal error budget 'applies to the abstract mode-mixture operator and inherits to the implemented shared-critic algorithm only through the frozen-parameter design intuition' identifies a load-bearing gap. The counterexample (machine-verified in Lean 4) shows that any dependence on the Q-function changes the contraction factor to γ + λΔ and causes failure when γ + λΔ ≥ 1; without a formal argument or additional verification that BOCD updates, context-conditioning, and shared-critic training preserve the required independence, the guarantee does not transfer to the practical algorithm.

Authors: We agree that the contraction property and component-wise error budget are formally established only for the abstract mode-mixture operator under the assumption of a belief distribution that remains independent of the Q-function. The machine-verified counterexample precisely delineates the failure boundary when this independence is lost. In the implemented BAPR algorithm, independence is maintained by construction: the BOCD change detector operates exclusively on the observed reward and state sequence, the RMDM context-conditioning loss is optimized separately using simulator mode labels available only at training time, and the shared critic applies the belief-weighted operator with parameters frozen during each Bellman backup. These choices prevent direct Q-to-belief feedback loops. Nevertheless, we acknowledge that a complete machine-checked argument covering the full stochastic optimization dynamics of BOCD and gradient updates is absent and would require substantial additional formalization effort. We have revised the abstract to state explicitly that the formal results apply to the abstract operator, while the practical algorithm is engineered to respect the independence condition through the frozen-parameter design, with supporting empirical evidence in non-stationary continuous control tasks. revision: yes

Circularity Check

0 steps flagged

No circularity detected; central contraction result grounded by external Lean verification

full rationale

The paper's derivation of the BAPR operator as a γ-contraction rests on explicit machine-checked theorems in Lean 4 (1,145 lines, 22 theorems, no 'sorry'). The abstract and text distinguish the abstract mode-mixture operator (formally proven) from the implemented shared-critic algorithm (inheritance via 'frozen-parameter design intuition' only). This separation avoids any reduction of the claimed result to its own fitted parameters or self-referential definitions. No self-citations, ansatzes smuggled via prior work, or renaming of known results appear as load-bearing steps. The counterexample for Q-dependent beliefs is also externally verified, providing independent grounding rather than circular support. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract provides limited detail on parameters or entities; the piecewise-stationary regime and frozen-belief independence are treated as domain assumptions rather than derived.

axioms (2)

domain assumption Environment dynamics are piecewise stationary with abrupt regime changes
Required for BOCD to detect changes and for the adaptive conservatism schedule to apply.
domain assumption Belief distribution over modes is frozen and independent of the Q-function
Explicitly required to keep the contraction factor at γ; dependence produces the counterexample where contraction fails.

pith-pipeline@v0.9.0 · 5856 in / 1205 out tokens · 65857 ms · 2026-05-20T20:51:46.223740+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The BAPR operator -- a convex combination of mode-conditional Bellman operators weighted by a frozen belief distribution -- is a γ-contraction. ... machine-verified in Lean 4 with no sorry (BAPR.lean: 560 lines; BAPR-Counterproof.lean: 265 lines).
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A complementary counterexample, machine-verified in Lean 4, establishes a sharp boundary: when beliefs depend on the Q-function, the contraction factor becomes γ + λΔ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Bayesian Online Changepoint Detection

Adams Ryan P. and MacKay David J.C., “Bayesian online changepoint detection,”arXiv preprint arXiv:0710.3742, 2007

work page internal anchor Pith review Pith/arXiv arXiv 2007
[2]

CARL: A benchmark for contextual and adaptive reinforcement learning,

Benjamins Carolin, Eimer Theresa, Schubert Frederik, Biedenkapp André, Rosenhahn Bodo, Hutter Frank, and Lindauer Marius, “CARL: A benchmark for contextual and adaptive reinforcement learning,”NeurIPS Workshop on Ecological Theory of RL, 2021

work page 2021
[3]

Context-aware safe reinforcement learning for non-stationary environments,

Chen Baiming, Liu Zuxin, Zhu Jiacheng, Xu Mengdi, Ding Wenhao, Li Liang, and Zhao Ding, “Context-aware safe reinforcement learning for non-stationary environments,” inProc. IEEE Int. Conf. Robotics and Automation (ICRA), 2021, pp. 10689–10695

work page 2021
[4]

Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations,

Doshi-Velez Finale and Konidaris George, “Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations,” inProc. Int. Joint Conf. Artificial Intelligence (IJCAI), 2016, pp. 1432–1440

work page 2016
[5]

Model-agnostic meta-learning for fast adaptation of deep networks,

Finn Chelsea, Abbeel Pieter, and Levine Sergey, “Model-agnostic meta-learning for fast adaptation of deep networks,” inProc. Int. Conf. Machine Learning (ICML), 2017, pp. 1126–1135

work page 2017
[6]

On upper-confidence bound policies for switching bandit problems,

Garivier Aurélien and Moulines Eric, “On upper-confidence bound policies for switching bandit problems,” inProc. Algorithmic Learning Theory (ALT), 2011, pp. 174–188

work page 2011
[7]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,

Haarnoja Tuomas, Zhou Aurick, Abbeel Pieter, and Levine Sergey, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” inProc. Int. Conf. Machine Learning (ICML), 2018, pp. 1861–1870

work page 2018
[8]

Sequential decision-making under non- stationary environments via sequential change-point detection,

Hadoux Emmanuel, Beynier Aurélie, and Weng Paul, “Sequential decision-making under non- stationary environments via sequential change-point detection,” inECML/PKDD Workshop on Learning over Multiple Contexts, 2014

work page 2014
[9]

Robust dynamic programming,

Iyengar Garud N., “Robust dynamic programming,”Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005. 20

work page 2005
[10]

Towards continual reinforcement learning: A review and perspectives,

Khetarpal Khimya, Riemer Matthew, Rish Irina, and Precup Doina, “Towards continual reinforcement learning: A review and perspectives,”Journal of Artificial Intelligence Research, vol. 75, pp. 1401–1476, 2022

work page 2022
[11]

Optimal detection of changepoints with a linear computational cost,

Killick Rebecca, Fearnhead Paul, and Eckley Idris A., “Optimal detection of changepoints with a linear computational cost,”Journal of the American Statistical Association, vol. 107, no. 500, pp. 1590–1598, 2012

work page 2012
[12]

Non-stationary Markov decision processes: A worst-case approach using model-based reinforcement learning,

Lecarpentier Erwan and Rachelson Emmanuel, “Non-stationary Markov decision processes: A worst-case approach using model-based reinforcement learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019, pp. 7214–7223

work page 2019
[13]

Adapt to environmentsuddenchangesbylearningacontextsensitivepolicy,

Luo Fan-Ming, Jiang Shengyi, Yu Yang, Zhang Zongzhang, and Zhang Yi-Feng, “Adapt to environmentsuddenchangesbylearningacontextsensitivepolicy,” inProc. AAAI Conf. Artificial Intelligence, vol. 36, no. 7, 2022, pp. 7637–7646, doi: https://doi.org/10.1609/aaai.v36i7.20730

work page doi:10.1609/aaai.v36i7.20730 2022
[14]

Thompson sampling in switching environments with Bayesian online change detection,

Mellor Joseph and Shapiro Jonathan, “Thompson sampling in switching environments with Bayesian online change detection,” inProc. Int. Conf. Artificial Intelligence and Statistics (AISTATS), 2013, pp. 442–450

work page 2013
[15]

Learning to adapt in dynamic, real-world environments through meta- reinforcement learning,

Nagabandi Anusha, Clavera Ignasi, Liu Simin, Fearing Ronald S., Abbeel Pieter, Levine Sergey, and Finn Chelsea, “Learning to adapt in dynamic, real-world environments through meta- reinforcement learning,” inProc. Int. Conf. Learning Representations (ICLR), 2019

work page 2019
[16]

Robustness and risk-sensitivity in Markov decision processes,

Osogami Takayuki, “Robustness and risk-sensitivity in Markov decision processes,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 25, 2012, pp. 233–241

work page 2012
[17]

A survey of reinforcement learning algorithms for dynamically varying environments,

Padakandla Sindhu, “A survey of reinforcement learning algorithms for dynamically varying environments,”ACM Computing Surveys, vol. 54, no. 6, pp. 1–25, 2021

work page 2021
[18]

Model-free robustϕ-divergence reinforcement learning using both offline and online data,

Panaganti Kishan, Wierman Adam, and Mazumdar Eric, “Model-free robustϕ-divergence reinforcement learning using both offline and online data,” inProc. Int. Conf. Machine Learning (ICML), PMLR, vol. 235, 2024, pp. 39324–39363

work page 2024
[19]

Efficient off-policy meta-reinforcement learning via probabilistic context variables,

Rakelly Kate, Zhou Aurick, Finn Chelsea, Levine Sergey, and Quillen Deirdre, “Efficient off-policy meta-reinforcement learning via probabilistic context variables,” inProc. Int. Conf. Machine Learning (ICML), 2019, pp. 5331–5340

work page 2019
[20]

Multi-task reinforcement learning with context- based representations,

Sodhani Shagun, Zhang Amy, and Pineau Joelle, “Multi-task reinforcement learning with context- based representations,” inProc. Int. Conf. Machine Learning (ICML), 2021, pp. 9767–9779

work page 2021
[21]

Robust Markov decision processes,

Wiesemann Wolfram, Kuhn Daniel, and Rustem Berç, “Robust Markov decision processes,” Mathematics of Operations Research, vol. 38, no. 1, pp. 153–183, 2013

work page 2013
[22]

Robust regression and Lasso,

Xu Huan, Caramanis Constantine, and Mannor Shie, “Robust regression and Lasso,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 21, 2008, pp. 1801–1808

work page 2008
[23]

Robustness and regularization of support vector machines,

Xu Huan, Caramanis Constantine, and Mannor Shie, “Robustness and regularization of support vector machines,”Journal of Machine Learning Research, vol. 10, pp. 1485–1510, 2009

work page 2009
[24]

RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

Zhang Yifan and Zheng Liang, “RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach,”arXiv preprint arXiv:2603.18396, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Single Agent Robust Deep Reinforcement Learning for Bus Fleet Control,

Zhang Yifan, “Single Agent Robust Deep Reinforcement Learning for Bus Fleet Control,” Transportation Safety and Environment, article tdag005, 2026, doi: https://doi.org/10.1093/tse/ tdag005

work page doi:10.1093/tse/ 2026
[26]

Natural actor- critic for robust reinforcement learning with function approximation,

Zhou Ruida, Liu Tao, Cheng Min, Kalathil Dileep, Kumar P. R., and Tian Chao, “Natural actor- critic for robust reinforcement learning with function approximation,” inAdvances in Neural In- formation Processing Systems (NeurIPS), vol. 36, 2023, pp. 97–133, https://proceedings.neurips. cc/paper_files/paper/2023/file/007f4927e60699392425f267d43f0940-Paper-Co...

work page 2023
[27]

VariBAD: A very good method for Bayes-adaptive deep RL via meta-learning,

Zintgraf Luisa, Shiarlis Kyriacos, Igl Maximilian, Schulze Sebastian, Gal Yarin, Hofmann Katja, and Whiteson Shimon, “VariBAD: A very good method for Bayes-adaptive deep RL via meta-learning,” inProc. Int. Conf. Learning Representations (ICLR), 2020

work page 2020
[28]

Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret,

Fei Yingjie, Yang Zhuoran, Chen Yudong, Wang Zhaoran, and Xie Qiaomin, “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 22384–22395

work page 2020
[29]

Discounted dynamic programming,

Blackwell David, “Discounted dynamic programming,”The Annals of Mathematical Statistics, vol. 36, no. 1, pp. 226–235, 1965. 22 Appendix: Theoretical foundations and derivations A Piecewise Q-value convergence rate We analyze how BAPR’s piecewise-robust design achieves polynomial convergence rates, in contrast to the exponential barrier faced by direct ris...

work page 1965
[30]

Perturbation recovery phase(nδ steps): The BOCD posterior converges to the correct mode, andβ eff reaches its maximal conservatism, providing protection during the transient

work page
[31]

Proof sketch.The proof proceeds by structural induction over the sequence of segments: Base case:The first segment[ t0, t1)has no prior contamination

Contraction phase( Tk −n δ steps): The frozen-belief BAPR operator contracts toward the segment-specific fixed pointQ∗ k at rateγ: ∥Qt −Q ∗ k∥∞ ≤γ t−tk−nδ · ∥Qtk+nδ −Q ∗ k∥∞ + εproj +σ 1−γ .(32) By Assumption B.3, the contraction phase has sufficient duration for the error to decay to the irreducible floor(εproj +σ)/(1−γ)before the next switch. Proof sket...

work page
[32]

EstablishingbothBlackwellconditions(monotonicityanddiscounting)foreachmode-conditional operator

work page
[33]

Proving that the convex combination preservesbothconditions (non-trivial for discounting: requiresP ρ= 1)

work page
[34]

obvious

Verifying that the Bayesian belief update preserves the probability distribution properties. Notation.The Lean 4 proof uses h∈H (run-length) as the mode index for implementation convenience, since BAPR’s original design indexes modes by run-length. In the paper, we usem∈ M to emphasize the conceptual distinction between modes (physical configurations) and...

work page
[35]

The beliefρmust benon-negative(for monotonicity preservation via triangle inequality)

work page
[36]

The belief mustsum to one(for discounting preservation, the load-bearing step)

work page
[37]

The belief must befrozenduring the Bellman backup (for the penalties to cancel—without this, the counterexample in §D applies)

work page
[38]

The Lean proof makes each of these requirements explicit and machine-verified

The Bayesian update mustpreserveproperties (1)–(2) at every step. The Lean proof makes each of these requirements explicit and machine-verified. Violatingany oneof them would break the proof (and indeed the contraction, as our counterexample demonstrates for condition (3)). C.3 Preservation of contraction under Bayesian belief updates Definition C.7(Bayes...

work page
[39]

The agent overestimatesQ(due to sampling noise or stale data)

work page
[40]

The Q-dependent belief shifts toward the higher-reward mode (ρ1 increases)

work page
[41]

This inflates the Q-target further (mode 1 hasR1 > R2)

work page
[42]

normal traffic

The cycle amplifies, preventing convergence. 29 The critical threshold.The contraction factor isγ + λ∆. For typical RL parameters (γ = 0.99), any λ∆ ≥ 0.01breaks contraction. Since∆(the mode reward gap) can be large in practice (e.g., ∆ = 50between “normal traffic” and “severe congestion” in bus control), even a tiny sensitivity λ= 0.001suffices to breach...

work page
[43]

If action a1 has lower ensemble disagreement thana2 at state s, the adaptive penaltyamplifies the preference fora1 monotonically withλ w

work page
[44]

Bayesian Amnesia

The adaptive penalty cannot reverse the ranking among actions with equal ensemble disagreement. Quantitative Q-value depression.Following the same analysis as RE-SAC [24]: Q∗ BAPR(s, a) =Q ∗(s, a)− γ·∆ penalty 1−γ ,(47) where∆ penalty includes the belief-weighted contribution. With the experimental configuration (γ = 0.99, λw ≤ 0.4empirically, cpenalty = ...

work page

[1] [1]

Bayesian Online Changepoint Detection

Adams Ryan P. and MacKay David J.C., “Bayesian online changepoint detection,”arXiv preprint arXiv:0710.3742, 2007

work page internal anchor Pith review Pith/arXiv arXiv 2007

[2] [2]

CARL: A benchmark for contextual and adaptive reinforcement learning,

Benjamins Carolin, Eimer Theresa, Schubert Frederik, Biedenkapp André, Rosenhahn Bodo, Hutter Frank, and Lindauer Marius, “CARL: A benchmark for contextual and adaptive reinforcement learning,”NeurIPS Workshop on Ecological Theory of RL, 2021

work page 2021

[3] [3]

Context-aware safe reinforcement learning for non-stationary environments,

Chen Baiming, Liu Zuxin, Zhu Jiacheng, Xu Mengdi, Ding Wenhao, Li Liang, and Zhao Ding, “Context-aware safe reinforcement learning for non-stationary environments,” inProc. IEEE Int. Conf. Robotics and Automation (ICRA), 2021, pp. 10689–10695

work page 2021

[4] [4]

Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations,

Doshi-Velez Finale and Konidaris George, “Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations,” inProc. Int. Joint Conf. Artificial Intelligence (IJCAI), 2016, pp. 1432–1440

work page 2016

[5] [5]

Model-agnostic meta-learning for fast adaptation of deep networks,

Finn Chelsea, Abbeel Pieter, and Levine Sergey, “Model-agnostic meta-learning for fast adaptation of deep networks,” inProc. Int. Conf. Machine Learning (ICML), 2017, pp. 1126–1135

work page 2017

[6] [6]

On upper-confidence bound policies for switching bandit problems,

Garivier Aurélien and Moulines Eric, “On upper-confidence bound policies for switching bandit problems,” inProc. Algorithmic Learning Theory (ALT), 2011, pp. 174–188

work page 2011

[7] [7]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,

Haarnoja Tuomas, Zhou Aurick, Abbeel Pieter, and Levine Sergey, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” inProc. Int. Conf. Machine Learning (ICML), 2018, pp. 1861–1870

work page 2018

[8] [8]

Sequential decision-making under non- stationary environments via sequential change-point detection,

Hadoux Emmanuel, Beynier Aurélie, and Weng Paul, “Sequential decision-making under non- stationary environments via sequential change-point detection,” inECML/PKDD Workshop on Learning over Multiple Contexts, 2014

work page 2014

[9] [9]

Robust dynamic programming,

Iyengar Garud N., “Robust dynamic programming,”Mathematics of Operations Research, vol. 30, no. 2, pp. 257–280, 2005. 20

work page 2005

[10] [10]

Towards continual reinforcement learning: A review and perspectives,

Khetarpal Khimya, Riemer Matthew, Rish Irina, and Precup Doina, “Towards continual reinforcement learning: A review and perspectives,”Journal of Artificial Intelligence Research, vol. 75, pp. 1401–1476, 2022

work page 2022

[11] [11]

Optimal detection of changepoints with a linear computational cost,

Killick Rebecca, Fearnhead Paul, and Eckley Idris A., “Optimal detection of changepoints with a linear computational cost,”Journal of the American Statistical Association, vol. 107, no. 500, pp. 1590–1598, 2012

work page 2012

[12] [12]

Non-stationary Markov decision processes: A worst-case approach using model-based reinforcement learning,

Lecarpentier Erwan and Rachelson Emmanuel, “Non-stationary Markov decision processes: A worst-case approach using model-based reinforcement learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019, pp. 7214–7223

work page 2019

[13] [13]

Adapt to environmentsuddenchangesbylearningacontextsensitivepolicy,

Luo Fan-Ming, Jiang Shengyi, Yu Yang, Zhang Zongzhang, and Zhang Yi-Feng, “Adapt to environmentsuddenchangesbylearningacontextsensitivepolicy,” inProc. AAAI Conf. Artificial Intelligence, vol. 36, no. 7, 2022, pp. 7637–7646, doi: https://doi.org/10.1609/aaai.v36i7.20730

work page doi:10.1609/aaai.v36i7.20730 2022

[14] [14]

Thompson sampling in switching environments with Bayesian online change detection,

Mellor Joseph and Shapiro Jonathan, “Thompson sampling in switching environments with Bayesian online change detection,” inProc. Int. Conf. Artificial Intelligence and Statistics (AISTATS), 2013, pp. 442–450

work page 2013

[15] [15]

Learning to adapt in dynamic, real-world environments through meta- reinforcement learning,

Nagabandi Anusha, Clavera Ignasi, Liu Simin, Fearing Ronald S., Abbeel Pieter, Levine Sergey, and Finn Chelsea, “Learning to adapt in dynamic, real-world environments through meta- reinforcement learning,” inProc. Int. Conf. Learning Representations (ICLR), 2019

work page 2019

[16] [16]

Robustness and risk-sensitivity in Markov decision processes,

Osogami Takayuki, “Robustness and risk-sensitivity in Markov decision processes,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 25, 2012, pp. 233–241

work page 2012

[17] [17]

A survey of reinforcement learning algorithms for dynamically varying environments,

Padakandla Sindhu, “A survey of reinforcement learning algorithms for dynamically varying environments,”ACM Computing Surveys, vol. 54, no. 6, pp. 1–25, 2021

work page 2021

[18] [18]

Model-free robustϕ-divergence reinforcement learning using both offline and online data,

Panaganti Kishan, Wierman Adam, and Mazumdar Eric, “Model-free robustϕ-divergence reinforcement learning using both offline and online data,” inProc. Int. Conf. Machine Learning (ICML), PMLR, vol. 235, 2024, pp. 39324–39363

work page 2024

[19] [19]

Efficient off-policy meta-reinforcement learning via probabilistic context variables,

Rakelly Kate, Zhou Aurick, Finn Chelsea, Levine Sergey, and Quillen Deirdre, “Efficient off-policy meta-reinforcement learning via probabilistic context variables,” inProc. Int. Conf. Machine Learning (ICML), 2019, pp. 5331–5340

work page 2019

[20] [20]

Multi-task reinforcement learning with context- based representations,

Sodhani Shagun, Zhang Amy, and Pineau Joelle, “Multi-task reinforcement learning with context- based representations,” inProc. Int. Conf. Machine Learning (ICML), 2021, pp. 9767–9779

work page 2021

[21] [21]

Robust Markov decision processes,

Wiesemann Wolfram, Kuhn Daniel, and Rustem Berç, “Robust Markov decision processes,” Mathematics of Operations Research, vol. 38, no. 1, pp. 153–183, 2013

work page 2013

[22] [22]

Robust regression and Lasso,

Xu Huan, Caramanis Constantine, and Mannor Shie, “Robust regression and Lasso,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 21, 2008, pp. 1801–1808

work page 2008

[23] [23]

Robustness and regularization of support vector machines,

Xu Huan, Caramanis Constantine, and Mannor Shie, “Robustness and regularization of support vector machines,”Journal of Machine Learning Research, vol. 10, pp. 1485–1510, 2009

work page 2009

[24] [24]

RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

Zhang Yifan and Zheng Liang, “RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach,”arXiv preprint arXiv:2603.18396, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Single Agent Robust Deep Reinforcement Learning for Bus Fleet Control,

Zhang Yifan, “Single Agent Robust Deep Reinforcement Learning for Bus Fleet Control,” Transportation Safety and Environment, article tdag005, 2026, doi: https://doi.org/10.1093/tse/ tdag005

work page doi:10.1093/tse/ 2026

[26] [26]

Natural actor- critic for robust reinforcement learning with function approximation,

Zhou Ruida, Liu Tao, Cheng Min, Kalathil Dileep, Kumar P. R., and Tian Chao, “Natural actor- critic for robust reinforcement learning with function approximation,” inAdvances in Neural In- formation Processing Systems (NeurIPS), vol. 36, 2023, pp. 97–133, https://proceedings.neurips. cc/paper_files/paper/2023/file/007f4927e60699392425f267d43f0940-Paper-Co...

work page 2023

[27] [27]

VariBAD: A very good method for Bayes-adaptive deep RL via meta-learning,

Zintgraf Luisa, Shiarlis Kyriacos, Igl Maximilian, Schulze Sebastian, Gal Yarin, Hofmann Katja, and Whiteson Shimon, “VariBAD: A very good method for Bayes-adaptive deep RL via meta-learning,” inProc. Int. Conf. Learning Representations (ICLR), 2020

work page 2020

[28] [28]

Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret,

Fei Yingjie, Yang Zhuoran, Chen Yudong, Wang Zhaoran, and Xie Qiaomin, “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 22384–22395

work page 2020

[29] [29]

Discounted dynamic programming,

Blackwell David, “Discounted dynamic programming,”The Annals of Mathematical Statistics, vol. 36, no. 1, pp. 226–235, 1965. 22 Appendix: Theoretical foundations and derivations A Piecewise Q-value convergence rate We analyze how BAPR’s piecewise-robust design achieves polynomial convergence rates, in contrast to the exponential barrier faced by direct ris...

work page 1965

[30] [30]

Perturbation recovery phase(nδ steps): The BOCD posterior converges to the correct mode, andβ eff reaches its maximal conservatism, providing protection during the transient

work page

[31] [31]

Proof sketch.The proof proceeds by structural induction over the sequence of segments: Base case:The first segment[ t0, t1)has no prior contamination

Contraction phase( Tk −n δ steps): The frozen-belief BAPR operator contracts toward the segment-specific fixed pointQ∗ k at rateγ: ∥Qt −Q ∗ k∥∞ ≤γ t−tk−nδ · ∥Qtk+nδ −Q ∗ k∥∞ + εproj +σ 1−γ .(32) By Assumption B.3, the contraction phase has sufficient duration for the error to decay to the irreducible floor(εproj +σ)/(1−γ)before the next switch. Proof sket...

work page

[32] [32]

EstablishingbothBlackwellconditions(monotonicityanddiscounting)foreachmode-conditional operator

work page

[33] [33]

Proving that the convex combination preservesbothconditions (non-trivial for discounting: requiresP ρ= 1)

work page

[34] [34]

obvious

Verifying that the Bayesian belief update preserves the probability distribution properties. Notation.The Lean 4 proof uses h∈H (run-length) as the mode index for implementation convenience, since BAPR’s original design indexes modes by run-length. In the paper, we usem∈ M to emphasize the conceptual distinction between modes (physical configurations) and...

work page

[35] [35]

The beliefρmust benon-negative(for monotonicity preservation via triangle inequality)

work page

[36] [36]

The belief mustsum to one(for discounting preservation, the load-bearing step)

work page

[37] [37]

The belief must befrozenduring the Bellman backup (for the penalties to cancel—without this, the counterexample in §D applies)

work page

[38] [38]

The Lean proof makes each of these requirements explicit and machine-verified

The Bayesian update mustpreserveproperties (1)–(2) at every step. The Lean proof makes each of these requirements explicit and machine-verified. Violatingany oneof them would break the proof (and indeed the contraction, as our counterexample demonstrates for condition (3)). C.3 Preservation of contraction under Bayesian belief updates Definition C.7(Bayes...

work page

[39] [39]

The agent overestimatesQ(due to sampling noise or stale data)

work page

[40] [40]

The Q-dependent belief shifts toward the higher-reward mode (ρ1 increases)

work page

[41] [41]

This inflates the Q-target further (mode 1 hasR1 > R2)

work page

[42] [42]

normal traffic

The cycle amplifies, preventing convergence. 29 The critical threshold.The contraction factor isγ + λ∆. For typical RL parameters (γ = 0.99), any λ∆ ≥ 0.01breaks contraction. Since∆(the mode reward gap) can be large in practice (e.g., ∆ = 50between “normal traffic” and “severe congestion” in bus control), even a tiny sensitivity λ= 0.001suffices to breach...

work page

[43] [43]

If action a1 has lower ensemble disagreement thana2 at state s, the adaptive penaltyamplifies the preference fora1 monotonically withλ w

work page

[44] [44]

Bayesian Amnesia

The adaptive penalty cannot reverse the ranking among actions with equal ensemble disagreement. Quantitative Q-value depression.Following the same analysis as RE-SAC [24]: Q∗ BAPR(s, a) =Q ∗(s, a)− γ·∆ penalty 1−γ ,(47) where∆ penalty includes the belief-weighted contribution. With the experimental configuration (γ = 0.99, λw ≤ 0.4empirically, cpenalty = ...

work page