Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning
Pith reviewed 2026-05-20 20:18 UTC · model grok-4.3
The pith
The MVP algorithm extends to reinforcement learning with episode-dependent action sets and delivers a minimax regret bound of order sqrt(S A H^3 K log L) for adversarial contexts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the setting of episodic reinforcement learning with fixed reward and transition functions but episode-dependent admissible action sets observed at the start of each episode, the MVP algorithm achieves a minimax regret bound of order sqrt(S A H^3 K log L) for adversarial contexts. This implies a bound of order sqrt(S A H^3 K) for stochastic contexts. The paper further derives a gap-dependent regret bound of order inf over p in [0,1) of (1 over Delta_min^p plus p K Delta_min^p) times log K times poly(S, A, H), where Delta_min^p is the global p-trimmed positive-gap floor over suboptimal state-action pairs at each step. These results follow from direct extension of the MVP analysis to the new
What carries the argument
The MVP algorithm adapted to episode-dependent admissible action sets observed before each episode begins, with separate regret analyses for adversarial versus stochastic context selection.
If this is right
- Regret stays sublinear in the number of episodes even when contexts are chosen adversarially, up to a logarithmic factor in the number of possible contexts.
- For a fixed distribution over contexts the algorithm reaches epsilon-optimal performance after roughly S A H^3 over epsilon squared episodes.
- When suboptimality gaps are large the gap-dependent bound improves substantially over the minimax rate.
- The same algorithm works without redesign for both the stochastic and adversarial context regimes.
Where Pith is reading between the lines
- The same style of analysis could extend to other reinforcement learning problems where action availability depends on external context signals observed before each decision period.
- Practitioners facing time-varying constraints, such as in scheduling or resource allocation, may obtain reliable performance predictions by applying this regret form directly.
- One could test whether the bounds remain valid when the number of possible contexts grows with the state space or when function approximation replaces tabular methods.
- Links to online learning with time-varying feasible sets may allow transfer of these techniques to combinatorial bandit problems.
Load-bearing premise
Reward and transition functions remain unchanged across episodes while only the admissible action sets vary and are fully revealed at the start of each episode.
What would settle it
Run the extended MVP algorithm on a small known MDP with two states and horizon length two, vary the action sets adversarially over thousands of episodes, and check whether cumulative regret stays below the claimed order or exceeds it by a polynomial factor.
Figures
read the original abstract
We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k=1}^K [V^{*,M^k} - V^{\pi^k,M^k}]$, where $M^k$ represents the action context in the $k$-th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of $\widetilde{O}(\sqrt{SAH^3K\log L})$ for adversarial contexts, where $L$ denotes the number of possible contexts. This result implies a regret bound of $\widetilde{O}(\sqrt{SAH^3K})$ for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of $\widetilde{O}(SAH^3/\epsilon^2)$ for a fixed context distribution. In addition, we derive a gap-dependent regret bound of \[ \widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{\Delta_{\min}^{p}} + pK\Delta_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right), \] where $\Delta_{\min}^{p}$ is the global $p$-trimmed positive-gap floor over suboptimal $(h,s,a)$ triples. This bound can substantially improve upon the minimax rate when the relevant suboptimality gaps are large.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies episodic reinforcement learning where reward and transition functions are fixed across episodes, but admissible action sets are episode-dependent and observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value. The authors show that the MVP algorithm extends naturally to this setting and establish a minimax regret bound of O~(sqrt(S A H^3 K log L)) for adversarial contexts (implying O~(sqrt(S A H^3 K)) for stochastic contexts), a corresponding sample-complexity bound of O~(S A H^3 / epsilon^2), and a gap-dependent regret bound of O~(inf_p (1/Delta_min^p + p K Delta_min^p) log K * poly(S,A,H)) using a p-trimmed positive-gap floor.
Significance. If the central claims hold, the results provide tighter regret guarantees for contextual action-set RL by incorporating the log L factor for covering contexts and a flexible gap-dependent form that improves on minimax rates when suboptimality gaps are large. The work builds on standard concentration and value-function arguments while extending MVP in a natural way; the absence of free parameters or invented entities in the bounds is a strength.
minor comments (3)
- The definition of the global p-trimmed positive-gap floor Delta_min^p in the gap-dependent bound (abstract and corresponding theorem) would benefit from an explicit equation or paragraph clarifying how it is computed over suboptimal (h,s,a) triples.
- Notation for M^k as the action context could be introduced more clearly in the problem formulation section to distinguish it from standard MDP components.
- The sample-complexity translation from the stochastic regret bound assumes a fixed context distribution; a brief remark on dependence on the distribution's support size would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. Since no specific major comments or criticisms were raised in the report, we have no points to address point-by-point at this time. We will incorporate any minor editorial or presentation improvements in the revised manuscript.
Circularity Check
No significant circularity detected in the derivation
full rationale
The paper extends the MVP algorithm to the setting of episode-dependent admissible action sets observed at the start of each episode, with fixed reward and transition functions, and derives minimax regret bounds of the form O~(sqrt(S A H^3 K log L)) for adversarial contexts along with the implied stochastic and gap-dependent forms. These bounds follow from standard concentration arguments and value-function analysis in episodic RL; the log L term arises directly from covering the finite context set, the stochastic implication drops the log L under distributional assumptions, and the inf_p trimming construction is a conventional device for balancing terms in gap-dependent analysis. No load-bearing step reduces a claimed prediction or first-principles result to a fitted parameter, self-definition, or self-citation chain by construction, and the central claims remain independently verifiable from the stated assumptions and standard RL tools.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Reward and transition functions are fixed across episodes while action sets vary and are observed at episode start.
- standard math Standard concentration inequalities and value-function bounds from episodic RL apply after extending MVP.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We establish a minimax regret bound of O~(sqrt(S A H^3 K log L)) for adversarial contexts... gap-dependent bound of O~(inf_p (1/Δ_min^p + p K Δ_min^p) log K · poly(S,A,H))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Posterior sampling for reinforcement learning: worst-case regret bounds, 2020
Shipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret bounds, 2020
work page 2020
-
[3]
Minimax regret bounds for reinforcement learning
Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, pages 263–272, 2017
work page 2017
-
[4]
Federated linear contextual bandits with heterogeneous clients, 2024
Ethan Blaser, Chuanhao Li, and Hongning Wang. Federated linear contextual bandits with heterogeneous clients, 2024
work page 2024
-
[5]
Strategic linear contextual bandits, 2024
Thomas Kleine Buening, Aadirupa Saha, Christos Dimitrakakis, and Haifeng Xu. Strategic linear contextual bandits, 2024
work page 2024
-
[6]
Murphy.Statistical Reinforcement Learning for Dynamic Treatment Regimes
Bibhas Chakraborty and Susan A. Murphy.Statistical Reinforcement Learning for Dynamic Treatment Regimes. Springer, 2013
work page 2013
-
[7]
Sharp gap-dependent variance-aware regret bounds for tabular MDPs
Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, and Simon Shaolei Du. Sharp gap-dependent variance-aware regret bounds for tabular MDPs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[8]
Contextual bandits with linear payoff functions
Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 208–214, Fort Lauderdale, FL, USA, ...
work page 2011
-
[9]
Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning, 2018
Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning, 2018
work page 2018
-
[10]
Sample complexity characterization for linear contextual mdps, 2024
Junze Deng, Yuan Cheng, Shaofeng Zou, and Yingbin Liang. Sample complexity characterization for linear contextual mdps, 2024
work page 2024
-
[11]
Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited, 2020
Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited, 2020
work page 2020
-
[12]
Q-learning with ucb exploration is sample efficient for infinite-horizon mdp, 2019
Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. Q-learning with ucb exploration is sample efficient for infinite-horizon mdp, 2019
work page 2019
-
[13]
Sleeping reinforcement learning
Simone Drago, Marco Mussi, and Alberto Maria Metelli. Sleeping reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 1443...
work page 2025
-
[14]
Tight regret bounds for model-based reinforcement learning with greedy policies, 2019
Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds for model-based reinforcement learning with greedy policies, 2019. 11
work page 2019
-
[15]
Efficient bias-span-constrained exploration-exploitation in reinforcement learning, 2018
Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning, 2018
work page 2018
-
[16]
Contextual markov decision processes
Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual markov decision processes. InInternational Conference on Machine Learning (ICML), pages 2074–2082, 2015
work page 2074
-
[17]
Near-optimal regret bounds for reinforcement learning
Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563–1600, 2010
work page 2010
- [18]
-
[19]
Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The complexity of MDP policy evaluation under parameter uncertainty.Journal of Machine Learning Research, 18(1):7120–7155, 2017
work page 2017
-
[20]
Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient?, 2018
work page 2018
-
[21]
Best-of-both-worlds algorithms for linear contextual bandits, 2024
Yuko Kuroki, Alberto Rumi, Taira Tsuchiya, Fabio Vitale, and Nicolò Cesa-Bianchi. Best-of-both-worlds algorithms for linear contextual bandits, 2024
work page 2024
-
[22]
The epoch-greedy algorithm for contextual multi-armed bandits
John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’07, page 817–824, Red Hook, NY, USA, 2007. Curran Associates Inc
work page 2007
-
[23]
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016
work page 2016
-
[24]
Eluder-based regret for stochastic contextual MDPs
Orin Levy, Asaf Cassel, Alon Cohen, and Yishay Mansour. Eluder-based regret for stochastic contextual MDPs. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning ...
work page 2024
-
[25]
Orin Levy, Aviv Rosenberg, Alon Cohen, and Yishay Mansour. Near-optimal regret for policy optimization in contextual mdps with general offline function approximation, 2026
work page 2026
-
[26]
Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning, 2022
Gen Li, Laixi Shi, Yuxin Chen, and Yuejie Chi. Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning, 2022
work page 2022
-
[27]
Minimax-optimal reward-agnostic exploration in reinforcement learning, 2024
Gen Li, Yuling Yan, Yuxin Chen, and Jianqing Fan. Minimax-optimal reward-agnostic exploration in reinforcement learning, 2024
work page 2024
-
[28]
Lee, Yuejie Chi, and Yuxin Chen
Gen Li, Wenhao Zhan, Jason D. Lee, Yuejie Chi, and Yuxin Chen. Reward-agnostic fine-tuning: Provable statistical benefits of hybrid reinforcement learning, 2023
work page 2023
- [29]
-
[30]
Yuanzhi Li, Ruosong Wang, and Lin F. Yang. Settling the horizon-dependence of sample complexity in reinforcement learning, 2021
work page 2021
-
[31]
Empirical Bernstein Bounds and Sample Variance Penalization
Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. InProceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. Also available on arXiv:0907.3740. 12
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[32]
Ucb momentum q-learning: Correcting the bias without forgetting, 2022
Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, and Michal Valko. Ucb momentum q-learning: Correcting the bias without forgetting, 2022
work page 2022
-
[33]
Markov decision processes with continuous side information
Aditya Modi, Nan Jiang, Satinder Singh, and Ambuj Tewari. Markov decision processes with continuous side information. In Firdaus Janoos, Mehryar Mohri, and Karthik Sridharan, editors,Proceedings of Algorithmic Learning Theory, volume 83 ofProceedings of Machine Learning Research, pages 597–618. PMLR, 07–09 Apr 2018
work page 2018
-
[34]
No-regret exploration in contextual reinforcement learning
Aditya Modi and Ambuj Tewari. No-regret exploration in contextual reinforcement learning. In Jonas Peters and David Sontag, editors,Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 ofProceedings of Machine Learning Research, pages 829–838. PMLR, 03–06 Aug 2020
work page 2020
-
[35]
A unifying view of optimism in episodic reinforcement learning, 2020
Gergely Neu and Ciara Pike-Burke. A unifying view of optimism in episodic reinforcement learning, 2020
work page 2020
-
[36]
Contextual MDP for online advertising
Eric Schwartz, Jesse Bradshaw, and Brantley Matthews. Contextual MDP for online advertising. In ACM Conference on Economics and Computation, 2017
work page 2017
-
[37]
Non-asymptotic gap-dependent regret bounds for tabular mdps, 2019
Max Simchowitz and Kevin Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps, 2019
work page 2019
-
[38]
Model-based reinforcement learning with nearly tight exploration complexity bounds
István Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. InProceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1031–1038, 2010
work page 2010
- [39]
-
[40]
Haike Xu, Tengyu Ma, and Simon S. Du. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap, 2021
work page 2021
-
[41]
Linear contextual bandits with interference, 2024
Yang Xu, Wenbin Lu, and Rui Song. Linear contextual bandits with interference, 2024
work page 2024
-
[42]
Q-learning with logarithmic regret
Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 1576–1584. PMLR, 13–15 Apr 2021
work page 2021
-
[43]
Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. InInternational Conference on Machine Learning, pages 7304–7312, 2019
work page 2019
-
[44]
Contextual linear bandits with delay as payoff, 2025
Mengxiao Zhang, Yingfei Wang, and Haipeng Luo. Contextual linear bandits with delay as payoff, 2025
work page 2025
-
[45]
Settling the sample complexity of online reinforcement learning
Zihan Zhang, Yuxin Chen, Jason D Lee, and Simon S Du. Settling the sample complexity of online reinforcement learning. InThe Thirty Seventh Annual Conference on Learning Theory, pages 5213–5219. PMLR, 2024
work page 2024
-
[46]
Zihan Zhang, Xiangyang Ji, and Simon Du. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. InConference on Learning Theory, pages 4528–4531, 2021
work page 2021
-
[47]
∃n, nX k=1 Xk ≥3 nX k=1 Yk +llog 1 δ # ≤δ P
Runlong Zhou, Zihan Zhang, and Simon S. Du. Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments, 2023. 13 A Notations and Technical Lemmas A.1 Notations 14 S, S=|S| State space and its size A, A=|A| Action space and its size H Horizon length K Number of learning episodes D Distributio...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.