pith. sign in

arxiv: 2605.15692 · v1 · pith:WODYYPXKnew · submitted 2026-05-15 · 💻 cs.LG · stat.ML

Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

Pith reviewed 2026-05-20 20:18 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords reinforcement learningregret boundscontextual reinforcement learningepisodic RLMVP algorithmminimax regretgap-dependent regretaction sets
0
0 comments X

The pith

The MVP algorithm extends to reinforcement learning with episode-dependent action sets and delivers a minimax regret bound of order sqrt(S A H^3 K log L) for adversarial contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies episodic reinforcement learning where reward and transition functions stay fixed but the set of allowed actions can change with each episode and is revealed at the start. It shows that the MVP algorithm carries over to this setting without major changes and still yields strong performance guarantees measured against the best policy for that episode. The guarantees include a worst-case regret that grows like the square root of the number of episodes, scaled by factors for states, actions, and horizon length, plus a logarithmic term when contexts are chosen adversarially. The same analysis gives a simpler bound without the log term when contexts arrive from a fixed distribution, plus a refined bound that shrinks when large performance gaps separate good and bad actions. A sympathetic reader would care because many real decision problems restrict which actions are feasible depending on the current situation, and these bounds give clearer predictions of how much learning is needed before the policy becomes reliable.

Core claim

In the setting of episodic reinforcement learning with fixed reward and transition functions but episode-dependent admissible action sets observed at the start of each episode, the MVP algorithm achieves a minimax regret bound of order sqrt(S A H^3 K log L) for adversarial contexts. This implies a bound of order sqrt(S A H^3 K) for stochastic contexts. The paper further derives a gap-dependent regret bound of order inf over p in [0,1) of (1 over Delta_min^p plus p K Delta_min^p) times log K times poly(S, A, H), where Delta_min^p is the global p-trimmed positive-gap floor over suboptimal state-action pairs at each step. These results follow from direct extension of the MVP analysis to the new

What carries the argument

The MVP algorithm adapted to episode-dependent admissible action sets observed before each episode begins, with separate regret analyses for adversarial versus stochastic context selection.

If this is right

  • Regret stays sublinear in the number of episodes even when contexts are chosen adversarially, up to a logarithmic factor in the number of possible contexts.
  • For a fixed distribution over contexts the algorithm reaches epsilon-optimal performance after roughly S A H^3 over epsilon squared episodes.
  • When suboptimality gaps are large the gap-dependent bound improves substantially over the minimax rate.
  • The same algorithm works without redesign for both the stochastic and adversarial context regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same style of analysis could extend to other reinforcement learning problems where action availability depends on external context signals observed before each decision period.
  • Practitioners facing time-varying constraints, such as in scheduling or resource allocation, may obtain reliable performance predictions by applying this regret form directly.
  • One could test whether the bounds remain valid when the number of possible contexts grows with the state space or when function approximation replaces tabular methods.
  • Links to online learning with time-varying feasible sets may allow transfer of these techniques to combinatorial bandit problems.

Load-bearing premise

Reward and transition functions remain unchanged across episodes while only the admissible action sets vary and are fully revealed at the start of each episode.

What would settle it

Run the extended MVP algorithm on a small known MDP with two states and horizon length two, vary the action sets adversarially over thousands of episodes, and check whether cumulative regret stays below the claimed order or exceeds it by a polynomial factor.

Figures

Figures reproduced from arXiv: 2605.15692 by Zihan Zhang, Zijun Chen.

Figure 1
Figure 1. Figure 1: Regret performance in the Contextual action-set MDP instance with horizon [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Gap-Dependent Decomposition with p-trimmed positive gap 28 [PITH_FULL_IMAGE:figures/full_fig_p028_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance for Algorithm 1/Contextual MVP under different parameter settings with confidence range. The reward function is also shared across contexts and is given by r(s2, a4) = r(s4, a4) = r(s5, a4) = 1, r(s6, a4) = r(s7, a4) = r(s8, a4) = 1 and r(s, a) = 0 for all other state-action pairs. The two contexts differ only through the admissible action sets, suppose the transition are same across the horizo… view at source ↗
read the original abstract

We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k=1}^K [V^{*,M^k} - V^{\pi^k,M^k}]$, where $M^k$ represents the action context in the $k$-th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of $\widetilde{O}(\sqrt{SAH^3K\log L})$ for adversarial contexts, where $L$ denotes the number of possible contexts. This result implies a regret bound of $\widetilde{O}(\sqrt{SAH^3K})$ for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of $\widetilde{O}(SAH^3/\epsilon^2)$ for a fixed context distribution. In addition, we derive a gap-dependent regret bound of \[ \widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{\Delta_{\min}^{p}} + pK\Delta_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right), \] where $\Delta_{\min}^{p}$ is the global $p$-trimmed positive-gap floor over suboptimal $(h,s,a)$ triples. This bound can substantially improve upon the minimax rate when the relevant suboptimality gaps are large.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript studies episodic reinforcement learning where reward and transition functions are fixed across episodes, but admissible action sets are episode-dependent and observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value. The authors show that the MVP algorithm extends naturally to this setting and establish a minimax regret bound of O~(sqrt(S A H^3 K log L)) for adversarial contexts (implying O~(sqrt(S A H^3 K)) for stochastic contexts), a corresponding sample-complexity bound of O~(S A H^3 / epsilon^2), and a gap-dependent regret bound of O~(inf_p (1/Delta_min^p + p K Delta_min^p) log K * poly(S,A,H)) using a p-trimmed positive-gap floor.

Significance. If the central claims hold, the results provide tighter regret guarantees for contextual action-set RL by incorporating the log L factor for covering contexts and a flexible gap-dependent form that improves on minimax rates when suboptimality gaps are large. The work builds on standard concentration and value-function arguments while extending MVP in a natural way; the absence of free parameters or invented entities in the bounds is a strength.

minor comments (3)
  1. The definition of the global p-trimmed positive-gap floor Delta_min^p in the gap-dependent bound (abstract and corresponding theorem) would benefit from an explicit equation or paragraph clarifying how it is computed over suboptimal (h,s,a) triples.
  2. Notation for M^k as the action context could be introduced more clearly in the problem formulation section to distinguish it from standard MDP components.
  3. The sample-complexity translation from the stochastic regret bound assumes a fixed context distribution; a brief remark on dependence on the distribution's support size would improve clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. Since no specific major comments or criticisms were raised in the report, we have no points to address point-by-point at this time. We will incorporate any minor editorial or presentation improvements in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity detected in the derivation

full rationale

The paper extends the MVP algorithm to the setting of episode-dependent admissible action sets observed at the start of each episode, with fixed reward and transition functions, and derives minimax regret bounds of the form O~(sqrt(S A H^3 K log L)) for adversarial contexts along with the implied stochastic and gap-dependent forms. These bounds follow from standard concentration arguments and value-function analysis in episodic RL; the log L term arises directly from covering the finite context set, the stochastic implication drops the log L under distributional assumptions, and the inf_p trimming construction is a conventional device for balancing terms in gap-dependent analysis. No load-bearing step reduces a claimed prediction or first-principles result to a fitted parameter, self-definition, or self-citation chain by construction, and the central claims remain independently verifiable from the stated assumptions and standard RL tools.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims rest on standard episodic MDP assumptions (fixed transitions and rewards) and the natural extension of MVP; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption Reward and transition functions are fixed across episodes while action sets vary and are observed at episode start.
    Stated directly in the abstract as the problem setup.
  • standard math Standard concentration inequalities and value-function bounds from episodic RL apply after extending MVP.
    Implicit in the derivation of the stated regret rates.

pith-pipeline@v0.9.0 · 5815 in / 1235 out tokens · 77830 ms · 2026-05-20T20:18:00.822424+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    Schapire

    Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits, 2014

  2. [2]

    Posterior sampling for reinforcement learning: worst-case regret bounds, 2020

    Shipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret bounds, 2020

  3. [3]

    Minimax regret bounds for reinforcement learning

    Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, pages 263–272, 2017

  4. [4]

    Federated linear contextual bandits with heterogeneous clients, 2024

    Ethan Blaser, Chuanhao Li, and Hongning Wang. Federated linear contextual bandits with heterogeneous clients, 2024

  5. [5]

    Strategic linear contextual bandits, 2024

    Thomas Kleine Buening, Aadirupa Saha, Christos Dimitrakakis, and Haifeng Xu. Strategic linear contextual bandits, 2024

  6. [6]

    Murphy.Statistical Reinforcement Learning for Dynamic Treatment Regimes

    Bibhas Chakraborty and Susan A. Murphy.Statistical Reinforcement Learning for Dynamic Treatment Regimes. Springer, 2013

  7. [7]

    Sharp gap-dependent variance-aware regret bounds for tabular MDPs

    Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, and Simon Shaolei Du. Sharp gap-dependent variance-aware regret bounds for tabular MDPs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  8. [8]

    Contextual bandits with linear payoff functions

    Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 208–214, Fort Lauderdale, FL, USA, ...

  9. [9]

    Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning, 2018

    Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning, 2018

  10. [10]

    Sample complexity characterization for linear contextual mdps, 2024

    Junze Deng, Yuan Cheng, Shaofeng Zou, and Yingbin Liang. Sample complexity characterization for linear contextual mdps, 2024

  11. [11]

    Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited, 2020

    Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited, 2020

  12. [12]

    Q-learning with ucb exploration is sample efficient for infinite-horizon mdp, 2019

    Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. Q-learning with ucb exploration is sample efficient for infinite-horizon mdp, 2019

  13. [13]

    Sleeping reinforcement learning

    Simone Drago, Marco Mussi, and Alberto Maria Metelli. Sleeping reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 1443...

  14. [14]

    Tight regret bounds for model-based reinforcement learning with greedy policies, 2019

    Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds for model-based reinforcement learning with greedy policies, 2019. 11

  15. [15]

    Efficient bias-span-constrained exploration-exploitation in reinforcement learning, 2018

    Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning, 2018

  16. [16]

    Contextual markov decision processes

    Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual markov decision processes. InInternational Conference on Machine Learning (ICML), pages 2074–2082, 2015

  17. [17]

    Near-optimal regret bounds for reinforcement learning

    Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563–1600, 2010

  18. [18]

    Schapire

    Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision processes with low bellman rank are pac-learnable, 2016

  19. [19]

    The complexity of MDP policy evaluation under parameter uncertainty.Journal of Machine Learning Research, 18(1):7120–7155, 2017

    Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The complexity of MDP policy evaluation under parameter uncertainty.Journal of Machine Learning Research, 18(1):7120–7155, 2017

  20. [20]

    Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient?, 2018

  21. [21]

    Best-of-both-worlds algorithms for linear contextual bandits, 2024

    Yuko Kuroki, Alberto Rumi, Taira Tsuchiya, Fabio Vitale, and Nicolò Cesa-Bianchi. Best-of-both-worlds algorithms for linear contextual bandits, 2024

  22. [22]

    The epoch-greedy algorithm for contextual multi-armed bandits

    John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’07, page 817–824, Red Hook, NY, USA, 2007. Curran Associates Inc

  23. [23]

    End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

  24. [24]

    Eluder-based regret for stochastic contextual MDPs

    Orin Levy, Asaf Cassel, Alon Cohen, and Yishay Mansour. Eluder-based regret for stochastic contextual MDPs. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning ...

  25. [25]

    Near-optimal regret for policy optimization in contextual mdps with general offline function approximation, 2026

    Orin Levy, Aviv Rosenberg, Alon Cohen, and Yishay Mansour. Near-optimal regret for policy optimization in contextual mdps with general offline function approximation, 2026

  26. [26]

    Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning, 2022

    Gen Li, Laixi Shi, Yuxin Chen, and Yuejie Chi. Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning, 2022

  27. [27]

    Minimax-optimal reward-agnostic exploration in reinforcement learning, 2024

    Gen Li, Yuling Yan, Yuxin Chen, and Jianqing Fan. Minimax-optimal reward-agnostic exploration in reinforcement learning, 2024

  28. [28]

    Lee, Yuejie Chi, and Yuxin Chen

    Gen Li, Wenhao Zhan, Jason D. Lee, Yuejie Chi, and Yuxin Chen. Reward-agnostic fine-tuning: Provable statistical benefits of hybrid reinforcement learning, 2023

  29. [29]

    Schapire

    Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InInternational World Wide Web Conference (WWW), pages 661–670, 2010

  30. [30]

    Yuanzhi Li, Ruosong Wang, and Lin F. Yang. Settling the horizon-dependence of sample complexity in reinforcement learning, 2021

  31. [31]

    Empirical Bernstein Bounds and Sample Variance Penalization

    Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. InProceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. Also available on arXiv:0907.3740. 12

  32. [32]

    Ucb momentum q-learning: Correcting the bias without forgetting, 2022

    Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, and Michal Valko. Ucb momentum q-learning: Correcting the bias without forgetting, 2022

  33. [33]

    Markov decision processes with continuous side information

    Aditya Modi, Nan Jiang, Satinder Singh, and Ambuj Tewari. Markov decision processes with continuous side information. In Firdaus Janoos, Mehryar Mohri, and Karthik Sridharan, editors,Proceedings of Algorithmic Learning Theory, volume 83 ofProceedings of Machine Learning Research, pages 597–618. PMLR, 07–09 Apr 2018

  34. [34]

    No-regret exploration in contextual reinforcement learning

    Aditya Modi and Ambuj Tewari. No-regret exploration in contextual reinforcement learning. In Jonas Peters and David Sontag, editors,Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 ofProceedings of Machine Learning Research, pages 829–838. PMLR, 03–06 Aug 2020

  35. [35]

    A unifying view of optimism in episodic reinforcement learning, 2020

    Gergely Neu and Ciara Pike-Burke. A unifying view of optimism in episodic reinforcement learning, 2020

  36. [36]

    Contextual MDP for online advertising

    Eric Schwartz, Jesse Bradshaw, and Brantley Matthews. Contextual MDP for online advertising. In ACM Conference on Economics and Computation, 2017

  37. [37]

    Non-asymptotic gap-dependent regret bounds for tabular mdps, 2019

    Max Simchowitz and Kevin Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps, 2019

  38. [38]

    Model-based reinforcement learning with nearly tight exploration complexity bounds

    István Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. InProceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1031–1038, 2010

  39. [39]

    Du, Lin F

    Ruosong Wang, Simon S. Du, Lin F. Yang, and Sham M. Kakade. Is long horizon reinforcement learning more difficult than short horizon reinforcement learning?, 2020

  40. [40]

    Haike Xu, Tengyu Ma, and Simon S. Du. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap, 2021

  41. [41]

    Linear contextual bandits with interference, 2024

    Yang Xu, Wenbin Lu, and Rui Song. Linear contextual bandits with interference, 2024

  42. [42]

    Q-learning with logarithmic regret

    Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 1576–1584. PMLR, 13–15 Apr 2021

  43. [43]

    Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds

    Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. InInternational Conference on Machine Learning, pages 7304–7312, 2019

  44. [44]

    Contextual linear bandits with delay as payoff, 2025

    Mengxiao Zhang, Yingfei Wang, and Haipeng Luo. Contextual linear bandits with delay as payoff, 2025

  45. [45]

    Settling the sample complexity of online reinforcement learning

    Zihan Zhang, Yuxin Chen, Jason D Lee, and Simon S Du. Settling the sample complexity of online reinforcement learning. InThe Thirty Seventh Annual Conference on Learning Theory, pages 5213–5219. PMLR, 2024

  46. [46]

    Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon

    Zihan Zhang, Xiangyang Ji, and Simon Du. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. InConference on Learning Theory, pages 4528–4531, 2021

  47. [47]

    ∃n, nX k=1 Xk ≥3 nX k=1 Yk +llog 1 δ # ≤δ P

    Runlong Zhou, Zihan Zhang, and Simon S. Du. Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments, 2023. 13 A Notations and Technical Lemmas A.1 Notations 14 S, S=|S| State space and its size A, A=|A| Action space and its size H Horizon length K Number of learning episodes D Distributio...