Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

Zihan Zhang; Zijun Chen

arxiv: 2605.15692 · v1 · pith:WODYYPXKnew · submitted 2026-05-15 · 💻 cs.LG · stat.ML

Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

Zijun Chen , Zihan Zhang This is my paper

Pith reviewed 2026-05-20 20:18 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords reinforcement learningregret boundscontextual reinforcement learningepisodic RLMVP algorithmminimax regretgap-dependent regretaction sets

0 comments

The pith

The MVP algorithm extends to reinforcement learning with episode-dependent action sets and delivers a minimax regret bound of order sqrt(S A H^3 K log L) for adversarial contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies episodic reinforcement learning where reward and transition functions stay fixed but the set of allowed actions can change with each episode and is revealed at the start. It shows that the MVP algorithm carries over to this setting without major changes and still yields strong performance guarantees measured against the best policy for that episode. The guarantees include a worst-case regret that grows like the square root of the number of episodes, scaled by factors for states, actions, and horizon length, plus a logarithmic term when contexts are chosen adversarially. The same analysis gives a simpler bound without the log term when contexts arrive from a fixed distribution, plus a refined bound that shrinks when large performance gaps separate good and bad actions. A sympathetic reader would care because many real decision problems restrict which actions are feasible depending on the current situation, and these bounds give clearer predictions of how much learning is needed before the policy becomes reliable.

Core claim

In the setting of episodic reinforcement learning with fixed reward and transition functions but episode-dependent admissible action sets observed at the start of each episode, the MVP algorithm achieves a minimax regret bound of order sqrt(S A H^3 K log L) for adversarial contexts. This implies a bound of order sqrt(S A H^3 K) for stochastic contexts. The paper further derives a gap-dependent regret bound of order inf over p in [0,1) of (1 over Delta_min^p plus p K Delta_min^p) times log K times poly(S, A, H), where Delta_min^p is the global p-trimmed positive-gap floor over suboptimal state-action pairs at each step. These results follow from direct extension of the MVP analysis to the new

What carries the argument

The MVP algorithm adapted to episode-dependent admissible action sets observed before each episode begins, with separate regret analyses for adversarial versus stochastic context selection.

If this is right

Regret stays sublinear in the number of episodes even when contexts are chosen adversarially, up to a logarithmic factor in the number of possible contexts.
For a fixed distribution over contexts the algorithm reaches epsilon-optimal performance after roughly S A H^3 over epsilon squared episodes.
When suboptimality gaps are large the gap-dependent bound improves substantially over the minimax rate.
The same algorithm works without redesign for both the stochastic and adversarial context regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same style of analysis could extend to other reinforcement learning problems where action availability depends on external context signals observed before each decision period.
Practitioners facing time-varying constraints, such as in scheduling or resource allocation, may obtain reliable performance predictions by applying this regret form directly.
One could test whether the bounds remain valid when the number of possible contexts grows with the state space or when function approximation replaces tabular methods.
Links to online learning with time-varying feasible sets may allow transfer of these techniques to combinatorial bandit problems.

Load-bearing premise

Reward and transition functions remain unchanged across episodes while only the admissible action sets vary and are fully revealed at the start of each episode.

What would settle it

Run the extended MVP algorithm on a small known MDP with two states and horizon length two, vary the action sets adversarially over thousands of episodes, and check whether cumulative regret stays below the claimed order or exceeds it by a polynomial factor.

Figures

Figures reproduced from arXiv: 2605.15692 by Zihan Zhang, Zijun Chen.

**Figure 2.** Figure 2: Gap-Dependent Decomposition with p-trimmed positive gap 28 [PITH_FULL_IMAGE:figures/full_fig_p028_2.png] view at source ↗

**Figure 3.** Figure 3: Performance for Algorithm 1/Contextual MVP under different parameter settings with confidence range. The reward function is also shared across contexts and is given by r(s2, a4) = r(s4, a4) = r(s5, a4) = 1, r(s6, a4) = r(s7, a4) = r(s8, a4) = 1 and r(s, a) = 0 for all other state-action pairs. The two contexts differ only through the admissible action sets, suppose the transition are same across the horizo… view at source ↗

read the original abstract

We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k=1}^K [V^{*,M^k} - V^{\pi^k,M^k}]$, where $M^k$ represents the action context in the $k$-th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of $\widetilde{O}(\sqrt{SAH^3K\log L})$ for adversarial contexts, where $L$ denotes the number of possible contexts. This result implies a regret bound of $\widetilde{O}(\sqrt{SAH^3K})$ for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of $\widetilde{O}(SAH^3/\epsilon^2)$ for a fixed context distribution. In addition, we derive a gap-dependent regret bound of \[ \widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{\Delta_{\min}^{p}} + pK\Delta_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right), \] where $\Delta_{\min}^{p}$ is the global $p$-trimmed positive-gap floor over suboptimal $(h,s,a)$ triples. This bound can substantially improve upon the minimax rate when the relevant suboptimality gaps are large.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript studies episodic reinforcement learning where reward and transition functions are fixed across episodes, but admissible action sets are episode-dependent and observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value. The authors show that the MVP algorithm extends naturally to this setting and establish a minimax regret bound of O~(sqrt(S A H^3 K log L)) for adversarial contexts (implying O~(sqrt(S A H^3 K)) for stochastic contexts), a corresponding sample-complexity bound of O~(S A H^3 / epsilon^2), and a gap-dependent regret bound of O~(inf_p (1/Delta_min^p + p K Delta_min^p) log K * poly(S,A,H)) using a p-trimmed positive-gap floor.

Significance. If the central claims hold, the results provide tighter regret guarantees for contextual action-set RL by incorporating the log L factor for covering contexts and a flexible gap-dependent form that improves on minimax rates when suboptimality gaps are large. The work builds on standard concentration and value-function arguments while extending MVP in a natural way; the absence of free parameters or invented entities in the bounds is a strength.

minor comments (3)

The definition of the global p-trimmed positive-gap floor Delta_min^p in the gap-dependent bound (abstract and corresponding theorem) would benefit from an explicit equation or paragraph clarifying how it is computed over suboptimal (h,s,a) triples.
Notation for M^k as the action context could be introduced more clearly in the problem formulation section to distinguish it from standard MDP components.
The sample-complexity translation from the stochastic regret bound assumes a fixed context distribution; a brief remark on dependence on the distribution's support size would improve clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. Since no specific major comments or criticisms were raised in the report, we have no points to address point-by-point at this time. We will incorporate any minor editorial or presentation improvements in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity detected in the derivation

full rationale

The paper extends the MVP algorithm to the setting of episode-dependent admissible action sets observed at the start of each episode, with fixed reward and transition functions, and derives minimax regret bounds of the form O~(sqrt(S A H^3 K log L)) for adversarial contexts along with the implied stochastic and gap-dependent forms. These bounds follow from standard concentration arguments and value-function analysis in episodic RL; the log L term arises directly from covering the finite context set, the stochastic implication drops the log L under distributional assumptions, and the inf_p trimming construction is a conventional device for balancing terms in gap-dependent analysis. No load-bearing step reduces a claimed prediction or first-principles result to a fitted parameter, self-definition, or self-citation chain by construction, and the central claims remain independently verifiable from the stated assumptions and standard RL tools.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims rest on standard episodic MDP assumptions (fixed transitions and rewards) and the natural extension of MVP; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Reward and transition functions are fixed across episodes while action sets vary and are observed at episode start.
Stated directly in the abstract as the problem setup.
standard math Standard concentration inequalities and value-function bounds from episodic RL apply after extending MVP.
Implicit in the derivation of the stated regret rates.

pith-pipeline@v0.9.0 · 5815 in / 1235 out tokens · 77830 ms · 2026-05-20T20:18:00.822424+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We establish a minimax regret bound of O~(sqrt(S A H^3 K log L)) for adversarial contexts... gap-dependent bound of O~(inf_p (1/Δ_min^p + p K Δ_min^p) log K · poly(S,A,H))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

[1]

Schapire

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits, 2014

work page 2014
[2]

Posterior sampling for reinforcement learning: worst-case regret bounds, 2020

Shipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret bounds, 2020

work page 2020
[3]

Minimax regret bounds for reinforcement learning

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, pages 263–272, 2017

work page 2017
[4]

Federated linear contextual bandits with heterogeneous clients, 2024

Ethan Blaser, Chuanhao Li, and Hongning Wang. Federated linear contextual bandits with heterogeneous clients, 2024

work page 2024
[5]

Strategic linear contextual bandits, 2024

Thomas Kleine Buening, Aadirupa Saha, Christos Dimitrakakis, and Haifeng Xu. Strategic linear contextual bandits, 2024

work page 2024
[6]

Murphy.Statistical Reinforcement Learning for Dynamic Treatment Regimes

Bibhas Chakraborty and Susan A. Murphy.Statistical Reinforcement Learning for Dynamic Treatment Regimes. Springer, 2013

work page 2013
[7]

Sharp gap-dependent variance-aware regret bounds for tabular MDPs

Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, and Simon Shaolei Du. Sharp gap-dependent variance-aware regret bounds for tabular MDPs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[8]

Contextual bandits with linear payoff functions

Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 208–214, Fort Lauderdale, FL, USA, ...

work page 2011
[9]

Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning, 2018

Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning, 2018

work page 2018
[10]

Sample complexity characterization for linear contextual mdps, 2024

Junze Deng, Yuan Cheng, Shaofeng Zou, and Yingbin Liang. Sample complexity characterization for linear contextual mdps, 2024

work page 2024
[11]

Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited, 2020

Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited, 2020

work page 2020
[12]

Q-learning with ucb exploration is sample efficient for infinite-horizon mdp, 2019

Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. Q-learning with ucb exploration is sample efficient for infinite-horizon mdp, 2019

work page 2019
[13]

Sleeping reinforcement learning

Simone Drago, Marco Mussi, and Alberto Maria Metelli. Sleeping reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 1443...

work page 2025
[14]

Tight regret bounds for model-based reinforcement learning with greedy policies, 2019

Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds for model-based reinforcement learning with greedy policies, 2019. 11

work page 2019
[15]

Efficient bias-span-constrained exploration-exploitation in reinforcement learning, 2018

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning, 2018

work page 2018
[16]

Contextual markov decision processes

Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual markov decision processes. InInternational Conference on Machine Learning (ICML), pages 2074–2082, 2015

work page 2074
[17]

Near-optimal regret bounds for reinforcement learning

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563–1600, 2010

work page 2010
[18]

Schapire

Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision processes with low bellman rank are pac-learnable, 2016

work page 2016
[19]

The complexity of MDP policy evaluation under parameter uncertainty.Journal of Machine Learning Research, 18(1):7120–7155, 2017

Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The complexity of MDP policy evaluation under parameter uncertainty.Journal of Machine Learning Research, 18(1):7120–7155, 2017

work page 2017
[20]

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient?, 2018

work page 2018
[21]

Best-of-both-worlds algorithms for linear contextual bandits, 2024

Yuko Kuroki, Alberto Rumi, Taira Tsuchiya, Fabio Vitale, and Nicolò Cesa-Bianchi. Best-of-both-worlds algorithms for linear contextual bandits, 2024

work page 2024
[22]

The epoch-greedy algorithm for contextual multi-armed bandits

John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’07, page 817–824, Red Hook, NY, USA, 2007. Curran Associates Inc

work page 2007
[23]

End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

work page 2016
[24]

Eluder-based regret for stochastic contextual MDPs

Orin Levy, Asaf Cassel, Alon Cohen, and Yishay Mansour. Eluder-based regret for stochastic contextual MDPs. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning ...

work page 2024
[25]

Near-optimal regret for policy optimization in contextual mdps with general offline function approximation, 2026

Orin Levy, Aviv Rosenberg, Alon Cohen, and Yishay Mansour. Near-optimal regret for policy optimization in contextual mdps with general offline function approximation, 2026

work page 2026
[26]

Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning, 2022

Gen Li, Laixi Shi, Yuxin Chen, and Yuejie Chi. Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning, 2022

work page 2022
[27]

Minimax-optimal reward-agnostic exploration in reinforcement learning, 2024

Gen Li, Yuling Yan, Yuxin Chen, and Jianqing Fan. Minimax-optimal reward-agnostic exploration in reinforcement learning, 2024

work page 2024
[28]

Lee, Yuejie Chi, and Yuxin Chen

Gen Li, Wenhao Zhan, Jason D. Lee, Yuejie Chi, and Yuxin Chen. Reward-agnostic fine-tuning: Provable statistical benefits of hybrid reinforcement learning, 2023

work page 2023
[29]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InInternational World Wide Web Conference (WWW), pages 661–670, 2010

work page 2010
[30]

Yuanzhi Li, Ruosong Wang, and Lin F. Yang. Settling the horizon-dependence of sample complexity in reinforcement learning, 2021

work page 2021
[31]

Empirical Bernstein Bounds and Sample Variance Penalization

Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. InProceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. Also available on arXiv:0907.3740. 12

work page internal anchor Pith review Pith/arXiv arXiv 2009
[32]

Ucb momentum q-learning: Correcting the bias without forgetting, 2022

Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, and Michal Valko. Ucb momentum q-learning: Correcting the bias without forgetting, 2022

work page 2022
[33]

Markov decision processes with continuous side information

Aditya Modi, Nan Jiang, Satinder Singh, and Ambuj Tewari. Markov decision processes with continuous side information. In Firdaus Janoos, Mehryar Mohri, and Karthik Sridharan, editors,Proceedings of Algorithmic Learning Theory, volume 83 ofProceedings of Machine Learning Research, pages 597–618. PMLR, 07–09 Apr 2018

work page 2018
[34]

No-regret exploration in contextual reinforcement learning

Aditya Modi and Ambuj Tewari. No-regret exploration in contextual reinforcement learning. In Jonas Peters and David Sontag, editors,Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 ofProceedings of Machine Learning Research, pages 829–838. PMLR, 03–06 Aug 2020

work page 2020
[35]

A unifying view of optimism in episodic reinforcement learning, 2020

Gergely Neu and Ciara Pike-Burke. A unifying view of optimism in episodic reinforcement learning, 2020

work page 2020
[36]

Contextual MDP for online advertising

Eric Schwartz, Jesse Bradshaw, and Brantley Matthews. Contextual MDP for online advertising. In ACM Conference on Economics and Computation, 2017

work page 2017
[37]

Non-asymptotic gap-dependent regret bounds for tabular mdps, 2019

Max Simchowitz and Kevin Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps, 2019

work page 2019
[38]

Model-based reinforcement learning with nearly tight exploration complexity bounds

István Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. InProceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1031–1038, 2010

work page 2010
[39]

Du, Lin F

Ruosong Wang, Simon S. Du, Lin F. Yang, and Sham M. Kakade. Is long horizon reinforcement learning more difficult than short horizon reinforcement learning?, 2020

work page 2020
[40]

Haike Xu, Tengyu Ma, and Simon S. Du. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap, 2021

work page 2021
[41]

Linear contextual bandits with interference, 2024

Yang Xu, Wenbin Lu, and Rui Song. Linear contextual bandits with interference, 2024

work page 2024
[42]

Q-learning with logarithmic regret

Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 1576–1584. PMLR, 13–15 Apr 2021

work page 2021
[43]

Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. InInternational Conference on Machine Learning, pages 7304–7312, 2019

work page 2019
[44]

Contextual linear bandits with delay as payoff, 2025

Mengxiao Zhang, Yingfei Wang, and Haipeng Luo. Contextual linear bandits with delay as payoff, 2025

work page 2025
[45]

Settling the sample complexity of online reinforcement learning

Zihan Zhang, Yuxin Chen, Jason D Lee, and Simon S Du. Settling the sample complexity of online reinforcement learning. InThe Thirty Seventh Annual Conference on Learning Theory, pages 5213–5219. PMLR, 2024

work page 2024
[46]

Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon

Zihan Zhang, Xiangyang Ji, and Simon Du. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. InConference on Learning Theory, pages 4528–4531, 2021

work page 2021
[47]

∃n, nX k=1 Xk ≥3 nX k=1 Yk +llog 1 δ # ≤δ P

Runlong Zhou, Zihan Zhang, and Simon S. Du. Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments, 2023. 13 A Notations and Technical Lemmas A.1 Notations 14 S, S=|S| State space and its size A, A=|A| Action space and its size H Horizon length K Number of learning episodes D Distributio...

work page 2023

[1] [1]

Schapire

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits, 2014

work page 2014

[2] [2]

Posterior sampling for reinforcement learning: worst-case regret bounds, 2020

Shipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret bounds, 2020

work page 2020

[3] [3]

Minimax regret bounds for reinforcement learning

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, pages 263–272, 2017

work page 2017

[4] [4]

Federated linear contextual bandits with heterogeneous clients, 2024

Ethan Blaser, Chuanhao Li, and Hongning Wang. Federated linear contextual bandits with heterogeneous clients, 2024

work page 2024

[5] [5]

Strategic linear contextual bandits, 2024

Thomas Kleine Buening, Aadirupa Saha, Christos Dimitrakakis, and Haifeng Xu. Strategic linear contextual bandits, 2024

work page 2024

[6] [6]

Murphy.Statistical Reinforcement Learning for Dynamic Treatment Regimes

Bibhas Chakraborty and Susan A. Murphy.Statistical Reinforcement Learning for Dynamic Treatment Regimes. Springer, 2013

work page 2013

[7] [7]

Sharp gap-dependent variance-aware regret bounds for tabular MDPs

Shulun Chen, Runlong Zhou, Zihan Zhang, Maryam Fazel, and Simon Shaolei Du. Sharp gap-dependent variance-aware regret bounds for tabular MDPs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026

[8] [8]

Contextual bandits with linear payoff functions

Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 208–214, Fort Lauderdale, FL, USA, ...

work page 2011

[9] [9]

Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning, 2018

Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning, 2018

work page 2018

[10] [10]

Sample complexity characterization for linear contextual mdps, 2024

Junze Deng, Yuan Cheng, Shaofeng Zou, and Yingbin Liang. Sample complexity characterization for linear contextual mdps, 2024

work page 2024

[11] [11]

Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited, 2020

Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited, 2020

work page 2020

[12] [12]

Q-learning with ucb exploration is sample efficient for infinite-horizon mdp, 2019

Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. Q-learning with ucb exploration is sample efficient for infinite-horizon mdp, 2019

work page 2019

[13] [13]

Sleeping reinforcement learning

Simone Drago, Marco Mussi, and Alberto Maria Metelli. Sleeping reinforcement learning. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 1443...

work page 2025

[14] [14]

Tight regret bounds for model-based reinforcement learning with greedy policies, 2019

Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds for model-based reinforcement learning with greedy policies, 2019. 11

work page 2019

[15] [15]

Efficient bias-span-constrained exploration-exploitation in reinforcement learning, 2018

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning, 2018

work page 2018

[16] [16]

Contextual markov decision processes

Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual markov decision processes. InInternational Conference on Machine Learning (ICML), pages 2074–2082, 2015

work page 2074

[17] [17]

Near-optimal regret bounds for reinforcement learning

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563–1600, 2010

work page 2010

[18] [18]

Schapire

Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision processes with low bellman rank are pac-learnable, 2016

work page 2016

[19] [19]

The complexity of MDP policy evaluation under parameter uncertainty.Journal of Machine Learning Research, 18(1):7120–7155, 2017

Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The complexity of MDP policy evaluation under parameter uncertainty.Journal of Machine Learning Research, 18(1):7120–7155, 2017

work page 2017

[20] [20]

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient?, 2018

work page 2018

[21] [21]

Best-of-both-worlds algorithms for linear contextual bandits, 2024

Yuko Kuroki, Alberto Rumi, Taira Tsuchiya, Fabio Vitale, and Nicolò Cesa-Bianchi. Best-of-both-worlds algorithms for linear contextual bandits, 2024

work page 2024

[22] [22]

The epoch-greedy algorithm for contextual multi-armed bandits

John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’07, page 817–824, Red Hook, NY, USA, 2007. Curran Associates Inc

work page 2007

[23] [23]

End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

work page 2016

[24] [24]

Eluder-based regret for stochastic contextual MDPs

Orin Levy, Asaf Cassel, Alon Cohen, and Yishay Mansour. Eluder-based regret for stochastic contextual MDPs. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning ...

work page 2024

[25] [25]

Near-optimal regret for policy optimization in contextual mdps with general offline function approximation, 2026

Orin Levy, Aviv Rosenberg, Alon Cohen, and Yishay Mansour. Near-optimal regret for policy optimization in contextual mdps with general offline function approximation, 2026

work page 2026

[26] [26]

Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning, 2022

Gen Li, Laixi Shi, Yuxin Chen, and Yuejie Chi. Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning, 2022

work page 2022

[27] [27]

Minimax-optimal reward-agnostic exploration in reinforcement learning, 2024

Gen Li, Yuling Yan, Yuxin Chen, and Jianqing Fan. Minimax-optimal reward-agnostic exploration in reinforcement learning, 2024

work page 2024

[28] [28]

Lee, Yuejie Chi, and Yuxin Chen

Gen Li, Wenhao Zhan, Jason D. Lee, Yuejie Chi, and Yuxin Chen. Reward-agnostic fine-tuning: Provable statistical benefits of hybrid reinforcement learning, 2023

work page 2023

[29] [29]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InInternational World Wide Web Conference (WWW), pages 661–670, 2010

work page 2010

[30] [30]

Yuanzhi Li, Ruosong Wang, and Lin F. Yang. Settling the horizon-dependence of sample complexity in reinforcement learning, 2021

work page 2021

[31] [31]

Empirical Bernstein Bounds and Sample Variance Penalization

Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. InProceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. Also available on arXiv:0907.3740. 12

work page internal anchor Pith review Pith/arXiv arXiv 2009

[32] [32]

Ucb momentum q-learning: Correcting the bias without forgetting, 2022

Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, and Michal Valko. Ucb momentum q-learning: Correcting the bias without forgetting, 2022

work page 2022

[33] [33]

Markov decision processes with continuous side information

Aditya Modi, Nan Jiang, Satinder Singh, and Ambuj Tewari. Markov decision processes with continuous side information. In Firdaus Janoos, Mehryar Mohri, and Karthik Sridharan, editors,Proceedings of Algorithmic Learning Theory, volume 83 ofProceedings of Machine Learning Research, pages 597–618. PMLR, 07–09 Apr 2018

work page 2018

[34] [34]

No-regret exploration in contextual reinforcement learning

Aditya Modi and Ambuj Tewari. No-regret exploration in contextual reinforcement learning. In Jonas Peters and David Sontag, editors,Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 ofProceedings of Machine Learning Research, pages 829–838. PMLR, 03–06 Aug 2020

work page 2020

[35] [35]

A unifying view of optimism in episodic reinforcement learning, 2020

Gergely Neu and Ciara Pike-Burke. A unifying view of optimism in episodic reinforcement learning, 2020

work page 2020

[36] [36]

Contextual MDP for online advertising

Eric Schwartz, Jesse Bradshaw, and Brantley Matthews. Contextual MDP for online advertising. In ACM Conference on Economics and Computation, 2017

work page 2017

[37] [37]

Non-asymptotic gap-dependent regret bounds for tabular mdps, 2019

Max Simchowitz and Kevin Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps, 2019

work page 2019

[38] [38]

Model-based reinforcement learning with nearly tight exploration complexity bounds

István Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. InProceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1031–1038, 2010

work page 2010

[39] [39]

Du, Lin F

Ruosong Wang, Simon S. Du, Lin F. Yang, and Sham M. Kakade. Is long horizon reinforcement learning more difficult than short horizon reinforcement learning?, 2020

work page 2020

[40] [40]

Haike Xu, Tengyu Ma, and Simon S. Du. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap, 2021

work page 2021

[41] [41]

Linear contextual bandits with interference, 2024

Yang Xu, Wenbin Lu, and Rui Song. Linear contextual bandits with interference, 2024

work page 2024

[42] [42]

Q-learning with logarithmic regret

Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 1576–1584. PMLR, 13–15 Apr 2021

work page 2021

[43] [43]

Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. InInternational Conference on Machine Learning, pages 7304–7312, 2019

work page 2019

[44] [44]

Contextual linear bandits with delay as payoff, 2025

Mengxiao Zhang, Yingfei Wang, and Haipeng Luo. Contextual linear bandits with delay as payoff, 2025

work page 2025

[45] [45]

Settling the sample complexity of online reinforcement learning

Zihan Zhang, Yuxin Chen, Jason D Lee, and Simon S Du. Settling the sample complexity of online reinforcement learning. InThe Thirty Seventh Annual Conference on Learning Theory, pages 5213–5219. PMLR, 2024

work page 2024

[46] [46]

Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon

Zihan Zhang, Xiangyang Ji, and Simon Du. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. InConference on Learning Theory, pages 4528–4531, 2021

work page 2021

[47] [47]

∃n, nX k=1 Xk ≥3 nX k=1 Yk +llog 1 δ # ≤δ P

Runlong Zhou, Zihan Zhang, and Simon S. Du. Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments, 2023. 13 A Notations and Technical Lemmas A.1 Notations 14 S, S=|S| State space and its size A, A=|A| Action space and its size H Horizon length K Number of learning episodes D Distributio...

work page 2023