Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

Harin Lee; Min-hwan Oh

arxiv: 2605.05102 · v3 · pith:K7DK3EPKnew · submitted 2026-05-06 · 💻 cs.LG · stat.ML

Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

Harin Lee , Min-hwan Oh This is my paper

Pith reviewed 2026-06-30 23:56 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords multi-armed banditsreinforcement learningdistributional regretUCBVIexploration bonusstochastic boundsregret distribution

0 comments

The pith

A UCBVI-style algorithm with min-based exploration bonus achieves optimal distributional regret bounds in bandits and reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a unified framework for bounding the distribution of regret rather than just its expectation in stochastic multi-armed bandits and episodic reinforcement learning. It defines distributional regret bounds as probabilistic guarantees that hold uniformly for all confidence levels delta from 0 to 1. The authors introduce a simple algorithm using an exploration bonus of the form min of two terms with user-chosen parameters that depend on visit counts. This setup allows derivation of general bounds that trade off expected regret against tail behavior in both minimax and instance-dependent ways. As a key result, the framework proves a distributional regret bound of order square root of A T times log of 1 over delta for multi-armed bandits, settling a prior conjecture.

Core claim

For arbitrary parameter sequences, the algorithm yields gap-independent and gap-dependent distributional regret bounds that achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes for multi-armed bandits and episodic reinforcement learning.

What carries the argument

The exploration bonus min{c_{1,k}/N, c_{2,k}/√N} with user-specified parameters (c1,k, c2,k), which enables control over the full regret distribution through the choice of parameters.

If this is right

The bounds hold uniformly over all delta in (0,1], characterizing the entire regret distribution.
For multi-armed bandits with A arms and horizon T, distributional regret is bounded by O(√(AT) log(1/δ)).
Optimal trade-offs are achieved in minimax and instance-dependent regimes.
The parameters allow a principled way to balance expected performance, tail risk, and instance-dependent behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might allow similar distributional analyses in other sequential decision problems like continuous control or non-stationary environments.
The confirmation of the conjecture opens the door to proving tight distributional bounds in more general settings beyond episodic RL.
Users could select parameters to prioritize low tail risk in applications where large regret events are costly.

Load-bearing premise

The specific form of the exploration bonus min{c1,k/N, c2,k/√N} with arbitrary user-specified parameters produces the general distributional regret bounds in the stochastic setting.

What would settle it

Running the algorithm on a multi-armed bandit instance with known optimal policy and checking if the observed regret exceeds the claimed O(√(AT) log(1/δ)) order with probability exceeding δ would falsify the bound.

read the original abstract

We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels $\delta \in (0,1]$, thereby characterizing the regret distribution across the full range of $\delta$. We present a simple UCBVI-style algorithm with exploration bonus $\min\{c_{1,k}/N, c_{2,k}/\sqrt{N}\}$, where $N$ denotes the visit count and $(c_{1,k},c_{2,k})$ are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with $A$ arms and horizon $T$, we obtain a distributional regret bound of order $\mathcal{O}(\sqrt{AT}\log(1/\delta))$, confirming the conjecture of Lattimore & Szepesv\'ari (2020, Section 17.1) for the first time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Confirms the distributional regret conjecture for bandits via a flexible UCBVI bonus and unifies it with episodic RL, though the derivations for arbitrary parameters need checking.

read the letter

This paper proves the conjecture from Lattimore and Szepesvari on a distributional regret bound of order O(sqrt(AT) log(1/delta)) for multi-armed bandits and builds a unified framework that covers both bandits and episodic RL.

They use a UCBVI-style algorithm whose exploration bonus is the min of two terms with user-specified constants c1,k and c2,k. For any sequence of those constants they derive both gap-independent and gap-dependent distributional regret bounds, which lets them describe the trade-off between expected regret and tail behavior in minimax and instance-dependent regimes. The unification and the explicit parameter control are the parts that work cleanly on the page.

The soft spot is the step that claims the chosen bonus form produces the stated general bounds for completely arbitrary parameter sequences. The abstract presents this as direct, but that is exactly where a referee would need to see the probability calculations and the handling of visit counts to confirm there are no hidden restrictions or gaps. Everything else in the high-level argument lines up without circularity.

The work is for theorists who already care about regret distributions rather than just expectations, especially those working on risk-sensitive RL or high-probability guarantees. A reader looking for new rates or a clean way to interpolate between mean and tail performance will find the framework useful.

Send it to peer review. Resolving the conjecture is a concrete contribution worth referee time even if the proofs need tightening.

Referee Report

0 major / 2 minor

Summary. The paper introduces a unified framework for distributional regret bounds in stochastic multi-armed bandits and episodic reinforcement learning. It proposes a UCBVI-style algorithm with exploration bonus min{c_{1,k}/N, c_{2,k}/√N} using arbitrary user-specified parameter sequences (c_{1,k}, c_{2,k}). General gap-independent and gap-dependent distributional regret bounds are derived that characterize trade-offs between expected regret, tail risk, and instance-dependent behavior. A key special case yields an O(√(AT) log(1/δ)) distributional regret bound for MAB, confirming the conjecture in Lattimore & Szepesvári (2020, §17.1).

Significance. If the derivations hold, the work is significant for providing the first confirmation of the cited conjecture on distributional regret and for enabling principled control of expected vs. distributional trade-offs via arbitrary parameters in both minimax and instance-dependent settings. The general bounds for user-specified sequences across MAB and RL represent a clear advance over expectation-only analyses.

minor comments (2)

The abstract and introduction would benefit from an explicit statement of the precise form of the distributional regret bound (e.g., the probability statement involving δ) to make the uniformity over δ ∈ (0,1] immediately clear.
Notation for the visit count N and the sequences (c_{1,k}, c_{2,k}) should be defined at first use with a short reminder of their dependence on the state-action pair or time index.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report lists no specific major comments to address.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives distributional regret bounds directly from a UCBVI-style algorithm whose exploration bonus uses arbitrary user-specified parameters (c_{1,k}, c_{2,k}). The resulting gap-independent and gap-dependent bounds are stated to hold for any such parameter sequences, and the MAB special case O(√(AT) log(1/δ)) is presented as a direct consequence that confirms an external conjecture (Lattimore & Szepesvári 2020). No step reduces a claimed prediction to a fitted quantity by construction, invokes a load-bearing self-citation, or imports a uniqueness result from the authors' prior work. The argument is therefore independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard domain assumptions in RL theory; no new entities or free parameters are introduced in the abstract.

axioms (1)

domain assumption Standard stochastic assumptions for multi-armed bandits and episodic MDPs (bounded rewards, independent transitions)
These are necessary for the regret analysis to hold but are standard in the field.

pith-pipeline@v0.9.1-grok · 5757 in / 1153 out tokens · 31348 ms · 2026-06-30T23:56:21.237039+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Minimax Quantile Lower Bounds for Interactive Statistical Decision Making with Privacy
cs.LG 2026-06 unverdicted novelty 7.0

Derives explicit minimax quantile lower bounds for Gaussian mean estimation and K-armed bandits under interactive decision making and MI privacy, with log(1/δ)/n and √(KT log(1/δ)) scalings.

Reference graph

Works this paper leans on

35 extracted references · 3 canonical work pages · cited by 1 Pith paper

[1]

Analysis of thompson sampling for the multi-armed bandit problem

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39--1. JMLR Workshop and Conference Proceedings, 2012

2012
[2]

Minimax policies for adversarial and stochastic bandits

Jean-Yves Audibert and S \'e bastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217--226, 2009

2009
[3]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47 0 (2): 0 235--256, 2002

2002
[4]

Minimax regret bounds for reinforcement learning

Mohammad Gheshlaghi Azar, Ian Osband, and R \'e mi Munos. Minimax regret bounds for reinforcement learning. In International conference on machine learning, pages 263--272. PMLR, 2017

2017
[5]

Regret analysis of stochastic and nonstochastic multi-armed bandit problems

S \'e bastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5 0 (1): 0 1--122, 2012

2012
[6]

Implicit finite-horizon approximation and efficient optimal algorithms for stochastic shortest path

Liyu Chen, Mehdi Jafarnia-Jahromi, Rahul Jain, and Haipeng Luo. Implicit finite-horizon approximation and efficient optimal algorithms for stochastic shortest path. Advances in Neural Information Processing Systems, 34: 0 10849--10861, 2021

2021
[7]

Beyond value-function gaps: Improved instance-dependent regret bounds for episodic reinforcement learning

Christoph Dann, Teodor Vanislavov Marinov, Mehryar Mohri, and Julian Zimmert. Beyond value-function gaps: Improved instance-dependent regret bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 34: 0 1--12, 2021

2021
[8]

Anytime optimal algorithms in stochastic multi-armed bandits

R \'e my Degenne and Vianney Perchet. Anytime optimal algorithms in stochastic multi-armed bandits. In International Conference on Machine Learning, pages 1587--1595. PMLR, 2016

2016
[9]

Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited

Omar Darwiche Domingues, Pierre M \'e nard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pages 578--598. PMLR, 2021

2021
[10]

The fragility of optimized bandit algorithms

Lin Fan and Peter W Glynn. The fragility of optimized bandit algorithms. Operations Research, 73 0 (6): 0 3173--3198, 2025

2025
[11]

On tail probabilities for martingales

David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100--118, 1975

1975
[12]

Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58 0 (301): 0 13--30, 1963. doi:10.1080/01621459.1963.10500830

work page doi:10.1080/01621459.1963.10500830 1963
[13]

Multi-armed bandit with sub-exponential rewards

Huiwen Jia, Cong Shi, and Siqian Shen. Multi-armed bandit with sub-exponential rewards. Operations Research Letters, 49 0 (5): 0 728--733, 2021

2021
[14]

Thompson sampling with less exploration is fast and optimal

Tianyuan Jin, Xianglin Yang, Xiaokui Xiao, and Pan Xu. Thompson sampling with less exploration is fast and optimal. In International Conference on Machine Learning, pages 15239--15261. PMLR, 2023

2023
[15]

Tail distribution of regret in optimistic reinforcement learning

Sajad Khodadadian and Mehrdad Moharrami. Tail distribution of regret in optimistic reinforcement learning. arXiv preprint arXiv:2511.18247, 2025

work page arXiv 2025
[16]

Efficient learning by implicit exploration in bandit problems with side observations

Tom \'a s Koc \'a k, Gergely Neu, Michal Valko, and R \'e mi Munos. Efficient learning by implicit exploration in bandit problems with side observations. Advances in Neural Information Processing Systems, 27, 2014

2014
[17]

Asymptotically efficient adaptive allocation rules

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6 0 (1): 0 4--22, 1985

1985
[18]

Refining the confidence level for optimistic bandit strategies

Tor Lattimore. Refining the confidence level for optimistic bandit strategies. Journal of Machine Learning Research, 19 0 (20): 0 1--32, 2018

2018
[19]

Bandit algorithms

Tor Lattimore and Csaba Szepesv \'a ri. Bandit algorithms. Cambridge University Press, 2020

2020
[20]

Minimax optimal reinforcement learning with quasi-optimism

Harin Lee and Min - hwan Oh. Minimax optimal reinforcement learning with quasi-optimism. In The Thirteenth International Conference on Learning Representations, 2025

2025
[21]

A minimax and asymptotically optimal algorithm for stochastic bandits

Pierre M \'e nard and Aur \'e lien Garivier. A minimax and asymptotically optimal algorithm for stochastic bandits. In International Conference on Algorithmic Learning Theory, pages 223--237. PMLR, 2017

2017
[22]

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Advances in Neural Information Processing Systems, 28, 2015

2015
[23]

A simple and optimal policy design for online learning with safety against heavy-tailed risk

David Simchi-Levi, Zeyu Zheng, and Feng Zhu. A simple and optimal policy design for online learning with safety against heavy-tailed risk. Advances in Neural Information Processing Systems, 35: 0 33795--33805, 2022

2022
[24]

Regret distribution in stochastic bandits: Optimal trade-off between expectation and tail risk

David Simchi-Levi, Zeyu Zheng, and Feng Zhu. Regret distribution in stochastic bandits: Optimal trade-off between expectation and tail risk. arXiv preprint arXiv:2304.04341, 2023 a

work page arXiv 2023
[25]

Stochastic multi-armed bandits: Optimal trade-off among optimality, consistency, and tail risk

David Simchi-Levi, Zeyu Zheng, and Feng Zhu. Stochastic multi-armed bandits: Optimal trade-off among optimality, consistency, and tail risk. Advances in Neural Information Processing Systems, 36: 0 35619--35630, 2023 b

2023
[26]

A simple and optimal policy design with safety against heavy-tailed risk for stochastic bandits

David Simchi-Levi, Zeyu Zheng, and Feng Zhu. A simple and optimal policy design with safety against heavy-tailed risk for stochastic bandits. Management Science, 71 0 (7): 0 6298--6318, 2025

2025
[27]

Non-asymptotic gap-dependent regret bounds for tabular mdps

Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32, 2019

2019
[28]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html

2018
[29]

From dirichlet to rubin: Optimistic exploration in rl without bonuses

Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, and Pierre M \'e nard. From dirichlet to rubin: Optimistic exploration in rl without bonuses. In International Conference on Machine Learning, pages 21380--21431. PMLR, 2022

2022
[30]

High-dimensional statistics: A non-asymptotic viewpoint, volume 48

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

2019
[31]

Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304--7312. PMLR, 2019

2019
[32]

Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon

Zihan Zhang, Xiangyang Ji, and Simon Du. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference on Learning Theory, pages 4528--4531. PMLR, 2021

2021
[33]

Settling the sample complexity of online reinforcement learning

Zihan Zhang, Yuxin Chen, Jason D Lee, and Simon S Du. Settling the sample complexity of online reinforcement learning. In The Thirty Seventh Annual Conference on Learning Theory, pages 5213--5219. PMLR, 2024

2024
[34]

Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments

Runlong Zhou, Zhang Zihan, and Simon Shaolei Du. Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments. In International Conference on Machine Learning, pages 42878--42914. PMLR, 2023

2023
[35]

Adaptive variance inflation in thompson sampling: Efficiency, safety, robustness, and beyond

Feng Zhu and David Simchi-Levi. Adaptive variance inflation in thompson sampling: Efficiency, safety, robustness, and beyond. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=IBhxvINfxv

2025

[1] [1]

Analysis of thompson sampling for the multi-armed bandit problem

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39--1. JMLR Workshop and Conference Proceedings, 2012

2012

[2] [2]

Minimax policies for adversarial and stochastic bandits

Jean-Yves Audibert and S \'e bastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217--226, 2009

2009

[3] [3]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47 0 (2): 0 235--256, 2002

2002

[4] [4]

Minimax regret bounds for reinforcement learning

Mohammad Gheshlaghi Azar, Ian Osband, and R \'e mi Munos. Minimax regret bounds for reinforcement learning. In International conference on machine learning, pages 263--272. PMLR, 2017

2017

[5] [5]

Regret analysis of stochastic and nonstochastic multi-armed bandit problems

S \'e bastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5 0 (1): 0 1--122, 2012

2012

[6] [6]

Implicit finite-horizon approximation and efficient optimal algorithms for stochastic shortest path

Liyu Chen, Mehdi Jafarnia-Jahromi, Rahul Jain, and Haipeng Luo. Implicit finite-horizon approximation and efficient optimal algorithms for stochastic shortest path. Advances in Neural Information Processing Systems, 34: 0 10849--10861, 2021

2021

[7] [7]

Beyond value-function gaps: Improved instance-dependent regret bounds for episodic reinforcement learning

Christoph Dann, Teodor Vanislavov Marinov, Mehryar Mohri, and Julian Zimmert. Beyond value-function gaps: Improved instance-dependent regret bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 34: 0 1--12, 2021

2021

[8] [8]

Anytime optimal algorithms in stochastic multi-armed bandits

R \'e my Degenne and Vianney Perchet. Anytime optimal algorithms in stochastic multi-armed bandits. In International Conference on Machine Learning, pages 1587--1595. PMLR, 2016

2016

[9] [9]

Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited

Omar Darwiche Domingues, Pierre M \'e nard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pages 578--598. PMLR, 2021

2021

[10] [10]

The fragility of optimized bandit algorithms

Lin Fan and Peter W Glynn. The fragility of optimized bandit algorithms. Operations Research, 73 0 (6): 0 3173--3198, 2025

2025

[11] [11]

On tail probabilities for martingales

David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100--118, 1975

1975

[12] [12]

Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58 0 (301): 0 13--30, 1963. doi:10.1080/01621459.1963.10500830

work page doi:10.1080/01621459.1963.10500830 1963

[13] [13]

Multi-armed bandit with sub-exponential rewards

Huiwen Jia, Cong Shi, and Siqian Shen. Multi-armed bandit with sub-exponential rewards. Operations Research Letters, 49 0 (5): 0 728--733, 2021

2021

[14] [14]

Thompson sampling with less exploration is fast and optimal

Tianyuan Jin, Xianglin Yang, Xiaokui Xiao, and Pan Xu. Thompson sampling with less exploration is fast and optimal. In International Conference on Machine Learning, pages 15239--15261. PMLR, 2023

2023

[15] [15]

Tail distribution of regret in optimistic reinforcement learning

Sajad Khodadadian and Mehrdad Moharrami. Tail distribution of regret in optimistic reinforcement learning. arXiv preprint arXiv:2511.18247, 2025

work page arXiv 2025

[16] [16]

Efficient learning by implicit exploration in bandit problems with side observations

Tom \'a s Koc \'a k, Gergely Neu, Michal Valko, and R \'e mi Munos. Efficient learning by implicit exploration in bandit problems with side observations. Advances in Neural Information Processing Systems, 27, 2014

2014

[17] [17]

Asymptotically efficient adaptive allocation rules

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6 0 (1): 0 4--22, 1985

1985

[18] [18]

Refining the confidence level for optimistic bandit strategies

Tor Lattimore. Refining the confidence level for optimistic bandit strategies. Journal of Machine Learning Research, 19 0 (20): 0 1--32, 2018

2018

[19] [19]

Bandit algorithms

Tor Lattimore and Csaba Szepesv \'a ri. Bandit algorithms. Cambridge University Press, 2020

2020

[20] [20]

Minimax optimal reinforcement learning with quasi-optimism

Harin Lee and Min - hwan Oh. Minimax optimal reinforcement learning with quasi-optimism. In The Thirteenth International Conference on Learning Representations, 2025

2025

[21] [21]

A minimax and asymptotically optimal algorithm for stochastic bandits

Pierre M \'e nard and Aur \'e lien Garivier. A minimax and asymptotically optimal algorithm for stochastic bandits. In International Conference on Algorithmic Learning Theory, pages 223--237. PMLR, 2017

2017

[22] [22]

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Advances in Neural Information Processing Systems, 28, 2015

2015

[23] [23]

A simple and optimal policy design for online learning with safety against heavy-tailed risk

David Simchi-Levi, Zeyu Zheng, and Feng Zhu. A simple and optimal policy design for online learning with safety against heavy-tailed risk. Advances in Neural Information Processing Systems, 35: 0 33795--33805, 2022

2022

[24] [24]

Regret distribution in stochastic bandits: Optimal trade-off between expectation and tail risk

David Simchi-Levi, Zeyu Zheng, and Feng Zhu. Regret distribution in stochastic bandits: Optimal trade-off between expectation and tail risk. arXiv preprint arXiv:2304.04341, 2023 a

work page arXiv 2023

[25] [25]

Stochastic multi-armed bandits: Optimal trade-off among optimality, consistency, and tail risk

David Simchi-Levi, Zeyu Zheng, and Feng Zhu. Stochastic multi-armed bandits: Optimal trade-off among optimality, consistency, and tail risk. Advances in Neural Information Processing Systems, 36: 0 35619--35630, 2023 b

2023

[26] [26]

A simple and optimal policy design with safety against heavy-tailed risk for stochastic bandits

David Simchi-Levi, Zeyu Zheng, and Feng Zhu. A simple and optimal policy design with safety against heavy-tailed risk for stochastic bandits. Management Science, 71 0 (7): 0 6298--6318, 2025

2025

[27] [27]

Non-asymptotic gap-dependent regret bounds for tabular mdps

Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32, 2019

2019

[28] [28]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html

2018

[29] [29]

From dirichlet to rubin: Optimistic exploration in rl without bonuses

Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, and Pierre M \'e nard. From dirichlet to rubin: Optimistic exploration in rl without bonuses. In International Conference on Machine Learning, pages 21380--21431. PMLR, 2022

2022

[30] [30]

High-dimensional statistics: A non-asymptotic viewpoint, volume 48

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

2019

[31] [31]

Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304--7312. PMLR, 2019

2019

[32] [32]

Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon

Zihan Zhang, Xiangyang Ji, and Simon Du. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference on Learning Theory, pages 4528--4531. PMLR, 2021

2021

[33] [33]

Settling the sample complexity of online reinforcement learning

Zihan Zhang, Yuxin Chen, Jason D Lee, and Simon S Du. Settling the sample complexity of online reinforcement learning. In The Thirty Seventh Annual Conference on Learning Theory, pages 5213--5219. PMLR, 2024

2024

[34] [34]

Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments

Runlong Zhou, Zhang Zihan, and Simon Shaolei Du. Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments. In International Conference on Machine Learning, pages 42878--42914. PMLR, 2023

2023

[35] [35]

Adaptive variance inflation in thompson sampling: Efficiency, safety, robustness, and beyond

Feng Zhu and David Simchi-Levi. Adaptive variance inflation in thompson sampling: Efficiency, safety, robustness, and beyond. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=IBhxvINfxv

2025