Quantile of Means: A Bonus-Free Ensemble Method for Minimax Optimal Reinforcement Learning

Asaf Cassel; Aviv Rosenberg

arxiv: 2606.20107 · v1 · pith:I4KLOMXRnew · submitted 2026-06-18 · 💻 cs.LG

Quantile of Means: A Bonus-Free Ensemble Method for Minimax Optimal Reinforcement Learning

Asaf Cassel , Aviv Rosenberg This is my paper

Pith reviewed 2026-06-26 18:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningensemble methodsregret boundsexplorationfinite-horizon MDPsquantile estimationvariance-dependent regret

0 comments

The pith

A quantile of means from an ensemble achieves minimax optimal regret bounds in finite-horizon MDPs without count-based bonuses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing count-based uncertainty bonuses with a quantile of means drawn from an ensemble of value estimates. This approach extends a bandit method to finite-horizon MDPs and is shown to attain the optimal variance-dependent regret. A sympathetic reader would care because it gives theoretical support to a practical heuristic that avoids the computational difficulties of exact counts in large state spaces. If correct, it means ensemble methods can be used confidently for exploration in RL without sacrificing optimality.

Core claim

The quantile of means estimator, when applied to an ensemble in finite-horizon MDPs, yields a bonus-free exploration strategy that attains the minimax optimal regret bound which depends on the variance of the returns.

What carries the argument

The quantile of means, which selects a conservative estimate from multiple independent value function estimates to guide exploration.

If this is right

It achieves optimal variance-dependent regret bounds.
It eliminates the need for explicit count-based uncertainty estimates.
It extends ensemble methods from multi-armed bandits to MDPs while preserving optimality.
It provides theoretical justification for using ensembles in reinforcement learning exploration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar quantile ensembles might work in infinite-horizon or continuous state settings if the finite-horizon analysis can be adapted.
The method could be combined with deep neural network ensembles to scale to complex environments.
Testing the variance dependence in practice could reveal whether the theoretical improvement translates to faster learning in high-variance tasks.

Load-bearing premise

That the properties of the ensemble method carry over from bandits to MDPs without needing extra structure or assumptions on how the ensemble is built.

What would settle it

Running the algorithm on a finite-horizon MDP where the observed regret significantly exceeds the variance-dependent bound predicted by the theory.

read the original abstract

Optimal Reinforcement Learning (RL) algorithms typically rely on carefully constructed count-based uncertainty estimates to drive exploration. Although theoretically sound, such estimates are hard to compute in practical settings and therefore offer limited insight for designing exploration heuristics. Meanwhile, ensembling has emerged as a practical approach, but remains without theoretical justification. Building on a recent ensemble-based method for Multi-Armed Bandits, we propose a quantile-based ensemble method for finite-horizon Markov Decision Processes (MDPs). Our simple count-free approach achieves optimal variance-dependent regret bounds, providing theoretical grounding for ensemble-based exploration in RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper lifts quantile-of-means ensembles from bandits to finite-horizon MDPs and claims optimal variance-dependent regret without counts, but the horizon error handling is the part that needs verification.

read the letter

The core contribution is a direct application of the quantile-of-means idea to finite-horizon MDPs, producing a count-free method that the authors say matches known optimal regret bounds. This is new in the sense that the prior bandit work did not cover value-function backups or multi-step error accumulation.

What the paper does cleanly is remove the need for explicit visit counts while still targeting variance-dependent bounds, which aligns with practical interest in ensemble-based exploration. If the math closes, it supplies a concrete justification for why simple ensembling can work in RL without ad-hoc bonuses.

The soft spot is exactly the one flagged in the stress test. Bandit concentration does not automatically carry over to MDPs because quantile errors can compound across the horizon H through the transition kernel and value estimates. Without explicit lemmas that bound how those per-step quantile deviations propagate and interact with the unknown dynamics, the optimality claim rests on an implicit assumption that the bandit result suffices. I would check whether the proof introduces new concentration arguments or simply invokes the earlier result.

The work is aimed at RL theorists who track regret analyses and ensemble techniques. A reader already familiar with the bandit quantile paper will see the extension immediately and can judge the technical lift. It is worth sending to peer review so the details on error propagation get checked by people who work on finite-horizon bounds; the claim is sharp enough to merit that step even if revisions follow.

Referee Report

2 major / 2 minor

Summary. The paper proposes a quantile-of-means ensemble method for exploration in finite-horizon MDPs. Extending a recent count-free ensemble approach from multi-armed bandits, the method is claimed to achieve minimax-optimal variance-dependent regret bounds without relying on count-based bonuses or explicit uncertainty estimates.

Significance. If the central claims hold with rigorous proofs, the result would be significant: it supplies the first theoretical justification for a practical ensemble heuristic in RL while matching known optimal variance-dependent rates (typically of the form involving sqrt(V* H^3 S A T) or similar). This bridges the gap between count-based theory and ensemble practice and could influence algorithm design.

major comments (2)

[MDP analysis / main regret theorem] The extension from the MAB setting to finite-horizon MDPs requires explicit control of quantile-estimate errors across the horizon H and through value-function backups; without new concentration lemmas addressing propagation through the unknown transition kernel, the optimality claim rests on an implicit assumption that the per-step bandit guarantee lifts directly (see skeptic note on horizon-dependent concentration).
[§3 (method) and regret proof] It is unclear whether the ensemble construction (size, quantile level, or variance handling) is independent of state-action visit counts or requires additional assumptions on the MDP to retain the variance-dependent bound; any hidden dependence would undermine the 'count-free' and 'bonus-free' claims.

minor comments (2)

Clarify all assumptions on the MDP (finite horizon, bounded rewards, etc.) and ensemble parameters in the main text rather than deferring to the bandit reference.
Ensure experimental sections (if present) report variance-dependent metrics and compare against both count-based and other ensemble baselines with statistical significance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We are pleased that the potential significance of the result is recognized. Below we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [MDP analysis / main regret theorem] The extension from the MAB setting to finite-horizon MDPs requires explicit control of quantile-estimate errors across the horizon H and through value-function backups; without new concentration lemmas addressing propagation through the unknown transition kernel, the optimality claim rests on an implicit assumption that the per-step bandit guarantee lifts directly (see skeptic note on horizon-dependent concentration).

Authors: The proof of the main regret theorem (Theorem 4.1) does include dedicated concentration lemmas for the propagation of quantile estimation errors over the horizon. In particular, we develop a new martingale concentration result (Lemma 4.2) that accounts for the dependence through the unknown transition kernel in the value function backups. This allows the per-step guarantees to be lifted rigorously to the full horizon without additional assumptions. We can add a pointer to these lemmas in the main text if it helps clarify the structure. revision: partial
Referee: [§3 (method) and regret proof] It is unclear whether the ensemble construction (size, quantile level, or variance handling) is independent of state-action visit counts or requires additional assumptions on the MDP to retain the variance-dependent bound; any hidden dependence would undermine the 'count-free' and 'bonus-free' claims.

Authors: The ensemble parameters are chosen independently of visit counts. Section 3 specifies that the number of ensemble members and the quantile level are absolute constants that do not depend on n(s,a) or any other state-action specific quantities. The variance-dependent bound is derived directly from the quantile-of-means estimator without invoking count-based terms. No extra MDP assumptions are used. To address the lack of clarity, we will revise Section 3 to explicitly state the independence from visit counts. revision: yes

Circularity Check

0 steps flagged

No circularity identified from visible content

full rationale

The provided abstract and context contain no equations, derivations, or explicit load-bearing steps. The claim of extending a recent MAB ensemble method to MDPs is stated at a high level without any reduction to fitted parameters, self-definitional quantities, or unverified self-citations shown. Per the hard rules, circularity requires quoting specific paper text exhibiting reduction by construction; none is available here. This is the common honest non-finding when the derivation chain is not inspectable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5618 in / 899 out tokens · 25300 ms · 2026-06-26T18:16:52.441857+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics , pages =

On Bayesian Upper Confidence Bounds for Bandit Problems , author =. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics , pages =. 2012 , editor =

2012
[2]

Bootstrapping Upper Confidence Bound , volume =

Hao, Botao and Abbasi Yadkori, Yasin and Wen, Zheng and Cheng, Guang , booktitle =. Bootstrapping Upper Confidence Bound , volume =
[3]

Improved Regret of Linear Ensemble Sampling , volume =

Lee, Harin and Oh, Min-hwan , booktitle =. Improved Regret of Linear Ensemble Sampling , volume =. doi:10.52202/079017-2947 , editor =

work page doi:10.52202/079017-2947
[4]

Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees , volume =

Tiapkin, Daniil and Belomestny, Denis and Calandriello, Daniele and Moulines, Eric and Munos, Remi and Naumov, Alexey and Rowland, Mark and Valko, Michal and M\'. Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees , volume =. Advances in Neural Information Processing Systems , editor =
[5]

Tiapkin, Daniil and Belomestny, Denis and Moulines, Eric and Naumov, Alexey and Samsonov, Sergey and Tang, Yunhao and Valko, Michal and Menard, Pierre , booktitle =. From. 2022 , editor =

2022
[6]

2025 , editor =

Viel, Stefano and Viano, Luca and Cevher, Volkan , booktitle =. 2025 , editor =

2025
[7]

Advances in neural information processing systems , volume=

Is Q-learning provably efficient? , author=. Advances in neural information processing systems , volume=
[8]

Advances in neural information processing systems , volume=

Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , author=. Advances in neural information processing systems , volume=
[9]

Advances in neural information processing systems , volume=

Worst-case regret bounds for exploration via randomized value functions , author=. Advances in neural information processing systems , volume=
[10]

Advances in Neural Information Processing Systems , volume=

Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[11]

Advances in neural information processing systems , volume=

Ensemble sampling , author=. Advances in neural information processing systems , volume=
[12]

Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =

Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , author =. Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =. 2021 , editor =

2021
[13]

Proceedings of the 40th International Conference on Machine Learning , pages =

Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023
[14]

nature , volume=

Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=

2015
[15]

International conference on machine learning , pages=

Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

2015
[16]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

International Conference on Machine Learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018
[18]

Advances in Neural Information Processing Systems , volume=

Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=
[19]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
[20]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Batch ensemble for variance dependent regret in stochastic bandits , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[21]

International conference on machine learning , pages=

Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[22]

Ross , title =

Xinyue Chen and Che Wang and Zijian Zhou and Keith W. Ross , title =. 9th International Conference on Learning Representations,
[23]

International conference on machine learning , pages=

Self-supervised exploration via disagreement , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[24]

International Conference on Machine Learning , pages=

Near-optimal regret bounds for stochastic shortest path , author=. International Conference on Machine Learning , pages=. 2020 , organization=

2020
[25]

International conference on machine learning , pages=

Minimax regret bounds for reinforcement learning , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[26]

Proceedings of the 36th International Conference on Machine Learning , pages =

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

2019
[27]

Reinforcement Learning with Lookahead Information , volume =

Merlis, Nadav , booktitle =. Reinforcement Learning with Lookahead Information , volume =
[28]

Proceedings of the 31st International Conference on Machine Learning , pages =

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , author =. Proceedings of the 31st International Conference on Machine Learning , pages =. 2014 , editor =

2014
[29]

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages=

Contextual bandit algorithms with supervised learning guarantees , author=. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages=. 2011 , organization=

2011
[30]

Freedman , journal =

David A. Freedman , journal =. On Tail Probabilities for Martingales , volume =
[31]

Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages=

On sums of independent random variables with unbounded variance, and estimating the average degree in a graph , author=. Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages=
[32]

UCB Exploration via Q-Ensembles

Ucb exploration via q-ensembles , author=. arXiv preprint arXiv:1706.01502 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

International Conference on Machine Learning , pages=

Ensemble bootstrapping for Q-Learning , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[34]

International Conference on Machine Learning , pages=

Generalization and exploration via randomized value functions , author=. International Conference on Machine Learning , pages=. 2016 , organization=

2016
[35]

Advances in Neural Information Processing Systems , volume=

(More) efficient reinforcement learning via posterior sampling , author=. Advances in Neural Information Processing Systems , volume=
[36]

Advances in neural information processing systems , volume=

Deep exploration via bootstrapped DQN , author=. Advances in neural information processing systems , volume=
[37]

arXiv preprint arXiv:2210.04157 , year=

The role of coverage in online reinforcement learning , author=. arXiv preprint arXiv:2210.04157 , year=

work page arXiv
[38]

arXiv preprint arXiv:2110.11202 , year=

Anti-Concentrated Confidence Bonuses for Scalable Exploration , author=. arXiv preprint arXiv:2110.11202 , year=

work page arXiv
[39]

arXiv preprint arXiv:2212.06132 , year=

Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes , author=. arXiv preprint arXiv:2212.06132 , year=

work page arXiv
[40]

Proceedings of the 25th Annual Conference on Learning Theory , pages =

Analysis of Thompson Sampling for the Multi-armed Bandit Problem , author =. Proceedings of the 25th Annual Conference on Learning Theory , pages =. 2012 , editor =

2012
[41]

Bulletin of the American Mathematical Society , volume=

Some aspects of the sequential design of experiments , author=. Bulletin of the American Mathematical Society , volume=. 1952 , publisher=

1952
[42]

Advances in applied mathematics , volume=

Asymptotically efficient adaptive allocation rules , author=. Advances in applied mathematics , volume=. 1985 , publisher=

1985
[43]

Machine learning , volume=

Finite-time analysis of the multiarmed bandit problem , author=. Machine learning , volume=. 2002 , publisher=

2002
[44]

Proceedings of the 24th Annual Conference on Learning Theory , pages =

A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , author =. Proceedings of the 24th Annual Conference on Learning Theory , pages =. 2011 , editor =

2011
[45]

Integer Programming and Combinatorial Optimization: 7th International IPCO Conference Graz, Austria, June 9--11, 1999 Proceedings , pages=

On the number of iterations for Dantzig-Wolfe optimization and packing-covering approximation algorithms , author=. Integer Programming and Combinatorial Optimization: 7th International IPCO Conference Graz, Austria, June 9--11, 1999 Proceedings , pages=. 1999 , organization=

1999
[46]

ACM Transactions on Algorithms (TALG) , volume=

Heavy hitters and the structure of local privacy , author=. ACM Transactions on Algorithms (TALG) , volume=. 2019 , publisher=

2019
[47]

arXiv preprint arXiv:2311.08376 , year=

Ensemble sampling for linear bandits: small ensembles suffice , author=. arXiv preprint arXiv:2311.08376 , year=

work page arXiv
[48]

arXiv preprint arXiv:2308.05435 , year=

Another look at binomial and related distributions exceeding values close to their centre , author=. arXiv preprint arXiv:2308.05435 , year=

work page arXiv
[49]

The Annals of Mathematical Statistics , pages=

On the distribution of the number of successes in independent trials , author=. The Annals of Mathematical Statistics , pages=. 1956 , publisher=

1956
[50]

Theoretical Computer Science , volume=

Exploration--exploitation tradeoff using variance estimates in multi-armed bandits , author=. Theoretical Computer Science , volume=. 2009 , publisher=

2009
[51]

Residual Bootstrap Exploration for Bandit Algorithms , journal =

Chi. Residual Bootstrap Exploration for Bandit Algorithms , journal =
[52]

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits , booktitle =

Branislav Kveton and Csaba Szepesv. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits , booktitle =
[53]

Perturbed-History Exploration in Stochastic Multi-Armed Bandits , booktitle =

Branislav Kveton and Csaba Szepesv. Perturbed-History Exploration in Stochastic Multi-Armed Bandits , booktitle =
[54]

Bootstrapping Upper Confidence Bound , booktitle =

Botao Hao and Yasin Abbasi. Bootstrapping Upper Confidence Bound , booktitle =
[55]

Sub-sampling for Multi-armed Bandits , booktitle =

Akram Baransi and Odalric. Sub-sampling for Multi-armed Bandits , booktitle =
[56]

Sub-sampling for Efficient Non-Parametric Bandit Exploration , journal =

Dorian Baudry and Emilie Kaufmann and Odalric. Sub-sampling for Efficient Non-Parametric Bandit Exploration , journal =
[57]

Maximum Average Randomly Sampled: A Scale Free and Non-parametric Algorithm for Stochastic Bandits , volume =

Moravej Khorasani, Masoud and Weyer, Erik , booktitle =. Maximum Average Randomly Sampled: A Scale Free and Non-parametric Algorithm for Stochastic Bandits , volume =
[58]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Maximum average randomly sampled: a scale free and non-parametric algorithm for stochastic bandits , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[59]

On the absolute constants in the Berry-Esseen type inequalities for identically distributed summands

On the absolute constants in the Berry-Esseen type inequalities for identically distributed summands , author=. arXiv preprint arXiv:1111.6554 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

The Annals of Statistics , volume=

THE MULTI-ARMED BANDIT PROBLEM , author=. The Annals of Statistics , volume=. 2020 , publisher=

2020
[61]

Advances in Neural Information Processing Systems , volume=

Reinforcement learning with a terminator , author=. Advances in Neural Information Processing Systems , volume=

[1] [1]

Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics , pages =

On Bayesian Upper Confidence Bounds for Bandit Problems , author =. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics , pages =. 2012 , editor =

2012

[2] [2]

Bootstrapping Upper Confidence Bound , volume =

Hao, Botao and Abbasi Yadkori, Yasin and Wen, Zheng and Cheng, Guang , booktitle =. Bootstrapping Upper Confidence Bound , volume =

[3] [3]

Improved Regret of Linear Ensemble Sampling , volume =

Lee, Harin and Oh, Min-hwan , booktitle =. Improved Regret of Linear Ensemble Sampling , volume =. doi:10.52202/079017-2947 , editor =

work page doi:10.52202/079017-2947

[4] [4]

Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees , volume =

Tiapkin, Daniil and Belomestny, Denis and Calandriello, Daniele and Moulines, Eric and Munos, Remi and Naumov, Alexey and Rowland, Mark and Valko, Michal and M\'. Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees , volume =. Advances in Neural Information Processing Systems , editor =

[5] [5]

Tiapkin, Daniil and Belomestny, Denis and Moulines, Eric and Naumov, Alexey and Samsonov, Sergey and Tang, Yunhao and Valko, Michal and Menard, Pierre , booktitle =. From. 2022 , editor =

2022

[6] [6]

2025 , editor =

Viel, Stefano and Viano, Luca and Cevher, Volkan , booktitle =. 2025 , editor =

2025

[7] [7]

Advances in neural information processing systems , volume=

Is Q-learning provably efficient? , author=. Advances in neural information processing systems , volume=

[8] [8]

Advances in neural information processing systems , volume=

Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , author=. Advances in neural information processing systems , volume=

[9] [9]

Advances in neural information processing systems , volume=

Worst-case regret bounds for exploration via randomized value functions , author=. Advances in neural information processing systems , volume=

[10] [10]

Advances in Neural Information Processing Systems , volume=

Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

[11] [11]

Advances in neural information processing systems , volume=

Ensemble sampling , author=. Advances in neural information processing systems , volume=

[12] [12]

Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =

Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , author =. Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =. 2021 , editor =

2021

[13] [13]

Proceedings of the 40th International Conference on Machine Learning , pages =

Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023

[14] [14]

nature , volume=

Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=

2015

[15] [15]

International conference on machine learning , pages=

Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

2015

[16] [16]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

International Conference on Machine Learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018

[18] [18]

Advances in Neural Information Processing Systems , volume=

Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=

[19] [19]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

[20] [20]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Batch ensemble for variance dependent regret in stochastic bandits , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[21] [21]

International conference on machine learning , pages=

Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[22] [22]

Ross , title =

Xinyue Chen and Che Wang and Zijian Zhou and Keith W. Ross , title =. 9th International Conference on Learning Representations,

[23] [23]

International conference on machine learning , pages=

Self-supervised exploration via disagreement , author=. International conference on machine learning , pages=. 2019 , organization=

2019

[24] [24]

International Conference on Machine Learning , pages=

Near-optimal regret bounds for stochastic shortest path , author=. International Conference on Machine Learning , pages=. 2020 , organization=

2020

[25] [25]

International conference on machine learning , pages=

Minimax regret bounds for reinforcement learning , author=. International conference on machine learning , pages=. 2017 , organization=

2017

[26] [26]

Proceedings of the 36th International Conference on Machine Learning , pages =

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

2019

[27] [27]

Reinforcement Learning with Lookahead Information , volume =

Merlis, Nadav , booktitle =. Reinforcement Learning with Lookahead Information , volume =

[28] [28]

Proceedings of the 31st International Conference on Machine Learning , pages =

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , author =. Proceedings of the 31st International Conference on Machine Learning , pages =. 2014 , editor =

2014

[29] [29]

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages=

Contextual bandit algorithms with supervised learning guarantees , author=. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages=. 2011 , organization=

2011

[30] [30]

Freedman , journal =

David A. Freedman , journal =. On Tail Probabilities for Martingales , volume =

[31] [31]

Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages=

On sums of independent random variables with unbounded variance, and estimating the average degree in a graph , author=. Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages=

[32] [32]

UCB Exploration via Q-Ensembles

Ucb exploration via q-ensembles , author=. arXiv preprint arXiv:1706.01502 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

International Conference on Machine Learning , pages=

Ensemble bootstrapping for Q-Learning , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021

[34] [34]

International Conference on Machine Learning , pages=

Generalization and exploration via randomized value functions , author=. International Conference on Machine Learning , pages=. 2016 , organization=

2016

[35] [35]

Advances in Neural Information Processing Systems , volume=

(More) efficient reinforcement learning via posterior sampling , author=. Advances in Neural Information Processing Systems , volume=

[36] [36]

Advances in neural information processing systems , volume=

Deep exploration via bootstrapped DQN , author=. Advances in neural information processing systems , volume=

[37] [37]

arXiv preprint arXiv:2210.04157 , year=

The role of coverage in online reinforcement learning , author=. arXiv preprint arXiv:2210.04157 , year=

work page arXiv

[38] [38]

arXiv preprint arXiv:2110.11202 , year=

Anti-Concentrated Confidence Bonuses for Scalable Exploration , author=. arXiv preprint arXiv:2110.11202 , year=

work page arXiv

[39] [39]

arXiv preprint arXiv:2212.06132 , year=

Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes , author=. arXiv preprint arXiv:2212.06132 , year=

work page arXiv

[40] [40]

Proceedings of the 25th Annual Conference on Learning Theory , pages =

Analysis of Thompson Sampling for the Multi-armed Bandit Problem , author =. Proceedings of the 25th Annual Conference on Learning Theory , pages =. 2012 , editor =

2012

[41] [41]

Bulletin of the American Mathematical Society , volume=

Some aspects of the sequential design of experiments , author=. Bulletin of the American Mathematical Society , volume=. 1952 , publisher=

1952

[42] [42]

Advances in applied mathematics , volume=

Asymptotically efficient adaptive allocation rules , author=. Advances in applied mathematics , volume=. 1985 , publisher=

1985

[43] [43]

Machine learning , volume=

Finite-time analysis of the multiarmed bandit problem , author=. Machine learning , volume=. 2002 , publisher=

2002

[44] [44]

Proceedings of the 24th Annual Conference on Learning Theory , pages =

A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , author =. Proceedings of the 24th Annual Conference on Learning Theory , pages =. 2011 , editor =

2011

[45] [45]

Integer Programming and Combinatorial Optimization: 7th International IPCO Conference Graz, Austria, June 9--11, 1999 Proceedings , pages=

On the number of iterations for Dantzig-Wolfe optimization and packing-covering approximation algorithms , author=. Integer Programming and Combinatorial Optimization: 7th International IPCO Conference Graz, Austria, June 9--11, 1999 Proceedings , pages=. 1999 , organization=

1999

[46] [46]

ACM Transactions on Algorithms (TALG) , volume=

Heavy hitters and the structure of local privacy , author=. ACM Transactions on Algorithms (TALG) , volume=. 2019 , publisher=

2019

[47] [47]

arXiv preprint arXiv:2311.08376 , year=

Ensemble sampling for linear bandits: small ensembles suffice , author=. arXiv preprint arXiv:2311.08376 , year=

work page arXiv

[48] [48]

arXiv preprint arXiv:2308.05435 , year=

Another look at binomial and related distributions exceeding values close to their centre , author=. arXiv preprint arXiv:2308.05435 , year=

work page arXiv

[49] [49]

The Annals of Mathematical Statistics , pages=

On the distribution of the number of successes in independent trials , author=. The Annals of Mathematical Statistics , pages=. 1956 , publisher=

1956

[50] [50]

Theoretical Computer Science , volume=

Exploration--exploitation tradeoff using variance estimates in multi-armed bandits , author=. Theoretical Computer Science , volume=. 2009 , publisher=

2009

[51] [51]

Residual Bootstrap Exploration for Bandit Algorithms , journal =

Chi. Residual Bootstrap Exploration for Bandit Algorithms , journal =

[52] [52]

Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits , booktitle =

Branislav Kveton and Csaba Szepesv. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits , booktitle =

[53] [53]

Perturbed-History Exploration in Stochastic Multi-Armed Bandits , booktitle =

Branislav Kveton and Csaba Szepesv. Perturbed-History Exploration in Stochastic Multi-Armed Bandits , booktitle =

[54] [54]

Bootstrapping Upper Confidence Bound , booktitle =

Botao Hao and Yasin Abbasi. Bootstrapping Upper Confidence Bound , booktitle =

[55] [55]

Sub-sampling for Multi-armed Bandits , booktitle =

Akram Baransi and Odalric. Sub-sampling for Multi-armed Bandits , booktitle =

[56] [56]

Sub-sampling for Efficient Non-Parametric Bandit Exploration , journal =

Dorian Baudry and Emilie Kaufmann and Odalric. Sub-sampling for Efficient Non-Parametric Bandit Exploration , journal =

[57] [57]

Maximum Average Randomly Sampled: A Scale Free and Non-parametric Algorithm for Stochastic Bandits , volume =

Moravej Khorasani, Masoud and Weyer, Erik , booktitle =. Maximum Average Randomly Sampled: A Scale Free and Non-parametric Algorithm for Stochastic Bandits , volume =

[58] [58]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Maximum average randomly sampled: a scale free and non-parametric algorithm for stochastic bandits , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[59] [59]

On the absolute constants in the Berry-Esseen type inequalities for identically distributed summands

On the absolute constants in the Berry-Esseen type inequalities for identically distributed summands , author=. arXiv preprint arXiv:1111.6554 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

The Annals of Statistics , volume=

THE MULTI-ARMED BANDIT PROBLEM , author=. The Annals of Statistics , volume=. 2020 , publisher=

2020

[61] [61]

Advances in Neural Information Processing Systems , volume=

Reinforcement learning with a terminator , author=. Advances in Neural Information Processing Systems , volume=