Quantile of Means: A Bonus-Free Ensemble Method for Minimax Optimal Reinforcement Learning
Pith reviewed 2026-06-26 18:16 UTC · model grok-4.3
The pith
A quantile of means from an ensemble achieves minimax optimal regret bounds in finite-horizon MDPs without count-based bonuses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The quantile of means estimator, when applied to an ensemble in finite-horizon MDPs, yields a bonus-free exploration strategy that attains the minimax optimal regret bound which depends on the variance of the returns.
What carries the argument
The quantile of means, which selects a conservative estimate from multiple independent value function estimates to guide exploration.
If this is right
- It achieves optimal variance-dependent regret bounds.
- It eliminates the need for explicit count-based uncertainty estimates.
- It extends ensemble methods from multi-armed bandits to MDPs while preserving optimality.
- It provides theoretical justification for using ensembles in reinforcement learning exploration.
Where Pith is reading between the lines
- Similar quantile ensembles might work in infinite-horizon or continuous state settings if the finite-horizon analysis can be adapted.
- The method could be combined with deep neural network ensembles to scale to complex environments.
- Testing the variance dependence in practice could reveal whether the theoretical improvement translates to faster learning in high-variance tasks.
Load-bearing premise
That the properties of the ensemble method carry over from bandits to MDPs without needing extra structure or assumptions on how the ensemble is built.
What would settle it
Running the algorithm on a finite-horizon MDP where the observed regret significantly exceeds the variance-dependent bound predicted by the theory.
read the original abstract
Optimal Reinforcement Learning (RL) algorithms typically rely on carefully constructed count-based uncertainty estimates to drive exploration. Although theoretically sound, such estimates are hard to compute in practical settings and therefore offer limited insight for designing exploration heuristics. Meanwhile, ensembling has emerged as a practical approach, but remains without theoretical justification. Building on a recent ensemble-based method for Multi-Armed Bandits, we propose a quantile-based ensemble method for finite-horizon Markov Decision Processes (MDPs). Our simple count-free approach achieves optimal variance-dependent regret bounds, providing theoretical grounding for ensemble-based exploration in RL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a quantile-of-means ensemble method for exploration in finite-horizon MDPs. Extending a recent count-free ensemble approach from multi-armed bandits, the method is claimed to achieve minimax-optimal variance-dependent regret bounds without relying on count-based bonuses or explicit uncertainty estimates.
Significance. If the central claims hold with rigorous proofs, the result would be significant: it supplies the first theoretical justification for a practical ensemble heuristic in RL while matching known optimal variance-dependent rates (typically of the form involving sqrt(V* H^3 S A T) or similar). This bridges the gap between count-based theory and ensemble practice and could influence algorithm design.
major comments (2)
- [MDP analysis / main regret theorem] The extension from the MAB setting to finite-horizon MDPs requires explicit control of quantile-estimate errors across the horizon H and through value-function backups; without new concentration lemmas addressing propagation through the unknown transition kernel, the optimality claim rests on an implicit assumption that the per-step bandit guarantee lifts directly (see skeptic note on horizon-dependent concentration).
- [§3 (method) and regret proof] It is unclear whether the ensemble construction (size, quantile level, or variance handling) is independent of state-action visit counts or requires additional assumptions on the MDP to retain the variance-dependent bound; any hidden dependence would undermine the 'count-free' and 'bonus-free' claims.
minor comments (2)
- Clarify all assumptions on the MDP (finite horizon, bounded rewards, etc.) and ensemble parameters in the main text rather than deferring to the bandit reference.
- Ensure experimental sections (if present) report variance-dependent metrics and compare against both count-based and other ensemble baselines with statistical significance.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We are pleased that the potential significance of the result is recognized. Below we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: [MDP analysis / main regret theorem] The extension from the MAB setting to finite-horizon MDPs requires explicit control of quantile-estimate errors across the horizon H and through value-function backups; without new concentration lemmas addressing propagation through the unknown transition kernel, the optimality claim rests on an implicit assumption that the per-step bandit guarantee lifts directly (see skeptic note on horizon-dependent concentration).
Authors: The proof of the main regret theorem (Theorem 4.1) does include dedicated concentration lemmas for the propagation of quantile estimation errors over the horizon. In particular, we develop a new martingale concentration result (Lemma 4.2) that accounts for the dependence through the unknown transition kernel in the value function backups. This allows the per-step guarantees to be lifted rigorously to the full horizon without additional assumptions. We can add a pointer to these lemmas in the main text if it helps clarify the structure. revision: partial
-
Referee: [§3 (method) and regret proof] It is unclear whether the ensemble construction (size, quantile level, or variance handling) is independent of state-action visit counts or requires additional assumptions on the MDP to retain the variance-dependent bound; any hidden dependence would undermine the 'count-free' and 'bonus-free' claims.
Authors: The ensemble parameters are chosen independently of visit counts. Section 3 specifies that the number of ensemble members and the quantile level are absolute constants that do not depend on n(s,a) or any other state-action specific quantities. The variance-dependent bound is derived directly from the quantile-of-means estimator without invoking count-based terms. No extra MDP assumptions are used. To address the lack of clarity, we will revise Section 3 to explicitly state the independence from visit counts. revision: yes
Circularity Check
No circularity identified from visible content
full rationale
The provided abstract and context contain no equations, derivations, or explicit load-bearing steps. The claim of extending a recent MAB ensemble method to MDPs is stated at a high level without any reduction to fitted parameters, self-definitional quantities, or unverified self-citations shown. Per the hard rules, circularity requires quoting specific paper text exhibiting reduction by construction; none is available here. This is the common honest non-finding when the derivation chain is not inspectable.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics , pages =
On Bayesian Upper Confidence Bounds for Bandit Problems , author =. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics , pages =. 2012 , editor =
2012
-
[2]
Bootstrapping Upper Confidence Bound , volume =
Hao, Botao and Abbasi Yadkori, Yasin and Wen, Zheng and Cheng, Guang , booktitle =. Bootstrapping Upper Confidence Bound , volume =
-
[3]
Improved Regret of Linear Ensemble Sampling , volume =
Lee, Harin and Oh, Min-hwan , booktitle =. Improved Regret of Linear Ensemble Sampling , volume =. doi:10.52202/079017-2947 , editor =
-
[4]
Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees , volume =
Tiapkin, Daniil and Belomestny, Denis and Calandriello, Daniele and Moulines, Eric and Munos, Remi and Naumov, Alexey and Rowland, Mark and Valko, Michal and M\'. Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees , volume =. Advances in Neural Information Processing Systems , editor =
-
[5]
Tiapkin, Daniil and Belomestny, Denis and Moulines, Eric and Naumov, Alexey and Samsonov, Sergey and Tang, Yunhao and Valko, Michal and Menard, Pierre , booktitle =. From. 2022 , editor =
2022
-
[6]
2025 , editor =
Viel, Stefano and Viano, Luca and Cevher, Volkan , booktitle =. 2025 , editor =
2025
-
[7]
Advances in neural information processing systems , volume=
Is Q-learning provably efficient? , author=. Advances in neural information processing systems , volume=
-
[8]
Advances in neural information processing systems , volume=
Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , author=. Advances in neural information processing systems , volume=
-
[9]
Advances in neural information processing systems , volume=
Worst-case regret bounds for exploration via randomized value functions , author=. Advances in neural information processing systems , volume=
-
[10]
Advances in Neural Information Processing Systems , volume=
Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[11]
Advances in neural information processing systems , volume=
Ensemble sampling , author=. Advances in neural information processing systems , volume=
-
[12]
Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =
Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , author =. Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =. 2021 , editor =
2021
-
[13]
Proceedings of the 40th International Conference on Machine Learning , pages =
Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
2023
-
[14]
nature , volume=
Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=
2015
-
[15]
International conference on machine learning , pages=
Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=
2015
-
[16]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
International Conference on Machine Learning , pages=
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International Conference on Machine Learning , pages=. 2018 , organization=
2018
-
[18]
Advances in Neural Information Processing Systems , volume=
Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[19]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[20]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Batch ensemble for variance dependent regret in stochastic bandits , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[21]
International conference on machine learning , pages=
Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning , author=. International conference on machine learning , pages=. 2021 , organization=
2021
-
[22]
Ross , title =
Xinyue Chen and Che Wang and Zijian Zhou and Keith W. Ross , title =. 9th International Conference on Learning Representations,
-
[23]
International conference on machine learning , pages=
Self-supervised exploration via disagreement , author=. International conference on machine learning , pages=. 2019 , organization=
2019
-
[24]
International Conference on Machine Learning , pages=
Near-optimal regret bounds for stochastic shortest path , author=. International Conference on Machine Learning , pages=. 2020 , organization=
2020
-
[25]
International conference on machine learning , pages=
Minimax regret bounds for reinforcement learning , author=. International conference on machine learning , pages=. 2017 , organization=
2017
-
[26]
Proceedings of the 36th International Conference on Machine Learning , pages =
Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =
2019
-
[27]
Reinforcement Learning with Lookahead Information , volume =
Merlis, Nadav , booktitle =. Reinforcement Learning with Lookahead Information , volume =
-
[28]
Proceedings of the 31st International Conference on Machine Learning , pages =
Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , author =. Proceedings of the 31st International Conference on Machine Learning , pages =. 2014 , editor =
2014
-
[29]
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages=
Contextual bandit algorithms with supervised learning guarantees , author=. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages=. 2011 , organization=
2011
-
[30]
Freedman , journal =
David A. Freedman , journal =. On Tail Probabilities for Martingales , volume =
-
[31]
Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages=
On sums of independent random variables with unbounded variance, and estimating the average degree in a graph , author=. Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages=
-
[32]
UCB Exploration via Q-Ensembles
Ucb exploration via q-ensembles , author=. arXiv preprint arXiv:1706.01502 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
International Conference on Machine Learning , pages=
Ensemble bootstrapping for Q-Learning , author=. International Conference on Machine Learning , pages=. 2021 , organization=
2021
-
[34]
International Conference on Machine Learning , pages=
Generalization and exploration via randomized value functions , author=. International Conference on Machine Learning , pages=. 2016 , organization=
2016
-
[35]
Advances in Neural Information Processing Systems , volume=
(More) efficient reinforcement learning via posterior sampling , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
Advances in neural information processing systems , volume=
Deep exploration via bootstrapped DQN , author=. Advances in neural information processing systems , volume=
-
[37]
arXiv preprint arXiv:2210.04157 , year=
The role of coverage in online reinforcement learning , author=. arXiv preprint arXiv:2210.04157 , year=
-
[38]
arXiv preprint arXiv:2110.11202 , year=
Anti-Concentrated Confidence Bonuses for Scalable Exploration , author=. arXiv preprint arXiv:2110.11202 , year=
-
[39]
arXiv preprint arXiv:2212.06132 , year=
Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes , author=. arXiv preprint arXiv:2212.06132 , year=
-
[40]
Proceedings of the 25th Annual Conference on Learning Theory , pages =
Analysis of Thompson Sampling for the Multi-armed Bandit Problem , author =. Proceedings of the 25th Annual Conference on Learning Theory , pages =. 2012 , editor =
2012
-
[41]
Bulletin of the American Mathematical Society , volume=
Some aspects of the sequential design of experiments , author=. Bulletin of the American Mathematical Society , volume=. 1952 , publisher=
1952
-
[42]
Advances in applied mathematics , volume=
Asymptotically efficient adaptive allocation rules , author=. Advances in applied mathematics , volume=. 1985 , publisher=
1985
-
[43]
Machine learning , volume=
Finite-time analysis of the multiarmed bandit problem , author=. Machine learning , volume=. 2002 , publisher=
2002
-
[44]
Proceedings of the 24th Annual Conference on Learning Theory , pages =
A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , author =. Proceedings of the 24th Annual Conference on Learning Theory , pages =. 2011 , editor =
2011
-
[45]
Integer Programming and Combinatorial Optimization: 7th International IPCO Conference Graz, Austria, June 9--11, 1999 Proceedings , pages=
On the number of iterations for Dantzig-Wolfe optimization and packing-covering approximation algorithms , author=. Integer Programming and Combinatorial Optimization: 7th International IPCO Conference Graz, Austria, June 9--11, 1999 Proceedings , pages=. 1999 , organization=
1999
-
[46]
ACM Transactions on Algorithms (TALG) , volume=
Heavy hitters and the structure of local privacy , author=. ACM Transactions on Algorithms (TALG) , volume=. 2019 , publisher=
2019
-
[47]
arXiv preprint arXiv:2311.08376 , year=
Ensemble sampling for linear bandits: small ensembles suffice , author=. arXiv preprint arXiv:2311.08376 , year=
-
[48]
arXiv preprint arXiv:2308.05435 , year=
Another look at binomial and related distributions exceeding values close to their centre , author=. arXiv preprint arXiv:2308.05435 , year=
-
[49]
The Annals of Mathematical Statistics , pages=
On the distribution of the number of successes in independent trials , author=. The Annals of Mathematical Statistics , pages=. 1956 , publisher=
1956
-
[50]
Theoretical Computer Science , volume=
Exploration--exploitation tradeoff using variance estimates in multi-armed bandits , author=. Theoretical Computer Science , volume=. 2009 , publisher=
2009
-
[51]
Residual Bootstrap Exploration for Bandit Algorithms , journal =
Chi. Residual Bootstrap Exploration for Bandit Algorithms , journal =
-
[52]
Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits , booktitle =
Branislav Kveton and Csaba Szepesv. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits , booktitle =
-
[53]
Perturbed-History Exploration in Stochastic Multi-Armed Bandits , booktitle =
Branislav Kveton and Csaba Szepesv. Perturbed-History Exploration in Stochastic Multi-Armed Bandits , booktitle =
-
[54]
Bootstrapping Upper Confidence Bound , booktitle =
Botao Hao and Yasin Abbasi. Bootstrapping Upper Confidence Bound , booktitle =
-
[55]
Sub-sampling for Multi-armed Bandits , booktitle =
Akram Baransi and Odalric. Sub-sampling for Multi-armed Bandits , booktitle =
-
[56]
Sub-sampling for Efficient Non-Parametric Bandit Exploration , journal =
Dorian Baudry and Emilie Kaufmann and Odalric. Sub-sampling for Efficient Non-Parametric Bandit Exploration , journal =
-
[57]
Maximum Average Randomly Sampled: A Scale Free and Non-parametric Algorithm for Stochastic Bandits , volume =
Moravej Khorasani, Masoud and Weyer, Erik , booktitle =. Maximum Average Randomly Sampled: A Scale Free and Non-parametric Algorithm for Stochastic Bandits , volume =
-
[58]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Maximum average randomly sampled: a scale free and non-parametric algorithm for stochastic bandits , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[59]
On the absolute constants in the Berry-Esseen type inequalities for identically distributed summands
On the absolute constants in the Berry-Esseen type inequalities for identically distributed summands , author=. arXiv preprint arXiv:1111.6554 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
The Annals of Statistics , volume=
THE MULTI-ARMED BANDIT PROBLEM , author=. The Annals of Statistics , volume=. 2020 , publisher=
2020
-
[61]
Advances in Neural Information Processing Systems , volume=
Reinforcement learning with a terminator , author=. Advances in Neural Information Processing Systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.