Multi-Armed Bandits With Best-Action Queries
Pith reviewed 2026-05-12 02:53 UTC · model grok-4.3
The pith
Best-action queries reduce regret to min(T/k, sqrt(T-k)) in bandit feedback only for i.i.d. stochastic rewards
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the bandit-feedback model with k best-action queries, the minimax regret is tilde-O of min(T/k, sqrt(T-k)) for i.i.d. stochastic rewards, with a matching lower bound up to logarithmic factors. When rewards are stochastic but correlated among arms, or when chosen adversarially, every algorithm incurs regret at least Omega of sqrt(T-k).
What carries the argument
best-action queries that reveal the identity of the optimal arm in a round, combined with standard bandit feedback on the reward of the chosen arm
If this is right
- Algorithms exist that achieve regret scaling linearly with T/k for large k in the i.i.d. case, then transition to sqrt(T-k) scaling.
- In correlated or adversarial settings the queries do not reduce the leading sqrt(T) term even when k is linear in T.
- The characterization separates the i.i.d. stochastic regime from the others, so algorithm design must condition on the reward dependence structure.
- Matching upper and lower bounds up to logs give a tight picture of query value under each reward model.
Where Pith is reading between the lines
- Algorithm designers could add a preliminary test for independence to decide whether to rely on the queries for the stronger bound.
- The sqrt(T-k) term suggests treating queries as reducing the effective time horizon for exploration.
- The separation result may extend to other partial-feedback models where an oracle supplies limited side information.
Load-bearing premise
The improved upper bound holds only when rewards are stochastic and drawn independently and identically across arms; the lower bounds assume correlations that keep the best arm hidden from bandit observations alone.
What would settle it
Running any algorithm with k best-action queries on a family of correlated stochastic reward instances and observing regret o(sqrt(T-k)) would falsify the lower bound; conversely, failing to achieve O(min(T/k, sqrt(T-k))) regret on i.i.d. instances would falsify the positive result.
read the original abstract
We study \emph{multi-armed bandits} (MABs) augmented with \emph{best-action queries}, in which the learner may additionally query an oracle that reveals the best arm in the current round. This setting was recently characterized by Russo et al. [2024] in the \emph{full-feedback} model, where the learner observes the rewards of all arms after each round. They show that, in both \emph{stochastic} and \emph{adversarial} environments, $k$ best-action queries reduce the optimal $\widetilde{\mathcal{O}}(\sqrt{T})$ regret to $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T}\})$. Whether this improvement extends to the more realistic \emph{bandit-feedback} model -- where the learner observes only the reward of the played arm -- was left as an open problem. We fully resolve this question. When rewards are stochastic but correlated among arms, we show that the full-feedback result does not carry over: any algorithm must incur regret at least $\Omega(\sqrt{T-k})$. This lower bound directly extends to adversarial environments. On the positive side, we show that $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T-k}\})$ regret is still achievable when rewards are stochastic and i.i.d., and establish a matching lower bound, up to logarithmic factors. Together, these results provide a complete characterization of the benefits of \emph{best-action queries} in the \emph{bandit-feedback} model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies multi-armed bandits augmented with best-action queries (which reveal the current best arm) under bandit feedback. It resolves the open question from Russo et al. (2024) by proving that, for stochastic but correlated rewards, any algorithm incurs regret at least Ω(√(T-k)), with the same lower bound holding in adversarial environments. For i.i.d. stochastic rewards, it constructs an algorithm achieving Õ(min{T/k, √(T-k)}) regret and proves a matching lower bound up to logarithmic factors.
Significance. If the stated bounds hold, the work delivers a complete characterization of best-action queries in the bandit-feedback model, cleanly separating the i.i.d. case (where a near-full-feedback improvement is possible) from the correlated case (where it is not). The combination of algorithmic constructions and information-theoretic lower bounds, together with the explicit resolution of the prior open problem, constitutes a substantive contribution to the theory of partial monitoring and query-augmented bandits.
minor comments (2)
- [Abstract] Abstract: the phrase 'up to logarithmic factors' for the i.i.d. lower bound should be expanded in the main text (e.g., §4 or §5) to state the precise polylog(T) multiplier so that the tightness claim can be verified at a glance.
- The lower-bound construction for correlated rewards (which relies on an adversarial correlation structure that hides the best arm from bandit feedback) is central; a short intuitive paragraph explaining why bandit feedback cannot recover the full-feedback improvement would improve accessibility without altering the technical content.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of our work and for recommending minor revision. The referee's summary accurately captures our resolution of the open question from Russo et al. (2024) regarding best-action queries under bandit feedback. As no specific major comments were raised in the report, we have no points requiring rebuttal or manuscript changes at this stage.
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper derives new regret bounds for the bandit-feedback model with best-action queries by constructing explicit information-theoretic lower bounds (via adversarial correlation structures that hide the best arm) and providing matching algorithmic upper bounds for the i.i.d. stochastic case. These steps are independent of the cited Russo et al. [2024] result, which only posed the open problem; the new lower and upper bounds are derived directly from the model definitions and do not reduce to fitted parameters, self-definitions, or load-bearing self-citations. The separation between correlated and i.i.d. settings is a deliberate modeling choice with explicit constructions, making the overall characterization self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard definitions of regret in stochastic and adversarial multi-armed bandit settings
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that Õ(min{T/k,√(T-k)}) regret is still achievable when rewards are stochastic and i.i.d., and establish a matching lower bound
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits.Theoretical Computer Science, 410(19):1876–1902,
work page 1902
-
[2]
Bandit online linear optimization with hints and queries
Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Bandit online linear optimization with hints and queries. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Rese...
work page 2023
-
[3]
neurips.cc/paper_files/paper/2017/file/22b1f2e0983160db6f7bb9f62f4dbb39-Paper.pdf
URLhttps://proceedings. neurips.cc/paper_files/paper/2017/file/22b1f2e0983160db6f7bb9f62f4dbb39-Paper.pdf. Sébastien Gerchinovitz and Tor Lattimore. Refined lower bounds for adversarial bandits.Advances in Neural Information Processing Systems, 29,
work page 2017
-
[4]
URLhttps://proceedings.mlr.press/v37/kveton15.html
PMLR. URLhttps://proceedings.mlr.press/v37/kveton15.html. Silvio Lattanzi, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Online scheduling via learned weights. In Shuchi Chawla, editor,Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 1859–1877. SIAM,
work page 2020
-
[5]
URLhttps://doi.org/10.1137/1.9781611975994.114
doi: 10.1137/1.9781611975994.114. URLhttps://doi.org/10.1137/1.9781611975994.114. Tor Lattimore and Csaba Szepesvari. Bandit algorithms
-
[6]
ISSN 0004-5411. doi: 10.1145/3447579. URLhttps://doi.org/10.1145/3447579. Davide Maran, Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Castiglioni, Nicola Gatti, and Marcello Restelli. Bandits with ranking feedback. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Informa- tion Pr...
-
[7]
doi: 10.52202/079017-2562. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 936ce22b767cf1a1496083e4725d3b21-Paper-Conference.pdf. Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions.Communications of the ACM, 65(7):33–35,
-
[8]
neurips.cc/paper_files/paper/2018/file/73a427badebe0e32caa2e1fc7530b7f3-Paper.pdf
URLhttps://proceedings. neurips.cc/paper_files/paper/2018/file/73a427badebe0e32caa2e1fc7530b7f3-Paper.pdf. Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed bandits. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors,Machine Learning, Proceedings of the Twenty-Fifth International Conference ...
work page 2018
-
[9]
URLhttps://doi.org/10.1145/1390156.1390255
doi: 10.1145/1390156.1390255. URLhttps://doi.org/10.1145/1390156.1390255. Matteo Russo, Andrea Celli, Riccardo Colini-Baldeschi, Federico Fusco, Daniel Haimovich, Dima Karamshuk, Stefano Leonardi, and Niek Tax. Online learning with sublinear best-action queries.Advances in Neural Information Processing Systems, 37:40407–40433,
-
[10]
Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553527. URLhttps://doi.org/10.1145/1553374.1553527. Yisong Yue and Thorsten Joachims. Beat the mean bandit. InProceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, page 241–248, Madison, WI, USA,
-
[11]
The K-armed dueling bandits problem , journal =
ISSN 0022-0000. doi: 10.1016/j.jcss.2011.12.028. URLhttps://doi.org/10.1016/j.jcss.2011.12.028. Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten Rijke. Relative upper confidence bound for the k- armedduelingbanditproblem. InEricP.XingandTonyJebara, editors,Proceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings...
-
[12]
URLhttps://proceedings.neurips.cc/ paper_files/paper/2015/file/9872ed9fc22fc182d371c3e9ed316094-Paper.pdf. 13 A Omitted Proofs of Section 3 Theorem 3.1.For every T≥ 1and k∈ [T− 1], and every randomized learning algorithmA that uses k best-action queries, there exists a MAB problem instance with two arms and stochastic correlated rewards such that the regr...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.