Multi-Armed Bandits With Best-Action Queries

Alberto Marchesi; Francesco Bacchiocchi; Francesco Emanuele Stradi; Matteo Castiglioni

arxiv: 2605.08287 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Multi-Armed Bandits With Best-Action Queries

Francesco Bacchiocchi , Matteo Castiglioni , Alberto Marchesi , Francesco Emanuele Stradi This is my paper

Pith reviewed 2026-05-12 02:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multi-armed banditsbest-action queriesbandit feedbackregret boundsstochastic rewardsadversarial banditsi.i.d. rewards

0 comments

The pith

Best-action queries reduce regret to min(T/k, sqrt(T-k)) in bandit feedback only for i.i.d. stochastic rewards

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper resolves an open question on whether best-action queries, which reveal the optimal arm each round, improve performance in the realistic bandit-feedback model where only the played arm's reward is seen. It establishes that the improvement from full feedback does not transfer in general. For rewards drawn independently and identically across arms, the regret becomes the minimum of T divided by k and the square root of T minus k, with matching lower bounds up to logs. For correlated stochastic rewards or adversarial ones, regret stays at least on the order of the square root of T minus k, so the queries add little value.

Core claim

In the bandit-feedback model with k best-action queries, the minimax regret is tilde-O of min(T/k, sqrt(T-k)) for i.i.d. stochastic rewards, with a matching lower bound up to logarithmic factors. When rewards are stochastic but correlated among arms, or when chosen adversarially, every algorithm incurs regret at least Omega of sqrt(T-k).

What carries the argument

best-action queries that reveal the identity of the optimal arm in a round, combined with standard bandit feedback on the reward of the chosen arm

If this is right

Algorithms exist that achieve regret scaling linearly with T/k for large k in the i.i.d. case, then transition to sqrt(T-k) scaling.
In correlated or adversarial settings the queries do not reduce the leading sqrt(T) term even when k is linear in T.
The characterization separates the i.i.d. stochastic regime from the others, so algorithm design must condition on the reward dependence structure.
Matching upper and lower bounds up to logs give a tight picture of query value under each reward model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Algorithm designers could add a preliminary test for independence to decide whether to rely on the queries for the stronger bound.
The sqrt(T-k) term suggests treating queries as reducing the effective time horizon for exploration.
The separation result may extend to other partial-feedback models where an oracle supplies limited side information.

Load-bearing premise

The improved upper bound holds only when rewards are stochastic and drawn independently and identically across arms; the lower bounds assume correlations that keep the best arm hidden from bandit observations alone.

What would settle it

Running any algorithm with k best-action queries on a family of correlated stochastic reward instances and observing regret o(sqrt(T-k)) would falsify the lower bound; conversely, failing to achieve O(min(T/k, sqrt(T-k))) regret on i.i.d. instances would falsify the positive result.

read the original abstract

We study \emph{multi-armed bandits} (MABs) augmented with \emph{best-action queries}, in which the learner may additionally query an oracle that reveals the best arm in the current round. This setting was recently characterized by Russo et al. [2024] in the \emph{full-feedback} model, where the learner observes the rewards of all arms after each round. They show that, in both \emph{stochastic} and \emph{adversarial} environments, $k$ best-action queries reduce the optimal $\widetilde{\mathcal{O}}(\sqrt{T})$ regret to $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T}\})$. Whether this improvement extends to the more realistic \emph{bandit-feedback} model -- where the learner observes only the reward of the played arm -- was left as an open problem. We fully resolve this question. When rewards are stochastic but correlated among arms, we show that the full-feedback result does not carry over: any algorithm must incur regret at least $\Omega(\sqrt{T-k})$. This lower bound directly extends to adversarial environments. On the positive side, we show that $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T-k}\})$ regret is still achievable when rewards are stochastic and i.i.d., and establish a matching lower bound, up to logarithmic factors. Together, these results provide a complete characterization of the benefits of \emph{best-action queries} in the \emph{bandit-feedback} model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Best-action queries help only for i.i.d. stochastic rewards under bandit feedback, with a clean separation from the correlated case that resolves the open problem.

read the letter

This paper answers the open question left by Russo et al. on best-action queries under bandit feedback: they help when rewards are i.i.d. stochastic but not when they are correlated. The new contribution is the complete characterization. In the i.i.d. case they achieve and match Õ(min{T/k, √(T-k)}) regret. In the correlated stochastic case they show a lower bound of Ω(√(T-k}), which also applies to adversarial settings. This separation is the main advance over prior full-feedback results. The paper derives these bounds specifically for the bandit-feedback model, filling the gap. They do this with information-theoretic lower bounds and algorithmic upper bounds. The approach looks direct and avoids self-referential issues. The claims are internally consistent. The soft spot is in the correlated lower bound, which uses a specific adversarial correlation to prevent bandit feedback from giving best-arm info. This is stated clearly but means the result is for worst-case correlations rather than arbitrary ones. The i.i.d. result has the expected log factors in the matching, which is standard but leaves a small gap. This work is for online learning researchers focused on bandits and the value of oracles. Readers who follow query-augmented models will find the bounds and separation useful. It is a focused contribution that resolves a specific open problem. It deserves a serious referee because the results are new and the analysis appears careful. I recommend putting it through peer review.

Referee Report

0 major / 2 minor

Summary. The manuscript studies multi-armed bandits augmented with best-action queries (which reveal the current best arm) under bandit feedback. It resolves the open question from Russo et al. (2024) by proving that, for stochastic but correlated rewards, any algorithm incurs regret at least Ω(√(T-k)), with the same lower bound holding in adversarial environments. For i.i.d. stochastic rewards, it constructs an algorithm achieving Õ(min{T/k, √(T-k)}) regret and proves a matching lower bound up to logarithmic factors.

Significance. If the stated bounds hold, the work delivers a complete characterization of best-action queries in the bandit-feedback model, cleanly separating the i.i.d. case (where a near-full-feedback improvement is possible) from the correlated case (where it is not). The combination of algorithmic constructions and information-theoretic lower bounds, together with the explicit resolution of the prior open problem, constitutes a substantive contribution to the theory of partial monitoring and query-augmented bandits.

minor comments (2)

[Abstract] Abstract: the phrase 'up to logarithmic factors' for the i.i.d. lower bound should be expanded in the main text (e.g., §4 or §5) to state the precise polylog(T) multiplier so that the tightness claim can be verified at a glance.
The lower-bound construction for correlated rewards (which relies on an adversarial correlation structure that hides the best arm from bandit feedback) is central; a short intuitive paragraph explaining why bandit feedback cannot recover the full-feedback improvement would improve accessibility without altering the technical content.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of our work and for recommending minor revision. The referee's summary accurately captures our resolution of the open question from Russo et al. (2024) regarding best-action queries under bandit feedback. As no specific major comments were raised in the report, we have no points requiring rebuttal or manuscript changes at this stage.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives new regret bounds for the bandit-feedback model with best-action queries by constructing explicit information-theoretic lower bounds (via adversarial correlation structures that hide the best arm) and providing matching algorithmic upper bounds for the i.i.d. stochastic case. These steps are independent of the cited Russo et al. [2024] result, which only posed the open problem; the new lower and upper bounds are derived directly from the model definitions and do not reduce to fitted parameters, self-definitions, or load-bearing self-citations. The separation between correlated and i.i.d. settings is a deliberate modeling choice with explicit constructions, making the overall characterization self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard definitions of regret in stochastic and adversarial multi-armed bandits; no free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)

standard math Standard definitions of regret in stochastic and adversarial multi-armed bandit settings
Invoked throughout to define the performance measures and environment classes.

pith-pipeline@v0.9.0 · 5580 in / 1310 out tokens · 34417 ms · 2026-05-12T02:53:03.145390+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that Õ(min{T/k,√(T-k)}) regret is still achievable when rewards are stochastic and i.i.d., and establish a matching lower bound

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Exploration–exploitation tradeoff using variance estimates in multi-armed bandits.Theoretical Computer Science, 410(19):1876–1902,

Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits.Theoretical Computer Science, 410(19):1876–1902,

work page 1902
[2]

Bandit online linear optimization with hints and queries

Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Bandit online linear optimization with hints and queries. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Rese...

work page 2023
[3]

neurips.cc/paper_files/paper/2017/file/22b1f2e0983160db6f7bb9f62f4dbb39-Paper.pdf

URLhttps://proceedings. neurips.cc/paper_files/paper/2017/file/22b1f2e0983160db6f7bb9f62f4dbb39-Paper.pdf. Sébastien Gerchinovitz and Tor Lattimore. Refined lower bounds for adversarial bandits.Advances in Neural Information Processing Systems, 29,

work page 2017
[4]

URLhttps://proceedings.mlr.press/v37/kveton15.html

PMLR. URLhttps://proceedings.mlr.press/v37/kveton15.html. Silvio Lattanzi, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Online scheduling via learned weights. In Shuchi Chawla, editor,Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 1859–1877. SIAM,

work page 2020
[5]

URLhttps://doi.org/10.1137/1.9781611975994.114

doi: 10.1137/1.9781611975994.114. URLhttps://doi.org/10.1137/1.9781611975994.114. Tor Lattimore and Csaba Szepesvari. Bandit algorithms

work page doi:10.1137/1.9781611975994.114
[6]

doi: 10.1145/3447579

ISSN 0004-5411. doi: 10.1145/3447579. URLhttps://doi.org/10.1145/3447579. Davide Maran, Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Castiglioni, Nicola Gatti, and Marcello Restelli. Bandits with ranking feedback. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Informa- tion Pr...

work page doi:10.1145/3447579
[7]

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 936ce22b767cf1a1496083e4725d3b21-Paper-Conference.pdf

doi: 10.52202/079017-2562. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 936ce22b767cf1a1496083e4725d3b21-Paper-Conference.pdf. Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions.Communications of the ACM, 65(7):33–35,

work page doi:10.52202/079017-2562 2024
[8]

neurips.cc/paper_files/paper/2018/file/73a427badebe0e32caa2e1fc7530b7f3-Paper.pdf

URLhttps://proceedings. neurips.cc/paper_files/paper/2018/file/73a427badebe0e32caa2e1fc7530b7f3-Paper.pdf. Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed bandits. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors,Machine Learning, Proceedings of the Twenty-Fifth International Conference ...

work page 2018
[9]

URLhttps://doi.org/10.1145/1390156.1390255

doi: 10.1145/1390156.1390255. URLhttps://doi.org/10.1145/1390156.1390255. Matteo Russo, Andrea Celli, Riccardo Colini-Baldeschi, Federico Fusco, Daniel Haimovich, Dima Karamshuk, Stefano Leonardi, and Niek Tax. Online learning with sublinear best-action queries.Advances in Neural Information Processing Systems, 37:40407–40433,

work page doi:10.1145/1390156.1390255
[10]

ISBN 9781605585161

Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553527. URLhttps://doi.org/10.1145/1553374.1553527. Yisong Yue and Thorsten Joachims. Beat the mean bandit. InProceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, page 241–248, Madison, WI, USA,

work page doi:10.1145/1553374.1553527
[11]

The K-armed dueling bandits problem , journal =

ISSN 0022-0000. doi: 10.1016/j.jcss.2011.12.028. URLhttps://doi.org/10.1016/j.jcss.2011.12.028. Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten Rijke. Relative upper confidence bound for the k- armedduelingbanditproblem. InEricP.XingandTonyJebara, editors,Proceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings...

work page doi:10.1016/j.jcss.2011.12.028 2011
[12]

URLhttps://proceedings.neurips.cc/ paper_files/paper/2015/file/9872ed9fc22fc182d371c3e9ed316094-Paper.pdf. 13 A Omitted Proofs of Section 3 Theorem 3.1.For every T≥ 1and k∈ [T− 1], and every randomized learning algorithmA that uses k best-action queries, there exists a MAB problem instance with two arms and stochastic correlated rewards such that the regr...

work page 2015

[1] [1]

Exploration–exploitation tradeoff using variance estimates in multi-armed bandits.Theoretical Computer Science, 410(19):1876–1902,

Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits.Theoretical Computer Science, 410(19):1876–1902,

work page 1902

[2] [2]

Bandit online linear optimization with hints and queries

Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Bandit online linear optimization with hints and queries. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Rese...

work page 2023

[3] [3]

neurips.cc/paper_files/paper/2017/file/22b1f2e0983160db6f7bb9f62f4dbb39-Paper.pdf

URLhttps://proceedings. neurips.cc/paper_files/paper/2017/file/22b1f2e0983160db6f7bb9f62f4dbb39-Paper.pdf. Sébastien Gerchinovitz and Tor Lattimore. Refined lower bounds for adversarial bandits.Advances in Neural Information Processing Systems, 29,

work page 2017

[4] [4]

URLhttps://proceedings.mlr.press/v37/kveton15.html

PMLR. URLhttps://proceedings.mlr.press/v37/kveton15.html. Silvio Lattanzi, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Online scheduling via learned weights. In Shuchi Chawla, editor,Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 1859–1877. SIAM,

work page 2020

[5] [5]

URLhttps://doi.org/10.1137/1.9781611975994.114

doi: 10.1137/1.9781611975994.114. URLhttps://doi.org/10.1137/1.9781611975994.114. Tor Lattimore and Csaba Szepesvari. Bandit algorithms

work page doi:10.1137/1.9781611975994.114

[6] [6]

doi: 10.1145/3447579

ISSN 0004-5411. doi: 10.1145/3447579. URLhttps://doi.org/10.1145/3447579. Davide Maran, Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Castiglioni, Nicola Gatti, and Marcello Restelli. Bandits with ranking feedback. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Informa- tion Pr...

work page doi:10.1145/3447579

[7] [7]

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 936ce22b767cf1a1496083e4725d3b21-Paper-Conference.pdf

doi: 10.52202/079017-2562. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 936ce22b767cf1a1496083e4725d3b21-Paper-Conference.pdf. Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions.Communications of the ACM, 65(7):33–35,

work page doi:10.52202/079017-2562 2024

[8] [8]

neurips.cc/paper_files/paper/2018/file/73a427badebe0e32caa2e1fc7530b7f3-Paper.pdf

URLhttps://proceedings. neurips.cc/paper_files/paper/2018/file/73a427badebe0e32caa2e1fc7530b7f3-Paper.pdf. Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed bandits. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors,Machine Learning, Proceedings of the Twenty-Fifth International Conference ...

work page 2018

[9] [9]

URLhttps://doi.org/10.1145/1390156.1390255

doi: 10.1145/1390156.1390255. URLhttps://doi.org/10.1145/1390156.1390255. Matteo Russo, Andrea Celli, Riccardo Colini-Baldeschi, Federico Fusco, Daniel Haimovich, Dima Karamshuk, Stefano Leonardi, and Niek Tax. Online learning with sublinear best-action queries.Advances in Neural Information Processing Systems, 37:40407–40433,

work page doi:10.1145/1390156.1390255

[10] [10]

ISBN 9781605585161

Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553527. URLhttps://doi.org/10.1145/1553374.1553527. Yisong Yue and Thorsten Joachims. Beat the mean bandit. InProceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, page 241–248, Madison, WI, USA,

work page doi:10.1145/1553374.1553527

[11] [11]

The K-armed dueling bandits problem , journal =

ISSN 0022-0000. doi: 10.1016/j.jcss.2011.12.028. URLhttps://doi.org/10.1016/j.jcss.2011.12.028. Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten Rijke. Relative upper confidence bound for the k- armedduelingbanditproblem. InEricP.XingandTonyJebara, editors,Proceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings...

work page doi:10.1016/j.jcss.2011.12.028 2011

[12] [12]

URLhttps://proceedings.neurips.cc/ paper_files/paper/2015/file/9872ed9fc22fc182d371c3e9ed316094-Paper.pdf. 13 A Omitted Proofs of Section 3 Theorem 3.1.For every T≥ 1and k∈ [T− 1], and every randomized learning algorithmA that uses k best-action queries, there exists a MAB problem instance with two arms and stochastic correlated rewards such that the regr...

work page 2015