pith. sign in

arxiv: 2605.08287 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Multi-Armed Bandits With Best-Action Queries

Pith reviewed 2026-05-12 02:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords multi-armed banditsbest-action queriesbandit feedbackregret boundsstochastic rewardsadversarial banditsi.i.d. rewards
0
0 comments X

The pith

Best-action queries reduce regret to min(T/k, sqrt(T-k)) in bandit feedback only for i.i.d. stochastic rewards

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper resolves an open question on whether best-action queries, which reveal the optimal arm each round, improve performance in the realistic bandit-feedback model where only the played arm's reward is seen. It establishes that the improvement from full feedback does not transfer in general. For rewards drawn independently and identically across arms, the regret becomes the minimum of T divided by k and the square root of T minus k, with matching lower bounds up to logs. For correlated stochastic rewards or adversarial ones, regret stays at least on the order of the square root of T minus k, so the queries add little value.

Core claim

In the bandit-feedback model with k best-action queries, the minimax regret is tilde-O of min(T/k, sqrt(T-k)) for i.i.d. stochastic rewards, with a matching lower bound up to logarithmic factors. When rewards are stochastic but correlated among arms, or when chosen adversarially, every algorithm incurs regret at least Omega of sqrt(T-k).

What carries the argument

best-action queries that reveal the identity of the optimal arm in a round, combined with standard bandit feedback on the reward of the chosen arm

If this is right

  • Algorithms exist that achieve regret scaling linearly with T/k for large k in the i.i.d. case, then transition to sqrt(T-k) scaling.
  • In correlated or adversarial settings the queries do not reduce the leading sqrt(T) term even when k is linear in T.
  • The characterization separates the i.i.d. stochastic regime from the others, so algorithm design must condition on the reward dependence structure.
  • Matching upper and lower bounds up to logs give a tight picture of query value under each reward model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Algorithm designers could add a preliminary test for independence to decide whether to rely on the queries for the stronger bound.
  • The sqrt(T-k) term suggests treating queries as reducing the effective time horizon for exploration.
  • The separation result may extend to other partial-feedback models where an oracle supplies limited side information.

Load-bearing premise

The improved upper bound holds only when rewards are stochastic and drawn independently and identically across arms; the lower bounds assume correlations that keep the best arm hidden from bandit observations alone.

What would settle it

Running any algorithm with k best-action queries on a family of correlated stochastic reward instances and observing regret o(sqrt(T-k)) would falsify the lower bound; conversely, failing to achieve O(min(T/k, sqrt(T-k))) regret on i.i.d. instances would falsify the positive result.

read the original abstract

We study \emph{multi-armed bandits} (MABs) augmented with \emph{best-action queries}, in which the learner may additionally query an oracle that reveals the best arm in the current round. This setting was recently characterized by Russo et al. [2024] in the \emph{full-feedback} model, where the learner observes the rewards of all arms after each round. They show that, in both \emph{stochastic} and \emph{adversarial} environments, $k$ best-action queries reduce the optimal $\widetilde{\mathcal{O}}(\sqrt{T})$ regret to $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T}\})$. Whether this improvement extends to the more realistic \emph{bandit-feedback} model -- where the learner observes only the reward of the played arm -- was left as an open problem. We fully resolve this question. When rewards are stochastic but correlated among arms, we show that the full-feedback result does not carry over: any algorithm must incur regret at least $\Omega(\sqrt{T-k})$. This lower bound directly extends to adversarial environments. On the positive side, we show that $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T-k}\})$ regret is still achievable when rewards are stochastic and i.i.d., and establish a matching lower bound, up to logarithmic factors. Together, these results provide a complete characterization of the benefits of \emph{best-action queries} in the \emph{bandit-feedback} model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript studies multi-armed bandits augmented with best-action queries (which reveal the current best arm) under bandit feedback. It resolves the open question from Russo et al. (2024) by proving that, for stochastic but correlated rewards, any algorithm incurs regret at least Ω(√(T-k)), with the same lower bound holding in adversarial environments. For i.i.d. stochastic rewards, it constructs an algorithm achieving Õ(min{T/k, √(T-k)}) regret and proves a matching lower bound up to logarithmic factors.

Significance. If the stated bounds hold, the work delivers a complete characterization of best-action queries in the bandit-feedback model, cleanly separating the i.i.d. case (where a near-full-feedback improvement is possible) from the correlated case (where it is not). The combination of algorithmic constructions and information-theoretic lower bounds, together with the explicit resolution of the prior open problem, constitutes a substantive contribution to the theory of partial monitoring and query-augmented bandits.

minor comments (2)
  1. [Abstract] Abstract: the phrase 'up to logarithmic factors' for the i.i.d. lower bound should be expanded in the main text (e.g., §4 or §5) to state the precise polylog(T) multiplier so that the tightness claim can be verified at a glance.
  2. The lower-bound construction for correlated rewards (which relies on an adversarial correlation structure that hides the best arm from bandit feedback) is central; a short intuitive paragraph explaining why bandit feedback cannot recover the full-feedback improvement would improve accessibility without altering the technical content.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of our work and for recommending minor revision. The referee's summary accurately captures our resolution of the open question from Russo et al. (2024) regarding best-action queries under bandit feedback. As no specific major comments were raised in the report, we have no points requiring rebuttal or manuscript changes at this stage.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives new regret bounds for the bandit-feedback model with best-action queries by constructing explicit information-theoretic lower bounds (via adversarial correlation structures that hide the best arm) and providing matching algorithmic upper bounds for the i.i.d. stochastic case. These steps are independent of the cited Russo et al. [2024] result, which only posed the open problem; the new lower and upper bounds are derived directly from the model definitions and do not reduce to fitted parameters, self-definitions, or load-bearing self-citations. The separation between correlated and i.i.d. settings is a deliberate modeling choice with explicit constructions, making the overall characterization self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard definitions of regret in stochastic and adversarial multi-armed bandits; no free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • standard math Standard definitions of regret in stochastic and adversarial multi-armed bandit settings
    Invoked throughout to define the performance measures and environment classes.

pith-pipeline@v0.9.0 · 5580 in / 1310 out tokens · 34417 ms · 2026-05-12T02:53:03.145390+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    Exploration–exploitation tradeoff using variance estimates in multi-armed bandits.Theoretical Computer Science, 410(19):1876–1902,

    Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits.Theoretical Computer Science, 410(19):1876–1902,

  2. [2]

    Bandit online linear optimization with hints and queries

    Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Bandit online linear optimization with hints and queries. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Rese...

  3. [3]

    neurips.cc/paper_files/paper/2017/file/22b1f2e0983160db6f7bb9f62f4dbb39-Paper.pdf

    URLhttps://proceedings. neurips.cc/paper_files/paper/2017/file/22b1f2e0983160db6f7bb9f62f4dbb39-Paper.pdf. Sébastien Gerchinovitz and Tor Lattimore. Refined lower bounds for adversarial bandits.Advances in Neural Information Processing Systems, 29,

  4. [4]

    URLhttps://proceedings.mlr.press/v37/kveton15.html

    PMLR. URLhttps://proceedings.mlr.press/v37/kveton15.html. Silvio Lattanzi, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Online scheduling via learned weights. In Shuchi Chawla, editor,Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 1859–1877. SIAM,

  5. [5]

    URLhttps://doi.org/10.1137/1.9781611975994.114

    doi: 10.1137/1.9781611975994.114. URLhttps://doi.org/10.1137/1.9781611975994.114. Tor Lattimore and Csaba Szepesvari. Bandit algorithms

  6. [6]

    doi: 10.1145/3447579

    ISSN 0004-5411. doi: 10.1145/3447579. URLhttps://doi.org/10.1145/3447579. Davide Maran, Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Castiglioni, Nicola Gatti, and Marcello Restelli. Bandits with ranking feedback. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Informa- tion Pr...

  7. [7]

    URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 936ce22b767cf1a1496083e4725d3b21-Paper-Conference.pdf

    doi: 10.52202/079017-2562. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 936ce22b767cf1a1496083e4725d3b21-Paper-Conference.pdf. Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions.Communications of the ACM, 65(7):33–35,

  8. [8]

    neurips.cc/paper_files/paper/2018/file/73a427badebe0e32caa2e1fc7530b7f3-Paper.pdf

    URLhttps://proceedings. neurips.cc/paper_files/paper/2018/file/73a427badebe0e32caa2e1fc7530b7f3-Paper.pdf. Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed bandits. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors,Machine Learning, Proceedings of the Twenty-Fifth International Conference ...

  9. [9]

    URLhttps://doi.org/10.1145/1390156.1390255

    doi: 10.1145/1390156.1390255. URLhttps://doi.org/10.1145/1390156.1390255. Matteo Russo, Andrea Celli, Riccardo Colini-Baldeschi, Federico Fusco, Daniel Haimovich, Dima Karamshuk, Stefano Leonardi, and Niek Tax. Online learning with sublinear best-action queries.Advances in Neural Information Processing Systems, 37:40407–40433,

  10. [10]

    ISBN 9781605585161

    Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553527. URLhttps://doi.org/10.1145/1553374.1553527. Yisong Yue and Thorsten Joachims. Beat the mean bandit. InProceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, page 241–248, Madison, WI, USA,

  11. [11]

    The K-armed dueling bandits problem , journal =

    ISSN 0022-0000. doi: 10.1016/j.jcss.2011.12.028. URLhttps://doi.org/10.1016/j.jcss.2011.12.028. Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten Rijke. Relative upper confidence bound for the k- armedduelingbanditproblem. InEricP.XingandTonyJebara, editors,Proceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings...

  12. [12]

    URLhttps://proceedings.neurips.cc/ paper_files/paper/2015/file/9872ed9fc22fc182d371c3e9ed316094-Paper.pdf. 13 A Omitted Proofs of Section 3 Theorem 3.1.For every T≥ 1and k∈ [T− 1], and every randomized learning algorithmA that uses k best-action queries, there exists a MAB problem instance with two arms and stochastic correlated rewards such that the regr...