pith. sign in

arxiv: 2605.15663 · v1 · pith:CRU7G3E5new · submitted 2026-05-15 · 💻 cs.LG

On the Power of Adaptivity for varepsilon-Best Arm Identification in Linear Bandits

Pith reviewed 2026-05-20 20:49 UTC · model grok-4.3

classification 💻 cs.LG
keywords linear banditsbest arm identificationadaptivitysample complexityepsilon-best armaction setnorm estimationGaussian width
0
0 comments X

The pith

There exists an action set in linear bandits where adaptivity gives a polynomial improvement over non-adaptive methods for ε-best arm identification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the minimax sample complexity of identifying an ε-best arm in a linear bandit with unknown reward vector θ and compact action set X spanning R^d. It first derives a non-adaptive fixed-design algorithm with sample complexity O(d log(1/δ)/ε² + w(X)²/ε²) and proves a matching lower bound that holds for every non-adaptive method. For several common action sets including the hypercube, ℓ2 ball, m-sets, and multi-task bandits, the authors show that adaptivity improves the rate by only logarithmic factors. The main result constructs a specific action set X for which adaptivity yields a polynomial-factor improvement, using an efficient adaptive subroutine that estimates the Euclidean norm of θ to additive accuracy ε with O(d log(1/δ)/ε²) samples.

Core claim

The central claim is that there exists a structured action set X for which any non-adaptive algorithm requires polynomially more samples than an adaptive algorithm to output an arm whose expected reward is within ε of the optimal arm with probability 1-δ. The separation is obtained by reducing an ℓ2-norm estimation problem to best-arm identification on this X; an adaptive procedure solves the norm estimation task using O(d log(1/δ)/ε²) samples from the unit ball and thereby achieves the improved rate on the constructed instance.

What carries the argument

An adaptive ℓ2-norm estimation algorithm that draws O(d log(1/δ)/ε²) samples from the unit ℓ2 ball and returns an estimate r̂ satisfying |r̂ - ||θ||₂| ≤ ε with high probability.

If this is right

  • Non-adaptive fixed-design methods are lower-bounded by Ω(d log(1/δ)/ε² + w(X)²/ε²) for every action set X.
  • For the hypercube, ℓ2 ball, m-sets, and multi-task bandits, adaptivity improves the non-adaptive rate by at most logarithmic factors.
  • On the specially constructed action set, the adaptive minimax sample complexity is polynomially smaller than the non-adaptive minimax sample complexity.
  • The adaptive norm estimator itself succeeds with O(d log(1/δ)/ε²) samples drawn from the unit ball.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation result indicates that geometric structure in the action set can make adaptivity essential for near-optimal sample efficiency in linear identification problems.
  • Similar polynomial gaps may exist in related settings such as linear regression with structured design matrices or contextual bandit identification.
  • Future constructions could aim to replace the explicit geometric X with more natural or randomly generated action sets that still exhibit the polynomial separation.
  • Validating the norm-estimation subroutine empirically in moderate dimensions would provide a concrete test of the O(d log(1/δ)/ε²) bound.

Load-bearing premise

The claim depends on the existence of a specific geometric action set X that permits embedding the norm-estimation problem into the best-arm identification task.

What would settle it

A direct calculation or simulation on the constructed action set showing that every adaptive algorithm still requires Ω(w(X)²/ε²) samples or that the proposed norm estimator needs more than O(d log(1/δ)/ε²) draws to achieve additive ε accuracy.

read the original abstract

We study the minimax sample complexity of $\varepsilon$-best arm identification in linear bandits. Given a compact action set $\mathcal{X}$ that spans $\mathbb{R}^d$ and an unknown reward vector $\theta\in\mathbb{R}^d$, the goal is to output an arm $\widehat{x}\in\mathcal{X}$ such that $\langle \widehat{x},\theta\rangle \ge \max_{x\in\mathcal{X}} \langle x,\theta\rangle - \varepsilon$ with probability at least $1-\delta$, using as few samples as possible. First, we present a non-adaptive fixed-design method with sample complexity $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$, where $w(\mathcal{X})$ is a Gaussian width term dependent on $\mathcal{X}$, and we prove a matching lower bound $\Omega\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$ for all non-adaptive fixed-design methods. We then turn to adaptive sampling. We raise an important structural question: beyond the canonical basis, are there structured action sets for which adaptivity yields only logarithmic-factor improvements over the optimal non-adaptive rate? We answer in the affirmative for several natural action sets, namely the hypercube, the $\ell_2$ ball, $m$-sets, and multi-task multi-armed bandits. Finally, we provide the first construction of an action set $\mathcal{X}$ for which adaptivity yields a polynomial-factor improvement over every non-adaptive algorithm. A key ingredient behind this separation is an $\ell_2$-norm estimation subroutine: we design an adaptive algorithm that uses $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}\right)$ samples from the unit $\ell_2$ ball in $\mathbb{R}^d$ and outputs an estimate $\widehat r$ satisfying $|\widehat r-\|\theta\|_2|\le \varepsilon$ with probability at least $1-\delta$, where $\theta$ is the unknown reward vector.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies the minimax sample complexity of ε-best arm identification in linear bandits over a compact action set X spanning R^d. It gives a non-adaptive fixed-design algorithm achieving O(d log(1/δ)/ε² + w(X)²/ε²) samples together with a matching Ω lower bound for all non-adaptive fixed-design methods. For several natural action sets (hypercube, ℓ₂ ball, m-sets, multi-task MAB) it shows that adaptivity improves the rate by only logarithmic factors. It then constructs a specific action set X for which adaptivity yields a polynomial-factor improvement over every non-adaptive algorithm, using as a key subroutine an adaptive procedure that estimates ||θ||₂ to additive ε accuracy with O(d log(1/δ)/ε²) samples from the unit ℓ₂ ball.

Significance. If the separation result holds, the paper supplies the first explicit geometric construction separating adaptive and non-adaptive rates by a polynomial factor in linear bandits, together with a reusable adaptive ℓ₂-norm estimator. The appearance of the Gaussian width w(X) as the precise geometric complexity term for the non-adaptive case is a clean and falsifiable contribution. These elements would strengthen the literature on when adaptivity helps in structured linear problems.

major comments (2)
  1. [Abstract] Abstract: the matching lower bound Ω(d log(1/δ)/ε² + w(X)²/ε²) is stated only for non-adaptive fixed-design methods, yet the central separation claim asserts a polynomial improvement “over every non-adaptive algorithm.” Randomized non-adaptive designs (fixed distributions over arms chosen independently of observations) are not covered by the fixed-design lower bound; if such designs can achieve rates closer to the adaptive O(d log(1/δ)/ε²) norm-estimation subroutine, the polynomial separation does not hold for the full class of non-adaptive algorithms.
  2. [Abstract] Abstract and the section describing the separating construction: the load-bearing modeling choice is the specific geometric action set X together with the reduction that embeds the ℓ₂-norm estimator into the best-arm identification problem. The manuscript must verify that this reduction preserves the polynomial gap for the reward distributions under consideration and that no simpler non-adaptive (possibly randomized) strategy on the same X can match the adaptive rate.
minor comments (2)
  1. Notation: the precise definition of the Gaussian width w(X) and its relation to the action set geometry should be recalled explicitly when first used, even if standard.
  2. The manuscript should clarify whether the O(d log(1/δ)/ε²) norm-estimation bound assumes sub-Gaussian noise or holds more generally.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our work. The comments regarding the precise scope of the non-adaptive lower bound and the need for verification in the separating construction are well-taken. We provide point-by-point responses below and will make the necessary revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the matching lower bound Ω(d log(1/δ)/ε² + w(X)²/ε²) is stated only for non-adaptive fixed-design methods, yet the central separation claim asserts a polynomial improvement “over every non-adaptive algorithm.” Randomized non-adaptive designs (fixed distributions over arms chosen independently of observations) are not covered by the fixed-design lower bound; if such designs can achieve rates closer to the adaptive O(d log(1/δ)/ε²) norm-estimation subroutine, the polynomial separation does not hold for the full class of non-adaptive algorithms.

    Authors: We acknowledge the validity of this comment. Our current lower bound applies exclusively to non-adaptive fixed-design methods. To strengthen the separation claim, we will revise the manuscript to include an analysis showing that randomized non-adaptive algorithms cannot achieve a substantially better rate than fixed-design methods for the action sets considered, particularly in the separating construction. Alternatively, we will adjust the claim to refer specifically to non-adaptive fixed-design algorithms if a general extension proves technically challenging. This revision will ensure the polynomial improvement is accurately positioned relative to the class of non-adaptive methods. revision: yes

  2. Referee: [Abstract] Abstract and the section describing the separating construction: the load-bearing modeling choice is the specific geometric action set X together with the reduction that embeds the ℓ₂-norm estimator into the best-arm identification problem. The manuscript must verify that this reduction preserves the polynomial gap for the reward distributions under consideration and that no simpler non-adaptive (possibly randomized) strategy on the same X can match the adaptive rate.

    Authors: We agree that explicit verification of the reduction and its implications is necessary. In the revised version, we will add a dedicated subsection that details the embedding of the adaptive ℓ₂-norm estimator into the ε-best arm identification task on the constructed action set X. This will demonstrate that the reward distributions and the polynomial sample-complexity gap are preserved. We will further establish that, for this specific X, no non-adaptive strategy—including those based on randomized fixed distributions—can match the adaptive rate, thereby confirming that the separation is robust to the modeling choices. revision: yes

Circularity Check

0 steps flagged

No circularity: rates from concentration and explicit geometric construction

full rationale

The paper derives the non-adaptive fixed-design upper and lower bounds directly from Gaussian concentration and minimax arguments on the action set geometry (w(X) term defined externally). The adaptive separation result rests on an explicit construction of a separating action set X together with a first-principles adaptive norm-estimation subroutine whose sample complexity is obtained from standard vector concentration, not from fitting to the target best-arm quantity. No equation reduces to a prior result by definition, no parameter is fitted to the claimed separation, and no load-bearing step is justified solely by self-citation. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on standard sub-Gaussian concentration and minimax arguments for linear bandits; the separating action set is introduced as a new geometric object whose properties are proved rather than postulated without evidence.

axioms (1)
  • domain assumption Rewards are sub-Gaussian with known variance proxy
    Standard assumption invoked for all concentration bounds in the sample-complexity derivations.
invented entities (1)
  • Separating action set X independent evidence
    purpose: To exhibit a geometry where adaptive sampling achieves polynomial improvement over any non-adaptive procedure
    The set is constructed explicitly in the paper; independent evidence consists of the matching lower bound for non-adaptive methods on that set.

pith-pipeline@v0.9.0 · 5945 in / 1541 out tokens · 69479 ms · 2026-05-20T20:49:03.186372+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    International Conference on Computational Learning Theory , pages=

    PAC bounds for multi-armed bandit and Markov decision processes , author=. International Conference on Computational Learning Theory , pages=. 2002 , organization=

  2. [2]

    Journal of Computer and System Sciences , volume=

    Combinatorial bandits , author=. Journal of Computer and System Sciences , volume=. 2012 , publisher=

  3. [3]

    Conference on Learning Theory , pages=

    Tight bounds for bandit combinatorial optimization , author=. Conference on Learning Theory , pages=. 2017 , organization=

  4. [4]

    arXiv preprint arXiv:2504.00461 , year=

    Efficient near-optimal algorithm for online shortest paths in directed acyclic graphs with bandit feedback against adaptive adversaries , author=. arXiv preprint arXiv:2504.00461 , year=

  5. [5]

    arXiv preprint arXiv:2510.17099 , year=

    On the Universal Near Optimality of Hedge in Combinatorial Settings , author=. arXiv preprint arXiv:2510.17099 , year=

  6. [6]

    Advances in neural information processing systems , volume=

    Best-arm identification in linear bandits , author=. Advances in neural information processing systems , volume=

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    Optimal best-arm identification in linear bandits , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    Advances in neural information processing systems , volume=

    Sequential experimental design for transductive linear bandits , author=. Advances in neural information processing systems , volume=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    An empirical process approach to the union bound: Practical algorithms for combinatorial and linear bandits , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    International Conference on Machine Learning , pages=

    Best arm identification in linear bandits with linear dimension dependency , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  11. [11]

    International Conference on Machine Learning , pages=

    Gamification of pure exploration for linear bandits , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  12. [12]

    International Conference on Artificial Intelligence and Statistics , pages=

    A fully adaptive algorithm for pure exploration in linear bandits , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

  13. [13]

    Advances in Neural Information Processing Systems , volume=

    Verification based solution for structured mab problems , author=. Advances in Neural Information Processing Systems , volume=

  14. [14]

    International Conference on Artificial Intelligence and Statistics , pages=

    Experimental design for regret minimization in linear bandits , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

  15. [15]

    2020 , publisher=

    Bandit algorithms , author=. 2020 , publisher=

  16. [16]

    International conference on artificial intelligence and statistics , pages=

    Sublinear optimal policy value estimation in contextual bandits , author=. International conference on artificial intelligence and statistics , pages=. 2020 , organization=

  17. [17]

    The Annals of Statistics , volume=

    Testing Composite Hypotheses, Hermite Polynomials and Optimal Estimation of a Nonsmooth Functional , author=. The Annals of Statistics , volume=

  18. [18]

    Bernoulli , year=

    ON ESTIMATION OF NONSMOOTH FUNCTIONALS OF SPARSE NORMAL MEANS , author=. Bernoulli , year=

  19. [19]

    Probability Theory and Related Fields , volume=

    On estimation of L r-norms in Gaussian white noise models , author=. Probability Theory and Related Fields , volume=. 2020 , publisher=

  20. [20]

    Upper bounds via the oracle approach , author=

    Adaptive estimation of the L 2-norm of a probability density and related topics II. Upper bounds via the oracle approach , author=. The Annals of Statistics , volume=. 2025 , publisher=

  21. [21]

    Advances in Neural Information Processing Systems , volume=

    Assouad, Fano, and Le Cam with Interaction: A Unifying Lower Bound Framework and Characterization for Bandit Learnability , author=. Advances in Neural Information Processing Systems , volume=

  22. [22]

    Conference on Learning Theory , pages=

    On the complexity of bandit linear optimization , author=. Conference on Learning Theory , pages=. 2015 , organization=

  23. [23]

    2006 , publisher=

    Optimal design of experiments , author=. 2006 , publisher=

  24. [24]

    IEEE Transactions on Information Theory , volume=

    On the fundamental limits of adaptive sensing , author=. IEEE Transactions on Information Theory , volume=. 2012 , publisher=

  25. [25]

    1935 , publisher=

    The Design of Experiments , author=. 1935 , publisher=

  26. [26]

    2018 , publisher=

    High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

  27. [27]

    The gamma function and stirling’s formula , author=

  28. [28]

    Flavors of geometry , volume=

    An elementary introduction to modern convex geometry , author=. Flavors of geometry , volume=

  29. [29]

    Mathematical Programming , volume=

    Near-optimal discrete optimization for experimental design: A regret minimization approach , author=. Mathematical Programming , volume=. 2021 , publisher=

  30. [30]

    Machine learning , volume=

    Linear least-squares algorithms for temporal difference learning , author=. Machine learning , volume=. 1996 , publisher=

  31. [31]

    Conference on learning theory , pages=

    Provably efficient reinforcement learning with linear function approximation , author=. Conference on learning theory , pages=. 2020 , organization=