On the Power of Adaptivity for varepsilon-Best Arm Identification in Linear Bandits
Pith reviewed 2026-05-20 20:49 UTC · model grok-4.3
The pith
There exists an action set in linear bandits where adaptivity gives a polynomial improvement over non-adaptive methods for ε-best arm identification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that there exists a structured action set X for which any non-adaptive algorithm requires polynomially more samples than an adaptive algorithm to output an arm whose expected reward is within ε of the optimal arm with probability 1-δ. The separation is obtained by reducing an ℓ2-norm estimation problem to best-arm identification on this X; an adaptive procedure solves the norm estimation task using O(d log(1/δ)/ε²) samples from the unit ball and thereby achieves the improved rate on the constructed instance.
What carries the argument
An adaptive ℓ2-norm estimation algorithm that draws O(d log(1/δ)/ε²) samples from the unit ℓ2 ball and returns an estimate r̂ satisfying |r̂ - ||θ||₂| ≤ ε with high probability.
If this is right
- Non-adaptive fixed-design methods are lower-bounded by Ω(d log(1/δ)/ε² + w(X)²/ε²) for every action set X.
- For the hypercube, ℓ2 ball, m-sets, and multi-task bandits, adaptivity improves the non-adaptive rate by at most logarithmic factors.
- On the specially constructed action set, the adaptive minimax sample complexity is polynomially smaller than the non-adaptive minimax sample complexity.
- The adaptive norm estimator itself succeeds with O(d log(1/δ)/ε²) samples drawn from the unit ball.
Where Pith is reading between the lines
- The separation result indicates that geometric structure in the action set can make adaptivity essential for near-optimal sample efficiency in linear identification problems.
- Similar polynomial gaps may exist in related settings such as linear regression with structured design matrices or contextual bandit identification.
- Future constructions could aim to replace the explicit geometric X with more natural or randomly generated action sets that still exhibit the polynomial separation.
- Validating the norm-estimation subroutine empirically in moderate dimensions would provide a concrete test of the O(d log(1/δ)/ε²) bound.
Load-bearing premise
The claim depends on the existence of a specific geometric action set X that permits embedding the norm-estimation problem into the best-arm identification task.
What would settle it
A direct calculation or simulation on the constructed action set showing that every adaptive algorithm still requires Ω(w(X)²/ε²) samples or that the proposed norm estimator needs more than O(d log(1/δ)/ε²) draws to achieve additive ε accuracy.
read the original abstract
We study the minimax sample complexity of $\varepsilon$-best arm identification in linear bandits. Given a compact action set $\mathcal{X}$ that spans $\mathbb{R}^d$ and an unknown reward vector $\theta\in\mathbb{R}^d$, the goal is to output an arm $\widehat{x}\in\mathcal{X}$ such that $\langle \widehat{x},\theta\rangle \ge \max_{x\in\mathcal{X}} \langle x,\theta\rangle - \varepsilon$ with probability at least $1-\delta$, using as few samples as possible. First, we present a non-adaptive fixed-design method with sample complexity $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$, where $w(\mathcal{X})$ is a Gaussian width term dependent on $\mathcal{X}$, and we prove a matching lower bound $\Omega\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$ for all non-adaptive fixed-design methods. We then turn to adaptive sampling. We raise an important structural question: beyond the canonical basis, are there structured action sets for which adaptivity yields only logarithmic-factor improvements over the optimal non-adaptive rate? We answer in the affirmative for several natural action sets, namely the hypercube, the $\ell_2$ ball, $m$-sets, and multi-task multi-armed bandits. Finally, we provide the first construction of an action set $\mathcal{X}$ for which adaptivity yields a polynomial-factor improvement over every non-adaptive algorithm. A key ingredient behind this separation is an $\ell_2$-norm estimation subroutine: we design an adaptive algorithm that uses $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}\right)$ samples from the unit $\ell_2$ ball in $\mathbb{R}^d$ and outputs an estimate $\widehat r$ satisfying $|\widehat r-\|\theta\|_2|\le \varepsilon$ with probability at least $1-\delta$, where $\theta$ is the unknown reward vector.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies the minimax sample complexity of ε-best arm identification in linear bandits over a compact action set X spanning R^d. It gives a non-adaptive fixed-design algorithm achieving O(d log(1/δ)/ε² + w(X)²/ε²) samples together with a matching Ω lower bound for all non-adaptive fixed-design methods. For several natural action sets (hypercube, ℓ₂ ball, m-sets, multi-task MAB) it shows that adaptivity improves the rate by only logarithmic factors. It then constructs a specific action set X for which adaptivity yields a polynomial-factor improvement over every non-adaptive algorithm, using as a key subroutine an adaptive procedure that estimates ||θ||₂ to additive ε accuracy with O(d log(1/δ)/ε²) samples from the unit ℓ₂ ball.
Significance. If the separation result holds, the paper supplies the first explicit geometric construction separating adaptive and non-adaptive rates by a polynomial factor in linear bandits, together with a reusable adaptive ℓ₂-norm estimator. The appearance of the Gaussian width w(X) as the precise geometric complexity term for the non-adaptive case is a clean and falsifiable contribution. These elements would strengthen the literature on when adaptivity helps in structured linear problems.
major comments (2)
- [Abstract] Abstract: the matching lower bound Ω(d log(1/δ)/ε² + w(X)²/ε²) is stated only for non-adaptive fixed-design methods, yet the central separation claim asserts a polynomial improvement “over every non-adaptive algorithm.” Randomized non-adaptive designs (fixed distributions over arms chosen independently of observations) are not covered by the fixed-design lower bound; if such designs can achieve rates closer to the adaptive O(d log(1/δ)/ε²) norm-estimation subroutine, the polynomial separation does not hold for the full class of non-adaptive algorithms.
- [Abstract] Abstract and the section describing the separating construction: the load-bearing modeling choice is the specific geometric action set X together with the reduction that embeds the ℓ₂-norm estimator into the best-arm identification problem. The manuscript must verify that this reduction preserves the polynomial gap for the reward distributions under consideration and that no simpler non-adaptive (possibly randomized) strategy on the same X can match the adaptive rate.
minor comments (2)
- Notation: the precise definition of the Gaussian width w(X) and its relation to the action set geometry should be recalled explicitly when first used, even if standard.
- The manuscript should clarify whether the O(d log(1/δ)/ε²) norm-estimation bound assumes sub-Gaussian noise or holds more generally.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our work. The comments regarding the precise scope of the non-adaptive lower bound and the need for verification in the separating construction are well-taken. We provide point-by-point responses below and will make the necessary revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the matching lower bound Ω(d log(1/δ)/ε² + w(X)²/ε²) is stated only for non-adaptive fixed-design methods, yet the central separation claim asserts a polynomial improvement “over every non-adaptive algorithm.” Randomized non-adaptive designs (fixed distributions over arms chosen independently of observations) are not covered by the fixed-design lower bound; if such designs can achieve rates closer to the adaptive O(d log(1/δ)/ε²) norm-estimation subroutine, the polynomial separation does not hold for the full class of non-adaptive algorithms.
Authors: We acknowledge the validity of this comment. Our current lower bound applies exclusively to non-adaptive fixed-design methods. To strengthen the separation claim, we will revise the manuscript to include an analysis showing that randomized non-adaptive algorithms cannot achieve a substantially better rate than fixed-design methods for the action sets considered, particularly in the separating construction. Alternatively, we will adjust the claim to refer specifically to non-adaptive fixed-design algorithms if a general extension proves technically challenging. This revision will ensure the polynomial improvement is accurately positioned relative to the class of non-adaptive methods. revision: yes
-
Referee: [Abstract] Abstract and the section describing the separating construction: the load-bearing modeling choice is the specific geometric action set X together with the reduction that embeds the ℓ₂-norm estimator into the best-arm identification problem. The manuscript must verify that this reduction preserves the polynomial gap for the reward distributions under consideration and that no simpler non-adaptive (possibly randomized) strategy on the same X can match the adaptive rate.
Authors: We agree that explicit verification of the reduction and its implications is necessary. In the revised version, we will add a dedicated subsection that details the embedding of the adaptive ℓ₂-norm estimator into the ε-best arm identification task on the constructed action set X. This will demonstrate that the reward distributions and the polynomial sample-complexity gap are preserved. We will further establish that, for this specific X, no non-adaptive strategy—including those based on randomized fixed distributions—can match the adaptive rate, thereby confirming that the separation is robust to the modeling choices. revision: yes
Circularity Check
No circularity: rates from concentration and explicit geometric construction
full rationale
The paper derives the non-adaptive fixed-design upper and lower bounds directly from Gaussian concentration and minimax arguments on the action set geometry (w(X) term defined externally). The adaptive separation result rests on an explicit construction of a separating action set X together with a first-principles adaptive norm-estimation subroutine whose sample complexity is obtained from standard vector concentration, not from fitting to the target best-arm quantity. No equation reduces to a prior result by definition, no parameter is fitted to the claimed separation, and no load-bearing step is justified solely by self-citation. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rewards are sub-Gaussian with known variance proxy
invented entities (1)
-
Separating action set X
independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
non-adaptive fixed-design method with sample complexity O(d log(1/δ)/ε² + w(X)²/ε²) … matching lower bound Ω(d log(1/δ)/ε² + w(X)²/ε²)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ℓ₂-norm estimation subroutine … O(d log(1/δ)/ε²) samples … |r̂ − ∥θ∥₂| ≤ ε
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
International Conference on Computational Learning Theory , pages=
PAC bounds for multi-armed bandit and Markov decision processes , author=. International Conference on Computational Learning Theory , pages=. 2002 , organization=
work page 2002
-
[2]
Journal of Computer and System Sciences , volume=
Combinatorial bandits , author=. Journal of Computer and System Sciences , volume=. 2012 , publisher=
work page 2012
-
[3]
Conference on Learning Theory , pages=
Tight bounds for bandit combinatorial optimization , author=. Conference on Learning Theory , pages=. 2017 , organization=
work page 2017
-
[4]
arXiv preprint arXiv:2504.00461 , year=
Efficient near-optimal algorithm for online shortest paths in directed acyclic graphs with bandit feedback against adaptive adversaries , author=. arXiv preprint arXiv:2504.00461 , year=
-
[5]
arXiv preprint arXiv:2510.17099 , year=
On the Universal Near Optimality of Hedge in Combinatorial Settings , author=. arXiv preprint arXiv:2510.17099 , year=
-
[6]
Advances in neural information processing systems , volume=
Best-arm identification in linear bandits , author=. Advances in neural information processing systems , volume=
-
[7]
Advances in Neural Information Processing Systems , volume=
Optimal best-arm identification in linear bandits , author=. Advances in Neural Information Processing Systems , volume=
-
[8]
Advances in neural information processing systems , volume=
Sequential experimental design for transductive linear bandits , author=. Advances in neural information processing systems , volume=
-
[9]
Advances in Neural Information Processing Systems , volume=
An empirical process approach to the union bound: Practical algorithms for combinatorial and linear bandits , author=. Advances in Neural Information Processing Systems , volume=
-
[10]
International Conference on Machine Learning , pages=
Best arm identification in linear bandits with linear dimension dependency , author=. International Conference on Machine Learning , pages=. 2018 , organization=
work page 2018
-
[11]
International Conference on Machine Learning , pages=
Gamification of pure exploration for linear bandits , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
-
[12]
International Conference on Artificial Intelligence and Statistics , pages=
A fully adaptive algorithm for pure exploration in linear bandits , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=
work page 2018
-
[13]
Advances in Neural Information Processing Systems , volume=
Verification based solution for structured mab problems , author=. Advances in Neural Information Processing Systems , volume=
-
[14]
International Conference on Artificial Intelligence and Statistics , pages=
Experimental design for regret minimization in linear bandits , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=
work page 2021
- [15]
-
[16]
International conference on artificial intelligence and statistics , pages=
Sublinear optimal policy value estimation in contextual bandits , author=. International conference on artificial intelligence and statistics , pages=. 2020 , organization=
work page 2020
-
[17]
The Annals of Statistics , volume=
Testing Composite Hypotheses, Hermite Polynomials and Optimal Estimation of a Nonsmooth Functional , author=. The Annals of Statistics , volume=
-
[18]
ON ESTIMATION OF NONSMOOTH FUNCTIONALS OF SPARSE NORMAL MEANS , author=. Bernoulli , year=
-
[19]
Probability Theory and Related Fields , volume=
On estimation of L r-norms in Gaussian white noise models , author=. Probability Theory and Related Fields , volume=. 2020 , publisher=
work page 2020
-
[20]
Upper bounds via the oracle approach , author=
Adaptive estimation of the L 2-norm of a probability density and related topics II. Upper bounds via the oracle approach , author=. The Annals of Statistics , volume=. 2025 , publisher=
work page 2025
-
[21]
Advances in Neural Information Processing Systems , volume=
Assouad, Fano, and Le Cam with Interaction: A Unifying Lower Bound Framework and Characterization for Bandit Learnability , author=. Advances in Neural Information Processing Systems , volume=
-
[22]
Conference on Learning Theory , pages=
On the complexity of bandit linear optimization , author=. Conference on Learning Theory , pages=. 2015 , organization=
work page 2015
- [23]
-
[24]
IEEE Transactions on Information Theory , volume=
On the fundamental limits of adaptive sensing , author=. IEEE Transactions on Information Theory , volume=. 2012 , publisher=
work page 2012
- [25]
-
[26]
High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=
work page 2018
-
[27]
The gamma function and stirling’s formula , author=
-
[28]
An elementary introduction to modern convex geometry , author=. Flavors of geometry , volume=
-
[29]
Mathematical Programming , volume=
Near-optimal discrete optimization for experimental design: A regret minimization approach , author=. Mathematical Programming , volume=. 2021 , publisher=
work page 2021
-
[30]
Linear least-squares algorithms for temporal difference learning , author=. Machine learning , volume=. 1996 , publisher=
work page 1996
-
[31]
Conference on learning theory , pages=
Provably efficient reinforcement learning with linear function approximation , author=. Conference on learning theory , pages=. 2020 , organization=
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.