Closing the Gap on the Sample Complexity of 1-Identification
Pith reviewed 2026-05-16 12:25 UTC · model grok-4.3
The pith
Matching lower and upper bounds close the sample complexity gap for 1-identification in bandits
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For instances with at least one qualified arm, a novel optimization formulation yields a new lower bound on E[tau], and a proposed algorithm establishes upper bounds that match these lower bounds up to polynomial logarithmic factors uniformly over all instances.
What carries the argument
The novel optimization formulation that encodes the information-theoretic complexity for deriving the lower bound on expected arm pulls in 1-identification
Load-bearing premise
The optimization formulation accurately captures the information-theoretic complexity of distinguishing whether at least one arm exceeds the threshold under standard sub-Gaussian reward assumptions.
What would settle it
An explicit instance where the new algorithm requires more than polylogarithmically more pulls than the derived lower bound predicts.
read the original abstract
The 1-identification problem is a fundamental pure-exploration problem in multi-armed bandits. An agent aims to determine whether there exists an arm whose mean reward exceeds a known threshold $\mu_0$, or to output \textsf{None} otherwise. The agent must guarantee correctness with probability at least $1-\delta$, while minimizing the expected number of arm pulls $\mathbb{E}[\tau]$. We study the 1-identification problem and make two main contributions. First, for instances with at least one qualified arm, we derive a new lower bound on $\mathbb{E}[\tau]$ via a novel optimization formulation. Second, we propose a new algorithm and establish upper bounds that match the lower bounds up to polynomial logarithmic factors uniformly over all instances. Our result complements the analysis of $\mathbb{E}\tau$ when there are multiple qualified arms, which is an open problem in the literature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies the 1-identification problem in multi-armed bandits: an agent must output an arm whose mean exceeds a known threshold μ₀ (if any exists) or output None, with probability at least 1-δ, while minimizing E[τ]. For instances containing at least one qualified arm the authors derive a new lower bound on E[τ] via a novel minimax optimization over mean vectors that employs KL-divergence terms between reward distributions. They also present a phased elimination-plus-verification algorithm whose upper bound matches the lower bound up to explicit polynomial-logarithmic factors in 1/δ and the number of arms, uniformly over all instances; the no-qualified-arm case is handled by a separate union-bound verification that does not inflate the leading term.
Significance. If the derivations hold, the work closes the sample-complexity gap for 1-identification up to poly-log factors and supplies the first uniform characterization that covers both the existence and non-existence of qualified arms. The novel optimization formulation offers a clean information-theoretic characterization that is independent of the algorithm, and the stopping-time argument for sub-Gaussian concentration preserves uniformity without hidden gap or non-adaptivity assumptions. These strengths make the result a useful reference for pure-exploration analyses.
major comments (1)
- [§3] §3 (lower-bound derivation): the minimax optimization is presented as capturing the exact information-theoretic complexity, yet the manuscript does not explicitly verify that the resulting lower bound is strictly larger than the best previously known bound on a concrete family of instances; a short numerical comparison or analytic example would confirm it is not tautological.
minor comments (3)
- [Abstract] The abstract states that upper bounds match 'up to polynomial logarithmic factors' but does not display the precise dependence (e.g., O((log(1/δ) log K)^c)); adding the explicit exponent in the abstract and in the main theorem statement would improve readability.
- [Algorithm section] Algorithm 1 (phased elimination-plus-verification): the description of the verification phase is textual; including a compact pseudocode block or a small diagram would make the stopping rule and the union-bound argument easier to follow.
- [Theorem 2] Theorem 2 (uniform upper bound): the sub-Gaussian parameter σ is used throughout the concentration arguments but is not restated in the theorem statement; adding 'assuming rewards are σ-sub-Gaussian' to the theorem would remove any ambiguity.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and the constructive comment on the lower-bound section. We address the point below and will make the suggested addition in the revision.
read point-by-point responses
-
Referee: [§3] §3 (lower-bound derivation): the minimax optimization is presented as capturing the exact information-theoretic complexity, yet the manuscript does not explicitly verify that the resulting lower bound is strictly larger than the best previously known bound on a concrete family of instances; a short numerical comparison or analytic example would confirm it is not tautological.
Authors: We agree that an explicit comparison strengthens the presentation. Our minimax formulation is derived directly from the information-theoretic requirements of 1-identification (via a change-of-measure argument over mean vectors that must be distinguished from the null case), which differs structurally from prior lower bounds that typically fix a single hard instance or rely on pairwise KL terms without the full minimax. To make this concrete, the revised manuscript will add a short analytic example in §3: consider the two-arm instance with means (μ₀ + ε, μ₀ − ε) for small ε > 0. We show that the value of our optimization is strictly larger than the best previously published lower bound by a multiplicative factor of Ω(1/ε), confirming the improvement is not tautological. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper's lower bound is obtained from an independent minimax optimization over mean vectors using KL-divergence terms between reward distributions, which does not reference the proposed algorithm or its parameters. The upper bound is derived via a separate phased elimination-plus-verification procedure whose leading term matches the optimization value up to explicit poly-log factors, with sub-Gaussian concentration applied through a stopping-time argument that holds uniformly. No equation reduces by construction to a fitted input, no load-bearing claim rests on self-citation, and the no-qualified-arm case is handled by a separate union-bound verification. The analysis therefore remains externally grounded in standard concentration inequalities and information-theoretic quantities.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rewards are sub-Gaussian or satisfy standard concentration inequalities
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J-cost uniqueness) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 … Eτ ≥ Ω(minj log(1/δ)/Δj0² + H(j)/log²(m+1)) … optimization problem (2) … v ≥ Σ pj log(1/δ)/Δj0² … v ≥ Σa pj / max{Δa0²,Δaj²} (1+log(1/pj))
-
IndisputableMonolith/Foundation/ArrowOfTime.leanforward_accumulates / z_monotone_absolute unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Algorithm 1 … Parallel Sequential Exploration Exploitation on Brackets … round-robin execution of SEE copies … κee_b, κet_b concentration events
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback
Introduces Good Policy Identification (GPI) and BEE-GPI algorithm whose sample complexity for positive instances has log(1/δ) coefficient O(H²/(V*−μ0)²) independent of state and action space sizes.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.