Closing the Gap on the Sample Complexity of 1-Identification

Wang Chi Cheung; Zitian Li

arxiv: 2601.15620 · v2 · pith:DE7VXBWMnew · submitted 2026-01-22 · 💻 cs.LG

Closing the Gap on the Sample Complexity of 1-Identification

Zitian Li , Wang Chi Cheung This is my paper

Pith reviewed 2026-05-16 12:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords 1-identificationmulti-armed banditspure explorationsample complexitylower boundupper boundthreshold identification

0 comments

The pith

Matching lower and upper bounds close the sample complexity gap for 1-identification in bandits

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In the 1-identification problem, an agent must decide whether any arm mean exceeds a known threshold or output None, while guaranteeing correctness with probability at least 1-delta and minimizing expected pulls. The paper derives a new lower bound on expected pulls for instances with at least one qualified arm by introducing a novel optimization formulation that encodes the information complexity. It then gives an algorithm whose upper bounds match the lower bounds up to polynomial logarithmic factors uniformly over all instances. This provides a near-tight characterization that fills the previous gap for the single-qualified-arm case.

Core claim

For instances with at least one qualified arm, a novel optimization formulation yields a new lower bound on E[tau], and a proposed algorithm establishes upper bounds that match these lower bounds up to polynomial logarithmic factors uniformly over all instances.

What carries the argument

The novel optimization formulation that encodes the information-theoretic complexity for deriving the lower bound on expected arm pulls in 1-identification

Load-bearing premise

The optimization formulation accurately captures the information-theoretic complexity of distinguishing whether at least one arm exceeds the threshold under standard sub-Gaussian reward assumptions.

What would settle it

An explicit instance where the new algorithm requires more than polylogarithmically more pulls than the derived lower bound predicts.

read the original abstract

The 1-identification problem is a fundamental pure-exploration problem in multi-armed bandits. An agent aims to determine whether there exists an arm whose mean reward exceeds a known threshold $\mu_0$, or to output \textsf{None} otherwise. The agent must guarantee correctness with probability at least $1-\delta$, while minimizing the expected number of arm pulls $\mathbb{E}[\tau]$. We study the 1-identification problem and make two main contributions. First, for instances with at least one qualified arm, we derive a new lower bound on $\mathbb{E}[\tau]$ via a novel optimization formulation. Second, we propose a new algorithm and establish upper bounds that match the lower bounds up to polynomial logarithmic factors uniformly over all instances. Our result complements the analysis of $\mathbb{E}\tau$ when there are multiple qualified arms, which is an open problem in the literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper closes the sample complexity gap for 1-identification with a new optimization lower bound and near-matching uniform upper bounds.

read the letter

The main thing to know is that this paper closes a specific gap in the sample complexity of 1-identification in multi-armed bandits. They derive a new lower bound using a novel optimization formulation for cases with at least one qualified arm, and they give an algorithm whose upper bound matches it up to polynomial log factors uniformly over all instances. The paper does well by providing these uniform guarantees. The lower bound comes from a minimax optimization over possible mean vectors with KL-divergence terms. The algorithm uses phased elimination and verification, applying sub-Gaussian concentration via stopping times to keep the bound uniform. The no-qualified-arm case is handled separately without inflating the complexity beyond the log factors. This complements the open problem for multiple qualified arms. A soft spot is the remaining poly-log gap between lower and upper bounds. It's explicit and standard, but it means the result is not fully tight. The optimization formulation appears independent and not circular with the algorithm, which is positive. Without the full proofs, one might worry about edge cases in the instance constructions, but the provided stress-test suggests the derivations are clean. This is for researchers working on pure-exploration bandits and sample complexity bounds. A reader interested in information-theoretic lower bounds or adaptive algorithms for identification would get concrete value from the new formulation and the uniform analysis. It deserves a serious referee. The contributions are focused and the approach seems solid enough to merit detailed review.

Referee Report

1 major / 3 minor

Summary. The manuscript studies the 1-identification problem in multi-armed bandits: an agent must output an arm whose mean exceeds a known threshold μ₀ (if any exists) or output None, with probability at least 1-δ, while minimizing E[τ]. For instances containing at least one qualified arm the authors derive a new lower bound on E[τ] via a novel minimax optimization over mean vectors that employs KL-divergence terms between reward distributions. They also present a phased elimination-plus-verification algorithm whose upper bound matches the lower bound up to explicit polynomial-logarithmic factors in 1/δ and the number of arms, uniformly over all instances; the no-qualified-arm case is handled by a separate union-bound verification that does not inflate the leading term.

Significance. If the derivations hold, the work closes the sample-complexity gap for 1-identification up to poly-log factors and supplies the first uniform characterization that covers both the existence and non-existence of qualified arms. The novel optimization formulation offers a clean information-theoretic characterization that is independent of the algorithm, and the stopping-time argument for sub-Gaussian concentration preserves uniformity without hidden gap or non-adaptivity assumptions. These strengths make the result a useful reference for pure-exploration analyses.

major comments (1)

[§3] §3 (lower-bound derivation): the minimax optimization is presented as capturing the exact information-theoretic complexity, yet the manuscript does not explicitly verify that the resulting lower bound is strictly larger than the best previously known bound on a concrete family of instances; a short numerical comparison or analytic example would confirm it is not tautological.

minor comments (3)

[Abstract] The abstract states that upper bounds match 'up to polynomial logarithmic factors' but does not display the precise dependence (e.g., O((log(1/δ) log K)^c)); adding the explicit exponent in the abstract and in the main theorem statement would improve readability.
[Algorithm section] Algorithm 1 (phased elimination-plus-verification): the description of the verification phase is textual; including a compact pseudocode block or a small diagram would make the stopping rule and the union-bound argument easier to follow.
[Theorem 2] Theorem 2 (uniform upper bound): the sub-Gaussian parameter σ is used throughout the concentration arguments but is not restated in the theorem statement; adding 'assuming rewards are σ-sub-Gaussian' to the theorem would remove any ambiguity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the constructive comment on the lower-bound section. We address the point below and will make the suggested addition in the revision.

read point-by-point responses

Referee: [§3] §3 (lower-bound derivation): the minimax optimization is presented as capturing the exact information-theoretic complexity, yet the manuscript does not explicitly verify that the resulting lower bound is strictly larger than the best previously known bound on a concrete family of instances; a short numerical comparison or analytic example would confirm it is not tautological.

Authors: We agree that an explicit comparison strengthens the presentation. Our minimax formulation is derived directly from the information-theoretic requirements of 1-identification (via a change-of-measure argument over mean vectors that must be distinguished from the null case), which differs structurally from prior lower bounds that typically fix a single hard instance or rely on pairwise KL terms without the full minimax. To make this concrete, the revised manuscript will add a short analytic example in §3: consider the two-arm instance with means (μ₀ + ε, μ₀ − ε) for small ε > 0. We show that the value of our optimization is strictly larger than the best previously published lower bound by a multiplicative factor of Ω(1/ε), confirming the improvement is not tautological. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's lower bound is obtained from an independent minimax optimization over mean vectors using KL-divergence terms between reward distributions, which does not reference the proposed algorithm or its parameters. The upper bound is derived via a separate phased elimination-plus-verification procedure whose leading term matches the optimization value up to explicit poly-log factors, with sub-Gaussian concentration applied through a stopping-time argument that holds uniformly. No equation reduces by construction to a fitted input, no load-bearing claim rests on self-citation, and the no-qualified-arm case is handled by a separate union-bound verification. The analysis therefore remains externally grounded in standard concentration inequalities and information-theoretic quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Paper relies on standard bandit assumptions for reward distributions and concentration; the novel optimization is the main new element but its validity depends on those assumptions.

axioms (1)

domain assumption Rewards are sub-Gaussian or satisfy standard concentration inequalities
Invoked implicitly for both lower and upper bound analyses in bandit pure exploration.

pith-pipeline@v0.9.0 · 5445 in / 1127 out tokens · 38580 ms · 2026-05-16T12:25:13.321790+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 … Eτ ≥ Ω(minj log(1/δ)/Δj0² + H(j)/log²(m+1)) … optimization problem (2) … v ≥ Σ pj log(1/δ)/Δj0² … v ≥ Σa pj / max{Δa0²,Δaj²} (1+log(1/pj))
IndisputableMonolith/Foundation/ArrowOfTime.lean forward_accumulates / z_monotone_absolute unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Algorithm 1 … Parallel Sequential Exploration Exploitation on Brackets … round-robin execution of SEE copies … κee_b, κet_b concentration events

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback
cs.LG 2026-05 unverdicted novelty 7.0

Introduces Good Policy Identification (GPI) and BEE-GPI algorithm whose sample complexity for positive instances has log(1/δ) coefficient O(H²/(V*−μ0)²) independent of state and action space sizes.