Introduces Good Policy Identification (GPI) and BEE-GPI algorithm whose sample complexity for positive instances has log(1/δ) coefficient O(H²/(V*−μ0)²) independent of state and action space sizes.
Closing the Gap on the Sample Complexity of 1-Identification
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
The 1-identification problem is a fundamental pure-exploration problem in multi-armed bandits. An agent aims to determine whether there exists an arm whose mean reward exceeds a known threshold $\mu_0$, or to output \textsf{None} otherwise. The agent must guarantee correctness with probability at least $1-\delta$, while minimizing the expected number of arm pulls $\mathbb{E}[\tau]$. We study the 1-identification problem and make two main contributions. First, for instances with at least one qualified arm, we derive a new lower bound on $\mathbb{E}[\tau]$ via a novel optimization formulation. Second, we propose a new algorithm and establish upper bounds that match the lower bounds up to polynomial logarithmic factors uniformly over all instances. Our result complements the analysis of $\mathbb{E}\tau$ when there are multiple qualified arms, which is an open problem in the literature.
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback
Introduces Good Policy Identification (GPI) and BEE-GPI algorithm whose sample complexity for positive instances has log(1/δ) coefficient O(H²/(V*−μ0)²) independent of state and action space sizes.