Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback
Pith reviewed 2026-05-25 04:46 UTC · model grok-4.3
The pith
Identifying any policy above a reward threshold requires sample complexity independent of state and action space sizes in episodic RL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BEE-GPI solves the fixed-confidence Good Policy Identification problem and achieves expected sample complexity upper bounds in which the coefficient of log(1/δ) for positive instances is O(H²/(V* - μ0)²), independent of state and action space sizes, with matching lower bounds confirming the necessity of the gap term and the near-optimality of the algorithm.
What carries the argument
The BEE-GPI algorithm, which performs targeted exploration to certify existence or non-existence of a policy whose value meets the given threshold μ0.
Load-bearing premise
Episodes are independent and the learner observes only the total reward at the end of each episode.
What would settle it
An episodic MDP with a fixed positive gap V* - μ0 where every algorithm requires a number of episodes that grows with the number of states or actions.
Figures
read the original abstract
Pure exploration in episodic Reinforcement Learning has primarily focused on Best Policy Identification (BPI), which seeks to identify a (near)-optimal policy with high confidence. Motivated by practical settings where a ``good enough'' policy suffices, we study an alternate objective of Good Policy Identification (GPI). For a given reward threshold $\mu_0$, GPI only requires identifying a policy with expected reward in an episode at least $\mu_0$ if such a policy exists (positive instance), or declaring None if no such policy exists (negative instance). We formalize GPI under the fixed-confidence setting. We require the output to be correct with probability $\geq 1-\delta$, and seek to minimize the expected sample complexity, which is the expected number of episodes explored for the output. We propose a novel algorithm BEE-GPI, and derive theoretically-grounded upper bounds on its sample complexity for positive and negative instances. Notably, for positive instances, the coefficient of $\log 1/\delta$ in our upper bound is $O(H^2/(V^* - \mu_0)^2)$, where $H$ is the episode length and $V^*$ is the optimal expected reward in an episode. The coefficient does not depend on the action and state space sizes otherwise, in sharp contrast to the sample complexity in BPI. We further establish lower bound results to show the near-optimality of BEE-GPI and the necessity of the $1/(V^* -\mu)^2$ term. Numerical experiments further validate the efficiency of our approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Good Policy Identification (GPI) as an alternative objective to Best Policy Identification (BPI) in episodic MDPs with bandit feedback. GPI requires identifying a policy whose expected episodic reward is at least a given threshold μ₀ (positive instance) or correctly declaring that no such policy exists (negative instance), under the fixed-confidence criterion that the output is correct with probability ≥1−δ. The authors propose the BEE-GPI algorithm, derive upper bounds on its expected sample complexity, and claim that for positive instances the leading coefficient of log(1/δ) is O(H²/(V*−μ₀)²) with no additional dependence on |S| or |A|. They also supply matching lower bounds establishing near-optimality of BEE-GPI and the necessity of the 1/(V*−μ₀)² factor, together with numerical experiments.
Significance. If the stated upper-bound independence from state-action space sizes holds, the result supplies a practically useful relaxation of pure exploration whose sample complexity scales with the gap to a user-specified threshold rather than the full optimality gap. The lower-bound results add value by confirming that the gap dependence cannot be improved. The work therefore addresses a realistic regime in which identifying any sufficiently good policy is acceptable, potentially broadening the applicability of high-confidence exploration beyond the computationally prohibitive requirements of BPI.
major comments (2)
- [Abstract and upper-bound analysis] Abstract (and the upper-bound analysis for positive instances): the claim that the coefficient of log(1/δ) is O(H²/(V*−μ₀)²) with no |S| or |A| factors is load-bearing for the asserted 'sharp contrast to BPI'. The lower bound only establishes necessity of the 1/(V*−μ₀)² term; the upper-bound proof for BEE-GPI must be checked to confirm that no policy-class covering factor (e.g., |A|^H or a covering number linear in |S|·|A|) multiplies the leading coefficient, given that each episode returns only a scalar reward.
- [Numerical experiments] The experimental section: without reported scaling plots or tables that vary |S| and |A| while holding H and (V*−μ₀) fixed, it is impossible to verify that the claimed space-size independence is observed in practice rather than masked by implementation details of the policy-selection subroutine.
minor comments (2)
- [Problem formulation] Clarify in the problem formulation whether μ₀ is treated as a known constant or must be estimated, and ensure the notation V* (optimal value) versus μ₀ (threshold) is used consistently in all theorems.
- [Related work] The related-work discussion of BPI sample complexity should cite the specific bound (including its |S|·|A| dependence) that is being contrasted.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our results on Good Policy Identification. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract and upper-bound analysis] Abstract (and the upper-bound analysis for positive instances): the claim that the coefficient of log(1/δ) is O(H²/(V*−μ₀)²) with no |S| or |A| factors is load-bearing for the asserted 'sharp contrast to BPI'. The lower bound only establishes necessity of the 1/(V*−μ₀)² term; the upper-bound proof for BEE-GPI must be checked to confirm that no policy-class covering factor (e.g., |A|^H or a covering number linear in |S|·|A|) multiplies the leading coefficient, given that each episode returns only a scalar reward.
Authors: The upper-bound analysis for positive instances (detailed in the proof of the sample-complexity bound for BEE-GPI) proceeds by maintaining a candidate policy whose value is estimated directly from scalar episode returns. The selection rule and concentration arguments are constructed so that the leading log(1/δ) term depends only on H and the gap V*−μ₀; no explicit covering number over policies or states/actions enters the coefficient. The bandit-feedback setting is accounted for by using the total episodic reward as the sole observation, and the algorithm never enumerates or covers the full policy class. We are happy to expand the proof sketch with an explicit step-by-step accounting of all terms in a revision if the referee finds the current presentation insufficiently transparent. revision: partial
-
Referee: [Numerical experiments] The experimental section: without reported scaling plots or tables that vary |S| and |A| while holding H and (V*−μ₀) fixed, it is impossible to verify that the claimed space-size independence is observed in practice rather than masked by implementation details of the policy-selection subroutine.
Authors: We agree that explicit scaling experiments would provide stronger empirical support for the claimed independence. In the revised manuscript we will add tables and plots that systematically vary |S| and |A| while keeping H and the gap V*−μ₀ fixed, reporting wall-clock episodes required by BEE-GPI. revision: yes
Circularity Check
No significant circularity; bounds derived from algorithm analysis without reduction to inputs
full rationale
The paper introduces the GPI objective and BEE-GPI algorithm, then states upper bounds on sample complexity for positive instances as O(H²/(V*−μ₀)² log(1/δ)) with explicit independence from |S| and |A| in the leading coefficient. This is presented as a consequence of the analysis under bandit feedback and fixed-confidence, not by redefining terms or fitting parameters to the target quantity. No self-citations are invoked as load-bearing for the uniqueness or form of the bound; lower bounds are separately established to show necessity of the 1/(V*−μ₀)² term. The derivation chain remains self-contained against standard episodic MDP assumptions and does not reduce any prediction to a fitted input or prior ansatz by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Episodic Markov decision process with independent episodes and bandit feedback
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , volume=
Sample complexity reduction via policy difference estimation in tabular reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[2]
Advances in Neural Information Processing Systems , volume=
Instance-dependent near-optimal policy identification in linear mdps via online experiment design , author=. Advances in Neural Information Processing Systems , volume=
-
[3]
arXiv preprint arXiv:2311.05638 , year=
Towards instance-optimality in online pac reinforcement learning , author=. arXiv preprint arXiv:2311.05638 , year=
-
[4]
The Journal of Machine Learning Research , volume=
On the complexity of best-arm identification in multi-armed bandit models , author=. The Journal of Machine Learning Research , volume=. 2016 , publisher=
work page 2016
-
[5]
Advances in neural information processing systems , volume=
Near instance-optimal pac reinforcement learning for deterministic mdps , author=. Advances in neural information processing systems , volume=
-
[6]
International Conference on Machine Learning , pages=
Adaptive sampling for best policy identification in markov decision processes , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[7]
Advances in Neural Information Processing Systems , volume=
Navigating to the best policy in markov decision processes , author=. Advances in Neural Information Processing Systems , volume=
-
[8]
Conference on Learning Theory , pages=
Beyond no regret: Instance-dependent pac reinforcement learning , author=. Conference on Learning Theory , pages=. 2022 , organization=
work page 2022
-
[9]
Algorithmic Learning Theory , pages=
Adaptive reward-free exploration , author=. Algorithmic Learning Theory , pages=. 2021 , organization=
work page 2021
-
[10]
Proceedings of the 37th International Conference on Machine Learning , pages =
Reward-Free Exploration for Reinforcement Learning , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =
work page 2020
-
[11]
Advances in neural information processing systems , volume=
Near-optimal regret bounds for reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[12]
Journal of Machine Learning Research , year =
Thomas Jaksch and Ronald Ortner and Peter Auer , title =. Journal of Machine Learning Research , year =
-
[13]
International conference on machine learning , pages=
Minimax regret bounds for reinforcement learning , author=. International conference on machine learning , pages=. 2017 , organization=
work page 2017
-
[14]
Advances in Neural Information Processing Systems , volume=
Sample complexity of episodic fixed-horizon reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
International Conference on Machine Learning , pages=
Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds , author=. International Conference on Machine Learning , pages=. 2019 , organization=
work page 2019
-
[16]
Advances in Neural Information Processing Systems , volume=
Non-asymptotic gap-dependent regret bounds for tabular mdps , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
Concentration Inequalities: A Nonasymptotic Theory of Independence , author=. 2013 , publisher=
work page 2013
-
[18]
Proceedings of the 42nd International Conference on Machine Learning , pages =
Near Optimal Non-asymptotic Sample Complexity of 1-Identification , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =
work page 2025
-
[19]
Advances in Neural Information Processing Systems , volume=
Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[20]
International Conference on Machine Learning , pages=
Fast active learning for pure exploration in reinforcement learning , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[21]
Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =
Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , author =. Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =. 2021 , editor =
work page 2021
-
[22]
International Conference on Algorithmic Learning Theory , pages=
Optimistic pac reinforcement learning: the instance-dependent view , author=. International Conference on Algorithmic Learning Theory , pages=. 2023 , organization=
work page 2023
-
[23]
Proceedings of the seventh annual conference on Computational learning theory , pages=
Efficient reinforcement learning , author=. Proceedings of the seventh annual conference on Computational learning theory , pages=
-
[24]
Advances in neural information processing systems , volume=
Finite-sample convergence rates for Q-learning and indirect algorithms , author=. Advances in neural information processing systems , volume=
-
[25]
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model , author=. Machine learning , volume=. 2013 , publisher=
work page 2013
-
[26]
Proceedings of Thirty Third Conference on Learning Theory , pages =
Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , editor =
work page 2020
-
[27]
Advances in Neural Information Processing Systems , volume=
Near-optimal time and sample complexities for solving Markov decision processes with a generative model , author=. Advances in Neural Information Processing Systems , volume=
-
[28]
Naval Research Logistics (NRL) , volume=
Variance reduced value iteration and faster algorithms for solving Markov decision processes , author=. Naval Research Logistics (NRL) , volume=. 2023 , publisher=
work page 2023
-
[29]
Breaking the sample size barrier in model-based reinforcement learning with a generative model , author=. Operations Research , volume=. 2024 , publisher=
work page 2024
-
[30]
International Conference on Artificial Intelligence and Statistics , pages=
The true sample complexity of identifying good arms , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=
work page 2020
-
[31]
Advances in Neural Information Processing Systems , volume=
Pure exploration with multiple correct answers , author=. Advances in Neural Information Processing Systems , volume=
-
[32]
Good Arm Identification via Bandit Feedback , author=. Machine Learning , number=
-
[33]
Closing the Gap on the Sample Complexity of 1-Identification
Closing the Gap on the Sample Complexity of 1-Identification , author=. arXiv preprint arXiv:2601.15620 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Journal of Machine Learning Research , volume=
An anytime algorithm for good arm identification , author=. Journal of Machine Learning Research , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.