pith. sign in

arxiv: 2605.23182 · v1 · pith:PXZLOZVSnew · submitted 2026-05-22 · 💻 cs.LG

Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

Pith reviewed 2026-05-25 04:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learningpure explorationgood policy identificationbandit feedbackepisodic MDPsample complexityfixed confidence
0
0 comments X

The pith

Identifying any policy above a reward threshold requires sample complexity independent of state and action space sizes in episodic RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Good Policy Identification as a relaxed goal compared to Best Policy Identification in episodic reinforcement learning with bandit feedback. GPI requires finding a policy whose expected episode reward meets or exceeds a threshold μ0 if one exists, or correctly declaring that none does. The authors develop the BEE-GPI algorithm and show upper bounds where, for positive instances, the leading coefficient of log(1/δ) scales as O(H²/(V* - μ0)²) without additional dependence on the number of states or actions. Lower bounds establish that the gap dependence is necessary and that the algorithm is near-optimal for both positive and negative instances.

Core claim

BEE-GPI solves the fixed-confidence Good Policy Identification problem and achieves expected sample complexity upper bounds in which the coefficient of log(1/δ) for positive instances is O(H²/(V* - μ0)²), independent of state and action space sizes, with matching lower bounds confirming the necessity of the gap term and the near-optimality of the algorithm.

What carries the argument

The BEE-GPI algorithm, which performs targeted exploration to certify existence or non-existence of a policy whose value meets the given threshold μ0.

Load-bearing premise

Episodes are independent and the learner observes only the total reward at the end of each episode.

What would settle it

An episodic MDP with a fixed positive gap V* - μ0 where every algorithm requires a number of episodes that grows with the number of states or actions.

Figures

Figures reproduced from arXiv: 2605.23182 by Wang Chi Cheung, Zitian Li.

Figure 1
Figure 1. Figure 1: Numerical comparison between BEE-GPI and BPI-UCR [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

Pure exploration in episodic Reinforcement Learning has primarily focused on Best Policy Identification (BPI), which seeks to identify a (near)-optimal policy with high confidence. Motivated by practical settings where a ``good enough'' policy suffices, we study an alternate objective of Good Policy Identification (GPI). For a given reward threshold $\mu_0$, GPI only requires identifying a policy with expected reward in an episode at least $\mu_0$ if such a policy exists (positive instance), or declaring None if no such policy exists (negative instance). We formalize GPI under the fixed-confidence setting. We require the output to be correct with probability $\geq 1-\delta$, and seek to minimize the expected sample complexity, which is the expected number of episodes explored for the output. We propose a novel algorithm BEE-GPI, and derive theoretically-grounded upper bounds on its sample complexity for positive and negative instances. Notably, for positive instances, the coefficient of $\log 1/\delta$ in our upper bound is $O(H^2/(V^* - \mu_0)^2)$, where $H$ is the episode length and $V^*$ is the optimal expected reward in an episode. The coefficient does not depend on the action and state space sizes otherwise, in sharp contrast to the sample complexity in BPI. We further establish lower bound results to show the near-optimality of BEE-GPI and the necessity of the $1/(V^* -\mu)^2$ term. Numerical experiments further validate the efficiency of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Good Policy Identification (GPI) as an alternative objective to Best Policy Identification (BPI) in episodic MDPs with bandit feedback. GPI requires identifying a policy whose expected episodic reward is at least a given threshold μ₀ (positive instance) or correctly declaring that no such policy exists (negative instance), under the fixed-confidence criterion that the output is correct with probability ≥1−δ. The authors propose the BEE-GPI algorithm, derive upper bounds on its expected sample complexity, and claim that for positive instances the leading coefficient of log(1/δ) is O(H²/(V*−μ₀)²) with no additional dependence on |S| or |A|. They also supply matching lower bounds establishing near-optimality of BEE-GPI and the necessity of the 1/(V*−μ₀)² factor, together with numerical experiments.

Significance. If the stated upper-bound independence from state-action space sizes holds, the result supplies a practically useful relaxation of pure exploration whose sample complexity scales with the gap to a user-specified threshold rather than the full optimality gap. The lower-bound results add value by confirming that the gap dependence cannot be improved. The work therefore addresses a realistic regime in which identifying any sufficiently good policy is acceptable, potentially broadening the applicability of high-confidence exploration beyond the computationally prohibitive requirements of BPI.

major comments (2)
  1. [Abstract and upper-bound analysis] Abstract (and the upper-bound analysis for positive instances): the claim that the coefficient of log(1/δ) is O(H²/(V*−μ₀)²) with no |S| or |A| factors is load-bearing for the asserted 'sharp contrast to BPI'. The lower bound only establishes necessity of the 1/(V*−μ₀)² term; the upper-bound proof for BEE-GPI must be checked to confirm that no policy-class covering factor (e.g., |A|^H or a covering number linear in |S|·|A|) multiplies the leading coefficient, given that each episode returns only a scalar reward.
  2. [Numerical experiments] The experimental section: without reported scaling plots or tables that vary |S| and |A| while holding H and (V*−μ₀) fixed, it is impossible to verify that the claimed space-size independence is observed in practice rather than masked by implementation details of the policy-selection subroutine.
minor comments (2)
  1. [Problem formulation] Clarify in the problem formulation whether μ₀ is treated as a known constant or must be estimated, and ensure the notation V* (optimal value) versus μ₀ (threshold) is used consistently in all theorems.
  2. [Related work] The related-work discussion of BPI sample complexity should cite the specific bound (including its |S|·|A| dependence) that is being contrasted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results on Good Policy Identification. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract and upper-bound analysis] Abstract (and the upper-bound analysis for positive instances): the claim that the coefficient of log(1/δ) is O(H²/(V*−μ₀)²) with no |S| or |A| factors is load-bearing for the asserted 'sharp contrast to BPI'. The lower bound only establishes necessity of the 1/(V*−μ₀)² term; the upper-bound proof for BEE-GPI must be checked to confirm that no policy-class covering factor (e.g., |A|^H or a covering number linear in |S|·|A|) multiplies the leading coefficient, given that each episode returns only a scalar reward.

    Authors: The upper-bound analysis for positive instances (detailed in the proof of the sample-complexity bound for BEE-GPI) proceeds by maintaining a candidate policy whose value is estimated directly from scalar episode returns. The selection rule and concentration arguments are constructed so that the leading log(1/δ) term depends only on H and the gap V*−μ₀; no explicit covering number over policies or states/actions enters the coefficient. The bandit-feedback setting is accounted for by using the total episodic reward as the sole observation, and the algorithm never enumerates or covers the full policy class. We are happy to expand the proof sketch with an explicit step-by-step accounting of all terms in a revision if the referee finds the current presentation insufficiently transparent. revision: partial

  2. Referee: [Numerical experiments] The experimental section: without reported scaling plots or tables that vary |S| and |A| while holding H and (V*−μ₀) fixed, it is impossible to verify that the claimed space-size independence is observed in practice rather than masked by implementation details of the policy-selection subroutine.

    Authors: We agree that explicit scaling experiments would provide stronger empirical support for the claimed independence. In the revised manuscript we will add tables and plots that systematically vary |S| and |A| while keeping H and the gap V*−μ₀ fixed, reporting wall-clock episodes required by BEE-GPI. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bounds derived from algorithm analysis without reduction to inputs

full rationale

The paper introduces the GPI objective and BEE-GPI algorithm, then states upper bounds on sample complexity for positive instances as O(H²/(V*−μ₀)² log(1/δ)) with explicit independence from |S| and |A| in the leading coefficient. This is presented as a consequence of the analysis under bandit feedback and fixed-confidence, not by redefining terms or fitting parameters to the target quantity. No self-citations are invoked as load-bearing for the uniqueness or form of the bound; lower bounds are separately established to show necessity of the 1/(V*−μ₀)² term. The derivation chain remains self-contained against standard episodic MDP assumptions and does not reduce any prediction to a fitted input or prior ansatz by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the standard episodic MDP model with bandit feedback and the fixed-confidence pure-exploration framework; no free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • domain assumption Episodic Markov decision process with independent episodes and bandit feedback
    The problem setting and sample complexity analysis are defined within this standard RL model.

pith-pipeline@v0.9.0 · 5810 in / 1337 out tokens · 36760 ms · 2026-05-25T04:46:09.428439+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Sample complexity reduction via policy difference estimation in tabular reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Instance-dependent near-optimal policy identification in linear mdps via online experiment design , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    arXiv preprint arXiv:2311.05638 , year=

    Towards instance-optimality in online pac reinforcement learning , author=. arXiv preprint arXiv:2311.05638 , year=

  4. [4]

    The Journal of Machine Learning Research , volume=

    On the complexity of best-arm identification in multi-armed bandit models , author=. The Journal of Machine Learning Research , volume=. 2016 , publisher=

  5. [5]

    Advances in neural information processing systems , volume=

    Near instance-optimal pac reinforcement learning for deterministic mdps , author=. Advances in neural information processing systems , volume=

  6. [6]

    International Conference on Machine Learning , pages=

    Adaptive sampling for best policy identification in markov decision processes , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    Navigating to the best policy in markov decision processes , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    Conference on Learning Theory , pages=

    Beyond no regret: Instance-dependent pac reinforcement learning , author=. Conference on Learning Theory , pages=. 2022 , organization=

  9. [9]

    Algorithmic Learning Theory , pages=

    Adaptive reward-free exploration , author=. Algorithmic Learning Theory , pages=. 2021 , organization=

  10. [10]

    Proceedings of the 37th International Conference on Machine Learning , pages =

    Reward-Free Exploration for Reinforcement Learning , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

  11. [11]

    Advances in neural information processing systems , volume=

    Near-optimal regret bounds for reinforcement learning , author=. Advances in neural information processing systems , volume=

  12. [12]

    Journal of Machine Learning Research , year =

    Thomas Jaksch and Ronald Ortner and Peter Auer , title =. Journal of Machine Learning Research , year =

  13. [13]

    International conference on machine learning , pages=

    Minimax regret bounds for reinforcement learning , author=. International conference on machine learning , pages=. 2017 , organization=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Sample complexity of episodic fixed-horizon reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    International Conference on Machine Learning , pages=

    Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Non-asymptotic gap-dependent regret bounds for tabular mdps , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    2013 , publisher=

    Concentration Inequalities: A Nonasymptotic Theory of Independence , author=. 2013 , publisher=

  18. [18]

    Proceedings of the 42nd International Conference on Machine Learning , pages =

    Near Optimal Non-asymptotic Sample Complexity of 1-Identification , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    International Conference on Machine Learning , pages=

    Fast active learning for pure exploration in reinforcement learning , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  21. [21]

    Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =

    Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , author =. Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =. 2021 , editor =

  22. [22]

    International Conference on Algorithmic Learning Theory , pages=

    Optimistic pac reinforcement learning: the instance-dependent view , author=. International Conference on Algorithmic Learning Theory , pages=. 2023 , organization=

  23. [23]

    Proceedings of the seventh annual conference on Computational learning theory , pages=

    Efficient reinforcement learning , author=. Proceedings of the seventh annual conference on Computational learning theory , pages=

  24. [24]

    Advances in neural information processing systems , volume=

    Finite-sample convergence rates for Q-learning and indirect algorithms , author=. Advances in neural information processing systems , volume=

  25. [25]

    Machine learning , volume=

    Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model , author=. Machine learning , volume=. 2013 , publisher=

  26. [26]

    Proceedings of Thirty Third Conference on Learning Theory , pages =

    Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , editor =

  27. [27]

    Advances in Neural Information Processing Systems , volume=

    Near-optimal time and sample complexities for solving Markov decision processes with a generative model , author=. Advances in Neural Information Processing Systems , volume=

  28. [28]

    Naval Research Logistics (NRL) , volume=

    Variance reduced value iteration and faster algorithms for solving Markov decision processes , author=. Naval Research Logistics (NRL) , volume=. 2023 , publisher=

  29. [29]

    Operations Research , volume=

    Breaking the sample size barrier in model-based reinforcement learning with a generative model , author=. Operations Research , volume=. 2024 , publisher=

  30. [30]

    International Conference on Artificial Intelligence and Statistics , pages=

    The true sample complexity of identifying good arms , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

  31. [31]

    Advances in Neural Information Processing Systems , volume=

    Pure exploration with multiple correct answers , author=. Advances in Neural Information Processing Systems , volume=

  32. [32]

    Machine Learning , number=

    Good Arm Identification via Bandit Feedback , author=. Machine Learning , number=

  33. [33]

    Closing the Gap on the Sample Complexity of 1-Identification

    Closing the Gap on the Sample Complexity of 1-Identification , author=. arXiv preprint arXiv:2601.15620 , year=

  34. [34]

    Journal of Machine Learning Research , volume=

    An anytime algorithm for good arm identification , author=. Journal of Machine Learning Research , volume=