pith. sign in

arxiv: 2506.20904 · v2 · submitted 2025-06-26 · 💻 cs.LG · cs.IT· math.IT· math.OC· stat.ML

Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL

Pith reviewed 2026-05-19 08:18 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.ITmath.OCstat.ML
keywords offline reinforcement learningaverage-reward MDPssingle-policy coveragesample complexityhitting radiusweakly communicating MDPspessimistic value iteration
0
0 comments X

The pith

Offline average-reward RL achieves its first fully single-policy sample complexity bound using only the target policy's bias span and hitting radius.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes sharp performance guarantees for offline reinforcement learning in average-reward Markov decision processes that depend solely on properties of the target policy. Previous approaches required additional complexity measures uniform over all policies, such as mixing times, but this work introduces a novel policy hitting radius to capture the necessary transient coverage. By developing an algorithm based on pessimistic discounted value iteration with quantile clipping, it obtains sample complexity bounds that are the first to be fully single-policy. This also allows handling general weakly communicating MDPs, which prior work restricted with structural assumptions. The authors further demonstrate through hard examples that coverage beyond the stationary distribution is necessary, and provide nearly matching lower bounds.

Core claim

In average-reward MDPs, an algorithm using pessimistic value iteration with quantile clipping achieves near-optimal sample complexity that depends only on the target policy's bias span and its hitting radius, under a coverage condition that the hitting radius is finite, and this extends to general weakly communicating MDPs with matching lower bounds.

What carries the argument

The policy hitting radius, a transient quantity measuring the time to cover states under the target policy, which defines a single-policy coverage condition stronger than stationary distribution coverage.

If this is right

  • Sample complexity scales only with the bias span and hitting radius of the target policy rather than uniform measures over all policies.
  • The algorithm applies to general weakly communicating MDPs without restrictive structural assumptions from prior work.
  • Coverage assumptions must address transient behavior beyond the stationary distribution of the target policy.
  • Implementation requires no prior knowledge of problem parameters such as the span or radius.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data collection strategies in practice should prioritize states that the target policy reaches quickly rather than only its long-run distribution.
  • Similar transient coverage measures could extend to other offline RL settings where stationary assumptions fall short.
  • The necessity of the hitting radius suggests testing whether performance degrades predictably as this radius grows in controlled experiments.

Load-bearing premise

The data distribution provides coverage such that the target policy's hitting radius is finite.

What would settle it

Construct an MDP and offline dataset where the hitting radius is infinite despite good stationary coverage of the target policy, and check whether the algorithm fails to achieve the claimed sample complexity.

Figures

Figures reproduced from arXiv: 2506.20904 by Guy Zamir, Matthew Zurek, Yudong Chen.

Figure 1
Figure 1. Figure 1: An MDP P parameterized by m, T, and an empirical MDP Pb which has constant probability of being sampled from P. Each solid arrow indicates an action and is annotated with its reward. Arrows which split into multiple dashed arrows indicate possible stochastic transitions, and each dashed arrow is annotated with the associated probabilities. For this sample size function n, with constant probability we will … view at source ↗
Figure 2
Figure 2. Figure 2: Diagram of the MDP (P(0,0), r). Arrows splitting into multiple dashed arrows indicate stochastic transitions, and each dashed arrow is annotated with the associated probability. Blue arrows represent action 0 and red arrows represent action 1. In state 1, the red arrow also represents actions 2, . . . , A − 1 (which are all identical). The reward function does not depend on the action, and is +1 in state 0… view at source ↗
Figure 3
Figure 3. Figure 3: Diagram of the MDP (P(0,...,0), r) only including actions 0 and 1. Arrows splitting into multiple dashed arrows indicate stochastic transitions, and each dashed arrow is annotated with the associated prob￾ability. Blue arrows represent action 0 and red arrows represent action 1. The reward is 0 at state 0 and the reward is 1 at all other states. In general, the MDP (Pθ, r) is similar, except in each state … view at source ↗
read the original abstract

We study offline reinforcement learning in average-reward MDPs, which presents increased challenges from the perspectives of distribution shift and non-uniform coverage, and has been relatively underexamined from a theoretical perspective. While previous work obtains performance guarantees under single-policy data coverage assumptions, such guarantees utilize additional complexity measures which are uniform over all policies, such as the uniform mixing time. We develop sharp guarantees depending only on the target policy, specifically the bias span and a novel policy hitting radius, yielding the first fully single-policy sample complexity bound for average-reward offline RL. We are also the first to handle general weakly communicating MDPs, contrasting restrictive structural assumptions made in prior work. To achieve this, we introduce an algorithm based on pessimistic discounted value iteration enhanced by a novel quantile clipping technique, which enables the use of a sharper empirical-span-based penalty function. Our algorithm also does not require any prior parameter knowledge for its implementation. Remarkably, we show via hard examples that learning under our conditions requires coverage assumptions beyond the stationary distribution of the target policy, distinguishing single-policy complexity measures from previously examined cases. We also develop lower bounds nearly matching our main result.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper studies offline RL for average-reward MDPs and introduces an algorithm based on pessimistic discounted value iteration augmented by quantile clipping. It claims the first fully single-policy sample-complexity bounds that depend only on the target policy's bias span and a novel hitting radius, applicable to general weakly communicating MDPs, together with nearly matching lower bounds. The work also shows via hard instances that stationary-distribution coverage is insufficient.

Significance. If the concentration arguments for the empirical span and the hitting-radius coverage hold, the result would constitute a meaningful advance: it removes the need for uniform mixing times or other policy-uniform complexity measures that appeared in prior average-reward offline RL analyses and relaxes the structural assumptions required by earlier work on weakly communicating MDPs. The explicit demonstration that stationary coverage alone is not enough is also a useful clarification for the field.

major comments (3)
  1. [§4.3] §4.3 (Definition of hitting radius and its use in Lemma 4.4): the deviation bound for the empirical bias span is stated to scale with the hitting radius R_π; however, the proof sketch invokes a covering argument over transient trajectories whose length is controlled by R_π, yet the dependence on the data distribution's support outside the recurrent class is not made fully explicit. This step is load-bearing for the single-policy claim.
  2. [§5.2] §5.2, Theorem 5.3 (upper bound): the quantile-clipping threshold is chosen to calibrate the pessimistic penalty; the analysis claims this choice eliminates hidden dependence on other policies' mixing behavior, but the argument relies on the finiteness of R_π under the stated coverage condition without an independent verification that the clipping step preserves the single-policy property for all weakly communicating MDPs.
  3. [§6] §6 (lower-bound construction): the hard instance demonstrates that stationary coverage is insufficient, yet it is not shown whether the constructed MDP remains weakly communicating when the hitting radius is forced to be large; if the instance inadvertently strengthens the communication assumption, the separation from prior work would be weaker than claimed.
minor comments (3)
  1. [Eq. (17)] Notation for the empirical span estimator in Eq. (17) is introduced without an explicit comparison to the population span; adding a short remark would improve readability.
  2. [Assumption 3.2] The statement of Assumption 3.2 on the data distribution could usefully include a short sentence clarifying that the hitting radius is with respect to the target policy only.
  3. [Appendix] A few typographical inconsistencies appear in the appendix (e.g., subscript π vs. π* in the concentration lemmas); these do not affect the argument but should be corrected.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. We are glad that the significance of achieving the first fully single-policy sample-complexity bounds for average-reward offline RL in general weakly communicating MDPs is recognized. We address each major comment below, indicating the revisions we will make to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§4.3] §4.3 (Definition of hitting radius and its use in Lemma 4.4): the deviation bound for the empirical bias span is stated to scale with the hitting radius R_π; however, the proof sketch invokes a covering argument over transient trajectories whose length is controlled by R_π, yet the dependence on the data distribution's support outside the recurrent class is not made fully explicit. This step is load-bearing for the single-policy claim.

    Authors: We agree that the dependence should be stated more explicitly. The covering argument in the appendix bounds the number of transient trajectories by the cardinality of the support of the empirical measure induced by the data distribution; under the single-policy coverage assumption this support is contained in the states reachable from the initial distribution within R_π steps under the target policy. We will add a clarifying sentence to the main-text discussion of Lemma 4.4 and expand the appendix proof to explicitly track the support restriction, confirming that no coverage outside the data distribution is used. revision: yes

  2. Referee: [§5.2] §5.2, Theorem 5.3 (upper bound): the quantile-clipping threshold is chosen to calibrate the pessimistic penalty; the analysis claims this choice eliminates hidden dependence on other policies' mixing behavior, but the argument relies on the finiteness of R_π under the stated coverage condition without an independent verification that the clipping step preserves the single-policy property for all weakly communicating MDPs.

    Authors: The quantile threshold is computed directly from the empirical distribution of observed returns under the given data; because the coverage condition guarantees that all states visited by the target policy within its hitting radius R_π receive sufficient samples, the clipping operation remains a function only of the target policy's transient behavior. We will insert a new supporting lemma in the appendix that formally verifies preservation of the single-policy property after clipping, for any weakly communicating MDP satisfying the coverage assumption. revision: partial

  3. Referee: [§6] §6 (lower-bound construction): the hard instance demonstrates that stationary coverage is insufficient, yet it is not shown whether the constructed MDP remains weakly communicating when the hitting radius is forced to be large; if the instance inadvertently strengthens the communication assumption, the separation from prior work would be weaker than claimed.

    Authors: We thank the referee for this important point. The hard instance is a weakly communicating MDP with one recurrent class and a transient chain whose transition probabilities are tuned so that the target policy has large hitting radius while a different policy reaches the recurrent class in finite expected time. We will add an explicit paragraph in Section 6 verifying that the MDP satisfies the definition of weak communication (existence of a policy that reaches the recurrent class from every state with positive probability) while the hitting radius for the target policy remains arbitrarily large. revision: yes

Circularity Check

0 steps flagged

Derivation is self-contained via first-principles concentration on novel hitting radius

full rationale

The paper introduces the policy hitting radius as a transient coverage measure strictly stronger than stationary distribution coverage, then applies standard concentration inequalities to bound empirical span deviation and calibrate the pessimistic penalty under quantile clipping. The main sample-complexity upper bound and matching lower bound are obtained directly from these quantities without any fitted parameter being relabeled as a prediction, without self-citation chains that carry the central claim, and without any ansatz or uniqueness theorem imported from prior work by the same authors. Hard-instance constructions are provided to show necessity of the radius, confirming the argument does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The result rests on standard MDP axioms plus the new hitting-radius coverage assumption and the weakly-communicating structure; no free parameters are fitted to data because the algorithm is parameter-free.

axioms (2)
  • domain assumption The MDP is weakly communicating (finite bias span for every policy).
    Invoked to ensure the average-reward value function is well-defined and to contrast with prior restrictive structural assumptions.
  • domain assumption The offline dataset satisfies a coverage condition quantified by the target policy's hitting radius.
    This is the load-bearing single-policy coverage assumption used to control transient deviation terms.
invented entities (1)
  • Policy hitting radius independent evidence
    purpose: Quantifies the transient coverage needed beyond the stationary distribution to reach states that affect the average reward.
    New complexity measure introduced to obtain single-policy bounds; independent evidence would be a matching lower bound construction showing it is necessary.

pith-pipeline@v0.9.0 · 5745 in / 1522 out tokens · 34270 ms · 2026-05-19T08:18:14.326943+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    (b) Constant shift: For anyc ∈ R, bT π pe(Q + c1) = bT π pe(Q) + γc1

    For any fixed stationary deterministic policyπ, the analogous statements hold forbT π pe: (a) Monotonicity: If Q ≥ Q′ then bT π pe(Q) ≥ bT π pe(Q′). (b) Constant shift: For anyc ∈ R, bT π pe(Q + c1) = bT π pe(Q) + γc1. (c) γ-contractivity: bT π pe is a γ-contraction and has a unique fixed pointbQπ pe. (d) Boundedness: 0 ≤ bQπ pe ≤ 1 1−γ 1

  2. [2]

    For any fixed stationary deterministic policyπ, we havebQ⋆ pe ≥ bQπ pe. Proof. We note that a few steps are similar to Li et al. [2023, Lemma 1], but our new choice of penalty requires much more involved analysis. We define an auxiliary operatorT pe : RS → RSA by, for anyV ∈ RS, T pe(V )(s, a) := r(s, a) + γ max n bPsaTβ(s,a)(bPsa, V ) − b(s, a, V ), min ...

  3. [3]

    Again note thatbQ = bQK. By the definition ofK = log( 2ntot 1−γ ) 1−γ , as well as the fact thatlog(1/γ) ≥ 1 − γ for any γ, we have γK = eK log(γ) ≤ e log( 2ntot 1−γ ) 1−γ log(γ) = e log( 1−γ 2ntot ) 1−γ log(1/γ) ≤ elog( 1−γ 2ntot ) = 1 − γ 2ntot . Using this bound,γ-contractivity, and the fact that0 ≤ bQ⋆ pe ≤ 1 1−γ 1 from Lemma B.1, we have bQK − bQ⋆ pe...

  4. [4]

    , Sn(s,a) sa drawn from P (· | s, a)

    For all u ∈ U, Xu is independent of all samplesS1 sa, . . . , Sn(s,a) sa drawn from P (· | s, a)

  5. [5]

    Also assume n(s, a) ≥ 2

    Almost surely there exists someu⋆ ∈ U such that ∥X − Xu⋆ ∥∞ ≤ 1 ntot . Also assume n(s, a) ≥ 2. Then with probability at least1 − 6δ′, we have that (bPsa − Psa)X ≤ ∥X∥span s log |U |/δ′ 2n(s, a) + 2 ntot 1 + s log |U |/δ′ 2n(s, a) ! (11) (bPsa − Psa)X ≤ s 2VPsa [X] log(|U |/δ′) n(s, a) + ∥X∥span log(|U |/δ′) 3n(s, a) + 1 ntot 2 + s 2 log(|U |/δ′) n(s, a) ...

  6. [6]

    , Sn(s,a) sa , and 2) almost surely there exists someu ∈ U2 sa such that Tβ(s,a)(bPsa,bV ⋆ pe) − X2 u ∞ ≤ 1 ntot

    (LOO constructions forTβ(s,a)(bPsa,bV ⋆ pe)) For eachs, a, there exists a setU2 sa ⊆ R with |U2 sa| ≤ S ntot 1−γ and random vectors(X2 u)u∈U 2sa such that 1) for allu ∈ U2 sa, X2 u is independent fromS1 sa, . . . , Sn(s,a) sa , and 2) almost surely there exists someu ∈ U2 sa such that Tβ(s,a)(bPsa,bV ⋆ pe) − X2 u ∞ ≤ 1 ntot

  7. [7]

    For eachs, a, there exists a setU3 sa ⊆ R with |U3 sa| ≤ ntot 1−γ and random vectors(X3 u)u∈U 3sa such that 1) for allu ∈ U3 sa, X3 u is independent fromS1 sa,

    (LOO construction forbV π pe) Fix any policyπ. For eachs, a, there exists a setU3 sa ⊆ R with |U3 sa| ≤ ntot 1−γ and random vectors(X3 u)u∈U 3sa such that 1) for allu ∈ U3 sa, X3 u is independent fromS1 sa, . . . , Sn(s,a) sa , and 2) almost surely there exists someu ∈ U3 sa such that bV π pe − X3 u ∞ ≤ 1 ntot

  8. [8]

    For each s, a, there exists a setU4 sa ⊆ R with |U4 sa| ≤ S ntot 1−γ and random vectors(X4 u)u∈U 4sa such that 1) for allu ∈ U4 sa, X4 u is independent from S1 sa,

    (LOO constructions forTβ(s,a)(bPsa,bV π pe)) Fix any policyπ. For each s, a, there exists a setU4 sa ⊆ R with |U4 sa| ≤ S ntot 1−γ and random vectors(X4 u)u∈U 4sa such that 1) for allu ∈ U4 sa, X4 u is independent from S1 sa, . . . , Sn(s,a) sa , and 2) almost surely there exists someu ∈ U4 sa such that Tβ(s,a)(bPsa,bV π pe) − X4 u ∞ ≤ 1 ntot . Proof. We ...

  9. [9]

    equal to r(s, a) + γbPsaTβ(s,a)(bPsa, MbQ) − γb(s, a, MbQ) or 2) equal tor(s, a) + γ mins′(MbQ)(s′). In the simpler case 2, we thus have that bTpe(bQ)(s, a) = r(s, a) + γ min s′ (MbQ)(s′) ≤ r(s, a) + γPsaMbQ = r(s, a) + γPsaMbπbQ = Tbπ(bQ)(s, a) using the facts that mins′ V (s′) ≤ PsaV for any V ∈ RS (since Psa is a probability distribution) and that MbQ ...

  10. [10]

    Combining the two cases we have shown that Tbπ(bQ) ≥ bQ as desired

    = MbQ+ 1 2ntot 1), and the final equality is from the definition ofbπ (from Algorithm 1) since it is greedy with respect tobQ. Combining the two cases we have shown that Tbπ(bQ) ≥ bQ as desired. Combining the two cases we have shown thatTbπ(bQ) ≥ bQ as desired. B.5 Policy hitting radius lemmas In this subsection we establish some key properties regarding ...

  11. [11]

    (ii) is due to (46), and(iii) uses V π⋆ − 1 1−γ ρπ⋆ ∞ ≤ V π⋆ span due to Zurek and Chen [2025a, Lemma 6]

    (47) 41 Furthermore we have bV ⋆ pe (i) ≥ bV π⋆ pe (ii) ≥ V π⋆ − ε1 (iii) ≥ 1 1 − γ ρπ⋆ − V π⋆ span 1 − ε1 (48) where (i) is due to Lemma B.1 which givesbQ⋆ pe ≥ bQπ⋆ pe, which impliesbV ⋆ pe = MbQ⋆ pe ≥ M π⋆ bQ⋆ pe ≥ M π⋆ bQπ⋆ pe using monotonicity ofM π⋆ . (ii) is due to (46), and(iii) uses V π⋆ − 1 1−γ ρπ⋆ ∞ ≤ V π⋆ span due to Zurek and Chen [2025a, Le...

  12. [12]

    Furthermore, using Zurek and Chen [2025a, Lemma 26] we have (sinceρπ⋆ is constant) that V π⋆ span ≤ 2 hπ⋆ span

    Thus ρbπ ≥ (1 − γ) min s∈S Vbπ(s)1 (i) ≥ (1 − γ) min s∈S bV ⋆ pe(s)1 − 1 − γ 2ntot 1 (ii) ≥ min s∈S ρπ⋆ (s)1 − (1 − γ) V π⋆ span 1 − (1 − γ)ε1 − 1 − γ 2ntot 1 (iii) ≥ ρπ⋆ − (1 − γ) V π⋆ span 1 − 1 − γ 2ntot 1 − vuut6144S ∥V π⋆ ∥span + 1 log 6S2Antot (1−γ)δ m 1 − 1920S V π⋆ span + 1 log 6S2Antot (1−γ)δ m 1 − 12 ntot 1 (iv) ≥ ρπ⋆ − vuut6144S ∥V π⋆ ∥span + 1...

  13. [13]

    Note that action 2 in state 0 is added to keep the diameter bounded byT, and actions3,

    In words, our choice ofA is one that is sufficiently large so that randomly guessing the optimal actionb in state 1 will not yield a good policy. Note that action 2 in state 0 is added to keep the diameter bounded byT, and actions3, . . . , A− 1 in state 0 simply keep the action space independent of the state, consistent with our upper bounds. Since actio...

  14. [14]

    We have shown that when the dataset does not contain any useful transitions, there must be at least one MDP where the algorithm is likely to make a poor guess

    Indeed, if this were not the case, we would have E "X a∈A ˆπ(a|1) B # = X a∈A E ˆπ(a|1) B ≥ X a∈A E ˆπ(a|1) Fa ∩ B P(Fa | B) > X a∈A 4 A · 1 4 = 1, which is a contradiction because we always haveP a∈A ˆπ(a|1) = 1. We have shown that when the dataset does not contain any useful transitions, there must be at least one MDP where the algorithm is likely to ma...

  15. [15]

    In particular, under the the eventE c i′ ∩ F c a′ we have ρˆπ (i′,a′) < 1

  16. [16]

    In summary, we have shown that max θ∈Θ Pθ,n ρ∗ θ − ρA (D) θ ≥ 1 2 ≥ δ, as desired

    Subsequently, forθ′ = (i′, a′), we have Pθ′,n ρˆπ θ′ < 1 2 ≥ Pθ′,n(E c i′ ∩ F c a′) ≥ Pθ′,n(E c i′ ∩ F c a′ ∩ B) = P(E c i′ ∩ F c a′|B)Pθ′(B) ≥ 1 4 · 4δ = δ, where the final inequality follows from Lemma C.1. In summary, we have shown that max θ∈Θ Pθ,n ρ∗ θ − ρA (D) θ ≥ 1 2 ≥ δ, as desired. C.1 Auxiliary lemmas Lemma C.1. For all θ ∈ Θ, we havePθ,n(B) ≥ 4...

  17. [17]

    For any δ ∈ 0, 1 e , we have 1 − 1 x ⌈ x 2 log( 1 δ)⌉ ≥ δ

  18. [18]

    For any δ ∈ 0, 1 e3 , we have 1 − 1 x ⌈ x 6 log( 1 δ)⌉ ≥ δ1/3

  19. [19]

    46 Proof

    For any δ ∈ 0, 1 e9 , we have 1 − 1 3x ⌈ x 6 log( 1 δ)⌉ ≥ 4δ1/3. 46 Proof. We will prove claim 1 by showing that 1 − 1 x x 2 log( 1 δ)+1 ≥ δ. For anyx ≥ 4 and δ ∈ 0, 1 e , we have log 1 − 1 x x 2 log( 1 δ)+1! = x 2 log 1 δ + 1 log 1 − 1 x ≥ x 2 log 1 δ + 1 − 1 x − 1 x2 = 1 2 + 1 2x log δ − 1 x − 1 x2 ≥ 5 8 log δ − 5 16 = 5 8 log δ + 5 16 log 1 e ≥ 5 8 + 5...

  20. [20]

    The set of states isS = {0, 1,

    Let p = 1−ε D and q = 1 D. The set of states isS = {0, 1, . . . , S′} and the set of actions isA = {0, 1, . . . , S′}. The reward function r : S × A → [0, 1] is defined to be 1 whens ̸= 0 and a ≤ 1, and 0 otherwise. We define an index set Θ = {0, 1}S′ . For eachθ ∈ Θ, we define the transition matrixPθ as follows: s a Pθ(s′|s, a) 0 0 (1 − q)I(s′ = s) + q S...

  21. [21]

    At state 0, which has a reward of 0, the agent will likely be trapped for a long time before returning to one of states1, . . . , S′. See Figure 3 for a diagram of the MDP(Pθ, r) for θ = (0, . . . ,0). We now state some easily verifiable facts about the MDP(Pθ, r): • The MDP hasS states, is unichain, and has diameter1 q + 1 1/2 = D + 2 = T. • There is a u...

  22. [22]

    D.1 Auxiliary lemmas Lemma D.1

    Thus, plugging back into Equation 52 yields max θ Pθ,n ρ∗ θ − ρA (D) θ > c3 r T S m ! ≥ 1 64 , with c3 = 2−17. D.1 Auxiliary lemmas Lemma D.1. Let π be a stationary policy on MDPMθ. Then ρπ θ ≤ 1 + ε2 2 − ε(1 − Lπ θ /S′) . Proof. A routine computation (see Lemma D.7) yields ρπ θ = q S′ PS′ s=1 1 κs 1 + q S′ PS′ s=1 1 κs , where κs = Lπ θ (s)q +(1 − Lπ θ (...

  23. [23]

    Lemma D.2 (Gilbert-VarshamovLemma[Massart,2007, Lemma4.7])

    Consequently, ρπ θ ≤ λ 1 + λ ≤ 1 1−ε(1−Lπ θ /S′) + ε2 1 + 1 1−ε(1−Lπ θ /S′) + ε2 ≤ 1 + ε2 2 − ε(1 − Lπ θ /S′) . Lemma D.2 (Gilbert-VarshamovLemma[Massart,2007, Lemma4.7]) . Let d ≥ 8. There existsΩd ⊂ {0, 1}d such that |Ωd| ≥ 2d/8 and ∥ω − ω′∥1 ≥ d/8 for all ω ̸= ω′ ∈ Ωd. Lemma D.3. For any θ, θ′ ∈ Θ, we have KL(Pθ,n ∥ Pθ′,n) ≤ S′ 16 − 1 log 2. Proof. Let...

  24. [24]

    Incase(55), wehave Qβ(bPsa, V ′) = Qβ(bPsa, V )ontheentireinterval U andalsothatforany V ′(x) ∈ U, Qβ(bPsa, V ) > V ′(x) (since the(1−β)-percentile is achieved at some states′ ∈ S >), soTβ(bPsa, V ′)(x) = V ′(x) and Tβ(bPsa, V ′)(s′) = Tβ(bPsa, V )(s′) for all s′ ̸= x. Therefore dTβ(bPsa, V ′)(s′) dV ′(x) V ′(x)=V (x) = ( 1 s′ = x 0 otherwise and deT1(V ′...

  25. [25]

    Thus Tβ(bPsa, V ′)(s′) = V ′(x) if s′ ∈ S > ∪ {x}, and Tβ(bPsa, V ′)(s′) = V ′(s′) = V (s′) for s′ ∈ S<

    In case (56), we haveQβ(bPsa, V ′) = V ′(x) on the entire intervalU. Thus Tβ(bPsa, V ′)(s′) = V ′(x) if s′ ∈ S > ∪ {x}, and Tβ(bPsa, V ′)(s′) = V ′(s′) = V (s′) for s′ ∈ S<. Thus dTβ(bPsa, V ′)(s′) dV ′(x) V ′(x)=V (x) = ( 1 s′ ∈ S > ∪ {x} 0 otherwise 54 and deT1(V ′) dV ′(x) V ′(x)=V (x) = d dV ′(x) V ′(x)=V (x) bPsaTβ(bPsa, V ′) − β Tβ(bPsa, V ′) span =...

  26. [26]

    Also mins′ V ′(s′) < V ′(x) on U, so mins′ V ′(s′) = mins′ V (s′) on U

    In case (57), we haveQβ(bPsa, V ′) = Qβ(bPsa, V ) and also that Tβ(bPsa, V ′)(x) = Qβ(bPsa, V ) < V ′(x) (since V ′(x) < Q β(bPsa, V ) in this case), so Tβ(bPsa, V ′) = Tβ(bPsa, V ) on the interval U. Also mins′ V ′(s′) < V ′(x) on U, so mins′ V ′(s′) = mins′ V (s′) on U. Thus dTβ(bPsa, V ′)(s′) dV ′(x) V ′(x)=V (x) = 0 for all s′ ∈ S, and deT1(V ′) dV ′(...

  27. [27]

    And by a similar argument, g(z) < r 2 would imply z ≤ z − δ

    Indeed, if we hadg(z) > r 2, then by continuity of g there would existδ > 0 such that g(t) > r 2 for t ∈ [z, z + δ], which would imply thatz ≥ z + δ. And by a similar argument, g(z) < r 2 would imply z ≤ z − δ. At z the right-derivative is nonnegative, so there existsw ∈ (z, y] such that f(t)−f(z) t−z > r 2 for allt ∈ (z, w]. Consequently, for allt ∈ (z, ...