In summary, we have shown that max θ∈Θ Pθ,n ρ∗ θ − ρA (D) θ ≥ 1 2 ≥ δ, as desired

Subsequently, forθ′ = (i′, a′), we have Pθ′,n ρˆπ θ′ < 1 2 ≥ Pθ′,n(E c i′ ∩ F c a′) ≥ Pθ′,n(E c i′ ∩ F c a′ ∩ B) = P(E c i′ ∩ F c a′|B)Pθ′(B) ≥ 1 4 · 4δ = δ, where the final inequa

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL

cs.LG · 2025-06-26 · conditional · novelty 8.0

First fully single-policy sample complexity bound for average-reward offline RL via bias span and policy hitting radius in weakly communicating MDPs.

citing papers explorer

Showing 1 of 1 citing paper.

Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL cs.LG · 2025-06-26 · conditional · none · ref 16
First fully single-policy sample complexity bound for average-reward offline RL via bias span and policy hitting radius in weakly communicating MDPs.

In summary, we have shown that max θ∈Θ Pθ,n ρ∗ θ − ρA (D) θ ≥ 1 2 ≥ δ, as desired

fields

years

verdicts

representative citing papers

citing papers explorer