First fully single-policy sample complexity bound for average-reward offline RL via bias span and policy hitting radius in weakly communicating MDPs.
In summary, we have shown that max θ∈Θ Pθ,n ρ∗ θ − ρA (D) θ ≥ 1 2 ≥ δ, as desired
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL
First fully single-policy sample complexity bound for average-reward offline RL via bias span and policy hitting radius in weakly communicating MDPs.