First fully single-policy sample complexity bound for average-reward offline RL via bias span and policy hitting radius in weakly communicating MDPs.
Also mins′ V ′(s′) < V ′(x) on U, so mins′ V ′(s′) = mins′ V (s′) on U
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL
First fully single-policy sample complexity bound for average-reward offline RL via bias span and policy hitting radius in weakly communicating MDPs.