First fully single-policy sample complexity bound for average-reward offline RL via bias span and policy hitting radius in weakly communicating MDPs.
We have shown that when the dataset does not contain any useful transitions, there must be at least one MDP where the algorithm is likely to make a poor guess
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL
First fully single-policy sample complexity bound for average-reward offline RL via bias span and policy hitting radius in weakly communicating MDPs.