We have shown that when the dataset does not contain any useful transitions, there must be at least one MDP where the algorithm is likely to make a poor guess

Indeed, if this were not the case, we would have E "X a∈A ˆπ(a|1) B # = X a∈A E ˆπ(a|1) B ≥ X a∈A E ˆπ(a|1) Fa ∩ B P(Fa | B) > X a∈A 4 A · 1 4 = 1, which is a contradiction because

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL

cs.LG · 2025-06-26 · conditional · novelty 8.0

First fully single-policy sample complexity bound for average-reward offline RL via bias span and policy hitting radius in weakly communicating MDPs.

citing papers explorer

Showing 1 of 1 citing paper.

Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL cs.LG · 2025-06-26 · conditional · none · ref 14
First fully single-policy sample complexity bound for average-reward offline RL via bias span and policy hitting radius in weakly communicating MDPs.

We have shown that when the dataset does not contain any useful transitions, there must be at least one MDP where the algorithm is likely to make a poor guess

fields

years

verdicts

representative citing papers

citing papers explorer