arXiv preprint arXiv:2107.06226 , year=

Pessimistic model-based offline reinforcement learning under partial coverage , author= · 2021 · arXiv 2107.06226

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.

Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

Offline-to-online value adaptation in RL has a minimax lower bound matching pure online learning in hard cases, yet O2O-LSVI improves sample complexity under a novel structural condition on pretrained Q-functions.

On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Offline KL-regularized MABs require sample complexity scaling as O(η S A C^π*/ε) for large regularization and Ω(S A C^π*/ε²) for small regularization, with matching lower bounds across the full range.

citing papers explorer

Showing 3 of 3 citing papers.

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability cs.LG · 2026-05-09 · unverdicted · none · ref 19
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation cs.LG · 2026-04-15 · unverdicted · none · ref 10
Offline-to-online value adaptation in RL has a minimax lower bound matching pure online learning in hard cases, yet O2O-LSVI improves sample complexity under a novel structural condition on pretrained Q-functions.
On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization cs.LG · 2026-05-04 · unverdicted · none · ref 50
Offline KL-regularized MABs require sample complexity scaling as O(η S A C^π*/ε) for large regularization and Ω(S A C^π*/ε²) for small regularization, with matching lower bounds across the full range.

arXiv preprint arXiv:2107.06226 , year=

fields

years

verdicts

representative citing papers

citing papers explorer