Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism

· 2025 · cs.LG · arXiv 2512.04341

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Popular offline reinforcement learning (RL) methods rely on explicit conservatism, penalizing out-of-dataset actions or restricting rollout horizons. We question the universality of this principle and revisit a complementary Bayesian perspective for test-time adaptation. By modeling a posterior over world models and training a history-dependent agent to maximize expected return, the Bayesian approach directly addresses epistemic uncertainty without explicit conservatism. We first illustrate in a bandit setting that Bayesianism excels on low-quality datasets where conservatism fails. Scaling to realistic tasks, we find that long-horizon rollouts are essential to control value overestimation once conservatism is removed. We introduce design choices that enable learning from long-horizon rollouts while mitigating compounding model errors, yielding our algorithm, NEUBAY, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, NEUBAY is competitive with leading conservative algorithms, achieving new state-of-the-art on 7 datasets with rollout horizons of several hundred steps. Finally, we characterize datasets by quality and coverage to identify when NEUBAY is preferable to conservative methods.

representative citing papers

Understanding Rollout Error in Graph World Models

cs.AI · 2026-06-26 · unverdicted · novelty 4.0

Develops graph rollout bounds separating topology and model error sources and proposes Error-Aware GWM with spectral regularization and consistency terms for dynamic graphs.

citing papers explorer

Showing 1 of 1 citing paper.

Understanding Rollout Error in Graph World Models cs.AI · 2026-06-26 · unverdicted · none · ref 44 · internal anchor
Develops graph rollout bounds separating topology and model error sources and proposes Error-Aware GWM with spectral regularization and consistency terms for dynamic graphs.

Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism

fields

years

verdicts

representative citing papers

citing papers explorer