MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models to reduce misalignment between search and value learning.
The Surprising Difficulty of Search in Model-Based Reinforcement Learning
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, showing that search is not a drop-in replacement for a learned policy. Surprisingly, we find that search can harm performance even when the model is highly accurate. Instead, we show that mitigating overestimation bias matters more than improving model or value function accuracy. Building on this insight, we identify that taking the minimum over an ensemble of value functions effectively addresses this bias and enables effective search, achieving state-of-the-art performance across multiple popular benchmark domains.
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization
MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models to reduce misalignment between search and value learning.