MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models to reduce misalignment between search and value learning.
Bootstrap off-policy with world model.arXiv preprint arXiv:2511.00423, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it