Exploring Model-based Planning with Policy Networks

Tingwu Wang , Jimmy Ba

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIcs.ROstat.ML

keywords planningpolicyactionoptimizationcontrolenvironmentsmodel-basednetwork

read the original abstract

Model-based reinforcement learning (MBRL) with model-predictive control or online planning has shown great potential for locomotion control tasks in terms of both sample efficiency and asymptotic performance. Despite their initial successes, the existing planning methods search from candidate sequences randomly generated in the action space, which is inefficient in complex high-dimensional environments. In this paper, we propose a novel MBRL algorithm, model-based policy planning (POPLIN), that combines policy networks with online planning. More specifically, we formulate action planning at each time-step as an optimization problem using neural networks. We experiment with both optimization w.r.t. the action sequences initialized from the policy network, and also online optimization directly w.r.t. the parameters of the policy network. We show that POPLIN obtains state-of-the-art performance in the MuJoCo benchmarking environments, being about 3x more sample efficient than the state-of-the-art algorithms, such as PETS, TD3 and SAC. To explain the effectiveness of our algorithm, we show that the optimization surface in parameter space is smoother than in action space. Further more, we found the distilled policy network can be effectively applied without the expansive model predictive control during test time for some environments such as Cheetah. Code is released in https://github.com/WilsonWangTHU/POPLIN.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.
Mastering Atari with Discrete World Models
cs.LG 2020-10 accept novelty 7.0

DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
Dream to Control: Learning Behaviors by Latent Imagination
cs.LG 2019-12 accept novelty 7.0

Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
Hyperfastrl: Hypernetwork-based reinforcement learning for unified control of parametric chaotic PDEs
cs.CE 2026-04 unverdicted novelty 6.0

Hypernetworks map a forcing parameter directly to policy weights in an RL framework, enabling unified stabilization of the Kuramoto-Sivashinsky equation across regimes with KAN architectures showing strongest extrapolation.