pith. sign in

arxiv: 2011.04021 · v2 · pith:FYIKN2JAnew · submitted 2020-11-08 · 💻 cs.AI · cs.LG

On the role of planning in model-based deep reinforcement learning

classification 💻 cs.AI cs.LG
keywords planningmbrllearningmodel-baseddeepdrivegeneralizationmethods
0
0 comments X
read the original abstract

Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al., 2019), a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning to Theorize the World from Observation

    cs.LG 2026-05 unverdicted novelty 7.0

    NEO is a probabilistic neural model that induces compositional programs as a learned Language of Thought from non-textual observations and executes them via a shared transition model to enable explanation-driven gener...

  2. Decoupled Guidance Diffusion for Adaptive Offline Safe Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    SDGD uses cost-conditioned classifier-free guidance plus reward guidance with feasible trajectory relabeling to generate safe high-reward trajectories that adapt to changing safety budgets in offline RL.

  3. Finding the Time to Think: Learning Planning Budgets in Real-Time RL

    cs.LG 2026-06 unverdicted novelty 6.0

    A learned gating policy selects state-dependent planning budgets in variable-delay real-time RL and outperforms fixed-budget and heuristic baselines across Pac-Man, Tetris, Snake, Speed Hex, and Speed Go.

  4. Finding the Time to Think: Learning Planning Budgets in Real-Time RL

    cs.LG 2026-06 unverdicted novelty 6.0

    Trains a gating policy to select state-dependent planning budgets in variable-delay real-time RL, outperforming fixed-budget and heuristic baselines across Pac-Man, Tetris, Snake, Speed Hex, and Speed Go.

  5. Learning to Theorize the World from Observation

    cs.LG 2026-05 unverdicted novelty 6.0

    NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.