pith. machine review for the scientific record. sign in

arxiv: 1911.08265 · v2 · submitted 2019-11-19 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:52 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords MuZeromodel-based reinforcement learningMonte Carlo tree searchAtariGochesslearned dynamicsplanning
0
0 comments X

The pith

MuZero achieves superhuman performance in Atari, Go, chess and shogi by learning a model that predicts only the reward, policy and value needed for planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MuZero, which learns a model to predict the quantities most useful for planning rather than attempting to reconstruct full environment dynamics. This model is applied iteratively inside tree search to select actions in environments where the rules or physics are completely unknown. Evaluated on 57 Atari games it sets a new state of the art; on Go, chess and shogi it matches the performance of AlphaZero while receiving no game rules. A sympathetic reader cares because the result shows that effective long-horizon planning is possible in high-dimensional visual domains without hand-crafted simulators.

Core claim

MuZero learns a model that, when applied iteratively, predicts the reward, the action-selection policy, and the value function. When this model is used inside tree-based search, the resulting agent reaches superhuman performance across visually complex domains without any knowledge of their underlying dynamics, and matches AlphaZero on Go, chess and shogi without being given the game rules.

What carries the argument

The MuZero learned model that iteratively predicts reward, policy and value inside Monte Carlo tree search.

If this is right

  • Planning methods can now be applied to domains that lack perfect simulators, such as real-world control tasks.
  • A new state of the art is reached on the full set of 57 Atari games.
  • Superhuman performance is obtained in Go, chess and shogi with zero prior knowledge of the rules.
  • Predictions can be limited to reward, policy and value rather than full next-state reconstruction while still enabling effective search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Targeted prediction of planning quantities may be sufficient for many sequential decision problems where full model learning is intractable.
  • The same architecture could be tested in continuous or partially observable settings where accumulating model error has historically limited planning.
  • If prediction accuracy holds at longer horizons, similar learned models might reduce the sample complexity gap between model-based and model-free methods in visual domains.

Load-bearing premise

The learned predictions remain accurate enough over many steps to support planning even when the true dynamics are unknown and high-dimensional.

What would settle it

A direct comparison showing that MuZero's performance collapses to below AlphaZero levels in Go when the learned model is replaced by the true game rules, or that prediction error grows rapidly enough to prevent superhuman play on any Atari game.

read the original abstract

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games - the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled - our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces the MuZero algorithm, which learns a model to iteratively predict reward, policy, and value for use inside Monte Carlo tree search. This enables planning in environments with unknown dynamics. The method achieves a new state of the art on 57 Atari games and matches the superhuman performance of AlphaZero on Go, chess, and shogi without being given the game rules or dynamics.

Significance. If the results hold, this is a significant advance for model-based RL: it demonstrates that a learned model can support effective long-horizon planning in high-dimensional, visually complex domains where prior model-based methods have struggled. The large-scale evaluation (57 Atari games plus three board games), direct comparisons to AlphaZero and prior SOTA agents, and reported training curves provide strong empirical grounding.

minor comments (3)
  1. [§3.2] §3.2 (MuZero algorithm): the mapping from the three learned heads (reward, policy, value) to the MCTS backup and selection steps could be stated more explicitly, perhaps with an additional equation or annotated diagram.
  2. [Table 2] Table 2 (Atari results): while median human-normalized scores are given, adding per-game statistical significance or variance across seeds would strengthen the 'new state of the art' claim.
  3. [Figure 4] Figure 4 (board-game learning curves): including the AlphaZero curve on the same plot would make the matching-performance claim easier to assess visually.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the recognition of its significance for model-based RL, and the recommendation for minor revision. We appreciate the detailed summary highlighting the key contributions of MuZero in learning models that directly support planning without access to environment dynamics.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The MuZero paper defines an algorithmic procedure (representation, dynamics, and prediction functions trained via a combined loss on observed rewards, policies, and values) and validates it through direct empirical evaluation on external benchmarks (57 Atari games and board games) against independent baselines such as AlphaZero and human performance. No load-bearing derivation step equates a claimed result to its own fitted inputs by construction, nor does any central claim reduce to a self-citation chain or renamed empirical pattern; the reported superhuman performance is measured externally and is not tautological with the training objectives.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a neural network can learn a sufficiently accurate planning model from self-play data alone; no new physical or mathematical axioms are introduced beyond standard neural network approximation and RL convergence assumptions.

free parameters (1)
  • network architecture and optimizer hyperparameters
    Layer sizes, learning rates, and regularization coefficients chosen per domain to achieve reported performance.
axioms (1)
  • domain assumption Iterated application of the learned model inside tree search yields predictions accurate enough for superhuman planning
    Invoked throughout the search and training sections; no formal guarantee is provided.
invented entities (1)
  • Learned dynamics model with reward-policy-value heads no independent evidence
    purpose: To supply the quantities needed by MCTS without access to true environment rules or simulator
    Core new component introduced by the paper; no independent falsifiable prediction outside performance metrics is given.

pith-pipeline@v0.9.0 · 5537 in / 1341 out tokens · 45939 ms · 2026-05-16T23:52:41.366359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling

    cs.LG 2026-05 unverdicted novelty 7.0

    PMCTS is the first principled parallel MCTS algorithm that preserves formal policy improvement guarantees and scales with parallel compute.

  2. Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $\tau$-Mixing

    stat.ML 2026-05 unverdicted novelty 7.0

    Finite-sample risk bounds for DQN with ReLU networks are extended to τ-mixing data, showing an extra dimensionality penalty in the convergence rate due to dependence.

  3. Latent State Design for World Models under Sufficiency Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

  4. Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.

  5. Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

    cs.LG 2026-02 unverdicted novelty 7.0

    Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.

  6. Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search

    cs.LG 2025-12 unverdicted novelty 7.0

    Inverse-RPO derives two variance-aware prior-based UCT policies from UCB-V that outperform PUCT on benchmarks with no extra cost.

  7. Latent Chain-of-Thought World Modeling for End-to-End Driving

    cs.CV 2025-12 unverdicted novelty 7.0

    LCDrive unifies chain-of-thought reasoning and action selection for end-to-end driving by interleaving action-proposal tokens and latent world-model tokens that predict action outcomes, yielding faster inference and b...

  8. Training Agents Inside of Scalable World Models

    cs.AI 2025-09 conditional novelty 7.0

    Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

  9. Mastering Diverse Domains through World Models

    cs.AI 2023-01 unverdicted novelty 7.0

    DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

  10. Mastering Atari with Discrete World Models

    cs.LG 2020-10 accept novelty 7.0

    DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.

  11. Dream to Control: Learning Behaviors by Latent Imagination

    cs.LG 2019-12 accept novelty 7.0

    Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.

  12. Plan Before You Trade: Inference-Time Optimization for RL Trading Agents

    cs.LG 2026-05 unverdicted novelty 6.0

    FPILOT optimizes pre-trained RL trading policies at inference time using forecasted price trajectories to improve portfolio allocations and risk-adjusted returns on the DJ30 benchmark.

  13. Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Ms.PR applies multi-scale predictive supervision to enforce goal-directed alignment in latent spaces for offline GCRL, yielding improved representation quality and performance on vision and state-based tasks.

  14. Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits

    cs.LG 2026-05 unverdicted novelty 6.0

    Hybrid agent with variational quantum circuits for feature extraction in hierarchical RL outperforms classical baselines with 66% parameter savings, but quantum value estimation degrades results.

  15. Is Conditional Generative Modeling all you need for Decision-Making?

    cs.LG 2022-11 unverdicted novelty 6.0

    Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.

  16. Interpretable experiential learning based on state history and global feedback

    cs.LG 2026-05 unverdicted novelty 4.0

    A transition graph model with utility and evidence counts learns behaviors from state history and feedback, showing performance comparable to neural networks on Atari Breakout.

  17. Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them

    cs.LG 2026-04 unverdicted novelty 4.0

    XAI-based correction methods outperform non-XAI baselines for fixing spurious correlations in DNNs, with Counterfactual Knowledge Distillation most effective, but all are limited by reliance on unavailable group label...

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 17 Pith papers · 6 internal anchors

  1. [1]

    Lipton, and Animashree Anandkumar

    Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Emma Brunskill, Zachary C. Lipton, and Animashree Anandkumar. Surprising negative results for generative adversarial tree search.CoRR, abs/1806.05780, 2018

  2. [2]

    The arcade learning environment: An evaluation platform for general agents

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013

  3. [3]

    Superhuman ai for heads-up no-limit poker: Libratus beats top profes- sionals

    Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top profes- sionals. Science, 359(6374):418–424, 2018

  4. [4]

    Learning and Querying Fast Generative Models for Reinforcement Learning

    Lars Buesing, Theophane Weber, Sebastien Racaniere, SM Eslami, Danilo Rezende, David P Reichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, et al. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006, 2018

  5. [5]

    Joseph Hoane, Jr., and Feng-hsiung Hsu

    Murray Campbell, A. Joseph Hoane, Jr., and Feng-hsiung Hsu. Deep blue. Artif. Intell., 134(1-2):57–83, January 2002

  6. [6]

    R. Coulom. Whole-history rating: A Bayesian rating system for players of time-varying strength. In Inter- national Conference on Computers and Games, pages 113–124, 2008

  7. [7]

    Efficient selectivity and backup operators in monte-carlo tree search

    R ´emi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. InInternational confer- ence on computers and games, pages 72–83. Springer, 2006

  8. [8]

    Deisenroth and CE

    MP. Deisenroth and CE. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011 , pages 465–472. Omnipress, 2011

  9. [9]

    Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures

    Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceedings of the International Conference on Machine Learning (ICML) , 2018

  10. [10]

    TreeQN and ATreec: Differ- entiable tree planning for deep reinforcement learning

    Gregory Farquhar, Tim Rocktaeschel, Maximilian Igl, and Shimon Whiteson. TreeQN and ATreec: Differ- entiable tree planning for deep reinforcement learning. In International Conference on Learning Represen- tations, 2018

  11. [11]

    Bellemare

    Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. DeepMDP: Learning continuous latent space models for representation learning. In Kamalika Chaudhuri and Ruslan Salakhutdi- nov, editors, Proceedings of the 36th International Conference on Machine Learning , volume 97 of Pro- ceedings of Machine Learning Research, pages 2170–2...

  12. [12]

    https://cloud.google.com/tpu/

    Cloud tpu. https://cloud.google.com/tpu/. Accessed: 2019

  13. [13]

    Recurrent world models facilitate policy evolution

    David Ha and J ¨urgen Schmidhuber. Recurrent world models facilitate policy evolution. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 2455–2467, USA, 2018. Curran Associates Inc. 8

  14. [14]

    Learning Latent Dynamics for Planning from Pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018

  15. [15]

    Identity mappings in deep residual networks

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In 14th European Conference on Computer Vision, pages 630–645, 2016

  16. [16]

    Learning con- tinuous control policies by stochastic value gradients

    Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, and Tom Erez. Learning con- tinuous control policies by stochastic value gradients. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 2944–2952, Cambridge, MA, USA,

  17. [17]

    Rainbow: Combining improvements in deep reinforcement learning

    Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018

  18. [18]

    Distributed prioritized experience replay

    Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver. Distributed prioritized experience replay. In International Conference on Learning Representations, 2018

  19. [19]

    Reinforcement Learning with Unsupervised Auxiliary Tasks

    Max Jaderberg, V olodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Sil- ver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016

  20. [20]

    Model-based reinforcement learning for atari

    Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019

  21. [21]

    Recurrent experience replay in distributed reinforcement learning

    Steven Kapturowski, Georg Ostrovski, Will Dabney, John Quan, and Remi Munos. Recurrent experience replay in distributed reinforcement learning. InInternational Conference on Learning Representations, 2019

  22. [22]

    Bandit based monte-carlo planning

    Levente Kocsis and Csaba Szepesv ´ari. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006

  23. [23]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012

  24. [24]

    Learning neural network policies with guided policy search under un- known dynamics

    Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under un- known dynamics. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1071–1079. Curran Associates, Inc., 2014

  25. [25]

    Human-level control through deep reinforcement learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

  26. [26]

    Deepstack: Expert-level artificial intelligence in heads-up no-limit poker

    Matej Morav ˇc´ık, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017

  27. [27]

    Massively Parallel Methods for Deep Reinforcement Learning

    Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Ve- davyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, V olodymyr Mnih, Koray Kavukcuoglu, and David Silver. Massively parallel methods for deep reinforcement learning. CoRR, abs/1507.04296, 2015

  28. [28]

    Value prediction network

    Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. InAdvances in Neural Information Processing Systems, pages 6118–6128, 2017

  29. [29]

    Openai five

    OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018. 9

  30. [30]

    Observe and Look Further: Achieving Consistent Performance on Atari

    Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Veˇcer´ık, et al. Observe and look further: Achieving consis- tent performance on atari. arXiv preprint arXiv:1805.11593, 2018

  31. [31]

    Puterman

    Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley & Sons, Inc., New York, NY , USA, 1st edition, 1994

  32. [32]

    Multi-armed bandits with episode context

    Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011

  33. [33]

    Single-player monte-carlo tree search

    Maarten PD Schadd, Mark HM Winands, H Jaap Van Den Herik, Guillaume MJ-B Chaslot, and Jos WHM Uiterwijk. Single-player monte-carlo tree search. In International Conference on Computers and Games , pages 1–12. Springer, 2008

  34. [34]

    A world championship caliber checkers program

    Jonathan Schaeffer, Joseph Culberson, Norman Treloar, Brent Knight, Paul Lu, and Duane Szafron. A world championship caliber checkers program. Artificial Intelligence, 53(2-3):273–289, 1992

  35. [35]

    Prioritized experience replay

    Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In Interna- tional Conference on Learning Representations, Puerto Rico, 2016

  36. [36]

    Off-policy actor-critic with shared experience replay

    Simon Schmitt, Matteo Hessel, and Karen Simonyan. Off-policy actor-critic with shared experience replay. arXiv preprint arXiv:1909.11583, 2019

  37. [37]

    Planning chemical syntheses with deep neural networks and symbolic ai

    Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural networks and symbolic ai. Nature, 555(7698):604, 2018

  38. [38]

    David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the g...

  39. [39]

    A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018

  40. [40]

    Mastering the game of go without human knowledge

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354–359, October 2017

  41. [41]

    The predictron: End-to-end learning and planning

    David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac- Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and planning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 3191–3199. JMLR. org, 2017

  42. [42]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018

  43. [43]

    Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning

    Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999

  44. [44]

    Value iteration networks

    Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. InAdvances in Neural Information Processing Systems, pages 2154–2162, 2016

  45. [45]

    When to use parametric models in reinforcement learning? arXiv preprint arXiv:1906.05243, 2019

    Hado van Hasselt, Matteo Hessel, and John Aslanides. When to use parametric models in reinforcement learning? arXiv preprint arXiv:1906.05243, 2019. 10

  46. [46]

    Grandmaster level in StarCraft II using multi-agent reinforcement learning

    Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha¨el Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, pages 1–5, 2019

  47. [47]

    Planning and scheduling

    I Vlahavas and I Refanidis. Planning and scheduling. EETN, Greece, Tech. Rep, 2013

  48. [48]

    From Pixels to Torques: Policy Learning with Deep Dynamical Models

    Niklas Wahlstr ¨om, Thomas B. Sch ¨on, and Marc Peter Deisenroth. From pixels to torques: Policy learning with deep dynamical models. CoRR, abs/1502.02251, 2015

  49. [49]

    Embed to control: A locally linear latent dynamics model for control from raw images

    Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 2746–2754, Cambridge, MA, USA, 2015. MIT Press. Supplementary Materi...

  50. [50]

    AlphaZero had access to a perfect simulator of the true dynamics process

    State transitions. AlphaZero had access to a perfect simulator of the true dynamics process. In contrast, MuZero employs a learned dynamics model within its search. Under this model, each node in the tree is represented by a corresponding hidden state; by providing a hidden statesk−1 and an actionak to the model the search algorithm can transition to a ne...

  51. [51]

    AlphaZero used the set of legal actions obtained from the simulator to mask the prior produced by the network everywhere in the search tree

    Actions available. AlphaZero used the set of legal actions obtained from the simulator to mask the prior produced by the network everywhere in the search tree. MuZero only masks legal actions at the root of the search tree where the environment can be queried, but does not perform any masking within the search tree. This is possible because the network ra...

  52. [52]

    AlphaZero stopped the search at tree nodes representing terminal states and used the ter- minal value provided by the simulator instead of the value produced by the network

    Terminal nodes. AlphaZero stopped the search at tree nodes representing terminal states and used the ter- minal value provided by the simulator instead of the value produced by the network. MuZero does not give special treatment to terminal nodes and always uses the value predicted by the network. Inside the tree, the search can proceed past a terminal no...

  53. [53]

    In the experiments reported in this paper, we always unroll for K = 5 steps

    This ensures that the total gradient applied to the dynamics function stays constant. In the experiments reported in this paper, we always unroll for K = 5 steps. For a detailed illustration, see Figure 1. To improve the learning process and bound the activations, we also scale the hidden state to the same range as the action input ([0, 1]):sscaled = s−mi...