arxiv: 1911.08265 · v2 · submitted 2019-11-19 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Julian Schrittwieser , Ioannis Antonoglou , Thomas Hubert , Karen Simonyan , Laurent Sifre , Simon Schmitt , Arthur Guez , Edward Lockhart

show 4 more authors

Demis Hassabis Thore Graepel Timothy Lillicrap David Silver

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:52 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords MuZeromodel-based reinforcement learningMonte Carlo tree searchAtariGochesslearned dynamicsplanning

0 comments

The pith

MuZero achieves superhuman performance in Atari, Go, chess and shogi by learning a model that predicts only the reward, policy and value needed for planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MuZero, which learns a model to predict the quantities most useful for planning rather than attempting to reconstruct full environment dynamics. This model is applied iteratively inside tree search to select actions in environments where the rules or physics are completely unknown. Evaluated on 57 Atari games it sets a new state of the art; on Go, chess and shogi it matches the performance of AlphaZero while receiving no game rules. A sympathetic reader cares because the result shows that effective long-horizon planning is possible in high-dimensional visual domains without hand-crafted simulators.

Core claim

MuZero learns a model that, when applied iteratively, predicts the reward, the action-selection policy, and the value function. When this model is used inside tree-based search, the resulting agent reaches superhuman performance across visually complex domains without any knowledge of their underlying dynamics, and matches AlphaZero on Go, chess and shogi without being given the game rules.

What carries the argument

The MuZero learned model that iteratively predicts reward, policy and value inside Monte Carlo tree search.

If this is right

Planning methods can now be applied to domains that lack perfect simulators, such as real-world control tasks.
A new state of the art is reached on the full set of 57 Atari games.
Superhuman performance is obtained in Go, chess and shogi with zero prior knowledge of the rules.
Predictions can be limited to reward, policy and value rather than full next-state reconstruction while still enabling effective search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Targeted prediction of planning quantities may be sufficient for many sequential decision problems where full model learning is intractable.
The same architecture could be tested in continuous or partially observable settings where accumulating model error has historically limited planning.
If prediction accuracy holds at longer horizons, similar learned models might reduce the sample complexity gap between model-based and model-free methods in visual domains.

Load-bearing premise

The learned predictions remain accurate enough over many steps to support planning even when the true dynamics are unknown and high-dimensional.

What would settle it

A direct comparison showing that MuZero's performance collapses to below AlphaZero levels in Go when the learned model is replaced by the true game rules, or that prediction error grows rapidly enough to prevent superhuman play on any Atari game.

read the original abstract

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games - the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled - our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MuZero shows that a compact learned model predicting only reward, policy, and value can power MCTS to superhuman levels in Atari and rule-free board games.

read the letter

The main thing to know is that MuZero reaches superhuman performance on Go, chess, and shogi without any game rules and sets a new state of the art on 57 Atari games by learning a dynamics model that outputs exactly the signals MCTS needs. It skips full state reconstruction and instead predicts reward, policy, and value at each simulated step, then trains the whole thing end-to-end from self-play data. The architecture is straightforward: a representation network encodes observations, a dynamics network advances the hidden state while emitting reward and policy, and a prediction network gives value. Results include direct comparisons to AlphaZero and prior agents, plus training curves that make the gains visible. This closes a long-standing gap for model-based planning in visually complex domains where simulators are unavailable. The empirical sweep is the strongest part; the board-game results match AlphaZero closely and the Atari numbers improve on previous bests across a wide range of games. The central claim holds up under standard RL assumptions with no obvious circularity in the evaluation. One softer spot is the lack of detailed diagnostics on how prediction error grows over long search horizons inside the tree; the paper shows the system works but does not break down where the model starts to drift or how that affects deeper planning. Training also requires heavy compute, which is typical but limits easy replication. These are practical limitations rather than load-bearing flaws. The paper is aimed at researchers working on model-based RL and planning in unknown environments. Anyone following progress on scaling tree search beyond perfect simulators will get concrete value from the experiments and architecture. It deserves serious referee time because the results are large-scale, the baselines are appropriate, and the algorithmic change is cleanly implemented and tested.

Referee Report

0 major / 3 minor

Summary. The paper introduces the MuZero algorithm, which learns a model to iteratively predict reward, policy, and value for use inside Monte Carlo tree search. This enables planning in environments with unknown dynamics. The method achieves a new state of the art on 57 Atari games and matches the superhuman performance of AlphaZero on Go, chess, and shogi without being given the game rules or dynamics.

Significance. If the results hold, this is a significant advance for model-based RL: it demonstrates that a learned model can support effective long-horizon planning in high-dimensional, visually complex domains where prior model-based methods have struggled. The large-scale evaluation (57 Atari games plus three board games), direct comparisons to AlphaZero and prior SOTA agents, and reported training curves provide strong empirical grounding.

minor comments (3)

[§3.2] §3.2 (MuZero algorithm): the mapping from the three learned heads (reward, policy, value) to the MCTS backup and selection steps could be stated more explicitly, perhaps with an additional equation or annotated diagram.
[Table 2] Table 2 (Atari results): while median human-normalized scores are given, adding per-game statistical significance or variance across seeds would strengthen the 'new state of the art' claim.
[Figure 4] Figure 4 (board-game learning curves): including the AlphaZero curve on the same plot would make the matching-performance claim easier to assess visually.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the recognition of its significance for model-based RL, and the recommendation for minor revision. We appreciate the detailed summary highlighting the key contributions of MuZero in learning models that directly support planning without access to environment dynamics.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The MuZero paper defines an algorithmic procedure (representation, dynamics, and prediction functions trained via a combined loss on observed rewards, policies, and values) and validates it through direct empirical evaluation on external benchmarks (57 Atari games and board games) against independent baselines such as AlphaZero and human performance. No load-bearing derivation step equates a claimed result to its own fitted inputs by construction, nor does any central claim reduce to a self-citation chain or renamed empirical pattern; the reported superhuman performance is measured externally and is not tautological with the training objectives.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a neural network can learn a sufficiently accurate planning model from self-play data alone; no new physical or mathematical axioms are introduced beyond standard neural network approximation and RL convergence assumptions.

free parameters (1)

network architecture and optimizer hyperparameters
Layer sizes, learning rates, and regularization coefficients chosen per domain to achieve reported performance.

axioms (1)

domain assumption Iterated application of the learned model inside tree search yields predictions accurate enough for superhuman planning
Invoked throughout the search and training sections; no formal guarantee is provided.

invented entities (1)

Learned dynamics model with reward-policy-value heads no independent evidence
purpose: To supply the quantities needed by MCTS without access to true environment rules or simulator
Core new component introduced by the paper; no independent falsifiable prediction outside performance metrics is given.

pith-pipeline@v0.9.0 · 5537 in / 1341 out tokens · 45939 ms · 2026-05-16T23:52:41.366359+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling
cs.LG 2026-05 unverdicted novelty 7.0

PMCTS is the first principled parallel MCTS algorithm that preserves formal policy improvement guarantees and scales with parallel compute.
Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $\tau$-Mixing
stat.ML 2026-05 unverdicted novelty 7.0

Finite-sample risk bounds for DQN with ReLU networks are extended to τ-mixing data, showing an extra dimensionality penalty in the convergence rate due to dependence.
Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models
cs.RO 2026-04 unverdicted novelty 7.0

Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.
Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum
cs.LG 2026-02 unverdicted novelty 7.0

Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.
Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search
cs.LG 2025-12 unverdicted novelty 7.0

Inverse-RPO derives two variance-aware prior-based UCT policies from UCB-V that outperform PUCT on benchmarks with no extra cost.
Latent Chain-of-Thought World Modeling for End-to-End Driving
cs.CV 2025-12 unverdicted novelty 7.0

LCDrive unifies chain-of-thought reasoning and action selection for end-to-end driving by interleaving action-proposal tokens and latent world-model tokens that predict action outcomes, yielding faster inference and b...
Training Agents Inside of Scalable World Models
cs.AI 2025-09 conditional novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
Mastering Atari with Discrete World Models
cs.LG 2020-10 accept novelty 7.0

DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
Dream to Control: Learning Behaviors by Latent Imagination
cs.LG 2019-12 accept novelty 7.0

Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
Plan Before You Trade: Inference-Time Optimization for RL Trading Agents
cs.LG 2026-05 unverdicted novelty 6.0

FPILOT optimizes pre-trained RL trading policies at inference time using forecasted price trajectories to improve portfolio allocations and risk-adjusted returns on the DJ30 benchmark.
Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Ms.PR applies multi-scale predictive supervision to enforce goal-directed alignment in latent spaces for offline GCRL, yielding improved representation quality and performance on vision and state-based tasks.
Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits
cs.LG 2026-05 unverdicted novelty 6.0

Hybrid agent with variational quantum circuits for feature extraction in hierarchical RL outperforms classical baselines with 66% parameter savings, but quantum value estimation degrades results.
Is Conditional Generative Modeling all you need for Decision-Making?
cs.LG 2022-11 unverdicted novelty 6.0

Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
Interpretable experiential learning based on state history and global feedback
cs.LG 2026-05 unverdicted novelty 4.0

A transition graph model with utility and evidence counts learns behaviors from state history and feedback, showing performance comparable to neural networks on Atari Breakout.
Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them
cs.LG 2026-04 unverdicted novelty 4.0

XAI-based correction methods outperform non-XAI baselines for fixing spurious correlations in DNNs, with Counterfactual Knowledge Distillation most effective, but all are limited by reliance on unavailable group label...

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 17 Pith papers · 6 internal anchors

[1]

Lipton, and Animashree Anandkumar

Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Emma Brunskill, Zachary C. Lipton, and Animashree Anandkumar. Surprising negative results for generative adversarial tree search.CoRR, abs/1806.05780, 2018

work page arXiv 2018
[2]

The arcade learning environment: An evaluation platform for general agents

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253–279, 2013

work page 2013
[3]

Superhuman ai for heads-up no-limit poker: Libratus beats top profes- sionals

Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top profes- sionals. Science, 359(6374):418–424, 2018

work page 2018
[4]

Learning and Querying Fast Generative Models for Reinforcement Learning

Lars Buesing, Theophane Weber, Sebastien Racaniere, SM Eslami, Danilo Rezende, David P Reichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, et al. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Joseph Hoane, Jr., and Feng-hsiung Hsu

Murray Campbell, A. Joseph Hoane, Jr., and Feng-hsiung Hsu. Deep blue. Artif. Intell., 134(1-2):57–83, January 2002

work page 2002
[6]

R. Coulom. Whole-history rating: A Bayesian rating system for players of time-varying strength. In Inter- national Conference on Computers and Games, pages 113–124, 2008

work page 2008
[7]

Efﬁcient selectivity and backup operators in monte-carlo tree search

R ´emi Coulom. Efﬁcient selectivity and backup operators in monte-carlo tree search. InInternational confer- ence on computers and games, pages 72–83. Springer, 2006

work page 2006
[8]

Deisenroth and CE

MP. Deisenroth and CE. Rasmussen. Pilco: A model-based and data-efﬁcient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011 , pages 465–472. Omnipress, 2011

work page 2011
[9]

Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceedings of the International Conference on Machine Learning (ICML) , 2018

work page 2018
[10]

TreeQN and ATreec: Differ- entiable tree planning for deep reinforcement learning

Gregory Farquhar, Tim Rocktaeschel, Maximilian Igl, and Shimon Whiteson. TreeQN and ATreec: Differ- entiable tree planning for deep reinforcement learning. In International Conference on Learning Represen- tations, 2018

work page 2018
[11]

Bellemare

Carles Gelada, Saurabh Kumar, Jacob Buckman, Oﬁr Nachum, and Marc G. Bellemare. DeepMDP: Learning continuous latent space models for representation learning. In Kamalika Chaudhuri and Ruslan Salakhutdi- nov, editors, Proceedings of the 36th International Conference on Machine Learning , volume 97 of Pro- ceedings of Machine Learning Research, pages 2170–2...

work page 2019
[12]

https://cloud.google.com/tpu/

Cloud tpu. https://cloud.google.com/tpu/. Accessed: 2019

work page 2019
[13]

Recurrent world models facilitate policy evolution

David Ha and J ¨urgen Schmidhuber. Recurrent world models facilitate policy evolution. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 2455–2467, USA, 2018. Curran Associates Inc. 8

work page 2018
[14]

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In 14th European Conference on Computer Vision, pages 630–645, 2016

work page 2016
[16]

Learning con- tinuous control policies by stochastic value gradients

Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, and Tom Erez. Learning con- tinuous control policies by stochastic value gradients. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 2944–2952, Cambridge, MA, USA,

work page
[17]

Rainbow: Combining improvements in deep reinforcement learning

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018
[18]

Distributed prioritized experience replay

Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver. Distributed prioritized experience replay. In International Conference on Learning Representations, 2018

work page 2018
[19]

Reinforcement Learning with Unsupervised Auxiliary Tasks

Max Jaderberg, V olodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Sil- ver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Model-based reinforcement learning for atari

Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019

work page arXiv 1903
[21]

Recurrent experience replay in distributed reinforcement learning

Steven Kapturowski, Georg Ostrovski, Will Dabney, John Quan, and Remi Munos. Recurrent experience replay in distributed reinforcement learning. InInternational Conference on Learning Representations, 2019

work page 2019
[22]

Bandit based monte-carlo planning

Levente Kocsis and Csaba Szepesv ´ari. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006

work page 2006
[23]

Imagenet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012

work page 2012
[24]

Learning neural network policies with guided policy search under un- known dynamics

Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under un- known dynamics. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1071–1079. Curran Associates, Inc., 2014

work page 2014
[25]

Human-level control through deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

work page 2015
[26]

Deepstack: Expert-level artiﬁcial intelligence in heads-up no-limit poker

Matej Morav ˇc´ık, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artiﬁcial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017

work page 2017
[27]

Massively Parallel Methods for Deep Reinforcement Learning

Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Ve- davyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, V olodymyr Mnih, Koray Kavukcuoglu, and David Silver. Massively parallel methods for deep reinforcement learning. CoRR, abs/1507.04296, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

Value prediction network

Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. InAdvances in Neural Information Processing Systems, pages 6118–6128, 2017

work page 2017
[29]

Openai ﬁve

OpenAI. Openai ﬁve. https://blog.openai.com/openai-five/, 2018. 9

work page 2018
[30]

Observe and Look Further: Achieving Consistent Performance on Atari

Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Veˇcer´ık, et al. Observe and look further: Achieving consis- tent performance on atari. arXiv preprint arXiv:1805.11593, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Puterman

Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley & Sons, Inc., New York, NY , USA, 1st edition, 1994

work page 1994
[32]

Multi-armed bandits with episode context

Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artiﬁcial Intelligence, 61(3):203–230, 2011

work page 2011
[33]

Single-player monte-carlo tree search

Maarten PD Schadd, Mark HM Winands, H Jaap Van Den Herik, Guillaume MJ-B Chaslot, and Jos WHM Uiterwijk. Single-player monte-carlo tree search. In International Conference on Computers and Games , pages 1–12. Springer, 2008

work page 2008
[34]

A world championship caliber checkers program

Jonathan Schaeffer, Joseph Culberson, Norman Treloar, Brent Knight, Paul Lu, and Duane Szafron. A world championship caliber checkers program. Artiﬁcial Intelligence, 53(2-3):273–289, 1992

work page 1992
[35]

Prioritized experience replay

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In Interna- tional Conference on Learning Representations, Puerto Rico, 2016

work page 2016
[36]

Off-policy actor-critic with shared experience replay

Simon Schmitt, Matteo Hessel, and Karen Simonyan. Off-policy actor-critic with shared experience replay. arXiv preprint arXiv:1909.11583, 2019

work page arXiv 1909
[37]

Planning chemical syntheses with deep neural networks and symbolic ai

Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural networks and symbolic ai. Nature, 555(7698):604, 2018

work page 2018
[38]

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the g...

work page 2016
[39]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018

work page 2018
[40]

Mastering the game of go without human knowledge

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354–359, October 2017

work page 2017
[41]

The predictron: End-to-end learning and planning

David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac- Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and planning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 3191–3199. JMLR. org, 2017

work page 2017
[42]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018

work page 2018
[43]

Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning

Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artiﬁcial intelligence, 112(1-2):181–211, 1999

work page 1999
[44]

Value iteration networks

Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. InAdvances in Neural Information Processing Systems, pages 2154–2162, 2016

work page 2016
[45]

When to use parametric models in reinforcement learning? arXiv preprint arXiv:1906.05243, 2019

Hado van Hasselt, Matteo Hessel, and John Aslanides. When to use parametric models in reinforcement learning? arXiv preprint arXiv:1906.05243, 2019. 10

work page arXiv 1906
[46]

Grandmaster level in StarCraft II using multi-agent reinforcement learning

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha¨el Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, pages 1–5, 2019

work page 2019
[47]

Planning and scheduling

I Vlahavas and I Refanidis. Planning and scheduling. EETN, Greece, Tech. Rep, 2013

work page 2013
[48]

From Pixels to Torques: Policy Learning with Deep Dynamical Models

Niklas Wahlstr ¨om, Thomas B. Sch ¨on, and Marc Peter Deisenroth. From pixels to torques: Policy learning with deep dynamical models. CoRR, abs/1502.02251, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[49]

Embed to control: A locally linear latent dynamics model for control from raw images

Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 2746–2754, Cambridge, MA, USA, 2015. MIT Press. Supplementary Materi...

work page 2015
[50]

AlphaZero had access to a perfect simulator of the true dynamics process

State transitions. AlphaZero had access to a perfect simulator of the true dynamics process. In contrast, MuZero employs a learned dynamics model within its search. Under this model, each node in the tree is represented by a corresponding hidden state; by providing a hidden statesk−1 and an actionak to the model the search algorithm can transition to a ne...

work page
[51]

AlphaZero used the set of legal actions obtained from the simulator to mask the prior produced by the network everywhere in the search tree

Actions available. AlphaZero used the set of legal actions obtained from the simulator to mask the prior produced by the network everywhere in the search tree. MuZero only masks legal actions at the root of the search tree where the environment can be queried, but does not perform any masking within the search tree. This is possible because the network ra...

work page
[52]

AlphaZero stopped the search at tree nodes representing terminal states and used the ter- minal value provided by the simulator instead of the value produced by the network

Terminal nodes. AlphaZero stopped the search at tree nodes representing terminal states and used the ter- minal value provided by the simulator instead of the value produced by the network. MuZero does not give special treatment to terminal nodes and always uses the value predicted by the network. Inside the tree, the search can proceed past a terminal no...

work page
[53]

In the experiments reported in this paper, we always unroll for K = 5 steps

This ensures that the total gradient applied to the dynamics function stays constant. In the experiments reported in this paper, we always unroll for K = 5 steps. For a detailed illustration, see Figure 1. To improve the learning process and bound the activations, we also scale the hidden state to the same range as the action input ([0, 1]):sscaled = s−mi...

work page