Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Arthur Guez; David Silver; Demis Hassabis; Dharshan Kumaran; Ioannis Antonoglou; Julian Schrittwieser; Karen Simonyan; Laurent Sifre; Marc Lanctot; Matthew Lai

arxiv: 1712.01815 · v1 · pith:AU5IKNFXnew · submitted 2017-12-05 · 💻 cs.AI · cs.LG

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver , Thomas Hubert , Julian Schrittwieser , Ioannis Antonoglou , Matthew Lai , Arthur Guez , Marc Lanctot , Laurent Sifre

show 5 more authors

Dharshan Kumaran Thore Graepel Timothy Lillicrap Karen Simonyan Demis Hassabis

This is my paper

classification 💻 cs.AI cs.LG

keywords chessgamesuperhumanachievedalgorithmalphazerodomaingames

0 comments

read the original abstract

The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 45 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Generative Language Modeling for Automated Theorem Proving
cs.LG 2020-09 unverdicted novelty 8.0

GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
AI safety via debate
stat.ML 2018-05 conditional novelty 8.0

AI agents trained through competitive debate can allow polynomial-time human judges to oversee PSPACE-level questions, with MNIST experiments boosting sparse classifier accuracy from 59% to 89% using only 6 pixels.
Less Effort, Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin
cs.CR 2026-05 unverdicted novelty 7.0

An RL-guided MCTS proof search for Tamarin finds more and shorter proofs than standard search across 16 protocol models.
ASH: Agents that Self-Hone via Embodied Learning
cs.AI 2026-05 unverdicted novelty 7.0

ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
Causal inference for social network formation
econ.EM 2026-04 conditional novelty 7.0

Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
cs.SE 2026-04 unverdicted novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
cs.AI 2023-10 unverdicted novelty 7.0

LATS integrates Monte Carlo Tree Search with language models using in-context learning, value functions, and self-reflection to achieve 92.7% pass@1 on HumanEval and competitive web navigation performance.
On the Measure of Intelligence
cs.AI 2019-11 unverdicted novelty 7.0

Intelligence is skill-acquisition efficiency, and the ARC benchmark measures human-like general fluid intelligence by testing abstraction and reasoning with minimal, innate-like priors.
Solving Rubik's Cube with a Robot Hand
cs.LG 2019-10 accept novelty 7.0

Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
Inductive general game playing
cs.AI 2019-06 unverdicted novelty 7.0

Introduces the IGGP problem and dataset from 50 GGP games, showing existing ILP systems solve at most 40% of tasks perfectly.
GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
Toward Modeling Player-Specific Chess Behaviors
cs.AI 2026-05 unverdicted novelty 6.0

Champion-specific embeddings and limited MCTS in Maia-2 reduce average Jensen-Shannon divergence to 16 historical chess champions' move distributions in a new latent-space metric, even as standard move accuracy falls.
Evaluating the False Trust Engendered by LLM Explanations
cs.HC 2026-05 unverdicted novelty 6.0

A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
Scaling Self-Play with Self-Guidance
cs.LG 2026-04 unverdicted novelty 6.0

SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
AlphaCNOT: Learning CNOT Minimization with Model-Based Planning
cs.AI 2026-04 unverdicted novelty 6.0

AlphaCNOT combines reinforcement learning with Monte Carlo Tree Search planning to reduce CNOT gate counts by up to 32% versus heuristics in quantum circuit synthesis.
Computer Architecture's AlphaZero Moment: Automated Discovery in an Encircled World
cs.AR 2026-03 conditional novelty 6.0

Automated architectural discovery engines can outperform human design teams by exploring massive design spaces and compressing development cycles from months to weeks.
Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse
cs.LG 2026-03 unverdicted novelty 6.0

Probabilistic language tries unify compression, sequential decision making, and inference caching by making explicit the prefix structure of any generative model over sequences.
Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning
cs.RO 2026-02 unverdicted novelty 6.0

R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.
Toward Training Superintelligent Software Agents through Self-Play SWE-RL
cs.SE 2025-12 unverdicted novelty 6.0

Self-play RL on bug injection and repair in sandboxed repositories yields +10.4 and +7.8 point gains on SWE-bench Verified and Pro while outperforming human-data baselines.
Learning to Plan, Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making
cs.LG 2025-12 unverdicted novelty 6.0

An adaptive RL-MPC framework uses RL to inform MPPI sampling and aggregates MPPI samples for value estimation, delivering up to 72% higher success rates and 2.1x faster convergence on tasks like race driving and Lunar...
Olmo 3
cs.CL 2025-12 accept novelty 6.0

Olmo 3 delivers fully open 7B and 32B language models with complete training artifacts, positioning the 32B variant as the strongest open thinking model released to date.
Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning
cs.LG 2025-10 unverdicted novelty 6.0

MINTO sets bootstrapped targets to the minimum of online and target network estimates, yielding faster stable value learning across online/offline RL and discrete/continuous actions.
Scalable Option Learning in High-Throughput Environments
cs.LG 2025-08 unverdicted novelty 6.0

SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.
General Agentic Planning Through Simulative Reasoning with World Models
cs.AI 2025-07 conditional novelty 6.0

SiRA uses LLM world models for simulative reasoning to achieve up to 124% higher task completion and 32.2% navigation success versus reactive baselines in web environments.
LIMO: Less is More for Reasoning
cs.CL 2025-02 unverdicted novelty 6.0

LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already ...
Reasoning with Language Model is Planning with World Model
cs.CL 2023-05 unverdicted novelty 6.0

RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
General Board Game Playing for Education and Research in Generic AI Game Learning
cs.AI 2019-07 unverdicted novelty 6.0

GBG framework standardizes board game AI interfaces and shows a generic TD(λ)-n-tuple agent outperforming MCTS on multiple games for education and research.
Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process
cs.LG 2019-06 unverdicted novelty 6.0

BUCRL is the first polynomial-time Bayesian algorithm for unknown discrete MDPs with bounded diameter that attains Õ(√(DSAT)) frequentist regret with high probability.
D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

D²Evo mines medium-difficulty anchors from the current model, trains a Questioner to generate matching questions, and jointly optimizes Solver and Questioner for progressive gains, outperforming baselines on math reas...
Evaluating the False Trust Engendered by LLM Explanations
cs.HC 2026-05 unverdicted novelty 5.0

LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.
PAWN: Piece Value Analysis with Neural Networks
cs.LG 2026-04 unverdicted novelty 5.0

A CNN autoencoder that encodes the entire chessboard state improves MLP prediction of relative piece values by 16% MAE reduction to roughly 0.65 pawns using 12 million Stockfish-labeled positions from grandmaster games.
Optimal control of the future via prospective learning with control
stat.ML 2025-11 unverdicted novelty 5.0

Prospective Learning with Control proves ERM asymptotically achieves the Bayes optimal policy in non-stationary reset-free settings and outperforms time-aware RL on a 1D foraging benchmark.
Scalable Multi-Task Learning through Spiking Neural Networks with Adaptive Task-Switching Policy for Intelligent Autonomous Agents
cs.NE 2025-04 unverdicted novelty 5.0

SwitchMT uses adaptive task-switching in deep spiking Q-networks with active dendrites to reduce task interference in multi-task RL, achieving competitive Atari scores without added network complexity.
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
cs.CV 2025-03 unverdicted novelty 5.0

Time-R1 applies RL with verifiable rewards to post-train LVLMs for temporal video grounding, reaching state-of-the-art results on multiple datasets using only 2.5K samples while also improving general video capabilities.
$\texttt{SEM-CTRL}$: Semantically Controlled Decoding
cs.CL 2025-03 unverdicted novelty 5.0

SEM-CTRL integrates token-level MCTS with Answer Set Grammars to enforce rich context-sensitive syntactic and semantic constraints on off-the-shelf LLM decoders, enabling guaranteed valid completions.
Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own
cs.RO 2023-10 unverdicted novelty 5.0

RLFP and the FAC algorithm combine foundation-model priors for policy, value, and rewards to produce sample-efficient robotic RL that reaches 86% real-robot success after one hour and 100% success on 7/8 Meta-world ta...
Growing Action Spaces
cs.LG 2019-06 unverdicted novelty 5.0

A curriculum of growing action spaces combined with simultaneous off-policy value estimation accelerates learning in large multi-agent action spaces.
EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control
cs.LG 2026-05 unverdicted novelty 4.0

EfficientTDMPC extends the TD-MPC family with model ensembles, return averaging, and uncertainty penalties to reach SOTA sample efficiency on hard continuous control benchmarks in low-data regimes.
On Multi-Agent Learning in Team Sports Games
cs.MA 2019-06 unverdicted novelty 3.0

Describes a hierarchical RL method for multi-agent learning in team sports games aiming for human-like agents, reporting preliminary results that show promise.