Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

Benjamin Walker; Terry Lyons

arxiv: 2605.30100 · v1 · pith:DZQLV6DEnew · submitted 2026-05-28 · 💻 cs.LG

Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

Benjamin Walker , Terry Lyons This is my paper

Pith reviewed 2026-06-29 08:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords chessstate trackingworld modelsrecurrent modelstransformersbenchmarkout-of-distributionboard state prediction

0 comments

The pith

Recurrent models outperform Transformers at tracking exact chess board states from move sequences, with a random-play split revealing scale limitations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Chess-World-Model, a benchmark built from 10 million real chess games that requires models to predict the exact board state after a sequence of legal moves. It pairs a held-out real-game test split with an out-of-distribution split drawn from uniformly random legal play, designed to check whether models learn the actual transition rules or rely on patterns common in human games. Under a matched training protocol, recurrent architectures including Mamba-3 and Gated DeltaNet achieve higher accuracy than a causal Transformer at the 3 million and 8 million parameter scales. Real-game accuracy saturates once models exceed 18 million parameters, yet the random split continues to separate the models up to 40 million parameters. Ablations further show that reducing the expressiveness of state-transition mechanisms hurts performance on the random split for every recurrent model tested.

Core claim

The recurrent models strongly outperform the Transformer at 3 and 8 million parameters. Real-game performance saturates above 18 million parameters, but the random-uniform split remains discriminative up to 40 million, exposing failures otherwise hidden by scale. Additionally, ablations show that less expressive state-transition mechanisms reduce performance on the out-of-distribution split for all three recurrent models.

What carries the argument

Chess-World-Model benchmark that supplies move sequences and requires exact next-board-state prediction, using a real-game split plus a uniformly random legal-play split to compare causal Transformer, block-diagonal SLiCE, Mamba-3, and Gated DeltaNet under identical training conditions.

If this is right

Recurrent models learn chess transition rules more effectively than Transformers when parameter count is limited to 3 or 8 million.
Accuracy on real human games plateaus once models exceed 18 million parameters across the tested architectures.
The uniformly random legal play split continues to differentiate model performance up to 40 million parameters.
Reducing the expressiveness of state-transition matrices lowers accuracy on the random split for all recurrent models tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be applied to other state-tracking domains to check whether scale hides similar rule-learning gaps.
Architectural choices for handling state updates may matter more than raw size when the task requires following explicit transition rules.
Models that pass the random split may generalize better to novel planning or simulation tasks outside chess.

Load-bearing premise

The uniformly random legal play split isolates whether models learn the transition rules rather than shortcuts from common human positions, and all models were trained under a truly matched interface and protocol.

What would settle it

A result in which a Transformer reaches the same accuracy as the recurrent models on the random-uniform split once both reach 40 million parameters would show that the split does not expose architecture-specific failures hidden by scale.

Figures

Figures reproduced from arXiv: 2605.30100 by Benjamin Walker, Terry Lyons.

**Figure 2.** Figure 2: CHESS-WORLD-MODEL as an aligned move-to-state prediction task. The input sequence consists of a start token followed by UCI moves. Each input prefix is aligned with the complete chess state obtained after applying that prefix. The target state contains the piece occupying each square, the side to move, castling rights, en passant information, and the halfmove and fullmove counters. 2.3 Move and state encod… view at source ↗

**Figure 3.** Figure 3: Test performance versus parameter count. The left panels show cross-entropy loss, and the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of reducing the expressivity of the recurrent state-transition mechanism. Filled [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

World models require state tracking, which is the ability to maintain a correct latent state across action sequences. Existing benchmarks are often synthetic or language-based, limiting their value as tests of structured state updates in realistic domains. We introduce Chess-World-Model, a large-scale state-tracking benchmark built from 10 million real chess games, where models predict the exact board state reached after a sequence of legal moves. Alongside a held-out real-game split, we include an out-of-distribution split from uniformly random legal play, which tests whether models learn the transition rules rather than shortcuts from common human positions. Prior theoretical and empirical work has shown that Transformers struggle to state-track, while input-dependent linear RNNs require expressive state-transition matrices to do so. We therefore benchmark a causal Transformer, block-diagonal SLiCE, Mamba-3, and Gated DeltaNet with negative eigenvalues under a matched interface and training protocol. The recurrent models strongly outperform the Transformer at 3 and 8 million parameters. Real-game performance saturates above 18 million parameters, but the random-uniform split remains discriminative up to 40 million, exposing failures otherwise hidden by scale. Additionally, ablations show that less expressive state-transition mechanisms reduce performance on the out-of-distribution split for all three recurrent models. Together, these results establish Chess-World-Model as a practical large-scale benchmark for state tracking that exposes failures model scale would otherwise conceal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New chess benchmark shows recurrent models beating transformers on random legal moves, but the OOD split needs checks to confirm it truly isolates rule learning.

read the letter

The paper's main contribution is a 10M-game chess benchmark with a held-out real-game split and a uniform random legal-play split meant to test state tracking without human-position shortcuts. It benchmarks a causal Transformer against block-diagonal SLiCE, Mamba-3, and Gated DeltaNet under matched conditions, reporting that the recurrent models pull ahead at 3M and 8M parameters while real-game accuracy saturates above 18M but the random split stays useful up to 40M. Ablations on less expressive transition matrices also hurt OOD performance.

This is useful because chess gives a structured, verifiable domain with clear state updates, and the scale plus the explicit OOD split is more realistic than most synthetic tests. The saturation pattern on real games versus continued discrimination on random play is a concrete observation worth noting.

The soft spot is the central assumption about the random split. The abstract claims it removes shortcuts from common positions, but without reported statistics on move distributions, piece densities, or encoding artifacts between the two splits, it's possible some residual regularities still allow non-rule solutions. The stress-test note flags exactly this, and the abstract alone does not provide the quantitative verification needed to close it. Matched training protocol is asserted but not detailed enough here to rule out small differences in tokenization or sampling.

The work is aimed at researchers building world models or testing sequence architectures on structured domains. It is coherent on its own terms and shows clear thinking about why existing benchmarks fall short. I would bring it to a reading group for the benchmark construction and the scale comparisons.

It deserves peer review because the benchmark itself is new and the empirical pattern is worth checking with full methods and verification data.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Chess-World-Model, a benchmark built from 10 million real chess games for evaluating exact board-state prediction after legal move sequences. It includes a held-out real-game split and an OOD split drawn from uniformly random legal play, intended to test transition-rule learning rather than positional shortcuts. Under a matched training protocol, the authors compare a causal Transformer to three recurrent architectures (block-diagonal SLiCE, Mamba-3, Gated DeltaNet) and report that recurrent models strongly outperform the Transformer at 3 M and 8 M parameters; real-game performance saturates above 18 M parameters while the random split remains discriminative up to 40 M parameters. Ablations on state-transition expressiveness are also presented.

Significance. If the central empirical claims hold after verification of the OOD split, the benchmark would supply a realistic, large-scale testbed for state tracking that reveals architectural differences hidden by scale on in-distribution data. The scale (10 M games), the controlled OOD construction, and the explicit comparison of state-transition mechanisms constitute concrete strengths for an empirical contribution in this area.

major comments (1)

[Abstract] Abstract (and the description of the OOD split): the claim that the uniformly random legal-play split isolates transition-rule learning from human-position shortcuts is load-bearing for the reported performance gaps and saturation behavior, yet the manuscript provides no quantitative verification (e.g., move-distribution statistics, piece-density comparisons, or legality-pattern analysis) that residual regularities have been eliminated. Without such evidence the outperformance and scale-discriminative results on the random split cannot be unambiguously attributed to state-tracking capacity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the importance of rigorously verifying the properties of the OOD split. We address the comment below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (and the description of the OOD split): the claim that the uniformly random legal-play split isolates transition-rule learning from human-position shortcuts is load-bearing for the reported performance gaps and saturation behavior, yet the manuscript provides no quantitative verification (e.g., move-distribution statistics, piece-density comparisons, or legality-pattern analysis) that residual regularities have been eliminated. Without such evidence the outperformance and scale-discriminative results on the random split cannot be unambiguously attributed to state-tracking capacity.

Authors: We agree that explicit quantitative verification would make the isolation claim more robust and unambiguous. By construction, the OOD split is generated via uniform sampling over legal moves at each step (starting from the initial position), which produces move distributions, piece densities, and position regularities that differ systematically from human games; however, we did not include supporting statistics in the original submission. In the revision we will add: (i) histograms comparing move-type frequencies (e.g., pawn vs. piece moves, captures) between splits, (ii) average piece-density maps across the board, and (iii) analysis of common legality patterns (e.g., frequency of checks or castling). These will be placed in a new subsection of the data section and referenced from the abstract. We believe this addition will directly address the concern while preserving the central empirical findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or self-referential predictions

full rationale

The paper introduces an empirical benchmark for state tracking using chess move sequences and compares model families (Transformer vs. recurrent variants) under matched training protocols. No equations, parameter-fitting steps presented as predictions, or derivation chains appear in the provided text. Claims rest on experimental results from real-game and uniform-random splits rather than any self-definition, fitted-input renaming, or load-bearing self-citation. The central performance gaps are externally falsifiable via the described splits and ablations, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that chess move sequences provide a valid test of general state tracking and that the random legal play split removes position shortcuts. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Legal chess moves define deterministic state transitions that can be exactly computed from move sequences
The benchmark requires models to output the exact board state reached after legal moves; this presupposes that the rules produce unique, verifiable states.

pith-pipeline@v0.9.1-grok · 5783 in / 1322 out tokens · 36107 ms · 2026-06-29T08:57:02.318765+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages · 3 internal anchors

[1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas (2023). “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Assran, M., A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckl...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Gaussian Error Linear Units (GELUs)

Fiekas, N. (2024).python-chess: A Chess Library for Python.URL: https : / / github . com / niklasf/python-chess. Grazzi, R., J. Siems, J. K. H. Franke, A. Zela, F. Hutter, and M. Pontil (2025). “Unlocking State- Tracking in Linear RNNs Through Negative Eigenvalues”. In:Proceedings of the 13th International Conference on Learning Representations (ICLR). Gu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

GLU Variants Improve Transformer

Meyer-Kahlen, S. and R. Huber (2000).Universal Chess Interface. https://www.shredderchess. com/chess- features/uci- universal- chess- interface.html. UCI protocol specifica- tion. Nanda, N., A. Lee, and M. Wattenberg (2023). “Emergent Linear Representations in World Models of Self-Supervised Sequence Models”. In:Proceedings of the 6th BlackboxNLP Workshop...

work page internal anchor Pith review Pith/arXiv arXiv 2000
[4]

with β1 = 0.9, global gradient clipping at 1.0 (Pascanu et al., 2013), dropout 0.1 (Srivastava et al., 2014), feed-forward multiplier 4, and batch size

2013
[5]

On GPUs supporting tensor cores, TF32 matrix multiplication is enabled

The value of β2 is selected by the sweep. On GPUs supporting tensor cores, TF32 matrix multiplication is enabled. Mamba-3 is trained with bfloat16 mixed precision, while the other model families are trained without mixed-precision autocasting. The first-stage sweep consists of 4×4×4×3×3×3 = 1728 one-epoch runs over model family, model size, learning rate,...

2017
[6]

The diagonal SLiCE ablation replaces the block-diagonal transition with a diagonal transition while keeping the rest of the configuration fixed

Each layer uses a GELU feed-forward MLP (Hendrycks et al., 2016), RMSNorm (Zhang et al., 2019), and dropout on the residual path (Srivastava et al., 2014). The diagonal SLiCE ablation replaces the block-diagonal transition with a diagonal transition while keeping the rest of the configuration fixed. Gated DeltaNet.The main Gated DeltaNet model uses the sa...

2016

[1] [1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas (2023). “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Assran, M., A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckl...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Gaussian Error Linear Units (GELUs)

Fiekas, N. (2024).python-chess: A Chess Library for Python.URL: https : / / github . com / niklasf/python-chess. Grazzi, R., J. Siems, J. K. H. Franke, A. Zela, F. Hutter, and M. Pontil (2025). “Unlocking State- Tracking in Linear RNNs Through Negative Eigenvalues”. In:Proceedings of the 13th International Conference on Learning Representations (ICLR). Gu...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

GLU Variants Improve Transformer

Meyer-Kahlen, S. and R. Huber (2000).Universal Chess Interface. https://www.shredderchess. com/chess- features/uci- universal- chess- interface.html. UCI protocol specifica- tion. Nanda, N., A. Lee, and M. Wattenberg (2023). “Emergent Linear Representations in World Models of Self-Supervised Sequence Models”. In:Proceedings of the 6th BlackboxNLP Workshop...

work page internal anchor Pith review Pith/arXiv arXiv 2000

[4] [4]

with β1 = 0.9, global gradient clipping at 1.0 (Pascanu et al., 2013), dropout 0.1 (Srivastava et al., 2014), feed-forward multiplier 4, and batch size

2013

[5] [5]

On GPUs supporting tensor cores, TF32 matrix multiplication is enabled

The value of β2 is selected by the sweep. On GPUs supporting tensor cores, TF32 matrix multiplication is enabled. Mamba-3 is trained with bfloat16 mixed precision, while the other model families are trained without mixed-precision autocasting. The first-stage sweep consists of 4×4×4×3×3×3 = 1728 one-epoch runs over model family, model size, learning rate,...

2017

[6] [6]

The diagonal SLiCE ablation replaces the block-diagonal transition with a diagonal transition while keeping the rest of the configuration fixed

Each layer uses a GELU feed-forward MLP (Hendrycks et al., 2016), RMSNorm (Zhang et al., 2019), and dropout on the residual path (Srivastava et al., 2014). The diagonal SLiCE ablation replaces the block-diagonal transition with a diagonal transition while keeping the rest of the configuration fixed. Gated DeltaNet.The main Gated DeltaNet model uses the sa...

2016