Chessformer: A Unified Architecture for Chess Modeling

Ashton Anderson; Daniel Monroe; George Eilender; Philip Chalmers; Zhenwei Tang

arxiv: 2605.19091 · v1 · pith:HW3A2YIGnew · submitted 2026-05-18 · 💻 cs.LG

Chessformer: A Unified Architecture for Chess Modeling

Daniel Monroe , George Eilender , Philip Chalmers , Zhenwei Tang , Ashton Anderson This is my paper

Pith reviewed 2026-05-20 12:24 UTC · model grok-4.3

classification 💻 cs.LG

keywords chesstransformerpositional encodinghuman move predictionchess engineinterpretabilitygeometric attentionunified architecture

0 comments

The pith

A single transformer architecture called Chessformer advances chess move prediction, engine strength, and interpretability at the same time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether the three main tasks in chess modeling—maximizing playing strength, predicting human moves, and enabling interpretability—require entirely separate architectures or can be handled by one design. It presents Chessformer as an encoder-only transformer that tokenizes the board into squares and adds a dynamic positional encoding to match the geometry of chess. The work shows this single model reaches new accuracy levels on human play while also strengthening a top engine and supporting direct interpretability. A sympathetic reader would care because the results suggest that fitting model structure to the domain's spatial layout can remove the usual trade-offs between these goals.

Core claim

Chessformer is an encoder-only transformer that represents board squares as tokens, augments self-attention with a novel dynamic positional encoding called Geometric Attention Bias (GAB) that adapts to domain-specific geometry, and predicts actions with an attention-based source-destination policy head. On human move prediction it reaches 57.1 percent accuracy with fewer than a quarter of the parameters of prior work. When integrated into Leela Chess Zero it adds over 100 Elo and secures tournament victories over Stockfish. Its square-token design makes attention patterns and activations directly attributable to individual board squares, supporting granular interpretability analyses.

What carries the argument

Geometric Attention Bias (GAB), a dynamic positional encoding added to self-attention that adapts to the specific geometry and relationships among chessboard squares.

If this is right

Human move prediction accuracy reaches 57.1 percent while using substantially fewer parameters than previous leading models.
Integration into a leading open-source engine produces more than 100 Elo of additional strength and tournament wins against top engines.
Square-token tokenization allows attention weights and activations to be traced directly to specific board squares for fine-grained analysis.
Aligning tokenization, positional encoding, and output head with the board's spatial structure yields simultaneous improvements on performance, human compatibility, and transparency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometric bias approach may transfer to other grid-based or spatial decision domains such as Go or certain video games.
Unified architectures could reduce the engineering cost of building separate systems for strength, prediction, and explanation in complex games.
Direct square-level attributions may help researchers study which board features drive human-like or superhuman decisions.

Load-bearing premise

The gains across prediction accuracy, Elo strength, and interpretability are due to the architecture itself rather than differences in training data, compute, or evaluation setup compared with earlier models.

What would settle it

A controlled replication that trains the strongest prior models on exactly the same data and compute budget as Chessformer and finds no remaining gap in move-matching accuracy or Elo rating.

Figures

Figures reproduced from arXiv: 2605.19091 by Ashton Anderson, Daniel Monroe, George Eilender, Philip Chalmers, Zhenwei Tang.

**Figure 2.** Figure 2: Move-matching accuracy on the ALLIE-AUGMENTED test set. The final input for our human emulation models consists of 64 tokens, a concatenation of representations of the current and n past board states and two strength embeddings of dimension 128. This results in a depth of 12 × (1 + n) + 2 × 128, which is 352 for the n = 7 hyperparameter choice used in our main training and ablation runs. Despite the dimens… view at source ↗

**Figure 3.** Figure 3: GAB bias maps in L14H11 of LeelaCF in the early and late game. The GAB bias for this head transitions from modeling a wide range of movement in the early game (left) to king movement in the late game (right) [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Torch-like pseudocode for GAB. A.4 TRANSCODER TRAINING For interpretability purposes, we train a cross-layer transcoder on MLP activations collected from layers 3 and 4 (in other words, the 4th and 5th layers) of an earlier checkpoint of MAIA-3. The transcoder consists of encoders for each layer and decoders going between the two layers (including between each layer and itself), trained on reconstruction a… view at source ↗

**Figure 5.** Figure 5: Move-matching accuracies of MAIA-3 for pairs of skill levels on the ALLIE-AUGMENTED test set, described in Appendix A.1. 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 Game rating 45.0 47.5 50.0 52.5 55.0 57.5 60.0 62.5 Move-matching accuracy (%) n = 7 n = 0 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 Game rating 45.0 47.5 50.0 52.5 55.0 57.5 60.0 62.5 Move-matching accuracy (%) 5M… view at source ↗

**Figure 6.** Figure 6: Human move-matching accuracies on the ALLIE-AUGMENTED test set by number of history positions n (left), position encoding (middle), and scale (right). History information helps most for weaker play, while scale and effective position encodings have a large effect for stronger play. We omit results for n = 31 history positions as they are virtually identical to those for n = 7, and also omit ALLIE-ADAPTIVE-… view at source ↗

**Figure 7.** Figure 7: Human move-matching perplexity on the ALLIE-AUGMENTED test set by number of history positions n (left), position encoding (middle), and scale (right). We omit results for n = 31 history positions as they are virtually identical to those for n = 7. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Annotations for features 0-9 of layer 3. L3F0000: Square that the active player can advance a pawn to in order to attack an enemy bishop. L3F0001: Active player’s knight, usually under attack. L3F0002: Square on the side of the board that is controlled by the active player’s rook or queen. L3F0003: Vacant square adjacent to a rook in the corner. L3F0004: An enemy pawn in the corner, in front of the active … view at source ↗

**Figure 9.** Figure 9: Annotations for features 10-19 of layer 3 of MAIA-3. L3F0010: Queenside activation square for active player’s knight in Queen’s gambit structures. L3F0011: Square on b3, b6, f3, or f6 in the opening that have been weakened by the lack of a supporting pawn. L3F0012: Square that the active player’s knight can move to to give check. L3F0013: Enemy rook checking the active player’s king. L3F0014: Active player… view at source ↗

**Figure 10.** Figure 10: Annotations for features 0-9 of layer 4 of MAIA-3. L4F0000: Not interpretable. L4F0001: Active player’s bishop on a strong diagonal, often paired up with a queen. L4F0002: Enemy center pawn targeted for capture in the opening. L4F0003: Square deep in opponent’s territory attacked either by two rooks or a rook and a queen. L4F0004: Either long castling or tension between active player’s f6 pawn and opponen… view at source ↗

**Figure 11.** Figure 11: Annotations for features 10-19 of layer 4 of MAIA-3. L4F0010: Not interpretable. L4F0011: Enemy pawn attacking or threatening to attack an active player’s minor piece. L4F0012: Not fully interpretable; miscellaneous key squares in endgames. L4F0013: Active player’s vulnerable king in the corner. L4F0014: Square that is or will be controlled by enemy pawn, especially if it is close to promotion. L4F0015: N… view at source ↗

**Figure 12.** Figure 12: Layer 4 head 4 MAIA-3 GAB and DPA maps, left and right respectively [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Layer 4 head 5 MAIA-3 GAB and DPA maps, left and right respectively. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Additional Leela-CF GAB maps from layer 3. [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: Additional Leela-CF DPA maps from layer 3. [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

read the original abstract

Chess has long served as a canonical testbed for artificial intelligence, but modeling approaches for its central tasks have diverged. Maximizing playing strength, predicting human play, and enabling interpretability are typically solved with disparate architectures, and these designs are often misaligned with the geometry of the domain. This raises the natural question of whether these objectives require separate modeling paradigms, or if there exists a single architecture that supports them simultaneously. We introduce Chessformer, a unified architecture that advances the state of the art on all three central goals in chess modeling. Chessformer is an encoder-only transformer that represents board squares as tokens, augments self-attention with a novel dynamic positional encoding called Geometric Attention Bias (GAB) that adapts to domain-specific geometry, and predicts actions with an attention-based source-destination policy head. We evaluate Chessformer on each front. First, we develop \maiathree, a family of models for human move prediction that reaches 57.1\% move-matching accuracy, significantly surpassing the previous state of the art with fewer than a quarter of the parameters. Second, we integrate Chessformer into Leela Chess Zero, a leading open-source engine, adding over 100 Elo of playing strength and resulting in tournament victories over Stockfish in major computer chess competitions. Third, we show that Chessformer's square-token design makes attention patterns and activations directly attributable to board squares, enabling granular interpretability analyses that prior architectures do not naturally support. More broadly, our results demonstrate that aligning a model's tokenization, positional encoding, and output design with the underlying structure of a domain can yield simultaneous gains in performance, human compatibility, and interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Chessformer, an encoder-only transformer for chess modeling that tokenizes board squares, augments self-attention with a novel Geometric Attention Bias (GAB), and employs an attention-based source-destination policy head. It claims this single architecture simultaneously advances the state of the art on human move prediction (57.1% accuracy with fewer than a quarter of prior parameters), playing strength (over 100 Elo gain when integrated into Leela Chess Zero, including tournament wins against Stockfish), and interpretability via direct square-level attribution.

Significance. If the results hold under transparent and matched experimental conditions, the work is significant for demonstrating that domain-aligned tokenization and positional encodings can yield joint gains across performance, human compatibility, and interpretability in a structured domain. The parameter efficiency and successful engine integration provide concrete, reproducible-style evidence that could guide similar unified modeling efforts elsewhere.

major comments (2)

[§4] §4 (Human Move Prediction): the 57.1% move-matching accuracy is presented as surpassing prior SOTA, yet the section provides no explicit comparison table or text detailing prior accuracies, training game counts, or compute budgets relative to the cited baselines; without these controls the attribution of gains to the unified architecture and GAB remains provisional.
[§5] §5 (Engine Integration): the >100 Elo claim and tournament victories over Stockfish are load-bearing for the playing-strength advance, but the manuscript does not report the exact LC0 version/patch, time controls, game counts, or implementation differences versus the baseline engine; this leaves open whether the improvement stems from Chessformer or from unstated experimental variations.

minor comments (2)

[Abstract] The abstract and §3 could more precisely quantify the parameter reduction (e.g., exact prior model sizes) rather than stating 'fewer than a quarter.'
[Interpretability Analysis] Figure captions in the interpretability section would benefit from explicit labels indicating which attention heads or layers are visualized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to incorporate additional experimental details for improved transparency and reproducibility.

read point-by-point responses

Referee: [§4] §4 (Human Move Prediction): the 57.1% move-matching accuracy is presented as surpassing prior SOTA, yet the section provides no explicit comparison table or text detailing prior accuracies, training game counts, or compute budgets relative to the cited baselines; without these controls the attribution of gains to the unified architecture and GAB remains provisional.

Authors: We agree that an explicit side-by-side comparison strengthens the presentation. In the revised manuscript we have added a new table in Section 4 that reports move-prediction accuracies, training-game counts, and parameter counts for all cited baselines alongside our results. While hardware-specific compute budgets are not uniformly reported in prior work and therefore cannot be matched exactly, we now discuss parameter count and training data volume as the most comparable efficiency metrics and note that Chessformer achieves its accuracy with substantially fewer parameters. revision: yes
Referee: [§5] §5 (Engine Integration): the >100 Elo claim and tournament victories over Stockfish are load-bearing for the playing-strength advance, but the manuscript does not report the exact LC0 version/patch, time controls, game counts, or implementation differences versus the baseline engine; this leaves open whether the improvement stems from Chessformer or from unstated experimental variations.

Authors: We appreciate the request for precise experimental controls. The revised Section 5 now specifies the exact Leela Chess Zero version and patch, the time controls used for both training and evaluation matches, the total number of games played in the reported tournaments, and a clear description of the integration (only the policy network was replaced; all other engine components remained unchanged). These additions confirm that the Elo gains and tournament results are attributable to the Chessformer policy head. revision: yes

Circularity Check

0 steps flagged

No significant circularity: claims rest on empirical evaluations rather than self-referential derivations

full rationale

The paper introduces Chessformer as a new architecture (encoder-only transformer with square-token representation, Geometric Attention Bias, and attention-based policy head) and reports three separate empirical results: 57.1% human-move accuracy, >100 Elo gain when integrated into LC0, and improved interpretability via direct square attribution. None of these outcomes are derived from equations that reduce by construction to fitted parameters, self-defined quantities, or prior self-citations. The work contains no mathematical derivation chain, uniqueness theorems, or ansatzes smuggled via self-reference; performance numbers come from standard training and benchmarking procedures. This is the normal case of an empirical ML paper whose central claims are falsifiable against external data and baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The design rests on standard transformer assumptions plus the introduction of Geometric Attention Bias as a new component whose benefits are demonstrated empirically rather than derived from prior independent evidence.

free parameters (1)

Transformer hyperparameters and GAB scaling factors
Standard model size, attention heads, and any scaling in the geometric bias are tuned to achieve the reported performance numbers.

axioms (1)

domain assumption Self-attention mechanisms can be effectively augmented with domain-specific geometric biases to capture chessboard structure
This premise underpins the introduction and claimed effectiveness of Geometric Attention Bias.

invented entities (1)

Geometric Attention Bias (GAB) no independent evidence
purpose: Dynamic positional encoding that adapts self-attention to chess-specific geometry
Newly proposed component whose independent validation outside this work is not provided in the abstract.

pith-pipeline@v0.9.0 · 5835 in / 1261 out tokens · 43683 ms · 2026-05-20T12:24:45.370784+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Chessformer is an encoder-only transformer that represents board squares as tokens, augments self-attention with a novel dynamic positional encoding called Geometric Attention Bias (GAB) ... and predicts actions with an attention-based source-destination policy head.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Chessformer, a unified architecture that advances the state of the art on all three central goals in chess modeling.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

Murray Campbell, A

https://transformer- circuits.pub/2023/monosemantic-features/index.html. Murray Campbell, A. Joseph Hoane Jr., and Feng-hsiung Hsu. Deep blue.Artificial Intelligence, 134(1–2):57–83,

work page 2023
[2]

Joseph Hoane, and Feng-hsiung Hsu

doi: 10.1016/S0004-3702(01)00129-1. R´emi Coulom. Whole-history rating: A bayesian rating system for players of time-varying strength. InComputers and Games,

work page doi:10.1016/s0004-3702(01)00129-1
[3]

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/2b8f4db0464cc5b6e9d5e6bea4b9f308-Paper-Conference.pdf

doi: 10.52202/ 079017-0768. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/2b8f4db0464cc5b6e9d5e6bea4b9f308-Paper-Conference.pdf. Steven J. Edwards. Standard: Portable game notation specification and implementation guide,

work page 2024
[4]

URLhttps://ia802908.us.archive.org/26/items/ pgn-standard-1994-03-12/PGN_standard_1994-03-12.txt. Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taiga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, Aviral Kumar, and Rishabh Agarwal. Stop regressing: Training value functions via classification for scal...

work page 1994
[6]

2023 , archivePrefix=

URLhttps://arxiv.org/abs/2305.01610. Karim Hamade, Reid McIlroy-Young, Siddhartha Sen, Jon Kleinberg, and Ashton Anderson. Designing skill-compatible AI: Methodologies and frameworks in chess. InThe Twelfth International Conference on Learning Representations,

work page arXiv
[7]

Deep residual learning for im- age recognition

doi: 10.1109/CVPR.2016.90. Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141,

work page doi:10.1109/cvpr.2016.90 2016
[8]

doi: 10.1109/CVPR.2018. 00745. 11 Published as a conference paper at ICLR 2026 Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. In Ricardo Silva and Amir Globerson (eds.),34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, 3...

work page doi:10.1109/cvpr.2018 2018
[9]

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/37d9f19150fce07bced2a81fc87d47a6-Paper-Conference.pdf

doi: 10.52202/ 079017-0987. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/37d9f19150fce07bced2a81fc87d47a6-Paper-Conference.pdf. Adam Karvonen. Emergent world models and latent variable estimation in chess-playing language models. InFirst Conference on Language Modeling, August

work page 2024
[10]

Understanding how chess-playing language models compute linear board representations

Aaron Mei. Understanding how chess-playing language models compute linear board representations. InICML 2025 Workshop on Methods and Opportunities at Small Scale,

work page 2025
[11]

Anian Ruoss, Gr ´egoire Del ´etang, Sourabh Medapati, Jordi Grau-Moya, Li Kevin Wenliang, Elliot Catt, John Reid, Cannada A

Accessed: 2025-11-29. Anian Ruoss, Gr ´egoire Del ´etang, Sourabh Medapati, Jordi Grau-Moya, Li Kevin Wenliang, Elliot Catt, John Reid, Cannada A. Lewis, Joel Veness, and Tim Genewein. Amortized planning with large-scale transformers: A case study on chess. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.),Advance...

work page 2025
[12]

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/78f0db30c39c850de728c769f42fc903-Paper-Conference.pdf

doi: 10.52202/ 079017-2102. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/78f0db30c39c850de728c769f42fc903-Paper-Conference.pdf. David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm th...

work page 2024
[13]

Stockfish testing framework.https://tests.stockfishchess.org/ tests

12 Published as a conference paper at ICLR 2026 Stockfish Team. Stockfish testing framework.https://tests.stockfishchess.org/ tests. Accessed: 2025-11-22. Stockfish Team. Stockfish 15.https://stockfishchess.org/blog/2022/ stockfish-15/, April

work page 2026
[14]

Stockfish Team

Accessed: 2025-11-19. Stockfish Team. Stockfish 17.https://stockfishchess.org/blog/2024/ stockfish-17/, September

work page 2025
[15]

Stockfish Team

Accessed: 2025-11-19. Stockfish Team. Regression tests.https://github.com/official-stockfish/ Stockfish/wiki/Regression-Tests,

work page 2025
[16]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu

Accessed: 2026-05-13. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), mar

work page 2026
[17]

Neurocomputing 568, 127063

ISSN 0925-2312. doi: 10.1016/j.neucom.2023.127063. URLhttps://doi.org/10.1016/j. neucom.2023.127063. Zhenwei Tang, Difan Jiao, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, and Ashton Anderson. Maia-2: A unified model for human-ai alignment in chess. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.),Advances ...

work page doi:10.1016/j.neucom.2023.127063 2023
[18]

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/250190819ff1dda47cd23cecc0c5a69b-Paper-Conference.pdf

doi: 10.52202/ 079017-0659. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/250190819ff1dda47cd23cecc0c5a69b-Paper-Conference.pdf. Zhenwei Tang, Difan Jiao, Eric Xue, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, and Ashton Anderson. Learning to imitate with less: Efficient individual behavior modeling in chess. InInternational Confere...

work page arXiv 2024
[19]

Human- aligned chess with a bit of search

Yiming Zhang, Athul Jacob, Vivian Lai, Daniel Fried, and Daphne Ippolito. Human- aligned chess with a bit of search. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.),International Conference on Representation Learning, volume 2025, pp. 4815–4836,

work page 2025
[20]

URLhttps://proceedings.iclr.cc/paper_files/paper/2025/file/ 0ef1afa0daa888d695dcd5e9513bafa3-Paper-Conference.pdf. 13 Published as a conference paper at ICLR 2026 A IMPLEMENTATIONDETAILS A.1 HUMANEMULATION All human-prediction models were trained with the AdamW optimizer on the dataset described in Section

work page 2025
[21]

We organize the raw game data into chunks of 20,000 games

For each game, we compute the average Elo of the two players and assign the game to the corresponding bin. We organize the raw game data into chunks of 20,000 games. For each chunk, we iterate through games sequentially and distribute them into bins until each bin accumulates 10 games. The process terminates when either all games in the chunk are consumed...

work page 2025
[22]

Following the Leela Chess Zero setup, checkpoints were 14 Published as a conference paper at ICLR 2026 Table 4: Human Move-Matching Training Configuration for Reproducibility Parameter Value Training Setup batch size train 128 batch size val 16 gradient accumulation steps 4 num workers 8 Optimization lr5×10 −5 min lr1×10 −5 wd (weight decay)1×10 −6 grad c...

work page 2026
[23]

Each was trained for 1.4 million steps on a single A100 GPU with a batch size of 2048 in approximately four days

The 2.5M model has embedding dimension and MLP dimension 192, with all else held constant. Each was trained for 1.4 million steps on a single A100 GPU with a batch size of 2048 in approximately four days. The learning rate was held constant at5×10 −4. 15 Published as a conference paper at ICLR 2026 A.3 SPECIALMOVES A source and destination square are suff...

work page 2048
[24]

Figure 4: Torch-like pseudocode for GAB. A.4 TRANSCODERTRAINING For interpretability purposes, we train a cross-layer transcoder on MLP activations collected from layers 3 and 4 (in other words, the 4th and 5th layers) of an earlier checkpoint of MAIA-3. The transcoder consists of encoders for each layer and decoders going between the two layers (includin...

work page 2023
[25]

We use the same data to sample the top-activating tokens for each feature

For training data, we use blitz games from lichess played during July 2019, filtered in the exact same way as in our base model training pipeline. We use the same data to sample the top-activating tokens for each feature. At the end of training, our transcoder achieves a reconstruction MSE of 1.6% and a sparsity of 0.90. B IMPLEMENTATIONDETAILS FORTOURNAM...

work page 2019
[26]

divide and conquer

The bias for querying square(i 1, j1) and key square(i 2, j2)is thusf (i2−i1,j2−j1), wheref a,b is defined for−7≤a, b≤7. This adds 15×15parameters per attention head. D TOKENIZATION A number of tokenization schemes have been proposed for chess. We review some of these and attempt to give insight into why our recipe, a square-based representation with a st...

work page 2025
[27]

18 Published as a conference paper at ICLR 2026 Table 7: Position encoding ablations for human emulation

The impact of scale on modeling performance is several times higher for strong play than it is for weak play. 18 Published as a conference paper at ICLR 2026 Table 7: Position encoding ablations for human emulation. Loss Accuracy (%) FLOPs #Params Policy Value Policy Value Absolute 1.418 0.75454.7±0.1 62.6±0.1268M 4.58M Relative bias 1.420 0.75454.6±0.1 6...

work page 2026
[28]

L3F0001: Active player’s knight, usually under attack

20 Published as a conference paper at ICLR 2026 F ADDITIONALRESULTS F.1 TOP-ACTIVATEDTOKENS FORTRANSCODER Figure 8:Annotations for features 0-9 of layer 3.L3F0000: Square that the active player can advance a pawn to in order to attack an enemy bishop. L3F0001: Active player’s knight, usually under attack. L3F0002: Square on the side of the board that is c...

work page 2026
[29]

26 Published as a conference paper at ICLR 2026 Figure 15: Additional Leela-CF DPA maps from layer

work page 2026

[1] [1]

Murray Campbell, A

https://transformer- circuits.pub/2023/monosemantic-features/index.html. Murray Campbell, A. Joseph Hoane Jr., and Feng-hsiung Hsu. Deep blue.Artificial Intelligence, 134(1–2):57–83,

work page 2023

[2] [2]

Joseph Hoane, and Feng-hsiung Hsu

doi: 10.1016/S0004-3702(01)00129-1. R´emi Coulom. Whole-history rating: A bayesian rating system for players of time-varying strength. InComputers and Games,

work page doi:10.1016/s0004-3702(01)00129-1

[3] [3]

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/2b8f4db0464cc5b6e9d5e6bea4b9f308-Paper-Conference.pdf

doi: 10.52202/ 079017-0768. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/2b8f4db0464cc5b6e9d5e6bea4b9f308-Paper-Conference.pdf. Steven J. Edwards. Standard: Portable game notation specification and implementation guide,

work page 2024

[4] [4]

URLhttps://ia802908.us.archive.org/26/items/ pgn-standard-1994-03-12/PGN_standard_1994-03-12.txt. Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taiga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, Aviral Kumar, and Rishabh Agarwal. Stop regressing: Training value functions via classification for scal...

work page 1994

[5] [6]

2023 , archivePrefix=

URLhttps://arxiv.org/abs/2305.01610. Karim Hamade, Reid McIlroy-Young, Siddhartha Sen, Jon Kleinberg, and Ashton Anderson. Designing skill-compatible AI: Methodologies and frameworks in chess. InThe Twelfth International Conference on Learning Representations,

work page arXiv

[6] [7]

Deep residual learning for im- age recognition

doi: 10.1109/CVPR.2016.90. Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141,

work page doi:10.1109/cvpr.2016.90 2016

[7] [8]

doi: 10.1109/CVPR.2018. 00745. 11 Published as a conference paper at ICLR 2026 Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. In Ricardo Silva and Amir Globerson (eds.),34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, 3...

work page doi:10.1109/cvpr.2018 2018

[8] [9]

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/37d9f19150fce07bced2a81fc87d47a6-Paper-Conference.pdf

doi: 10.52202/ 079017-0987. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/37d9f19150fce07bced2a81fc87d47a6-Paper-Conference.pdf. Adam Karvonen. Emergent world models and latent variable estimation in chess-playing language models. InFirst Conference on Language Modeling, August

work page 2024

[9] [10]

Understanding how chess-playing language models compute linear board representations

Aaron Mei. Understanding how chess-playing language models compute linear board representations. InICML 2025 Workshop on Methods and Opportunities at Small Scale,

work page 2025

[10] [11]

Anian Ruoss, Gr ´egoire Del ´etang, Sourabh Medapati, Jordi Grau-Moya, Li Kevin Wenliang, Elliot Catt, John Reid, Cannada A

Accessed: 2025-11-29. Anian Ruoss, Gr ´egoire Del ´etang, Sourabh Medapati, Jordi Grau-Moya, Li Kevin Wenliang, Elliot Catt, John Reid, Cannada A. Lewis, Joel Veness, and Tim Genewein. Amortized planning with large-scale transformers: A case study on chess. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.),Advance...

work page 2025

[11] [12]

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/78f0db30c39c850de728c769f42fc903-Paper-Conference.pdf

doi: 10.52202/ 079017-2102. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/78f0db30c39c850de728c769f42fc903-Paper-Conference.pdf. David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm th...

work page 2024

[12] [13]

Stockfish testing framework.https://tests.stockfishchess.org/ tests

12 Published as a conference paper at ICLR 2026 Stockfish Team. Stockfish testing framework.https://tests.stockfishchess.org/ tests. Accessed: 2025-11-22. Stockfish Team. Stockfish 15.https://stockfishchess.org/blog/2022/ stockfish-15/, April

work page 2026

[13] [14]

Stockfish Team

Accessed: 2025-11-19. Stockfish Team. Stockfish 17.https://stockfishchess.org/blog/2024/ stockfish-17/, September

work page 2025

[14] [15]

Stockfish Team

Accessed: 2025-11-19. Stockfish Team. Regression tests.https://github.com/official-stockfish/ Stockfish/wiki/Regression-Tests,

work page 2025

[15] [16]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu

Accessed: 2026-05-13. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), mar

work page 2026

[16] [17]

Neurocomputing 568, 127063

ISSN 0925-2312. doi: 10.1016/j.neucom.2023.127063. URLhttps://doi.org/10.1016/j. neucom.2023.127063. Zhenwei Tang, Difan Jiao, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, and Ashton Anderson. Maia-2: A unified model for human-ai alignment in chess. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.),Advances ...

work page doi:10.1016/j.neucom.2023.127063 2023

[17] [18]

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/250190819ff1dda47cd23cecc0c5a69b-Paper-Conference.pdf

doi: 10.52202/ 079017-0659. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2024/file/250190819ff1dda47cd23cecc0c5a69b-Paper-Conference.pdf. Zhenwei Tang, Difan Jiao, Eric Xue, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, and Ashton Anderson. Learning to imitate with less: Efficient individual behavior modeling in chess. InInternational Confere...

work page arXiv 2024

[18] [19]

Human- aligned chess with a bit of search

Yiming Zhang, Athul Jacob, Vivian Lai, Daniel Fried, and Daphne Ippolito. Human- aligned chess with a bit of search. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.),International Conference on Representation Learning, volume 2025, pp. 4815–4836,

work page 2025

[19] [20]

URLhttps://proceedings.iclr.cc/paper_files/paper/2025/file/ 0ef1afa0daa888d695dcd5e9513bafa3-Paper-Conference.pdf. 13 Published as a conference paper at ICLR 2026 A IMPLEMENTATIONDETAILS A.1 HUMANEMULATION All human-prediction models were trained with the AdamW optimizer on the dataset described in Section

work page 2025

[20] [21]

We organize the raw game data into chunks of 20,000 games

For each game, we compute the average Elo of the two players and assign the game to the corresponding bin. We organize the raw game data into chunks of 20,000 games. For each chunk, we iterate through games sequentially and distribute them into bins until each bin accumulates 10 games. The process terminates when either all games in the chunk are consumed...

work page 2025

[21] [22]

Following the Leela Chess Zero setup, checkpoints were 14 Published as a conference paper at ICLR 2026 Table 4: Human Move-Matching Training Configuration for Reproducibility Parameter Value Training Setup batch size train 128 batch size val 16 gradient accumulation steps 4 num workers 8 Optimization lr5×10 −5 min lr1×10 −5 wd (weight decay)1×10 −6 grad c...

work page 2026

[22] [23]

Each was trained for 1.4 million steps on a single A100 GPU with a batch size of 2048 in approximately four days

The 2.5M model has embedding dimension and MLP dimension 192, with all else held constant. Each was trained for 1.4 million steps on a single A100 GPU with a batch size of 2048 in approximately four days. The learning rate was held constant at5×10 −4. 15 Published as a conference paper at ICLR 2026 A.3 SPECIALMOVES A source and destination square are suff...

work page 2048

[23] [24]

Figure 4: Torch-like pseudocode for GAB. A.4 TRANSCODERTRAINING For interpretability purposes, we train a cross-layer transcoder on MLP activations collected from layers 3 and 4 (in other words, the 4th and 5th layers) of an earlier checkpoint of MAIA-3. The transcoder consists of encoders for each layer and decoders going between the two layers (includin...

work page 2023

[24] [25]

We use the same data to sample the top-activating tokens for each feature

For training data, we use blitz games from lichess played during July 2019, filtered in the exact same way as in our base model training pipeline. We use the same data to sample the top-activating tokens for each feature. At the end of training, our transcoder achieves a reconstruction MSE of 1.6% and a sparsity of 0.90. B IMPLEMENTATIONDETAILS FORTOURNAM...

work page 2019

[25] [26]

divide and conquer

The bias for querying square(i 1, j1) and key square(i 2, j2)is thusf (i2−i1,j2−j1), wheref a,b is defined for−7≤a, b≤7. This adds 15×15parameters per attention head. D TOKENIZATION A number of tokenization schemes have been proposed for chess. We review some of these and attempt to give insight into why our recipe, a square-based representation with a st...

work page 2025

[26] [27]

18 Published as a conference paper at ICLR 2026 Table 7: Position encoding ablations for human emulation

The impact of scale on modeling performance is several times higher for strong play than it is for weak play. 18 Published as a conference paper at ICLR 2026 Table 7: Position encoding ablations for human emulation. Loss Accuracy (%) FLOPs #Params Policy Value Policy Value Absolute 1.418 0.75454.7±0.1 62.6±0.1268M 4.58M Relative bias 1.420 0.75454.6±0.1 6...

work page 2026

[27] [28]

L3F0001: Active player’s knight, usually under attack

20 Published as a conference paper at ICLR 2026 F ADDITIONALRESULTS F.1 TOP-ACTIVATEDTOKENS FORTRANSCODER Figure 8:Annotations for features 0-9 of layer 3.L3F0000: Square that the active player can advance a pawn to in order to attack an enemy bishop. L3F0001: Active player’s knight, usually under attack. L3F0002: Square on the side of the board that is c...

work page 2026

[28] [29]

26 Published as a conference paper at ICLR 2026 Figure 15: Additional Leela-CF DPA maps from layer

work page 2026