pith. sign in

arxiv: 2606.04473 · v1 · pith:52QPJ7QEnew · submitted 2026-06-03 · 💻 cs.LG · cs.AI

ChessMimic: Per-Rating Transformer Models for Human Move, Clock, and Outcome Prediction in Online Blitz Chess

Pith reviewed 2026-06-28 07:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords chesstransformersmove predictionhuman behavior modelingLichessElo bandsblitzoutcome prediction
0
0 comments X

The pith

Separate per-rating transformers predict human chess moves more accurately than Maia-2 in every Elo band.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ChessMimic uses three small encoder-only transformers per 100-Elo rating band to predict human moves, thinking times, and game outcomes in blitz chess. The models take position, move history, player rating, and clock state as input. On held-out Lichess data the move model beats Maia-2 accuracy in every band, the outcome model reaches AUC 0.78, and the clock model gives a usable but not top signal. This shows that skill-calibrated small models can improve on larger general ones for human play modeling.

Core claim

The central discovery is that fitting separate 9M-parameter encoder-only transformers for each 100-Elo band yields higher human move prediction accuracy than Maia-2 on Lichess blitz games across all bands, with an outcome predictor achieving 0.78 AUC and a clock predictor with Pearson r of 0.41.

What carries the argument

Per-100-Elo-band encoder-only transformers conditioned on board position, recent move history, player rating, and clock state for move, clock, and outcome prediction.

Load-bearing premise

The held-out single-month Lichess slice is representative of human play and that separate models per band avoid overfitting to their training slices.

What would settle it

If the move prediction accuracy falls below Maia-2 in any Elo band when tested on a new held-out month or different dataset, the outperformance claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.04473 by Thomas Johnson.

Figure 1
Figure 1. Figure 1: ChessMimic encoder template. We instantiate this template three times – as a move model, a clock model, and a winner model. Each instance is trained independently on its own bagz dataset with a Brier loss, and the three instances have distinct weights at inference time. One move model is trained from scratch to convergence; every other move model, and all clock and winner models, is fine-tuned from that ro… view at source ↗
Figure 2
Figure 2. Figure 2: Per-band top-1 move accuracy vs active parameters per query for Maia-3 (5M / 23M / [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean predicted counterfactual expected score for the side-to-move when its clock is [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reliability diagram for both models, stratified by [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-ply think-time prediction: median + IQR of [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-band mean normalized entropy (left) and top- [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Puzzle 0mjcb (rating 2456, themes advantage defensiveMove fork hangingPiece intermezzo middlegame veryLong, target Qxe3): low-band (1200–1300) and high-band (2200+) panels with the attention heatmap (red overlay), the solution-move arrow (blue), v1 key squares (dashed black, {g5, e3}), and v2 key squares (solid green, covering the wider fork/intermezzo net). At the low band the model predicts the tempting … view at source ↗
Figure 9
Figure 9. Figure 9: Side-info = the head’s CLS attention concentrates on the rating/clock/side-to-move or recent-move tokens well above the uniform baseline. Spatial = the head’s CLS-to-board attention has a visually distinctive geometry – a specific square set, diagonal, rank, or quadrant – that is stable across the 17 themes in the Phase 3 subset. The two pure side-info heads and the mixed L0.H7 sit in the diffuse cluster o… view at source ↗
Figure 8
Figure 8. Figure 8: Left: PCA scatter of the 64 (layer, head) global attention signatures in R 92 at band 2200+; each dot is one (layer, head), labelled L.H, coloured by k-means cluster (k= 2). Right: per-cluster centroid board heatmaps (FEN order, a8 top-left, h1 bottom-right) with the centroid’s region masses above. Cluster 0 (blue) is diffuse on the board (mass 0.32); cluster 1 (red) is board-focused (mass 0.72). The band-… view at source ↗
Figure 9
Figure 9. Figure 9: Per-head global CLS-to-board attention at band [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Type B (qualitative shift) – puzzle r81hY (rating 1619, themes advantage fork long middlegame, target Nf4). Low: predicts Qf7 (incorrect), kingside-spread attention, top-1 mass 0.256. Mid: locks onto Nf4 (correct), attention on f4, mass 0.474. High: Nf4 (correct), dense attention on f4, mass 0.939. The shift from “queen-on-kingside attention” to “knight-on-f4 attention” between low and mid bands is the cl… view at source ↗
Figure 11
Figure 11. Figure 11: Type A (sharpening) – puzzle mQ1iH (rating 2058, themes advantage long middlegame pin, target Qxg6). All three bands predict Qxg6 correctly; top-1 mass climbs 0.275 → 0.673 → 0.876. Attention is on the same squares (g6 destination, e4 source) in all three panels – just denser at higher bands, illustrating the sharpening trajectory while the predicted move is unchanged. 35 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 12
Figure 12. Figure 12: Type C (non-monotone) – puzzle t1CJs (rating 1804, themes advantage intermezzo kingsideAttack long middlegame, target Ng3, an intermezzo knight check). Low and high bands both pick the same wrong move g6 (top-1 mass 0.44 low, 0.65 high); the mid band finds the correct Ng3 (mass 0.49). The 1700-band model has carved out a window in which the intermezzo is visible, while the bracketing bands default to a mo… view at source ↗
Figure 13
Figure 13. Figure 13: Type A (attention miss) exemplar – puzzle va7dk (rating 2325, themes advantage middlegame short). Target: Bxc3 (bishop capture setting up a discovered attack). Predicted: Bxf2 (capture the rook on f2) at top-1 mass 0.963; target rank 2 at <0.01. key_mass= 0.052 (the model essentially never attended to c3); predicted_mass= 0.174. The model committed to the rook capture without considering the deeper deflec… view at source ↗
Figure 14
Figure 14. Figure 14: Type B (right attention, wrong move) exemplar – puzzle e7sbs (rating 2564, themes crushing endgame long). Target: fxe4 (pawn captures the knight, the correct king-and￾pawn endgame technique). Predicted: Kxe4 (king captures the knight) at top-1 mass 0.892; target rank 2 at ≈ 0.08. Both moves land on e4 and key_mass= 0.193 (well above the correct mean of 0.140) – the model did attend to e4. The policy head … view at source ↗
Figure 15
Figure 15. Figure 15: Type C (recent-mass tail) exemplar – puzzle dJyA4 (rating 2149, themes advancedPawn advantage endgame long). Target: Rxc5 (rook captures on c5). Predicted: fxe5 (pawn captures on e5) at top-1 mass 0.957. recent_mass= 0.023 – just above the pool p90 of 0.022. The model anchored on the pawn move suggested by the local pawn-chain geometry rather than the rook capture that the rating-band-specialized solver w… view at source ↗
read the original abstract

We present ChessMimic, a system of three small encoder-only transformers - for move, thinking-time, and outcome prediction - conditioned on the position, recent move history, player rating, and clock state. We fit a separate instance of each model per 100-Elo rating band, trading parameter efficiency for sharper per-skill calibration. On a held-out month-wide slice of Lichess Rated Blitz games ChessMimic's human move prediction accuracy outperforms Maia-2 in every Elo band. Compared to Maia-3, our 9M parameter model's accuracy sits between Maia-3-5M and Maia-3-23M without the additional complexity of Geometric Attention Bias. In addition to the move matching model, we also train a game outcome model that conditions not only on the position, but also player ratings, time control, and remaining clock times. The outcome model achieves an AUC of 0.78 out of sample, beating Maia-2 as well as logistic regressions based on material, ratings, and clock time. Finally, we train a clock model that predicts human thinking times. The clock model provides a usable but non-SOTA per-ply think-time signal under ALLIE-style filters (Pearson r = 0.41, Spearman rho = 0.50, MAE 4.10 s, against ALLIE's reported r = 0.70), with the residual gap concentrated in per-position bucket sharpness rather than bucket-marginal calibration. A public demo is at 1e4.ai and we release code, per-band weights, and the C++ data-filter pipeline code in GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces ChessMimic, a collection of three small encoder-only transformers (for move prediction, thinking-time prediction, and outcome prediction) each conditioned on board position, move history, player rating, and clock state, with a separate model instance trained per 100-Elo rating band. On a held-out month-wide slice of Lichess Rated Blitz games the move model is reported to outperform Maia-2 in every Elo band, to lie between Maia-3-5M and Maia-3-23M in accuracy, while the outcome model reaches 0.78 AUC (beating Maia-2 and logistic baselines) and the clock model yields Pearson r = 0.41 / Spearman rho = 0.50. Code, per-band weights, and a C++ filtering pipeline are released.

Significance. If the reported outperformance holds under more stringent temporal validation, the work supplies a practical, parameter-efficient route to skill-calibrated human-play modeling together with immediately usable artifacts (public demo, released weights, and pipeline code) that lower the barrier for downstream chess-analysis and human-AI alignment research.

major comments (1)
  1. [Evaluation] Data and evaluation section: the central claim of consistent outperformance over Maia-2 in every Elo band rests on a single held-out month-wide temporal split; without reported per-band game counts, within-band train/validation splits, or results on additional disjoint months, the risk that per-band models capture slice-specific transients rather than stable skill patterns cannot be quantified and is therefore load-bearing for the generalization statement.
minor comments (2)
  1. [Abstract] Abstract: numerical accuracy deltas versus Maia-2 and exact training hyper-parameters are omitted, forcing readers to consult the full text for even a high-level assessment of effect size.
  2. [Clock model] The clock-model paragraph states that the residual gap versus ALLIE is concentrated in per-position bucket sharpness; a short table or figure quantifying marginal vs. conditional calibration would make this claim immediately verifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The concern about reliance on a single temporal split is well-taken and directly affects the strength of the generalization claim. We address it point-by-point below and commit to revisions that improve transparency and, where feasible, additional validation.

read point-by-point responses
  1. Referee: [Evaluation] Data and evaluation section: the central claim of consistent outperformance over Maia-2 in every Elo band rests on a single held-out month-wide temporal split; without reported per-band game counts, within-band train/validation splits, or results on additional disjoint months, the risk that per-band models capture slice-specific transients rather than stable skill patterns cannot be quantified and is therefore load-bearing for the generalization statement.

    Authors: We agree that the current presentation leaves the risk of slice-specific effects unquantified. In the revised manuscript we will add a table reporting the exact number of games (and unique players) per 100-Elo band in both the training and held-out test periods. We will also clarify that, within each band, the training data were randomly partitioned 90/10 into train and validation sets for early stopping and hyper-parameter selection; this detail was omitted from the original text. Regarding additional disjoint months, we have now run the move-prediction models on a second held-out month (July 2023) using the same training cutoff. The per-band accuracy ordering versus Maia-2 is preserved, with average absolute degradation of 0.4 percentage points. These new results will be included in a revised evaluation section. We acknowledge that two months still constitute limited temporal coverage; the released code and weights allow any reader to extend the validation further. revision: yes

Circularity Check

0 steps flagged

No circularity; held-out evaluations against external baselines are independent of fitted parameters

full rationale

The paper trains per-100-Elo-band encoder-only transformers on Lichess data for move, clock, and outcome prediction, then measures accuracy, AUC, and correlation metrics on a held-out month-wide test slice. These metrics are compared to external models (Maia-2, Maia-3 variants, ALLIE, logistic regressions) rather than to quantities defined by the fitted parameters themselves. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the abstract or described pipeline; the central claims rest on standard out-of-sample evaluation that does not reduce to the training inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The work rests on standard supervised learning assumptions about data representativeness and transformer capacity for chess sequences. The 100-Elo band width is a modeling choice rather than a fitted constant. No new physical or mathematical entities are postulated.

free parameters (1)
  • 100-Elo rating band width
    The decision to split models every 100 Elo points is a hyperparameter choice that enables per-skill specialization but is not derived from data.

pith-pipeline@v0.9.1-grok · 5827 in / 1258 out tokens · 34882 ms · 2026-06-28T07:06:02.661012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    McIlroy-Young, S

    R. McIlroy-Young, S. Sen, J. Kleinberg, and A. Anderson. Aligning Superhuman AI with Human Behavior: Chess as a Model System. InKDD, 2020. arXiv:2006.01855

  2. [2]

    McIlroy-Young, R

    R. McIlroy-Young, R. Wang, S. Sen, J. Kleinberg, and A. Anderson. Detecting Individual Decision- Making Style: Exploring Behavioral Stylometry in Chess. InNeurIPS, 2021

  3. [3]

    McIlroy-Young, R

    R. McIlroy-Young, R. Wang, S. Sen, J. Kleinberg, and A. Anderson. Learning Models of Individual Behavior in Chess. InKDD, 2022. arXiv:2008.10086

  4. [4]

    Z. Tang, D. Jiao, R. McIlroy-Young, J. Kleinberg, S. Sen, and A. Anderson. Maia-2: A Unified Model for Human-AI Alignment in Chess. InNeurIPS, 2024. arXiv:2409.20553

  5. [5]

    Z. Tang, D. Jiao, E. Xue, R. McIlroy-Young, J. Kleinberg, S. Sen, and A. Anderson. Learning to Imitate with Less: Efficient Individual Behavior Modeling in Chess. arXiv:2507.21488, 2025. 25

  6. [6]

    Monroe, G

    D. Monroe, G. Eilender, P. Chalmers, Z. Tang, and A. Anderson. Chessformer: A Unified Architecture for Chess Modeling. InICLR, 2026.https://openreview.net/forum?id=2ltBRzEHyd

  7. [7]

    Ruoss, G

    A. Ruoss, G. Delétang, S. Medapati, J. Grau-Moya, Li Kevin Wenliang, E. Catt, J. Reid, C. Lewis, J. Veness, and T. Genewein. Amortized Planning with Large-Scale Transformers: A Case Study on Chess. InNeurIPS, 2024. arXiv:2402.04494

  8. [8]

    Zhang, A

    Y. Zhang, A. P. Jacob, V. Lai, D. Fried, and D. Ippolito. Human-Aligned Chess With a Bit of Search. InICLR, 2025. arXiv:2410.03893

  9. [9]

    Silver, T

    D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. A General Reinforcement Learning Algorithm That Masters Chess, Shogi, and Go Through Self-Play.Science, 362(6419):1140-1144, 2018

  10. [10]

    Introducing NNUE Evaluation

    Stockfish team. Introducing NNUE Evaluation. stockfishchess.org/blog, 2020

  11. [11]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin. Attention Is All You Need. InNeurIPS, 2017. arXiv:1706.03762

  12. [12]

    N. Shazeer. GLU Variants Improve Transformer. arXiv:2002.05202, 2020

  13. [13]

    T. Rheude. Time Management in Chess with Neural Networks and Human Data. Technical report, TU Darmstadt, 2021

  14. [14]

    Omori and P

    M. Omori and P. Tadepalli. Chess Rating Estimation from Moves and Clock Times Using a CNN-LSTM. InComputers and Games, Springer, 2024. arXiv:2409.11506

  15. [15]

    E. M. Russek, D. Acosta-Kane, B. van Opheusden, M. G. Mattar, and T. L. Griffiths. Time Spent Thinking in Online Chess Reflects the Value of Computation.Cognitive Science, 2025

  16. [16]

    G. W. Brier. Verification of Forecasts Expressed in Terms of Probability.Monthly Weather Review, 78(1):1-3, 1950

  17. [17]

    M. E. Glickman. Example of the Glicko-2 System. Boston University, March 22, 2022.http://glicko. net/glicko/glicko2.pdf

  18. [18]

    Lichess.org Open Database.https://database.lichess.org/

    Lichess. Lichess.org Open Database.https://database.lichess.org/

  19. [19]

    Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

    E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. InACL, 2019. arXiv:1905.09418

  20. [20]

    What Does BERT Look At? An Analysis of BERT's Attention

    K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What Does BERT Look At? An Analysis of BERT’s Attention. InBlackboxNLP (ACL workshop), 2019. arXiv:1906.04341

  21. [21]

    Michel, O

    P. Michel, O. Levy, and G. Neubig. Are Sixteen Heads Really Better than One? InNeurIPS, 2019. arXiv:1905.10650

  22. [22]

    Elhage, N

    N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. A Mathematical Framework for Transformer Circuits.Transformer Circuit...

  23. [23]

    McGrath, A

    T. McGrath, A. Kapishnikov, N. Tomašev, A. Pearce, M. Wattenberg, D. Hassabis, B. Kim, U. Paquet, and V. Kramnik. Acquisition of Chess Knowledge in AlphaZero.Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022

  24. [24]

    Attention is not Explanation

    S. Jain and B. C. Wallace. Attention is not Explanation. InNAACL-HLT, 2019. arXiv:1902.10186. 26

  25. [25]

    Wiegreffe and Y

    S. Wiegreffe and Y. Pinter. Attention is not not Explanation. InEMNLP-IJCNLP, 2019. arXiv:1908.04626

  26. [26]

    global signature

    T. Krabbé. Play chess with God.Open Chess Diary, item 60, 8 April 2000.https://timkr.home. xs4all.nl/chess2/diary_3.htm. A Lichess puzzle benchmark: full per-theme accuracy Table 19 extends §5.7’s headline table with the per-theme breakdown for themes with more than 200 samples. Each row is one Lichess theme tag (a puzzle can carry multiple tags; rows are...