pith. sign in

arxiv: 2602.04447 · v2 · submitted 2026-02-04 · 💻 cs.LG · cs.AI

Mixture of Masters: Sparse Chess Language Models with Player Routing

Pith reviewed 2026-05-16 07:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords chessmixture of expertslanguage modelsgrandmaster emulationgating networksparse modelsgame AIstyle routing
0
0 comments X

The pith

Mixture of grandmaster chess experts with dynamic routing outperforms dense models on unseen games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Dense chess language models trained on aggregated games from many players tend to blur stylistic differences and suppress distinctive strategies. The paper introduces Mixture-of-Masters, which trains separate small expert networks on individual grandmasters' games and adds a gating network that chooses which expert to activate for each move according to the board position. This routing lets the system switch between contrasting styles, such as aggressive or solid play, without averaging them together. A sympathetic reader would care because the approach claims to deliver both stronger play against engines and greater control over the generated moves.

Core claim

Mixture-of-Masters uses multiple small GPT experts, each trained to emulate one grandmaster, together with a post-hoc learnable gating network that selects the most suitable expert for the current game state; when tested against Stockfish on unseen standard games, this architecture outperforms both dense single-expert networks and GPT models trained on pooled data while preserving move variety, style control, and interpretability.

What carries the argument

The post-hoc learnable gating network that selects the appropriate grandmaster expert for each move based on the game state.

If this is right

  • The routed model produces more stylistically varied moves than a single dense network trained on all data.
  • Explicit expert selection supplies direct interpretability and the ability to bias toward a chosen grandmaster's tendencies.
  • Rare but effective strategies associated with specific players are retained rather than averaged away.
  • Performance gains appear on unseen standard games when compared with both individual experts and aggregated GPT baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing pattern could be tested in other turn-based strategy games to check whether persona separation improves generalization beyond chess.
  • If the gate generalizes, it might allow controlled style transfer in sequential generation tasks such as story writing or music composition.
  • Further experiments could measure whether the model maintains strength when forced to use only one fixed expert across an entire game.

Load-bearing premise

The gating network can pick the right grandmaster persona for every position without introducing inconsistent moves or overfitting to the training games.

What would settle it

A large-scale match on hundreds of held-out standard games in which the Mixture-of-Masters model fails to exceed the win rate of a dense baseline trained on the same total number of games.

Figures

Figures reproduced from arXiv: 2602.04447 by Davide Freddi, Giacomo Frisoni, Gianluca Moro, Lorenzo Molfetta.

Figure 1
Figure 1. Figure 1: Illustration of MOM. First, multiple decoder-only chess language models are trained to emulate the game decisions of specific grandmasters. Then, their layers are combined into a sparse language model by alternating uniform weight merging and top-k routing for next move prediction. r = {r1, . . . , rM} serve as a guiding signal to promote cor￾rect actions. Formally, consistent with the notation of Shao et … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the visual chess player identification system. Left: During training, game embeddings are processed through contrastive learning against GM-specific centroids to enforce intra-player similarity and inter-player distinctiveness. Right: The visual encoding pipeline processes consecutive chess board frames to extract and temporally aggregate spatial patch tokens (in blue), with positional and temp… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation studies. (a) Effect of seed model on expert FIDEScore; SSL-only, Stockfish 1, pooled over 10 runs. (b) Effect of expert count k on game results; MOM (top-5 exp. by FIDEScore), Stockfish 0, pooled over 10 runs. (c) Effect of RL on legality; Karvonen seed, Stockfish 1, pooled over 10 runs. turns (184 moves) in PGN format. Consistent with Karvo￾nen (Karvonen, 2024), games are forcibly ended after 90 … view at source ↗
Figure 4
Figure 4. Figure 4: Style Consistency (left): Relative change in cosine distance when computing expert-specific centroids from random subsamples of played games; Style Acquisition (right): Recall of style-similarity retrieval mapping of played games to the correct real-GM centroid. struggle with move legality—likely because they forget pre￾training knowledge due to the distributional shift in game states encountered by indivi… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between MOM (top-5 experts by FIDEScore, SSL+RL, Karvonen seed) and baselines. FIDEScore after battling Stockfish at increasing difficulties; average results after 10 runs for each level. 2 4 6 8 10 12 14 16 ❸ ❻ ❼ ❽ ❾ Layer ID Expert ID 8 Vr Tk Vr 7 Yp Yp 6 WB YpYp 5 YpYP Yp 4 YP 3 Wb YP 2 YPYP YP 1 TKVR VR a b c d e f g h 2 4 6 8 10 12 14 16 ❸ ❻ ❼ ❽ ❾ Layer ID Expert ID [PITH_FULL_IMAGE:figure… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of how MOM activated experts vary when playing a game at test time against Stockfish. Decoder block top-1 routing paths for two distinct board states. MOM (White) dynamically adjusts expert utilization in response to the evolving position. actual grandmaster and expert-generated games. To evaluate style consistency, we partition each expert’s game collection, compute a centroid on one subset,… view at source ↗
Figure 7
Figure 7. Figure 7: Geographic distribution of survey participants by affiliation country. A.2. Player Recognizability Some experts argue that professional players can indeed be recognized from their moves alone, pointing to recent machine learning studies that achieve high accuracy in attributing games even when results and openings are excluded (McIlroy￾Young et al., 2021), suggesting that mid- and late-game decisions carry… view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of responses to the statement that professional chess players are recognizable by their moves alone. The horizontal stacked bar represents the proportion of respondents on a five-point Likert scale (from Strongly Disagree to Strongly Agree). The question was framed to omit any mention of the high accuracy rates achieved by prior AI studies, ensuring that responses would reflect participants’ g… view at source ↗
Figure 9
Figure 9. Figure 9: Perceived contribution of gameplay attributes to player recognizability. The donut charts display the percentage of respondents who selected each given factor. decision speed is a powerful discriminative signal, the absence of move-timing information in the PGN datasets used in this work precluded its inclusion in our stylometry model. A.3. Existence and Definition of Style Following the question of recogn… view at source ↗
Figure 10
Figure 10. Figure 10: Expert consensus on the existence of playing style. Binary question. If one accepts that style exists in chess, the next challenge is defining and categorizing it. This is not straightforward: styles may overlap, and manifest differently across contexts. Which of the following playing styles do you consider valid and useful categories? Attacking/Tactical: Prefers dynamic complications, combinations, sacri… view at source ↗
Figure 11
Figure 11. Figure 11: Validation of conventional playing style categories. The donut charts show the percentage of respondents who endorsed each of the proposed style categories as valid and useful [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Perceived importance of visual patterns in style recognition. Binary question. As shown in [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Distribution of responses to the statement that modern elite grandmasters exhibit a dominant playing style. The horizontal stacked bar represents the proportion of respondents on a five-point Likert scale (from Strongly Disagree to Strongly Agree). The expert sample’s response to this question, detailed in [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Distribution of style category assignments for the ten grandmasters featured in this study. Once choice per grandmaster. Each subplot displays the percentage of respondents assigning a dominant style category to a specific grandmaster. to note a fundamental distinction. Traditional chess engines, whether based on search algorithms or RL, are explicitly optimized to win the game: their architecture support… view at source ↗
Figure 15
Figure 15. Figure 15: Perceptions of AI’s impact on stylistic diversity and the enduring importance of human variability. Each horizontal stacked bar represents the proportion of respondents on a five-point Likert scale (from Strongly Disagree to Strongly Agree). etc., all tuned to achieve maximal performance in terms of game outcomes. In contrast, traditional language models trained autoregressively to predict the next move l… view at source ↗
Figure 16
Figure 16. Figure 16: Distribution of unique games (○) by move. C.5. Scope Within the Mixture-of-Masters Contribution We emphasize that our stylometry framework serves a specific, bounded purpose within the broader Mixture-of-Masters contribution: providing a model-based validation that independently trained expert models have acquired distinctive, GM￾aligned playing signatures. It addresses one of six research questions (RQ5)… view at source ↗
Figure 17
Figure 17. Figure 17: Win Rate comparison between merging algorithms. F. Reinforcement Learning with GRPO: Extended Analysis This section provides an expanded discussion of our reinforcement learning methodology, addressing the training dynamics, reward structure, and theoretical justification for applying Group Relative Policy Optimization (GRPO) to chess move prediction. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: GRPO training curves for all 10 grandmaster experts. Each curve shows the mean reward (legality + syntax) over training steps. All experts demonstrate positive reward trajectories, transitioning from negative starting values (indicating frequent illegal moves post-SSL) to positive convergence points (indicating predominantly legal move generation). The heterogeneous endpoints reflect differences in SSL in… view at source ↗
read the original abstract

Modern chess language models are dense transformers trained on millions of games played by thousands of high-rated individuals. However, these monolithic networks tend to collapse into mode-averaged behavior, where stylistic boundaries are blurred, and rare but effective strategies are suppressed. To counteract homogenization, we introduce Mixture-of-Masters (MoM), the first chess mixture-of-experts model with small-sized GPT experts emulating world-class grandmasters. For each move, a post-hoc learnable gating network selects the most appropriate persona to channel depending on the game state, allowing MoM to switch its style dynamically, e.g., Tal's offensive vocation or Petrosian's defensive solidity. When evaluated against Stockfish on unseen standard games, MoM outperforms both dense individual expert networks and popular GPT baselines trained on aggregated data, while ensuring generation variety, control, and interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Mixture-of-Masters (MoM), a sparse mixture-of-experts chess language model consisting of small GPT experts each emulating a specific grandmaster. A post-hoc learnable gating network routes each move to the appropriate expert based on the game state, enabling dynamic style switching (e.g., Tal's aggression or Petrosian's defense). The central empirical claim is that MoM outperforms both dense single-expert networks and aggregated GPT baselines when evaluated against Stockfish on unseen standard games, while also delivering greater generation variety, control, and interpretability.

Significance. If the results hold, this work would provide evidence that post-hoc expert routing can counteract mode collapse in language models trained on heterogeneous player data, preserving distinct stylistic behaviors without sacrificing strength. The approach offers a concrete path toward more interpretable and controllable chess agents, with potential extension to other sequential decision domains where maintaining persona-specific policies matters.

major comments (2)
  1. [Method section on gating network] The gating network description (method section): the post-hoc router is trained after the experts and conditioned only on board state, yet no routing accuracy metrics, confusion matrices, or ablations (e.g., performance with oracle vs. learned routing on held-out positions) are reported. This is load-bearing for the outperformance claim, because if the router fails to recover player-specific distributions, the model collapses to parameter averaging and the reported gains in variety and strength cannot be attributed to dynamic control.
  2. [Experimental results] Experimental evaluation (results section): the abstract asserts outperformance against Stockfish and baselines on unseen games, but supplies no win rates, Elo differences, centipawn-loss statistics, number of games evaluated, or significance tests. Without these quantitative details or training hyperparameters, the central empirical claim cannot be assessed.
minor comments (2)
  1. The number and exact parameter counts of the individual GPT experts are not stated, making it difficult to assess the sparsity benefit.
  2. Formal notation for the gating function g(s) and the expert selection rule would improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, agreeing that additional details are needed to strengthen the claims, and we will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Method section on gating network] The gating network description (method section): the post-hoc router is trained after the experts and conditioned only on board state, yet no routing accuracy metrics, confusion matrices, or ablations (e.g., performance with oracle vs. learned routing on held-out positions) are reported. This is load-bearing for the outperformance claim, because if the router fails to recover player-specific distributions, the model collapses to parameter averaging and the reported gains in variety and strength cannot be attributed to dynamic control.

    Authors: We agree that quantitative validation of the gating network is essential to support the outperformance and style-control claims. In the revised manuscript we will add routing accuracy on held-out positions, confusion matrices for player prediction, and an ablation comparing learned routing against an oracle router that knows the true player identity. These results will show that the router recovers distinct player-specific distributions rather than collapsing to parameter averaging. revision: yes

  2. Referee: [Experimental results] Experimental evaluation (results section): the abstract asserts outperformance against Stockfish and baselines on unseen games, but supplies no win rates, Elo differences, centipawn-loss statistics, number of games evaluated, or significance tests. Without these quantitative details or training hyperparameters, the central empirical claim cannot be assessed.

    Authors: We acknowledge that the current results section lacks the requested quantitative details. The revised version will report win rates, Elo differences, centipawn-loss statistics, the exact number of evaluation games, and statistical significance tests. We will also include the full training hyperparameters for the experts and the gating network to enable reproducibility and proper assessment of the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model with external evaluation

full rationale

The paper introduces MoM as an empirical mixture-of-experts architecture for chess language modeling, with no equations, derivations, or first-principles claims that could reduce to self-definition or fitted inputs. The central result (outperformance vs. Stockfish on unseen games) rests on post-training evaluation against an external engine and held-out data, not on any internal construction where a prediction is forced by the fitting process itself. No self-citations are invoked as load-bearing uniqueness theorems, and the gating network is presented as a standard trainable component without ansatz smuggling or renaming of known results. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond standard transformer training assumptions and the introduction of the gating network as a learnable component.

pith-pipeline@v0.9.0 · 5443 in / 1180 out tokens · 51238 ms · 2026-05-16T07:55:38.463873+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 5.0

    The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

  2. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 3.0

    This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [3]

    URL https://doi

    doi: 10.3390/E24040550. URL https://doi. org/10.3390/e24040550. Marcolino, L. S., Xu, H., Jiang, A. X., Tambe, M., and Bowring, E. Give a hard problem to a diverse team: Ex- ploring large action spaces. In Brodley, C. E. and Stone, P. (eds.),Proceedings of the Twenty-Eighth AAAI Confer- ence on Artificial Intelligence, July 27 -31, 2014, Qu´ebec City, Qu ...

  2. [4]

    URL https: //doi.org/10.1609/aaai.v28i1.8880

    doi: 10.1609/AAAI.V28I1.8880. URL https: //doi.org/10.1609/aaai.v28i1.8880. Matena, M. and Raffel, C. Merging models with fisher-weighted averaging. In Koyejo, S., Mohamed, 11 Mixture of Masters: Sparse Chess Language Models with Player Routing S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Advances in Neural Information Pro- cessing Systems 35...

  3. [5]

    Learning models of individual behavior in chess

    doi: 10.1145/3534678.3539367. URL https: //doi.org/10.1145/3534678.3539367. Merrill, W., Petty, J., and Sabharwal, A. The illusion of state in state-space models. InForty-first International Confer- ence on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=QZgo9JZpLq. Monroe, D. and ...

  4. [7]

    St¨ockl, A

    URL https://openreview.net/forum? id=uyTL5Bvosj. St¨ockl, A. Watching a language model learning chess. In An- gelova, G., Kunilovskaya, M., Mitkov, R., and Nikolova- Koleva, I. (eds.),Proceedings of the International Confer- ence on Recent Advances in Natural Language Process- ing (RANLP 2021), Held Online, 1-3 September, 2021, pp. 1369–1379. INCOMA Ltd.,...

  5. [8]

    cc/paper_files/paper/1994/hash/ d7322ed717dedf1eb4e6e52a37ea7bcd-Abstract

    URL https://proceedings.neurips. cc/paper_files/paper/1994/hash/ d7322ed717dedf1eb4e6e52a37ea7bcd-Abstract. html. Toshniwal, S., Wiseman, S., Livescu, K., and Gimpel, K. Chess as a testbed for language model state tracking. InThirty-Sixth AAAI Conference on Artificial Intelli- gence, AAAI 2022, Thirty-Fourth Conference on Inno- vative Applications of Arti...

  6. [9]

    Representation Learning with Contrastive Predictive Coding

    doi: 10.1609/AAAI.V36I10.21390. URL https: //doi.org/10.1609/aaai.v36i10.21390. van den Oord, A., Li, Y ., and Vinyals, O. Repre- sentation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018. URL https://api. semanticscholar.org/CorpusID:49670925. Wan, L., Wang, Q., Papir, A., and L´opez-Moreno, I. Gener- alized end-to-end loss for s...

  7. [10]

    LG-GAN: Label Guided Adversarial Network for Flexible Targeted Attack of Point Cloud-based Deep Networks

    URL https://proceedings.mlr.press/ v162/wortsman22a.html. Wu, H., Zheng, H., He, Z., and Yu, B. Parameter-efficient sparsity crafting from dense to mixture-of-experts for instruction tuning on general tasks. In Al-Onaizan, Y ., Bansal, M., and Chen, Y . (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 20...

  8. [11]

    In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    URL https://doi.org/10.18653/v1/2025. naacl-short.1. 14 Mixture of Masters: Sparse Chess Language Models with Player Routing A. Survey In parallel to the methodological and resource contributions presented in the main paper, we designed and administered a survey aimed at clarifying long-standing open questions in the chess community that directly underpin...

  9. [12]

    behavioral fingerprints

    and early versions of Stockfish (Romstad et al., 2008), which used tree-search algorithms with handcrafted evaluation functions. Although most modern engines retain this search-evaluation structure, they have replaced static evaluation with neural networks. In this sense, AlphaZero (Silver et al., 2017) represented a major milestone. It learned to play ch...