Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game
Pith reviewed 2026-05-10 19:57 UTC · model grok-4.3
The pith
Separate policy and value heads for each role let AlphaZero self-play succeed on the asymmetric game Tablut.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A modified AlphaZero network with distinct policy and value heads for the attacker and defender roles, plus C4 data augmentation, an expanded replay buffer, and 25 percent of games played against past checkpoints, produces steady improvement on Tablut and reaches a BayesElo rating of 1235 relative to random play while lowering policy entropy and average piece count.
What carries the argument
Role-specific policy and value heads sharing one residual trunk, stabilized by larger replay buffer, C4 augmentation, and periodic checkpoint play to prevent catastrophic forgetting between opposing objectives.
If this is right
- Training remains stable across 100 iterations instead of diverging due to role conflict.
- Policy entropy drops and games end with fewer pieces, showing more focused decisions.
- The resulting agent outperforms a random baseline by a clear rating margin.
- The same self-play loop works once the network no longer forces one head to represent two opposing goals.
Where Pith is reading between the lines
- The same head split could be tested on other games with fixed unequal roles, such as certain historical variants or asymmetric strategy titles.
- If the shared trunk proves sufficient here, designers of multi-agent systems might try minimal separation only where objectives directly oppose.
- Longer training runs could check whether the rating keeps rising or plateaus once the stabilizations are in place.
Load-bearing premise
Splitting the heads and applying the listed stabilizations will resolve conflicting evaluations without introducing new biases that block convergence to strong play.
What would settle it
After the modifications, if the model shows no rise in win rate against a fixed opponent or continues to forget one role when training the other, the claim that the framework transfers would not hold.
Figures
read the original abstract
This work investigates the adaptation of the AlphaZero reinforcement learning algorithm to Tablut, an asymmetric historical board game featuring unequal piece counts and distinct player objectives (king capture versus king escape). While the original AlphaZero architecture successfully leverages a single policy and value head for symmetric games, applying it to asymmetric environments forces the network to learn two conflicting evaluation functions, which can hinder learning efficiency and performance. To address this, the core architecture is modified to use separate policy and value heads for each player role, while maintaining a shared residual trunk to learn common board features. During training, the asymmetric structure introduced training instabilities, notably catastrophic forgetting between the attacker and defender roles. These issues were mitigated by applying C4 data augmentation, increasing the replay buffer size, and having the model play 25 percent of training games against randomly sampled past checkpoints. Over 100 self-play iterations, the modified model demonstrated steady improvement, achieving a BayesElo rating of 1235 relative to a randomly initialized baseline. Training metrics also showed a significant decrease in policy entropy and average remaining pieces, reflecting increasingly focused and decisive play. Ultimately, the experiments confirm that AlphaZero's self-play framework can transfer to highly asymmetric games, provided that distinct policy/value heads and robust stabilization techniques are employed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that AlphaZero's self-play RL framework can be adapted to the asymmetric board game Tablut (with unequal pieces and opposing objectives) by replacing the single policy/value head with separate heads per role on a shared residual trunk, combined with stabilization via C4 data augmentation, enlarged replay buffer, and 25% of training games against past checkpoints; this yields steady improvement over 100 iterations to a BayesElo of 1235 versus random baseline, plus drops in policy entropy and average remaining pieces.
Significance. If the result holds, the work provides an empirical demonstration that self-play methods can transfer to highly asymmetric games with role-specific goals, which would be useful for extending RL to other imbalanced or multi-objective settings. The reported training metrics and single-run improvement constitute a concrete data point, though the absence of controls reduces the strength of the transfer claim.
major comments (3)
- [Experiments / Results] The reported experiments consist of one training run that applies the dual-head architecture together with all three stabilization techniques at once. No ablation is shown against a single-head baseline or against variants omitting individual stabilizations (e.g., without the 25% checkpoint games), which directly undermines the central claim that distinct heads plus the listed stabilizations are what enable successful transfer.
- [Evaluation / Metrics] The BayesElo of 1235 is given relative to a random baseline, but the text supplies no information on the number of evaluation games, standard error, or statistical significance of the rating; likewise, the entropy and piece-count trends are internal metrics that do not test whether the learned policies actually avoid the conflicting-evaluation problem the dual-head design was introduced to solve.
- [Discussion / Abstract] The weakest assumption—that separate heads plus the stabilizations resolve role conflicts without introducing new biases—is not probed; the paper would need at least one controlled comparison (single-head vs. dual-head under otherwise identical conditions) to substantiate that the observed convergence is attributable to the proposed changes rather than to other factors.
minor comments (1)
- [Abstract / Methods] The phrase 'C4 data augmentation' appears without definition or citation; a brief description of the augmentation procedure and why it is labeled 'C4' would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below, acknowledging limitations in the current experimental design while outlining how we will strengthen the presentation and claims in revision.
read point-by-point responses
-
Referee: The reported experiments consist of one training run that applies the dual-head architecture together with all three stabilization techniques at once. No ablation is shown against a single-head baseline or against variants omitting individual stabilizations (e.g., without the 25% checkpoint games), which directly undermines the central claim that distinct heads plus the listed stabilizations are what enable successful transfer.
Authors: We agree that presenting only the combined configuration without ablations limits the strength of causal claims about the dual-head design and each stabilization. The experiments were intended as a feasibility demonstration for adapting self-play RL to Tablut rather than a full factorial study. In the revised manuscript we will add ablation experiments, including a single-head baseline and variants omitting individual stabilizations, to better isolate contributions. revision: yes
-
Referee: The BayesElo of 1235 is given relative to a random baseline, but the text supplies no information on the number of evaluation games, standard error, or statistical significance of the rating; likewise, the entropy and piece-count trends are internal metrics that do not test whether the learned policies actually avoid the conflicting-evaluation problem the dual-head design was introduced to solve.
Authors: We will revise the evaluation section to report the number of games used for the BayesElo computation along with standard errors and any significance testing performed. The entropy and remaining-piece metrics are indeed internal training signals that reflect policy sharpening and decisive play; they do not directly measure resolution of role conflicts. We will add explicit discussion of this limitation and note that the observed stable improvement is consistent with the dual-head motivation but does not constitute a direct test. revision: partial
-
Referee: The weakest assumption—that separate heads plus the stabilizations resolve role conflicts without introducing new biases—is not probed; the paper would need at least one controlled comparison (single-head vs. dual-head under otherwise identical conditions) to substantiate that the observed convergence is attributable to the proposed changes rather than to other factors.
Authors: This is a fair critique. The manuscript frames the dual-head architecture as addressing conflicting evaluations and reports successful training, yet lacks the direct controlled comparison. We will revise the discussion and abstract to present the results more cautiously as an empirical demonstration of transfer rather than definitive mechanistic proof, and we will include a single-head baseline experiment in the revision where compute permits. revision: partial
Circularity Check
Empirical self-play training results contain no circular derivations
full rationale
The paper reports an empirical reproduction of AlphaZero on Tablut using a modified dual-head architecture plus stabilization techniques (C4 augmentation, enlarged replay buffer, 25% games vs past checkpoints). All claims rest on observed training curves, BayesElo scores against a random baseline, and internal metrics such as policy entropy and piece count over 100 iterations. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters, self-citations, or renamed inputs. The work is therefore self-contained as an experimental study.
Axiom & Free-Parameter Ledger
free parameters (3)
- percentage of training games against past checkpoints
- replay buffer size
- number of self-play iterations
axioms (2)
- domain assumption Shared residual trunk can learn common board features despite asymmetric player roles
- ad hoc to paper C4 data augmentation and checkpoint play stabilize asymmetric self-play training
Reference graph
Works this paper leans on
-
[1]
JAX: composable transformations of Python+NumPy programs, 2018
Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/jax-ml/jax. Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver. Policy improvement by planning with gumbel. InInternational Conference on Learning Representations, 2022
work page 2018
-
[2]
The DeepMind JAX Ecosystem, 2020
Quan, Roman Ring, Francisco Ruiz, Alvaro Sanchez, Laurent Sartran, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan Srinivasan, Miloš Stanojevi ´c, Wojciech Stokowiec, Luyu Wang, Guangyao Zhou, and Fabio Viola. The DeepMind JAX Ecosystem, 2020. URL http://github.com/google-deepmind. Sotetsu Koyamada, Shinri Okano, Soichiro Nishimori, Yu Murata,...
work page 2020
-
[3]
Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019. 3 A Appendix A.1 Tablut rules Tablut is a historic board game that was played in Northern Europe and its rules were first documented by Carl Linnaeus in 1732 in his diary. The ...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.