Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game

Tambet Matiisen; T\~onis Lees

arxiv: 2604.05476 · v1 · submitted 2026-04-07 · 💻 cs.LG

Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game

T\~onis Lees , Tambet Matiisen This is my paper

Pith reviewed 2026-05-10 19:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords AlphaZeroTablutasymmetric board gamesself-playreinforcement learningpolicy and value headsgame AI

0 comments

The pith

Separate policy and value heads for each role let AlphaZero self-play succeed on the asymmetric game Tablut.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether AlphaZero's self-play method can be adapted to Tablut, where one side must capture the king and the other must escape with it. A standard network with one policy head and one value head cannot learn both objectives at once without conflict. The authors split the heads while keeping a shared trunk for board features, then add data augmentation, a larger replay buffer, and occasional games against older checkpoints to stop forgetting between roles. After 100 iterations the trained model improves steadily against a random baseline. This shows the core self-play loop remains usable once the architecture accounts for unequal objectives.

Core claim

A modified AlphaZero network with distinct policy and value heads for the attacker and defender roles, plus C4 data augmentation, an expanded replay buffer, and 25 percent of games played against past checkpoints, produces steady improvement on Tablut and reaches a BayesElo rating of 1235 relative to random play while lowering policy entropy and average piece count.

What carries the argument

Role-specific policy and value heads sharing one residual trunk, stabilized by larger replay buffer, C4 augmentation, and periodic checkpoint play to prevent catastrophic forgetting between opposing objectives.

If this is right

Training remains stable across 100 iterations instead of diverging due to role conflict.
Policy entropy drops and games end with fewer pieces, showing more focused decisions.
The resulting agent outperforms a random baseline by a clear rating margin.
The same self-play loop works once the network no longer forces one head to represent two opposing goals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same head split could be tested on other games with fixed unequal roles, such as certain historical variants or asymmetric strategy titles.
If the shared trunk proves sufficient here, designers of multi-agent systems might try minimal separation only where objectives directly oppose.
Longer training runs could check whether the rating keeps rising or plateaus once the stabilizations are in place.

Load-bearing premise

Splitting the heads and applying the listed stabilizations will resolve conflicting evaluations without introducing new biases that block convergence to strong play.

What would settle it

After the modifications, if the model shows no rise in win rate against a fixed opponent or continues to forget one role when training the other, the claim that the framework transfers would not hold.

Figures

Figures reproduced from arXiv: 2604.05476 by Tambet Matiisen, T\~onis Lees.

**Figure 2.** Figure 2: Tablut board’s initial state 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Elo progression across three cumulative training configurations. Each successive run adds [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Detailed training metrics and game statistics recorded over 100 self-play iterations. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

This work investigates the adaptation of the AlphaZero reinforcement learning algorithm to Tablut, an asymmetric historical board game featuring unequal piece counts and distinct player objectives (king capture versus king escape). While the original AlphaZero architecture successfully leverages a single policy and value head for symmetric games, applying it to asymmetric environments forces the network to learn two conflicting evaluation functions, which can hinder learning efficiency and performance. To address this, the core architecture is modified to use separate policy and value heads for each player role, while maintaining a shared residual trunk to learn common board features. During training, the asymmetric structure introduced training instabilities, notably catastrophic forgetting between the attacker and defender roles. These issues were mitigated by applying C4 data augmentation, increasing the replay buffer size, and having the model play 25 percent of training games against randomly sampled past checkpoints. Over 100 self-play iterations, the modified model demonstrated steady improvement, achieving a BayesElo rating of 1235 relative to a randomly initialized baseline. Training metrics also showed a significant decrease in policy entropy and average remaining pieces, reflecting increasingly focused and decisive play. Ultimately, the experiments confirm that AlphaZero's self-play framework can transfer to highly asymmetric games, provided that distinct policy/value heads and robust stabilization techniques are employed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They adapted AlphaZero to Tablut with role-specific heads and stabilizations and saw improvement, but a single run without controls leaves the necessity of those changes untested.

read the letter

The paper takes AlphaZero and makes it work on Tablut by giving the network separate policy and value heads for the attacker and defender roles while sharing the residual trunk. They also added C4 augmentation, a larger replay buffer, and 25 percent of training games against past checkpoints to handle forgetting between roles. After 100 self-play iterations the model reaches 1235 BayesElo against random play, with falling policy entropy and fewer pieces left on the board on average. That is the concrete result they report.

Referee Report

3 major / 1 minor

Summary. The paper claims that AlphaZero's self-play RL framework can be adapted to the asymmetric board game Tablut (with unequal pieces and opposing objectives) by replacing the single policy/value head with separate heads per role on a shared residual trunk, combined with stabilization via C4 data augmentation, enlarged replay buffer, and 25% of training games against past checkpoints; this yields steady improvement over 100 iterations to a BayesElo of 1235 versus random baseline, plus drops in policy entropy and average remaining pieces.

Significance. If the result holds, the work provides an empirical demonstration that self-play methods can transfer to highly asymmetric games with role-specific goals, which would be useful for extending RL to other imbalanced or multi-objective settings. The reported training metrics and single-run improvement constitute a concrete data point, though the absence of controls reduces the strength of the transfer claim.

major comments (3)

[Experiments / Results] The reported experiments consist of one training run that applies the dual-head architecture together with all three stabilization techniques at once. No ablation is shown against a single-head baseline or against variants omitting individual stabilizations (e.g., without the 25% checkpoint games), which directly undermines the central claim that distinct heads plus the listed stabilizations are what enable successful transfer.
[Evaluation / Metrics] The BayesElo of 1235 is given relative to a random baseline, but the text supplies no information on the number of evaluation games, standard error, or statistical significance of the rating; likewise, the entropy and piece-count trends are internal metrics that do not test whether the learned policies actually avoid the conflicting-evaluation problem the dual-head design was introduced to solve.
[Discussion / Abstract] The weakest assumption—that separate heads plus the stabilizations resolve role conflicts without introducing new biases—is not probed; the paper would need at least one controlled comparison (single-head vs. dual-head under otherwise identical conditions) to substantiate that the observed convergence is attributable to the proposed changes rather than to other factors.

minor comments (1)

[Abstract / Methods] The phrase 'C4 data augmentation' appears without definition or citation; a brief description of the augmentation procedure and why it is labeled 'C4' would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below, acknowledging limitations in the current experimental design while outlining how we will strengthen the presentation and claims in revision.

read point-by-point responses

Referee: The reported experiments consist of one training run that applies the dual-head architecture together with all three stabilization techniques at once. No ablation is shown against a single-head baseline or against variants omitting individual stabilizations (e.g., without the 25% checkpoint games), which directly undermines the central claim that distinct heads plus the listed stabilizations are what enable successful transfer.

Authors: We agree that presenting only the combined configuration without ablations limits the strength of causal claims about the dual-head design and each stabilization. The experiments were intended as a feasibility demonstration for adapting self-play RL to Tablut rather than a full factorial study. In the revised manuscript we will add ablation experiments, including a single-head baseline and variants omitting individual stabilizations, to better isolate contributions. revision: yes
Referee: The BayesElo of 1235 is given relative to a random baseline, but the text supplies no information on the number of evaluation games, standard error, or statistical significance of the rating; likewise, the entropy and piece-count trends are internal metrics that do not test whether the learned policies actually avoid the conflicting-evaluation problem the dual-head design was introduced to solve.

Authors: We will revise the evaluation section to report the number of games used for the BayesElo computation along with standard errors and any significance testing performed. The entropy and remaining-piece metrics are indeed internal training signals that reflect policy sharpening and decisive play; they do not directly measure resolution of role conflicts. We will add explicit discussion of this limitation and note that the observed stable improvement is consistent with the dual-head motivation but does not constitute a direct test. revision: partial
Referee: The weakest assumption—that separate heads plus the stabilizations resolve role conflicts without introducing new biases—is not probed; the paper would need at least one controlled comparison (single-head vs. dual-head under otherwise identical conditions) to substantiate that the observed convergence is attributable to the proposed changes rather than to other factors.

Authors: This is a fair critique. The manuscript frames the dual-head architecture as addressing conflicting evaluations and reports successful training, yet lacks the direct controlled comparison. We will revise the discussion and abstract to present the results more cautiously as an empirical demonstration of transfer rather than definitive mechanistic proof, and we will include a single-head baseline experiment in the revision where compute permits. revision: partial

Circularity Check

0 steps flagged

Empirical self-play training results contain no circular derivations

full rationale

The paper reports an empirical reproduction of AlphaZero on Tablut using a modified dual-head architecture plus stabilization techniques (C4 augmentation, enlarged replay buffer, 25% games vs past checkpoints). All claims rest on observed training curves, BayesElo scores against a random baseline, and internal metrics such as policy entropy and piece count over 100 iterations. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters, self-citations, or renamed inputs. The work is therefore self-contained as an experimental study.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard AlphaZero assumptions plus ad-hoc choices for handling asymmetry; limited information available from abstract.

free parameters (3)

percentage of training games against past checkpoints
Set to 25 percent to mitigate catastrophic forgetting
replay buffer size
Increased from default to address training instabilities
number of self-play iterations
Run for 100 iterations to demonstrate improvement

axioms (2)

domain assumption Shared residual trunk can learn common board features despite asymmetric player roles
Invoked to justify keeping a single trunk while splitting heads
ad hoc to paper C4 data augmentation and checkpoint play stabilize asymmetric self-play training
Used to mitigate instabilities without further justification

pith-pipeline@v0.9.0 · 5528 in / 1331 out tokens · 32578 ms · 2026-05-10T19:57:32.835495+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

JAX: composable transformations of Python+NumPy programs, 2018

Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/jax-ml/jax. Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver. Policy improvement by planning with gumbel. InInternational Conference on Learning Representations, 2022

work page 2018
[2]

The DeepMind JAX Ecosystem, 2020

Quan, Roman Ring, Francisco Ruiz, Alvaro Sanchez, Laurent Sartran, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan Srinivasan, Miloš Stanojevi ´c, Wojciech Stokowiec, Luyu Wang, Guangyao Zhou, and Fabio Viola. The DeepMind JAX Ecosystem, 2020. URL http://github.com/google-deepmind. Sotetsu Koyamada, Shinri Okano, Soichiro Nishimori, Yu Murata,...

work page 2020
[3]

Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019. 3 A Appendix A.1 Tablut rules Tablut is a historic board game that was played in Northern Europe and its rules were first documented by Carl Linnaeus in 1732 in his diary. The ...

work page 2019

[1] [1]

JAX: composable transformations of Python+NumPy programs, 2018

Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/jax-ml/jax. Ivo Danihelka, Arthur Guez, Julian Schrittwieser, and David Silver. Policy improvement by planning with gumbel. InInternational Conference on Learning Representations, 2022

work page 2018

[2] [2]

The DeepMind JAX Ecosystem, 2020

Quan, Roman Ring, Francisco Ruiz, Alvaro Sanchez, Laurent Sartran, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan Srinivasan, Miloš Stanojevi ´c, Wojciech Stokowiec, Luyu Wang, Guangyao Zhou, and Fabio Viola. The DeepMind JAX Ecosystem, 2020. URL http://github.com/google-deepmind. Sotetsu Koyamada, Shinri Okano, Soichiro Nishimori, Yu Murata,...

work page 2020

[3] [3]

Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019. 3 A Appendix A.1 Tablut rules Tablut is a historic board game that was played in Northern Europe and its rules were first documented by Carl Linnaeus in 1732 in his diary. The ...

work page 2019