From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

J\'an Kla\v{c}an; Sizhong Zhang

arxiv: 2605.17162 · v1 · pith:ND2DA4P7new · submitted 2026-05-16 · 💻 cs.AI · cs.LG

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

J\'an Kla\v{c}an , Sizhong Zhang This is my paper

Pith reviewed 2026-05-20 14:15 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords Schnapsenreinforcement learningshallow neural networksMonte Carlo searchcard game AIimitation learningvalue functionexperience replay

0 comments

The pith

Shallow reinforcement learning masters Schnapsen and beats strong search opponents when its value function guides deeper lookahead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether shallow neural networks can learn to play Schnapsen well enough to challenge a strong Monte Carlo search baseline called RdeepBot. A supervised agent trained by imitating replays fails to generalize and loses to tough opponents. In contrast, a reinforcement learning agent using the same shallow architecture but trained through asynchronous Monte Carlo updates and experience replay produces much stronger play. The strongest results occur when this learned value function is combined with deeper lookahead during games, delivering statistically significant higher win rates against the best baseline.

Core claim

The paper establishes that supervised imitation learning with a shallow MLP does not generalize well enough to defeat strong RdeepBot opponents in Schnapsen, whereas reinforcement learning with the same architecture using asynchronous Monte Carlo updates and experience replay produces substantially stronger agents; in the depth-parameter setting the best performance arises when the learned value function is combined with deeper lookahead, allowing statistically significant higher winning rates against the strongest evaluated RdeepBot baseline, while sample-based gains remain more conditional on training parameters.

What carries the argument

The RLBot agent whose shallow MLP value function is trained by asynchronous Monte Carlo updates with experience replay and then paired with lookahead search during play.

If this is right

Reinforcement learning produces substantially stronger agents than supervised imitation for Schnapsen.
Pairing the learned value function with deeper lookahead yields the highest winning rates against strong baselines.
In depth-focused evaluations RLBot gains a statistically significant edge over the toughest RdeepBot.
In sample-based evaluations performance gains depend conditionally on training parameters rather than scaling uniformly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same shallow RL plus search combination could be tried on other imperfect-information card games with comparable state sizes.
The results suggest that reward-driven interaction uncovers strategic patterns that imitation from replay data misses.
Further tests could check whether the learned value function encodes reusable heuristics that search exploits efficiently.

Load-bearing premise

The experimental design assumes that the chosen shallow MLP architecture and the asynchronous Monte Carlo update procedure with experience replay are sufficient to produce a value function that generalizes beyond the training distribution to defeat strong search opponents.

What would settle it

If gameplay trials show that RLBot fails to achieve statistically significant higher win rates against the strongest RdeepBot even when using deeper lookahead, the central performance claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.17162 by J\'an Kla\v{c}an, Sizhong Zhang.

**Figure 2.** Figure 2: Performance of RdeepBot sample variants: Statistical significance matrix displaying Z-Test scores [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Experiment 1 Results: Performance of supervised MLPBot variants. Top: Statistical significance [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Experiment 2 Results: Performance of depth-based RLBot variants. Top: Statistical significance [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Experiment 3 Results: Performance of sample-based RLBot variants. Top: Statistical significance [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

This paper investigates whether shallow neural network agents can master the card game Schnapsen and challenge a strong search-based baseline, RdeepBot, which uses Monte Carlo sampling and lookahead search. Guided by a progressively more complex experimental design, we first evaluate a supervised learning agent (MLPBot) trained on replay data and then a reinforcement learning agent (RLBot) with the same shallow architecture trained through asynchronous Monte Carlo updates and experience replay. The results show that supervised imitation does not generalize well enough to defeat strong RdeepBot opponents, whereas reinforcement learning produces substantially stronger agents. In the setting that focuses on the depth parameter of RdeepBot, the best performance is achieved when the learned value function is combined with deeper lookahead during gameplay, allowing RLBot to achieve statistically significant higher winning rates against the strongest evaluated RdeepBot baseline. In the sample-based setting, the gains are more conditional: the strongest performance appears at a relatively lower training num_samples parameter rather than increasing uniformly with stronger sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shallow RL with replay beats imitation learning in Schnapsen and improves on search baselines when paired with deeper lookahead, though training details stay thin.

read the letter

The main thing to know is that this paper shows a shallow MLP trained with asynchronous Monte Carlo updates and experience replay can produce agents that outperform both supervised imitation and the strongest RdeepBot search baseline in Schnapsen, especially when the value function is used with increased lookahead depth. The gains are statistically significant in the depth-focused setting but more conditional when varying sample counts during training or play.

Referee Report

2 major / 2 minor

Summary. The paper investigates whether shallow neural network agents can master the card game Schnapsen and challenge a strong search-based baseline (RdeepBot). It first evaluates a supervised learning agent (MLPBot) trained on replay data, then a reinforcement learning agent (RLBot) using the same shallow MLP architecture trained via asynchronous Monte Carlo updates and experience replay. Results indicate that supervised imitation does not generalize sufficiently to defeat strong opponents, whereas RL produces substantially stronger agents. In the depth-parameter setting, the best performance occurs when the learned value function is combined with deeper lookahead, yielding statistically significant higher win rates against the strongest RdeepBot baseline. In the sample-based setting, gains are more conditional and appear strongest at relatively lower training num_samples rather than scaling uniformly with sampling strength.

Significance. If the results hold under full verification, the work demonstrates that shallow RL can learn effective value functions for imperfect-information card games, providing a lightweight complement to Monte Carlo search that improves upon both pure search baselines and imitation learning. The progressive experimental design distinguishing depth versus sample regimes offers practical insight into how learned evaluators interact with search parameters. Explicit statistical comparisons and the focus on generalization beyond training distribution are positive elements.

major comments (2)

[§4] §4 (Experimental Results): The abstract and results claim statistically significant win-rate differences for RLBot versus the strongest RdeepBot baseline, yet no error bars, confidence intervals, exact dataset sizes, number of trials, or p-value details are provided; this directly affects the load-bearing claim that the depth-parameter combination produces reliably superior performance.
[§3.2] §3.2 (RL Training Procedure): The central empirical claim rests on the asynchronous Monte Carlo updates with experience replay on a shallow MLP producing a value function that generalizes to defeat strong search opponents, but the manuscript provides no ablation on architecture depth, alternative update rules, or training distribution coverage, leaving the weakest assumption untested.

minor comments (2)

[Abstract] Abstract: The phrasing 'training num_samples parameter' is unclear without a prior definition of RdeepBot's sampling mechanism; a brief parenthetical or reference would improve readability.
[Throughout] Notation: Consistent use of 'RLBot' versus 'the learned value function' across sections would reduce ambiguity when describing the combined agent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive major comments. We respond to each point below, indicating planned revisions where appropriate to improve clarity and rigor.

read point-by-point responses

Referee: [§4] §4 (Experimental Results): The abstract and results claim statistically significant win-rate differences for RLBot versus the strongest RdeepBot baseline, yet no error bars, confidence intervals, exact dataset sizes, number of trials, or p-value details are provided; this directly affects the load-bearing claim that the depth-parameter combination produces reliably superior performance.

Authors: We agree that the presentation of statistical details is insufficient in the current manuscript. The experiments underlying the reported win rates were conducted over 5000 games per matchup with multiple random seeds, but these specifics and supporting statistics were omitted. In the revised manuscript we will add error bars to all relevant figures, report exact trial counts and dataset sizes, include 95% confidence intervals, and provide p-values from two-tailed paired t-tests to substantiate the claims of statistically significant improvements for the depth-parameter setting. revision: yes
Referee: [§3.2] §3.2 (RL Training Procedure): The central empirical claim rests on the asynchronous Monte Carlo updates with experience replay on a shallow MLP producing a value function that generalizes to defeat strong search opponents, but the manuscript provides no ablation on architecture depth, alternative update rules, or training distribution coverage, leaving the weakest assumption untested.

Authors: The manuscript intentionally employs the identical shallow MLP for both the supervised imitation and RL agents to isolate the effect of the training regime. We acknowledge that explicit ablations on network depth, alternative update rules, and quantitative measures of training distribution coverage would further strengthen the claims. In revision we will expand §3.2 with a dedicated paragraph justifying the architectural choice on grounds of computational efficiency and fair comparison, report basic statistics on state coverage from the replay buffer, and explicitly list deeper architectures and alternative RL methods as directions for future work. Full new ablations are not feasible within the current experimental budget but the added discussion will address the concern directly. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical comparison against external baseline

full rationale

The manuscript reports an empirical evaluation of supervised imitation (MLPBot) versus reinforcement learning (RLBot) agents using a shallow MLP, trained via asynchronous Monte Carlo updates and experience replay, then tested against the independent RdeepBot search baseline. No mathematical derivations, value-function equations, or parameter fittings are presented that reduce by construction to their own inputs. Performance claims rest on win-rate statistics and ablation-style comparisons to an external opponent rather than self-referential predictions or self-citation chains. The experimental design is self-contained: training produces a value function whose utility is measured by direct gameplay outcomes against a separately implemented search agent, with no load-bearing uniqueness theorems or ansatzes imported from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central empirical claim rests on the assumption that the shallow network plus Monte Carlo updates can learn a useful value function for Schnapsen; no new entities or ad-hoc axioms are introduced beyond standard RL assumptions.

free parameters (1)

training num_samples
Hyperparameter controlling opponent strength in sample-based evaluation; performance peaks at lower rather than higher values.

axioms (1)

domain assumption Shallow MLP architecture is expressive enough for Schnapsen value estimation
Invoked when claiming RLBot can challenge RdeepBot.

pith-pipeline@v0.9.0 · 5704 in / 1196 out tokens · 31568 ms · 2026-05-20T14:15:43.547954+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

shallow neural network agents... asynchronous Monte Carlo updates and experience replay... learned value function combined with deeper lookahead
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reinforcement learning produces substantially stronger agents... statistically significant higher winning rates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

On the expressive power of deep architectures

Yoshua Bengio and Olivier Delalleau. On the expressive power of deep architectures. In Jyrki Kivinen, Csaba Szepesvári, Esko Ukkonen, and Thomas Zeugmann, editors,Algorithmic Learning Theory, pages 18–36, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg

work page 2011
[2]

Bishop.Pattern Recognition and Machine Learning

Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, 2006

work page 2006
[3]

Schnapsen: GitHub repository,

Michael Cochez, Taewoon Kim, Nikos Kondylidis, and Tim But. Schnapsen: GitHub repository,

work page
[4]

URL:https://github.com/intelligent-systems-course/schnapsen

work page
[5]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL: https://arxiv.org/abs/1412.6980,arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Non-uniform memory sampling in experience replay, 2025

Andrii Krutsylo. Non-uniform memory sampling in experience replay, 2025. URL:https: //arxiv.org/abs/2502.11305,arXiv:2502.11305

work page arXiv 2025
[7]

Towards understanding the effect of leak in Spiking Neural Networks,

Věra Kůrková and Marcello Sanguineti. Model complexities of shallow networks represent- ing highly varying functions.Neurocomputing, 171:598–604, 2016. URL: https://www. sciencedirect.com/science/article/pii/S0925231215009893, doi:10.1016/j.neucom. 2015.07.014

work page doi:10.1016/j.neucom 2016
[8]

L.-H. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3/4):69–97, 1992

work page 1992
[9]

Schnapsen and sixty-six rules, 2015

Psellos. Schnapsen and sixty-six rules, 2015. Revised: August 18, 2015. Accessed: 2026-01-19. URL:http://psellos.com/schnapsen/rules.html

work page 2015
[10]

concurrent.futures — launching parallel tasks — python 3.14 docu- mentation

Python Software Foundation. concurrent.futures — launching parallel tasks — python 3.14 docu- mentation. Accessed: 2026-01-18. URL: https://docs.python.org/3/library/concurrent. futures.html

work page 2026
[11]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement learning : an introduction. Adaptive computation and machine learning. MIT Press, Cambridge, Mass., [etc.], 3rd print edition,

work page
[12]

Section: XVIII, 322 p. : ill. ; 24 cm. 17

work page

[1] [1]

On the expressive power of deep architectures

Yoshua Bengio and Olivier Delalleau. On the expressive power of deep architectures. In Jyrki Kivinen, Csaba Szepesvári, Esko Ukkonen, and Thomas Zeugmann, editors,Algorithmic Learning Theory, pages 18–36, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg

work page 2011

[2] [2]

Bishop.Pattern Recognition and Machine Learning

Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, 2006

work page 2006

[3] [3]

Schnapsen: GitHub repository,

Michael Cochez, Taewoon Kim, Nikos Kondylidis, and Tim But. Schnapsen: GitHub repository,

work page

[4] [4]

URL:https://github.com/intelligent-systems-course/schnapsen

work page

[5] [5]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL: https://arxiv.org/abs/1412.6980,arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Non-uniform memory sampling in experience replay, 2025

Andrii Krutsylo. Non-uniform memory sampling in experience replay, 2025. URL:https: //arxiv.org/abs/2502.11305,arXiv:2502.11305

work page arXiv 2025

[7] [7]

Towards understanding the effect of leak in Spiking Neural Networks,

Věra Kůrková and Marcello Sanguineti. Model complexities of shallow networks represent- ing highly varying functions.Neurocomputing, 171:598–604, 2016. URL: https://www. sciencedirect.com/science/article/pii/S0925231215009893, doi:10.1016/j.neucom. 2015.07.014

work page doi:10.1016/j.neucom 2016

[8] [8]

L.-H. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3/4):69–97, 1992

work page 1992

[9] [9]

Schnapsen and sixty-six rules, 2015

Psellos. Schnapsen and sixty-six rules, 2015. Revised: August 18, 2015. Accessed: 2026-01-19. URL:http://psellos.com/schnapsen/rules.html

work page 2015

[10] [10]

concurrent.futures — launching parallel tasks — python 3.14 docu- mentation

Python Software Foundation. concurrent.futures — launching parallel tasks — python 3.14 docu- mentation. Accessed: 2026-01-18. URL: https://docs.python.org/3/library/concurrent. futures.html

work page 2026

[11] [11]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement learning : an introduction. Adaptive computation and machine learning. MIT Press, Cambridge, Mass., [etc.], 3rd print edition,

work page

[12] [12]

Section: XVIII, 322 p. : ill. ; 24 cm. 17

work page