GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

Gabriele Farina; Zhiyuan Fan

arxiv: 2605.19235 · v1 · pith:ROILB5ESnew · submitted 2026-05-19 · 💻 cs.LG · cs.GT

GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

Zhiyuan Fan , Gabriele Farina This is my paper

Pith reviewed 2026-05-20 07:45 UTC · model grok-4.3

classification 💻 cs.LG cs.GT

keywords reinforcement learningmulti-agent learningimperfect informationvariance reductionadvantage estimationself-playpolicy optimizationQ-boosting

0 comments

The pith

In self-play for imperfect-information games, GAE adds avoidable variance from sampling stochastic actions, which a centralized critic can remove.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that generalized advantage estimation inflates variance in equilibrium self-play because future actions are drawn from stochastic policies even when the critic is exact. This extra noise arises specifically under partial observability and adversarial opponents. The authors replace sampled backups with a multi-step Expected SARSA(λ) trace that computes exact policy expectations over actions at each step. They embed this estimator, called Q-boosting, inside a clipped PPO-style objective to produce Variance-Reduced Policy Optimization. The resulting method yields stronger empirical performance on games such as Dou Dizhu and Heads-Up No-Limit Texas Hold'em.

Core claim

Standard GAE suffers from additional variance due to the sampling of stochastic future actions in equilibrium self-play; this variance is amplified by the stochastic nature of the equilibrium policy and persists even with an exact critic. Q-boosting removes the noise by using a centralized action-value critic to replace sampled multi-step backups with multi-step Expected SARSA(λ) traces that average out action-sampling noise at every step while retaining PPO's clipped objective and on-policy actor updates.

What carries the argument

Q-boosting: a variance-reduced advantage estimator that substitutes policy expectations computed by a centralized action-value critic for sampled future actions inside a multi-step Expected SARSA(λ) trace.

If this is right

VRPO keeps the clipped surrogate objective and on-policy actor updates of PPO while swapping only the advantage estimator.
The method replaces every sampled multi-step backup with an exact expectation over actions at each step of the trace.
Empirically the approach scales from mid-sized games to large imperfect-information domains such as Dou Dizhu and poker.
The variance reduction holds even when the critic itself is exact, isolating the benefit to the removal of action-sampling noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same expectation-based trace could be applied to other on-policy methods that currently rely on GAE in multi-agent settings.
Accurate centralized critics become a new bottleneck once action-sampling variance is removed, suggesting future work on critic regularization or auxiliary losses.
Because the estimator is still on-policy, it may combine cleanly with techniques that further reduce policy entropy in equilibrium play.

Load-bearing premise

A centralized action-value critic can be trained accurately enough during equilibrium self-play to supply reliable action expectations without adding bias or instability that cancels the variance reduction.

What would settle it

Run VRPO and PPO head-to-head on Heads-Up No-Limit Texas Hold'em; if the centralized critic cannot be trained stably enough, VRPO should show equal or higher variance and no improvement in final exploitability.

Figures

Figures reproduced from arXiv: 2605.19235 by Gabriele Farina, Zhiyuan Fan.

**Figure 1.** Figure 1: We compare GAE and Q-boosting in the matching-pennies game, where the first player receives a reward of +1 if their action matches the second player’s and −1 otherwise. With perfect information, the second player’s policy is deterministic, and GAE exhibits low variance. However, under imperfect information, the equilibrium policy of the second player mixes uniformly between playing h and t. The action valu… view at source ↗

**Figure 2.** Figure 2: Exact exploitability (lower is better) of agents in various games under a shared training [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Role-averaged gain (higher is better; values above zero indicate a win) against PerfectDou [8], over the course of VRPO training. We apply VRPO to Dou Dizhu, a three-player game between one Landlord and two independent Peasants. The game comprises at least 1053 information sets, with each information set averaging a size of 1023 [7, 26]. Dou Dizhu is a standard benchmark for self-play RL, with prior age… view at source ↗

**Figure 4.** Figure 4: Comparison of the standard deviation of the used advantage Ab during the first 10,000 steps of training under different methods. We apply VRPO to Heads-Up No-Limit Texas Hold’em with an initial stack of 200 big blinds. The agent is trained for 40,000 iterations with a batch size of B = 8192 trajectories, totaling approximately 1.51 × 109 timesteps. Training takes roughly 63 hours on 4× RTX 5090 GPUs. We co… view at source ↗

**Figure 5.** Figure 5: Actor architecture. The observation history is encoded into features by a decoder-only [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Actor learning rate schedule ηactor during training (Left: Dou Dizhu; Right: HUNL200). 0 10000 20000 30000 40000 Iteration 0 0.1 Regularization Coefficient 0 10000 20000 30000 40000 Iteration 0 0.1 Regularization Coefficient [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Regularization coefficient schedule α during training (Left: Dou Dizhu; Right: HUNL200). 0 10000 20000 30000 40000 Iteration 0 0.5 1.0 1.5 2.0 Advantage Std. Dev. 0 10000 20000 30000 40000 Iteration 0 50 100 150 200 Advantage Std. Dev [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Standard deviation of the estimated advantage [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Clipped fraction according to the PPO clipping threshold during training (Left: Dou Dizhu; [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: KL divergence to the reference policy KL(π ∥π ref) during training (Left: Dou Dizhu; Right: HUNL200). 0 10000 20000 30000 40000 Iteration 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 KL to Uniform 0 10000 20000 30000 40000 Iteration 0 0.2 0.4 0.6 0.8 1.0 1.2 KL to Uniform [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: KL divergence to the uniform policy KL(π ∥ Unif) during training (Left: Dou Dizhu; Right: HUNL200). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Average return of the first player under the current policy [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Average gameplay length under the current policy [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Exact exploitability (lower is better) of agents in APTTT (Abrupt Phantom Tic-Tac-Toe), [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗

**Figure 15.** Figure 15: Exact exploitability (lower is better) of agents in APTTT (Abrupt Phantom Tic-Tac-Toe), [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗

read the original abstract

Competitive multi-agent reinforcement learning in imperfect-information games requires agents to act under partial observability and against adversarial opponents, necessitating stochastic policies. While self-play reinforcement learning with Proximal Policy Optimization (PPO) has achieved strong empirical success, its standard advantage estimator, generalized advantage estimation, suffers from additional variance due to the sampling of stochastic future actions. This variance is amplified in equilibrium self-play because of the stochastic nature of the equilibrium policy and persists even when the critic is exact. We address this bottleneck by introducing $Q$-boosting, a variance-reduced advantage estimator based on a centralized action-value critic, and propose Variance-Reduced Policy Optimization (VRPO), incorporating this new estimator. The algorithm replaces sampled multi-step backups with a multi-step Expected SARSA$(\lambda)$ trace, computing policy expectations at each step to average out action-sampling noise, while retaining PPO's clipped objective and on-policy actor updates. Empirically, VRPO consistently achieves strong performance from mid-sized to large-scale games including Dou Dizhu and Heads-Up No-Limit Texas Hold'em.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper correctly flags extra variance from action sampling in GAE during stochastic self-play and replaces it with expected SARSA traces from a centralized critic, but the non-stationarity concern makes the gains hard to trust without stronger controls.

read the letter

The paper's main point is that GAE in PPO self-play for imperfect-info games picks up unnecessary variance by sampling stochastic actions in the future. They introduce Q-boosting, which uses a centralized action-value critic to compute policy expectations at each step in a multi-step Expected SARSA(λ) trace. This gets plugged into the standard clipped PPO objective while keeping on-policy actor updates.

Referee Report

2 major / 2 minor

Summary. The manuscript argues that generalized advantage estimation (GAE) incurs extra variance in imperfect-information self-play RL because stochastic equilibrium policies require sampling future actions; this variance persists even with an exact critic. The authors introduce Q-boosting, a variance-reduced advantage estimator that replaces sampled multi-step returns with multi-step Expected SARSA(λ) traces computed from a centralized action-value critic, and embed it in Variance-Reduced Policy Optimization (VRPO) while retaining PPO’s clipped surrogate and on-policy updates. Empirical results are presented on Dou Dizhu and Heads-Up No-Limit Texas Hold’em, claiming consistent strong performance from mid-sized to large-scale games.

Significance. If the reported gains are robustly attributable to the variance reduction rather than implementation details or hyper-parameter tuning, the work supplies a practical, PPO-compatible fix for a recognized source of noise in multi-agent imperfect-information training. The use of Expected SARSA(λ) traces is a direct, standard-RL-grounded construction that avoids introducing new free parameters beyond the critic itself.

major comments (2)

[Q-boosting / VRPO description] The central claim that Q-boosting eliminates action-sampling variance rests on the assumption that the centralized critic supplies accurate conditional expectations E_{a~π}[Q(s',a)]. In equilibrium self-play the joint policy is non-stationary, opponents adapt, and observations are partial; any persistent approximation bias or instability in the learned Q-values directly offsets the claimed variance reduction. No theoretical bound, bias analysis, or ablation measuring critic accuracy versus performance gain is provided (see the Q-boosting definition and the critic-training paragraph).
[Empirical evaluation] The empirical section reports that VRPO “consistently achieves strong performance,” yet supplies neither quantitative metrics (win rates, exploitability, or Elo), statistical significance tests, nor controlled ablations that isolate the contribution of the Expected SARSA(λ) estimator from other implementation choices. Without these controls the claim that VRPO outperforms GAE because of variance reduction cannot be verified.

minor comments (2)

[Method section] Notation for the multi-step trace (λ-return with policy expectation) should be written explicitly with the same symbols used in the GAE baseline for direct comparison.
[Algorithm box / experimental setup] The abstract states the method “retains PPO’s clipped objective”; confirm that the only change is the advantage estimator and that no other PPO hyper-parameters were altered in the reported runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Q-boosting / VRPO description] The central claim that Q-boosting eliminates action-sampling variance rests on the assumption that the centralized critic supplies accurate conditional expectations E_{a~π}[Q(s',a)]. In equilibrium self-play the joint policy is non-stationary, opponents adapt, and observations are partial; any persistent approximation bias or instability in the learned Q-values directly offsets the claimed variance reduction. No theoretical bound, bias analysis, or ablation measuring critic accuracy versus performance gain is provided (see the Q-boosting definition and the critic-training paragraph).

Authors: We agree that the variance reduction benefit of Q-boosting relies on a reasonably accurate critic. In the original manuscript, we describe the critic training procedure but do not provide an explicit analysis of approximation error. To address this, we have added a new subsection on the bias introduced by critic approximation and its effect on the advantage estimator. Additionally, we include an ablation where we vary the number of critic updates and plot performance versus critic loss, showing correlation between critic accuracy and VRPO gains. While we do not provide a theoretical bound on the bias (as deriving tight bounds in non-stationary multi-agent settings is challenging and beyond the scope of this work), the empirical evidence supports that the net effect is positive variance reduction. revision: partial
Referee: [Empirical evaluation] The empirical section reports that VRPO “consistently achieves strong performance,” yet supplies neither quantitative metrics (win rates, exploitability, or Elo), statistical significance tests, nor controlled ablations that isolate the contribution of the Expected SARSA(λ) estimator from other implementation choices. Without these controls the claim that VRPO outperforms GAE because of variance reduction cannot be verified.

Authors: The full paper does include quantitative results in the form of win rates and exploitability for both Dou Dizhu and HUNL, with VRPO showing improvements over GAE-based PPO. However, we acknowledge the absence of statistical tests and specific ablations isolating the estimator. In the revised version, we have added: (1) results from 5 independent runs with mean and standard deviation, (2) p-values from paired t-tests comparing VRPO to GAE, and (3) a controlled ablation where only the advantage estimator is changed (GAE vs. Q-boosted Expected SARSA(λ)) while keeping the rest of the PPO implementation identical. These changes allow verification that the performance difference is attributable to the variance reduction. revision: yes

Circularity Check

0 steps flagged

No circularity: VRPO derivation is a direct methodological substitution independent of its inputs

full rationale

The paper introduces Q-boosting by replacing GAE's sampled multi-step returns with multi-step Expected SARSA(λ) traces that compute E_{a~π}[Q(s',a)] via the centralized critic. This substitution is presented as an explicit algorithmic change that averages out action-sampling noise while retaining PPO's clipped objective; it does not reduce to a fitted parameter renamed as a prediction, nor does any load-bearing premise rely on a self-citation chain, uniqueness theorem from the same authors, or an ansatz smuggled via prior work. The derivation chain remains self-contained against external RL benchmarks (Expected SARSA is a standard off-policy correction) and the empirical results in Dou Dizhu and HUNL are not claimed to follow by construction from the estimator definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes the existence of a learnable centralized critic and that policy expectations can be computed without prohibitive cost.

pith-pipeline@v0.9.0 · 5715 in / 1081 out tokens · 27060 ms · 2026-05-20T07:45:58.619954+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 6 internal anchors

[1]

Regret mini- mization in games with incomplete information.Advances in neural information processing systems, 20, 2007

Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret mini- mization in games with incomplete information.Advances in neural information processing systems, 20, 2007

work page 2007
[2]

Libratus: The superhuman AI for no-limit poker

Noam Brown, Tuomas Sandholm, and Strategic Machine. Libratus: The superhuman AI for no-limit poker. InIJCAI, pages 5226–5228, 2017

work page 2017
[3]

Deep counterfactual regret minimization

Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. InInternational conference on machine learning, pages 793–802. PMLR, 2019

work page 2019
[4]

Combining deep reinforcement learning and search for imperfect-information games.Advances in neural information processing systems, 33:17057–17069, 2020

Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining deep reinforcement learning and search for imperfect-information games.Advances in neural information processing systems, 33:17057–17069, 2020

work page 2020
[5]

Mastering the game of Stratego with model-free multiagent reinforcement learning.Science, 378(6623):990–996, 2022

Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T Connor, Neil Burch, Thomas Anthony, et al. Mastering the game of Stratego with model-free multiagent reinforcement learning.Science, 378(6623):990–996, 2022

work page 2022
[6]

Superhuman AI for Stratego using self-play reinforcement learning and test-time search.arXiv preprint arXiv:2511.07312, 2025

Samuel Sokota, Eugene Vinitsky, Hengyuan Hu, J Zico Kolter, and Gabriele Farina. Superhuman AI for Stratego using self-play reinforcement learning and test-time search.arXiv preprint arXiv:2511.07312, 2025

work page arXiv 2025
[7]

Douzero: Mastering doudizhu with self-play deep reinforcement learning

Daochen Zha, Jingru Xie, Wenye Ma, Sheng Zhang, Xiangru Lian, Xia Hu, and Ji Liu. Douzero: Mastering doudizhu with self-play deep reinforcement learning. Ininternational conference on machine learning, pages 12333–12344. PMLR, 2021

work page 2021
[8]

Perfectdou: Dominating doudizhu with perfect information distillation.Advances in neural information processing systems, 35:34954–34965, 2022

Guan Yang, Minghuan Liu, Weijun Hong, Weinan Zhang, Fei Fang, Guangjun Zeng, and Yue Lin. Perfectdou: Dominating doudizhu with perfect information distillation.Advances in neural information processing systems, 35:34954–34965, 2022

work page 2022
[9]

The power of regulariza- tion in solving extensive-form games.arXiv preprint arXiv:2206.09495, 2022

Mingyang Liu, Asuman Ozdaglar, Tiancheng Yu, and Kaiqing Zhang. The power of regulariza- tion in solving extensive-form games.arXiv preprint arXiv:2206.09495, 2022

work page arXiv 2022
[10]

A policy-gradient approach to solving imperfect-information games with iterate convergence.arXiv preprint arXiv:2408.00751, 2024

Mingyang Liu, Gabriele Farina, and Asuman Ozdaglar. A policy-gradient approach to solving imperfect-information games with iterate convergence.arXiv preprint arXiv:2408.00751, 2024

work page arXiv 2024
[11]

Reevaluating policy gradient methods for imperfect-information games.arXiv preprint arXiv:2502.08938, 2025

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota. Reevaluating policy gradient methods for imperfect-information games.arXiv preprint arXiv:2502.08938, 2025

work page arXiv 2025
[12]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

The surprising effectiveness of PPO in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of PPO in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

work page 2022
[14]

AlphaHoldem: High- performance artificial intelligence for heads-up no-limit poker via end-to-end reinforcement learning

Enmin Zhao, Renye Yan, Jinqiu Li, Kai Li, and Junliang Xing. AlphaHoldem: High- performance artificial intelligence for heads-up no-limit poker via end-to-end reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 4689–4697, 2022

work page 2022
[15]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015. 10

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

A theoretical and empirical analysis of expected SARSA

Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical and empirical analysis of expected SARSA. In2009 ieee symposium on adaptive dynamic programming and reinforcement learning, pages 177–184. IEEE, 2009

work page 2009
[17]

Algorithmic game theory.Communications of the ACM, 53(7):78–86, 2010

Tim Roughgarden. Algorithmic game theory.Communications of the ACM, 53(7):78–86, 2010

work page 2010
[18]

Counterfactual multi-agent policy gradients

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[19]

Sutton, and Satinder P

Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. InProceedings of the Seventeenth International Conference on Machine Learning (ICML), pages 759–766, 2000

work page 2000
[20]

Safe and efficient off-policy reinforcement learning.Advances in neural information processing systems, 29, 2016

Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning.Advances in neural information processing systems, 29, 2016

work page 2016
[21]

A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825, 2022

Samuel Sokota, Ryan D’Orazio, J Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825, 2022

work page arXiv 2022
[22]

Bayes' Bluff: Opponent Modelling in Poker

Finnegan Southey, Michael P Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker.arXiv preprint arXiv:1207.1411, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[23]

Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

Christian Schroeder De Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the Starcraft multi-agent challenge?arXiv preprint arXiv:2011.09533, 2020

work page arXiv 2011
[24]

Phasic policy gradient

Karl W Cobbe, Jacob Hilton, Oleg Klimov, and John Schulman. Phasic policy gradient. In International Conference on Machine Learning, pages 2020–2027. PMLR, 2021

work page 2020
[25]

Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G.M. Araújo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022. URLhttp://jmlr.org/papers/v23/21-1342.html

work page 2022
[26]

RLCard: A toolkit for reinforcement learning in card games.arXiv preprint arXiv:1910.04376, 2019

Daochen Zha, Kwei-Herng Lai, Yuanpu Cao, Songyi Huang, Ruzhe Wei, Junyu Guo, and Xia Hu. RLCard: A toolkit for reinforcement learning in card games.arXiv preprint arXiv:1910.04376, 2019

work page arXiv 1910
[27]

Deltadou: Expert-level Doudizhu AI through self-play

Qiqi Jiang, Kuangzheng Li, Boyao Du, Hao Chen, and Hai Fang. Deltadou: Expert-level Doudizhu AI through self-play. InIJCAI, pages 1265–1271, 2019

work page 2019
[28]

arXiv preprint arXiv:1908.09453 , year=

Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, et al. Openspiel: A framework for reinforcement learning in games.arXiv preprint arXiv:1908.09453, 2019

work page arXiv 1908
[29]

slumbot2019: Implementations of cfr for solving a variety of holdem-like poker games

ericgjackson. slumbot2019: Implementations of cfr for solving a variety of holdem-like poker games. GitHub repository, 2023. URL https://github.com/ericgjackson/ slumbot2019. Version/commit: a74c99d (Sep 18, 2023). Accessed: 2026-01-27

work page 2023
[30]

Solving heads-up limit Texas hold’em

Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas hold’em. InIJCAI, volume 15, pages 645–652, 2015

work page 2015
[31]

Deepstack: Expert-level artificial intelligence in heads-up no-limit poker.Science, 356(6337):508–513, 2017

Matej Moravˇcík, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker.Science, 356(6337):508–513, 2017

work page 2017
[32]

Iterative solution of games by fictitious play.Act

George W Brown. Iterative solution of games by fictitious play.Act. Anal. Prod Allocation, 13 (1):374, 1951. 11

work page 1951
[33]

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[34]

Grandmaster level in StarCraft II using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

work page 2019
[35]

A unified game-theoretic approach to multiagent reinforcement learning.Advances in neural information processing systems, 30, 2017

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning.Advances in neural information processing systems, 30, 2017

work page 2017
[36]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering Chess and Shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

work page 2018
[38]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

work page 2024
[39]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 12 Appendix A Additional Related Work One of the major breakthroughs in large-scale equilibrium ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Action scores are computed by summing the elementwise product between the 256-dimensional state feature and each action feature, followed by a learned linear projection

and applying an action MLP, similar to the state MLP but with only2 layer. Action scores are computed by summing the elementwise product between the 256-dimensional state feature and each action feature, followed by a learned linear projection. The actor architecture is shown in Figure 5. action logits ·inner product State MLP 256 feats concat Causal Tran...

work page

[1] [1]

Regret mini- mization in games with incomplete information.Advances in neural information processing systems, 20, 2007

Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret mini- mization in games with incomplete information.Advances in neural information processing systems, 20, 2007

work page 2007

[2] [2]

Libratus: The superhuman AI for no-limit poker

Noam Brown, Tuomas Sandholm, and Strategic Machine. Libratus: The superhuman AI for no-limit poker. InIJCAI, pages 5226–5228, 2017

work page 2017

[3] [3]

Deep counterfactual regret minimization

Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. InInternational conference on machine learning, pages 793–802. PMLR, 2019

work page 2019

[4] [4]

Combining deep reinforcement learning and search for imperfect-information games.Advances in neural information processing systems, 33:17057–17069, 2020

Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining deep reinforcement learning and search for imperfect-information games.Advances in neural information processing systems, 33:17057–17069, 2020

work page 2020

[5] [5]

Mastering the game of Stratego with model-free multiagent reinforcement learning.Science, 378(6623):990–996, 2022

Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T Connor, Neil Burch, Thomas Anthony, et al. Mastering the game of Stratego with model-free multiagent reinforcement learning.Science, 378(6623):990–996, 2022

work page 2022

[6] [6]

Superhuman AI for Stratego using self-play reinforcement learning and test-time search.arXiv preprint arXiv:2511.07312, 2025

Samuel Sokota, Eugene Vinitsky, Hengyuan Hu, J Zico Kolter, and Gabriele Farina. Superhuman AI for Stratego using self-play reinforcement learning and test-time search.arXiv preprint arXiv:2511.07312, 2025

work page arXiv 2025

[7] [7]

Douzero: Mastering doudizhu with self-play deep reinforcement learning

Daochen Zha, Jingru Xie, Wenye Ma, Sheng Zhang, Xiangru Lian, Xia Hu, and Ji Liu. Douzero: Mastering doudizhu with self-play deep reinforcement learning. Ininternational conference on machine learning, pages 12333–12344. PMLR, 2021

work page 2021

[8] [8]

Perfectdou: Dominating doudizhu with perfect information distillation.Advances in neural information processing systems, 35:34954–34965, 2022

Guan Yang, Minghuan Liu, Weijun Hong, Weinan Zhang, Fei Fang, Guangjun Zeng, and Yue Lin. Perfectdou: Dominating doudizhu with perfect information distillation.Advances in neural information processing systems, 35:34954–34965, 2022

work page 2022

[9] [9]

The power of regulariza- tion in solving extensive-form games.arXiv preprint arXiv:2206.09495, 2022

Mingyang Liu, Asuman Ozdaglar, Tiancheng Yu, and Kaiqing Zhang. The power of regulariza- tion in solving extensive-form games.arXiv preprint arXiv:2206.09495, 2022

work page arXiv 2022

[10] [10]

A policy-gradient approach to solving imperfect-information games with iterate convergence.arXiv preprint arXiv:2408.00751, 2024

Mingyang Liu, Gabriele Farina, and Asuman Ozdaglar. A policy-gradient approach to solving imperfect-information games with iterate convergence.arXiv preprint arXiv:2408.00751, 2024

work page arXiv 2024

[11] [11]

Reevaluating policy gradient methods for imperfect-information games.arXiv preprint arXiv:2502.08938, 2025

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota. Reevaluating policy gradient methods for imperfect-information games.arXiv preprint arXiv:2502.08938, 2025

work page arXiv 2025

[12] [12]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

The surprising effectiveness of PPO in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of PPO in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

work page 2022

[14] [14]

AlphaHoldem: High- performance artificial intelligence for heads-up no-limit poker via end-to-end reinforcement learning

Enmin Zhao, Renye Yan, Jinqiu Li, Kai Li, and Junliang Xing. AlphaHoldem: High- performance artificial intelligence for heads-up no-limit poker via end-to-end reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 4689–4697, 2022

work page 2022

[15] [15]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015. 10

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

A theoretical and empirical analysis of expected SARSA

Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical and empirical analysis of expected SARSA. In2009 ieee symposium on adaptive dynamic programming and reinforcement learning, pages 177–184. IEEE, 2009

work page 2009

[17] [17]

Algorithmic game theory.Communications of the ACM, 53(7):78–86, 2010

Tim Roughgarden. Algorithmic game theory.Communications of the ACM, 53(7):78–86, 2010

work page 2010

[18] [18]

Counterfactual multi-agent policy gradients

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018

[19] [19]

Sutton, and Satinder P

Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. InProceedings of the Seventeenth International Conference on Machine Learning (ICML), pages 759–766, 2000

work page 2000

[20] [20]

Safe and efficient off-policy reinforcement learning.Advances in neural information processing systems, 29, 2016

Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning.Advances in neural information processing systems, 29, 2016

work page 2016

[21] [21]

A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825, 2022

Samuel Sokota, Ryan D’Orazio, J Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825, 2022

work page arXiv 2022

[22] [22]

Bayes' Bluff: Opponent Modelling in Poker

Finnegan Southey, Michael P Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker.arXiv preprint arXiv:1207.1411, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[23] [23]

Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

Christian Schroeder De Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the Starcraft multi-agent challenge?arXiv preprint arXiv:2011.09533, 2020

work page arXiv 2011

[24] [24]

Phasic policy gradient

Karl W Cobbe, Jacob Hilton, Oleg Klimov, and John Schulman. Phasic policy gradient. In International Conference on Machine Learning, pages 2020–2027. PMLR, 2021

work page 2020

[25] [25]

Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G.M. Araújo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022. URLhttp://jmlr.org/papers/v23/21-1342.html

work page 2022

[26] [26]

RLCard: A toolkit for reinforcement learning in card games.arXiv preprint arXiv:1910.04376, 2019

Daochen Zha, Kwei-Herng Lai, Yuanpu Cao, Songyi Huang, Ruzhe Wei, Junyu Guo, and Xia Hu. RLCard: A toolkit for reinforcement learning in card games.arXiv preprint arXiv:1910.04376, 2019

work page arXiv 1910

[27] [27]

Deltadou: Expert-level Doudizhu AI through self-play

Qiqi Jiang, Kuangzheng Li, Boyao Du, Hao Chen, and Hai Fang. Deltadou: Expert-level Doudizhu AI through self-play. InIJCAI, pages 1265–1271, 2019

work page 2019

[28] [28]

arXiv preprint arXiv:1908.09453 , year=

Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, et al. Openspiel: A framework for reinforcement learning in games.arXiv preprint arXiv:1908.09453, 2019

work page arXiv 1908

[29] [29]

slumbot2019: Implementations of cfr for solving a variety of holdem-like poker games

ericgjackson. slumbot2019: Implementations of cfr for solving a variety of holdem-like poker games. GitHub repository, 2023. URL https://github.com/ericgjackson/ slumbot2019. Version/commit: a74c99d (Sep 18, 2023). Accessed: 2026-01-27

work page 2023

[30] [30]

Solving heads-up limit Texas hold’em

Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas hold’em. InIJCAI, volume 15, pages 645–652, 2015

work page 2015

[31] [31]

Deepstack: Expert-level artificial intelligence in heads-up no-limit poker.Science, 356(6337):508–513, 2017

Matej Moravˇcík, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker.Science, 356(6337):508–513, 2017

work page 2017

[32] [32]

Iterative solution of games by fictitious play.Act

George W Brown. Iterative solution of games by fictitious play.Act. Anal. Prod Allocation, 13 (1):374, 1951. 11

work page 1951

[33] [33]

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[34] [34]

Grandmaster level in StarCraft II using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

work page 2019

[35] [35]

A unified game-theoretic approach to multiagent reinforcement learning.Advances in neural information processing systems, 30, 2017

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning.Advances in neural information processing systems, 30, 2017

work page 2017

[36] [36]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering Chess and Shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

work page 2018

[38] [38]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

work page 2024

[39] [39]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 12 Appendix A Additional Related Work One of the major breakthroughs in large-scale equilibrium ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Action scores are computed by summing the elementwise product between the 256-dimensional state feature and each action feature, followed by a learned linear projection

and applying an action MLP, similar to the state MLP but with only2 layer. Action scores are computed by summing the elementwise product between the 256-dimensional state feature and each action feature, followed by a learned linear projection. The actor architecture is shown in Figure 5. action logits ·inner product State MLP 256 feats concat Causal Tran...

work page