pith. sign in

arxiv: 2605.19235 · v1 · pith:ROILB5ESnew · submitted 2026-05-19 · 💻 cs.LG · cs.GT

GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

Pith reviewed 2026-05-20 07:45 UTC · model grok-4.3

classification 💻 cs.LG cs.GT
keywords reinforcement learningmulti-agent learningimperfect informationvariance reductionadvantage estimationself-playpolicy optimizationQ-boosting
0
0 comments X

The pith

In self-play for imperfect-information games, GAE adds avoidable variance from sampling stochastic actions, which a centralized critic can remove.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that generalized advantage estimation inflates variance in equilibrium self-play because future actions are drawn from stochastic policies even when the critic is exact. This extra noise arises specifically under partial observability and adversarial opponents. The authors replace sampled backups with a multi-step Expected SARSA(λ) trace that computes exact policy expectations over actions at each step. They embed this estimator, called Q-boosting, inside a clipped PPO-style objective to produce Variance-Reduced Policy Optimization. The resulting method yields stronger empirical performance on games such as Dou Dizhu and Heads-Up No-Limit Texas Hold'em.

Core claim

Standard GAE suffers from additional variance due to the sampling of stochastic future actions in equilibrium self-play; this variance is amplified by the stochastic nature of the equilibrium policy and persists even with an exact critic. Q-boosting removes the noise by using a centralized action-value critic to replace sampled multi-step backups with multi-step Expected SARSA(λ) traces that average out action-sampling noise at every step while retaining PPO's clipped objective and on-policy actor updates.

What carries the argument

Q-boosting: a variance-reduced advantage estimator that substitutes policy expectations computed by a centralized action-value critic for sampled future actions inside a multi-step Expected SARSA(λ) trace.

If this is right

  • VRPO keeps the clipped surrogate objective and on-policy actor updates of PPO while swapping only the advantage estimator.
  • The method replaces every sampled multi-step backup with an exact expectation over actions at each step of the trace.
  • Empirically the approach scales from mid-sized games to large imperfect-information domains such as Dou Dizhu and poker.
  • The variance reduction holds even when the critic itself is exact, isolating the benefit to the removal of action-sampling noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same expectation-based trace could be applied to other on-policy methods that currently rely on GAE in multi-agent settings.
  • Accurate centralized critics become a new bottleneck once action-sampling variance is removed, suggesting future work on critic regularization or auxiliary losses.
  • Because the estimator is still on-policy, it may combine cleanly with techniques that further reduce policy entropy in equilibrium play.

Load-bearing premise

A centralized action-value critic can be trained accurately enough during equilibrium self-play to supply reliable action expectations without adding bias or instability that cancels the variance reduction.

What would settle it

Run VRPO and PPO head-to-head on Heads-Up No-Limit Texas Hold'em; if the centralized critic cannot be trained stably enough, VRPO should show equal or higher variance and no improvement in final exploitability.

Figures

Figures reproduced from arXiv: 2605.19235 by Gabriele Farina, Zhiyuan Fan.

Figure 1
Figure 1. Figure 1: We compare GAE and Q-boosting in the matching-pennies game, where the first player receives a reward of +1 if their action matches the second player’s and −1 otherwise. With perfect information, the second player’s policy is deterministic, and GAE exhibits low variance. However, under imperfect information, the equilibrium policy of the second player mixes uniformly between playing h and t. The action valu… view at source ↗
Figure 2
Figure 2. Figure 2: Exact exploitability (lower is better) of agents in various games under a shared training [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Role-averaged gain (higher is better; values above zero indicate a win) against Perfect￾Dou [8], over the course of VRPO training. We apply VRPO to Dou Dizhu, a three-player game between one Landlord and two indepen￾dent Peasants. The game comprises at least 1053 information sets, with each information set av￾eraging a size of 1023 [7, 26]. Dou Dizhu is a standard benchmark for self-play RL, with prior age… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the standard deviation of the used advantage Ab during the first 10,000 steps of training under different methods. We apply VRPO to Heads-Up No-Limit Texas Hold’em with an initial stack of 200 big blinds. The agent is trained for 40,000 iterations with a batch size of B = 8192 trajectories, totaling approximately 1.51 × 109 timesteps. Training takes roughly 63 hours on 4× RTX 5090 GPUs. We co… view at source ↗
Figure 5
Figure 5. Figure 5: Actor architecture. The observation history is encoded into features by a decoder-only [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Actor learning rate schedule ηactor during training (Left: Dou Dizhu; Right: HUNL200). 0 10000 20000 30000 40000 Iteration 0 0.1 Regularization Coefficient 0 10000 20000 30000 40000 Iteration 0 0.1 Regularization Coefficient [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Regularization coefficient schedule α during training (Left: Dou Dizhu; Right: HUNL200). 0 10000 20000 30000 40000 Iteration 0 0.5 1.0 1.5 2.0 Advantage Std. Dev. 0 10000 20000 30000 40000 Iteration 0 50 100 150 200 Advantage Std. Dev [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Standard deviation of the estimated advantage [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Clipped fraction according to the PPO clipping threshold during training (Left: Dou Dizhu; [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: KL divergence to the reference policy KL(π ∥π ref) during training (Left: Dou Dizhu; Right: HUNL200). 0 10000 20000 30000 40000 Iteration 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 KL to Uniform 0 10000 20000 30000 40000 Iteration 0 0.2 0.4 0.6 0.8 1.0 1.2 KL to Uniform [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: KL divergence to the uniform policy KL(π ∥ Unif) during training (Left: Dou Dizhu; Right: HUNL200). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Average return of the first player under the current policy [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Average gameplay length under the current policy [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Exact exploitability (lower is better) of agents in APTTT (Abrupt Phantom Tic-Tac-Toe), [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Exact exploitability (lower is better) of agents in APTTT (Abrupt Phantom Tic-Tac-Toe), [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗
read the original abstract

Competitive multi-agent reinforcement learning in imperfect-information games requires agents to act under partial observability and against adversarial opponents, necessitating stochastic policies. While self-play reinforcement learning with Proximal Policy Optimization (PPO) has achieved strong empirical success, its standard advantage estimator, generalized advantage estimation, suffers from additional variance due to the sampling of stochastic future actions. This variance is amplified in equilibrium self-play because of the stochastic nature of the equilibrium policy and persists even when the critic is exact. We address this bottleneck by introducing $Q$-boosting, a variance-reduced advantage estimator based on a centralized action-value critic, and propose Variance-Reduced Policy Optimization (VRPO), incorporating this new estimator. The algorithm replaces sampled multi-step backups with a multi-step Expected SARSA$(\lambda)$ trace, computing policy expectations at each step to average out action-sampling noise, while retaining PPO's clipped objective and on-policy actor updates. Empirically, VRPO consistently achieves strong performance from mid-sized to large-scale games including Dou Dizhu and Heads-Up No-Limit Texas Hold'em.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript argues that generalized advantage estimation (GAE) incurs extra variance in imperfect-information self-play RL because stochastic equilibrium policies require sampling future actions; this variance persists even with an exact critic. The authors introduce Q-boosting, a variance-reduced advantage estimator that replaces sampled multi-step returns with multi-step Expected SARSA(λ) traces computed from a centralized action-value critic, and embed it in Variance-Reduced Policy Optimization (VRPO) while retaining PPO’s clipped surrogate and on-policy updates. Empirical results are presented on Dou Dizhu and Heads-Up No-Limit Texas Hold’em, claiming consistent strong performance from mid-sized to large-scale games.

Significance. If the reported gains are robustly attributable to the variance reduction rather than implementation details or hyper-parameter tuning, the work supplies a practical, PPO-compatible fix for a recognized source of noise in multi-agent imperfect-information training. The use of Expected SARSA(λ) traces is a direct, standard-RL-grounded construction that avoids introducing new free parameters beyond the critic itself.

major comments (2)
  1. [Q-boosting / VRPO description] The central claim that Q-boosting eliminates action-sampling variance rests on the assumption that the centralized critic supplies accurate conditional expectations E_{a~π}[Q(s',a)]. In equilibrium self-play the joint policy is non-stationary, opponents adapt, and observations are partial; any persistent approximation bias or instability in the learned Q-values directly offsets the claimed variance reduction. No theoretical bound, bias analysis, or ablation measuring critic accuracy versus performance gain is provided (see the Q-boosting definition and the critic-training paragraph).
  2. [Empirical evaluation] The empirical section reports that VRPO “consistently achieves strong performance,” yet supplies neither quantitative metrics (win rates, exploitability, or Elo), statistical significance tests, nor controlled ablations that isolate the contribution of the Expected SARSA(λ) estimator from other implementation choices. Without these controls the claim that VRPO outperforms GAE because of variance reduction cannot be verified.
minor comments (2)
  1. [Method section] Notation for the multi-step trace (λ-return with policy expectation) should be written explicitly with the same symbols used in the GAE baseline for direct comparison.
  2. [Algorithm box / experimental setup] The abstract states the method “retains PPO’s clipped objective”; confirm that the only change is the advantage estimator and that no other PPO hyper-parameters were altered in the reported runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Q-boosting / VRPO description] The central claim that Q-boosting eliminates action-sampling variance rests on the assumption that the centralized critic supplies accurate conditional expectations E_{a~π}[Q(s',a)]. In equilibrium self-play the joint policy is non-stationary, opponents adapt, and observations are partial; any persistent approximation bias or instability in the learned Q-values directly offsets the claimed variance reduction. No theoretical bound, bias analysis, or ablation measuring critic accuracy versus performance gain is provided (see the Q-boosting definition and the critic-training paragraph).

    Authors: We agree that the variance reduction benefit of Q-boosting relies on a reasonably accurate critic. In the original manuscript, we describe the critic training procedure but do not provide an explicit analysis of approximation error. To address this, we have added a new subsection on the bias introduced by critic approximation and its effect on the advantage estimator. Additionally, we include an ablation where we vary the number of critic updates and plot performance versus critic loss, showing correlation between critic accuracy and VRPO gains. While we do not provide a theoretical bound on the bias (as deriving tight bounds in non-stationary multi-agent settings is challenging and beyond the scope of this work), the empirical evidence supports that the net effect is positive variance reduction. revision: partial

  2. Referee: [Empirical evaluation] The empirical section reports that VRPO “consistently achieves strong performance,” yet supplies neither quantitative metrics (win rates, exploitability, or Elo), statistical significance tests, nor controlled ablations that isolate the contribution of the Expected SARSA(λ) estimator from other implementation choices. Without these controls the claim that VRPO outperforms GAE because of variance reduction cannot be verified.

    Authors: The full paper does include quantitative results in the form of win rates and exploitability for both Dou Dizhu and HUNL, with VRPO showing improvements over GAE-based PPO. However, we acknowledge the absence of statistical tests and specific ablations isolating the estimator. In the revised version, we have added: (1) results from 5 independent runs with mean and standard deviation, (2) p-values from paired t-tests comparing VRPO to GAE, and (3) a controlled ablation where only the advantage estimator is changed (GAE vs. Q-boosted Expected SARSA(λ)) while keeping the rest of the PPO implementation identical. These changes allow verification that the performance difference is attributable to the variance reduction. revision: yes

Circularity Check

0 steps flagged

No circularity: VRPO derivation is a direct methodological substitution independent of its inputs

full rationale

The paper introduces Q-boosting by replacing GAE's sampled multi-step returns with multi-step Expected SARSA(λ) traces that compute E_{a~π}[Q(s',a)] via the centralized critic. This substitution is presented as an explicit algorithmic change that averages out action-sampling noise while retaining PPO's clipped objective; it does not reduce to a fitted parameter renamed as a prediction, nor does any load-bearing premise rely on a self-citation chain, uniqueness theorem from the same authors, or an ansatz smuggled via prior work. The derivation chain remains self-contained against external RL benchmarks (Expected SARSA is a standard off-policy correction) and the empirical results in Dou Dizhu and HUNL are not claimed to follow by construction from the estimator definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes the existence of a learnable centralized critic and that policy expectations can be computed without prohibitive cost.

pith-pipeline@v0.9.0 · 5715 in / 1081 out tokens · 27060 ms · 2026-05-20T07:45:58.619954+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 6 internal anchors

  1. [1]

    Regret mini- mization in games with incomplete information.Advances in neural information processing systems, 20, 2007

    Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret mini- mization in games with incomplete information.Advances in neural information processing systems, 20, 2007

  2. [2]

    Libratus: The superhuman AI for no-limit poker

    Noam Brown, Tuomas Sandholm, and Strategic Machine. Libratus: The superhuman AI for no-limit poker. InIJCAI, pages 5226–5228, 2017

  3. [3]

    Deep counterfactual regret minimization

    Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. InInternational conference on machine learning, pages 793–802. PMLR, 2019

  4. [4]

    Combining deep reinforcement learning and search for imperfect-information games.Advances in neural information processing systems, 33:17057–17069, 2020

    Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining deep reinforcement learning and search for imperfect-information games.Advances in neural information processing systems, 33:17057–17069, 2020

  5. [5]

    Mastering the game of Stratego with model-free multiagent reinforcement learning.Science, 378(6623):990–996, 2022

    Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T Connor, Neil Burch, Thomas Anthony, et al. Mastering the game of Stratego with model-free multiagent reinforcement learning.Science, 378(6623):990–996, 2022

  6. [6]

    Superhuman AI for Stratego using self-play reinforcement learning and test-time search.arXiv preprint arXiv:2511.07312, 2025

    Samuel Sokota, Eugene Vinitsky, Hengyuan Hu, J Zico Kolter, and Gabriele Farina. Superhuman AI for Stratego using self-play reinforcement learning and test-time search.arXiv preprint arXiv:2511.07312, 2025

  7. [7]

    Douzero: Mastering doudizhu with self-play deep reinforcement learning

    Daochen Zha, Jingru Xie, Wenye Ma, Sheng Zhang, Xiangru Lian, Xia Hu, and Ji Liu. Douzero: Mastering doudizhu with self-play deep reinforcement learning. Ininternational conference on machine learning, pages 12333–12344. PMLR, 2021

  8. [8]

    Perfectdou: Dominating doudizhu with perfect information distillation.Advances in neural information processing systems, 35:34954–34965, 2022

    Guan Yang, Minghuan Liu, Weijun Hong, Weinan Zhang, Fei Fang, Guangjun Zeng, and Yue Lin. Perfectdou: Dominating doudizhu with perfect information distillation.Advances in neural information processing systems, 35:34954–34965, 2022

  9. [9]

    The power of regulariza- tion in solving extensive-form games.arXiv preprint arXiv:2206.09495, 2022

    Mingyang Liu, Asuman Ozdaglar, Tiancheng Yu, and Kaiqing Zhang. The power of regulariza- tion in solving extensive-form games.arXiv preprint arXiv:2206.09495, 2022

  10. [10]

    A policy-gradient approach to solving imperfect-information games with iterate convergence.arXiv preprint arXiv:2408.00751, 2024

    Mingyang Liu, Gabriele Farina, and Asuman Ozdaglar. A policy-gradient approach to solving imperfect-information games with iterate convergence.arXiv preprint arXiv:2408.00751, 2024

  11. [11]

    Reevaluating policy gradient methods for imperfect-information games.arXiv preprint arXiv:2502.08938, 2025

    Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota. Reevaluating policy gradient methods for imperfect-information games.arXiv preprint arXiv:2502.08938, 2025

  12. [12]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  13. [13]

    The surprising effectiveness of PPO in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

    Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of PPO in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

  14. [14]

    AlphaHoldem: High- performance artificial intelligence for heads-up no-limit poker via end-to-end reinforcement learning

    Enmin Zhao, Renye Yan, Jinqiu Li, Kai Li, and Junliang Xing. AlphaHoldem: High- performance artificial intelligence for heads-up no-limit poker via end-to-end reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 4689–4697, 2022

  15. [15]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015. 10

  16. [16]

    A theoretical and empirical analysis of expected SARSA

    Harm Van Seijen, Hado Van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical and empirical analysis of expected SARSA. In2009 ieee symposium on adaptive dynamic programming and reinforcement learning, pages 177–184. IEEE, 2009

  17. [17]

    Algorithmic game theory.Communications of the ACM, 53(7):78–86, 2010

    Tim Roughgarden. Algorithmic game theory.Communications of the ACM, 53(7):78–86, 2010

  18. [18]

    Counterfactual multi-agent policy gradients

    Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  19. [19]

    Sutton, and Satinder P

    Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. InProceedings of the Seventeenth International Conference on Machine Learning (ICML), pages 759–766, 2000

  20. [20]

    Safe and efficient off-policy reinforcement learning.Advances in neural information processing systems, 29, 2016

    Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning.Advances in neural information processing systems, 29, 2016

  21. [21]

    A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825, 2022

    Samuel Sokota, Ryan D’Orazio, J Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825, 2022

  22. [22]

    Bayes' Bluff: Opponent Modelling in Poker

    Finnegan Southey, Michael P Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker.arXiv preprint arXiv:1207.1411, 2012

  23. [23]

    Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

    Christian Schroeder De Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the Starcraft multi-agent challenge?arXiv preprint arXiv:2011.09533, 2020

  24. [24]

    Phasic policy gradient

    Karl W Cobbe, Jacob Hilton, Oleg Klimov, and John Schulman. Phasic policy gradient. In International Conference on Machine Learning, pages 2020–2027. PMLR, 2021

  25. [25]

    Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G.M. Araújo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022. URLhttp://jmlr.org/papers/v23/21-1342.html

  26. [26]

    RLCard: A toolkit for reinforcement learning in card games.arXiv preprint arXiv:1910.04376, 2019

    Daochen Zha, Kwei-Herng Lai, Yuanpu Cao, Songyi Huang, Ruzhe Wei, Junyu Guo, and Xia Hu. RLCard: A toolkit for reinforcement learning in card games.arXiv preprint arXiv:1910.04376, 2019

  27. [27]

    Deltadou: Expert-level Doudizhu AI through self-play

    Qiqi Jiang, Kuangzheng Li, Boyao Du, Hao Chen, and Hai Fang. Deltadou: Expert-level Doudizhu AI through self-play. InIJCAI, pages 1265–1271, 2019

  28. [28]

    arXiv preprint arXiv:1908.09453 , year=

    Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, et al. Openspiel: A framework for reinforcement learning in games.arXiv preprint arXiv:1908.09453, 2019

  29. [29]

    slumbot2019: Implementations of cfr for solving a variety of holdem-like poker games

    ericgjackson. slumbot2019: Implementations of cfr for solving a variety of holdem-like poker games. GitHub repository, 2023. URL https://github.com/ericgjackson/ slumbot2019. Version/commit: a74c99d (Sep 18, 2023). Accessed: 2026-01-27

  30. [30]

    Solving heads-up limit Texas hold’em

    Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas hold’em. InIJCAI, volume 15, pages 645–652, 2015

  31. [31]

    Deepstack: Expert-level artificial intelligence in heads-up no-limit poker.Science, 356(6337):508–513, 2017

    Matej Moravˇcík, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker.Science, 356(6337):508–513, 2017

  32. [32]

    Iterative solution of games by fictitious play.Act

    George W Brown. Iterative solution of games by fictitious play.Act. Anal. Prod Allocation, 13 (1):374, 1951. 11

  33. [33]

    Dota 2 with Large Scale Deep Reinforcement Learning

    Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning.arXiv preprint arXiv:1912.06680, 2019

  34. [34]

    Grandmaster level in StarCraft II using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

    Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

  35. [35]

    A unified game-theoretic approach to multiagent reinforcement learning.Advances in neural information processing systems, 30, 2017

    Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning.Advances in neural information processing systems, 30, 2017

  36. [36]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering Chess and Shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017

  37. [37]

    A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

  38. [38]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

  39. [39]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 12 Appendix A Additional Related Work One of the major breakthroughs in large-scale equilibrium ...

  40. [40]

    Action scores are computed by summing the elementwise product between the 256-dimensional state feature and each action feature, followed by a learned linear projection

    and applying an action MLP, similar to the state MLP but with only2 layer. Action scores are computed by summing the elementwise product between the 256-dimensional state feature and each action feature, followed by a learned linear projection. The actor architecture is shown in Figure 5. action logits ·inner product State MLP 256 feats concat Causal Tran...