NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

Chang Xu; Cl\'ement L. Canonne; Eason Yu; Nguyen H. Tran; Stefano V. Albrecht; Tzu Hao Liu; Yunke Wang

arxiv: 2510.18183 · v2 · submitted 2025-10-21 · 💻 cs.LG · cs.GT

NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

Eason Yu , Tzu Hao Liu , Cl\'ement L. Canonne , Yunke Wang , Chang Xu , Nguyen H. Tran , Stefano V. Albrecht This is my paper

Pith reviewed 2026-05-18 05:44 UTC · model grok-4.3

classification 💻 cs.LG cs.GT

keywords Nash equilibriapolicy gradientregularizationextensive-form gameszero-sum gamesBregman divergencemulti-agent reinforcement learningimperfect information

0 comments

The pith

Placing iteratively refined regularization inside the policy-gradient objective guarantees strictly monotonic reduction in Bregman divergence toward Nash equilibria in two-player zero-sum extensive-form games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a multi-round regularization procedure can be added to policy-gradient methods to locate Nash equilibria in two-player zero-sum extensive-form games. It proves that the procedure produces strictly monotonic decreases in Bregman divergence to the equilibrium set and ensures convergence to an actual equilibrium. NashPG realizes this idea by embedding the regularization term directly into the standard policy-gradient objective, so it runs with ordinary policy-gradient updates. This design avoids the need for full game-tree enumeration or non-policy-gradient inner solvers that limited earlier methods. On benchmark games the approach matches or beats prior model-free exploitability levels and extends to larger domains such as Battleship and No-Limit Texas Hold'em.

Core claim

A novel multi-round regularization procedure guarantees strictly monotonic reduction in Bregman divergence to Nash equilibria and eventual convergence to one in two-player zero-sum extensive-form games; the regularization can be placed directly inside a standard policy-gradient objective to produce the practical NashPG algorithm.

What carries the argument

The multi-round regularization term that is refined across rounds and inserted into the policy-gradient objective, enforcing monotonic progress measured by Bregman divergence.

If this is right

NashPG runs with ordinary policy-gradient code and needs no custom inner solvers.
The method reaches comparable or lower exploitability than prior model-free algorithms on standard benchmark games.
It scales to large imperfect-information domains such as Battleship and No-Limit Texas Hold'em while producing higher head-to-head payoffs.
The regularization framework supplies a scalable policy-gradient route to Nash equilibria that earlier full-enumeration or non-gradient approaches lacked.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularization placement might stabilize training in broader classes of multi-agent games beyond two-player zero-sum settings.
Adaptive schedules for how the regularization strength changes across rounds could accelerate convergence in practice.
The monotonic-divergence property offers a diagnostic that could be monitored during training to detect when an agent is drifting away from equilibrium play.

Load-bearing premise

That the regularization term can be inserted directly into a standard policy-gradient objective while still preserving the strictly monotonic reduction in Bregman divergence without extra assumptions on the game tree or the inner solver.

What would settle it

A concrete run on any two-player zero-sum extensive-form game in which the Bregman divergence to the Nash set increases or fails to decrease monotonically during NashPG updates would falsify the guarantee.

Figures

Figures reproduced from arXiv: 2510.18183 by Chang Xu, Cl\'ement L. Canonne, Eason Yu, Nguyen H. Tran, Stefano V. Albrecht, Tzu Hao Liu, Yunke Wang.

**Figure 2.** Figure 2: Exploitability in real-world games. Battleship and Texas Hold’em show negative values (theoretically [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Elo ratings in large real-world games. N [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Exploitability under annealing different initial values for [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Maximum GPU memory usage during training across environments and algorithms. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Total wall-clock time during training across all evaluated environments and algorithms. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

Finding Nash equilibria in two-player zero-sum imperfect-information games remains a central challenge in multi-agent reinforcement learning. Recent multi-round regularization methods offer a promising direction, yet existing approaches either require full enumeration of the game tree or rely on non-policy-gradient inner solvers that underperform in practice, leaving a scalable policy-gradient-based solution open. In this paper, we propose a novel multi-round regularization procedure and show that it guarantees strictly monotonic reduction in Bregman divergence to Nash equilibria and eventual convergence to one in two-player zero-sum extensive-form games. Guided by this framework, we develop a practical algorithm, Nash Policy Gradient (NashPG), which places the regularization directly in the policy optimization objective and is implemented using standard policy gradient methods. Empirically, NashPG achieves comparable or lower exploitability than prior model-free methods on classic benchmark games and scales to large domains such as Battleship and No-Limit Texas Hold'em, where it attains higher average payoff in head-to-head play.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NashPG puts iterative regularization inside policy gradients for Nash equilibria in extensive-form games and scales empirically, but the monotonic Bregman claim looks tied to exact updates that stochastic gradients may not preserve.

read the letter

Hi there, the main point with this paper is that it develops NashPG, which uses a multi-round regularization scheme inside a policy gradient method to find Nash equilibria in two-player zero-sum extensive-form games. They claim this gives strictly monotonic reduction in Bregman divergence and convergence, plus it scales practically to large games. What the paper does well is take an existing regularization idea and make it work with standard policy gradients without full tree enumeration. This addresses a gap where prior methods were either too slow or not gradient-based. The empirical results look promising, with comparable or better exploitability on classic benchmarks and good performance in head-to-head on No-Limit Texas Hold'em and Battleship. That's a real strength for showing practical reach. The soft spots are in how the theory connects to the implementation. The convergence guarantee relies on the regularization leading to exact descent steps on the divergence. But policy gradients use sampling, which adds noise and potential bias in extensive-form settings. Without a clear analysis of how much this approximation affects the monotonicity, the strict guarantee might not hold in the stochastic case. I'd check the proof for any handling of this error. This is the kind of paper for people in multi-agent RL who want methods that work on big imperfect-info games. It could be valuable for those trying to combine theoretical properties with scalable optimization. I recommend putting it through peer review. The scaling results and the attempt at a gradient-friendly framework make it worth a serious look, even if revisions are needed on the theory side.

Referee Report

3 major / 2 minor

Summary. The paper proposes NashPG, a policy-gradient algorithm for two-player zero-sum extensive-form games that embeds an iteratively refined regularization procedure directly into the policy optimization objective. It claims this multi-round regularization guarantees strictly monotonic reduction in Bregman divergence to a Nash equilibrium and eventual convergence, while remaining compatible with standard policy-gradient methods. The work includes empirical evaluation showing competitive or superior exploitability and payoff on benchmarks including Battleship and No-Limit Texas Hold'em.

Significance. If the claimed monotonicity and convergence guarantees can be shown to survive the transition from exact regularized best-response updates to stochastic policy-gradient approximations, the result would be significant: it would supply a scalable, model-free method with provable properties for imperfect-information games, extending prior regularization approaches that either enumerate the full game tree or rely on non-PG inner solvers.

major comments (3)

[§3] §3 (Theoretical Analysis), Theorem on monotonic Bregman reduction: the proof establishes strictly monotonic decrease only under exact minimization of the regularized objective at each round. NashPG replaces this with stochastic policy-gradient estimators; the manuscript provides no error bound, variance control, or additional assumptions on tree depth or sampling that would ensure the one-step update remains a descent step on the Bregman divergence, so the strict monotonicity guarantee does not automatically transfer.
[§4] §4 (Algorithm Description), Eq. for the regularized policy-gradient objective: the placement of the iteratively refined regularization term inside a standard PG loss is presented as preserving the theoretical property, yet the derivation appears to assume the inner solver returns the exact minimizer; the stochastic nature of the gradient estimator introduces bias and variance that are not bounded in the convergence argument.
[§5] §5 (Experiments), Tables on exploitability: while results report lower exploitability than prior model-free baselines on several domains, the absence of reported gradient variance, confidence intervals on the monotonicity metric, or ablation on the number of regularization rounds leaves open whether observed performance is consistent with the claimed strict decrease in Bregman divergence.

minor comments (2)

[§2] The recursive definition of the refined regularization term is introduced in the main text but its dependence on previous-round policies could be stated more explicitly with a numbered equation for immediate reference.
[§5] Figure captions for the large-game scaling plots should include the precise number of training iterations and the exact exploitability metric used, rather than referring readers to the appendix for these details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important distinctions between exact and stochastic optimization that we will clarify in the revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [§3] §3 (Theoretical Analysis), Theorem on monotonic Bregman reduction: the proof establishes strictly monotonic decrease only under exact minimization of the regularized objective at each round. NashPG replaces this with stochastic policy-gradient estimators; the manuscript provides no error bound, variance control, or additional assumptions on tree depth or sampling that would ensure the one-step update remains a descent step on the Bregman divergence, so the strict monotonicity guarantee does not automatically transfer.

Authors: We agree that the proof of strict monotonic Bregman reduction in Theorem 1 is stated for exact minimization of the regularized objective at each iteration. The stochastic policy-gradient implementation in NashPG introduces approximation error, and the manuscript does not supply concentration bounds or variance controls that would preserve strict per-step monotonicity. In the revised manuscript we will explicitly qualify the theorem statement to apply only under exact inner minimization and add a short discussion noting that the stochastic version relies on empirical convergence rather than a transferred guarantee. We do not claim a rigorous stochastic monotonicity result at this time. revision: partial
Referee: [§4] §4 (Algorithm Description), Eq. for the regularized policy-gradient objective: the placement of the iteratively refined regularization term inside a standard PG loss is presented as preserving the theoretical property, yet the derivation appears to assume the inner solver returns the exact minimizer; the stochastic nature of the gradient estimator introduces bias and variance that are not bounded in the convergence argument.

Authors: The derivation in §4 substitutes the iteratively refined regularized objective into the standard policy-gradient estimator. We acknowledge that the accompanying convergence discussion implicitly relies on properties that hold for exact minimization. We will revise the text to state clearly that the theoretical monotonicity and convergence results are proven for the exact case, while the practical NashPG algorithm uses stochastic gradients whose bias and variance are not bounded in the current analysis. The revised wording will emphasize that empirical performance on the benchmarks is offered as supporting evidence rather than a formal transfer of the guarantee. revision: yes
Referee: [§5] §5 (Experiments), Tables on exploitability: while results report lower exploitability than prior model-free baselines on several domains, the absence of reported gradient variance, confidence intervals on the monotonicity metric, or ablation on the number of regularization rounds leaves open whether observed performance is consistent with the claimed strict decrease in Bregman divergence.

Authors: We will add confidence intervals computed over multiple independent runs to the exploitability tables. We will also include an ablation on the number of regularization rounds and report gradient variance estimates for the main domains. These additions will help readers assess consistency with the claimed behavior. Because the original experiments did not record per-run monotonicity metrics, we cannot retroactively supply confidence intervals on that specific quantity without new runs; we will therefore focus the revision on the requested ablations and variance reporting. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained

full rationale

The paper derives a multi-round regularization procedure whose strictly monotonic Bregman-divergence reduction is stated as a first-principles guarantee for exact regularized best-response updates in two-player zero-sum extensive-form games; the NashPG algorithm is then obtained by substituting a standard policy-gradient estimator into that same objective. No equation or claim reduces by construction to a fitted parameter, a self-citation that itself depends on the target result, or a renaming of an already-known empirical pattern. The theoretical guarantee is presented as independent of the subsequent empirical implementation, satisfying the criteria for a non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on a novel regularization procedure whose detailed construction and proof are not supplied in the abstract; the ledger therefore records the minimal domain assumptions needed to state the result.

axioms (1)

domain assumption The setting is restricted to two-player zero-sum extensive-form games with imperfect information.
The convergence statement is explicitly limited to this class of games.

invented entities (1)

Multi-round regularization procedure no independent evidence
purpose: To enforce monotonic reduction in Bregman divergence when embedded in policy-gradient updates.
New procedure introduced by the paper; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5726 in / 1278 out tokens · 57866 ms · 2026-05-18T05:44:50.432118+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

Oliehoek

Ariyan Bighashdel, Yongzhao Wang, Stephen McAleer, Rahul Savani, and Frans A. Oliehoek. Policy space response oracles: A survey. InIJCAI, pages 7951–7961. ijcai.org, 2024

work page 2024
[2]

JAX: composable transfor- mations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transfor- mations of Python+NumPy programs, 2018. URLhttp://github.com/jax-ml/jax

work page 2018
[3]

Iterative solution of games by fictitious play.Act

George W Brown. Iterative solution of games by fictitious play.Act. Anal. Prod Allocation, 13(1):374, 1951

work page 1951
[4]

Superhuman ai for heads-up no-limit poker: Libratus beats top profession- als.Science, 359(6374):418–424, 2018

Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top profession- als.Science, 359(6374):418–424, 2018. doi: 10.1126/science.aao1733. URLhttps://www.science.org/ doi/abs/10.1126/science.aao1733

work page doi:10.1126/science.aao1733 2018
[5]

Superhuman ai for multiplayer poker.Science, 365(6456):885–890, 2019

Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker.Science, 365(6456):885–890, 2019. doi: 10.1126/science.aay2400. URLhttps://www.science.org/doi/abs/10.1126/science.aay2400

work page doi:10.1126/science.aay2400 2019
[6]

Deep counterfactual regret minimization

Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. In ICML, volume 97 ofProceedings of Machine Learning Research, pages 793–802. PMLR, 2019

work page 2019
[7]

A comprehensive survey of multiagent reinforcement learning.IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2): 156–172, 2008

Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning.IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2): 156–172, 2008

work page 2008
[8]

Fast policy extragradient methods for competitive games with entropy regularization

Shicong Cen, Yuting Wei, and Yuejie Chi. Fast policy extragradient methods for competitive games with entropy regularization. InNeurIPS, pages 27952–27964, 2021

work page 2021
[9]

Last-iterate convergence: Zero-sum games and constrained min- max optimization

Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min- max optimization. InITCS, volume 124 ofLIPIcs, pages 27:1–27:18. Schloss Dagstuhl - Leibniz-Zentrum f ¨ur Informatik, 2019

work page 2019
[10]

Training gans with optimism

Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. In ICLR (Poster). OpenReview.net, 2018

work page 2018
[11]

Springer, 2003

Francisco Facchinei and Jong-Shi Pang.Finite-dimensional variational inequalities and complementarity prob- lems. Springer, 2003

work page 2003
[12]

Solving for best responses in extensive-form games using reinforcement learning methods

Amy Greenwald, Jiacui Li, Eric Sodomka, and Michael Littman. Solving for best responses in extensive-form games using reinforcement learning methods. InProceedings of the conference on reinforcement learning and decision making, pages 116–120. Citeseer, 2013

work page 2013
[13]

Deep Reinforcement Learning from Self-Play in Imperfect-Information Games

Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-information games.CoRR, abs/1603.01121, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

Learning in perturbed asymmetric games.Games Econ

Josef Hofbauer and Ed Hopkins. Learning in perturbed asymmetric games.Games Econ. Behav., 52(1):133–152, 2005

work page 2005
[15]

Adaptive learning in continuous games: Optimal regret bounds and convergence to nash equilibrium

Yu-Guan Hsieh, Kimon Antonakopoulos, and Panayotis Mertikopoulos. Adaptive learning in continuous games: Optimal regret bounds and convergence to nash equilibrium. InCOLT, volume 134 ofProceedings of Machine Learning Research, pages 2388–2422. PMLR, 2021. 9 Nash Policy GradientA PREPRINT

work page 2021
[16]

Extensive games and the problem of information.Contributions to the Theory of Games, 2(28): 193–216, 1953

Harold W Kuhn. Extensive games and the problem of information.Contributions to the Theory of Games, 2(28): 193–216, 1953

work page 1953
[17]

A unified game-theoretic approach to multiagent reinforcement learning

Marc Lanctot, Vin ´ıcius Flores Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien P ´erolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In NIPS, pages 4190–4203, 2017

work page 2017
[19]

URLhttp://arxiv.org/abs/1908.09453

work page arXiv 1908
[20]

A class of gap functions for variational inequalities.Mathematical Programming, 64(1):53–79, 1994

Torbj ¨orn Larsson and Michael Patriksson. A class of gap functions for variational inequalities.Mathematical Programming, 64(1):53–79, 1994

work page 1994
[21]

Last-iterate convergence in extensive-form games

Chung-Wei Lee, Christian Kroer, and Haipeng Luo. Last-iterate convergence in extensive-form games. In NeurIPS, pages 14293–14305, 2021

work page 2021
[22]

Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. InNeurIPS, 2020

work page 2020
[23]

A survey of nash equilibrium strategy solving based on cfr.Archives of Computational Methods in Engineering, 28(4), 2021

Huale Li, Xuan Wang, Fengwei Jia, Yifan Li, and Qian Chen. A survey of nash equilibrium strategy solving based on cfr.Archives of Computational Methods in Engineering, 28(4), 2021

work page 2021
[24]

Ozdaglar, Tiancheng Yu, and Kaiqing Zhang

Mingyang Liu, Asuman E. Ozdaglar, Tiancheng Yu, and Kaiqing Zhang. The power of regularization in solving extensive-form games. InICLR. OpenReview.net, 2023

work page 2023
[25]

Lanier, Roy Fox, and Pierre Baldi

Stephen McAleer, John B. Lanier, Roy Fox, and Pierre Baldi. Pipeline PSRO: A scalable approach for finding approximate nash equilibria in large games. InNeurIPS, 2020

work page 2020
[26]

ESCHER: eschewing impor- tance sampling in games by computing a history value function to estimate regret

Stephen Marcus McAleer, Gabriele Farina, Marc Lanctot, and Tuomas Sandholm. ESCHER: eschewing impor- tance sampling in games by computing a history value function to estimate regret. InICLR. OpenReview.net, 2023

work page 2023
[27]

Wang, Pierre Baldi, Tuomas Sandholm, and Roy Fox

Stephen Marcus McAleer, JB Lanier, Kevin A. Wang, Pierre Baldi, Tuomas Sandholm, and Roy Fox. Toward optimal policy population growth in two-player zero-sum games. InICLR. OpenReview.net, 2024

work page 2024
[28]

McKelvey and Thomas R

Richard D. McKelvey and Thomas R. Palfrey. Quantal response equilibria for normal form games.Games and Economic Behavior, 10(1):6–38, 1995. ISSN 0899-8256. doi: https://doi.org/10.1006/game.1995.1023. URL https://www.sciencedirect.com/science/article/pii/S0899825685710238

work page doi:10.1006/game.1995.1023 1995
[29]

Ortega, Neil Burch, Thomas W

Julien P ´erolat, R ´emi Munos, Jean-Baptiste Lespiau, Shayegan Omidshafiei, Mark Rowland, Pedro A. Ortega, Neil Burch, Thomas W. Anthony, David Balduzzi, Bart De Vylder, Georgios Piliouras, Marc Lanctot, and Karl Tuyls. From poincar ´e recurrence to convergence in imperfect information games: Finding equilibrium via reg- ularization. InICML, volume 139 o...

work page 2021
[30]

Reevaluating policy gradient methods for imperfect-information games.arXiv preprint arXiv:2502.08938, 2025

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota. Reevaluating policy gradient methods for imperfect-information games, 2025. URLhttps://arxiv.org/abs/2502.08938

work page arXiv 2025
[31]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

If multi-agent learning is the answer, what is the question? Artificial intelligence, 171(7):365–377, 2007

Yoav Shoham, Rob Powers, and Trond Grenager. If multi-agent learning is the answer, what is the question? Artificial intelligence, 171(7):365–377, 2007

work page 2007
[33]

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Masterin...

work page 2016
[34]

SIMA Team, Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, et al

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanc- tot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362 (6419):1140–1144, 2018. doi: 10.1...

work page doi:10.1126/science.aar6404 2018
[35]

Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer

Samuel Sokota, Ryan D’Orazio, J. Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. InICLR. OpenReview.net, 2023

work page 2023
[36]

DREAM: deep regret minimization with advantage baselines and model-free learning.CoRR, abs/2006.10410, 2020

Eric Steinberger, Adam Lerer, and Noam Brown. DREAM: deep regret minimization with advantage baselines and model-free learning.CoRR, abs/2006.10410, 2020

work page arXiv 2006
[37]

Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha ¨el Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

work page 2019
[38]

Bowling, and Carmelo Piccione

Martin Zinkevich, Michael Johanson, Michael H. Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. InNIPS, pages 1729–1736. Curran Associates, Inc., 2007. 11 Nash Policy GradientA PREPRINT A Theoretical Analysis In this appendix, we provide detailed proofs for the main theoretical results presented in Section 4.1 and S...

work page 2007

[1] [1]

Oliehoek

Ariyan Bighashdel, Yongzhao Wang, Stephen McAleer, Rahul Savani, and Frans A. Oliehoek. Policy space response oracles: A survey. InIJCAI, pages 7951–7961. ijcai.org, 2024

work page 2024

[2] [2]

JAX: composable transfor- mations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transfor- mations of Python+NumPy programs, 2018. URLhttp://github.com/jax-ml/jax

work page 2018

[3] [3]

Iterative solution of games by fictitious play.Act

George W Brown. Iterative solution of games by fictitious play.Act. Anal. Prod Allocation, 13(1):374, 1951

work page 1951

[4] [4]

Superhuman ai for heads-up no-limit poker: Libratus beats top profession- als.Science, 359(6374):418–424, 2018

Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top profession- als.Science, 359(6374):418–424, 2018. doi: 10.1126/science.aao1733. URLhttps://www.science.org/ doi/abs/10.1126/science.aao1733

work page doi:10.1126/science.aao1733 2018

[5] [5]

Superhuman ai for multiplayer poker.Science, 365(6456):885–890, 2019

Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker.Science, 365(6456):885–890, 2019. doi: 10.1126/science.aay2400. URLhttps://www.science.org/doi/abs/10.1126/science.aay2400

work page doi:10.1126/science.aay2400 2019

[6] [6]

Deep counterfactual regret minimization

Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. In ICML, volume 97 ofProceedings of Machine Learning Research, pages 793–802. PMLR, 2019

work page 2019

[7] [7]

A comprehensive survey of multiagent reinforcement learning.IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2): 156–172, 2008

Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning.IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2): 156–172, 2008

work page 2008

[8] [8]

Fast policy extragradient methods for competitive games with entropy regularization

Shicong Cen, Yuting Wei, and Yuejie Chi. Fast policy extragradient methods for competitive games with entropy regularization. InNeurIPS, pages 27952–27964, 2021

work page 2021

[9] [9]

Last-iterate convergence: Zero-sum games and constrained min- max optimization

Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min- max optimization. InITCS, volume 124 ofLIPIcs, pages 27:1–27:18. Schloss Dagstuhl - Leibniz-Zentrum f ¨ur Informatik, 2019

work page 2019

[10] [10]

Training gans with optimism

Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. In ICLR (Poster). OpenReview.net, 2018

work page 2018

[11] [11]

Springer, 2003

Francisco Facchinei and Jong-Shi Pang.Finite-dimensional variational inequalities and complementarity prob- lems. Springer, 2003

work page 2003

[12] [12]

Solving for best responses in extensive-form games using reinforcement learning methods

Amy Greenwald, Jiacui Li, Eric Sodomka, and Michael Littman. Solving for best responses in extensive-form games using reinforcement learning methods. InProceedings of the conference on reinforcement learning and decision making, pages 116–120. Citeseer, 2013

work page 2013

[13] [13]

Deep Reinforcement Learning from Self-Play in Imperfect-Information Games

Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-information games.CoRR, abs/1603.01121, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [14]

Learning in perturbed asymmetric games.Games Econ

Josef Hofbauer and Ed Hopkins. Learning in perturbed asymmetric games.Games Econ. Behav., 52(1):133–152, 2005

work page 2005

[15] [15]

Adaptive learning in continuous games: Optimal regret bounds and convergence to nash equilibrium

Yu-Guan Hsieh, Kimon Antonakopoulos, and Panayotis Mertikopoulos. Adaptive learning in continuous games: Optimal regret bounds and convergence to nash equilibrium. InCOLT, volume 134 ofProceedings of Machine Learning Research, pages 2388–2422. PMLR, 2021. 9 Nash Policy GradientA PREPRINT

work page 2021

[16] [16]

Extensive games and the problem of information.Contributions to the Theory of Games, 2(28): 193–216, 1953

Harold W Kuhn. Extensive games and the problem of information.Contributions to the Theory of Games, 2(28): 193–216, 1953

work page 1953

[17] [17]

A unified game-theoretic approach to multiagent reinforcement learning

Marc Lanctot, Vin ´ıcius Flores Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien P ´erolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In NIPS, pages 4190–4203, 2017

work page 2017

[18] [19]

URLhttp://arxiv.org/abs/1908.09453

work page arXiv 1908

[19] [20]

A class of gap functions for variational inequalities.Mathematical Programming, 64(1):53–79, 1994

Torbj ¨orn Larsson and Michael Patriksson. A class of gap functions for variational inequalities.Mathematical Programming, 64(1):53–79, 1994

work page 1994

[20] [21]

Last-iterate convergence in extensive-form games

Chung-Wei Lee, Christian Kroer, and Haipeng Luo. Last-iterate convergence in extensive-form games. In NeurIPS, pages 14293–14305, 2021

work page 2021

[21] [22]

Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. InNeurIPS, 2020

work page 2020

[22] [23]

A survey of nash equilibrium strategy solving based on cfr.Archives of Computational Methods in Engineering, 28(4), 2021

Huale Li, Xuan Wang, Fengwei Jia, Yifan Li, and Qian Chen. A survey of nash equilibrium strategy solving based on cfr.Archives of Computational Methods in Engineering, 28(4), 2021

work page 2021

[23] [24]

Ozdaglar, Tiancheng Yu, and Kaiqing Zhang

Mingyang Liu, Asuman E. Ozdaglar, Tiancheng Yu, and Kaiqing Zhang. The power of regularization in solving extensive-form games. InICLR. OpenReview.net, 2023

work page 2023

[24] [25]

Lanier, Roy Fox, and Pierre Baldi

Stephen McAleer, John B. Lanier, Roy Fox, and Pierre Baldi. Pipeline PSRO: A scalable approach for finding approximate nash equilibria in large games. InNeurIPS, 2020

work page 2020

[25] [26]

ESCHER: eschewing impor- tance sampling in games by computing a history value function to estimate regret

Stephen Marcus McAleer, Gabriele Farina, Marc Lanctot, and Tuomas Sandholm. ESCHER: eschewing impor- tance sampling in games by computing a history value function to estimate regret. InICLR. OpenReview.net, 2023

work page 2023

[26] [27]

Wang, Pierre Baldi, Tuomas Sandholm, and Roy Fox

Stephen Marcus McAleer, JB Lanier, Kevin A. Wang, Pierre Baldi, Tuomas Sandholm, and Roy Fox. Toward optimal policy population growth in two-player zero-sum games. InICLR. OpenReview.net, 2024

work page 2024

[27] [28]

McKelvey and Thomas R

Richard D. McKelvey and Thomas R. Palfrey. Quantal response equilibria for normal form games.Games and Economic Behavior, 10(1):6–38, 1995. ISSN 0899-8256. doi: https://doi.org/10.1006/game.1995.1023. URL https://www.sciencedirect.com/science/article/pii/S0899825685710238

work page doi:10.1006/game.1995.1023 1995

[28] [29]

Ortega, Neil Burch, Thomas W

Julien P ´erolat, R ´emi Munos, Jean-Baptiste Lespiau, Shayegan Omidshafiei, Mark Rowland, Pedro A. Ortega, Neil Burch, Thomas W. Anthony, David Balduzzi, Bart De Vylder, Georgios Piliouras, Marc Lanctot, and Karl Tuyls. From poincar ´e recurrence to convergence in imperfect information games: Finding equilibrium via reg- ularization. InICML, volume 139 o...

work page 2021

[29] [30]

Reevaluating policy gradient methods for imperfect-information games.arXiv preprint arXiv:2502.08938, 2025

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota. Reevaluating policy gradient methods for imperfect-information games, 2025. URLhttps://arxiv.org/abs/2502.08938

work page arXiv 2025

[30] [31]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [32]

If multi-agent learning is the answer, what is the question? Artificial intelligence, 171(7):365–377, 2007

Yoav Shoham, Rob Powers, and Trond Grenager. If multi-agent learning is the answer, what is the question? Artificial intelligence, 171(7):365–377, 2007

work page 2007

[32] [33]

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Masterin...

work page 2016

[33] [34]

SIMA Team, Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, et al

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanc- tot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362 (6419):1140–1144, 2018. doi: 10.1...

work page doi:10.1126/science.aar6404 2018

[34] [35]

Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer

Samuel Sokota, Ryan D’Orazio, J. Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. InICLR. OpenReview.net, 2023

work page 2023

[35] [36]

DREAM: deep regret minimization with advantage baselines and model-free learning.CoRR, abs/2006.10410, 2020

Eric Steinberger, Adam Lerer, and Noam Brown. DREAM: deep regret minimization with advantage baselines and model-free learning.CoRR, abs/2006.10410, 2020

work page arXiv 2006

[36] [37]

Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha ¨el Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

work page 2019

[37] [38]

Bowling, and Carmelo Piccione

Martin Zinkevich, Michael Johanson, Michael H. Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. InNIPS, pages 1729–1736. Curran Associates, Inc., 2007. 11 Nash Policy GradientA PREPRINT A Theoretical Analysis In this appendix, we provide detailed proofs for the main theoretical results presented in Section 4.1 and S...

work page 2007