Reevaluating Policy Gradient Methods for Imperfect-Information Games

Alexandre Bayen; Amy Zhang; Eugene Vinitsky; Gabriele Farina; J. Zico Kolter; Max Rudolph; Nathan Lichtle; Samuel Sokota; Sobhan Mohammadpour

arxiv: 2502.08938 · v4 · pith:57EGOQ3Xnew · submitted 2025-02-13 · 💻 cs.LG

Reevaluating Policy Gradient Methods for Imperfect-Information Games

Max Rudolph , Nathan Lichtle , Sobhan Mohammadpour , Alexandre Bayen , J. Zico Kolter , Amy Zhang , Gabriele Farina , Eugene Vinitsky

show 1 more author

Samuel Sokota

This is my paper

classification 💻 cs.LG

keywords gamesgradientimperfect-informationmethodspolicyalgorithmsapproachescfr-based

0 comments

read the original abstract

In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for five large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 7000 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods. Code is available at https://github.com/nathanlct/IIG-RL-Benchmark and https://github.com/gabrfarina/exp-a-spiel .

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria
cs.LG 2025-10 unverdicted novelty 6.0

NashPG is a policy-gradient method with iteratively refined regularization that guarantees monotonic convergence to Nash equilibria in two-player zero-sum extensive-form games and scales to large benchmarks.