Deep reinforcement learning from self-play in imperfect-information games.arXiv preprint arXiv:1603.01121

Johannes Heinrich, David Silver · 2016 · cs.LG · arXiv 1603.01121

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

Many real-world applications can be described as large-scale games of imperfect information. To deal with these challenging domains, prior work has focused on computing Nash equilibria in a handcrafted abstraction of the domain. In this paper we introduce the first scalable end-to-end approach to learning approximate Nash equilibria without prior domain knowledge. Our method combines fictitious self-play with deep reinforcement learning. When applied to Leduc poker, Neural Fictitious Self-Play (NFSP) approached a Nash equilibrium, whereas common reinforcement learning methods diverged. In Limit Texas Holdem, a poker game of real-world scale, NFSP learnt a strategy that approached the performance of state-of-the-art, superhuman algorithms based on significant domain expertise.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

cs.LG · 2026-04-04 · conditional · novelty 7.0

PPO in a new competitive game fails due to five implementation bugs and then competitive overfitting where self-play stays near 50% but generalization drops to 21.6%; mixing 20% random opponents restores generalization to 77.1%.

Dota 2 with Large Scale Deep Reinforcement Learning

cs.LG · 2019-12-13 · accept · novelty 7.0

OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

cs.RO · 2026-05-21 · conditional · novelty 6.0

Multi-agent RL with league self-play trains quadrotors to exceed champion human performance in multi-player races above 22 m/s while cutting collisions by 50% and generalizing zero-shot to safer human interaction.

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

cs.AI · 2026-05-16 · unverdicted · novelty 6.0

PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.

NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

cs.LG · 2025-10-21 · unverdicted · novelty 6.0

NashPG is a policy-gradient method with iteratively refined regularization that guarantees monotonic convergence to Nash equilibria in two-player zero-sum extensive-form games and scales to large benchmarks.

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.

StratFormer: Adaptive Opponent Modeling and Exploitation in Imperfect-Information Games

cs.AI · 2026-04-28 · unverdicted · novelty 5.0

StratFormer uses a two-phase curriculum with dual-turn tokens and bucket-rate features to model and exploit opponents in Leduc Hold'em, gaining +0.106 BB/hand on average over GTO while keeping near-equilibrium safety.

citing papers explorer

Showing 7 of 7 citing papers.

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO cs.LG · 2026-04-04 · conditional · none · ref 2
PPO in a new competitive game fails due to five implementation bugs and then competitive overfitting where self-play stays near 50% but generalization drops to 21.6%; mixing 20% random opponents restores generalization to 77.1%.
Dota 2 with Large Scale Deep Reinforcement Learning cs.LG · 2019-12-13 · accept · none · ref 38
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning cs.RO · 2026-05-21 · conditional · none · ref 36 · internal anchor
Multi-agent RL with league self-play trains quadrotors to exceed champion human performance in multi-player races above 22 m/s while cutting collisions by 50% and generalizing zero-shot to safer human interaction.
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play cs.AI · 2026-05-16 · unverdicted · none · ref 14 · internal anchor
PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.
NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria cs.LG · 2025-10-21 · unverdicted · none · ref 13 · internal anchor
NashPG is a policy-gradient method with iteratively refined regularization that guarantees monotonic convergence to Nash equilibria in two-player zero-sum extensive-form games and scales to large benchmarks.
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution cs.CL · 2026-04-03 · unverdicted · none · ref 13
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
StratFormer: Adaptive Opponent Modeling and Exploitation in Imperfect-Information Games cs.AI · 2026-04-28 · unverdicted · none · ref 10
StratFormer uses a two-phase curriculum with dual-turn tokens and bucket-rate features to model and exploit opponents in Leduc Hold'em, gaining +0.106 BB/hand on average over GTO while keeping near-equilibrium safety.

Deep reinforcement learning from self-play in imperfect-information games.arXiv preprint arXiv:1603.01121

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer