pith. machine review for the scientific record. sign in

arxiv: 1603.01121 · v2 · submitted 2016-03-03 · 💻 cs.LG · cs.AI· cs.GT

Recognition: unknown

Deep Reinforcement Learning from Self-Play in Imperfect-Information Games

Authors on Pith no claims yet
classification 💻 cs.LG cs.AIcs.GT
keywords learningdomainnashreinforcementself-playapproacheddeepequilibria
0
0 comments X
read the original abstract

Many real-world applications can be described as large-scale games of imperfect information. To deal with these challenging domains, prior work has focused on computing Nash equilibria in a handcrafted abstraction of the domain. In this paper we introduce the first scalable end-to-end approach to learning approximate Nash equilibria without prior domain knowledge. Our method combines fictitious self-play with deep reinforcement learning. When applied to Leduc poker, Neural Fictitious Self-Play (NFSP) approached a Nash equilibrium, whereas common reinforcement learning methods diverged. In Limit Texas Holdem, a poker game of real-world scale, NFSP learnt a strategy that approached the performance of state-of-the-art, superhuman algorithms based on significant domain expertise.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

    cs.LG 2026-04 conditional novelty 7.0

    PPO in a new competitive game fails due to five implementation bugs and then competitive overfitting where self-play stays near 50% but generalization drops to 21.6%; mixing 20% random opponents restores generalizatio...

  2. Dota 2 with Large Scale Deep Reinforcement Learning

    cs.LG 2019-12 accept novelty 7.0

    OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

  3. Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

    cs.CL 2026-04 unverdicted novelty 6.0

    Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.

  4. StratFormer: Adaptive Opponent Modeling and Exploitation in Imperfect-Information Games

    cs.AI 2026-04 unverdicted novelty 5.0

    StratFormer uses a two-phase curriculum with dual-turn tokens and bucket-rate features to model and exploit opponents in Leduc Hold'em, gaining +0.106 BB/hand on average over GTO while keeping near-equilibrium safety.