A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games

Samuel Sokota, Ryan D’Orazio, J Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, Christian Kroer · 2022 · arXiv 2206.05825

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

Outbidding and Outbluffing Elite Humans: Mastering Liar's Poker via Self-Play and Reinforcement Learning

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

Solly is the first AI to achieve elite human-level play in reduced-format Liar's Poker via self-play actor-critic reinforcement learning, outperforming both world-class humans and large language models on win rate and equity while developing non-exploitable strategies.

GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.

Multiplayer Nash Preference Optimization

cs.AI · 2025-09-27 · unverdicted · novelty 6.0

MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.

citing papers explorer

Showing 3 of 3 citing papers.

Outbidding and Outbluffing Elite Humans: Mastering Liar's Poker via Self-Play and Reinforcement Learning cs.AI · 2025-11-05 · unverdicted · none · ref 9
Solly is the first AI to achieve elite human-level play in reduced-format Liar's Poker via self-play actor-critic reinforcement learning, outperforming both world-class humans and large language models on win rate and equity while developing non-exploitable strategies.
GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning cs.LG · 2026-05-19 · unverdicted · none · ref 21
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
Multiplayer Nash Preference Optimization cs.AI · 2025-09-27 · unverdicted · none · ref 24
MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.

A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games

fields

years

verdicts

representative citing papers

citing papers explorer