pith. sign in

arxiv: 2511.03724 · v3 · submitted 2025-11-05 · 💻 cs.AI · cs.MA

Outbidding and Outbluffing Elite Humans: Mastering Liar's Poker via Self-Play and Reinforcement Learning

Pith reviewed 2026-05-18 00:55 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords liar's pokerreinforcement learningself-playimperfect informationmulti-player gamesbluffingdeep RLAI poker
0
0 comments X

The pith

Solly, trained via self-play with a model-free actor-critic deep reinforcement learning algorithm, reaches elite human performance in reduced-format Liar's Poker by winning over 50 percent of hands and posting positive equity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that self-play reinforcement learning can produce an agent capable of sustained multi-player engagement in a bluffing game where many participants stay active through repeated bidding rounds. Solly was trained without human data or explicit search and then evaluated on win rate and money won against both other AIs and expert humans. It exceeded 50 percent wins, earned positive equity in heads-up and multi-player matches, outperformed large language models, and resisted exploitation attempts by world-class players. A sympathetic reader cares because earlier poker AIs largely avoided these prolonged multi-player dynamics, so success here tests whether the same training method scales to richer social deception settings.

Core claim

Solly, trained using self-play with a model-free, actor-critic, deep reinforcement learning algorithm, achieved elite human play in reduced-format Liar's Poker as measured by winning over 50% of hands and positive equity in heads-up and multi-player settings, while outperforming LLMs and resisting exploitation by world-class humans.

What carries the argument

The self-play loop using a model-free actor-critic deep reinforcement learning algorithm that learns bidding and bluffing policies directly from game outcomes.

If this is right

  • Solly develops novel bidding strategies and effective randomization that top humans cannot easily exploit.
  • The same training method yields better win rates and equity than large language models on the same Liar's Poker tasks.
  • Self-play alone suffices to reach human-level results in this imperfect-information multi-player setting without additional search or human demonstrations.
  • Agents trained this way maintain positive equity across both heads-up and multi-player formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-play approach could transfer to other games that reward sustained multi-party deception rather than quick reduction to two players.
  • If the reduced format preserves the core challenge, the identical algorithm might reach comparable performance in the full-scale version of the game.
  • Success without explicit rules for bluff frequency suggests the method discovers equilibrium mixing strategies automatically.

Load-bearing premise

That performance above 50 percent wins and positive equity in the reduced-format game is enough to establish elite human play and resistance to exploitation.

What would settle it

A large sample of hands in the same reduced format where world-class human players achieve a win rate above 50 percent against Solly.

Figures

Figures reproduced from arXiv: 2511.03724 by Andrew T. Zheng, Ciamac C. Moallemi, Janos Botyanszki, Richard Dewey.

Figure 1
Figure 1. Figure 1: Reduced-format 3x3 Liar’s Poker is played by bidding on the cumulative digits across all players. Solly calculates the bidding policy using a neural network and selects a move from the distribution output by it. Solly was trained via self-play. The first AI vs. human game to capture widespread attention was the 1997 chess match between IBM’s Deep Blue (Campbell et al., 2002) and Garry Kasparov. A similar m… view at source ↗
Figure 2
Figure 2. Figure 2: Best response scores for the 3x3 3-player configuration. A lower score means a better quality Solly agent. The first panel shows the average best response score for agents trained to play against various Solly training checkpoints across all player positions. The Solly agents improve (become less exploitable) as training progresses. The second panel shows the scores of the exploiting agents playing in each… view at source ↗
Figure 3
Figure 3. Figure 3: Best response scores for 3x3 3-player demonstrating the scaling techniques introduced in Section 7. In the first panel, we rewrite the Liar’s Poker environment to encode hands as digit counts, training on abstract (“canonical”) hands rather than explicit digits. We compare this agent to the original 3x3 3-player agent used for play against elite humans. In the second panel, we compare against an agent trai… view at source ↗
read the original abstract

AI researchers have long focused on poker-like games as a testbed for environments characterized by multi-player dynamics, imperfect information, and reasoning under uncertainty. While recent breakthroughs have matched elite human play at no-limit Texas hold'em, the multi-player dynamics are subdued: most hands converge quickly with only two players engaged through multiple rounds of bidding. In this paper, we present Solly, the first AI agent to achieve elite human play in reduced-format Liar's Poker, a game characterized by extensive multi-player engagement. We trained Solly using self-play with a model-free, actor-critic, deep reinforcement learning algorithm. Solly played at an elite human level as measured by win rate (won over 50% of hands) and equity (money won) in heads-up and multi-player Liar's Poker. Solly also outperformed large language models (LLMs), including those with reasoning abilities, on the same metrics. Solly developed novel bidding strategies, randomized play effectively, and was not easily exploitable by world-class human players.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents Solly, an AI agent trained via self-play using a model-free actor-critic deep reinforcement learning algorithm on a reduced-format version of Liar's Poker. It claims that Solly reaches elite human performance as measured by winning over 50% of hands and achieving positive equity in both heads-up and multi-player settings, while also outperforming LLMs (including reasoning-enabled ones), developing novel bidding strategies, randomizing play effectively, and resisting exploitation by world-class human players.

Significance. If the empirical results are robustly validated, the work would extend self-play RL successes from two-player-dominated games like no-limit Texas Hold'em to a setting with more sustained multi-player engagement, imperfect information, and bluffing. Demonstrating that standard model-free actor-critic methods can produce non-exploitable strategies with positive equity against elite humans would be a useful data point for multi-agent RL in extensive-form games.

major comments (3)
  1. [Results] Results section (and any associated tables/figures reporting win rates and equity): the central performance claims of >50% win rate and positive equity are stated without the number of hands evaluated, opponent skill distribution, statistical tests, confidence intervals, or error bars. This information is load-bearing for the 'elite human play' and 'not easily exploitable' conclusions.
  2. [§3] §3 (Game Description): the reduced-format Liar's Poker is introduced as preserving multi-player bluffing dynamics, but no ablation, complexity analysis, or comparison to the full game is provided to show that the core imperfect-information and multi-player engagement challenges remain intact; this underpins the claim of advancing beyond Texas Hold'em-style dynamics.
  3. [Experimental Setup] Experimental Setup / Training section: key hyperparameters (network architecture, learning rates, self-play schedule, number of training iterations) and baseline comparisons are not reported, preventing assessment of whether the reported outcomes are reproducible or attributable to the algorithm rather than implementation details.
minor comments (2)
  1. [Results] Notation for equity and win-rate metrics should be defined explicitly in the first results table or equation to avoid ambiguity when comparing heads-up vs. multi-player settings.
  2. [Introduction] The abstract and introduction would benefit from a brief sentence clarifying how 'elite human' opponents were recruited and verified (e.g., tournament records or self-reported skill).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and have revised the paper accordingly.

read point-by-point responses
  1. Referee: [Results] Results section (and any associated tables/figures reporting win rates and equity): the central performance claims of >50% win rate and positive equity are stated without the number of hands evaluated, opponent skill distribution, statistical tests, confidence intervals, or error bars. This information is load-bearing for the 'elite human play' and 'not easily exploitable' conclusions.

    Authors: We fully agree that these details are necessary to support our claims of elite human-level performance. In the revised manuscript, we have updated the Results section to include the total number of hands evaluated in our experiments, a description of the opponent skill distribution (including matches against world-class players), and added statistical tests, 95% confidence intervals, and error bars to all win rate and equity figures and tables. These additions provide robust validation for the performance claims. revision: yes

  2. Referee: [§3] §3 (Game Description): the reduced-format Liar's Poker is introduced as preserving multi-player bluffing dynamics, but no ablation, complexity analysis, or comparison to the full game is provided to show that the core imperfect-information and multi-player engagement challenges remain intact; this underpins the claim of advancing beyond Texas Hold'em-style dynamics.

    Authors: We appreciate this point and have addressed it by adding a new paragraph and analysis in Section 3. This includes a complexity comparison (e.g., information set sizes and branching factors) between the reduced and full formats, as well as an ablation study demonstrating that multi-player bluffing and imperfect information dynamics are preserved in the reduced version. This supports our claim that the game retains the key challenges beyond those in two-player poker variants. revision: yes

  3. Referee: [Experimental Setup] Experimental Setup / Training section: key hyperparameters (network architecture, learning rates, self-play schedule, number of training iterations) and baseline comparisons are not reported, preventing assessment of whether the reported outcomes are reproducible or attributable to the algorithm rather than implementation details.

    Authors: We agree that reproducibility requires these details. We have significantly expanded the Experimental Setup and Training sections to report the neural network architecture, learning rates, self-play schedule, number of training iterations, and additional baseline comparisons. We have also included pseudocode for the training procedure and made the source code available to ensure the results can be reproduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical RL results are self-contained

full rationale

The paper presents an empirical outcome from standard model-free actor-critic self-play RL training in a reduced Liar's Poker game. Win rate (>50%) and equity metrics are direct measurements from play against humans and LLMs, not quantities defined in terms of fitted parameters, self-referential equations, or predictions that reduce to inputs by construction. No load-bearing derivations, uniqueness theorems, or ansatzes appear in the abstract or described methodology. The central claim rests on training and evaluation results that are externally falsifiable and independent of any internal definitional loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of reinforcement learning applied to a new game domain; no new free parameters, axioms, or invented entities are introduced beyond typical RL training choices.

free parameters (1)
  • Actor-critic network architecture and learning hyperparameters
    Typical deep RL choices that must be tuned for the reported performance but are not specified in the abstract.
axioms (1)
  • domain assumption Self-play in imperfect-information games produces robust, non-exploitable strategies
    Implicit foundation for claiming elite performance and resistance to human exploitation.

pith-pipeline@v0.9.0 · 5732 in / 1292 out tokens · 44646 ms · 2026-05-18T00:55:36.654939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    Human-level play in the game of diplomacy by combining language models with strategic reasoning

    doi: 10.1126/science.ade9097. URL https://www.science.org/doi/10. 1126/science.ade9097. Murray Campbell, A. Joseph Hoane Jr., and Feng hsiung Hsu. Deep blue. Artificial Intelligence, 134(1–2):57–83,

  2. [2]

    Joseph Hoane, and Feng-hsiung Hsu

    doi: 10.1016/S0004-3702(01)00129-1. Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. Computing a nash equilibrium: The complexity of games. Communications of the ACM , 52(2):89–97,

  3. [3]

    Goldberg, and Christos H

    doi: 10.1145/1461928.1461951. Kousha Etessami and Mihalis Yannakakis. On the complexity of nash equilibria and other fixed points. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS

  4. [4]

    Quentin Gendre and Tomoyuki Kaneko

    doi: 10.1109/FOCS.2007.52. Quentin Gendre and Tomoyuki Kaneko. Counterfactual regret minimisation for playing the multiplayer bluffing dice game dudo. In The 24th Game Programming Workshop , pages 181– 187, Tokyo, Japan,

  5. [6]

    Openspiel: A frame- work for reinforcement learning in games.arXiv preprint arXiv:1908.09453,

    URL https://arxiv.org/abs/1908.09453. 20 Michael Lewis. Liar’s Poker: Rising Through the Wreckage on Wall Street . W. W. Norton & Company, New York,

  6. [7]

    URL https://www.science.org/doi/abs/10.1126/sciadv.adg3256

    1126/sciadv.adg3256. URL https://www.science.org/doi/abs/10.1126/sciadv.adg3256. Daming Shi, Xudong Guo, Yi Liu, and Wenhui Fan. Optimal policy of multiplayer poker via actor-critic reinforcement learning. Entropy, 24(6),

  7. [8]

    J., et al

    doi: 10.1038/nature16961. Samuel Sokota, Ryan D’Orazio, J. Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In International Conference on Learning Representations (ICLR),

  8. [9]

    A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825, 2022

    URL https://arxiv.org/abs/2206.05825. 21 Stratego News. Deepnash surprises top stratego players. https://www.strategonews.com/ wc2023/deepnash-surprises-top-stratego-players/ , August

  9. [10]

    Finbarr Timbers, Nolan Bard, Edward Lockhart, Marc Lanctot, Martin Schmid, Neil Burch, Ju- lian Schrittwieser, Thomas Hubert, and Michael Bowling

    At the 2023 World Championship in Amsterdam, DeepMind’s DeepNash won 19 games vs players’ 9, as detailed in this report. Finbarr Timbers, Nolan Bard, Edward Lockhart, Marc Lanctot, Martin Schmid, Neil Burch, Ju- lian Schrittwieser, Thomas Hubert, and Michael Bowling. Approximate exploitability: Learn- ing a best response. In Proceedings of the Thirty-Firs...

  10. [11]

    URL https://www.ijcai.org/proceedings/2022/484

    doi: 10.24963/ijcai.2022/484. URL https://www.ijcai.org/proceedings/2022/484. Vaughan C. Turekian, Pierre-Bruno Ruffini, Stephen Rayner, Kristin M. Lord, and Pe- ter D. Gluckman. Science diplomacy: New global challenges, new trends. Sci- ence & Diplomacy , 11(1),

  11. [12]

    Amos Tversky and Daniel Kahneman

    https://www.sciencediplomacy.org/article/2022/ science-diplomacy-new-global-challenges-new-trends . Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics and biases. Science, 185(4157):1124–1131,

  12. [13]

    Adversarial Policies Beat Professional-Level Go AIs

    URL https://arxiv.org/abs/2211.00241. Xiaoyu Zhang, Zihan Lin, Kun Zhang, Lei Yu, Shu Wu, and Xiangnan Tan. Chessgpt: Bridging policy learning and language modeling. arXiv preprint arXiv:2306.09200 ,

  13. [14]

    Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione

    URL https: //arxiv.org/abs/2306.09200. Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimiza- tion in games with incomplete information. In NIPS,