Outbidding and Outbluffing Elite Humans: Mastering Liar's Poker via Self-Play and Reinforcement Learning
Pith reviewed 2026-05-18 00:55 UTC · model grok-4.3
The pith
Solly, trained via self-play with a model-free actor-critic deep reinforcement learning algorithm, reaches elite human performance in reduced-format Liar's Poker by winning over 50 percent of hands and posting positive equity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Solly, trained using self-play with a model-free, actor-critic, deep reinforcement learning algorithm, achieved elite human play in reduced-format Liar's Poker as measured by winning over 50% of hands and positive equity in heads-up and multi-player settings, while outperforming LLMs and resisting exploitation by world-class humans.
What carries the argument
The self-play loop using a model-free actor-critic deep reinforcement learning algorithm that learns bidding and bluffing policies directly from game outcomes.
If this is right
- Solly develops novel bidding strategies and effective randomization that top humans cannot easily exploit.
- The same training method yields better win rates and equity than large language models on the same Liar's Poker tasks.
- Self-play alone suffices to reach human-level results in this imperfect-information multi-player setting without additional search or human demonstrations.
- Agents trained this way maintain positive equity across both heads-up and multi-player formats.
Where Pith is reading between the lines
- The same self-play approach could transfer to other games that reward sustained multi-party deception rather than quick reduction to two players.
- If the reduced format preserves the core challenge, the identical algorithm might reach comparable performance in the full-scale version of the game.
- Success without explicit rules for bluff frequency suggests the method discovers equilibrium mixing strategies automatically.
Load-bearing premise
That performance above 50 percent wins and positive equity in the reduced-format game is enough to establish elite human play and resistance to exploitation.
What would settle it
A large sample of hands in the same reduced format where world-class human players achieve a win rate above 50 percent against Solly.
Figures
read the original abstract
AI researchers have long focused on poker-like games as a testbed for environments characterized by multi-player dynamics, imperfect information, and reasoning under uncertainty. While recent breakthroughs have matched elite human play at no-limit Texas hold'em, the multi-player dynamics are subdued: most hands converge quickly with only two players engaged through multiple rounds of bidding. In this paper, we present Solly, the first AI agent to achieve elite human play in reduced-format Liar's Poker, a game characterized by extensive multi-player engagement. We trained Solly using self-play with a model-free, actor-critic, deep reinforcement learning algorithm. Solly played at an elite human level as measured by win rate (won over 50% of hands) and equity (money won) in heads-up and multi-player Liar's Poker. Solly also outperformed large language models (LLMs), including those with reasoning abilities, on the same metrics. Solly developed novel bidding strategies, randomized play effectively, and was not easily exploitable by world-class human players.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Solly, an AI agent trained via self-play using a model-free actor-critic deep reinforcement learning algorithm on a reduced-format version of Liar's Poker. It claims that Solly reaches elite human performance as measured by winning over 50% of hands and achieving positive equity in both heads-up and multi-player settings, while also outperforming LLMs (including reasoning-enabled ones), developing novel bidding strategies, randomizing play effectively, and resisting exploitation by world-class human players.
Significance. If the empirical results are robustly validated, the work would extend self-play RL successes from two-player-dominated games like no-limit Texas Hold'em to a setting with more sustained multi-player engagement, imperfect information, and bluffing. Demonstrating that standard model-free actor-critic methods can produce non-exploitable strategies with positive equity against elite humans would be a useful data point for multi-agent RL in extensive-form games.
major comments (3)
- [Results] Results section (and any associated tables/figures reporting win rates and equity): the central performance claims of >50% win rate and positive equity are stated without the number of hands evaluated, opponent skill distribution, statistical tests, confidence intervals, or error bars. This information is load-bearing for the 'elite human play' and 'not easily exploitable' conclusions.
- [§3] §3 (Game Description): the reduced-format Liar's Poker is introduced as preserving multi-player bluffing dynamics, but no ablation, complexity analysis, or comparison to the full game is provided to show that the core imperfect-information and multi-player engagement challenges remain intact; this underpins the claim of advancing beyond Texas Hold'em-style dynamics.
- [Experimental Setup] Experimental Setup / Training section: key hyperparameters (network architecture, learning rates, self-play schedule, number of training iterations) and baseline comparisons are not reported, preventing assessment of whether the reported outcomes are reproducible or attributable to the algorithm rather than implementation details.
minor comments (2)
- [Results] Notation for equity and win-rate metrics should be defined explicitly in the first results table or equation to avoid ambiguity when comparing heads-up vs. multi-player settings.
- [Introduction] The abstract and introduction would benefit from a brief sentence clarifying how 'elite human' opponents were recruited and verified (e.g., tournament records or self-reported skill).
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and have revised the paper accordingly.
read point-by-point responses
-
Referee: [Results] Results section (and any associated tables/figures reporting win rates and equity): the central performance claims of >50% win rate and positive equity are stated without the number of hands evaluated, opponent skill distribution, statistical tests, confidence intervals, or error bars. This information is load-bearing for the 'elite human play' and 'not easily exploitable' conclusions.
Authors: We fully agree that these details are necessary to support our claims of elite human-level performance. In the revised manuscript, we have updated the Results section to include the total number of hands evaluated in our experiments, a description of the opponent skill distribution (including matches against world-class players), and added statistical tests, 95% confidence intervals, and error bars to all win rate and equity figures and tables. These additions provide robust validation for the performance claims. revision: yes
-
Referee: [§3] §3 (Game Description): the reduced-format Liar's Poker is introduced as preserving multi-player bluffing dynamics, but no ablation, complexity analysis, or comparison to the full game is provided to show that the core imperfect-information and multi-player engagement challenges remain intact; this underpins the claim of advancing beyond Texas Hold'em-style dynamics.
Authors: We appreciate this point and have addressed it by adding a new paragraph and analysis in Section 3. This includes a complexity comparison (e.g., information set sizes and branching factors) between the reduced and full formats, as well as an ablation study demonstrating that multi-player bluffing and imperfect information dynamics are preserved in the reduced version. This supports our claim that the game retains the key challenges beyond those in two-player poker variants. revision: yes
-
Referee: [Experimental Setup] Experimental Setup / Training section: key hyperparameters (network architecture, learning rates, self-play schedule, number of training iterations) and baseline comparisons are not reported, preventing assessment of whether the reported outcomes are reproducible or attributable to the algorithm rather than implementation details.
Authors: We agree that reproducibility requires these details. We have significantly expanded the Experimental Setup and Training sections to report the neural network architecture, learning rates, self-play schedule, number of training iterations, and additional baseline comparisons. We have also included pseudocode for the training procedure and made the source code available to ensure the results can be reproduced. revision: yes
Circularity Check
No significant circularity; empirical RL results are self-contained
full rationale
The paper presents an empirical outcome from standard model-free actor-critic self-play RL training in a reduced Liar's Poker game. Win rate (>50%) and equity metrics are direct measurements from play against humans and LLMs, not quantities defined in terms of fitted parameters, self-referential equations, or predictions that reduce to inputs by construction. No load-bearing derivations, uniqueness theorems, or ansatzes appear in the abstract or described methodology. The central claim rests on training and evaluation results that are externally falsifiable and independent of any internal definitional loop.
Axiom & Free-Parameter Ledger
free parameters (1)
- Actor-critic network architecture and learning hyperparameters
axioms (1)
- domain assumption Self-play in imperfect-information games produces robust, non-exploitable strategies
Reference graph
Works this paper leans on
-
[1]
Human-level play in the game of diplomacy by combining language models with strategic reasoning
doi: 10.1126/science.ade9097. URL https://www.science.org/doi/10. 1126/science.ade9097. Murray Campbell, A. Joseph Hoane Jr., and Feng hsiung Hsu. Deep blue. Artificial Intelligence, 134(1–2):57–83,
-
[2]
Joseph Hoane, and Feng-hsiung Hsu
doi: 10.1016/S0004-3702(01)00129-1. Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. Computing a nash equilibrium: The complexity of games. Communications of the ACM , 52(2):89–97,
-
[3]
doi: 10.1145/1461928.1461951. Kousha Etessami and Mihalis Yannakakis. On the complexity of nash equilibria and other fixed points. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS
-
[4]
Quentin Gendre and Tomoyuki Kaneko
doi: 10.1109/FOCS.2007.52. Quentin Gendre and Tomoyuki Kaneko. Counterfactual regret minimisation for playing the multiplayer bluffing dice game dudo. In The 24th Game Programming Workshop , pages 181– 187, Tokyo, Japan,
-
[6]
Openspiel: A frame- work for reinforcement learning in games.arXiv preprint arXiv:1908.09453,
URL https://arxiv.org/abs/1908.09453. 20 Michael Lewis. Liar’s Poker: Rising Through the Wreckage on Wall Street . W. W. Norton & Company, New York,
-
[7]
URL https://www.science.org/doi/abs/10.1126/sciadv.adg3256
1126/sciadv.adg3256. URL https://www.science.org/doi/abs/10.1126/sciadv.adg3256. Daming Shi, Xudong Guo, Yi Liu, and Wenhui Fan. Optimal policy of multiplayer poker via actor-critic reinforcement learning. Entropy, 24(6),
-
[8]
doi: 10.1038/nature16961. Samuel Sokota, Ryan D’Orazio, J. Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In International Conference on Learning Representations (ICLR),
-
[9]
URL https://arxiv.org/abs/2206.05825. 21 Stratego News. Deepnash surprises top stratego players. https://www.strategonews.com/ wc2023/deepnash-surprises-top-stratego-players/ , August
-
[10]
At the 2023 World Championship in Amsterdam, DeepMind’s DeepNash won 19 games vs players’ 9, as detailed in this report. Finbarr Timbers, Nolan Bard, Edward Lockhart, Marc Lanctot, Martin Schmid, Neil Burch, Ju- lian Schrittwieser, Thomas Hubert, and Michael Bowling. Approximate exploitability: Learn- ing a best response. In Proceedings of the Thirty-Firs...
work page 2023
-
[11]
URL https://www.ijcai.org/proceedings/2022/484
doi: 10.24963/ijcai.2022/484. URL https://www.ijcai.org/proceedings/2022/484. Vaughan C. Turekian, Pierre-Bruno Ruffini, Stephen Rayner, Kristin M. Lord, and Pe- ter D. Gluckman. Science diplomacy: New global challenges, new trends. Sci- ence & Diplomacy , 11(1),
-
[12]
Amos Tversky and Daniel Kahneman
https://www.sciencediplomacy.org/article/2022/ science-diplomacy-new-global-challenges-new-trends . Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics and biases. Science, 185(4157):1124–1131,
work page 2022
-
[13]
Adversarial Policies Beat Professional-Level Go AIs
URL https://arxiv.org/abs/2211.00241. Xiaoyu Zhang, Zihan Lin, Kun Zhang, Lei Yu, Shu Wu, and Xiangnan Tan. Chessgpt: Bridging policy learning and language modeling. arXiv preprint arXiv:2306.09200 ,
-
[14]
Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione
URL https: //arxiv.org/abs/2306.09200. Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimiza- tion in games with incomplete information. In NIPS,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.