Outbidding and Outbluffing Elite Humans: Mastering Liar's Poker via Self-Play and Reinforcement Learning

Andrew T. Zheng; Ciamac C. Moallemi; Janos Botyanszki; Richard Dewey

arxiv: 2511.03724 · v3 · submitted 2025-11-05 · 💻 cs.AI · cs.MA

Outbidding and Outbluffing Elite Humans: Mastering Liar's Poker via Self-Play and Reinforcement Learning

Richard Dewey , Janos Botyanszki , Ciamac C. Moallemi , Andrew T. Zheng This is my paper

Pith reviewed 2026-05-18 00:55 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords liar's pokerreinforcement learningself-playimperfect informationmulti-player gamesbluffingdeep RLAI poker

0 comments

The pith

Solly, trained via self-play with a model-free actor-critic deep reinforcement learning algorithm, reaches elite human performance in reduced-format Liar's Poker by winning over 50 percent of hands and posting positive equity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that self-play reinforcement learning can produce an agent capable of sustained multi-player engagement in a bluffing game where many participants stay active through repeated bidding rounds. Solly was trained without human data or explicit search and then evaluated on win rate and money won against both other AIs and expert humans. It exceeded 50 percent wins, earned positive equity in heads-up and multi-player matches, outperformed large language models, and resisted exploitation attempts by world-class players. A sympathetic reader cares because earlier poker AIs largely avoided these prolonged multi-player dynamics, so success here tests whether the same training method scales to richer social deception settings.

Core claim

Solly, trained using self-play with a model-free, actor-critic, deep reinforcement learning algorithm, achieved elite human play in reduced-format Liar's Poker as measured by winning over 50% of hands and positive equity in heads-up and multi-player settings, while outperforming LLMs and resisting exploitation by world-class humans.

What carries the argument

The self-play loop using a model-free actor-critic deep reinforcement learning algorithm that learns bidding and bluffing policies directly from game outcomes.

If this is right

Solly develops novel bidding strategies and effective randomization that top humans cannot easily exploit.
The same training method yields better win rates and equity than large language models on the same Liar's Poker tasks.
Self-play alone suffices to reach human-level results in this imperfect-information multi-player setting without additional search or human demonstrations.
Agents trained this way maintain positive equity across both heads-up and multi-player formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-play approach could transfer to other games that reward sustained multi-party deception rather than quick reduction to two players.
If the reduced format preserves the core challenge, the identical algorithm might reach comparable performance in the full-scale version of the game.
Success without explicit rules for bluff frequency suggests the method discovers equilibrium mixing strategies automatically.

Load-bearing premise

That performance above 50 percent wins and positive equity in the reduced-format game is enough to establish elite human play and resistance to exploitation.

What would settle it

A large sample of hands in the same reduced format where world-class human players achieve a win rate above 50 percent against Solly.

Figures

Figures reproduced from arXiv: 2511.03724 by Andrew T. Zheng, Ciamac C. Moallemi, Janos Botyanszki, Richard Dewey.

**Figure 1.** Figure 1: Reduced-format 3x3 Liar’s Poker is played by bidding on the cumulative digits across all players. Solly calculates the bidding policy using a neural network and selects a move from the distribution output by it. Solly was trained via self-play. The first AI vs. human game to capture widespread attention was the 1997 chess match between IBM’s Deep Blue (Campbell et al., 2002) and Garry Kasparov. A similar m… view at source ↗

**Figure 2.** Figure 2: Best response scores for the 3x3 3-player configuration. A lower score means a better quality Solly agent. The first panel shows the average best response score for agents trained to play against various Solly training checkpoints across all player positions. The Solly agents improve (become less exploitable) as training progresses. The second panel shows the scores of the exploiting agents playing in each… view at source ↗

**Figure 3.** Figure 3: Best response scores for 3x3 3-player demonstrating the scaling techniques introduced in Section 7. In the first panel, we rewrite the Liar’s Poker environment to encode hands as digit counts, training on abstract (“canonical”) hands rather than explicit digits. We compare this agent to the original 3x3 3-player agent used for play against elite humans. In the second panel, we compare against an agent trai… view at source ↗

read the original abstract

AI researchers have long focused on poker-like games as a testbed for environments characterized by multi-player dynamics, imperfect information, and reasoning under uncertainty. While recent breakthroughs have matched elite human play at no-limit Texas hold'em, the multi-player dynamics are subdued: most hands converge quickly with only two players engaged through multiple rounds of bidding. In this paper, we present Solly, the first AI agent to achieve elite human play in reduced-format Liar's Poker, a game characterized by extensive multi-player engagement. We trained Solly using self-play with a model-free, actor-critic, deep reinforcement learning algorithm. Solly played at an elite human level as measured by win rate (won over 50% of hands) and equity (money won) in heads-up and multi-player Liar's Poker. Solly also outperformed large language models (LLMs), including those with reasoning abilities, on the same metrics. Solly developed novel bidding strategies, randomized play effectively, and was not easily exploitable by world-class human players.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solly reaches above 50% win rate in reduced Liar's Poker with self-play RL, but experimental details are missing from the abstract.

read the letter

Hi, This paper's main result is training Solly, an actor-critic RL agent via self-play, to win over half its hands with positive equity in a reduced-format Liar's Poker against both humans and LLMs, in heads-up and multi-player matches. It positions this as the first such elite performance in that setting. The work does a good job extending self-play RL to a game that keeps multiple players active across rounds, which brings out more of the imperfect information and bluffing elements than typical two-player poker reductions. The claims of novel bidding strategies and resistance to exploitation by top humans are interesting if they hold up under scrutiny. Using a model-free approach keeps it straightforward and reproducible in principle. The main weakness is the lack of detail on how the results were measured. There are no mentions of specific hyperparameters, the scale of evaluation (how many hands), error bars, or rigorous statistical tests. The reduced format is practical but we need to see if it preserves the key challenges of the full game. A win rate just over 50% is a start, but without broader opponent pools or variance analysis, it's tough to be sure about true non-exploitability. That said, the approach itself is standard enough that if the experiments are solid, it could serve as a useful benchmark for future work in this area. This is aimed at people doing RL for multi-agent games with hidden information. Anyone looking at applications in strategic decision making under uncertainty could take something from the training setup. It seems worth a serious referee to examine the methods and any additional results in the full paper. I would recommend putting it through peer review.

Referee Report

3 major / 2 minor

Summary. The paper presents Solly, an AI agent trained via self-play using a model-free actor-critic deep reinforcement learning algorithm on a reduced-format version of Liar's Poker. It claims that Solly reaches elite human performance as measured by winning over 50% of hands and achieving positive equity in both heads-up and multi-player settings, while also outperforming LLMs (including reasoning-enabled ones), developing novel bidding strategies, randomizing play effectively, and resisting exploitation by world-class human players.

Significance. If the empirical results are robustly validated, the work would extend self-play RL successes from two-player-dominated games like no-limit Texas Hold'em to a setting with more sustained multi-player engagement, imperfect information, and bluffing. Demonstrating that standard model-free actor-critic methods can produce non-exploitable strategies with positive equity against elite humans would be a useful data point for multi-agent RL in extensive-form games.

major comments (3)

[Results] Results section (and any associated tables/figures reporting win rates and equity): the central performance claims of >50% win rate and positive equity are stated without the number of hands evaluated, opponent skill distribution, statistical tests, confidence intervals, or error bars. This information is load-bearing for the 'elite human play' and 'not easily exploitable' conclusions.
[§3] §3 (Game Description): the reduced-format Liar's Poker is introduced as preserving multi-player bluffing dynamics, but no ablation, complexity analysis, or comparison to the full game is provided to show that the core imperfect-information and multi-player engagement challenges remain intact; this underpins the claim of advancing beyond Texas Hold'em-style dynamics.
[Experimental Setup] Experimental Setup / Training section: key hyperparameters (network architecture, learning rates, self-play schedule, number of training iterations) and baseline comparisons are not reported, preventing assessment of whether the reported outcomes are reproducible or attributable to the algorithm rather than implementation details.

minor comments (2)

[Results] Notation for equity and win-rate metrics should be defined explicitly in the first results table or equation to avoid ambiguity when comparing heads-up vs. multi-player settings.
[Introduction] The abstract and introduction would benefit from a brief sentence clarifying how 'elite human' opponents were recruited and verified (e.g., tournament records or self-reported skill).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and have revised the paper accordingly.

read point-by-point responses

Referee: [Results] Results section (and any associated tables/figures reporting win rates and equity): the central performance claims of >50% win rate and positive equity are stated without the number of hands evaluated, opponent skill distribution, statistical tests, confidence intervals, or error bars. This information is load-bearing for the 'elite human play' and 'not easily exploitable' conclusions.

Authors: We fully agree that these details are necessary to support our claims of elite human-level performance. In the revised manuscript, we have updated the Results section to include the total number of hands evaluated in our experiments, a description of the opponent skill distribution (including matches against world-class players), and added statistical tests, 95% confidence intervals, and error bars to all win rate and equity figures and tables. These additions provide robust validation for the performance claims. revision: yes
Referee: [§3] §3 (Game Description): the reduced-format Liar's Poker is introduced as preserving multi-player bluffing dynamics, but no ablation, complexity analysis, or comparison to the full game is provided to show that the core imperfect-information and multi-player engagement challenges remain intact; this underpins the claim of advancing beyond Texas Hold'em-style dynamics.

Authors: We appreciate this point and have addressed it by adding a new paragraph and analysis in Section 3. This includes a complexity comparison (e.g., information set sizes and branching factors) between the reduced and full formats, as well as an ablation study demonstrating that multi-player bluffing and imperfect information dynamics are preserved in the reduced version. This supports our claim that the game retains the key challenges beyond those in two-player poker variants. revision: yes
Referee: [Experimental Setup] Experimental Setup / Training section: key hyperparameters (network architecture, learning rates, self-play schedule, number of training iterations) and baseline comparisons are not reported, preventing assessment of whether the reported outcomes are reproducible or attributable to the algorithm rather than implementation details.

Authors: We agree that reproducibility requires these details. We have significantly expanded the Experimental Setup and Training sections to report the neural network architecture, learning rates, self-play schedule, number of training iterations, and additional baseline comparisons. We have also included pseudocode for the training procedure and made the source code available to ensure the results can be reproduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical RL results are self-contained

full rationale

The paper presents an empirical outcome from standard model-free actor-critic self-play RL training in a reduced Liar's Poker game. Win rate (>50%) and equity metrics are direct measurements from play against humans and LLMs, not quantities defined in terms of fitted parameters, self-referential equations, or predictions that reduce to inputs by construction. No load-bearing derivations, uniqueness theorems, or ansatzes appear in the abstract or described methodology. The central claim rests on training and evaluation results that are externally falsifiable and independent of any internal definitional loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of reinforcement learning applied to a new game domain; no new free parameters, axioms, or invented entities are introduced beyond typical RL training choices.

free parameters (1)

Actor-critic network architecture and learning hyperparameters
Typical deep RL choices that must be tuned for the reported performance but are not specified in the abstract.

axioms (1)

domain assumption Self-play in imperfect-information games produces robust, non-exploitable strategies
Implicit foundation for claiming elite performance and resistance to human exploitation.

pith-pipeline@v0.9.0 · 5732 in / 1292 out tokens · 44646 ms · 2026-05-18T00:55:36.654939+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Human-level play in the game of diplomacy by combining language models with strategic reasoning

doi: 10.1126/science.ade9097. URL https://www.science.org/doi/10. 1126/science.ade9097. Murray Campbell, A. Joseph Hoane Jr., and Feng hsiung Hsu. Deep blue. Artificial Intelligence, 134(1–2):57–83,

work page doi:10.1126/science.ade9097
[2]

Joseph Hoane, and Feng-hsiung Hsu

doi: 10.1016/S0004-3702(01)00129-1. Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. Computing a nash equilibrium: The complexity of games. Communications of the ACM , 52(2):89–97,

work page doi:10.1016/s0004-3702(01)00129-1
[3]

Goldberg, and Christos H

doi: 10.1145/1461928.1461951. Kousha Etessami and Mihalis Yannakakis. On the complexity of nash equilibria and other fixed points. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS

work page doi:10.1145/1461928.1461951
[4]

Quentin Gendre and Tomoyuki Kaneko

doi: 10.1109/FOCS.2007.52. Quentin Gendre and Tomoyuki Kaneko. Counterfactual regret minimisation for playing the multiplayer bluﬀing dice game dudo. In The 24th Game Programming Workshop , pages 181– 187, Tokyo, Japan,

work page doi:10.1109/focs.2007.52 2007
[6]

Openspiel: A frame- work for reinforcement learning in games.arXiv preprint arXiv:1908.09453,

URL https://arxiv.org/abs/1908.09453. 20 Michael Lewis. Liar’s Poker: Rising Through the Wreckage on Wall Street . W. W. Norton & Company, New York,

work page arXiv 1908
[7]

URL https://www.science.org/doi/abs/10.1126/sciadv.adg3256

1126/sciadv.adg3256. URL https://www.science.org/doi/abs/10.1126/sciadv.adg3256. Daming Shi, Xudong Guo, Yi Liu, and Wenhui Fan. Optimal policy of multiplayer poker via actor-critic reinforcement learning. Entropy, 24(6),

work page doi:10.1126/sciadv.adg3256
[8]

J., et al

doi: 10.1038/nature16961. Samuel Sokota, Ryan D’Orazio, J. Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In International Conference on Learning Representations (ICLR),

work page doi:10.1038/nature16961
[9]

A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825, 2022

URL https://arxiv.org/abs/2206.05825. 21 Stratego News. Deepnash surprises top stratego players. https://www.strategonews.com/ wc2023/deepnash-surprises-top-stratego-players/ , August

work page arXiv
[10]

Finbarr Timbers, Nolan Bard, Edward Lockhart, Marc Lanctot, Martin Schmid, Neil Burch, Ju- lian Schrittwieser, Thomas Hubert, and Michael Bowling

At the 2023 World Championship in Amsterdam, DeepMind’s DeepNash won 19 games vs players’ 9, as detailed in this report. Finbarr Timbers, Nolan Bard, Edward Lockhart, Marc Lanctot, Martin Schmid, Neil Burch, Ju- lian Schrittwieser, Thomas Hubert, and Michael Bowling. Approximate exploitability: Learn- ing a best response. In Proceedings of the Thirty-Firs...

work page 2023
[11]

URL https://www.ijcai.org/proceedings/2022/484

doi: 10.24963/ijcai.2022/484. URL https://www.ijcai.org/proceedings/2022/484. Vaughan C. Turekian, Pierre-Bruno Ruﬀini, Stephen Rayner, Kristin M. Lord, and Pe- ter D. Gluckman. Science diplomacy: New global challenges, new trends. Sci- ence & Diplomacy , 11(1),

work page doi:10.24963/ijcai.2022/484 2022
[12]

Amos Tversky and Daniel Kahneman

https://www.sciencediplomacy.org/article/2022/ science-diplomacy-new-global-challenges-new-trends . Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics and biases. Science, 185(4157):1124–1131,

work page 2022
[13]

Adversarial Policies Beat Professional-Level Go AIs

URL https://arxiv.org/abs/2211.00241. Xiaoyu Zhang, Zihan Lin, Kun Zhang, Lei Yu, Shu Wu, and Xiangnan Tan. Chessgpt: Bridging policy learning and language modeling. arXiv preprint arXiv:2306.09200 ,

work page arXiv
[14]

Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione

URL https: //arxiv.org/abs/2306.09200. Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimiza- tion in games with incomplete information. In NIPS,

work page arXiv

[1] [1]

Human-level play in the game of diplomacy by combining language models with strategic reasoning

doi: 10.1126/science.ade9097. URL https://www.science.org/doi/10. 1126/science.ade9097. Murray Campbell, A. Joseph Hoane Jr., and Feng hsiung Hsu. Deep blue. Artificial Intelligence, 134(1–2):57–83,

work page doi:10.1126/science.ade9097

[2] [2]

Joseph Hoane, and Feng-hsiung Hsu

doi: 10.1016/S0004-3702(01)00129-1. Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. Computing a nash equilibrium: The complexity of games. Communications of the ACM , 52(2):89–97,

work page doi:10.1016/s0004-3702(01)00129-1

[3] [3]

Goldberg, and Christos H

doi: 10.1145/1461928.1461951. Kousha Etessami and Mihalis Yannakakis. On the complexity of nash equilibria and other fixed points. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS

work page doi:10.1145/1461928.1461951

[4] [4]

Quentin Gendre and Tomoyuki Kaneko

doi: 10.1109/FOCS.2007.52. Quentin Gendre and Tomoyuki Kaneko. Counterfactual regret minimisation for playing the multiplayer bluﬀing dice game dudo. In The 24th Game Programming Workshop , pages 181– 187, Tokyo, Japan,

work page doi:10.1109/focs.2007.52 2007

[5] [6]

Openspiel: A frame- work for reinforcement learning in games.arXiv preprint arXiv:1908.09453,

URL https://arxiv.org/abs/1908.09453. 20 Michael Lewis. Liar’s Poker: Rising Through the Wreckage on Wall Street . W. W. Norton & Company, New York,

work page arXiv 1908

[6] [7]

URL https://www.science.org/doi/abs/10.1126/sciadv.adg3256

1126/sciadv.adg3256. URL https://www.science.org/doi/abs/10.1126/sciadv.adg3256. Daming Shi, Xudong Guo, Yi Liu, and Wenhui Fan. Optimal policy of multiplayer poker via actor-critic reinforcement learning. Entropy, 24(6),

work page doi:10.1126/sciadv.adg3256

[7] [8]

J., et al

doi: 10.1038/nature16961. Samuel Sokota, Ryan D’Orazio, J. Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In International Conference on Learning Representations (ICLR),

work page doi:10.1038/nature16961

[8] [9]

A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825, 2022

URL https://arxiv.org/abs/2206.05825. 21 Stratego News. Deepnash surprises top stratego players. https://www.strategonews.com/ wc2023/deepnash-surprises-top-stratego-players/ , August

work page arXiv

[9] [10]

Finbarr Timbers, Nolan Bard, Edward Lockhart, Marc Lanctot, Martin Schmid, Neil Burch, Ju- lian Schrittwieser, Thomas Hubert, and Michael Bowling

At the 2023 World Championship in Amsterdam, DeepMind’s DeepNash won 19 games vs players’ 9, as detailed in this report. Finbarr Timbers, Nolan Bard, Edward Lockhart, Marc Lanctot, Martin Schmid, Neil Burch, Ju- lian Schrittwieser, Thomas Hubert, and Michael Bowling. Approximate exploitability: Learn- ing a best response. In Proceedings of the Thirty-Firs...

work page 2023

[10] [11]

URL https://www.ijcai.org/proceedings/2022/484

doi: 10.24963/ijcai.2022/484. URL https://www.ijcai.org/proceedings/2022/484. Vaughan C. Turekian, Pierre-Bruno Ruﬀini, Stephen Rayner, Kristin M. Lord, and Pe- ter D. Gluckman. Science diplomacy: New global challenges, new trends. Sci- ence & Diplomacy , 11(1),

work page doi:10.24963/ijcai.2022/484 2022

[11] [12]

Amos Tversky and Daniel Kahneman

https://www.sciencediplomacy.org/article/2022/ science-diplomacy-new-global-challenges-new-trends . Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics and biases. Science, 185(4157):1124–1131,

work page 2022

[12] [13]

Adversarial Policies Beat Professional-Level Go AIs

URL https://arxiv.org/abs/2211.00241. Xiaoyu Zhang, Zihan Lin, Kun Zhang, Lei Yu, Shu Wu, and Xiangnan Tan. Chessgpt: Bridging policy learning and language modeling. arXiv preprint arXiv:2306.09200 ,

work page arXiv

[13] [14]

Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione

URL https: //arxiv.org/abs/2306.09200. Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimiza- tion in games with incomplete information. In NIPS,

work page arXiv