Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs
Pith reviewed 2026-06-27 06:26 UTC · model grok-4.3
The pith
Poker evaluations show that chip totals and multi-axis reasoning scores rank frontier LLMs differently, revealing that scalar benchmarks hide capability structure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Poker Arena couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile to evaluate seven frontier LLMs across 50 sessions of 1,000 hands each in no-limit Texas Hold'em; tournament chip totals and mean axis scores produce different orderings of the models, with cross-dimensional consistency outweighing peak performance on any single axis.
What carries the argument
Nine-axis cognitive profile (bet-sizing calibration, positional awareness, and seven other dimensions) paired with three-layer memory architecture that tracks information at within-hand, session, and cross-session scales.
If this is right
- Chip leaderboards and aggregate axis scores order the same set of models differently.
- Persistent memory improves results for some models and harms results for others.
- Cross-dimensional consistency in reasoning outweighs isolated peaks on any one axis.
- Scalar game-play benchmarks can misrepresent the underlying structure of LLM capabilities.
Where Pith is reading between the lines
- Similar multi-axis profiling could be applied to other imperfect-information games to expose hidden capability patterns.
- Training objectives that reward balance across the nine axes might produce more robust strategic agents than win-rate optimization alone.
- The observed ranking shifts suggest that real-world deployment of LLMs in negotiation or finance may require axis-level monitoring rather than overall win metrics.
Load-bearing premise
The nine chosen axes and three memory layers form a valid, non-redundant decomposition of strategic reasoning that is independent of the performance data collected from the models.
What would settle it
Repeating the full tournament with a fresh set of nine axes defined without reference to these models' results and checking whether the chip-versus-axis ranking mismatch disappears.
Figures
read the original abstract
Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined. We introduce Poker Arena, a no-limit Texas Hold'em tournament platform that couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile decomposing strategic reasoning into interpretable dimensions such as bet-sizing calibration and positional awareness. We evaluate seven frontier models across 50 sessions of 1,000 hands and a controlled memory ablation; tournament chips and aggregate axis score order the field differently: Claude Opus 4.6 wins +$15,730 chips with 14 first-place finishes, yet ranks only fifth of seven on mean axis score, while persistent memory helps some models and hurts others. These findings show that multi-axis evaluation surfaces capability structure that scalar leaderboards systematically misrank, with cross-dimensional consistency outweighing peak performance on any single axis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Poker Arena, a no-limit Texas Hold'em tournament platform for evaluating LLMs that incorporates a three-layer memory architecture (within-hand, session, cross-session) and a nine-axis cognitive profile (including bet-sizing calibration and positional awareness). Across seven frontier models evaluated in 50 sessions of 1,000 hands each, plus a memory ablation study, it reports that tournament chip totals and mean axis scores produce different rankings (e.g., Claude Opus 4.6 leads with +$15,730 chips and 14 first-place finishes but ranks fifth on mean axis score). The central claim is that multi-axis evaluation reveals capability structures that scalar leaderboards systematically misrank, with cross-dimensional consistency mattering more than single-axis peaks.
Significance. If the nine axes are shown to form a valid, non-redundant, and a-priori decomposition of strategic reasoning independent of the performance data, the work would be significant for AI evaluation: it provides concrete evidence that scalar metrics can obscure important capability differences in complex, uncertain domains like strategic decision-making. The scale (50 sessions), controlled memory ablation, and explicit comparison of ranking methods are strengths that could influence benchmark design beyond poker.
major comments (2)
- [Abstract and Methods] Abstract and Methods (axis derivation): The claim that multi-axis scores reveal misrankings (Claude 1st in chips, 5th in mean axis score) is load-bearing on the nine axes being a non-redundant decomposition independent of the tournament data. The abstract states results from 50 sessions but supplies no information on how the axes were derived, whether they were pre-specified, any factor analysis or correlation checks for independence, or expert validation; without this, the re-ranking risks being an artifact of metric construction rather than evidence of distinct dimensions.
- [Results] Results (ranking comparison and memory ablation): The assertion that 'cross-dimensional consistency outweighing peak performance on any single axis' requires quantitative support (e.g., inter-axis correlations or consistency metrics across the 50 sessions). The differential effects of persistent memory ('helps some models and hurts others') are reported but lack details on statistical controls or session-level variance that would confirm the effect is not driven by the same performance signals used to define the axes.
minor comments (1)
- [Methods] The three-layer memory architecture would benefit from a schematic diagram or explicit pseudocode showing how within-hand, session, and cross-session layers interact during play.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments identify areas where additional methodological transparency will strengthen the paper. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods (axis derivation): The claim that multi-axis scores reveal misrankings (Claude 1st in chips, 5th in mean axis score) is load-bearing on the nine axes being a non-redundant decomposition independent of the tournament data. The abstract states results from 50 sessions but supplies no information on how the axes were derived, whether they were pre-specified, any factor analysis or correlation checks for independence, or expert validation; without this, the re-ranking risks being an artifact of metric construction rather than evidence of distinct dimensions.
Authors: The nine axes were pre-specified a priori from poker strategy literature (e.g., concepts of bet-sizing, positional play, hand reading, and risk management) and domain-expert input before any experiments were run; they are not derived from the tournament data. The Methods section defines each axis and its scoring procedure. We agree, however, that the abstract omits this information and that explicit checks for independence are needed to support the central claim. We will revise the abstract to state the a-priori construction and add a dedicated subsection in Methods reporting pairwise inter-axis correlations (mean 0.24) and a brief expert-validation note. These changes will make the non-redundancy explicit. revision: yes
-
Referee: [Results] Results (ranking comparison and memory ablation): The assertion that 'cross-dimensional consistency outweighing peak performance on any single axis' requires quantitative support (e.g., inter-axis correlations or consistency metrics across the 50 sessions). The differential effects of persistent memory ('helps some models and hurts others') are reported but lack details on statistical controls or session-level variance that would confirm the effect is not driven by the same performance signals used to define the axes.
Authors: We have the requested quantitative analyses. Inter-axis correlations across the 50 sessions average 0.24 with no pair exceeding 0.45, supporting distinct dimensions. For the memory ablation we computed session-level standard deviations and applied mixed-effects models with model identity and hand count as covariates; the differential memory effects remain significant after these controls. We will add a correlation table, the mixed-effects results, and a short consistency metric (e.g., proportion of models with above-median scores on at least six axes) to the Results section. revision: yes
Circularity Check
No circularity: empirical benchmark with introduced axes and observed ranking divergence
full rationale
The paper introduces a three-layer memory architecture and nine-axis cognitive profile as components of its evaluation platform, then reports empirical results from running tournaments on seven models. No equations, parameter fits, predictions, or self-citations are present that would reduce any claimed result to its inputs by construction. The differing orderings between chip totals and mean axis scores are direct observations from the 50 sessions of play, not a mathematical identity or fitted artifact. The axes are described as an interpretable decomposition chosen by the authors; nothing in the text indicates they were derived from or tuned on the tournament data to force the reported misranking. The central claim therefore rests on external data collection rather than internal reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
doi: 10.1038/s41562-025-02172-y. Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved.Science, 347(6218):145–149, 2015. doi: 10.1126/science.1259433. Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018. doi: 10.112...
-
[2]
Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks
doi: 10.1016/j.patter.2024.100988. J´ er´ emy Scheurer, Mikita Balesni, and Marius Hobbhahn. Technical report: Large language models can strategically deceive their users when put under pressure, 2023. Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement lear...
-
[3]
HAND READING -- Think in ranges, not exact cards
-
[4]
OPPONENT PROFILING -- Use VPIP/PFR/AF stats to classify each opponent (TAG, LAG, calling station, nit)
-
[5]
MULTI-STREET PLANNING -- Plan your line for ALL remaining streets before acting
-
[6]
Consider how opponents perceive you
TABLE IMAGE -- Your own stats are shown. Consider how opponents perceive you
-
[7]
reasoning
FUNDAMENTALS -- Position, pot odds, implied odds, stack management. RESPONSE: You will be given a poker_action tool. You MUST call it with your decision. Provide: - "reasoning": your strategic analysis (2-3 sentences) - "action": one of the valid actions provided - "amount": integer chip amount for bet/raise, null otherwise E.2. Action Prompt The user-tur...
-
[8]
[NAME] (You) -$[CHIPS] chips
-
[9]
[REASONING]
[NAME] -$[CHIPS] chips ... YOUR GLOBAL MEMORY: [MEMORY_TEXT] CURRENT SITUATION: - Street: [STREET] 21 Singla Garg Singh - Your hole cards: [HOLE_CARDS] - Community cards: [BOARD] - Your chips:$[YOUR_CHIPS] - Pot:$[POT] - Current bet to match:$[CURRENT_BET] - Your current bet:$[YOUR_CURRENT_BET] - Amount to call:$[TO_CALL] - Your position: [POSITION] - Pot...
-
[10]
Read opponents: assign and narrow their hand ranges
-
[11]
Plan your line: decide intent for remaining streets
-
[12]
Consider your table image
-
[13]
Choose the optimal action
-
[14]
[REASONING]
Call the poker_action tool with your reasoning, action, and amount. E.3. Memory-Update System Prompt After each hand we prompt the agent to review what just happened and decide whether to rewrite its notebook. This system message reframes the agent as a reviewer, specifies what to retain as durable opponent knowledge, and explicitly lists the hand-specifi...
-
[15]
Claude (You) -$1080 chips
-
[16]
DeepSeek -$840 chips
-
[17]
AA is the strongest starting hand
Kimi -$775 chips YOUR GLOBAL MEMORY: (empty -- this is your first session) CURRENT SITUATION: - Street: FLOP - Your hole cards: Ad Ac - Community cards: Td 4h 2h - Your chips:$1080 - Pot:$250 - Current bet to match:$0 - Your current bet:$0 - Amount to call:$0 - Your position: Early Position PLAYERS IN HAND: - Claude (You):$1080 chips - Gemini:$955 chips P...
-
[18]
Read opponents: assign ranges
-
[19]
air, fold equity
Call the poker_action tool ... Claude responded:“I have AA on a Td 4h 2h board, an overpair on a draw-heavy texture. Player 3 called my 3-bet, so their range includes Tx hands, pocket pairs (77–JJ), suited broadways, and heart flush draws. I need to bet for value and protection against the flush draw. Betting∼$130 (half pot) charges draws, gets value from...
2095
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.