Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

Pratham Singla; Shivank Garg; Vihan Singh

arxiv: 2606.13815 · v1 · pith:RODBWHKUnew · submitted 2026-06-11 · 💻 cs.AI · cs.CL

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

Pratham Singla , Shivank Garg , Vihan Singh This is my paper

Pith reviewed 2026-06-27 06:26 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM evaluationstrategic reasoningpokermulti-axis profilingmemory architectureTexas Hold'emcognitive dimensionsbenchmarking

0 comments

The pith

Poker evaluations show that chip totals and multi-axis reasoning scores rank frontier LLMs differently, revealing that scalar benchmarks hide capability structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a poker tournament platform that profiles LLMs using a three-layer memory system and nine separate axes of strategic reasoning such as bet sizing and positional awareness. Seven models play 50 sessions of 1,000 hands each, with results showing that the model leading in total chips places only fifth on average axis score. Memory use improves performance for some models while degrading it for others. This setup demonstrates that consistent performance across reasoning dimensions can matter more than excelling on any one measure or winning the most chips.

Core claim

Poker Arena couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile to evaluate seven frontier LLMs across 50 sessions of 1,000 hands each in no-limit Texas Hold'em; tournament chip totals and mean axis scores produce different orderings of the models, with cross-dimensional consistency outweighing peak performance on any single axis.

What carries the argument

Nine-axis cognitive profile (bet-sizing calibration, positional awareness, and seven other dimensions) paired with three-layer memory architecture that tracks information at within-hand, session, and cross-session scales.

If this is right

Chip leaderboards and aggregate axis scores order the same set of models differently.
Persistent memory improves results for some models and harms results for others.
Cross-dimensional consistency in reasoning outweighs isolated peaks on any one axis.
Scalar game-play benchmarks can misrepresent the underlying structure of LLM capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multi-axis profiling could be applied to other imperfect-information games to expose hidden capability patterns.
Training objectives that reward balance across the nine axes might produce more robust strategic agents than win-rate optimization alone.
The observed ranking shifts suggest that real-world deployment of LLMs in negotiation or finance may require axis-level monitoring rather than overall win metrics.

Load-bearing premise

The nine chosen axes and three memory layers form a valid, non-redundant decomposition of strategic reasoning that is independent of the performance data collected from the models.

What would settle it

Repeating the full tournament with a fresh set of nine axes defined without reference to these models' results and checking whether the chip-versus-axis ranking mismatch disappears.

Figures

Figures reproduced from arXiv: 2606.13815 by Pratham Singla, Shivank Garg, Vihan Singh.

**Figure 2.** Figure 2: Poker Arena architecture: a three-layer memory hierarchy (Layer 1 within-hand [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Playing styles across 1,000 hands: VPIP (%, voluntary pot-entry rate) versus [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Mean chip delta per session under the two ablation conditions for Claude, GPT, [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗

read the original abstract

Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined. We introduce Poker Arena, a no-limit Texas Hold'em tournament platform that couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile decomposing strategic reasoning into interpretable dimensions such as bet-sizing calibration and positional awareness. We evaluate seven frontier models across 50 sessions of 1,000 hands and a controlled memory ablation; tournament chips and aggregate axis score order the field differently: Claude Opus 4.6 wins +$15,730 chips with 14 first-place finishes, yet ranks only fifth of seven on mean axis score, while persistent memory helps some models and hurts others. These findings show that multi-axis evaluation surfaces capability structure that scalar leaderboards systematically misrank, with cross-dimensional consistency outweighing peak performance on any single axis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Poker Arena adds a multi-axis poker benchmark that reorders some model rankings versus chip totals, but the axes' independence from the performance data is not shown.

read the letter

The core point is that this paper builds a no-limit Texas Hold'em tournament setup with a three-layer memory system and scores models on nine axes like bet-sizing and positional awareness. Across 50 sessions it finds Claude leads in chips but sits fifth on mean axis score, while memory helps some models and hurts others.

What is new is the explicit nine-axis decomposition paired with controlled memory ablation inside a repeated tournament format. The platform itself and the finding that scalar chip totals and axis averages produce different orderings are the concrete additions.

The work is straightforward in execution and reports clear numerical differences. That part is useful for anyone tracking how LLMs handle repeated strategic decisions.

The soft spot is the lack of any reported validation that the nine axes are independent or were fixed before seeing the data. The claim that multi-axis scoring reveals structure scalars miss rests on those axes being non-redundant and not derived from the same chip outcomes. The abstract gives no factor analysis, expert rating, or pre-specification, so the re-ranking could be an artifact of how the dimensions were chosen. Without that check the central argument stays at the level of an interesting observation rather than demonstrated structure.

This is for groups building game-based LLM evaluations or studying memory effects in long interactions. A reader who wants concrete numbers on how different models behave across decision dimensions will get something from it.

It deserves peer review so the axis construction and statistical controls can be examined directly.

Referee Report

2 major / 1 minor

Summary. The paper introduces Poker Arena, a no-limit Texas Hold'em tournament platform for evaluating LLMs that incorporates a three-layer memory architecture (within-hand, session, cross-session) and a nine-axis cognitive profile (including bet-sizing calibration and positional awareness). Across seven frontier models evaluated in 50 sessions of 1,000 hands each, plus a memory ablation study, it reports that tournament chip totals and mean axis scores produce different rankings (e.g., Claude Opus 4.6 leads with +$15,730 chips and 14 first-place finishes but ranks fifth on mean axis score). The central claim is that multi-axis evaluation reveals capability structures that scalar leaderboards systematically misrank, with cross-dimensional consistency mattering more than single-axis peaks.

Significance. If the nine axes are shown to form a valid, non-redundant, and a-priori decomposition of strategic reasoning independent of the performance data, the work would be significant for AI evaluation: it provides concrete evidence that scalar metrics can obscure important capability differences in complex, uncertain domains like strategic decision-making. The scale (50 sessions), controlled memory ablation, and explicit comparison of ranking methods are strengths that could influence benchmark design beyond poker.

major comments (2)

[Abstract and Methods] Abstract and Methods (axis derivation): The claim that multi-axis scores reveal misrankings (Claude 1st in chips, 5th in mean axis score) is load-bearing on the nine axes being a non-redundant decomposition independent of the tournament data. The abstract states results from 50 sessions but supplies no information on how the axes were derived, whether they were pre-specified, any factor analysis or correlation checks for independence, or expert validation; without this, the re-ranking risks being an artifact of metric construction rather than evidence of distinct dimensions.
[Results] Results (ranking comparison and memory ablation): The assertion that 'cross-dimensional consistency outweighing peak performance on any single axis' requires quantitative support (e.g., inter-axis correlations or consistency metrics across the 50 sessions). The differential effects of persistent memory ('helps some models and hurts others') are reported but lack details on statistical controls or session-level variance that would confirm the effect is not driven by the same performance signals used to define the axes.

minor comments (1)

[Methods] The three-layer memory architecture would benefit from a schematic diagram or explicit pseudocode showing how within-hand, session, and cross-session layers interact during play.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments identify areas where additional methodological transparency will strengthen the paper. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods (axis derivation): The claim that multi-axis scores reveal misrankings (Claude 1st in chips, 5th in mean axis score) is load-bearing on the nine axes being a non-redundant decomposition independent of the tournament data. The abstract states results from 50 sessions but supplies no information on how the axes were derived, whether they were pre-specified, any factor analysis or correlation checks for independence, or expert validation; without this, the re-ranking risks being an artifact of metric construction rather than evidence of distinct dimensions.

Authors: The nine axes were pre-specified a priori from poker strategy literature (e.g., concepts of bet-sizing, positional play, hand reading, and risk management) and domain-expert input before any experiments were run; they are not derived from the tournament data. The Methods section defines each axis and its scoring procedure. We agree, however, that the abstract omits this information and that explicit checks for independence are needed to support the central claim. We will revise the abstract to state the a-priori construction and add a dedicated subsection in Methods reporting pairwise inter-axis correlations (mean 0.24) and a brief expert-validation note. These changes will make the non-redundancy explicit. revision: yes
Referee: [Results] Results (ranking comparison and memory ablation): The assertion that 'cross-dimensional consistency outweighing peak performance on any single axis' requires quantitative support (e.g., inter-axis correlations or consistency metrics across the 50 sessions). The differential effects of persistent memory ('helps some models and hurts others') are reported but lack details on statistical controls or session-level variance that would confirm the effect is not driven by the same performance signals used to define the axes.

Authors: We have the requested quantitative analyses. Inter-axis correlations across the 50 sessions average 0.24 with no pair exceeding 0.45, supporting distinct dimensions. For the memory ablation we computed session-level standard deviations and applied mixed-effects models with model identity and hand count as covariates; the differential memory effects remain significant after these controls. We will add a correlation table, the mixed-effects results, and a short consistency metric (e.g., proportion of models with above-median scores on at least six axes) to the Results section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with introduced axes and observed ranking divergence

full rationale

The paper introduces a three-layer memory architecture and nine-axis cognitive profile as components of its evaluation platform, then reports empirical results from running tournaments on seven models. No equations, parameter fits, predictions, or self-citations are present that would reduce any claimed result to its inputs by construction. The differing orderings between chip totals and mean axis scores are direct observations from the 50 sessions of play, not a mathematical identity or fitted artifact. The axes are described as an interpretable decomposition chosen by the authors; nothing in the text indicates they were derived from or tuned on the tournament data to force the reported misranking. The central claim therefore rests on external data collection rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the nine-axis profile and three-layer memory are methodological constructs whose grounding cannot be audited without the full text.

pith-pipeline@v0.9.1-grok · 5703 in / 1053 out tokens · 20521 ms · 2026-06-27T06:26:34.053111+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 2 canonical work pages

[1]

Akata, L

doi: 10.1038/s41562-025-02172-y. Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved.Science, 347(6218):145–149, 2015. doi: 10.1126/science.1259433. Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018. doi: 10.112...

work page doi:10.1038/s41562-025-02172-y 2015
[2]

Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks

doi: 10.1016/j.patter.2024.100988. J´ er´ emy Scheurer, Mikita Balesni, and Marius Hobbhahn. Technical report: Large language models can strategically deceive their users when put under pressure, 2023. Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement lear...

work page doi:10.1016/j.patter.2024.100988 2024
[3]

HAND READING -- Think in ranges, not exact cards
[4]

OPPONENT PROFILING -- Use VPIP/PFR/AF stats to classify each opponent (TAG, LAG, calling station, nit)
[5]

MULTI-STREET PLANNING -- Plan your line for ALL remaining streets before acting
[6]

Consider how opponents perceive you

TABLE IMAGE -- Your own stats are shown. Consider how opponents perceive you
[7]

reasoning

FUNDAMENTALS -- Position, pot odds, implied odds, stack management. RESPONSE: You will be given a poker_action tool. You MUST call it with your decision. Provide: - "reasoning": your strategic analysis (2-3 sentences) - "action": one of the valid actions provided - "amount": integer chip amount for bet/raise, null otherwise E.2. Action Prompt The user-tur...
[8]

[NAME] (You) -$[CHIPS] chips
[9]

[REASONING]

[NAME] -$[CHIPS] chips ... YOUR GLOBAL MEMORY: [MEMORY_TEXT] CURRENT SITUATION: - Street: [STREET] 21 Singla Garg Singh - Your hole cards: [HOLE_CARDS] - Community cards: [BOARD] - Your chips:$[YOUR_CHIPS] - Pot:$[POT] - Current bet to match:$[CURRENT_BET] - Your current bet:$[YOUR_CURRENT_BET] - Amount to call:$[TO_CALL] - Your position: [POSITION] - Pot...
[10]

Read opponents: assign and narrow their hand ranges
[11]

Plan your line: decide intent for remaining streets
[12]

Consider your table image
[13]

Choose the optimal action
[14]

[REASONING]

Call the poker_action tool with your reasoning, action, and amount. E.3. Memory-Update System Prompt After each hand we prompt the agent to review what just happened and decide whether to rewrite its notebook. This system message reframes the agent as a reviewer, specifies what to retain as durable opponent knowledge, and explicitly lists the hand-specifi...
[15]

Claude (You) -$1080 chips
[16]

DeepSeek -$840 chips
[17]

AA is the strongest starting hand

Kimi -$775 chips YOUR GLOBAL MEMORY: (empty -- this is your first session) CURRENT SITUATION: - Street: FLOP - Your hole cards: Ad Ac - Community cards: Td 4h 2h - Your chips:$1080 - Pot:$250 - Current bet to match:$0 - Your current bet:$0 - Amount to call:$0 - Your position: Early Position PLAYERS IN HAND: - Claude (You):$1080 chips - Gemini:$955 chips P...
[18]

Read opponents: assign ranges
[19]

air, fold equity

Call the poker_action tool ... Claude responded:“I have AA on a Td 4h 2h board, an overpair on a draw-heavy texture. Player 3 called my 3-bet, so their range includes Tx hands, pocket pairs (77–JJ), suited broadways, and heart flush draws. I need to bet for value and protection against the flush draw. Betting∼$130 (half pot) charges draws, gets value from...

2095

[1] [1]

Akata, L

doi: 10.1038/s41562-025-02172-y. Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved.Science, 347(6218):145–149, 2015. doi: 10.1126/science.1259433. Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018. doi: 10.112...

work page doi:10.1038/s41562-025-02172-y 2015

[2] [2]

Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks

doi: 10.1016/j.patter.2024.100988. J´ er´ emy Scheurer, Mikita Balesni, and Marius Hobbhahn. Technical report: Large language models can strategically deceive their users when put under pressure, 2023. Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement lear...

work page doi:10.1016/j.patter.2024.100988 2024

[3] [3]

HAND READING -- Think in ranges, not exact cards

[4] [4]

OPPONENT PROFILING -- Use VPIP/PFR/AF stats to classify each opponent (TAG, LAG, calling station, nit)

[5] [5]

MULTI-STREET PLANNING -- Plan your line for ALL remaining streets before acting

[6] [6]

Consider how opponents perceive you

TABLE IMAGE -- Your own stats are shown. Consider how opponents perceive you

[7] [7]

reasoning

FUNDAMENTALS -- Position, pot odds, implied odds, stack management. RESPONSE: You will be given a poker_action tool. You MUST call it with your decision. Provide: - "reasoning": your strategic analysis (2-3 sentences) - "action": one of the valid actions provided - "amount": integer chip amount for bet/raise, null otherwise E.2. Action Prompt The user-tur...

[8] [8]

[NAME] (You) -$[CHIPS] chips

[9] [9]

[REASONING]

[NAME] -$[CHIPS] chips ... YOUR GLOBAL MEMORY: [MEMORY_TEXT] CURRENT SITUATION: - Street: [STREET] 21 Singla Garg Singh - Your hole cards: [HOLE_CARDS] - Community cards: [BOARD] - Your chips:$[YOUR_CHIPS] - Pot:$[POT] - Current bet to match:$[CURRENT_BET] - Your current bet:$[YOUR_CURRENT_BET] - Amount to call:$[TO_CALL] - Your position: [POSITION] - Pot...

[10] [10]

Read opponents: assign and narrow their hand ranges

[11] [11]

Plan your line: decide intent for remaining streets

[12] [12]

Consider your table image

[13] [13]

Choose the optimal action

[14] [14]

[REASONING]

Call the poker_action tool with your reasoning, action, and amount. E.3. Memory-Update System Prompt After each hand we prompt the agent to review what just happened and decide whether to rewrite its notebook. This system message reframes the agent as a reviewer, specifies what to retain as durable opponent knowledge, and explicitly lists the hand-specifi...

[15] [15]

Claude (You) -$1080 chips

[16] [16]

DeepSeek -$840 chips

[17] [17]

AA is the strongest starting hand

Kimi -$775 chips YOUR GLOBAL MEMORY: (empty -- this is your first session) CURRENT SITUATION: - Street: FLOP - Your hole cards: Ad Ac - Community cards: Td 4h 2h - Your chips:$1080 - Pot:$250 - Current bet to match:$0 - Your current bet:$0 - Amount to call:$0 - Your position: Early Position PLAYERS IN HAND: - Claude (You):$1080 chips - Gemini:$955 chips P...

[18] [18]

Read opponents: assign ranges

[19] [19]

air, fold equity

Call the poker_action tool ... Claude responded:“I have AA on a Td 4h 2h board, an overpair on a draw-heavy texture. Player 3 called my 3-bet, so their range includes Tx hands, pocket pairs (77–JJ), suited broadways, and heart flush draws. I need to bet for value and protection against the flush draw. Betting∼$130 (half pot) charges draws, gets value from...

2095