GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

Alex Kenich; Anany Kotawala; Kia Ghods; Vartan Shadarevian

arxiv: 2605.23238 · v1 · pith:WN2VDZSWnew · submitted 2026-05-22 · 💻 cs.AI · cs.GT· cs.LG· cs.MA

GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

Vartan Shadarevian , Kia Ghods , Alex Kenich , Anany Kotawala This is my paper

Pith reviewed 2026-05-25 04:39 UTC · model grok-4.3

classification 💻 cs.AI cs.GTcs.LGcs.MA

keywords strategic reasoninglarge language modelscapability profilesprocedural game generationzero-sum gamesjaggedness measurebenchmark evaluationimperfect information

0 comments

The pith

GENSTRAT generates fresh card games to expose distinct strategic profiles and local volatility among LLMs with near-identical overall scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes GENSTRAT as a method that samples procedurally generated two-player zero-sum imperfect-information card games on demand to evaluate LLM strategic reasoning. It decomposes performance into six capability axes and adds a jaggedness metric that flags unpredictable performance shifts between similar games. Evaluation across nine models in over 36,000 matches shows that average scores improve with newer models yet top performers differ sharply in their profiles and volatility. A sympathetic reader cares because LLMs now act as agents in real marketplaces where average rankings alone cannot predict behavior in specific deployments. The approach resists benchmark saturation and contamination by drawing new instances each time.

Core claim

GENSTRAT draws from a distribution of two-player zero-sum imperfect-information card games that can be generated fresh for each evaluation run. It pairs this distribution with a capability-profile method that measures competence on state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness, plus a jaggedness measure of within-distribution smoothness. When nine frontier and open-weight models compete in a head-to-head tournament on 50 sampled games, newer models achieve higher average scores, but models with near-identical overall strength display qualitatively different profiles, and two of the top three models prove more locally volatile than the third.

What carries the argument

The procedurally generated distribution of two-player zero-sum imperfect-information card games, combined with the six-axis capability profile and the jaggedness measure of local volatility.

If this is right

Evaluators can draw new games indefinitely, keeping benchmarks fresh and resistant to contamination.
Deployment choices can prioritize specific capability gaps or low volatility rather than overall ranking alone.
Models that appear equivalent on aggregate scores can be distinguished by their smoothness across strategically similar situations.
Benchmark saturation is avoided because the generator produces new instances rather than reusing fixed canonical games.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generation-plus-profile approach could be applied to multi-agent or non-zero-sum settings to test whether the six axes remain sufficient.
Real-world logs from LLM-mediated auctions could serve as a direct test of whether the generated distribution predicts observed behavior.
Jaggedness may correlate with specific failure modes in high-stakes decisions, offering a practical filter before deployment.

Load-bearing premise

The distribution of procedurally generated card games is representative enough of the strategic environments that LLMs actually face in deployments such as marketplaces and auctions.

What would settle it

A follow-up experiment that applies the same profile and jaggedness analysis to a separate collection of real auction or marketplace traces and finds that all models with close overall scores produce identical profiles and zero jaggedness.

Figures

Figures reproduced from arXiv: 2605.23238 by Alex Kenich, Anany Kotawala, Kia Ghods, Vartan Shadarevian.

**Figure 1.** Figure 1: shows the 50 selected games in a scatter plot with state space and information sensitivity as axes. A scatter covering the full 2,000-game pool against the selected 50 is reported in Appendix E. 1.5 2.0 2.5 3.0 3.5 4.0 state space (log10 info-states) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 information sensitivity r = 0.65 Seed 10173 3 cards, 1-card hands 1 position, hand swap post-betting peek auction (closest to Kuhn… view at source ↗

**Figure 2.** Figure 2: Per-axis distributions of the 50 benchmark games. Kernel-density estimates with tick-marked observations and tertile cuts (dashed). FPS achieves broad per-axis coverage on all six axes rather than concentrating on the two diagonal axes in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Capability profile. Per-model OLS slopes βˆm,a of per-game strength αˆm,g on the six z-scored axes, fit with a per-model intercept on the 50 benchmark games. An outward slope indicates that the model’s chip lead over the across-model mean grows with that axis, while an inward slope indicates that the lead shrinks. Units are chips/game per σ of axis. Slopes sum to zero across the nine models on every axis b… view at source ↗

**Figure 4.** Figure 4: Local jaggedness Jm, per model. Bars sorted ascending, with horizontal whiskers showing bias-corrected paired-cluster bootstrap 95% confidence intervals (B = 500). Higher Jm means the stakes-normalized per-game performance surface swings more between axis-space-similar games. Interpreting Jm. llama-3.3-70b-together is the most locally jagged model, with a central Jm of 0.152 and a 95% confidence interval t… view at source ↗

**Figure 5.** Figure 5: Pairwise Pearson correlation of the six complexity axes across the 50 benchmark games. The strongest positive pair is state space and information sensitivity, with Pearson r = 0.65, followed by state space and temporal depth at r = 0.57 and information sensitivity and opponent modeling at r = 0.50. Risk and brittleness are nearly independent of the remaining axes. In both cases the absolute correlation wit… view at source ↗

**Figure 6.** Figure 6: The 2,000-game accepted pool vs. the 50 FPS-selected benchmark games in two diagnostic [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Capability profile in absolute units. Predicted per-game strength αˆm,g∗ a from the same multivariate OLS used for [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Reversal significance in axis space. The 50 benchmark games plotted in the state-space versus information-sensitivity plane (matching [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces GENSTRAT, which generates a distribution of two-player zero-sum imperfect-information card games to evaluate LLM strategic reasoning in an evergreen, contamination-resistant manner. It pairs this with a six-axis capability profile (state space, temporal depth, information sensitivity, opponent modeling, risk, brittleness) and a jaggedness measure of local volatility within the distribution. A tournament of nine frontier and open-weight models across over 36,000 matches on 50 sampled games shows newer models scoring higher on average, but models with near-identical overall strength exhibiting qualitatively different profiles and differing local volatility (e.g., gpt-5 and claude more volatile than gemini-3.1-pro). The central claim is that these metrics supply a deployment-relevant diagnostic beyond aggregate rankings.

Significance. If the generated game distribution proves representative and the axes capture transferable distinctions, the framework would enable nuanced, non-saturating evaluation of strategic competence relevant to LLM deployment as economic agents. The reported empirical differences in profiles and jaggedness among top models illustrate the value of moving beyond single-score leaderboards. The approach's procedural generation and scale (36k matches) are strengths, but significance depends on addressing external validity.

major comments (2)

[Abstract] Abstract: The claim that 'the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide' is not supported by any evidence in the manuscript. All reported results are internal to the fixed sample of 50 games from the generated pool; there are no comparisons of model behavior on GENSTRAT versus any marketplace, auction, or bidding task, nor any mapping of game parameters to real deployment features.
[Abstract] Abstract: No information is provided on how the six axes were selected, validated, or operationalized, nor on the criteria used to choose game parameters or test the generator against external strategic environments. This leaves the decomposition of competence and the representativeness of the distribution ungrounded.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that stronger grounding is needed for the deployment-relevance claim and for the selection of the capability axes. We will revise the manuscript accordingly and respond to each point below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide' is not supported by any evidence in the manuscript. All reported results are internal to the fixed sample of 50 games from the generated pool; there are no comparisons of model behavior on GENSTRAT versus any marketplace, auction, or bidding task, nor any mapping of game parameters to real deployment features.

Authors: We agree that the manuscript contains no direct empirical comparisons between GENSTRAT performance and behavior in external tasks such as auctions or marketplaces, nor any explicit mapping of game parameters to deployment features. The abstract statement is therefore unsupported as currently phrased. In revision we will qualify the claim (e.g., replace with “may supply a deployment-relevant diagnostic”) and add an explicit limitations paragraph noting the absence of external validation while outlining planned follow-up work to test transfer. revision: yes
Referee: [Abstract] Abstract: No information is provided on how the six axes were selected, validated, or operationalized, nor on the criteria used to choose game parameters or test the generator against external strategic environments. This leaves the decomposition of competence and the representativeness of the distribution ungrounded.

Authors: The six axes are motivated by core dimensions in game-theoretic treatments of strategic reasoning, yet the manuscript does not document their selection process, operationalization details, or the sampling criteria for the 2,000-game pool. We will add a dedicated methods subsection that (a) cites the relevant literature for each axis, (b) describes the generator parameters used to instantiate them (e.g., deck size and branching factor for state space; revelation schedule for information sensitivity), and (c) states the diversity criteria applied during sampling. We will also acknowledge that no external-environment validation was performed and treat this as a limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical tournament results on generated games

full rationale

The paper reports empirical outcomes from head-to-head tournaments on a fixed sample of 50 procedurally generated games drawn from a 2000-game pool, with over 36,000 matches across nine LLMs. Capability profiles are decomposed across six axes and jaggedness is measured from within-distribution performance variation; neither is shown to reduce by any equation or definition to fitted parameters or prior self-citations. The deployment-relevance claim is an interpretive assertion about the game distribution, not a mathematical derivation that collapses to its inputs. No load-bearing self-citation chains, self-definitional constructs, or fitted-input predictions appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on the untested premise that card-game distributions capture deployment-relevant strategic structure and that the six named axes are the right decomposition; no free parameters or invented entities are quantified in the abstract.

pith-pipeline@v0.9.0 · 5845 in / 1176 out tokens · 28689 ms · 2026-05-25T04:39:50.610364+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

[1]

Project Vend: Can Claude run a small shop? (And why does that matter?)

Anthropic. Project Vend: Can Claude run a small shop? (And why does that matter?). Anthropic Research, 2025

work page 2025
[2]

Project Deal: our Claude-run marketplace experiment

Anthropic. Project Deal: our Claude-run marketplace experiment. Anthropic, 2026

work page 2026
[3]

Algorithmic Collusion by Large Language Models,

Sara Fish, Yannai A Gonczarowski, and Ran I Shorrer. Algorithmic Collusion by Large Language Models,

work page
[4]

arXiv:2404.00806v5, revised 2026

work page arXiv 2026
[5]

Suspicion Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4

Jiaxian Guo, Bo Yang, Paul Yoo, Bill Yuchen Lin, Yusuke Iwasawa, and Yutaka Matsuo. Suspicion Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4. InFirst Conference on Language Modeling (COLM), 2024

work page 2024
[6]

PokerGPT: An End-to-End Lightweight Solver for Multi-Player Texas Hold’em via Large Language Model, 2024

Chenghao Huang, Yanbo Cao, Yinlong Wen, Tao Zhou, and Yanru Zhang. PokerGPT: An End-to-End Lightweight Solver for Multi-Player Texas Hold’em via Large Language Model, 2024. arXiv:2401.06781. 13

work page arXiv 2024
[7]

AvalonBench: Evaluating LLMs Playing the Game of Avalon, 2023

Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. AvalonBench: Evaluating LLMs Playing the Game of Avalon, 2023. arXiv:2310.05036v3

work page arXiv 2023
[8]

Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra

Meta Fundamental AI Research Diplomacy Team (FAIR), Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan...

work page 2022
[9]

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 28219–28253, 2024

work page 2024
[10]

Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents, 2024. arXiv:2406.06613v2

work page arXiv 2024
[11]

Leveraging Procedural Generation to Benchmark Reinforcement Learning

Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging Procedural Generation to Benchmark Reinforcement Learning. InInternational Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 2048–2056. PMLR, 2020

work page 2048
[12]

Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks

Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and J K Terry. Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, volume 3...

work page 2023
[13]

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018

work page 2018
[14]

Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018

Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018

work page 2018
[15]

Superhuman AI for multiplayer poker.Science, 365(6456):885–890, 2019

Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker.Science, 365(6456):885–890, 2019

work page 2019
[16]

Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H

Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T. Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H. Cen, Zhe Wang, Audrunas Gruslys, Aleksandra Malysheva, Mina Khan, Sherjil Ozair, Finbarr Timbers, Toby Pohlen, Tom Eccles, Mark Rowland, Marc Lanctot, Jean-Baptis...

work page 2022
[17]

Playing repeated games with large language models.Nature Human Behaviour, 9(7):1380–1390, 2025

Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models.Nature Human Behaviour, 9(7):1380–1390, 2025

work page 2025
[18]

Strategic behavior of large language models and the role of game structure versus contextual framing.Scientific Reports, 14(1):18490, 2024

Nunzio Lorè and Babak Heydari. Strategic behavior of large language models and the role of game structure versus contextual framing.Scientific Reports, 14(1):18490, 2024

work page 2024
[19]

Collins, Cedegao E

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, and Thomas L. Griffiths. Evaluating Language Models’ Evaluations of Games. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[20]

How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use

Minhua Lin, Enyan Dai, Hui Liu, Xianfeng Tang, Yuliang Yan, Zhenwei Dai, Jingying Zeng, Zhiwei Zhang, Fali Wang, Hongcheng Gao, Chen Luo, Xiang Zhang, Qi He, and Suhang Wang. How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use. In International Conference on Learning Representations (ICLR), 2026

work page 2026
[21]

James W. A. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, Michael S. A. Graziano, and Cristina Becchio. Testing theory of mind in large language models and humans.Nature Human Behaviour, 8(7):1285–1295, 2024

work page 2024
[22]

Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks, 2023

Tomer Ullman. Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks, 2023. arXiv:2302.08399v5

work page arXiv 2023
[23]

Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents

Hsieh-Ting Lin and Tsung-Yu Hou. Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents, 2026. arXiv:2604.04157

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Measuring General Intelligence with Generated Games, 2025

Vivek Verma, David Huang, William Chen, Dan Klein, and Nicholas Tomlin. Measuring General Intelligence with Generated Games, 2025. arXiv:2505.07215. 14

work page arXiv 2025
[25]

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Keyon Vafa, Ashesh Rambachan, and Sendhil Mullainathan. Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function. InInternational Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 48919–48937. PMLR, 2024

work page 2024
[26]

Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

Keyon Vafa, Justin Y . Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the World Model Implicit in a Generative Model. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 26941–26975, 2024

work page 2024
[27]

Chang, Ashesh Rambachan, and Sendhil Mullainathan

Keyon Vafa, Peter G. Chang, Ashesh Rambachan, and Sendhil Mullainathan. What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[28]

Springer, 2016

Noor Shaker, Julian Togelius, and Mark J Nelson.Procedural Content Generation in Games. Springer, 2016

work page 2016
[29]

Harold W. Kuhn. A simplified two-person poker. In H. W. Kuhn and A. W. Tucker, editors,Contributions to the Theory of Games, Vol. I, number 24 in Annals of Mathematics Studies, pages 97–103. Princeton University Press, 1950

work page 1950
[30]

Bayes’ Bluff: Opponent Modelling in Poker

Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ Bluff: Opponent Modelling in Poker. InProceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI), pages 550–558, 2005

work page 2005
[31]

I. M. Sobol’. On the distribution of points in a cube and the approximate evaluation of integrals.USSR Computational Mathematics and Mathematical Physics, 7(4):86–112, 1967

work page 1967
[32]

Ralph Allan Bradley and Milton E. Terry. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952
[33]

Regret minimization in games with incomplete information

Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. InAdvances in Neural Information Processing Systems (NeurIPS), volume 20, pages 1729–1736, 2007

work page 2007
[34]

Solving Large Imperfect Information Games Using CFR+

Oskari Tammelin. Solving Large Imperfect Information Games Using CFR +, 2014. arXiv:1407.5042

work page internal anchor Pith review Pith/arXiv arXiv 2014
[35]

(public)

Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995. A Modular game design Each GBG is assembled from modular components composed by the parameterized GBG builder: • Core layer. GameState (play...

work page 1995
[36]

**Setup**: shuffle deck, deal cards

work page
[37]

**Ante Phase**: each player antes into the pot

work page
[38]

**Betting Phase**: players may bet, check, fold, or call

work page
[39]

**Post-Betting Phase**

work page
[40]

- Deal 1 card face-down to each player (Alice, Bob) (or as many as remain if the deck is short)

**Showdown**: reveal hands and settle the pot ## Phase Details ### Setup - Shuffle the Deck. - Deal 1 card face-down to each player (Alice, Bob) (or as many as remain if the deck is short). Each player sees only their own cards. ### Ante Phase - Alice antes 2 chips (moved from Alice's chip stack into the pot). - Bob antes 2 chips (moved from Bob's chip st...

work page 2000

[1] [1]

Project Vend: Can Claude run a small shop? (And why does that matter?)

Anthropic. Project Vend: Can Claude run a small shop? (And why does that matter?). Anthropic Research, 2025

work page 2025

[2] [2]

Project Deal: our Claude-run marketplace experiment

Anthropic. Project Deal: our Claude-run marketplace experiment. Anthropic, 2026

work page 2026

[3] [3]

Algorithmic Collusion by Large Language Models,

Sara Fish, Yannai A Gonczarowski, and Ran I Shorrer. Algorithmic Collusion by Large Language Models,

work page

[4] [4]

arXiv:2404.00806v5, revised 2026

work page arXiv 2026

[5] [5]

Suspicion Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4

Jiaxian Guo, Bo Yang, Paul Yoo, Bill Yuchen Lin, Yusuke Iwasawa, and Yutaka Matsuo. Suspicion Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4. InFirst Conference on Language Modeling (COLM), 2024

work page 2024

[6] [6]

PokerGPT: An End-to-End Lightweight Solver for Multi-Player Texas Hold’em via Large Language Model, 2024

Chenghao Huang, Yanbo Cao, Yinlong Wen, Tao Zhou, and Yanru Zhang. PokerGPT: An End-to-End Lightweight Solver for Multi-Player Texas Hold’em via Large Language Model, 2024. arXiv:2401.06781. 13

work page arXiv 2024

[7] [7]

AvalonBench: Evaluating LLMs Playing the Game of Avalon, 2023

Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. AvalonBench: Evaluating LLMs Playing the Game of Avalon, 2023. arXiv:2310.05036v3

work page arXiv 2023

[8] [8]

Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra

Meta Fundamental AI Research Diplomacy Team (FAIR), Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan...

work page 2022

[9] [9]

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 28219–28253, 2024

work page 2024

[10] [10]

Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents, 2024. arXiv:2406.06613v2

work page arXiv 2024

[11] [11]

Leveraging Procedural Generation to Benchmark Reinforcement Learning

Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging Procedural Generation to Benchmark Reinforcement Learning. InInternational Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 2048–2056. PMLR, 2020

work page 2048

[12] [12]

Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks

Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and J K Terry. Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, volume 3...

work page 2023

[13] [13]

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018

work page 2018

[14] [14]

Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018

Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018

work page 2018

[15] [15]

Superhuman AI for multiplayer poker.Science, 365(6456):885–890, 2019

Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker.Science, 365(6456):885–890, 2019

work page 2019

[16] [16]

Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H

Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T. Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H. Cen, Zhe Wang, Audrunas Gruslys, Aleksandra Malysheva, Mina Khan, Sherjil Ozair, Finbarr Timbers, Toby Pohlen, Tom Eccles, Mark Rowland, Marc Lanctot, Jean-Baptis...

work page 2022

[17] [17]

Playing repeated games with large language models.Nature Human Behaviour, 9(7):1380–1390, 2025

Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models.Nature Human Behaviour, 9(7):1380–1390, 2025

work page 2025

[18] [18]

Strategic behavior of large language models and the role of game structure versus contextual framing.Scientific Reports, 14(1):18490, 2024

Nunzio Lorè and Babak Heydari. Strategic behavior of large language models and the role of game structure versus contextual framing.Scientific Reports, 14(1):18490, 2024

work page 2024

[19] [19]

Collins, Cedegao E

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, and Thomas L. Griffiths. Evaluating Language Models’ Evaluations of Games. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026

[20] [20]

How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use

Minhua Lin, Enyan Dai, Hui Liu, Xianfeng Tang, Yuliang Yan, Zhenwei Dai, Jingying Zeng, Zhiwei Zhang, Fali Wang, Hongcheng Gao, Chen Luo, Xiang Zhang, Qi He, and Suhang Wang. How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use. In International Conference on Learning Representations (ICLR), 2026

work page 2026

[21] [21]

James W. A. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, Michael S. A. Graziano, and Cristina Becchio. Testing theory of mind in large language models and humans.Nature Human Behaviour, 8(7):1285–1295, 2024

work page 2024

[22] [22]

Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks, 2023

Tomer Ullman. Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks, 2023. arXiv:2302.08399v5

work page arXiv 2023

[23] [23]

Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents

Hsieh-Ting Lin and Tsung-Yu Hou. Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents, 2026. arXiv:2604.04157

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Measuring General Intelligence with Generated Games, 2025

Vivek Verma, David Huang, William Chen, Dan Klein, and Nicholas Tomlin. Measuring General Intelligence with Generated Games, 2025. arXiv:2505.07215. 14

work page arXiv 2025

[25] [25]

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Keyon Vafa, Ashesh Rambachan, and Sendhil Mullainathan. Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function. InInternational Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 48919–48937. PMLR, 2024

work page 2024

[26] [26]

Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

Keyon Vafa, Justin Y . Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the World Model Implicit in a Generative Model. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 26941–26975, 2024

work page 2024

[27] [27]

Chang, Ashesh Rambachan, and Sendhil Mullainathan

Keyon Vafa, Peter G. Chang, Ashesh Rambachan, and Sendhil Mullainathan. What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models. InInternational Conference on Machine Learning (ICML), 2025

work page 2025

[28] [28]

Springer, 2016

Noor Shaker, Julian Togelius, and Mark J Nelson.Procedural Content Generation in Games. Springer, 2016

work page 2016

[29] [29]

Harold W. Kuhn. A simplified two-person poker. In H. W. Kuhn and A. W. Tucker, editors,Contributions to the Theory of Games, Vol. I, number 24 in Annals of Mathematics Studies, pages 97–103. Princeton University Press, 1950

work page 1950

[30] [30]

Bayes’ Bluff: Opponent Modelling in Poker

Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ Bluff: Opponent Modelling in Poker. InProceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI), pages 550–558, 2005

work page 2005

[31] [31]

I. M. Sobol’. On the distribution of points in a cube and the approximate evaluation of integrals.USSR Computational Mathematics and Mathematical Physics, 7(4):86–112, 1967

work page 1967

[32] [32]

Ralph Allan Bradley and Milton E. Terry. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952

[33] [33]

Regret minimization in games with incomplete information

Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. InAdvances in Neural Information Processing Systems (NeurIPS), volume 20, pages 1729–1736, 2007

work page 2007

[34] [34]

Solving Large Imperfect Information Games Using CFR+

Oskari Tammelin. Solving Large Imperfect Information Games Using CFR +, 2014. arXiv:1407.5042

work page internal anchor Pith review Pith/arXiv arXiv 2014

[35] [35]

(public)

Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995. A Modular game design Each GBG is assembled from modular components composed by the parameterized GBG builder: • Core layer. GameState (play...

work page 1995

[36] [36]

**Setup**: shuffle deck, deal cards

work page

[37] [37]

**Ante Phase**: each player antes into the pot

work page

[38] [38]

**Betting Phase**: players may bet, check, fold, or call

work page

[39] [39]

**Post-Betting Phase**

work page

[40] [40]

- Deal 1 card face-down to each player (Alice, Bob) (or as many as remain if the deck is short)

**Showdown**: reveal hands and settle the pot ## Phase Details ### Setup - Shuffle the Deck. - Deal 1 card face-down to each player (Alice, Bob) (or as many as remain if the deck is short). Each player sees only their own cards. ### Ante Phase - Alice antes 2 chips (moved from Alice's chip stack into the pot). - Bob antes 2 chips (moved from Bob's chip st...

work page 2000