GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models
Pith reviewed 2026-05-25 04:39 UTC · model grok-4.3
The pith
GENSTRAT generates fresh card games to expose distinct strategic profiles and local volatility among LLMs with near-identical overall scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GENSTRAT draws from a distribution of two-player zero-sum imperfect-information card games that can be generated fresh for each evaluation run. It pairs this distribution with a capability-profile method that measures competence on state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness, plus a jaggedness measure of within-distribution smoothness. When nine frontier and open-weight models compete in a head-to-head tournament on 50 sampled games, newer models achieve higher average scores, but models with near-identical overall strength display qualitatively different profiles, and two of the top three models prove more locally volatile than the third.
What carries the argument
The procedurally generated distribution of two-player zero-sum imperfect-information card games, combined with the six-axis capability profile and the jaggedness measure of local volatility.
If this is right
- Evaluators can draw new games indefinitely, keeping benchmarks fresh and resistant to contamination.
- Deployment choices can prioritize specific capability gaps or low volatility rather than overall ranking alone.
- Models that appear equivalent on aggregate scores can be distinguished by their smoothness across strategically similar situations.
- Benchmark saturation is avoided because the generator produces new instances rather than reusing fixed canonical games.
Where Pith is reading between the lines
- The same generation-plus-profile approach could be applied to multi-agent or non-zero-sum settings to test whether the six axes remain sufficient.
- Real-world logs from LLM-mediated auctions could serve as a direct test of whether the generated distribution predicts observed behavior.
- Jaggedness may correlate with specific failure modes in high-stakes decisions, offering a practical filter before deployment.
Load-bearing premise
The distribution of procedurally generated card games is representative enough of the strategic environments that LLMs actually face in deployments such as marketplaces and auctions.
What would settle it
A follow-up experiment that applies the same profile and jaggedness analysis to a separate collection of real auction or marketplace traces and finds that all models with close overall scores produce identical profiles and zero jaggedness.
Figures
read the original abstract
Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GENSTRAT, which generates a distribution of two-player zero-sum imperfect-information card games to evaluate LLM strategic reasoning in an evergreen, contamination-resistant manner. It pairs this with a six-axis capability profile (state space, temporal depth, information sensitivity, opponent modeling, risk, brittleness) and a jaggedness measure of local volatility within the distribution. A tournament of nine frontier and open-weight models across over 36,000 matches on 50 sampled games shows newer models scoring higher on average, but models with near-identical overall strength exhibiting qualitatively different profiles and differing local volatility (e.g., gpt-5 and claude more volatile than gemini-3.1-pro). The central claim is that these metrics supply a deployment-relevant diagnostic beyond aggregate rankings.
Significance. If the generated game distribution proves representative and the axes capture transferable distinctions, the framework would enable nuanced, non-saturating evaluation of strategic competence relevant to LLM deployment as economic agents. The reported empirical differences in profiles and jaggedness among top models illustrate the value of moving beyond single-score leaderboards. The approach's procedural generation and scale (36k matches) are strengths, but significance depends on addressing external validity.
major comments (2)
- [Abstract] Abstract: The claim that 'the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide' is not supported by any evidence in the manuscript. All reported results are internal to the fixed sample of 50 games from the generated pool; there are no comparisons of model behavior on GENSTRAT versus any marketplace, auction, or bidding task, nor any mapping of game parameters to real deployment features.
- [Abstract] Abstract: No information is provided on how the six axes were selected, validated, or operationalized, nor on the criteria used to choose game parameters or test the generator against external strategic environments. This leaves the decomposition of competence and the representativeness of the distribution ungrounded.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We agree that stronger grounding is needed for the deployment-relevance claim and for the selection of the capability axes. We will revise the manuscript accordingly and respond to each point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide' is not supported by any evidence in the manuscript. All reported results are internal to the fixed sample of 50 games from the generated pool; there are no comparisons of model behavior on GENSTRAT versus any marketplace, auction, or bidding task, nor any mapping of game parameters to real deployment features.
Authors: We agree that the manuscript contains no direct empirical comparisons between GENSTRAT performance and behavior in external tasks such as auctions or marketplaces, nor any explicit mapping of game parameters to deployment features. The abstract statement is therefore unsupported as currently phrased. In revision we will qualify the claim (e.g., replace with “may supply a deployment-relevant diagnostic”) and add an explicit limitations paragraph noting the absence of external validation while outlining planned follow-up work to test transfer. revision: yes
-
Referee: [Abstract] Abstract: No information is provided on how the six axes were selected, validated, or operationalized, nor on the criteria used to choose game parameters or test the generator against external strategic environments. This leaves the decomposition of competence and the representativeness of the distribution ungrounded.
Authors: The six axes are motivated by core dimensions in game-theoretic treatments of strategic reasoning, yet the manuscript does not document their selection process, operationalization details, or the sampling criteria for the 2,000-game pool. We will add a dedicated methods subsection that (a) cites the relevant literature for each axis, (b) describes the generator parameters used to instantiate them (e.g., deck size and branching factor for state space; revelation schedule for information sensitivity), and (c) states the diversity criteria applied during sampling. We will also acknowledge that no external-environment validation was performed and treat this as a limitation. revision: yes
Circularity Check
No circularity: empirical tournament results on generated games
full rationale
The paper reports empirical outcomes from head-to-head tournaments on a fixed sample of 50 procedurally generated games drawn from a 2000-game pool, with over 36,000 matches across nine LLMs. Capability profiles are decomposed across six axes and jaggedness is measured from within-distribution performance variation; neither is shown to reduce by any equation or definition to fitted parameters or prior self-citations. The deployment-relevance claim is an interpretive assertion about the game distribution, not a mathematical derivation that collapses to its inputs. No load-bearing self-citation chains, self-definitional constructs, or fitted-input predictions appear in the provided text.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Project Vend: Can Claude run a small shop? (And why does that matter?)
Anthropic. Project Vend: Can Claude run a small shop? (And why does that matter?). Anthropic Research, 2025
work page 2025
-
[2]
Project Deal: our Claude-run marketplace experiment
Anthropic. Project Deal: our Claude-run marketplace experiment. Anthropic, 2026
work page 2026
-
[3]
Algorithmic Collusion by Large Language Models,
Sara Fish, Yannai A Gonczarowski, and Ran I Shorrer. Algorithmic Collusion by Large Language Models,
- [4]
-
[5]
Suspicion Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4
Jiaxian Guo, Bo Yang, Paul Yoo, Bill Yuchen Lin, Yusuke Iwasawa, and Yutaka Matsuo. Suspicion Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4. InFirst Conference on Language Modeling (COLM), 2024
work page 2024
-
[6]
Chenghao Huang, Yanbo Cao, Yinlong Wen, Tao Zhou, and Yanru Zhang. PokerGPT: An End-to-End Lightweight Solver for Multi-Player Texas Hold’em via Large Language Model, 2024. arXiv:2401.06781. 13
-
[7]
AvalonBench: Evaluating LLMs Playing the Game of Avalon, 2023
Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. AvalonBench: Evaluating LLMs Playing the Game of Avalon, 2023. arXiv:2310.05036v3
-
[8]
Meta Fundamental AI Research Diplomacy Team (FAIR), Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan...
work page 2022
-
[9]
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations
Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 28219–28253, 2024
work page 2024
-
[10]
Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents, 2024. arXiv:2406.06613v2
-
[11]
Leveraging Procedural Generation to Benchmark Reinforcement Learning
Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging Procedural Generation to Benchmark Reinforcement Learning. InInternational Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research, pages 2048–2056. PMLR, 2020
work page 2048
-
[12]
Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and J K Terry. Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, volume 3...
work page 2023
-
[13]
A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018
work page 2018
-
[14]
Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018
work page 2018
-
[15]
Superhuman AI for multiplayer poker.Science, 365(6456):885–890, 2019
Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker.Science, 365(6456):885–890, 2019
work page 2019
-
[16]
Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H
Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T. Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H. Cen, Zhe Wang, Audrunas Gruslys, Aleksandra Malysheva, Mina Khan, Sherjil Ozair, Finbarr Timbers, Toby Pohlen, Tom Eccles, Mark Rowland, Marc Lanctot, Jean-Baptis...
work page 2022
-
[17]
Playing repeated games with large language models.Nature Human Behaviour, 9(7):1380–1390, 2025
Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models.Nature Human Behaviour, 9(7):1380–1390, 2025
work page 2025
-
[18]
Nunzio Lorè and Babak Heydari. Strategic behavior of large language models and the role of game structure versus contextual framing.Scientific Reports, 14(1):18490, 2024
work page 2024
-
[19]
Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, and Thomas L. Griffiths. Evaluating Language Models’ Evaluations of Games. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[20]
Minhua Lin, Enyan Dai, Hui Liu, Xianfeng Tang, Yuliang Yan, Zhenwei Dai, Jingying Zeng, Zhiwei Zhang, Fali Wang, Hongcheng Gao, Chen Luo, Xiang Zhang, Qi He, and Suhang Wang. How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use. In International Conference on Learning Representations (ICLR), 2026
work page 2026
-
[21]
James W. A. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, Michael S. A. Graziano, and Cristina Becchio. Testing theory of mind in large language models and humans.Nature Human Behaviour, 8(7):1285–1295, 2024
work page 2024
-
[22]
Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks, 2023
Tomer Ullman. Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks, 2023. arXiv:2302.08399v5
-
[23]
Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents
Hsieh-Ting Lin and Tsung-Yu Hou. Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents, 2026. arXiv:2604.04157
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Measuring General Intelligence with Generated Games, 2025
Vivek Verma, David Huang, William Chen, Dan Klein, and Nicholas Tomlin. Measuring General Intelligence with Generated Games, 2025. arXiv:2505.07215. 14
-
[25]
Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function
Keyon Vafa, Ashesh Rambachan, and Sendhil Mullainathan. Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function. InInternational Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 48919–48937. PMLR, 2024
work page 2024
-
[26]
Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan
Keyon Vafa, Justin Y . Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the World Model Implicit in a Generative Model. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 26941–26975, 2024
work page 2024
-
[27]
Chang, Ashesh Rambachan, and Sendhil Mullainathan
Keyon Vafa, Peter G. Chang, Ashesh Rambachan, and Sendhil Mullainathan. What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models. InInternational Conference on Machine Learning (ICML), 2025
work page 2025
-
[28]
Noor Shaker, Julian Togelius, and Mark J Nelson.Procedural Content Generation in Games. Springer, 2016
work page 2016
-
[29]
Harold W. Kuhn. A simplified two-person poker. In H. W. Kuhn and A. W. Tucker, editors,Contributions to the Theory of Games, Vol. I, number 24 in Annals of Mathematics Studies, pages 97–103. Princeton University Press, 1950
work page 1950
-
[30]
Bayes’ Bluff: Opponent Modelling in Poker
Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ Bluff: Opponent Modelling in Poker. InProceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI), pages 550–558, 2005
work page 2005
-
[31]
I. M. Sobol’. On the distribution of points in a cube and the approximate evaluation of integrals.USSR Computational Mathematics and Mathematical Physics, 7(4):86–112, 1967
work page 1967
-
[32]
Ralph Allan Bradley and Milton E. Terry. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.Biometrika, 39(3/4):324–345, 1952
work page 1952
-
[33]
Regret minimization in games with incomplete information
Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. InAdvances in Neural Information Processing Systems (NeurIPS), volume 20, pages 1729–1736, 2007
work page 2007
-
[34]
Solving Large Imperfect Information Games Using CFR+
Oskari Tammelin. Solving Large Imperfect Information Games Using CFR +, 2014. arXiv:1407.5042
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[35]
Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995. A Modular game design Each GBG is assembled from modular components composed by the parameterized GBG builder: • Core layer. GameState (play...
work page 1995
-
[36]
**Setup**: shuffle deck, deal cards
-
[37]
**Ante Phase**: each player antes into the pot
-
[38]
**Betting Phase**: players may bet, check, fold, or call
-
[39]
**Post-Betting Phase**
-
[40]
- Deal 1 card face-down to each player (Alice, Bob) (or as many as remain if the deck is short)
**Showdown**: reveal hands and settle the pot ## Phase Details ### Setup - Shuffle the Deck. - Deal 1 card face-down to each player (Alice, Bob) (or as many as remain if the deck is short). Each player sees only their own cards. ### Ante Phase - Alice antes 2 chips (moved from Alice's chip stack into the pot). - Bob antes 2 chips (moved from Bob's chip st...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.