pith. machine review for the scientific record. sign in

arxiv: 2605.10410 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Equilibrium Residuals Expose Three Regimes of Matrix-Game Strategic Reasoning in Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:47 UTC · model grok-4.3

classification 💻 cs.LG
keywords large language modelsmatrix gamesNash equilibriumexploitabilitystrategic reasoningresidual trainingprocedural evaluation
0
0 comments X

The pith

Language models learn approximate Nash computation for matrix games through residual training but hit an output formatting bottleneck on larger instances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models recognize familiar game-theory benchmarks yet fail on anonymous zero-sum payoff matrices, with success dropping to 34 percent on 2x2, 18 percent on 3x3, and 2 percent on 5x5 versions. Supervised fine-tuning on 2x2 and 3x3 games raises performance on unseen 5x5 to 7x7 matrices to 61 percent, while exploitability-reward training reaches 37 percent on average. The paper proves the exploitability residual is 2-Lipschitz continuous in payoff perturbations, in contrast to the discontinuous behavior of exact linear-programming equilibrium selectors. A dominated-action padding test shows trained models solve embedded 3x3 games inside larger matrices while controls fail, isolating the output interface as the remaining constraint rather than a deficit in strategic reasoning itself.

Core claim

Procedurally generated anonymous matrix games separate three regimes of LLM behavior: semantic recall of named games, acquisition of approximate Nash computation via training on small instances, and a persistent output-formatting bottleneck that prevents reliable scaling. Supervised fine-tuning transfers from 2x2 and 3x3 to larger sizes, and the 2-Lipschitz property of the exploitability residual accounts for stable generalization under payoff shifts where exact solvers would break.

What carries the argument

The exploitability residual, the maximum gain an opponent can obtain by best-responding to the model's output strategy, which supplies a continuous training signal for approximate equilibrium finding.

If this is right

  • Residual training transfers across game sizes and payoff shifts because the exploitability signal remains continuous.
  • Procedural anonymous evaluation is required to measure genuine strategic reasoning instead of semantic recall.
  • The output formatting step, not internal computation, is the primary limit on handling larger matrices.
  • Models acquire generalizable approximate equilibrium finding that applies to embedded subgames within bigger instances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Structured output mechanisms could bypass the formatting bottleneck and unlock higher performance on large games.
  • Different regimes call for different fixes: prompting for recall, residual objectives for computation, and interface engineering for deployment.
  • The continuity advantage may extend to other domains where exact solutions are brittle but approximate residuals are robust.

Load-bearing premise

Performance gains after training on 2x2 and 3x3 games reflect acquisition of approximate Nash computation rather than overfitting to matrix size, formatting conventions, or residual-specific artifacts.

What would settle it

If models trained on small games fail to solve 3x3 subgames embedded via dominated-action padding inside larger matrices at rates clearly above random controls, or if small payoff perturbations cause large discontinuous drops in residual-trained output quality.

Figures

Figures reproduced from arXiv: 2605.10410 by Binhan Luo, Ching-Wen Ma, Jyh-Shing Roger Jang, Wenhua Nie, Zijie Meng.

Figure 1
Figure 1. Figure 1: Three regimes of matrix-game strategic reasoning in LLMs. (a) Base model performs well on named games but falls to 34%, 18%, and 2% on random 2×2, 3×3, and 5×5 matrices (memorization gap). (b) SFT and VERGE trained on 2–3×3 generalize to 7×7 OOD, far exceeding maximin and base. (c) 3×3 games embedded in 12×12 matrices succeed for both SFT and VERGE, while dense and random-padded 12×12 games fail—a depth/se… view at source ↗
Figure 2
Figure 2. Figure 2: Dominated-padding experiment: s@0.10 as a function of padded matrix size N for 3×3 → N embeddings with iteratively dominated actions versus dense random N×N games and random-padded negative controls. Dominated embeddings remain far above both controls through N=20. remain at 0.97±.01 even through 20 × 20 padding, while dense 12 × 12 games achieve 0.01±.01 and random-padded 12 × 12 games achieve only 0.06 i… view at source ↗
read the original abstract

Large language models can score well on named game-theory benchmarks while failing on the same strategic computation once semantic cues are removed. We show this gap with procedurally generated zero-sum matrix games: a model that recognizes familiar games drops to 34%, 18%, and 2% success on anonymous $2{\times}2$, $3{\times}3$, and $5{\times}5$ payoff matrices. The benchmark separates semantic recall, learned approximate Nash computation, and an output-interface bottleneck that limits scale. Training only on $2{\times}2$ and $3{\times}3$ games, supervised fine-tuning raises unseen $5{\times}5$--$7{\times}7$ success from 2% to 61%, while exploitability-reward training averages 37% with high seed variance. We prove that the exploitability residual is $2$-Lipschitz in payoff perturbations, unlike discontinuous vertex-returning LP equilibrium selectors, explaining why residual training can transfer under payoff shifts even when formatting instability limits mean performance. A dominated-action padding experiment provides causal evidence: trained models solve $3{\times}3$ games embedded in much larger matrices, while random-padded controls fail and dense $12{\times}12$ games remain near failure. Procedural evaluation is therefore necessary for measuring strategic reasoning, and residual rewards expose a real but format-limited route to approximate equilibrium computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs succeed on named game-theory benchmarks but drop sharply (to 34%, 18%, 2%) on procedurally generated anonymous zero-sum matrix games of increasing size. Supervised fine-tuning on 2×2 and 3×3 games raises success on unseen 5×5–7×7 matrices from 2% to 61%, while exploitability-reward training reaches 37% (high seed variance). A 2-Lipschitz continuity proof for the exploitability residual (contrasted with discontinuous LP selectors) is offered to explain transfer under payoff shifts, and a dominated-action padding experiment is presented as causal evidence that models solve embedded 3×3 subgames in larger matrices while random-padded controls fail.

Significance. If the central claims hold, the work supplies both a concrete training route to approximate equilibrium computation in LLMs and a mathematical reason (Lipschitz continuity of the residual) why such training can generalize across payoff perturbations where vertex-based solvers cannot. The emphasis on procedural generation and the padding control also strengthens the case that semantic recall and formatting artifacts must be separated from genuine strategic reasoning.

major comments (2)
  1. [Dominated-action padding experiment] Dominated-action padding experiment (abstract and associated results section): trained models succeed on 3×3 games embedded in larger matrices while random-padded controls fail, yet both conditions employ the identical payoff-generation pipeline and output representation for probability vectors. This does not isolate acquisition of approximate Nash computation from consistent artifacts in matrix presentation or formatting conventions, which is load-bearing for the claim that performance gains reflect strategic reasoning rather than size- or format-specific adaptation.
  2. [Training results and exploitability-reward training] Exploitability-reward training results (abstract): the reported 37% average success on larger games is accompanied by high seed variance. Without additional ablations that vary output formatting independently of the residual objective or that compare against size-matched but non-strategic baselines, it remains possible that the observed transfer is driven by procedural regularities rather than equilibrium computation.
minor comments (2)
  1. The title refers to 'three regimes' of strategic reasoning, but the abstract does not delineate them explicitly; ensure the main text defines these regimes with clear operational criteria and supporting figures or tables.
  2. Report all success rates with standard errors, confidence intervals, or per-seed distributions to allow readers to assess reliability in light of the noted seed variance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, clarifying the role of the controls and theoretical results while noting where the manuscript will be revised for greater precision.

read point-by-point responses
  1. Referee: Dominated-action padding experiment (abstract and associated results section): trained models succeed on 3×3 games embedded in larger matrices while random-padded controls fail, yet both conditions employ the identical payoff-generation pipeline and output representation for probability vectors. This does not isolate acquisition of approximate Nash computation from consistent artifacts in matrix presentation or formatting conventions, which is load-bearing for the claim that performance gains reflect strategic reasoning rather than size- or format-specific adaptation.

    Authors: We appreciate the referee's point on potential confounds. The random-padded and dominated-action conditions are matched exactly on payoff generation, matrix dimensions, and output format. The sole systematic difference is the strategic structure: dominated-action padding embeds a 3×3 subgame whose Nash equilibrium can be computed from the training distribution, while random padding contains no such structure. Because models were trained exclusively on small anonymous games and succeed selectively on the former, the performance gap indicates transfer of approximate equilibrium computation to embedded subgames rather than format adaptation. We will revise the results section to state this matching and differential more explicitly. revision: partial

  2. Referee: Exploitability-reward training results (abstract): the reported 37% average success on larger games is accompanied by high seed variance. Without additional ablations that vary output formatting independently of the residual objective or that compare against size-matched but non-strategic baselines, it remains possible that the observed transfer is driven by procedural regularities rather than equilibrium computation.

    Authors: The high seed variance for exploitability-reward training is reported in the manuscript and reflects known instability of that objective. Supervised fine-tuning on the same small games yields stable gains to 61% on larger instances. The 2-Lipschitz continuity of the exploitability residual (contrasted with discontinuous LP selectors) supplies a theoretical reason why residual training transfers under payoff shifts. While we did not run separate ablations isolating output formatting or non-strategic baselines, the procedural anonymous generation already removes semantic cues. We will add a limitations paragraph acknowledging the variance and the value of such ablations for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; mathematical proof and experiments remain independent of fitted inputs

full rationale

The paper's core derivation is a mathematical proof that the exploitability residual is 2-Lipschitz in payoff perturbations, presented as independent of the empirical training results on 2x2/3x3 games. No equation reduces the claimed transferability or generalization to quantities defined by the same fitted parameters or self-citations. The dominated-action padding experiment and procedural generation are described as providing causal evidence without reducing to renaming or ansatz smuggling. The derivation chain is self-contained against external benchmarks and does not invoke load-bearing self-citations or uniqueness theorems from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mathematical proof of 2-Lipschitz continuity of the exploitability residual and on empirical observations from supervised and reward-based training; the abstract introduces no free parameters, no new postulated entities, and relies only on standard game-theoretic background.

axioms (1)
  • standard math Existence of Nash equilibria in finite zero-sum matrix games
    Implicit background for defining exploitability and equilibrium computation.

pith-pipeline@v0.9.0 · 5573 in / 1480 out tokens · 54305 ms · 2026-05-12T04:47:38.990038+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 5 internal anchors

  1. [1]

    NeurIPS , year =

    Shivam Garg and Dimitris Tsipras and Percy Liang and Gregory Valiant , title =. NeurIPS , year =

  2. [2]

    NeurIPS , year =

    Hung Le and Yue Wang and Akhilesh Deepak Gotmare and Silvio Savarese and Steven Chu-Hong Hoi , title =. NeurIPS , year =

  3. [3]

    ICML , year =

    Weizhe Yuan and Richard Yuanzhe Pang and Kyunghyun Cho and Xian Li and Sainbayar Sukhbaatar and Jing Xu and Jason Weston , title =. ICML , year =

  4. [4]

    Nash Learning from Human Feedback , booktitle =

    R. Nash Learning from Human Feedback , booktitle =

  5. [5]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Xiao Bi and Haowei Zhang and Mingchuan Zhang and Y.K. Li and Y. Wu and Daya Guo , title =. arXiv preprint arXiv:2402.03300 , year =

  6. [6]

    arXiv preprint arXiv:2305.19165 , year=

    Kanishk Gandhi and Dorsa Sadigh and Noah D. Goodman , title =. arXiv preprint arXiv:2305.19165 , year =

  7. [7]

    NeurIPS , year =

    Jinhao Duan and Renming Zhang and James Diffenderfer and Bhavya Kailkhura and Lichao Sun and Elias Stengel-Eskin and Mohit Bansal and Tianlong Chen and Kaidi Xu , title =. NeurIPS , year =

  8. [8]

    AAAI , year =

    Caoyun Fan and Jindou Chen and Yaohui Jin and Hao He , title =. AAAI , year =

  9. [9]

    arXiv preprint arXiv:2406.10574 , year =

    Alonso Silva , title =. arXiv preprint arXiv:2406.10574 , year =

  10. [10]

    arXiv preprint arXiv:2410.10479 , year =

    Haochuan Wang and Xiachong Feng and Lei Li and Zhanyue Qin and Dianbo Sui and Lingpeng Kong , title =. arXiv preprint arXiv:2410.10479 , year =

  11. [11]

    NeurIPS , year =

    Cem Anil and Yuhuai Wu and Anders Andreassen and Aitor Lewkowycz and Vedant Misra and Vinay Ramasesh and Ambrose Slone and Guy Gur-Ari and Ethan Dyer and Behnam Neyshabur , title =. NeurIPS , year =

  12. [12]

    ICML , year =

    Zixiang Chen and Yihe Deng and Huizhuo Yuan and Kaixuan Ji and Quanquan Gu , title =. ICML , year =

  13. [13]

    A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , booktitle =

    Marc Lanctot and Vinicius Zambaldi and Audrunas Gruslys and Angeliki Lazaridou and Karl Tuyls and Julien P. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , booktitle =

  14. [14]

    NeurIPS , year =

    Yue Wu and Zhiqing Sun and Huizhuo Yuan and Kaixuan Ji and Yiming Yang and Quanquan Gu , title =. NeurIPS , year =

  15. [15]

    arXiv preprint arXiv:2510.15414 , year =

    Huining Yuan and Zelai Xu and Zheyue Tan and Xiangmin Yi and Mo Guang and Kaiwen Long and Haojia Hui and Boxun Li and Xinlei Chen and Bo Zhao and Xiao-Ping Zhang and Chao Yu and Yu Wang , title =. arXiv preprint arXiv:2510.15414 , year =

  16. [16]

    Le and Denny Zhou , title =

    Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Fei Xia and Ed Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =

  17. [17]

    NeurIPS , year =

    Takeshi Kojima and Shixiang Shane Gu and Machel Reid and Yutaka Matsuo and Yusuke Iwasawa , title =. NeurIPS , year =

  18. [18]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Maxwell Nye and Anders Johan Andreassen and Guy Gur-Ari and Henryk Michalewski and Jacob Austin and David Bieber and David Dohan and Aitor Lewkowycz and Maarten Bosma and David Luan and Charles Sutton and Augustus Odena , title =. arXiv preprint arXiv:2112.00114 , year =

  19. [19]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , title =. arXiv preprint arXiv:2110.14168 , year =

  20. [20]

    Christiano and Jan Leike and Tom B

    Paul F. Christiano and Jan Leike and Tom B. Brown and Miljan Martic and Shane Legg and Dario Amodei , title =. NeurIPS , year =

  21. [21]

    Goodman , title =

    Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah D. Goodman , title =. NeurIPS , year =

  22. [22]

    Science , year =

    Anton Bakhtin and Noam Brown and Emily Dinan and Gabriele Farina and Colin Flaherty and Daniel Fried and Andrew Goff and Jonathan Gray and Hengyuan Hu and Athul Paul Jacob and others , title =. Science , year =

  23. [23]

    Annals of Mathematics , year =

    John Nash , title =. Annals of Mathematics , year =

  24. [24]

    Lemke and Joseph T

    Carlton E. Lemke and Joseph T. Howson Jr. , title =. Journal of the Society for Industrial and Applied Mathematics , year =

  25. [25]

    Algorithmic Game Theory , publisher =

    Noam Nisan and Tim Roughgarden and. Algorithmic Game Theory , publisher =

  26. [26]

    Let's Verify Step by Step

    Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. arXiv preprint arXiv:2305.20050 , year =

  27. [27]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo and Dejian Yang and Haowei Zhang and Junxiao Song and Ruoyu Zhang and Runxin Xu and Qihao Zhu and Shirong Ma and Peiyi Wang and Xiao Bi and others , title =. arXiv preprint arXiv:2501.12948 , year =. 2501.12948 , archiveprefix =