pith. sign in

arxiv: 2604.18596 · v2 · submitted 2026-04-01 · ⚛️ physics.soc-ph · cs.GT

Large language models converge on competitive rationality but diverge on cooperation across providers and generations

Pith reviewed 2026-05-13 21:39 UTC · model grok-4.3

classification ⚛️ physics.soc-ph cs.GT
keywords large language modelscooperationgame theorystrategic behaviorprovider differencesmodel generationseconomic interactionsautonomous agents
0
0 comments X

The pith

Large language models converge on competitive and coordination behaviors but diverge dramatically in cooperation rates across different providers and generations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests how large language models act as strategic agents in economic games on behalf of humans. It reveals that while models from seven developers converge closely on competitive and coordination choices, their rates of cooperation range from under 2 percent to over 70 percent. Provider identity stands out as the main factor driving these differences, and cooperation levels shift unpredictably from one model generation to the next. These findings matter because models are now used in negotiations and decisions that have real economic stakes. The results cannot be predicted from standard capability measures.

Core claim

In 51,906 trials across 38 games, 25 models show low variation in competitive and coordination behavior but a 48-fold spread in cooperation, with Anthropic models sustaining high cooperation even in final rounds of repeated games while OpenAI's latest models cooperate far less. Provider identity predicts cooperative disposition better than other factors, and this trait changes across generations, such as OpenAI cooperation falling sharply and Google's rising.

What carries the argument

The set of 38 canonical games used to elicit strategic decisions from the models, revealing distinct provider-specific cooperative personalities.

If this is right

  • Cooperative outcomes in any interaction mediated by these models will depend primarily on which provider's model is chosen.
  • Updates to models can cause large, unpredictable changes in cooperative behavior.
  • Standard benchmarks of model capability will not reveal these strategic differences.
  • Models from the same provider will tend to produce similar competitive results.
  • Endgame behavior in repeated interactions will vary by model family, sometimes contradicting theoretical predictions of defection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines appear to embed stable but distinct economic personalities into models from each provider.
  • Selecting an LLM for tasks involving negotiation or resource allocation may require testing its cooperative tendencies rather than relying on capability scores.
  • Real-world economic impacts from LLM agents could be mitigated by choosing models with desired strategic profiles or by developing methods to adjust them.

Load-bearing premise

The strategic dispositions measured in these laboratory games will generalize to the real-world economic interactions where LLMs serve as autonomous agents.

What would settle it

A direct test would be to deploy different provider models in actual negotiation tasks and measure whether the observed cooperation rates match the game-trial predictions, or to check if a new model generation reverses its prior cooperation level.

read the original abstract

As language models are deployed as autonomous agents that negotiate, cooperate, and compete on behalf of human principals, their strategic dispositions acquire direct economic consequences. Here we show, across 51,906 game-theoretic trials generating 826,990 strategic decisions from 25 large language models spanning seven developers and 38 canonical games, that models converge on competitive and coordination behaviour (coefficient of variation 0.06 for coordination, 0.11 for strategic depth) while diverging 48-fold on cooperation, from 1.5 per cent (GPT-5 Nano) to 71.5 per cent (Claude Opus 4.6). Provider identity is the dominant predictor of cooperative disposition, and this divergence is generationally unstable: OpenAI cooperation fell from 50.3 to 1.5 per cent across four model generations while Google cooperation rose from 8.3 to 56.8 per cent. Endgame analysis reveals that Anthropic frontier models sustain 57 per cent cooperation in the final round of finitely repeated games, where backward induction predicts zero, while the newest Google models cooperate throughout but universally defect when punishment becomes impossible. These strategic personalities are shaped by training pipelines, shift unpredictably across model versions, and cannot be inferred from capability benchmarks, yet they determine the cooperative outcomes of every economic interaction these models mediate. The complete dataset and an interactive explorer for the data are publicly available at https://felipemaffonso.github.io/strategic-personalities/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript reports an empirical study of 25 large language models across seven providers in 51,906 trials spanning 38 canonical games, generating 826,990 decisions. It claims convergence on competitive and coordination behaviors (coefficients of variation 0.06 and 0.11) alongside a 48-fold divergence in cooperation rates (1.5% for GPT-5 Nano to 71.5% for Claude Opus 4.6), with provider identity as the dominant predictor. Generational shifts are documented (e.g., OpenAI cooperation declining from 50.3% to 1.5%), endgame analysis shows deviations from backward induction (Anthropic models sustaining 57% cooperation in final rounds), and the authors conclude these dispositions determine cooperative outcomes in all economic interactions mediated by LLMs. The full dataset and interactive explorer are released publicly.

Significance. If the empirical patterns prove robust, the work provides a large-scale, reproducible mapping of strategic dispositions in frontier LLMs that could inform risk assessment for deploying these models as autonomous agents in economic settings. Credit is due for the scale (over 50k trials), public data release, and falsifiable predictions about provider-level and generational effects, which enable direct follow-up testing.

major comments (2)
  1. [Abstract and Discussion] Abstract and concluding discussion: the central claim that observed dispositions 'determine the cooperative outcomes of every economic interaction these models mediate' is load-bearing yet unsupported by the experimental scope. All 51,906 trials use fixed finite-horizon matrix games with explicit payoff tables; no results address open-ended bargaining, evolving-state repeated play, or natural-language principal-agent delegation, leaving the extrapolation without bridging evidence.
  2. [Results] Results section (cooperation rates and CV calculations): the reported 48-fold divergence, coefficients of variation (0.06 for coordination, 0.11 for strategic depth), and specific percentages (e.g., 57% final-round cooperation) are presented without visible error bars, confidence intervals, or statistical tests for provider dominance. This absence directly affects verifiability of the convergence/divergence claims and the assertion that provider identity is the dominant predictor.
minor comments (1)
  1. [Methods] The public data link and interactive explorer are valuable; include a brief methods subsection describing exact prompt templates, game encoding, and trial randomization to facilitate replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We respond to each major comment below and outline the revisions we will make to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract and Discussion] Abstract and concluding discussion: the central claim that observed dispositions 'determine the cooperative outcomes of every economic interaction these models mediate' is load-bearing yet unsupported by the experimental scope. All 51,906 trials use fixed finite-horizon matrix games with explicit payoff tables; no results address open-ended bargaining, evolving-state repeated play, or natural-language principal-agent delegation, leaving the extrapolation without bridging evidence.

    Authors: We acknowledge the limitation in experimental scope to finite-horizon matrix games. These canonical games allow precise measurement of strategic dispositions, which we argue form the basis for behavior in more complex economic interactions. However, to strengthen the manuscript, we will revise the abstract and discussion to explicitly note the scope of our findings and qualify the extrapolation, emphasizing the need for future studies on open-ended and natural-language settings. This revision will be partial, as we maintain that the observed patterns provide valuable insights into LLM-mediated cooperation. revision: partial

  2. Referee: [Results] Results section (cooperation rates and CV calculations): the reported 48-fold divergence, coefficients of variation (0.06 for coordination, 0.11 for strategic depth), and specific percentages (e.g., 57% final-round cooperation) are presented without visible error bars, confidence intervals, or statistical tests for provider dominance. This absence directly affects verifiability of the convergence/divergence claims and the assertion that provider identity is the dominant predictor.

    Authors: We agree that statistical support is necessary for the reported metrics. In the revised version, we will include error bars and confidence intervals for all key statistics, such as cooperation rates and coefficients of variation. Additionally, we will perform and report statistical tests to substantiate that provider identity is the dominant predictor, for example through regression models controlling for other factors. This will directly address the verifiability concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity in this empirical measurement study

full rationale

The paper reports direct experimental results from 51,906 game-theoretic trials across 25 LLMs and 38 canonical games, measuring cooperation rates, coordination, and strategic depth without any derivations, equations, fitted parameters, or first-principles claims. Provider identity as predictor and generational shifts are presented as observed patterns from the data, not reduced to self-defined quantities. No self-citations serve as load-bearing uniqueness theorems, and the analysis remains self-contained against external benchmarks with no ansatzes or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work consists of direct empirical observation of model outputs in game settings.

pith-pipeline@v0.9.0 · 5566 in / 1093 out tokens · 42584 ms · 2026-05-13T21:39:44.081554+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

  1. [1]

    https://arxiv.org/abs/2510.25779 (2025)

    Bansal, T.et al.Magentic Marketplace: A benchmark for two-sided agent interactions in e-commerce. https://arxiv.org/abs/2510.25779 (2025)

  2. [2]

    arXiv:2508.02630 [cs.AI]https://arxiv.org/abs/2508.02630 Eric Budish

    Allouah, A.et al.What is your AI agent buying? A first look at shopping agent behavior. https: //arxiv.org/abs/2508.02630 (2025)

  3. [3]

    C.et al.Delegation to AI increases dishonesty.Naturehttps://doi.org/10.1038/s41586-025-0 9505-x (2025) doi:10.1038/s41586-025-09505-x

    Köbis, N. C.et al.Delegation to AI increases dishonesty.Naturehttps://doi.org/10.1038/s41586-025-0 9505-x (2025) doi:10.1038/s41586-025-09505-x

  4. [4]

    Betley, A.et al.Training on narrow tasks can produce broadly misaligned AI models.Naturehttps: //doi.org/10.1038/s41586-025-09937-5 (2026) doi:10.1038/s41586-025-09937-5

  5. [5]

    Rahwan, I.et al.Machine behaviour.Nature568, 477–486 (2019)

  6. [6]

    Horton, J. J. Large language models as simulated economic agents: What can we learn from homo silicus?NBER Working Paperhttps://doi.org/10.3386/w31122 (2023) doi:10.3386/w31122

  7. [7]

    F.Behavioral Game Theory: Experiments in Strategic Interaction

    Camerer, C. F.Behavioral Game Theory: Experiments in Strategic Interaction. (Princeton University Press, 2003)

  8. [8]

    & Schmidt, K

    Fehr, E. & Schmidt, K. M. A theory of fairness, competition, and cooperation.Quarterly Journal of Economics114, 817–868 (1999)

  9. [9]

    & Schwarze, B

    Güth, W., Schmittberger, R. & Schwarze, B. An experimental analysis of ultimatum bargaining.Journal of Economic Behavior & Organization3, 367–388 (1982)

  10. [10]

    Rand, D. G. & Nowak, M. A. Human cooperation.Trends in Cognitive Sciences17, 413–425 (2013)

  11. [11]

    A.Evolutionary Dynamics: Exploring the Equations of Life

    Nowak, M. A.Evolutionary Dynamics: Exploring the Equations of Life. (Harvard University Press, 2006)

  12. [12]

    & DeBacker, J

    Brookins, P . & DeBacker, J. M. Playing games with GPT: What can we learn about a large language model from canonical strategic games?Economics Bulletinhttps://doi.org/10.2139/ssrn.4493398 (2024) doi:10.2139/ssrn.4493398

  13. [13]

    & DeBacker, J

    Brookins, P . & DeBacker, J. M. Strategic behavior of large language models: Game structure vs. Con- textual framing.Scientific Reports14, 18832 (2024)

  14. [14]

    Akata, E.et al.Playing repeated games with large language models.Nature Human Behaviourhttps: //doi.org/10.1038/s41562-025-02172-y (2025) doi:10.1038/s41562-025-02172-y

  15. [15]

    Fan, C., Chen, J., Jin, Y. & He, H. Can large language models serve as rational players in game theory? A systematic analysis.Proceedings of the AAAI Conference on Artificial Intelligence38, 17960–17967 (2024)

  16. [16]

    & Aiello, L

    Fontana, N., Pierri, F. & Aiello, L. M. Nicer than humans: How do large language models behave in the Prisoner’s Dilemma? inICWSM(2025)

  17. [17]

    inICLR (2025)

    Huang, D.et al.GAMA-Bench: Benchmarking LLMs’ game-theoretic reasoning abilities. inICLR (2025)

  18. [18]

    inNeurIPS(2025)

    Duan, H.et al.LLM strategic reasoning via behavioral game theory. inNeurIPS(2025)

  19. [19]

    inCOLING(2025)

    Mao, S.et al.ALYMPICS: Language agents meet game theory. inCOLING(2025). 18

  20. [20]

    arXiv preprint arXiv:2305.05516 , year=

    Guo, F. GPT in game theory experiments.arXiv preprint arXiv:2305.05516https://arxiv.org/abs/2305 .05516 (2024)

  21. [21]

    & Arita, T

    Suzuki, R. & Arita, T. An evolutionary model of personality traits related to cooperative behavior using a large language model.Scientific Reports14, 5989 (2024)

  22. [22]

    Proceedings of the National Academy of Scienceshttps://doi.org/10.1073/pnas.2412015122 (2025) doi:10.1073/pnas.2412015122

    Cheung, V .et al.Large language models show amplified cognitive biases in moral decision-making. Proceedings of the National Academy of Scienceshttps://doi.org/10.1073/pnas.2412015122 (2025) doi:10.1073/pnas.2412015122

  23. [23]

    More at stake: LLM cooperation in high-stakes games.arXiv preprint arXiv:2601.19082https: //arxiv.org/abs/2601.19082 (2026)

    Various. More at stake: LLM cooperation in high-stakes games.arXiv preprint arXiv:2601.19082https: //arxiv.org/abs/2601.19082 (2026)

  24. [24]

    FAIRGAME: A framework for assessing LLM fairness in game-theoretic settings.arXiv preprint arXiv:2512.07462https://arxiv.org/abs/2512.07462 (2025)

    Various. FAIRGAME: A framework for assessing LLM fairness in game-theoretic settings.arXiv preprint arXiv:2512.07462https://arxiv.org/abs/2512.07462 (2025)

  25. [25]

    Playing games with LLMs: Randomness and strategy.arXiv preprint arXiv:2503.02582https: //arxiv.org/abs/2503.02582 (2025)

    Various. Playing games with LLMs: Randomness and strategy.arXiv preprint arXiv:2503.02582https: //arxiv.org/abs/2503.02582 (2025)

  26. [26]

    Shin, D.et al.Emergence of strategic reasoning in large language models.arXiv preprint arXiv:2412.13013https://arxiv.org/abs/2412.13013 (2024)

  27. [27]

    & Pastorello, S

    Calvano, E., Calzolari, G., Denicolò, V . & Pastorello, S. Artificial intelligence, algorithmic pricing, and collusion.American Economic Review110, 3267–3297 (2020)

  28. [28]

    & Pastorello, S

    Calvano, E., Calzolari, G., Denicolò, V . & Pastorello, S. Algorithmic collusion with imperfect monitor- ing.International Journal of Industrial Organization79, (2021)

  29. [29]

    Game theory meets large language models: A systematic survey,

    Various. Game theory meets large language models: A survey. https://arxiv.org/abs/2502.09053 (2025)

  30. [30]

    Gao, X.et al.Scylla ex machina: Failures of LLMs as human behavioral surrogates.Proceed- ings of the National Academy of Scienceshttps://doi.org/10.1073/pnas.2501660122 (2025) doi:10.1073/pnas.2501660122

  31. [31]

    & McCabe, K

    Berg, J., Dickhaut, J. & McCabe, K. Trust, reciprocity, and social history.Games and Economic Behavior 10, 122–142 (1995)

  32. [32]

    Unraveling in guessing games: An experimental study.American Economic Review85, 1313– 1326 (1995)

    Nagel, R. Unraveling in guessing games: An experimental study.American Economic Review85, 1313– 1326 (1995)

  33. [33]

    & Jackson, M

    Mei, Q., Xie, Y., Yuan, W. & Jackson, M. O. A Turing test of whether AI chatbot behavior is indistin- guishable from human behavior.Proceedings of the National Academy of Sciences121, e2313925121 (2024)

  34. [34]

    Serapio-Garcı´a, G.et al.A psychometric framework for evaluating and shaping personality traits in large language models.Nature Machine Intelligence7, 1954–1968 (2025)

  35. [35]

    (Basic Books, 1984)

    Axelrod, R.The Evolution of Cooperation. (Basic Books, 1984)

  36. [36]

    Nash, J. F. The bargaining problem.Econometrica18, 155–162 (1950)

  37. [37]

    Risk and temptation: A meta-study on prisoner’s dilemma games.The Economic Journal128, 3182–3209 (2018)

    Mengel, F. Risk and temptation: A meta-study on prisoner’s dilemma games.The Economic Journal128, 3182–3209 (2018)

  38. [38]

    Linear public goods experiments: A meta-analysis.Experimental Economics6, 299–310 (2003)

    Zelmer, J. Linear public goods experiments: A meta-analysis.Experimental Economics6, 299–310 (2003)

  39. [39]

    & Van de Kuilen, G

    Oosterbeek, H., Sloof, R. & Van de Kuilen, G. Cultural differences in ultimatum game experiments: Evidence from a meta-analysis.Experimental Economics7, 171–188 (2004)

  40. [40]

    Johnson, N. D. & Mislin, A. A. Trust games: A meta-analysis.Journal of Economic Psychology32, 865–889 (2011). 19

  41. [41]

    The chain store paradox.Theory and Decision9, 127–159 (1978)

    Selten, R. The chain store paradox.Theory and Decision9, 127–159 (1978)

  42. [42]

    Rosenthal, R. W. Games of perfect information, predatory pricing and the chain-store paradox.Journal of Economic Theory25, 92–100 (1981)

  43. [43]

    Embrey, M., Frechette, G. R. & Yuksel, S. Cooperation in the finitely repeated prisoner’s dilemma.The Quarterly Journal of Economics133, 509–551 (2018)

  44. [44]

    & Fréchette, G

    Dal Bó, P . & Fréchette, G. R. On the determinants of cooperation in infinitely repeated games: A survey. Journal of Economic Literature56, 60–114 (2018)

  45. [45]

    Cooperation under the shadow of the future: Experimental evidence from infinitely repeated games.American Economic Review95, 1591–1604 (2005)

    Dal Bó, P . Cooperation under the shadow of the future: Experimental evidence from infinitely repeated games.American Economic Review95, 1591–1604 (2005)

  46. [46]

    M., Milgrom, P ., Roberts, J

    Kreps, D. M., Milgrom, P ., Roberts, J. & Wilson, R. Rational cooperation in the finitely repeated prison- ers’ dilemma.Journal of Economic Theory27, 245–252 (1982)

  47. [47]

    Hodoscope: Unsupervised Monitoring for AI Misbehaviors

    Zhong, Z., Saxena, S. & Raghunathan, A. Hodoscope: Unsupervised monitoring for AI misbehaviors. arXiv preprint arXiv:2604.11072(2026)

  48. [48]

    Bai, Y.et al.Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073https: //arxiv.org/abs/2212.08073 (2022)

  49. [49]

    Ouyang, L.et al.Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems35, 27730–27744 (2022)

  50. [50]

    Emergent social conventions and collective bias in llm populations.Science Advances, 11(20):eadu9368, 2025

    Various. Emergent social conventions in large language model populations.Science Advanceshttps: //doi.org/10.1126/sciadv.adu9368 (2025) doi:10.1126/sciadv.adu9368

  51. [51]

    LLMs replicate human cooperation in social dilemmas.arXiv preprint arXiv:2511.04500https: //arxiv.org/abs/2511.04500 (2025)

    Various. LLMs replicate human cooperation in social dilemmas.arXiv preprint arXiv:2511.04500https: //arxiv.org/abs/2511.04500 (2025)

  52. [52]

    W.et al.Cooperating with machines.Nature Communications9, 233 (2018)

    Crandall, J. W.et al.Cooperating with machines.Nature Communications9, 233 (2018)

  53. [53]

    Phan, L.et al.A benchmark of expert-level academic questions to assess AI capabilities.Nature649, 1139–1146 (2026)

  54. [54]

    Wang, Z.et al.When experimental economics meets large language models.arXiv preprint arXiv:2505.21371https://arxiv.org/abs/2505.21371 (2025)

  55. [55]

    inACL(2025)

    Gemp, I.et al.GAMEBoT: Transparent assessment of LLM reasoning in games. inACL(2025)

  56. [56]

    inICML(2024)

    Bianchi, F.et al.NegotiationArena: A benchmark for language model negotiation. inICML(2024)

  57. [57]

    inNeurIPS(2024)

    Abdelnabi, S.et al.LLM-deliberation: Evaluating LLMs with interactive multi-agent negotiation games. inNeurIPS(2024)

  58. [58]

    Job market signaling.Quarterly Journal of Economics87, 355–374 (1973)

    Spence, M. Job market signaling.Quarterly Journal of Economics87, 355–374 (1973)

  59. [59]

    Crawford, V . P . & Sobel, J. Strategic information transmission.Econometrica50, 1431–1451 (1982)

  60. [60]

    fairness

    Arthur, W. B. Inductive reasoning and bounded rationality (the El Farol problem).American Economic Review84, 406–411 (1994). 20 Extended Data Figures Extended Data Figure 1 | Behavioural radar profiles for all 25 models.Eight-dimensional radar charts showing normalised scores on cooperation, coordination, fairness, strategic depth, trust, com- petitivenes...