Large language models converge on competitive rationality but diverge on cooperation across providers and generations
Pith reviewed 2026-05-13 21:39 UTC · model grok-4.3
The pith
Large language models converge on competitive and coordination behaviors but diverge dramatically in cooperation rates across different providers and generations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In 51,906 trials across 38 games, 25 models show low variation in competitive and coordination behavior but a 48-fold spread in cooperation, with Anthropic models sustaining high cooperation even in final rounds of repeated games while OpenAI's latest models cooperate far less. Provider identity predicts cooperative disposition better than other factors, and this trait changes across generations, such as OpenAI cooperation falling sharply and Google's rising.
What carries the argument
The set of 38 canonical games used to elicit strategic decisions from the models, revealing distinct provider-specific cooperative personalities.
If this is right
- Cooperative outcomes in any interaction mediated by these models will depend primarily on which provider's model is chosen.
- Updates to models can cause large, unpredictable changes in cooperative behavior.
- Standard benchmarks of model capability will not reveal these strategic differences.
- Models from the same provider will tend to produce similar competitive results.
- Endgame behavior in repeated interactions will vary by model family, sometimes contradicting theoretical predictions of defection.
Where Pith is reading between the lines
- Training pipelines appear to embed stable but distinct economic personalities into models from each provider.
- Selecting an LLM for tasks involving negotiation or resource allocation may require testing its cooperative tendencies rather than relying on capability scores.
- Real-world economic impacts from LLM agents could be mitigated by choosing models with desired strategic profiles or by developing methods to adjust them.
Load-bearing premise
The strategic dispositions measured in these laboratory games will generalize to the real-world economic interactions where LLMs serve as autonomous agents.
What would settle it
A direct test would be to deploy different provider models in actual negotiation tasks and measure whether the observed cooperation rates match the game-trial predictions, or to check if a new model generation reverses its prior cooperation level.
read the original abstract
As language models are deployed as autonomous agents that negotiate, cooperate, and compete on behalf of human principals, their strategic dispositions acquire direct economic consequences. Here we show, across 51,906 game-theoretic trials generating 826,990 strategic decisions from 25 large language models spanning seven developers and 38 canonical games, that models converge on competitive and coordination behaviour (coefficient of variation 0.06 for coordination, 0.11 for strategic depth) while diverging 48-fold on cooperation, from 1.5 per cent (GPT-5 Nano) to 71.5 per cent (Claude Opus 4.6). Provider identity is the dominant predictor of cooperative disposition, and this divergence is generationally unstable: OpenAI cooperation fell from 50.3 to 1.5 per cent across four model generations while Google cooperation rose from 8.3 to 56.8 per cent. Endgame analysis reveals that Anthropic frontier models sustain 57 per cent cooperation in the final round of finitely repeated games, where backward induction predicts zero, while the newest Google models cooperate throughout but universally defect when punishment becomes impossible. These strategic personalities are shaped by training pipelines, shift unpredictably across model versions, and cannot be inferred from capability benchmarks, yet they determine the cooperative outcomes of every economic interaction these models mediate. The complete dataset and an interactive explorer for the data are publicly available at https://felipemaffonso.github.io/strategic-personalities/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical study of 25 large language models across seven providers in 51,906 trials spanning 38 canonical games, generating 826,990 decisions. It claims convergence on competitive and coordination behaviors (coefficients of variation 0.06 and 0.11) alongside a 48-fold divergence in cooperation rates (1.5% for GPT-5 Nano to 71.5% for Claude Opus 4.6), with provider identity as the dominant predictor. Generational shifts are documented (e.g., OpenAI cooperation declining from 50.3% to 1.5%), endgame analysis shows deviations from backward induction (Anthropic models sustaining 57% cooperation in final rounds), and the authors conclude these dispositions determine cooperative outcomes in all economic interactions mediated by LLMs. The full dataset and interactive explorer are released publicly.
Significance. If the empirical patterns prove robust, the work provides a large-scale, reproducible mapping of strategic dispositions in frontier LLMs that could inform risk assessment for deploying these models as autonomous agents in economic settings. Credit is due for the scale (over 50k trials), public data release, and falsifiable predictions about provider-level and generational effects, which enable direct follow-up testing.
major comments (2)
- [Abstract and Discussion] Abstract and concluding discussion: the central claim that observed dispositions 'determine the cooperative outcomes of every economic interaction these models mediate' is load-bearing yet unsupported by the experimental scope. All 51,906 trials use fixed finite-horizon matrix games with explicit payoff tables; no results address open-ended bargaining, evolving-state repeated play, or natural-language principal-agent delegation, leaving the extrapolation without bridging evidence.
- [Results] Results section (cooperation rates and CV calculations): the reported 48-fold divergence, coefficients of variation (0.06 for coordination, 0.11 for strategic depth), and specific percentages (e.g., 57% final-round cooperation) are presented without visible error bars, confidence intervals, or statistical tests for provider dominance. This absence directly affects verifiability of the convergence/divergence claims and the assertion that provider identity is the dominant predictor.
minor comments (1)
- [Methods] The public data link and interactive explorer are valuable; include a brief methods subsection describing exact prompt templates, game encoding, and trial randomization to facilitate replication.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We respond to each major comment below and outline the revisions we will make to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract and Discussion] Abstract and concluding discussion: the central claim that observed dispositions 'determine the cooperative outcomes of every economic interaction these models mediate' is load-bearing yet unsupported by the experimental scope. All 51,906 trials use fixed finite-horizon matrix games with explicit payoff tables; no results address open-ended bargaining, evolving-state repeated play, or natural-language principal-agent delegation, leaving the extrapolation without bridging evidence.
Authors: We acknowledge the limitation in experimental scope to finite-horizon matrix games. These canonical games allow precise measurement of strategic dispositions, which we argue form the basis for behavior in more complex economic interactions. However, to strengthen the manuscript, we will revise the abstract and discussion to explicitly note the scope of our findings and qualify the extrapolation, emphasizing the need for future studies on open-ended and natural-language settings. This revision will be partial, as we maintain that the observed patterns provide valuable insights into LLM-mediated cooperation. revision: partial
-
Referee: [Results] Results section (cooperation rates and CV calculations): the reported 48-fold divergence, coefficients of variation (0.06 for coordination, 0.11 for strategic depth), and specific percentages (e.g., 57% final-round cooperation) are presented without visible error bars, confidence intervals, or statistical tests for provider dominance. This absence directly affects verifiability of the convergence/divergence claims and the assertion that provider identity is the dominant predictor.
Authors: We agree that statistical support is necessary for the reported metrics. In the revised version, we will include error bars and confidence intervals for all key statistics, such as cooperation rates and coefficients of variation. Additionally, we will perform and report statistical tests to substantiate that provider identity is the dominant predictor, for example through regression models controlling for other factors. This will directly address the verifiability concerns. revision: yes
Circularity Check
No significant circularity in this empirical measurement study
full rationale
The paper reports direct experimental results from 51,906 game-theoretic trials across 25 LLMs and 38 canonical games, measuring cooperation rates, coordination, and strategic depth without any derivations, equations, fitted parameters, or first-principles claims. Provider identity as predictor and generational shifts are presented as observed patterns from the data, not reduced to self-defined quantities. No self-citations serve as load-bearing uniqueness theorems, and the analysis remains self-contained against external benchmarks with no ansatzes or renamings of known results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
https://arxiv.org/abs/2510.25779 (2025)
Bansal, T.et al.Magentic Marketplace: A benchmark for two-sided agent interactions in e-commerce. https://arxiv.org/abs/2510.25779 (2025)
-
[2]
arXiv:2508.02630 [cs.AI]https://arxiv.org/abs/2508.02630 Eric Budish
Allouah, A.et al.What is your AI agent buying? A first look at shopping agent behavior. https: //arxiv.org/abs/2508.02630 (2025)
-
[3]
Köbis, N. C.et al.Delegation to AI increases dishonesty.Naturehttps://doi.org/10.1038/s41586-025-0 9505-x (2025) doi:10.1038/s41586-025-09505-x
-
[4]
Betley, A.et al.Training on narrow tasks can produce broadly misaligned AI models.Naturehttps: //doi.org/10.1038/s41586-025-09937-5 (2026) doi:10.1038/s41586-025-09937-5
-
[5]
Rahwan, I.et al.Machine behaviour.Nature568, 477–486 (2019)
work page 2019
-
[6]
Horton, J. J. Large language models as simulated economic agents: What can we learn from homo silicus?NBER Working Paperhttps://doi.org/10.3386/w31122 (2023) doi:10.3386/w31122
-
[7]
F.Behavioral Game Theory: Experiments in Strategic Interaction
Camerer, C. F.Behavioral Game Theory: Experiments in Strategic Interaction. (Princeton University Press, 2003)
work page 2003
-
[8]
Fehr, E. & Schmidt, K. M. A theory of fairness, competition, and cooperation.Quarterly Journal of Economics114, 817–868 (1999)
work page 1999
-
[9]
Güth, W., Schmittberger, R. & Schwarze, B. An experimental analysis of ultimatum bargaining.Journal of Economic Behavior & Organization3, 367–388 (1982)
work page 1982
-
[10]
Rand, D. G. & Nowak, M. A. Human cooperation.Trends in Cognitive Sciences17, 413–425 (2013)
work page 2013
-
[11]
A.Evolutionary Dynamics: Exploring the Equations of Life
Nowak, M. A.Evolutionary Dynamics: Exploring the Equations of Life. (Harvard University Press, 2006)
work page 2006
-
[12]
Brookins, P . & DeBacker, J. M. Playing games with GPT: What can we learn about a large language model from canonical strategic games?Economics Bulletinhttps://doi.org/10.2139/ssrn.4493398 (2024) doi:10.2139/ssrn.4493398
-
[13]
Brookins, P . & DeBacker, J. M. Strategic behavior of large language models: Game structure vs. Con- textual framing.Scientific Reports14, 18832 (2024)
work page 2024
-
[14]
Akata, E.et al.Playing repeated games with large language models.Nature Human Behaviourhttps: //doi.org/10.1038/s41562-025-02172-y (2025) doi:10.1038/s41562-025-02172-y
-
[15]
Fan, C., Chen, J., Jin, Y. & He, H. Can large language models serve as rational players in game theory? A systematic analysis.Proceedings of the AAAI Conference on Artificial Intelligence38, 17960–17967 (2024)
work page 2024
-
[16]
Fontana, N., Pierri, F. & Aiello, L. M. Nicer than humans: How do large language models behave in the Prisoner’s Dilemma? inICWSM(2025)
work page 2025
-
[17]
Huang, D.et al.GAMA-Bench: Benchmarking LLMs’ game-theoretic reasoning abilities. inICLR (2025)
work page 2025
-
[18]
Duan, H.et al.LLM strategic reasoning via behavioral game theory. inNeurIPS(2025)
work page 2025
-
[19]
Mao, S.et al.ALYMPICS: Language agents meet game theory. inCOLING(2025). 18
work page 2025
-
[20]
arXiv preprint arXiv:2305.05516 , year=
Guo, F. GPT in game theory experiments.arXiv preprint arXiv:2305.05516https://arxiv.org/abs/2305 .05516 (2024)
-
[21]
Suzuki, R. & Arita, T. An evolutionary model of personality traits related to cooperative behavior using a large language model.Scientific Reports14, 5989 (2024)
work page 2024
-
[22]
Cheung, V .et al.Large language models show amplified cognitive biases in moral decision-making. Proceedings of the National Academy of Scienceshttps://doi.org/10.1073/pnas.2412015122 (2025) doi:10.1073/pnas.2412015122
-
[23]
Various. More at stake: LLM cooperation in high-stakes games.arXiv preprint arXiv:2601.19082https: //arxiv.org/abs/2601.19082 (2026)
-
[24]
Various. FAIRGAME: A framework for assessing LLM fairness in game-theoretic settings.arXiv preprint arXiv:2512.07462https://arxiv.org/abs/2512.07462 (2025)
-
[25]
Various. Playing games with LLMs: Randomness and strategy.arXiv preprint arXiv:2503.02582https: //arxiv.org/abs/2503.02582 (2025)
- [26]
-
[27]
Calvano, E., Calzolari, G., Denicolò, V . & Pastorello, S. Artificial intelligence, algorithmic pricing, and collusion.American Economic Review110, 3267–3297 (2020)
work page 2020
-
[28]
Calvano, E., Calzolari, G., Denicolò, V . & Pastorello, S. Algorithmic collusion with imperfect monitor- ing.International Journal of Industrial Organization79, (2021)
work page 2021
-
[29]
Game theory meets large language models: A systematic survey,
Various. Game theory meets large language models: A survey. https://arxiv.org/abs/2502.09053 (2025)
-
[30]
Gao, X.et al.Scylla ex machina: Failures of LLMs as human behavioral surrogates.Proceed- ings of the National Academy of Scienceshttps://doi.org/10.1073/pnas.2501660122 (2025) doi:10.1073/pnas.2501660122
-
[31]
Berg, J., Dickhaut, J. & McCabe, K. Trust, reciprocity, and social history.Games and Economic Behavior 10, 122–142 (1995)
work page 1995
-
[32]
Unraveling in guessing games: An experimental study.American Economic Review85, 1313– 1326 (1995)
Nagel, R. Unraveling in guessing games: An experimental study.American Economic Review85, 1313– 1326 (1995)
work page 1995
-
[33]
Mei, Q., Xie, Y., Yuan, W. & Jackson, M. O. A Turing test of whether AI chatbot behavior is indistin- guishable from human behavior.Proceedings of the National Academy of Sciences121, e2313925121 (2024)
work page 2024
-
[34]
Serapio-Garcı´a, G.et al.A psychometric framework for evaluating and shaping personality traits in large language models.Nature Machine Intelligence7, 1954–1968 (2025)
work page 1954
- [35]
-
[36]
Nash, J. F. The bargaining problem.Econometrica18, 155–162 (1950)
work page 1950
-
[37]
Mengel, F. Risk and temptation: A meta-study on prisoner’s dilemma games.The Economic Journal128, 3182–3209 (2018)
work page 2018
-
[38]
Linear public goods experiments: A meta-analysis.Experimental Economics6, 299–310 (2003)
Zelmer, J. Linear public goods experiments: A meta-analysis.Experimental Economics6, 299–310 (2003)
work page 2003
-
[39]
Oosterbeek, H., Sloof, R. & Van de Kuilen, G. Cultural differences in ultimatum game experiments: Evidence from a meta-analysis.Experimental Economics7, 171–188 (2004)
work page 2004
-
[40]
Johnson, N. D. & Mislin, A. A. Trust games: A meta-analysis.Journal of Economic Psychology32, 865–889 (2011). 19
work page 2011
-
[41]
The chain store paradox.Theory and Decision9, 127–159 (1978)
Selten, R. The chain store paradox.Theory and Decision9, 127–159 (1978)
work page 1978
-
[42]
Rosenthal, R. W. Games of perfect information, predatory pricing and the chain-store paradox.Journal of Economic Theory25, 92–100 (1981)
work page 1981
-
[43]
Embrey, M., Frechette, G. R. & Yuksel, S. Cooperation in the finitely repeated prisoner’s dilemma.The Quarterly Journal of Economics133, 509–551 (2018)
work page 2018
-
[44]
Dal Bó, P . & Fréchette, G. R. On the determinants of cooperation in infinitely repeated games: A survey. Journal of Economic Literature56, 60–114 (2018)
work page 2018
-
[45]
Dal Bó, P . Cooperation under the shadow of the future: Experimental evidence from infinitely repeated games.American Economic Review95, 1591–1604 (2005)
work page 2005
-
[46]
Kreps, D. M., Milgrom, P ., Roberts, J. & Wilson, R. Rational cooperation in the finitely repeated prison- ers’ dilemma.Journal of Economic Theory27, 245–252 (1982)
work page 1982
-
[47]
Hodoscope: Unsupervised Monitoring for AI Misbehaviors
Zhong, Z., Saxena, S. & Raghunathan, A. Hodoscope: Unsupervised monitoring for AI misbehaviors. arXiv preprint arXiv:2604.11072(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Bai, Y.et al.Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073https: //arxiv.org/abs/2212.08073 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
Ouyang, L.et al.Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems35, 27730–27744 (2022)
work page 2022
-
[50]
Various. Emergent social conventions in large language model populations.Science Advanceshttps: //doi.org/10.1126/sciadv.adu9368 (2025) doi:10.1126/sciadv.adu9368
-
[51]
Various. LLMs replicate human cooperation in social dilemmas.arXiv preprint arXiv:2511.04500https: //arxiv.org/abs/2511.04500 (2025)
-
[52]
W.et al.Cooperating with machines.Nature Communications9, 233 (2018)
Crandall, J. W.et al.Cooperating with machines.Nature Communications9, 233 (2018)
work page 2018
-
[53]
Phan, L.et al.A benchmark of expert-level academic questions to assess AI capabilities.Nature649, 1139–1146 (2026)
work page 2026
- [54]
-
[55]
Gemp, I.et al.GAMEBoT: Transparent assessment of LLM reasoning in games. inACL(2025)
work page 2025
-
[56]
Bianchi, F.et al.NegotiationArena: A benchmark for language model negotiation. inICML(2024)
work page 2024
-
[57]
Abdelnabi, S.et al.LLM-deliberation: Evaluating LLMs with interactive multi-agent negotiation games. inNeurIPS(2024)
work page 2024
-
[58]
Job market signaling.Quarterly Journal of Economics87, 355–374 (1973)
Spence, M. Job market signaling.Quarterly Journal of Economics87, 355–374 (1973)
work page 1973
-
[59]
Crawford, V . P . & Sobel, J. Strategic information transmission.Econometrica50, 1431–1451 (1982)
work page 1982
-
[60]
Arthur, W. B. Inductive reasoning and bounded rationality (the El Farol problem).American Economic Review84, 406–411 (1994). 20 Extended Data Figures Extended Data Figure 1 | Behavioural radar profiles for all 25 models.Eight-dimensional radar charts showing normalised scores on cooperation, coordination, fairness, strategic depth, trust, com- petitivenes...
work page 1994
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.