pith. sign in

arxiv: 2605.29874 · v1 · pith:ZTAENMFSnew · submitted 2026-05-28 · 💻 cs.MA · cs.AI· cs.GT

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

Pith reviewed 2026-06-29 00:09 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.GT
keywords LLM agentscooperationIterated Prisoner's Dilemmaevolutionary dynamicsmulti-agent systemsprompt engineeringprovider differences
0
0 comments X

The pith

Provider identity, not model generation, best predicts equilibrium cooperation in next-generation LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper extends prior work on LLM agents in evolutionary games by testing four newer models from different providers in the Iterated Prisoner's Dilemma. It shows that cooperative tendencies carry over from earlier models but differ markedly depending on which company developed the model. Prompting techniques like self-refinement boost cooperation across the board, while noise in the environment continues to disrupt cooperation similarly in both old and new systems. The results suggest that the source of the model matters more for behavior than its release date or size.

Core claim

Next-generation LLM agents maintain cooperative biases in balanced conditions across providers, yet substantial divergence appears under biased populations, with Gemini models favoring aggression and GPT models favoring cooperation; self-refine prompting elevates cooperation indices in all cases, while noise sensitivity shows no statistically significant improvement over prior generations after error propagation.

What carries the argument

The Moran process evolutionary simulation with 500 iterations per condition, applied across prompting styles (Default, Prose, Self-Refine) and population compositions in the Iterated Prisoner's Dilemma.

If this is right

  • Cooperative equilibria are favored in nine of twelve model-prompt combinations under balanced noiseless conditions.
  • Self-Refine prompting increases the Index of Cooperative Deviation in all tested models.
  • Cross-provider differences reach up to 77% aggressive equilibria for Gemini 2.5 Flash in biased conditions.
  • Noise sensitivity remains a challenge with no confirmed reduction in newer models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If provider effects dominate, then mixed-provider agent swarms may exhibit unpredictable cooperation levels.
  • Developers could select models from specific providers to encourage desired equilibrium behaviors in agent systems.
  • Testing additional prompting methods or larger population sizes could clarify the noise robustness question.

Load-bearing premise

That the unreported sampling details from the earlier study allow accurate propagation of error to conclude the noise sensitivity gap is not statistically significant.

What would settle it

A re-analysis or new experiment with full sampling details from the predecessor study showing a statistically significant reduction in noise sensitivity for the 2025-2026 models.

Figures

Figures reproduced from arXiv: 2605.29874 by Francisco Le\'on Z\'u\~niga Bol\'ivar (Instituci\'on Universitaria Colegio Mayor del Cauca).

Figure 1
Figure 1. Figure 1: Example Moran process trajectory for Claude 4.6 De [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Index of Differential Capabilities (ICD) per model and prompt style. Lower values indicate a larger cooper￾ative payoff advantage. Shaded bands show the range of ICDvalues reported by Willis et al. for ChatGPT-4o (purple) and Claude 3.5 Sonnet (brown); markers indicate per-model averages. One notable exception to the monotonic Default→Prose→Refine trend is GPT-5.4 Mini: its Refine ICD(0.577) is substantial… view at source ↗
Figure 3
Figure 3. Figure 3: Moran process equilibrium distributions across all 48 conditions (12 model–prompt combinations [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025-2026 - Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini - applying the identical protocol across three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): nine of twelve model-prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross-provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reaches 70% cooperative equilibria under Self-Refine. Support for aggressive capability parity is partial (H2): Self-Refine raises ICD in all models and Claude Sonnet 4.6 Refine achieves the highest ICD in the dataset (0.913), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is approximately 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript extends Willis et al.'s evolutionary game theory benchmark on LLM agents in the Iterated Prisoner's Dilemma to four 2025-2026 frontier models (Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, GPT-5.4 Mini) under Default/Prose/Self-Refine prompts and balanced/biased/noise conditions. It reports persistent cooperative bias in nine of twelve model-prompt combinations (H1), partial support for capability parity via Self-Refine (H2), substantial cross-provider divergence (H3), and non-significant noise-sensitivity differences after error propagation (H4), concluding that provider identity rather than model generation is the dominant correlate and that noise remains a universal challenge.

Significance. If the empirical patterns hold under verifiable statistics, the work supplies a timely cross-provider extension that isolates provider effects from generational scaling in multi-agent cooperation, with direct relevance to robust LLM agent design. The reuse of the prior protocol and n=500 Moran iterations per condition enable direct comparability.

major comments (3)
  1. [Abstract (H4)] Abstract (H4 paragraph): the claim that the observed 6 pp vs 13 pp noise-sensitivity gap is not statistically significant rests on propagating unreported sampling variance, run count, and error structure from Willis et al.; because those details are unavailable, the non-significance result cannot be independently verified and directly weakens support for the 'universal noise' conclusion that underpins the provider-over-generation ranking.
  2. [Results (H3)] Results section on H3 and provider comparisons: the assertion that 'provider identity, rather than model generation, is the strongest correlate' is presented without reported statistical tests, effect-size comparisons, or variance decomposition that would quantify the relative explanatory power of provider vs. vintage; the tabulated equilibrium percentages alone do not establish dominance.
  3. [Methods] Methods (sampling and error propagation): the manuscript states n=500 iterations per condition but provides no explicit error-propagation formula, assumed distribution for the predecessor study, or sensitivity analysis showing how alternative variance assumptions would affect the H4 p-value; this omission renders the cross-study non-significance claim non-reproducible from the given text.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'nine of twelve model-prompt combinations' is stated without enumerating which combinations meet the cooperative criterion, reducing immediate interpretability.
  2. Figure or table captions (equilibrium percentages): axis labels and legend entries for the four population compositions should explicitly repeat the prompt-style abbreviations used in the text to avoid cross-referencing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these constructive comments. We agree that greater statistical transparency and reproducibility are required, particularly for cross-study claims and the provider-generation comparison. We will revise the manuscript to address each point.

read point-by-point responses
  1. Referee: [Abstract (H4)] Abstract (H4 paragraph): the claim that the observed 6 pp vs 13 pp noise-sensitivity gap is not statistically significant rests on propagating unreported sampling variance, run count, and error structure from Willis et al.; because those details are unavailable, the non-significance result cannot be independently verified and directly weakens support for the 'universal noise' conclusion that underpins the provider-over-generation ranking.

    Authors: We accept this point. The non-significance assessment depends on unreported details from Willis et al., preventing independent verification. We will revise the abstract to remove the statistical claim, stating only that the gap is directionally smaller while noting that significance cannot be confirmed from available data, and will accordingly qualify the 'universal noise' conclusion. revision: yes

  2. Referee: [Results (H3)] Results section on H3 and provider comparisons: the assertion that 'provider identity, rather than model generation, is the strongest correlate' is presented without reported statistical tests, effect-size comparisons, or variance decomposition that would quantify the relative explanatory power of provider vs. vintage; the tabulated equilibrium percentages alone do not establish dominance.

    Authors: The referee is correct that the claim rests on descriptive percentages without formal tests or decomposition. We will add statistical comparisons (e.g., regression or ANOVA with provider and generation as predictors) and effect-size metrics in the revised results section to quantify relative explanatory power. revision: yes

  3. Referee: [Methods] Methods (sampling and error propagation): the manuscript states n=500 iterations per condition but provides no explicit error-propagation formula, assumed distribution for the predecessor study, or sensitivity analysis showing how alternative variance assumptions would affect the H4 p-value; this omission renders the cross-study non-significance claim non-reproducible from the given text.

    Authors: We agree the methods section is insufficiently detailed. We will add the explicit propagation formula, the assumed distribution from the predecessor, and a sensitivity analysis under alternative variance assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical extension

full rationale

The paper conducts fresh Moran-process simulations on four new 2025-2026 models under the Willis et al. protocol. All equilibrium percentages, ICD values, and hypothesis outcomes (H1-H4) are direct counts from the n=500 iterations per condition. No equations, fitted parameters, or self-citations reduce any reported result to a quantity defined inside the paper. The H4 error-propagation step references external unreported variance but does not alter or tautologically force the new measurements themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical extension of an existing benchmark protocol and introduces no new free parameters, axioms, or invented entities beyond the standard IPD payoff matrix and Moran process already defined in the referenced prior work.

axioms (1)
  • domain assumption The Iterated Prisoner's Dilemma payoff structure and Moran process update rules are identical to those used in Willis et al.
    The abstract states the identical protocol is applied across all new model-prompt combinations.

pith-pipeline@v0.9.1-grok · 5905 in / 1172 out tokens · 28976 ms · 2026-06-29T00:09:19.304090+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 15 canonical work pages · 3 internal anchors

  1. [1]

    Preprint, arXiv:2208.10264

    Gati Aher, Rosa I. Arriaga, and Adam Tauman Kalai. 2023. Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. arXiv:2208.10264 [cs.CL]

  2. [2]

    Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. 2025. Playing repeated games with large language models. Nature Human Behaviour9 (2025), 1380–1390. doi:10.1038/s41562-025-02172-y Published version of arXiv:2305.16867

  3. [3]

    1984.The Evolution of Cooperation

    Robert Axelrod. 1984.The Evolution of Cooperation. Basic Books, New York

  4. [4]

    Hamilton

    Robert Axelrod and William D. Hamilton. 1981. The Evolution of Cooperation. Science211, 4489 (1981), 1390–1396

  5. [5]

    DeBacker

    Philip Brookins and Jason M. DeBacker. 2023. Playing Games with GPT: What Can We Learn about a Large Language Model from Canonical Strategic Games? arXiv:2305.10912 [econ.GN]

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]

  7. [7]

    De Zarzà, J

    I. De Zarzà, J. De Curtò, Gemma Roig, Pietro Manzoni, and Carlos T. Calafate

  8. [8]

    Emergent Cooperation and Strategy Adaptation in Multi-Agent Systems: An Extended Coevolutionary Theory with LLMs.Electronics12, 12 (2023), 2722

  9. [9]

    Caoyun Fan, Jindou Chen, Yaohui Jin, and Hao He. 2024. Can Large Language Models Serve as Rational Players in Game Theory: A Systematic Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 17960–17967

  10. [10]

    Fulin Guo. 2023. GPT Agents in Game Theory Experiments. arXiv:2305.05516 [econ.GN]

  11. [11]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding.International Conference on Learning Representations(2021). arXiv:2009.03300

  12. [12]

    Vincent Knight, Owen Campbell, Marc Harper, Karol Langner, James Campbell, Thomas Campbell, Alex Carney, Martin Chorley, Cameron Davidson-Pilon, Kris- tian Glass, et al. 2016. An Open Framework for the Reproducible Study of the Iterated Prisoner’s Dilemma.Journal of Open Research Software4, 1 (2016), e35

  13. [13]

    Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Grae- pel

    Joel Z. Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Grae- pel. 2017. Multi-Agent Reinforcement Learning in Sequential Social Dilemmas. InProceedings of the 16th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS). 464–473

  14. [14]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

  15. [15]

    InAdvances in Neural Information Processing Systems (NeurIPS), Vol

    Self-Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

  16. [16]

    Patrick A. P. Moran. 1958. Random Processes in Genetics.Mathematical Proceed- ings of the Cambridge Philosophical Society54, 1 (1958), 60–71

  17. [17]

    Martin A. Nowak. 2006.Evolutionary Dynamics: Exploring the Equations of Life. Harvard University Press, Cambridge, MA

  18. [18]

    Nowak, Akira Sasaki, Christine Taylor, and Drew Fudenberg

    Martin A. Nowak, Akira Sasaki, Christine Taylor, and Drew Fudenberg. 2004. Emergence of cooperation and evolutionary stability in finite populations.Nature 428, 6983 (2004), 646–650. doi:10.1038/nature02414

  19. [19]

    Generative Agents: Interactive Simulacra of Human Behavior

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442 [cs.HC]

  20. [20]

    Kenneth Payne and Baptiste Alloui-Cros. 2025. Strategic Intelligence in Large Language Models: Evidence from Evolutionary Game Theory. arXiv:2507.02618 [cs.AI]

  21. [21]

    Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Schölkopf, Mrinmaya Sachan, and Rada Mihalcea. 2024. Cooperate or Collapse: Emergence of Sustain- able Cooperation in a Society of LLM Agents. InAdvances in Neural Information Processing Systems (NeurIPS 2024). arXiv:2404.16698 [cs.AI]

  22. [22]

    Haoran Sun, Yusen Wu, Peng Wang, Wei Chen, Yukun Cheng, Xiaotie Deng, and Xu Chu. 2025. Game Theory Meets Large Language Models: A Systematic Survey with Taxonomy and New Frontiers. InProceedings of IJCAI 2025. arXiv:2502.09053

  23. [23]

    Stochastic dynamics of invasion and fixation

    Arne Traulsen, Martin A. Nowak, and Jorge M. Pacheco. 2006. Stochas- tic dynamics of invasion and fixation.Physical Review E74 (2006), 011909. doi:10.1103/PhysRevE.74.011909

  24. [24]

    Aron Vallinder and Edward Hughes. 2024. Cultural Evolution of Cooperation among LLM Agents. arXiv:2412.10270 [cs.MA] Extended Abstract at AAMAS 2025

  25. [25]

    Wahl and Martin A

    Lindi M. Wahl and Martin A. Nowak. 1999. The Continuous Prisoner’s Dilemma: II. Linear Reactive Strategies with Noise.Journal of Theoretical Biology200, 3 (1999), 323–338

  26. [26]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

  27. [27]

    Leibo, and Michael Luck

    George Willis, Yali Du, Joel Z. Leibo, and Michael Luck. 2025. Do LLM Agents Cooperate or Defect? Evolutionary Dynamics in Multi-Agent Systems. arXiv:2501.16173 [cs.GT]

  28. [28]

    Richard Willis, Jianing Zhao, Yali Du, and Joel Z. Leibo. 2026. Evaluating Collec- tive Behaviour of Hundreds of LLM Agents. arXiv:2602.16662 [cs.MA]

  29. [29]

    Jianzhong Wu and Robert Axelrod. 1995. How to Cope with Noise in the Iterated Prisoner’s Dilemma.Journal of Conflict Resolution39, 1 (1995), 183–189

  30. [30]

    Julian Yocum, Phillip Christoffersen, Mehul Damani, Justin Svegliato, Dylan Hadfield-Menell, and Stuart Russell. 2023. Mitigating Generative Agent Social Dilemmas. InFoundation Models for Decision Making Workshop, NeurIPS