pith. sign in

arxiv: 2606.30454 · v1 · pith:K2YGPTIOnew · submitted 2026-06-29 · ⚛️ physics.soc-ph · cs.AI

Collective cooperation without individual fidelity in LLM agents

Pith reviewed 2026-06-30 03:16 UTC · model grok-4.3

classification ⚛️ physics.soc-ph cs.AI
keywords LLM agentsPrisoner's Dilemmacooperation dynamicssocial networksmacro-micro dissociationhuman-AI comparisoncollective behaviorconditional cooperation
0
0 comments X

The pith

LLM agents reproduce human-like cooperation rates in networks but diverge in individual behaviors and decision rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs nine open-weight LLMs through the exact same networked Prisoner's Dilemma protocol, payoffs, and topologies used in a prior human experiment. One model captures the macro pattern of early cooperation decline followed by stabilization. At the individual level, however, the LLMs show narrower behavioral variation and different rules for when to cooperate based on neighbors' actions. Introducing random agents narrows some micro gaps yet leaves the mismatch in decision logic intact. The work therefore demonstrates that group-level agreement with human data can occur without matching the underlying distributions or mechanisms.

Core claim

LLM populations can exhibit human-like collective cooperation dynamics in a networked Prisoner's Dilemma while underestimating individual heterogeneity and producing distinct conditional cooperation patterns, revealing a macro-micro dissociation where aggregate agreement does not imply fidelity at the level of behavioral distributions or decision rules.

What carries the argument

The macro-micro dissociation: aggregate cooperation trajectories match human benchmarks while individual heterogeneity and conditional cooperation rules do not.

If this is right

  • Validation of LLM agents as human surrogates must compare aggregate dynamics, individual heterogeneity, and context-dependent decision rules rather than outcome agreement alone.
  • Selected models can reproduce the early decline and later stabilization of cooperation rates across the network.
  • LLM populations underestimate individual-level heterogeneity compared with humans.
  • Conditional cooperation patterns generated by LLMs differ from those observed in humans.
  • Mixing in random agents improves some micro-level statistics but leaves the mismatch in decision rules unresolved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Simulations relying on LLMs may systematically understate population diversity when forecasting social outcomes.
  • Prompt engineering or post-training on human choice data could be tested to reduce the gap in conditional cooperation rules.
  • The dissociation suggests LLMs might still be useful for coarse-grained trend prediction even if unsuitable for mechanism-level studies of cooperation evolution.
  • Repeating the protocol with different network sizes or payoff ratios would show whether the macro-micro split persists beyond the tested conditions.

Load-bearing premise

The human experiment supplies a direct benchmark because LLM agents are prompted with the identical interaction protocol, payoffs, and network structures.

What would settle it

A run of the same LLM on the original human network and payoffs that produces individual cooperation frequencies and conditional rules statistically indistinguishable from the human data.

read the original abstract

Large language models (LLMs) are increasingly used as agents in simulations of social systems, yet it remains unclear when their behavior can be interpreted as a faithful proxy for human decision-making. Here we test LLM agents against a direct empirical benchmark: a large-scale networked Prisoner's Dilemma experiment with human participants. Using the same interaction protocol, payoff structure, and network topologies, we compare nine open-weight LLMs with the human data. The selected model reproduces several macro-level features of cooperation dynamics, including the early decline and later stabilization of cooperation. This aggregate agreement, however, does not extend uniformly to finer levels of behavior. LLM populations underestimate individual-level heterogeneity and generate conditional cooperation patterns that differ from those observed in humans. Adding a fraction of random agents improves some aspects of micro-level agreement, but does not remove the mismatch in decision rules. These findings reveal a macro--micro dissociation in LLM-based social agents: collective outcomes can appear human-like even when the underlying behavioral distributions and mechanisms are not. They suggest that validating LLM agents as human surrogates requires comparisons across aggregate dynamics, individual heterogeneity, and context-dependent decision rules, rather than outcome-level agreement alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript compares nine open-weight LLMs as agents against human data from a large-scale networked Prisoner's Dilemma experiment, using identical interaction protocols, payoffs, and topologies. It reports that LLMs reproduce macro-level features such as the early decline and later stabilization of cooperation rates, but underestimate individual heterogeneity and produce different conditional cooperation patterns; adding random agents partially mitigates some micro mismatches but not the divergence in decision rules. The central claim is a macro-micro dissociation: aggregate outcomes can appear human-like even when underlying behavioral distributions and mechanisms do not.

Significance. If the experimental realizations are equivalent, the result is significant for the growing use of LLM agents in social simulations: it demonstrates that outcome-level agreement is insufficient validation and that multi-scale checks (aggregate dynamics, heterogeneity, context-dependent rules) are required. The direct empirical benchmark against human data is a methodological strength.

major comments (3)
  1. [Methods] Methods section: The description of LLM prompting, memory handling, neighbor visibility, and action selection format is insufficient to confirm that the information set and decision context match those given to human participants; without this, the reported micro-level mismatches cannot be unambiguously attributed to model limitations rather than implementation differences.
  2. [Results] Results, conditional cooperation subsection: The quantification of conditional cooperation patterns and the statistical comparison to human data omit details on the exact tests used, per-model sample sizes, and the operational definition of 'conditional cooperation,' making it impossible to evaluate the strength or robustness of the claimed divergence.
  3. [Results] Results on random agents: The addition of a fraction of random agents is introduced to improve micro agreement without a pre-specified criterion or analysis of how this alters the interpretation of the base LLM behavior, weakening the claim that the dissociation is intrinsic to the models.
minor comments (2)
  1. [Figures] Figure captions and legends should explicitly state the number of independent runs or agents per model to allow readers to assess variability.
  2. [Abstract] The abstract states 'nine open-weight LLMs' but the main text should include a table or list of the specific models and their parameter counts for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to improve clarity, transparency, and rigor where appropriate.

read point-by-point responses
  1. Referee: [Methods] Methods section: The description of LLM prompting, memory handling, neighbor visibility, and action selection format is insufficient to confirm that the information set and decision context match those given to human participants; without this, the reported micro-level mismatches cannot be unambiguously attributed to model limitations rather than implementation differences.

    Authors: We agree that the original Methods section lacked sufficient detail for full replicability and equivalence verification. In the revised manuscript we have added a dedicated subsection that specifies the exact prompt templates (including system and user messages), the memory buffer implementation (last k rounds with neighbor actions), the neighbor visibility format (identical to the human interface), and the constrained output parsing for binary cooperation/defection choices. These additions confirm that the information set and decision context match the human protocol, allowing the observed micro-level differences to be attributed to model properties. revision: yes

  2. Referee: [Results] Results, conditional cooperation subsection: The quantification of conditional cooperation patterns and the statistical comparison to human data omit details on the exact tests used, per-model sample sizes, and the operational definition of 'conditional cooperation,' making it impossible to evaluate the strength or robustness of the claimed divergence.

    Authors: We acknowledge the omission of these operational and statistical details. The revised subsection now provides: (i) the operational definition of conditional cooperation as the round-by-round probability of cooperation given the exact number of cooperating neighbors in the previous round; (ii) per-model sample sizes (N = 1000 independent runs per LLM, matching the scale of the human data); and (iii) the statistical procedures (two-sample Kolmogorov-Smirnov tests for distribution comparisons and mixed-effects logistic regressions with neighbor cooperation count as predictor). These additions enable direct assessment of the robustness of the reported divergence. revision: yes

  3. Referee: [Results] Results on random agents: The addition of a fraction of random agents is introduced to improve micro agreement without a pre-specified criterion or analysis of how this alters the interpretation of the base LLM behavior, weakening the claim that the dissociation is intrinsic to the models.

    Authors: The random-agent analysis was exploratory, intended to test whether added behavioral heterogeneity could mitigate micro mismatches. We accept that a pre-specified criterion and fuller sensitivity analysis would have been preferable. In revision we have (a) stated the exploratory rationale explicitly, (b) added a sensitivity analysis across random-agent fractions (0–0.5) reporting effects on both macro cooperation trajectories and micro conditional-cooperation distributions, and (c) clarified that the core macro–micro dissociation is demonstrated for the pure LLM populations while noting the interpretive limits of the mixed-agent results. The central claim is therefore retained but more carefully qualified. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark comparison

full rationale

The paper conducts an empirical comparison of LLM agents against human experimental data in a networked Prisoner's Dilemma, using identical interaction protocols, payoffs, and topologies. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps exist. The central claim rests on observed macro-micro dissociation in the data, which is externally falsifiable against the human benchmark and does not reduce to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work is empirical and rests on assumptions about experimental equivalence and LLM instruction-following capability rather than mathematical axioms or new entities.

axioms (2)
  • domain assumption The human participants' data serves as a faithful empirical benchmark for comparison with LLM agents.
    Invoked when stating the direct benchmark comparison using the same protocol.
  • domain assumption LLM agents can be instructed via prompts to participate in the game following the same rules as humans.
    Underlying the use of LLMs as agents in the simulation with identical payoffs and networks.

pith-pipeline@v0.9.1-grok · 5740 in / 1354 out tokens · 86095 ms · 2026-06-30T03:16:30.963845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 7 canonical work pages · 5 internal anchors

  1. [1]

    Physics of Life Reviews51, 283–293 (2024)

    Lu, Y., Aleta, A., Du, C., Shi, L., Moreno, Y.: Llms and generative agent-based models for complex systems research. Physics of Life Reviews51, 283–293 (2024)

  2. [2]

    15607–15631 (2023)

    Chiang, C.-H., Lee, H.-y.: Can large language models be an alternative to human evaluations? In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15607–15631 (2023)

  3. [3]

    Proceedings of the ACM on Human-Computer Interaction9(7), 1–44 (2025)

    Papachristou, M., Yang, L., Hsu, C.-C.: Leveraging large language models for collective decision- making. Proceedings of the ACM on Human-Computer Interaction9(7), 1–44 (2025)

  4. [4]

    Scientific Reports15(1), 13852 (2025)

    Flamino, J., Modi, M.S., Szymanski, B.K., Cross, B., Mikolajczyk, C.: Testing the limits of large language models in debating humans. Scientific Reports15(1), 13852 (2025)

  5. [5]

    Online Social Networks and Media51, 100344 (2026)

    Nudo, J., Pandolfo, M.E., Loru, E., Samory, M., Cinelli, M., Quattrociocchi, W.: Generative exag- geration in llm social agents: Consistency, bias, and toxicity. Online Social Networks and Media51, 100344 (2026)

  6. [6]

    Proceedings of the National Academy of Sciences122(42), 2518443122 (2025)

    Loru, E., Nudo, J., Di Marco, N., Santirocchi, A., Atzeni, R., Cinelli, M., Cestari, V., Rossi-Arnaud, C., Quattrociocchi, W.: The simulation of judgment in llms. Proceedings of the National Academy of Sciences122(42), 2518443122 (2025)

  7. [7]

    arXiv preprint arXiv:2602.18152 (2026)

    Hadad, O., Loru, E., Nudo, J., Di Marco, N., Cinelli, M., Quattrociocchi, W.: The statistical signature of llms. arXiv preprint arXiv:2602.18152 (2026)

  8. [8]

    Nature Human Behaviour9(7), 1380–1390 (2025)

    Akata, E., Schulz, L., Coda-Forno, J., Oh, S.J., Bethge, M., Schulz, E.: Playing repeated games with large language models. Nature Human Behaviour9(7), 1380–1390 (2025)

  9. [9]

    Science Advances11(20), 9368 (2025)

    Ashery, A.F., Aiello, L.M., Baronchelli, A.: Emergent social conventions and collective bias in llm populations. Science Advances11(20), 9368 (2025)

  10. [10]

    Nature Machine Intelligence8(2), 173–185 (2026)

    Kumar, A., Poungpeth, N., Yang, D., Farrell, E., Lambert, B.L., Groh, M.: When large language models are reliable for judging empathic communication. Nature Machine Intelligence8(2), 173–185 (2026)

  11. [11]

    Nature Machine Intelligence, 1–15 (2025)

    Serapio-Garc´ ıa, G., Safdari, M., Crepy, C., Sun, L., Fitz, S., Romero, P., Abdulhai, M., Faust, A., Matari´ c, M.: A psychometric framework for evaluating and shaping personality traits in large language models. Nature Machine Intelligence, 1–15 (2025)

  12. [12]

    arXiv preprint arXiv:2307.12966 (2023)

    Wang, Y., Zhong, W., Li, L., Mi, F., Zeng, X., Huang, W., Shang, L., Jiang, X., Liu, Q.: Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966 (2023)

  13. [13]

    In: The Thirteenth International Conference on Learning Representations (2025)

    Slocum, S., Parker-Sartori, A., Hadfield-Menell, D.: Diverse preference learning for capabilities and alignment. In: The Thirteenth International Conference on Learning Representations (2025). https://openreview.net/forum?id=pOq9vDIYev

  14. [14]

    In: Proceedings of the First Edition of the Workshop on the Scaling Behavior of Large Language Models (SCALE-LLM 2024), pp

    Hamilton, S.: Detecting mode collapse in language models via narration. In: Proceedings of the First Edition of the Workshop on the Scaling Behavior of Large Language Models (SCALE-LLM 2024), pp. 65–72 (2024)

  15. [15]

    Physics of Life Reviews53, 305–306 (2025)

    Reia, S.M., Pfoser, D.,et al.: Opportunities and challenges of llms in urban science: Comment on” llms and generative agent-based models for complex systems research” by yikang lu et al. Physics of Life Reviews53, 305–306 (2025)

  16. [16]

    Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits

    Ye, J., Cao, L., Chen, D., Ferrara, E.: Stop drawing scientific claims from llm social simulations without robustness audits. arXiv preprint arXiv:2605.18890 (2026)

  17. [17]

    Proceedings of the National Academy of Sciences109(32), 12922–12926 (2012) 47

    Gracia-L´ azaro, C., Ferrer, A., Ruiz, G., Taranc´ on, A., Cuesta, J.A., S´ anchez, A., Moreno, Y.: Hetero- geneous networks do not promote cooperation when humans play a prisoner’s dilemma. Proceedings of the National Academy of Sciences109(32), 12922–12926 (2012) 47

  18. [18]

    PloS one5(11), 13749 (2010)

    Gruji´ c, J., Fosco, C., Araujo, L., Cuesta, J.A., S´ anchez, A.: Social experiments in the mesoscale: Humans playing a spatial prisoner’s dilemma. PloS one5(11), 13749 (2010)

  19. [19]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  20. [20]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al.: Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)

  21. [21]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  22. [22]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    Allal, L.B., Lozhkov, A., Bakouch, E., Bl´ azquez, G.M., Penedo, G., Tunstall, L., Marafioti, A., Kydl´ ıˇ cek, H., Lajar´ ın, A.P., Srivastav, V., et al.: Smollm2: When smol goes big–data-centric training of a small language model. arXiv preprint arXiv:2502.02737 (2025)

  23. [23]

    Princeton University Press, Princeton, NJ (1994)

    Hamilton, J.D.: Time Series Analysis. Princeton University Press, Princeton, NJ (1994)

  24. [24]

    Journal of the American statistical association74(366a), 427–431 (1979) 48

    Dickey, D.A., Fuller, W.A.: Distribution of the estimators for autoregressive time series with a unit root. Journal of the American statistical association74(366a), 427–431 (1979) 48