arxiv: 2604.20658 · v1 · submitted 2026-04-22 · 💻 cs.CL · cs.CY· cs.MA

Recognition: unknown

Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

Shivani Kumar , Adarsh Bharathwaj , David Jurgens

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:49 UTC · model grok-4.3

classification 💻 cs.CL cs.CYcs.MA

keywords multi-agent LLMscooperative profilesbehavioral economics gamesAI for scienceteam performancecollaborative workflowsresource constraints

0 comments

The pith

Cooperative profiles from behavioral games predict how well LLM teams perform on scientific tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs' tendencies to cooperate, measured in six standard behavioral economics games, forecast how effectively groups of these models collaborate on AI-for-Science workflows. In these workflows, teams analyze data, build models, and produce reports while sharing limited resources such as compute budgets. The authors find that models showing cooperative patterns in the games lead their teams to higher accuracy, quality, and completion in the scientific outputs. These links hold after statistical controls for other model traits, suggesting cooperation functions as a distinct measurable property. If correct, the approach supplies a quick, low-cost screen for choosing models before running expensive multi-agent deployments.

Core claim

Teams composed of LLMs whose game-derived cooperative profiles favor coordination and investment in multiplicative team production produce scientific reports with superior accuracy, quality, and completion rates under shared budget constraints, with these associations persisting after controlling for multiple factors including general model ability.

What carries the argument

Cooperative profiles, constructed from an LLM's choices across six behavioral economics games that isolate mechanisms such as coordination and contribution to collective production rather than individual gain.

Load-bearing premise

Behavior observed in the stylized games isolates cooperation mechanisms that transfer to the coordination demands, resource sharing, and output needs of the specific AI-for-Science collaborative workflows.

What would settle it

A new experiment using the same six games and controls but different AI-for-Science tasks or held-out models that shows no reliable correlation between cooperative profiles and downstream team performance.

Figures

Figures reproduced from arXiv: 2604.20658 by Adarsh Bharathwaj, David Jurgens, Shivani Kumar.

**Figure 2.** Figure 2: Behavioral profiles across six games. Each subplot shows one game, with models [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗

read the original abstract

Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Game profiles from behavioral economics predict LLM team performance on budgeted science workflows, but the evidence that this captures a distinct cooperative trait rather than general capability is still thin.

read the letter

The main point is that LLMs which do well in six standard economic games also tend to produce better joint scientific reports when working as teams under shared GPU or credit limits. The authors benchmark 35 open-weight models and report that the association survives controls for several factors, suggesting a practical way to screen models before running expensive multi-agent pipelines for data analysis and report generation.

Referee Report

3 major / 2 minor

Summary. The paper benchmarks 35 open-weight LLMs on six behavioral economics games to derive cooperative profiles and reports that these profiles predict downstream performance in multi-agent AI-for-Science workflows, where LLM teams analyze data, build models, and generate reports under shared budget constraints. Models showing effective coordination and investment in multiplicative team production (vs. greedy strategies) yield better outcomes on accuracy, quality, and completion metrics. The associations are claimed to persist after controlling for multiple factors, supporting the conclusion that cooperative disposition is a distinct, measurable LLM property not reducible to general ability. The framework is positioned as an inexpensive diagnostic for screening LLMs prior to costly multi-agent deployment.

Significance. If the central associations prove robust, the work offers a practical, low-cost screening tool for selecting LLMs in cooperative multi-agent scientific applications, bridging behavioral economics games with realistic AI-for-Science coordination demands. Strengths include the scale of benchmarking across 35 models, the use of multiple outcome metrics, and the attempt to provide falsifiable predictions via game-derived profiles that can be tested in follow-up studies. This could inform more efficient deployment of multi-agent systems where resource sharing and information handoffs are critical.

major comments (3)

[Abstract] Abstract: The claim that 'these associations hold after controlling for multiple factors' provides no details on the specific controls applied, effect sizes, data exclusions, or model selection criteria. This information is load-bearing for the assertion that cooperative disposition is distinct from general ability, as the central empirical result depends on demonstrating that the game metrics capture something beyond capability proxies.
[Methods] Methods (profile construction): The manuscript does not specify how cooperative profiles are aggregated from the six games, including the exact quantification of 'investment in multiplicative team production' versus greedy strategies or the aggregation rule across games. Without this, it is impossible to evaluate whether the profiles isolate mechanisms that transfer to the coordination, data partitioning, and report-synthesis demands of the AI-for-Science tasks.
[Results] Results: No evidence is presented that the statistical controls include single-agent performance on the scientific tasks themselves (or other direct capability proxies). This omission leaves the distinctness claim vulnerable, as the reported predictive associations could be driven by overall model quality rather than a separable cooperative disposition.

minor comments (2)

[Methods] The paper would benefit from an appendix or table explicitly listing the six behavioral games, their payoff structures, and the precise metrics extracted for each LLM.
[Figures] Figure captions and legends should more clearly indicate how the cooperative profiles are visualized and which statistical tests underlie the reported associations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below, indicating the revisions we will make to improve clarity and strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'these associations hold after controlling for multiple factors' provides no details on the specific controls applied, effect sizes, data exclusions, or model selection criteria. This information is load-bearing for the assertion that cooperative disposition is distinct from general ability, as the central empirical result depends on demonstrating that the game metrics capture something beyond capability proxies.

Authors: We agree that the abstract would be strengthened by greater specificity on this central claim. In the revision, we will expand the abstract to briefly describe the controls applied (model scale, parameter count, and baseline capability proxies), report key effect sizes, and note data exclusion criteria. Full details on model selection and robustness checks will be cross-referenced to the Methods and supplementary materials. This addresses the concern that the distinctness of cooperative disposition requires transparent support. revision: yes
Referee: [Methods] Methods (profile construction): The manuscript does not specify how cooperative profiles are aggregated from the six games, including the exact quantification of 'investment in multiplicative team production' versus greedy strategies or the aggregation rule across games. Without this, it is impossible to evaluate whether the profiles isolate mechanisms that transfer to the coordination, data partitioning, and report-synthesis demands of the AI-for-Science tasks.

Authors: We acknowledge that the Methods section lacks sufficient detail on profile construction. We will revise this section to explicitly define the quantification of investment in multiplicative team production versus greedy strategies for each game and to specify the aggregation rule across the six games. This addition will allow readers to assess the transferability of the profiles to the coordination and synthesis demands of the AI-for-Science tasks. revision: yes
Referee: [Results] Results: No evidence is presented that the statistical controls include single-agent performance on the scientific tasks themselves (or other direct capability proxies). This omission leaves the distinctness claim vulnerable, as the reported predictive associations could be driven by overall model quality rather than a separable cooperative disposition.

Authors: The referee correctly identifies that single-agent performance on the scientific tasks was not included among the reported controls. We will add these analyses to the Results section, presenting regressions both with and without single-agent accuracy, quality, and completion metrics as covariates. Updated coefficients and effect sizes will be reported to demonstrate that the cooperative profile remains predictive after accounting for direct task-specific capability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical correlations from independent benchmarks

full rationale

The paper performs separate benchmarking of 35 LLMs on six behavioral economics games to extract cooperative profiles, then measures performance on distinct AI-for-Science collaborative workflows under budget constraints. Reported associations are statistical correlations after controls for multiple factors, with no equations, derivations, or self-referential definitions that reduce the downstream predictions to the game inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The central claim rests on observed transfer from stylized games to science tasks rather than any tautological redefinition or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract only; no explicit parameters, axioms, or invented entities are stated. The central claim implicitly rests on transferability of game behavior.

axioms (1)

domain assumption Behavior observed in stylized behavioral economics games transfers to coordination demands in realistic multi-agent scientific workflows
The paper uses game-derived profiles as predictors for downstream science-task performance, requiring this transfer to hold.

pith-pipeline@v0.9.0 · 5499 in / 1372 out tokens · 59348 ms · 2026-05-09T23:49:47.837845+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

95 extracted references · 14 canonical work pages · 7 internal anchors

[1]

Scientific American , volume=

Experiments in intergroup discrimination , author=. Scientific American , volume=. 1970 , publisher=

1970
[2]

Advances in Neural Information Processing Systems , volume=

Can large language model agents simulate human trust behavior? , author=. Advances in Neural Information Processing Systems , volume=
[3]

Games , year=

Favoritism and Fairness in Teams , author=. Games , year=
[4]

Group Processes & Intergroup Relations , year=

Relative deprivation and intergroup competition , author=. Group Processes & Intergroup Relations , year=
[5]

Shapira, Eilam and Madmon, Omer and Reichart, Roi and Tennenholtz, Moshe , year=. Can
[6]

, journal=

Huang, Jen-Tse and Li, Eric and Lam, Man Ho and Liang, Tian and Wang, Wenxuan and Yuan, Youliang and Jiao, Wenxiang and Wang, Xing and Tu, Zhaopeng and Lyu, Michael R. , journal=. How Far Are We on the Decision-Making of
[7]

2023 , eprint=

Generative Agents: Interactive Simulacra of Human Behavior , author=. 2023 , eprint=

2023
[8]

Zhou, Jinfeng and others , booktitle=
[9]

Nature , volume=

Altruistic punishment in humans , author=. Nature , volume=
[10]

Current Biology , volume=

The evolution of human cooperation , author=. Current Biology , volume=
[11]

Van Noorden, Richard and Perkel, Jeffrey M , journal=
[12]

Nature , volume=

Scientific discovery in the age of artificial intelligence , author=. Nature , volume=
[13]

Philosophical Transactions of the Royal Society B , volume=

Collective minds: social network topology shapes collective cognition , author=. Philosophical Transactions of the Royal Society B , volume=
[14]

AAAI , year=

Theory of Minds: Understanding Behavior in Groups Through Inverse Planning , author=. AAAI , year=
[15]

Perspectives on Psychological Science , volume=

Motivated Cognition in Cooperation , author=. Perspectives on Psychological Science , volume=
[16]

SAGE Open , volume=

How Emergent Roles and Structures Create Trust in Hastily Formed Interorganizational Teams , author=. SAGE Open , volume=
[17]

Scientific Reports , volume=

Strategic behavior of large language models and the role of game structure versus contextual framing , author=. Scientific Reports , volume=
[18]

Genetic and Cultural Evolution of Cooperation , pages=

Origins of human cooperation , author=. Genetic and Cultural Evolution of Cooperation , pages=
[19]

Emergent Abilities of Large Language Models

Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

work page internal anchor Pith review arXiv
[20]

AAMAS , pages=

Multi-agent Reinforcement Learning in Sequential Social Dilemmas , author=. AAMAS , pages=
[21]

American Economic Review , volume=

Tacit Coordination Games, Strategic Uncertainty, and Coordination Failure , author=. American Economic Review , volume=
[22]

1990 , publisher=

Governing the Commons: The Evolution of Institutions for Collective Action , author=. 1990 , publisher=

1990
[23]

1994 , publisher=

Rules, Games, and Common-Pool Resources , author=. 1994 , publisher=

1994
[24]

Proceedings of the National Academy of Sciences , volume=

The collective-risk social dilemma and the prevention of simulated dangerous climate change , author=. Proceedings of the National Academy of Sciences , volume=
[25]

arXiv preprint arXiv:2506.23276 , year=

Corrupted by reasoning: Reasoning language models become free-riders in public goods games , author=. arXiv preprint arXiv:2506.23276 , year=

work page arXiv
[26]

Kremer, Michael , journal=. The
[27]

Bell Journal of Economics , volume=

Moral Hazard in Teams , author=. Bell Journal of Economics , volume=
[28]

Journal of Public Economics , volume=

Group Size and the Voluntary Provision of Public Goods: Experimental Evidence Utilizing Large Groups , author=. Journal of Public Economics , volume=
[29]

Journal of Economic Behavior & Organization , volume=

Design of Experiments with Unknown Arrival Times of Subjects , author=. Journal of Economic Behavior & Organization , volume=. 2003 , note=

2003
[30]

American Economic Review , volume=

Cooperation and Punishment in Public Goods Experiments , author=. American Economic Review , volume=
[31]

Psychological Science , volume=

``In-Group Love'' and ``Out-Group Hate'' as Motives for Individual Participation in Intergroup Conflict: A New Game Paradigm , author=. Psychological Science , volume=
[32]

Playing repeated games with large language models

Playing Repeated Games with Large Language Models , author=. arXiv preprint arXiv:2305.16867 , year=

work page arXiv
[33]

, journal=

Brookins, Philip and DeBacker, Jason M. , journal=. Playing Games with. 2024 , note=

2024
[34]

AAAI , year=

Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis , author=. AAAI , year=
[35]

arXiv preprint arXiv:2405.01111 , year=

Nicer Than Humans: How do Large Language Models Behave in the Dictator Game? , author=. arXiv preprint arXiv:2405.01111 , year=

work page arXiv
[36]

NBER Working Paper , number=

Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? , author=. NBER Working Paper , number=
[37]

Hong, Sirui and Zhuge, Mingchen and Chen, Jonathan and Zheng, Xiawu and Cheng, Yuheng and Zhang, Ceyao and Wang, Jinlin and Wang, Zili and Yau, Steven Ka Zhong and Lin, Zijuan and others , booktitle=
[38]

Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and others , journal=
[39]

Lu, Chris and Lu, Cong and Lange, Robert Tjarko and Foerster, Jakob and Clune, Jeff and Ha, David , journal=. The
[40]

Qian, Chen and Liu, Wei and Liu, Hongzhang and Chen, Nuo and Dang, Yufan and Li, Jiahao and Yang, Cheng and Chen, Weize and Su, Yusheng and Cong, Xin and others , booktitle=
[41]

Journal of Personality and Social Psychology , volume=

Intragroup social influence and intergroup competition , author=. Journal of Personality and Social Psychology , volume=. 2003 , publisher=

2003
[42]

Games and Economic Behavior , volume=

Trust, reciprocity, and social history , author=. Games and Economic Behavior , volume=. 1995 , publisher=

1995
[43]

World Development , volume=

Local environmental control and institutional crowding-out , author=. World Development , volume=
[44]

The Handbook of Experimental Economics , editor=

Public goods: A survey of experimental research , author=. The Handbook of Experimental Economics , editor=. 1995 , publisher=

1995
[45]

The Review of Economics and Statistics , volume=

Bootstrap-based improvements for inference with clustered errors , author=. The Review of Economics and Statistics , volume=
[46]

and Gelman, Andrew , journal=

Hoffman, Matthew D. and Gelman, Andrew , journal=. The No-U-Turn Sampler: Adaptively Setting Path Lengths in
[47]

Composable Effects for Flexible and Accelerated Probabilistic Programming in

Phan, Du and Pradhan, Neeraj and Jankowiak, Martin , booktitle=. Composable Effects for Flexible and Accelerated Probabilistic Programming in
[48]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search , author =. arXiv preprint arXiv:2504.08066 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Frontiers in Earth Science (Perspective) , year =

Accelerating Earth Science Discovery via Multi-Agent LLM Systems , author =. Frontiers in Earth Science (Perspective) , year =
[50]

npj Artificial Intelligence , volume =

Exploring the role of large language models in the scientific method: From hypothesis to discovery , author =. npj Artificial Intelligence , volume =. 2025 , doi =

2025
[51]

Gemma 3 Technical Report

Gemma 3 Technical Report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Grattafiori, Aaron and al., et , journal=. The
[53]

Walsh, Pete and Soldaini, Luca and Groeneveld, Dirk and Lo, Kyle and Arora, Shane and Bhagia, Akshita and Gu, Yuling and Huang, Shengyi and Jordan, Matt and Lambert, Nathan and Schwenk, Dustin and Tafjord, Oyvind and Anderson, Taira and Atkinson, David and Brahman, Faeze and Clark, Christopher and Dasigi, Pradeep and Dziri, Nouha and Ettinger, Allyson and...
[54]

Phi-4 Technical Report

Phi-4 Technical Report , author=. arXiv preprint arXiv:2412.08905 , year=

work page internal anchor Pith review arXiv
[55]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Emergent coordination in multi-agent language models , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[56]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[57]

Instruction-Following Evaluation for Large Language Models

Instruction-Following Evaluation for Large Language Models , author=. arXiv preprint arXiv:2311.07911 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. arXiv preprint arXiv:2406.01574 , year=

work page internal anchor Pith review arXiv
[59]

2024 , howpublished =

LangGraph: Building Stateful, Multi-Agent Applications with LLMs , author =. 2024 , howpublished =

2024
[60]

1984 , publisher=

The Evolution of Cooperation , author=. 1984 , publisher=

1984
[61]

and Leibo, Joel Z

Dafoe, Allan and Hughes, Edward and Bachrach, Yoram and Collins, Tantum and McKee, Kevin R. and Leibo, Joel Z. and Larson, Kate and Graepel, Thore , journal=. Open Problems in Cooperative
[62]

and Horvitz, Eric and Larson, Kate and Graepel, Thore , journal=

Dafoe, Allan and Bachrach, Yoram and Hadfield, Gillian K. and Horvitz, Eric and Larson, Kate and Graepel, Thore , journal=. Cooperative
[63]

Science , volume=

Five Rules for the Evolution of Cooperation , author=. Science , volume=
[64]

Trends in Cognitive Sciences , volume=

Human Cooperation , author=. Trends in Cognitive Sciences , volume=
[65]

When and Why?

Devetag, Giovanna and Ortmann, Andreas , journal=. When and Why?
[66]

Experimental Economics , volume=

Sustaining Cooperation in Laboratory Public Goods Experiments: A Selective Survey of the Literature , author=. Experimental Economics , volume=
[67]

Proceedings of the National Academy of Sciences , volume=

Inequality, Communication, and the Avoidance of Disastrous Climate Change in a Public Goods Game , author=. Proceedings of the National Academy of Sciences , volume=
[68]

Nature , volume=

Autonomous Chemical Research with Large Language Models , author=. Nature , volume=
[69]

Nature , volume=

Towards end-to-end automation of AI research , author=. Nature , volume=. 2026 , publisher=

2026
[70]

and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D

Bran, Andres M. and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D. and Schwaller, Philippe , journal=
[71]

, journal=

Ghafarollahi, Alireza and Buehler, Markus J. , journal=
[72]

Journal of Development Economics , volume=

Real Wealth and Experimental Cooperation: Experiments in the Field Lab , author=. Journal of Development Economics , volume=
[73]

Frontiers in behavioral neuroscience , volume=

Preferences and beliefs in ingroup favoritism , author=. Frontiers in behavioral neuroscience , volume=. 2015 , publisher=

2015
[74]

European journal of social psychology , volume=

Social categorization and intergroup behaviour , author=. European journal of social psychology , volume=. 1971 , publisher=

1971
[75]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

Theory of Mind for Multi-Agent Collaboration via Large Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

2023
[76]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

ArXiv , year=

Transforming Competition into Collaboration: The Revolutionary Role of Multi-Agent Systems and Language Models in Modern Organizations , author=. ArXiv , year=
[78]

Discover Computing , year=

An organizational theory for multi-agent interactions integrating human agents, LLMs, and specialized AI , author=. Discover Computing , year=
[79]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Can Large Language Model Agents Simulate Human Trust Behavior? , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[80]

Science Advances , volume =

Ariel Flint Ashery and Luca Maria Aiello and Andrea Baronchelli , title =. Science Advances , volume =. 2025 , doi =

2025

Showing first 80 references.