Recognition: no theorem link
More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration
Pith reviewed 2026-05-10 18:14 UTC · model grok-4.3
The pith
Capability does not predict cooperation in LLM agents even when helping costs nothing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a designed environment with no strategic costs to helping, LLM agents still underperform collective optima, with higher-capability models showing worse cooperation rates than lower-capability ones under identical prompts to maximize group revenue. A causal decomposition separates these from competence failures.
What carries the argument
The frictionless multi-agent setup that removes strategic complexity from helping decisions, combined with automated communication to isolate cooperation failures from competence failures.
If this is right
- Scaling model size or capability alone will not solve coordination problems in multi-agent systems.
- Explicit protocols can double performance for lower-competence models.
- Tiny sharing incentives can improve cooperation in models that otherwise under-helped.
- Deliberate cooperative design remains necessary even when helping others carries no cost.
Where Pith is reading between the lines
- Training on individual-task objectives may embed a bias against zero-cost helping that later scaling does not remove.
- Joint reward signals or shared training across agents could be required to produce reliable cooperation.
- The same pattern may appear in other low-friction domains such as documentation sharing or knowledge transfer inside organizations.
Load-bearing premise
The experimental setup truly removes all strategic complexity and any performance gaps reflect cooperation shortfalls rather than differences in how models interpret the shared instructions.
What would settle it
Running the same tasks while automating one agent's side of communication to always share helpful information and checking whether the capability gap in collective performance disappears.
Figures
read the original abstract
Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation failures may arise. In many real-world coordination problems, from knowledge sharing in organizations to code documentation, helping others carries negligible personal cost while generating substantial collective benefits. However, whether LLM agents cooperate when helping neither benefits nor harms the helper, while being given explicit instructions to do so, remains unknown. We build a multi-agent setup designed to study cooperative behavior in a frictionless environment, removing all strategic complexity from cooperation. We find that capability does not predict cooperation: OpenAI o3 achieves only 17% of optimal collective performance while OpenAI o3-mini reaches 50%, despite identical instructions to maximize group revenue. Through a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, tracing their origins through agent reasoning analysis. Testing targeted interventions, we find that explicit protocols double performance for low-competence models, and tiny sharing incentives improve models with weak cooperation. Our findings suggest that scaling intelligence alone will not solve coordination problems in multi-agent systems and will require deliberate cooperative design, even when helping others costs nothing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines cooperative behavior in LLM agents within a multi-agent system engineered to be frictionless, where helping incurs zero personal cost yet yields collective gains. The core empirical result is that capability does not predict cooperation: OpenAI o3 attains only 17% of optimal group revenue while o3-mini attains 50%, under identical instructions to maximize collective performance. The authors introduce a causal decomposition that automates one agent's communication side to isolate cooperation failures from competence failures, trace origins via reasoning traces, and evaluate interventions including explicit protocols (which double low-competence performance) and small sharing incentives (which aid weak cooperators).
Significance. If the results are robust, the work demonstrates that scaling model intelligence alone is insufficient to resolve coordination failures in multi-agent LLM systems, even in zero-cost settings, and underscores the value of targeted cooperative mechanisms. The causal decomposition and intervention tests supply a concrete methodological template for dissecting cooperation versus competence in future LLM multi-agent research.
major comments (2)
- [§4] §4 (Experimental Setup): The claim that the environment 'removes all strategic complexity' is load-bearing for interpreting the 17% vs. 50% gap as a cooperation failure rather than differential instruction parsing or over-inference of hidden costs/repeated-game effects. The manuscript does not report explicit checks (e.g., post-hoc prompt paraphrasing or verification questions) confirming that o3 and o3-mini interpret 'maximize group revenue' identically.
- [§5] §5 (Causal Decomposition): While automating one side of communication usefully separates competence from cooperation, the decomposition does not include controls for asymmetric task comprehension; the reported performance difference could still arise from o3's greater tendency to question revenue mechanics or assume verification requirements rather than from lower willingness to cooperate at zero cost.
minor comments (2)
- [Figure 1] Figure 1 and Table 2: Error bars or confidence intervals are not shown for the reported percentages (17%, 50%), preventing assessment of whether the capability-cooperation dissociation is statistically reliable across runs.
- [§3.2] §3.2: The precise formula for 'optimal collective performance' (the denominator in the reported ratios) should be stated explicitly, including how revenue is aggregated across agents and any normalization constants.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of experimental validation. We address each major comment point by point below and describe the planned revisions.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Setup): The claim that the environment 'removes all strategic complexity' is load-bearing for interpreting the 17% vs. 50% gap as a cooperation failure rather than differential instruction parsing or over-inference of hidden costs/repeated-game effects. The manuscript does not report explicit checks (e.g., post-hoc prompt paraphrasing or verification questions) confirming that o3 and o3-mini interpret 'maximize group revenue' identically.
Authors: We agree that explicit verification of instruction interpretation strengthens the attribution of performance differences to cooperation rather than comprehension. The environment was constructed to eliminate strategic elements through zero personal cost for helping, explicit collective objective, and absence of repeated interactions or hidden payoffs. In the revised manuscript we will add post-hoc checks consisting of paraphrased prompt variants and direct verification questions administered to both models, with results confirming equivalent parsing of 'maximize group revenue' and no differential inference of costs or game-theoretic structure. revision: yes
-
Referee: [§5] §5 (Causal Decomposition): While automating one side of communication usefully separates competence from cooperation, the decomposition does not include controls for asymmetric task comprehension; the reported performance difference could still arise from o3's greater tendency to question revenue mechanics or assume verification requirements rather than from lower willingness to cooperate at zero cost.
Authors: We acknowledge that the causal decomposition, while isolating the communication decision, leaves open the possibility of asymmetric task comprehension. To address this, the revision will incorporate explicit controls: both models will be administered identical comprehension questions on revenue mechanics and sharing rules prior to the main trials, and any observed differences will be reported and used to qualify the cooperation-failure analysis. revision: yes
Circularity Check
Empirical comparison with no derivation chain or self-referential reductions
full rationale
The paper conducts an experimental study of LLM agents in a custom multi-agent revenue-maximization task. Performance is measured directly against an external optimal collective benchmark (17% for o3, 50% for o3-mini). No equations, fitted parameters, or derivations are present that could reduce results to author-defined inputs. The environment is constructed by design to remove strategic elements; this is an experimental choice, not a self-definitional loop. No load-bearing self-citations or uniqueness theorems from prior author work are invoked. The central claim rests on observed differences across distinct models under identical prompts, which are falsifiable against the benchmark and do not collapse by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Hypothesis generation and updating in large language models
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2308.03688. Yat Long Lo, Christian Schroeder de Witt, Samuel Sokota, Jakob Nicolaus Foerster, and Shimon Whiteson. Cheap talk discovery and utilization in multi-agent reinforcement learning, 2023. URL https://arxiv.org/abs/2303.10733. Ryan Lowe, Jakob Foerster, Y-Lan Boureau, Joelle Pineau, and Yann Dauphin. On the pitfalls of mea...
-
[2]
action":
Send message to another agent: {{"action": "send_message", "to": "agent_2", "content": "I need the Department 3 budget for my task..."}}
-
[3]
action":
Send information to an agent who requested it: {{"action": "send_information", "to": "agent_3", "information": ["Q4 sales data", "Department 3 budget"], "values": {{"Q4 sales data": 75, "Department 3 budget": 80}}}}
-
[4]
action":
Broadcast to all agents: {{"action": "broadcast", "content": "Looking for Region 4 market data..."}}
-
[5]
action":
Submit a completed task (you can submit multiple tasks if ready): {{"action": "submit_task", "answer": "Combined result of: Department 3 budget, Region 4 market data, Q4 sales data"}} INSTRUCTIONS: - You have the information listed in "Your Information" section above. - To get information from others, you must request it and they must send it to you. - Wh...
2026
-
[6]
action":
Send message to another agent: {"action": "send_message", "to": "agent_2", "content": "I need the Department 3 budget for my task..."}
-
[7]
action":
Send information to an agent who requested it: {"action": "send_information", "to": "agent_3", "information": [" Q4 sales data", "Department 3 budget"], "values": {"Q4 sales data": 75, "Department 3 budget": 80}}
-
[8]
action":
Broadcast to all agents: {"action": "broadcast", "content": "Looking for Region 4 market data..."}
-
[9]
action":
Submit a completed task (you can submit multiple tasks if ready): {"action": "submit_task", "answer": "Combined result of: Department 3 budget, Region 4 market data, Q4 sales data"} INSTRUCTIONS: - You have the information listed in "Your Information" section above. - To get information from others, you must request it and they must send it to you. - When...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.