arxiv: 2604.07821 · v1 · submitted 2026-04-09 · 💻 cs.MA · cs.AI· cs.CL

Recognition: no theorem link

More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration

Advait Yadav , Sid Black , Oliver Sourbut

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:14 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.CL

keywords LLM agentsmulti-agent systemscooperationcoordinationzero-cost collaborationcollective performancecapability vs cooperation

0 comments

The pith

Capability does not predict cooperation in LLM agents even when helping costs nothing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a multi-agent environment in which agents can share information or resources at zero personal cost while instructions explicitly direct them to maximize total group revenue. In this frictionless setting, more capable models achieve markedly lower collective performance than less capable ones under identical prompts. A causal decomposition that fixes one agent's communication isolates whether shortfalls arise from inability to cooperate versus inability to understand the task. Targeted fixes such as explicit protocols or small incentives raise performance, indicating that coordination failures persist independently of raw intelligence.

Core claim

In a designed environment with no strategic costs to helping, LLM agents still underperform collective optima, with higher-capability models showing worse cooperation rates than lower-capability ones under identical prompts to maximize group revenue. A causal decomposition separates these from competence failures.

What carries the argument

The frictionless multi-agent setup that removes strategic complexity from helping decisions, combined with automated communication to isolate cooperation failures from competence failures.

If this is right

Scaling model size or capability alone will not solve coordination problems in multi-agent systems.
Explicit protocols can double performance for lower-competence models.
Tiny sharing incentives can improve cooperation in models that otherwise under-helped.
Deliberate cooperative design remains necessary even when helping others carries no cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training on individual-task objectives may embed a bias against zero-cost helping that later scaling does not remove.
Joint reward signals or shared training across agents could be required to produce reliable cooperation.
The same pattern may appear in other low-friction domains such as documentation sharing or knowledge transfer inside organizations.

Load-bearing premise

The experimental setup truly removes all strategic complexity and any performance gaps reflect cooperation shortfalls rather than differences in how models interpret the shared instructions.

What would settle it

Running the same tasks while automating one agent's side of communication to always share helpful information and checking whether the capability gap in collective performance disappears.

Figures

Figures reproduced from arXiv: 2604.07821 by Advait Yadav, Oliver Sourbut, Sid Black.

**Figure 1.** Figure 1: The instruction-utility gap. Agent 1 requests information from Agent 3 to complete a task. Agent 3 can cooperate or withhold. While the agents are instructed to maximize overall revenue, sending information has no effect on Agent 3’s individual payoff—only Agent 1 benefits from receiving it. This neutrality for the sender creates the instruction-utility gap and drives cooperative failures. Across eight wid… view at source ↗

**Figure 2.** Figure 2: The two-step pipeline under perfect play. In round T, Agent A requests all missing pieces from holders and fulfills incoming requests from others. Other agents fulfill A’s requests during their turns within the same round. By round T+1, Agent A has received the needed pieces, can submit completed tasks, and receives new tasks to maintain its queue. This two-step flow continually repeats for subsequent roun… view at source ↗

**Figure 3.** Figure 3: Final performance is uncorrelated with general capability. We use Chatbot Arena Elo scores as a proxy for capability. The dashed line shows linear fit (R² = 0.025, p = 0.71). We evaluate eight widely used LLMs that differ in size, training pipelines, and intended use: Gemini-2.5-Pro (Google DeepMind, 2025b), Gemini-2.5-Flash (Google DeepMind, 2025a), Claude Sonnet 4 (Anthropic, 2025), OpenAI o3 (OpenAI, 2… view at source ↗

**Figure 4.** Figure 4: Failure mode decomposition. Models mapped by their cooperation rate versus competence rate. The diagonal separates cooperationlimited models from competence-limited models. To separate competence and cooperation failures, we run a causal decomposition experiment that automates one side of the exchange at a time. The two axes correspond to requesting information from other agents and sharing information… view at source ↗

**Figure 5.** Figure 5: Intervention effects. Performance impact of three interventions relative to baseline. Limited visibility produces the most variable effects. Smaller LLMs (o3- mini, GPT-4.1-mini) improve substantially when peer revenues and error notices are hidden, suggesting their baseline failures stemmed partly from defensive or competitive framing triggered by social comparison. However, Sonnet 4 degrades by 15%, in… view at source ↗

read the original abstract

Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation failures may arise. In many real-world coordination problems, from knowledge sharing in organizations to code documentation, helping others carries negligible personal cost while generating substantial collective benefits. However, whether LLM agents cooperate when helping neither benefits nor harms the helper, while being given explicit instructions to do so, remains unknown. We build a multi-agent setup designed to study cooperative behavior in a frictionless environment, removing all strategic complexity from cooperation. We find that capability does not predict cooperation: OpenAI o3 achieves only 17% of optimal collective performance while OpenAI o3-mini reaches 50%, despite identical instructions to maximize group revenue. Through a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, tracing their origins through agent reasoning analysis. Testing targeted interventions, we find that explicit protocols double performance for low-competence models, and tiny sharing incentives improve models with weak cooperation. Our findings suggest that scaling intelligence alone will not solve coordination problems in multi-agent systems and will require deliberate cooperative design, even when helping others costs nothing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines cooperative behavior in LLM agents within a multi-agent system engineered to be frictionless, where helping incurs zero personal cost yet yields collective gains. The core empirical result is that capability does not predict cooperation: OpenAI o3 attains only 17% of optimal group revenue while o3-mini attains 50%, under identical instructions to maximize collective performance. The authors introduce a causal decomposition that automates one agent's communication side to isolate cooperation failures from competence failures, trace origins via reasoning traces, and evaluate interventions including explicit protocols (which double low-competence performance) and small sharing incentives (which aid weak cooperators).

Significance. If the results are robust, the work demonstrates that scaling model intelligence alone is insufficient to resolve coordination failures in multi-agent LLM systems, even in zero-cost settings, and underscores the value of targeted cooperative mechanisms. The causal decomposition and intervention tests supply a concrete methodological template for dissecting cooperation versus competence in future LLM multi-agent research.

major comments (2)

[§4] §4 (Experimental Setup): The claim that the environment 'removes all strategic complexity' is load-bearing for interpreting the 17% vs. 50% gap as a cooperation failure rather than differential instruction parsing or over-inference of hidden costs/repeated-game effects. The manuscript does not report explicit checks (e.g., post-hoc prompt paraphrasing or verification questions) confirming that o3 and o3-mini interpret 'maximize group revenue' identically.
[§5] §5 (Causal Decomposition): While automating one side of communication usefully separates competence from cooperation, the decomposition does not include controls for asymmetric task comprehension; the reported performance difference could still arise from o3's greater tendency to question revenue mechanics or assume verification requirements rather than from lower willingness to cooperate at zero cost.

minor comments (2)

[Figure 1] Figure 1 and Table 2: Error bars or confidence intervals are not shown for the reported percentages (17%, 50%), preventing assessment of whether the capability-cooperation dissociation is statistically reliable across runs.
[§3.2] §3.2: The precise formula for 'optimal collective performance' (the denominator in the reported ratios) should be stated explicitly, including how revenue is aggregated across agents and any normalization constants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of experimental validation. We address each major comment point by point below and describe the planned revisions.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup): The claim that the environment 'removes all strategic complexity' is load-bearing for interpreting the 17% vs. 50% gap as a cooperation failure rather than differential instruction parsing or over-inference of hidden costs/repeated-game effects. The manuscript does not report explicit checks (e.g., post-hoc prompt paraphrasing or verification questions) confirming that o3 and o3-mini interpret 'maximize group revenue' identically.

Authors: We agree that explicit verification of instruction interpretation strengthens the attribution of performance differences to cooperation rather than comprehension. The environment was constructed to eliminate strategic elements through zero personal cost for helping, explicit collective objective, and absence of repeated interactions or hidden payoffs. In the revised manuscript we will add post-hoc checks consisting of paraphrased prompt variants and direct verification questions administered to both models, with results confirming equivalent parsing of 'maximize group revenue' and no differential inference of costs or game-theoretic structure. revision: yes
Referee: [§5] §5 (Causal Decomposition): While automating one side of communication usefully separates competence from cooperation, the decomposition does not include controls for asymmetric task comprehension; the reported performance difference could still arise from o3's greater tendency to question revenue mechanics or assume verification requirements rather than from lower willingness to cooperate at zero cost.

Authors: We acknowledge that the causal decomposition, while isolating the communication decision, leaves open the possibility of asymmetric task comprehension. To address this, the revision will incorporate explicit controls: both models will be administered identical comprehension questions on revenue mechanics and sharing rules prior to the main trials, and any observed differences will be reported and used to qualify the cooperation-failure analysis. revision: yes

Circularity Check

0 steps flagged

Empirical comparison with no derivation chain or self-referential reductions

full rationale

The paper conducts an experimental study of LLM agents in a custom multi-agent revenue-maximization task. Performance is measured directly against an external optimal collective benchmark (17% for o3, 50% for o3-mini). No equations, fitted parameters, or derivations are present that could reduce results to author-defined inputs. The environment is constructed by design to remove strategic elements; this is an experimental choice, not a self-definitional loop. No load-bearing self-citations or uniqueness theorems from prior author work are invoked. The central claim rests on observed differences across distinct models under identical prompts, which are falsifiable against the benchmark and do not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the experimental environment is truly frictionless and that collective performance can be unambiguously measured; no free parameters or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5513 in / 1053 out tokens · 44665 ms · 2026-05-10T18:14:26.734631+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hypothesis generation and updating in large language models
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.

Reference graph

Works this paper leans on

9 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

conditional

URLhttps://arxiv.org/abs/2308.03688. Yat Long Lo, Christian Schroeder de Witt, Samuel Sokota, Jakob Nicolaus Foerster, and Shimon Whiteson. Cheap talk discovery and utilization in multi-agent reinforcement learning, 2023. URL https://arxiv.org/abs/2303.10733. Ryan Lowe, Jakob Foerster, Y-Lan Boureau, Joelle Pineau, and Yann Dauphin. On the pitfalls of mea...

work page doi:10.1016/j.infsof.2013.02.013 2023
[2]

action":

Send message to another agent: {{"action": "send_message", "to": "agent_2", "content": "I need the Department 3 budget for my task..."}}
[3]

action":

Send information to an agent who requested it: {{"action": "send_information", "to": "agent_3", "information": ["Q4 sales data", "Department 3 budget"], "values": {{"Q4 sales data": 75, "Department 3 budget": 80}}}}
[4]

action":

Broadcast to all agents: {{"action": "broadcast", "content": "Looking for Region 4 market data..."}}
[5]

action":

Submit a completed task (you can submit multiple tasks if ready): {{"action": "submit_task", "answer": "Combined result of: Department 3 budget, Region 4 market data, Q4 sales data"}} INSTRUCTIONS: - You have the information listed in "Your Information" section above. - To get information from others, you must request it and they must send it to you. - Wh...

2026
[6]

action":

Send message to another agent: {"action": "send_message", "to": "agent_2", "content": "I need the Department 3 budget for my task..."}
[7]

action":

Send information to an agent who requested it: {"action": "send_information", "to": "agent_3", "information": [" Q4 sales data", "Department 3 budget"], "values": {"Q4 sales data": 75, "Department 3 budget": 80}}
[8]

action":

Broadcast to all agents: {"action": "broadcast", "content": "Looking for Region 4 market data..."}
[9]

action":

Submit a completed task (you can submit multiple tasks if ready): {"action": "submit_task", "answer": "Combined result of: Department 3 budget, Region 4 market data, Q4 sales data"} INSTRUCTIONS: - You have the information listed in "Your Information" section above. - To get information from others, you must request it and they must send it to you. - When...

2026