Recognition: no theorem link
CRAFT: Grounded Multi-Agent Coordination Under Partial Information
Pith reviewed 2026-05-15 01:03 UTC · model grok-4.3
The pith
Stronger reasoning models do not reliably coordinate better than smaller ones under partial information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the CRAFT benchmark, agents receive partial 3D views and must coordinate through language to build a target structure; the evaluation reveals that stronger reasoning models do not outperform smaller open-weight models, that improvements in individual communication do not translate into successful collaboration, and that failures cluster into a three-way taxonomy of spatial grounding errors, belief-modeling errors, and pragmatic-communication errors. The work formalizes the setting as a multi-sender Bounded Pragmatic Speaker problem and supplies behavioral failure profiles for both frontier and open-weight systems.
What carries the argument
The three-way diagnostic framework that decomposes coordination failures into spatial grounding, belief modeling, and pragmatic communication errors.
If this is right
- Multi-agent coordination under partial information remains a fundamentally unsolved challenge for current language models.
- Individual improvements in reasoning or communication do not guarantee better group performance.
- Smaller open-weight models can match or exceed frontier systems on coordination metrics.
- Benchmarks focused on pragmatic communication are needed beyond single-agent evaluations.
Where Pith is reading between the lines
- Training regimes that optimize single-agent reasoning or isolated communication may need explicit multi-agent objectives to transfer to collaborative settings.
- The gap between frontier and open-weight models on this task suggests that coordination ability may not follow the same scaling trends as other capabilities.
- Similar partial-information coordination problems in robotics or human-AI teams could be diagnosed with the same three-way taxonomy.
Load-bearing premise
The chosen 3D construction tasks and the three-way failure taxonomy accurately isolate general coordination skills without introducing artifacts that would not appear in other multi-agent settings.
What would settle it
Running the same models on a different multi-agent task with partial information, such as a text-based navigation or dialogue game, and finding that the relative coordination success rates reverse or that the failure taxonomy no longer accounts for most errors.
Figures
read the original abstract
We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender Bounded Pragmatic Speaker problem and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu-signal/CRAFT
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CRAFT, a multi-agent benchmark for pragmatic communication in LLMs under strict partial information. Agents with complementary but incomplete views of a 3D structure must coordinate via natural language to build it. The work formalizes the setting as a multi-sender Bounded Pragmatic Speaker problem, supplies a diagnostic framework that decomposes failures into spatial grounding, belief modeling, and pragmatic communication errors, and reports a taxonomy of behavioral failure profiles. Across 8 open-weight and 7 frontier models (including reasoning models), the central empirical finding is that stronger reasoning ability does not reliably translate to better coordination success; smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee collaboration.
Significance. If the results hold under rigorous controls, the work is significant for establishing that multi-agent coordination under partial observability remains a fundamental unsolved challenge for current LLMs, decoupled from individual reasoning strength. The open-source code at the provided GitHub link is a clear strength for reproducibility. The diagnostic taxonomy offers a useful lens for future work on pragmatic failures, though its validity depends on the independence of the three failure categories.
major comments (2)
- [Abstract / Experimental Results] The abstract states results across 15 models (8 open-weight, 7 frontier) but supplies no details on exact metrics, number of trials per condition, statistical tests, variance across runs, or controls for prompt sensitivity. This information is load-bearing for the central claim that reasoning strength does not predict coordination success and must be added to allow verification.
- [Diagnostic Framework] The diagnostic framework decomposes failures into spatial grounding, belief modeling, and pragmatic communication. In the 3D construction setting, however, inability to describe or interpret spatial relations directly corrupts belief-state updates for other agents, rendering the three categories non-independent. No ablation that holds the partial-information protocol fixed while varying spatial complexity (e.g., 2D grid vs. 3D with occlusion) is reported, so observed performance gaps could be driven by spatial difficulty rather than the intended pragmatic deficit.
minor comments (1)
- [Abstract] The abstract mentions 'a taxonomy of behavioral failure profiles' but does not define how profiles are identified or quantified from agent traces; a brief operational definition or example would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and will incorporate revisions to improve the transparency of our experimental reporting and the robustness of our diagnostic framework.
read point-by-point responses
-
Referee: [Abstract / Experimental Results] The abstract states results across 15 models (8 open-weight, 7 frontier) but supplies no details on exact metrics, number of trials per condition, statistical tests, variance across runs, or controls for prompt sensitivity. This information is load-bearing for the central claim that reasoning strength does not predict coordination success and must be added to allow verification.
Authors: We agree that the abstract and main text require additional quantitative detail to support verification of the central claim. In the revision we will expand the abstract to report key metrics (task success rate and average turns to completion), state that each model-condition pair was evaluated over 100 independent trials, note the use of paired t-tests with Bonferroni correction for significance, report standard deviations across runs, and describe controls consisting of three distinct prompt templates per model with results averaged across templates. A new table summarizing means, variances, and p-values will be added to the experimental results section. revision: yes
-
Referee: [Diagnostic Framework] The diagnostic framework decomposes failures into spatial grounding, belief modeling, and pragmatic communication. In the 3D construction setting, however, inability to describe or interpret spatial relations directly corrupts belief-state updates for other agents, rendering the three categories non-independent. No ablation that holds the partial-information protocol fixed while varying spatial complexity (e.g., 2D grid vs. 3D with occlusion) is reported, so observed performance gaps could be driven by spatial difficulty rather than the intended pragmatic deficit.
Authors: We acknowledge that spatial grounding errors can propagate into belief-state inaccuracies in the 3D setting, creating partial dependence between categories. Our current annotation protocol assigns the primary error label according to the dominant observable failure in each trace, with inter-annotator agreement of 0.82 Cohen's kappa. To strengthen the framework we will add a dedicated subsection discussing category interdependence and will include a controlled 2D-grid ablation (identical partial-information protocol, reduced spatial complexity) in the revised experiments to quantify how much of the performance gap is attributable to pragmatic versus spatial factors. revision: partial
Circularity Check
Empirical benchmark evaluation with no derivation chain
full rationale
The paper presents an empirical benchmark (CRAFT) for multi-agent coordination under partial information, along with a diagnostic taxonomy of failure modes. No mathematical derivation, first-principles prediction, or fitted-parameter result is claimed; the central findings rest on observed performance differences across models rather than any quantity that reduces to its own inputs by construction. The formalization as a multi-sender Bounded Pragmatic Speaker problem is presented as a modeling choice, not a self-referential derivation. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the load-bearing claims. This is a standard empirical study whose conclusions are falsifiable via new experiments and therefore carries no circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions of LLM evaluation benchmarks hold (prompts elicit intended behavior, human-designed tasks measure the targeted skills).
Forward citations
Cited by 1 Pith paper
-
When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation
Stronger reasoning models in LLMs reduce behavioral negotiation by defaulting to authority outcomes in multi-agent settings, unlike structured scaffolds that enable concessions.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2312.10256. Paul Humphreys. How properties emerge.Philosophy of science, 64(1):1–17, 1997. Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing climate: Enhancing lm adaptation with tulu 2.arXiv preprint arXiv:...
-
[2]
URLhttps://arxiv.org/abs/2511.15722. Annie Louis, Dan Roth, and Filip Radlinski. “i’d rather just go to bed”: Understanding indirect answers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7411–7425, 2020. 12 Andrei Lupu, Brandon Cui, Hengyuan Hu, and Jakob Foerster. Trajectory diversity for zero-sh...
-
[3]
The joint ToM listener (Definition 2) reduces to: LToMjoint(z⋆ 1 |u 1, c1)∝exp(λ 1R1(u1, st,T)) =:L ToM1(z⋆ 1 |u 1, c1),(10) since all terms λiRi = 0 for i >1 . Substituting into Equation 4, the product over base speakers collapses to a single factor: π⋆(u1 |z ⋆ 1 , c1)∝S base1(u1 |z ⋆ 1 , c1)·L ToM1(z⋆ 1 |u 1, c1),(11) which is Equation 1 exactly. 17 Lem...
work page 2025
-
[4]
To place small block: PLACE:block_code:position:layer:CONFIRM:interpretation Example:PLACE:bs:(0,0):0:CONFIRM:Placing blue small block at bottom-left of D1’s side as requested
-
[5]
To place large block: PLACE:block_code:position:layer:span_to:CONFIRM:interpretation Example: PLACE:gl:(0,0):0:(1,0):CONFIRM:Placing large green block across left and middle cells of D1’s bottom layer
-
[6]
To remove small block: REMOVE:position:layer:CONFIRM:interpretation Example:REMOVE:(1,2):0:CONFIRM:Removing the block from middle-right of D3’s side as requested
-
[7]
To remove large block: REMOVE:position:layer:span_to:CONFIRM:interpretation Example: REMOVE:(2,2):0:(2,1):CONFIRM:Removing large green block from D3’s bottom layer as requested NOTE: REMOVE never includes block code — do NOT writeREMOVE:bl:(0,0):
-
[8]
Figure 14: Builder prompt used in the experiments, section III
To clarify: CLARIFY:your specific question Example:CLARIFY:Which blue block should I move - the one on top or bottom? Always include CONFIRM section to show what you understood from their instructions. Figure 14: Builder prompt used in the experiments, section III. 29 Builder Prompt (IV): Tool Calling and Move Exploration TOOL MODE —simulate_moveavailable...
-
[9]
Simulate each director’s instruction once directly and literally
- [10]
-
[11]
DO NOT INVENT NEW MOVE AFTER SIMULATING
Submit that exact move as your FINAL answer. DO NOT INVENT NEW MOVE AFTER SIMULATING
-
[12]
If a sim fails (ok=False)→fix ONLY the field the hint specifies, retry once
-
[13]
NEVER submit a move that returnedok=False
-
[14]
NEVER submit a remove move where simulate shows structurePlacement=False — even if it’s the only ok=True simulation. In that case,CLARIFYinstead
-
[15]
NEVER clarify just because directors disagree — simulate and pick the best
-
[16]
NEVER remove a block where simulate showsstructurePlacement=Falsefor that remove. Figure 15: Builder’s More Exploration Tool Call Prompt. Note that although CRAFT provides this facility, this is not explored in our current benchmark, in favor of oracle moves in the Builder’s observation space in order to restrict the action space of the Builder for contro...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.