pith. machine review for the scientific record. sign in

arxiv: 2603.25268 · v2 · submitted 2026-03-26 · 💻 cs.CL · cs.AI

Recognition: no theorem link

CRAFT: Grounded Multi-Agent Coordination Under Partial Information

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multi-agent coordinationpartial informationpragmatic communicationlanguage modelsbenchmarkfailure taxonomy3D constructionbounded pragmatic speaker
0
0 comments X

The pith

Stronger reasoning models do not reliably coordinate better than smaller ones under partial information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CRAFT, a benchmark where multiple language-model agents with complementary but incomplete views of a 3D scene must communicate in natural language to assemble a shared structure. It decomposes coordination failures into spatial grounding, belief modeling, and pragmatic communication, then shows across frontier and open-weight models that advanced reasoning ability does not produce better group success and that better solo communication does not guarantee collaborative outcomes. A sympathetic reader would care because the results indicate that current language models still lack reliable mechanisms for grounded multi-agent interaction even when individual capabilities appear strong.

Core claim

In the CRAFT benchmark, agents receive partial 3D views and must coordinate through language to build a target structure; the evaluation reveals that stronger reasoning models do not outperform smaller open-weight models, that improvements in individual communication do not translate into successful collaboration, and that failures cluster into a three-way taxonomy of spatial grounding errors, belief-modeling errors, and pragmatic-communication errors. The work formalizes the setting as a multi-sender Bounded Pragmatic Speaker problem and supplies behavioral failure profiles for both frontier and open-weight systems.

What carries the argument

The three-way diagnostic framework that decomposes coordination failures into spatial grounding, belief modeling, and pragmatic communication errors.

If this is right

  • Multi-agent coordination under partial information remains a fundamentally unsolved challenge for current language models.
  • Individual improvements in reasoning or communication do not guarantee better group performance.
  • Smaller open-weight models can match or exceed frontier systems on coordination metrics.
  • Benchmarks focused on pragmatic communication are needed beyond single-agent evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes that optimize single-agent reasoning or isolated communication may need explicit multi-agent objectives to transfer to collaborative settings.
  • The gap between frontier and open-weight models on this task suggests that coordination ability may not follow the same scaling trends as other capabilities.
  • Similar partial-information coordination problems in robotics or human-AI teams could be diagnosed with the same three-way taxonomy.

Load-bearing premise

The chosen 3D construction tasks and the three-way failure taxonomy accurately isolate general coordination skills without introducing artifacts that would not appear in other multi-agent settings.

What would settle it

Running the same models on a different multi-agent task with partial information, such as a text-based navigation or dialogue game, and finding that the relative coordination success rates reverse or that the failure taxonomy no longer accounts for most errors.

Figures

Figures reproduced from arXiv: 2603.25268 by Abhijnan Nath, Hannah VanderHoeven, Nikhil Krishnaswamy.

Figure 1
Figure 1. Figure 1: CRAFT overview. A structure generator creates a target 3D object and 3 private 2D views for Directors, enforcing information asymmetry. Each turn, Directors produce instructions from their partial views, and a Builder executes them in the CRAFT environment, which logs task progress and evaluates communication for spatial grounding, mind modeling, and pragmatic sufficiency. 2024, Zhu et al., 2026]. This cre… view at source ↗
Figure 2
Figure 2. Figure 2: Director perspective views for a 25 block structure with the “camera” at the “bottom” of the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Failure taxonomy over all turns across 15 director models. Frontier models do not uniformly dominate open￾weight models [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per turn oracle-prescribed vs. attempted remove rate, averaged across all 20 structures [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LLM grader scores across three evaluation dimensions—spatial grounding (left), mind modeling (center), and pragmatic sufficiency (right)—broken down by question and model group. Error bars denote standard error of the mean across all structure–turn–director observations per model from 3 independent runs. elevated for Gemini-3-Flash relative to its otherwise strong performance: despite Directors correctly i… view at source ↗
Figure 6
Figure 6. Figure 6: Director perspective views for structure_016, a complex-tier structure with 25 total blocks. D1 (left column, j=0) sees a large yellow domino spanning (0, 0)–(1, 0) at L2, a small red at (0, 0) and large yellow domino spanning (1, 0)–(2, 0) at L1, and small orange, red, green blocks at L0. D2 (top row, i=0) sees a large orange domino spanning (0, 0)–(0, 1) with small red at (0, 2) at L0, a large red domino… view at source ↗
Figure 7
Figure 7. Figure 7: Turn-level outcome rates per model sorted by oracle adherence. Oracle available (orange) [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Three turns of zero progress in Qwen-32b ( [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Director prompt used in the experiments, section I. Note that personality and archetypes are [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Director prompt used in the experiments, section II. [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Director prompt used in the experiments, section III. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Builder prompt used in the experiments, section I. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Builder prompt used in the experiments, section II. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Builder prompt used in the experiments, section III. [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Builder’s More Exploration Tool Call Prompt. Note that although [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Spatial Grounding (SG) judge. Evaluated once per turn across 20 structures and 20 overall [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Mind Model (MM) Judge. Evaluated once per turn across 20 structures and 20 overall [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Pragmatic Sufficiency (PS) judge. Evaluated once per turn over the collective Director [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Per-model LLM grader scores across all judge questions for spatial grounding (top), mind [PITH_FULL_IMAGE:figures/full_fig_p035_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Oracle-prescribed vs. attempted remove rate per turn for all base open-weight models. [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Oracle-prescribed vs. attempted remove rate per turn for all frontier and proprietary [PITH_FULL_IMAGE:figures/full_fig_p040_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Three turns, zero progress: a CRAFT correction spiral (Qwen-32b, structure_001, T10–T14) (see [PITH_FULL_IMAGE:figures/full_fig_p041_22.png] view at source ↗
read the original abstract

We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender Bounded Pragmatic Speaker problem and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu-signal/CRAFT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CRAFT, a multi-agent benchmark for pragmatic communication in LLMs under strict partial information. Agents with complementary but incomplete views of a 3D structure must coordinate via natural language to build it. The work formalizes the setting as a multi-sender Bounded Pragmatic Speaker problem, supplies a diagnostic framework that decomposes failures into spatial grounding, belief modeling, and pragmatic communication errors, and reports a taxonomy of behavioral failure profiles. Across 8 open-weight and 7 frontier models (including reasoning models), the central empirical finding is that stronger reasoning ability does not reliably translate to better coordination success; smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee collaboration.

Significance. If the results hold under rigorous controls, the work is significant for establishing that multi-agent coordination under partial observability remains a fundamental unsolved challenge for current LLMs, decoupled from individual reasoning strength. The open-source code at the provided GitHub link is a clear strength for reproducibility. The diagnostic taxonomy offers a useful lens for future work on pragmatic failures, though its validity depends on the independence of the three failure categories.

major comments (2)
  1. [Abstract / Experimental Results] The abstract states results across 15 models (8 open-weight, 7 frontier) but supplies no details on exact metrics, number of trials per condition, statistical tests, variance across runs, or controls for prompt sensitivity. This information is load-bearing for the central claim that reasoning strength does not predict coordination success and must be added to allow verification.
  2. [Diagnostic Framework] The diagnostic framework decomposes failures into spatial grounding, belief modeling, and pragmatic communication. In the 3D construction setting, however, inability to describe or interpret spatial relations directly corrupts belief-state updates for other agents, rendering the three categories non-independent. No ablation that holds the partial-information protocol fixed while varying spatial complexity (e.g., 2D grid vs. 3D with occlusion) is reported, so observed performance gaps could be driven by spatial difficulty rather than the intended pragmatic deficit.
minor comments (1)
  1. [Abstract] The abstract mentions 'a taxonomy of behavioral failure profiles' but does not define how profiles are identified or quantified from agent traces; a brief operational definition or example would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will incorporate revisions to improve the transparency of our experimental reporting and the robustness of our diagnostic framework.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] The abstract states results across 15 models (8 open-weight, 7 frontier) but supplies no details on exact metrics, number of trials per condition, statistical tests, variance across runs, or controls for prompt sensitivity. This information is load-bearing for the central claim that reasoning strength does not predict coordination success and must be added to allow verification.

    Authors: We agree that the abstract and main text require additional quantitative detail to support verification of the central claim. In the revision we will expand the abstract to report key metrics (task success rate and average turns to completion), state that each model-condition pair was evaluated over 100 independent trials, note the use of paired t-tests with Bonferroni correction for significance, report standard deviations across runs, and describe controls consisting of three distinct prompt templates per model with results averaged across templates. A new table summarizing means, variances, and p-values will be added to the experimental results section. revision: yes

  2. Referee: [Diagnostic Framework] The diagnostic framework decomposes failures into spatial grounding, belief modeling, and pragmatic communication. In the 3D construction setting, however, inability to describe or interpret spatial relations directly corrupts belief-state updates for other agents, rendering the three categories non-independent. No ablation that holds the partial-information protocol fixed while varying spatial complexity (e.g., 2D grid vs. 3D with occlusion) is reported, so observed performance gaps could be driven by spatial difficulty rather than the intended pragmatic deficit.

    Authors: We acknowledge that spatial grounding errors can propagate into belief-state inaccuracies in the 3D setting, creating partial dependence between categories. Our current annotation protocol assigns the primary error label according to the dominant observable failure in each trace, with inter-annotator agreement of 0.82 Cohen's kappa. To strengthen the framework we will add a dedicated subsection discussing category interdependence and will include a controlled 2D-grid ablation (identical partial-information protocol, reduced spatial complexity) in the revised experiments to quantify how much of the performance gap is attributable to pragmatic versus spatial factors. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation chain

full rationale

The paper presents an empirical benchmark (CRAFT) for multi-agent coordination under partial information, along with a diagnostic taxonomy of failure modes. No mathematical derivation, first-principles prediction, or fitted-parameter result is claimed; the central findings rest on observed performance differences across models rather than any quantity that reduces to its own inputs by construction. The formalization as a multi-sender Bounded Pragmatic Speaker problem is presented as a modeling choice, not a self-referential derivation. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the load-bearing claims. This is a standard empirical study whose conclusions are falsifiable via new experiments and therefore carries no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark relies on standard assumptions of LLM prompting and evaluation; no free parameters, ad-hoc axioms, or invented entities are introduced beyond the task definition itself.

axioms (1)
  • domain assumption Standard assumptions of LLM evaluation benchmarks hold (prompts elicit intended behavior, human-designed tasks measure the targeted skills).
    Invoked implicitly when interpreting model outputs as evidence of coordination ability.

pith-pipeline@v0.9.0 · 5468 in / 1150 out tokens · 48648 ms · 2026-05-15T01:03:02.364170+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation

    cs.LG 2026-04 unverdicted novelty 6.0

    Stronger reasoning models in LLMs reduce behavioral negotiation by defaulting to authority outcomes in multi-agent settings, unlike structured scaffolds that enable concessions.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 1 Pith paper

  1. [1]

    Paul Humphreys

    URLhttps://arxiv.org/abs/2312.10256. Paul Humphreys. How properties emerge.Philosophy of science, 64(1):1–17, 1997. Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing climate: Enhancing lm adaptation with tulu 2.arXiv preprint arXiv:...

  2. [2]

    i’d rather just go to bed

    URLhttps://arxiv.org/abs/2511.15722. Annie Louis, Dan Roth, and Filip Radlinski. “i’d rather just go to bed”: Understanding indirect answers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7411–7425, 2020. 12 Andrei Lupu, Brandon Cui, Hengyuan Hu, and Jakob Foerster. Trajectory diversity for zero-sh...

  3. [3]

    permutation-invariant

    The joint ToM listener (Definition 2) reduces to: LToMjoint(z⋆ 1 |u 1, c1)∝exp(λ 1R1(u1, st,T)) =:L ToM1(z⋆ 1 |u 1, c1),(10) since all terms λiRi = 0 for i >1 . Substituting into Equation 4, the product over base speakers collapses to a single factor: π⋆(u1 |z ⋆ 1 , c1)∝S base1(u1 |z ⋆ 1 , c1)·L ToM1(z⋆ 1 |u 1, c1),(11) which is Equation 1 exactly. 17 Lem...

  4. [4]

    To place small block: PLACE:block_code:position:layer:CONFIRM:interpretation Example:PLACE:bs:(0,0):0:CONFIRM:Placing blue small block at bottom-left of D1’s side as requested

  5. [5]

    To place large block: PLACE:block_code:position:layer:span_to:CONFIRM:interpretation Example: PLACE:gl:(0,0):0:(1,0):CONFIRM:Placing large green block across left and middle cells of D1’s bottom layer

  6. [6]

    To remove small block: REMOVE:position:layer:CONFIRM:interpretation Example:REMOVE:(1,2):0:CONFIRM:Removing the block from middle-right of D3’s side as requested

  7. [7]

    To remove large block: REMOVE:position:layer:span_to:CONFIRM:interpretation Example: REMOVE:(2,2):0:(2,1):CONFIRM:Removing large green block from D3’s bottom layer as requested NOTE: REMOVE never includes block code — do NOT writeREMOVE:bl:(0,0):

  8. [8]

    Figure 14: Builder prompt used in the experiments, section III

    To clarify: CLARIFY:your specific question Example:CLARIFY:Which blue block should I move - the one on top or bottom? Always include CONFIRM section to show what you understood from their instructions. Figure 14: Builder prompt used in the experiments, section III. 29 Builder Prompt (IV): Tool Calling and Move Exploration TOOL MODE —simulate_moveavailable...

  9. [9]

    Simulate each director’s instruction once directly and literally

  10. [10]

    progress

    Pick the result with greatest value for"progress"

  11. [11]

    DO NOT INVENT NEW MOVE AFTER SIMULATING

    Submit that exact move as your FINAL answer. DO NOT INVENT NEW MOVE AFTER SIMULATING

  12. [12]

    If a sim fails (ok=False)→fix ONLY the field the hint specifies, retry once

  13. [13]

    NEVER submit a move that returnedok=False

  14. [14]

    In that case,CLARIFYinstead

    NEVER submit a remove move where simulate shows structurePlacement=False — even if it’s the only ok=True simulation. In that case,CLARIFYinstead

  15. [15]

    NEVER clarify just because directors disagree — simulate and pick the best

  16. [16]

    Remove-gap

    NEVER remove a block where simulate showsstructurePlacement=Falsefor that remove. Figure 15: Builder’s More Exploration Tool Call Prompt. Note that although CRAFT provides this facility, this is not explored in our current benchmark, in favor of oracle moves in the Builder’s observation space in order to restrict the action space of the Builder for contro...