pith. sign in

arxiv: 2505.11556 · v4 · pith:PSBDAAIWnew · submitted 2025-05-15 · 💻 cs.CL · cs.AI· cs.MA

Systematic Failures in Collective Reasoning under Distributed Information in Multi-Agent LLMs

Pith reviewed 2026-05-22 14:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.MA
keywords multi-agent LLMscollective reasoningdistributed informationinformation asymmetryHiddenBenchHidden Profile paradigmmulti-agent decision making
0
0 comments X

The pith

Multi-agent LLMs achieve only 30.1% accuracy under distributed information compared to 80.7% for single agents with complete information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that groups of LLMs struggle to combine pieces of information that are split across members, even though each member could solve the problem if given everything at once. Agents default to what everyone has already said and do not ask about or consider facts that others might hold privately. This pattern holds across many models, prompting styles, and group sizes, and it gets worse with larger groups. The authors trace the problem to a failure to detect latent information asymmetry rather than to weak individual reasoning.

Core claim

Multi-agent LLMs cannot recognize or act under latent information asymmetry; they fail to reason about what others might know but have not yet expressed, leading to premature convergence on shared evidence while critical distributed facts remain unexplored.

What carries the argument

HiddenBench, a 65-task benchmark built on the Hidden Profile paradigm that isolates collective discovery of distributed facts from individual reasoning ability.

If this is right

  • The accuracy gap remains across prompting strategies, communication depths, and group sizes.
  • Collective performance declines as the number of agents grows.
  • A lightweight structured communication protocol raises accuracy across different model families.
  • Neither model scale nor single-agent accuracy on complete information predicts group success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems that add an explicit step for agents to list what they think others might still be holding could reduce the observed failures without retraining.
  • The same pattern may limit multi-agent performance in open-ended tasks such as joint planning or evidence synthesis where facts are also unevenly held.
  • Benchmarks that require agents to maintain and update models of each other's private knowledge would expose whether current architectures can ever handle asymmetry without external scaffolding.

Load-bearing premise

The HiddenBench tasks cleanly separate collective reasoning under distributed information from limits in single-agent skill or from artifacts of prompting and communication format.

What would settle it

Running the same 65 tasks with a protocol that forces each agent to explicitly request unshared facts from others and checking whether group accuracy then approaches the 80.7 percent single-agent baseline.

read the original abstract

Multi-agent systems built on large language models (LLMs) are expected to enhance decision-making by pooling distributed information, yet systematically evaluating this capability has remained challenging. We introduce HiddenBench, a 65-task benchmark grounded in the Hidden Profile paradigm, which isolates collective reasoning under distributed information from individual reasoning ability. Evaluating 15 frontier LLMs, we find that multi-agent LLMs achieve only 30.1% accuracy under distributed information, compared to 80.7% accuracy for single agents given complete information. We trace this gap to a systematic failure mode: agents cannot recognize or act under latent information asymmetry -- they fail to reason about what others might know but have not yet expressed, leading to premature convergence on shared evidence while critical distributed facts remain unexplored. These failures persist across prompting strategies, communication depths, and group sizes -- and worsen as groups scale. While some models (e.g., Gemini-2.5-Flash/Pro) outperform others, neither model scale nor individual reasoning accuracy reliably predicts collective performance. We further show that this bottleneck is actionable: a lightweight structured communication protocol substantially improves collective reasoning across model families. Our results identify failures in collective information exploration in decision-making as a key limitation of multi-agent LLMs, and provide a theory-grounded, reproducible framework for diagnosing collective reasoning failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HiddenBench, a 65-task benchmark grounded in the Hidden Profile paradigm, to evaluate collective reasoning under distributed information in multi-agent LLMs. Experiments across 15 frontier models show multi-agent accuracy of 30.1% under distributed information versus 80.7% for single agents given complete information. The authors trace the gap to agents' failure to recognize or act on latent information asymmetry, causing premature convergence on shared evidence while unexpressed facts remain unexplored. This pattern persists across prompting strategies, communication depths, and group sizes (worsening with scale), is not predicted by model scale or individual accuracy, and is mitigated by a lightweight structured communication protocol.

Significance. If the central results hold, the work identifies a concrete, actionable limitation in multi-agent LLM systems for information-pooling tasks that are common in real-world decision-making. The broad empirical scope (15 models, multiple conditions) and the theory-grounded benchmark provide a reproducible diagnostic framework. Credit is given for showing the failure is not fixed by scale or individual capability and for demonstrating improvement via a simple protocol. This could shape design of collaborative LLM agents.

major comments (2)
  1. [HiddenBench Task Construction (§3)] HiddenBench construction: The 30.1% vs. 80.7% gap and the claimed failure mode (inability to reason about latent asymmetry) are load-bearing only if tasks satisfy hidden-profile criteria—no agent solves from partial info alone, union of shares yields the correct answer, and shared subset biases toward error. The manuscript must supply explicit validation (e.g., per-task or aggregate single-agent partial-info baselines) to confirm isolation from individual capability or task artifacts.
  2. [Experimental Results and Baselines (§4)] Baseline fairness: The single-agent complete-information accuracy (80.7%) is the key comparator; the manuscript must confirm these baselines used identical prompting and context presentation as the multi-agent agents to avoid confounding the gap with differences in input format or instruction.
minor comments (2)
  1. [Results] Clarify whether the reported averages are macro- or micro-averaged across the 65 tasks and whether variance or per-model breakdowns are provided in the main text or appendix.
  2. [Mitigation Experiments] The structured protocol is presented as a mitigation; a brief ablation showing which protocol elements (e.g., explicit asymmetry prompts) drive the gains would strengthen the claim that it targets the identified failure mode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the work's significance and for the constructive major comments. We address each point below and have revised the manuscript accordingly to provide the requested validations and clarifications.

read point-by-point responses
  1. Referee: [HiddenBench Task Construction (§3)] HiddenBench construction: The 30.1% vs. 80.7% gap and the claimed failure mode (inability to reason about latent asymmetry) are load-bearing only if tasks satisfy hidden-profile criteria—no agent solves from partial info alone, union of shares yields the correct answer, and shared subset biases toward error. The manuscript must supply explicit validation (e.g., per-task or aggregate single-agent partial-info baselines) to confirm isolation from individual capability or task artifacts.

    Authors: We agree with the referee that explicit validation of the hidden profile properties is crucial for the validity of our claims. While the manuscript describes the task construction in §3 and provides some aggregate results, we acknowledge that per-task or detailed baseline comparisons were not sufficiently detailed. In the revised manuscript, we have added a new table (Table 2) reporting single-agent accuracy under partial information for each task, confirming that individual agents perform at chance level or below on their partial views, the full union solves correctly, and the shared information leads to systematic errors. This addition directly addresses the concern and strengthens the isolation of the collective failure mode. revision: yes

  2. Referee: [Experimental Results and Baselines (§4)] Baseline fairness: The single-agent complete-information accuracy (80.7%) is the key comparator; the manuscript must confirm these baselines used identical prompting and context presentation as the multi-agent agents to avoid confounding the gap with differences in input format or instruction.

    Authors: We appreciate this point on ensuring fair comparison. The single-agent complete-information baselines were indeed conducted using the exact same prompting instructions and context formatting as in the multi-agent experiments, with the sole modification being the inclusion of all distributed facts in the single agent's input. To eliminate any ambiguity, we have expanded the description in §4.2 to explicitly state this equivalence and included an example of the input formats for both conditions in the appendix. This clarification confirms that the observed gap is attributable to the distributed information setting rather than input differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are direct measurements without reduction to inputs

full rationale

The paper is an empirical evaluation that introduces the HiddenBench benchmark and reports measured accuracy gaps (30.1% multi-agent under distributed information vs. 80.7% single-agent with complete information) plus observed failure modes. No equations, fitted parameters, or first-principles derivations appear in the provided text. The benchmark is described as grounded in the established Hidden Profile paradigm from external social-psychology literature rather than defined circularly within the paper. Central claims rest on experimental outcomes across models, prompting strategies, and group sizes, which are falsifiable by replication and do not reduce to self-citation chains or by-construction equivalences. The derivation chain consists of task construction, agent execution, and performance comparison, all independent of the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the Hidden Profile paradigm from social psychology and on standard assumptions that LLM chat interfaces can be treated as agents with private context. No new free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The Hidden Profile paradigm from social psychology applies directly to LLM agents without modification.
    The benchmark is explicitly grounded in this paradigm to isolate collective reasoning.

pith-pipeline@v0.9.0 · 5772 in / 1211 out tokens · 41835 ms · 2026-05-22T14:18:16.660945+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TeamBench: Evaluating Agent Coordination under Enforced Role Separation

    cs.AI 2026-05 unverdicted novelty 7.0

    Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.

  2. The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies

    cs.CL 2025-09 conditional novelty 7.0

    A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when cont...

  3. LLM-Based Agentic Negotiation for 6G: Addressing Uncertainty Neglect and Tail-Event Risk

    cs.NI 2025-11 conditional novelty 6.0

    A CVaR-aware agentic framework for 6G network slicing eliminates URLLC SLA violations by shifting LLM decisions from mean latency to tail-risk distributions predicted by digital twins.

  4. A Tutorial on Cognitive Biases in Agentic AI-Driven 6G Autonomous Networks

    cs.NI 2025-10 unverdicted novelty 5.0

    Randomized Weibull anchors and debiased collective memory with decay and inflection bonuses let agentic AI in 6G cut anchoring, temporal, and confirmation biases, doubling energy savings to 25% and reducing latency by...