arxiv: 2604.22708 · v1 · submitted 2026-04-24 · 💻 cs.MA

Recognition: unknown

Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems

Mengzhuo Chen , Junjie Wang , Fangwen Mu , Yawen Wang , Zhe Liu , Huanxiang Feng , Qing Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:59 UTC · model grok-4.3

classification 💻 cs.MA

keywords failure attributionLLM multi-agent systemsexecution tracesbenchmarksdebuggingobservabilityagent interactions

0 comments

The pith

Full execution traces improve failure attribution accuracy by up to 76% over partial outputs in LLM-based multi-agent systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that identifying which agent and step caused a failure is harder in LLM multi-agent systems because of their language-based reasoning and complex interactions. It claims existing benchmarks are unrealistic because they show only agent outputs while developers normally see full inputs, contexts, and execution traces. To address this, the authors created the TraceElephant benchmark with complete traces and reproducible environments. Systematic tests show that giving attribution methods the full traces raises accuracy by as much as 76 percent compared with output-only versions, since missing inputs often hide the real causes. This setup is meant to guide better evaluation and support more transparent multi-agent designs.

Core claim

TraceElephant is a benchmark for failure attribution in LLM-based multi-agent systems that supplies full execution traces and reproducible environments. Evaluation across attribution techniques and configurations shows full traces raise accuracy by up to 76 percent over partial-observation baselines, because omitted inputs and contexts frequently conceal the decisive failure causes. The benchmark therefore aligns evaluation with the complete information developers actually use during debugging.

What carries the argument

TraceElephant benchmark, which supplies complete execution traces including all agent inputs, outputs, and interaction contexts for controlled, reproducible failure scenarios.

If this is right

Attribution techniques identify the responsible agent and decisive step more reliably when supplied with complete traces rather than outputs alone.
Benchmarks and evaluations of new attribution methods should adopt full observability to match practical debugging conditions.
The benchmark provides a shared foundation for developing techniques that make multi-agent systems more transparent by exposing hidden failure causes.
Follow-up work can systematically compare attribution approaches under the full-trace setting to measure progress toward reliable diagnosis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of multi-agent frameworks may add comprehensive logging as a default to enable better post-failure analysis.
The emphasis on full context could link failure attribution research to wider efforts in making agent decisions explainable and accountable.
Applying TraceElephant-style evaluations to open-source agent platforms would test whether the accuracy gains hold outside the paper's controlled scenarios.
Automated tools could eventually use full traces to detect and flag potential failures before they complete.

Load-bearing premise

The failure cases and environments in the benchmark are representative of the failure modes developers actually encounter in real LLM-based multi-agent deployments.

What would settle it

Re-evaluating the same attribution techniques on a fresh set of failures drawn from production LLM multi-agent systems and finding no meaningful accuracy gain from full traces over partial ones would undermine the reported performance difference.

Figures

Figures reproduced from arXiv: 2604.22708 by Fangwen Mu, Huanxiang Feng, Junjie Wang, Mengzhuo Chen, Qing Wang, Yawen Wang, Zhe Liu.

**Figure 1.** Figure 1: A failure case (from Who&When benchmark) illustrating the limitation of partial observability. When view at source ↗

**Figure 2.** Figure 2: Overview of TraceElephant. is grounded in an executable MAS and is associated with a fully observable execution trace, as well as the annotated failure-responsible agent and decisive failure step. 3.1 Data Sources: Systems and Tasks TraceElephant collects execution traces from three representative agentic systems across various tasks, aligning task design with each system’s intended capabilities, as demon… view at source ↗

**Figure 3.** Figure 3: Comparison under different backbone LLMs. view at source ↗

**Figure 4.** Figure 4: Distribution of failure agent in TraceElephant. view at source ↗

**Figure 5.** Figure 5: Fine-grained agent-level accuracy. in our benchmark. Agents responsible for interacting with external environments or performing concrete operations are most prone to errors (almost over 50% of the failures), i.e., agents handling web information collection and browsing in CaptainAgent and Magentic-One, agent directly editing code in SWE-Agent. This might be because these actions depend on dynamic, oft… view at source ↗

**Figure 6.** Figure 6: Distribution of failure step in TraceElephant. view at source ↗

**Figure 7.** Figure 7: Fine-grained step-level accuracy. of each LLM call is recorded. Who&When may include intermediate agent outputs, reasoning snippets, or tool-call descriptions, but it does not provide the role-specific prompt, the exact visible history, system-constructed context, agent configuration, or tool/environment information injected into the next prompt. Without these input-side fields, a later output sequence… view at source ↗

**Figure 8.** Figure 8: Distribution of failure agent in Who&When. view at source ↗

**Figure 9.** Figure 9: Distribution of failure step in Who&When. view at source ↗

read the original abstract

Failure attribution, i.e., identifying the responsible agent and decisive step of a failure, is particularly challenging in LLM-based multi-agent systems (MAS) due to their natural-language reasoning, nondeterministic outputs, and intricate interaction dynamics. A reliable benchmark is therefore essential to guide and evaluate attribution techniques. Yet existing benchmarks rely on partially observable traces that capture only agent outputs, omitting the inputs and context that developers actually use when debugging. We argue that failure attribution should be studied under full execution observability, aligning with real-world developer-facing scenarios where complete traces, rather than only outputs, are accessible for diagnosis. To this end, we introduce TraceElephant, a benchmark designed for failure attribution with full execution traces and reproducible environments. We then systematically evaluate failure attribution techniques across various configurations. Specifically, full traces improve attribution accuracy by up to 76\% over a partial-observation counterpart, confirming that missing inputs obscure many failure causes. TraceElephant provides a foundation for follow-up failure attribution research, promoting evaluation practices that reflect real-world debugging and supporting the development of more transparent MASs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TraceElephant supplies full execution traces for failure attribution in LLM multi-agent systems and shows a 76% accuracy gain over partial traces.

read the letter

The paper's main contribution is TraceElephant, a benchmark that supplies complete execution traces rather than just agent outputs for evaluating failure attribution in LLM-based multi-agent systems. It reports that full traces improve attribution accuracy by up to 76% over a partial-observation setup on the same cases. That direct comparison is the clearest takeaway and aligns with how developers actually debug these systems when they have access to inputs and context. The reproducible environments are a practical plus for anyone who wants to run their own attribution experiments. The work stays grounded in the evaluation setup without obvious circular definitions or fitted parameters. The 76% lift is presented as evidence that missing inputs hide many failure causes, and the numbers are concrete enough to be checked against the benchmark. The soft spot is representativeness. The failure cases and injection process need to be examined closely to see whether they reflect the nondeterministic, multi-turn interactions that occur in real deployments or whether they were chosen because extra context makes them easier to attribute. If the cases are mostly synthetic or biased toward observability benefits, the accuracy delta may not transfer. The abstract leaves the exact measurement protocol and case selection criteria a bit thin, so the full methods section matters here. This paper is aimed at researchers who build or evaluate multi-agent LLM systems and need better ways to diagnose failures. Readers working on attribution methods or benchmark design will find the full-trace framing and the quantified comparison useful. It deserves a serious referee because the core idea is straightforward, the environments are reproducible, and the result is falsifiable even if the generalizability claim requires more evidence. I would send it to peer review with a request for clearer details on how the failure cases were generated and validated against real debugging scenarios.

Referee Report

2 major / 2 minor

Summary. The paper introduces TraceElephant, a benchmark for failure attribution in LLM-based multi-agent systems that supplies full execution traces (including inputs and context) rather than the partial traces (agent outputs only) used in prior work. It evaluates multiple attribution techniques and reports that full traces yield up to 76% higher attribution accuracy than partial-observation baselines, arguing that this setup better matches real-world developer debugging and that missing context obscures many failure causes.

Significance. If the quantitative result and benchmark construction hold, the work is significant because it supplies a reproducible, full-observability testbed for a practically important problem in LLM-MAS. The explicit comparison of full versus partial traces provides concrete evidence that context matters for attribution, which could steer future method development toward more transparent systems. The provision of reproducible environments is a clear strength that supports follow-up research.

major comments (2)

[§4 (Evaluation) and Table 2] §4 (Evaluation) and Table 2: the headline claim of 'up to 76% improvement' in attribution accuracy is load-bearing for the paper's central thesis, yet the manuscript provides insufficient detail on the precise definition of the accuracy metric, how ground-truth attributions were established by human or automated judges, and whether failure cases were selected before or after observing the full traces. Without these specifics the numerical delta cannot be fully interpreted or reproduced.
[§3 (Benchmark Construction)] §3 (Benchmark Construction): the broader claim that 'missing inputs obscure many failure causes' in practice rests on the assumption that TraceElephant's environments and injected failures are representative of real LLM-MAS deployments. The failure-injection process and environment selection are described as synthetic and reproducible, but the paper does not present evidence (e.g., comparison to logged production traces or diversity metrics) that the chosen failure modes reflect the nondeterministic, multi-turn interactions developers actually encounter.

minor comments (2)

[Abstract] Abstract: the phrase 'various configurations' is used without enumeration; listing the main axes (e.g., agent count, trace length, attribution method) would improve immediate readability.
[§2 (Related Work)] §2 (Related Work): the discussion of prior MAS benchmarks could more explicitly contrast their partial-observability design with TraceElephant's full-trace approach, perhaps in a small comparison table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity and provide additional supporting details.

read point-by-point responses

Referee: [§4 (Evaluation) and Table 2] §4 (Evaluation) and Table 2: the headline claim of 'up to 76% improvement' in attribution accuracy is load-bearing for the paper's central thesis, yet the manuscript provides insufficient detail on the precise definition of the accuracy metric, how ground-truth attributions were established by human or automated judges, and whether failure cases were selected before or after observing the full traces. Without these specifics the numerical delta cannot be fully interpreted or reproduced.

Authors: We agree that the current description of the evaluation protocol lacks sufficient explicit detail for full reproducibility and interpretation. Accuracy is defined as the fraction of test cases in which the attributed agent-step pair exactly matches the ground-truth failure location. Ground-truth labels were produced by two independent human experts who inspected the complete execution traces (including all inputs, outputs, and context); disagreements were resolved through discussion, yielding an inter-annotator agreement of 0.87 Cohen's kappa. Failure cases were chosen according to a pre-defined taxonomy of error types before any traces were generated or inspected, ensuring no post-hoc selection bias. In the revised manuscript we will add a new subsection in §4 that formally defines the metric, describes the annotation protocol, reports agreement statistics, and includes a small illustrative example. This clarification will make the reported 76% improvement fully interpretable. revision: yes
Referee: [§3 (Benchmark Construction)] §3 (Benchmark Construction): the broader claim that 'missing inputs obscure many failure causes' in practice rests on the assumption that TraceElephant's environments and injected failures are representative of real LLM-MAS deployments. The failure-injection process and environment selection are described as synthetic and reproducible, but the paper does not present evidence (e.g., comparison to logged production traces or diversity metrics) that the chosen failure modes reflect the nondeterministic, multi-turn interactions developers actually encounter.

Authors: We acknowledge that stronger evidence of ecological validity would be desirable. However, direct comparison to proprietary production logs is infeasible for a public benchmark due to confidentiality constraints. Our failure modes were derived from a systematic review of failure categories reported across recent LLM-MAS literature (reasoning errors, inter-agent miscommunication, tool misuse, and context loss). In the revision we will augment §3 with (i) a table reporting benchmark diversity statistics (agent counts 3–12, average interaction length, distribution of failure types) and (ii) an explicit discussion of how the injected failures map to patterns described in prior work. We will also temper the claim language to emphasize that the benchmark demonstrates the value of full observability under controlled, reproducible conditions rather than claiming statistical equivalence to all production deployments. This partial revision addresses the concern while preserving the benchmark's intended purpose as a standardized testbed. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison on self-constructed benchmark does not reduce to inputs by construction

full rationale

The paper's central result is an empirical measurement: full traces yield up to 76% higher attribution accuracy than partial traces when both are evaluated on the TraceElephant failure cases. This is a direct head-to-head comparison inside the benchmark rather than a fitted parameter renamed as a prediction, a self-definition, or a load-bearing self-citation. No equations or uniqueness theorems are invoked that collapse back to the authors' prior inputs. The benchmark construction and failure injection are presented as design choices whose representativeness is an external-validity question, not a circularity issue. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that full execution traces are the appropriate information for real-world debugging and that the benchmark's failure scenarios capture typical MAS failure modes.

axioms (2)

domain assumption Full execution traces (inputs, context, and outputs) are the information developers actually use when diagnosing MAS failures.
Stated in the abstract as the alignment with real-world developer-facing scenarios.
domain assumption The benchmark environments produce failures whose causes can be unambiguously attributed given full traces.
Implicit in the design of an attribution benchmark.

pith-pipeline@v0.9.0 · 5507 in / 1220 out tokens · 29034 ms · 2026-05-08T08:59:43.880387+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Ge et al

Who is introducing the failure? automatically attributing failures of multi-agent systems via spec- trum analysis.Preprint, arXiv:2509.13782. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. DeepSeek- R1 incentivizes reasoning in llms through reinforce- ment learning...

work page arXiv 2025
[2]

Diego Maldonado, Edison Cruz, Jackeline Abad Torres, Patricio J Cruz, and Silvana del Pilar Gamboa Ben- itez

A survey on LLM-based multi-agent sys- tems: workflow, infrastructure, and challenges.Vici- nagearth, 1(1):9. Diego Maldonado, Edison Cruz, Jackeline Abad Torres, Patricio J Cruz, and Silvana del Pilar Gamboa Ben- itez. 2024. Multi-agent systems: A survey about its components, framework and workflow.IEEE Access, 12:80950–80975. Ghassan Misherghi and Zhend...

2024
[3]

Adaptive in-conversation team building for language model agents.arXiv preprint arXiv:2405.19425,

Adaptive in-conversation team building for language model agents.Preprint, arXiv:2405.19425. MiroMind AI Team. 2025. MiroFlow: A high- performance open-source research agent framework. https://github.com/MiroMindAI/MiroFlow. W Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A survey on software fault localization.IEEE Transactions on S...

work page arXiv 2025
[4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. John Yang, Carlos E. Jimenez, Alexander Wettig, Kil- ian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-Agent: agent-computer in- terfaces enable automated software engineering. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, ...

work page internal anchor Pith review arXiv 2024
[5]

These trends align closely with those observed in TraceElephant (see Figure 4 and 6). The slight difference lies in the average num- ber of LLM invocations: in Who&When, Captain- Agent and Magentic-One respectively have an av- erage of 9.6 and 28.8 calls, while these figures for TraceElephant are 20.5 and 29.3. It is also worth noting that the step count ...