Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

Jianan Liu; Jing Yang; Mengwei Yuan; Penghao Liang; Weiran Yan; Xianyou Li; Yichao Wu

arxiv: 2606.01365 · v1 · pith:LQ4WHYIKnew · submitted 2026-05-31 · 💻 cs.AI

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

Xianyou Li , Weiran Yan , Yichao Wu , Penghao Liang , Mengwei Yuan , Jianan Liu , Jing Yang This is my paper

Pith reviewed 2026-06-28 17:05 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent LLMfailure-aware observabilitywasted computationtrace signalstool reliabilityorchestration loopsGAIA benchmarkdiagnostic framework

0 comments

The pith

Failure-aware observability framework diagnoses wasted computation in multi-agent LLM systems by mapping trace signals to failure modes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework for early diagnosis of wasted computation in tool-using multi-agent LLM systems. It maps recurring failure modes to online trace signals such as tool reliability, execution recovery, orchestration loops, evidence availability, information change, and budget pressure. This allows identification of the point where a trajectory stops making recoverable progress, rather than waiting for final-answer evaluation. Evaluation on 165 GAIA validation traces in a three-agent question-answering system reveals common operational failures with different underlying mechanisms and increasing token usage at higher task levels. The framework is positioned as a diagnostic layer between raw execution logs and final-answer accuracy metrics.

Core claim

The framework maps recurring failure modes to online trace signals, including tool reliability, execution recovery, orchestration loops, evidence availability, information change, and budget pressure, in order to diagnose wasted computation in multi-agent LLM traces before final answers are produced.

What carries the argument

The failure-aware observability framework, which maps recurring failure modes to specific online trace signals in execution traces.

If this is right

Operational failures remain common, with 22 of 53 level-1 runs, 33 of 86 level-2 runs, and 12 of 26 level-3 runs failing to produce a usable final answer.
Mean token use increases from 8,152 tokens at level 1 to 16,389 tokens at level 3.
Traces expose mechanisms such as insufficient evidence, repeated-action loops, max-step termination, tool-failure streaks, and execution calls that succeed without useful output.
Evidence availability and sentence-level support diverge across levels.
A cached 10-trace LLM-judge grounding audit shows that cheap online signals and deeper semantic metrics capture complementary layers of failure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such a framework could enable systems to terminate unproductive trajectories early and conserve computational resources.
The signal mappings might be adapted to other multi-agent LLM applications beyond the three-agent QA system tested.
Integrating these observability signals with existing monitoring tools could improve overall system reliability in agent-based workflows.

Load-bearing premise

The trace signals can be mapped to failure modes in a manner that accurately diagnoses when a trajectory ceases to make recoverable progress.

What would settle it

A controlled experiment continuing execution on trajectories flagged by the framework as non-recoverable and checking whether they produce correct answers at rates no higher than random chance would falsify the diagnostic utility.

Figures

Figures reproduced from arXiv: 2606.01365 by Jianan Liu, Jing Yang, Mengwei Yuan, Penghao Liang, Weiran Yan, Xianyou Li, Yichao Wu.

**Figure 1.** Figure 1: Full level-stratified run under identical execution caps. Usable finals are operational outcomes, not correctness claims; grounding quality is analyzed [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Layered Pearson correlation comparison. The full-run column com [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Tool-using multi-agent large language model (LLM) systems spend computation through model tokens, tool calls, retries, and code execution before producing an answer. When a run fails, final-answer evaluation reveals the endpoint but usually not the point at which the trajectory stopped making recoverable progress. This paper introduces a failure-aware observability framework for diagnosing wasted computation in multi-agent LLM traces. The framework maps recurring failure modes to online trace signals, including tool reliability, execution recovery, orchestration loops, evidence availability, information change, and budget pressure. We instantiate the framework in a three- agent question-answering system and evaluate it on 165 GAIA validation traces under identical execution caps. Operational failures remain common: 22/53 level-1 runs, 33/86 level-2 runs, and 12/26 level-3 runs fail to produce a usable final answer. The traces expose different mechanisms behind these outcomes, including insufficient evidence, repeated-action loops, max-step termination, tool-failure streaks, and execution calls that succeed without useful output. Mean token use rises from 8,152 tokens at level 1 to 16,389 tokens at level 3, while evidence availability and sentence-level support diverge. A cached 10-trace LLM-judge grounding audit shows that cheap online signals and deeper semantic metrics capture complementary layers of failure. The results position failure-aware observability as a diagnostic layer between raw execution logs and final-answer accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete breakdown of failure signals in multi-agent LLM traces on GAIA but its evaluation stays retrospective and does not test online early detection.

read the letter

The paper introduces a failure-aware observability framework that maps recurring failure modes in multi-agent LLM systems to trace signals including tool reliability, execution recovery, orchestration loops, evidence availability, information change, and budget pressure. They instantiate it in a three-agent QA setup and run it across 165 GAIA validation traces under fixed caps, reporting failure rates and token patterns by difficulty level.

It does a straightforward job with the data. Operational failures are common: 22 of 53 level-1 runs, 33 of 86 level-2 runs, and 12 of 26 level-3 runs produce no usable answer. Mean token use rises from 8,152 at level 1 to 16,389 at level 3, with evidence availability and sentence-level support diverging across levels. The cached 10-trace LLM-judge audit shows that cheap online signals and deeper semantic checks capture different failure layers.

The soft spot is the mismatch with the central claim. The framework is sold as a way to diagnose the point where a trajectory stops making recoverable progress, yet the evaluation runs every trace to completion and then looks back at the logs. No results appear on real-time monitoring thresholds, precision of early detection, or whether intervening on the signals would have reduced wasted tokens while preserving answer quality. The mapping procedure itself is not described in enough detail to assess.

This is for people tuning or debugging multi-agent LLM systems who need better ways to inspect traces. It has enough empirical grounding and clear numbers to deserve a serious referee, though the online detection piece would need more work.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a failure-aware observability framework for identifying wasted computation in multi-agent LLM systems. It maps recurring failure modes to online trace signals including tool reliability, execution recovery, orchestration loops, evidence availability, information change, and budget pressure. The framework is instantiated in a three-agent QA system and evaluated on 165 GAIA validation traces, reporting operational failure rates of 22/53 for level-1, 33/86 for level-2, and 12/26 for level-3, along with increasing mean token usage from 8,152 to 16,389 tokens across levels, and insights from a 10-trace LLM-judge audit.

Significance. If the results hold and the framework enables early online detection, it would offer a practical diagnostic layer for multi-agent systems, helping to reduce wasted tokens by identifying non-recoverable trajectories. The paper provides concrete empirical data from 165 traces and a grounding audit, which are strengths in grounding the failure mode analysis.

major comments (2)

[Abstract] The central claim requires demonstrating that the listed trace signals can diagnose in an online fashion the point at which a trajectory stops making recoverable progress. However, the evaluation runs all traces to completion under fixed caps and retrospectively identifies failures, without results on real-time monitoring thresholds or precision of early detection.
[Evaluation] No methods are described for how the signal-to-failure mapping is performed or how the evaluation supports the claim that these signals enable early diagnosis, making it impossible to assess the data against the central claim.

minor comments (1)

[Abstract] The abstract mentions 'a cached 10-trace LLM-judge grounding audit' but does not specify the criteria used for the audit or how it complements the cheap online signals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for identifying key gaps between our claims and the presented evaluation. We address each major comment below and will make revisions to clarify scope and methods.

read point-by-point responses

Referee: [Abstract] The central claim requires demonstrating that the listed trace signals can diagnose in an online fashion the point at which a trajectory stops making recoverable progress. However, the evaluation runs all traces to completion under fixed caps and retrospectively identifies failures, without results on real-time monitoring thresholds or precision of early detection.

Authors: We agree the evaluation is retrospective on completed traces under fixed caps and provides no real-time monitoring thresholds or early-detection precision metrics. The signals are defined to be observable during execution, but the study characterizes their presence in failing traces rather than demonstrating online diagnosis. We will revise the abstract and claims to remove implications of online early detection results and add explicit discussion of this limitation. revision: yes
Referee: [Evaluation] No methods are described for how the signal-to-failure mapping is performed or how the evaluation supports the claim that these signals enable early diagnosis, making it impossible to assess the data against the central claim.

Authors: The mappings were derived from author inspection of the 165 traces, identifying patterns such as repeated-action loops and tool-failure streaks, with grounding from the 10-trace LLM-judge audit. We will add a subsection detailing the mapping process with trace examples. We will also adjust language to state that the data shows associations with eventual failure in completed runs, without claiming support for online early diagnosis. revision: yes

Circularity Check

0 steps flagged

No circularity; framework is descriptive mapping plus retrospective trace analysis

full rationale

The paper introduces a failure-aware observability framework that maps failure modes to trace signals and evaluates the mapping on 165 completed GAIA traces. No equations, fitted parameters, predictions, or derivations are present. No self-citations are invoked as load-bearing premises. The central claim reduces to empirical observation of logs rather than any self-referential construction, satisfying the criteria for a self-contained non-circular analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5813 in / 1011 out tokens · 36652 ms · 2026-06-28T17:05:29.411337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 9 canonical work pages · 4 internal anchors

[1]

ReAct: Synergizing reasoning and acting in language models,

S. Yao et al., “ReAct: Synergizing reasoning and acting in language models,” inProc. International Conference on Learning Representa- tions, 2023

2023
[2]

Toolformer: Language models can teach themselves to use tools,

T. Schick et al., “Toolformer: Language models can teach themselves to use tools,” inProc. Advances in Neural Information Processing Systems, 2023

2023
[3]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Q. Wu et al., “AutoGen: Enabling next-gen LLM applications via multi- agent conversation,” arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

GAIA: a benchmark for General AI Assistants

G. Mialon et al., “GAIA: A benchmark for general AI assistants,” arXiv:2311.12983, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Dapper, a large-scale distributed systems tracing infrastructure,

B. H. Sigelman et al., “Dapper, a large-scale distributed systems tracing infrastructure,” Google, Tech. Rep., 2010

2010
[6]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” inProc. Conference on Empirical Methods in Natural Language Processing, 2019

2019
[7]

Tiny-Critic RAG: Empowering agentic fallback with parameter-efficient small language models,

Y . Wu, P. Liang, Y . Xiang, M. Yuan, J. Liu, J. Yang, X. Li, and W. Yan, “Tiny-Critic RAG: Empowering agentic fallback with parameter-efficient small language models,” arXiv preprint arXiv:2603.00846, 2026

work page arXiv 2026
[8]

TA-Mem: Tool-augmented autonomous memory retrieval for LLM in long-term conversational QA,

M. Yuan, J. Liu, J. Yang, X. Li, W. Yan, Y . Wu, and P. Liang, “TA-Mem: Tool-augmented autonomous memory retrieval for LLM in long-term conversational QA,” inProc. 9th Int. Conf. Advanced Algorithms and Control Engineering (ICAACE), 2026, pp. 2684–2688, doi: 10.1109/ICAACE69793.2026.11509181

work page doi:10.1109/icaace69793.2026.11509181 2026
[9]

In: 2026 9th International Symposium on Big Data and Applied Statistics (ISBDAS)

P. Liang, M. Yuan, J. Liu, J. Yang, X. Li, W. Yan, and Y . Wu, “Dy- naRAG: Bridging static and dynamic knowledge in retrieval-augmented generation,” inProc. 9th Int. Symp. Big Data and Applied Statistics (ISB- DAS), 2026, pp. 442–445, doi: 10.1109/ISBDAS69350.2026.11484130

work page doi:10.1109/isbdas69350.2026.11484130 2026
[10]

PRISM: Pipeline for root-cause investigation via special- ized multi-agents,

W. Yan, Y . Wu, P. Liang, M. Yuan, J. Liu, J. Yang, and X. Li, “PRISM: Pipeline for root-cause investigation via special- ized multi-agents,” inProc. Int. Conf. Generative Artificial Intelli- gence and Information Security (GAIIS), 2026, pp. 709–712, doi: 10.1109/GAIIS69281.2026.11519347

work page doi:10.1109/gaiis69281.2026.11519347 2026
[11]

Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints

J. Liu, J. Yang, X. Li, W. Yan, Y . Wu, P. Liang, and M. Yuan, “Architecture matters more than scale: A comparative study of retrieval and memory augmentation for financial QA under SME compute con- straints,” arXiv preprint arXiv:2604.17979, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Recursive Multi-Agent Trading System: Iterative Optimized Portfolio Strategy Under Geopolitical Uncertainty

J. Yang, Y . Wu, J. Liu, P. Liang, M. Yuan, X. Li, and W. Yan, “Recursive multi-agent trading system: Iterative optimized portfolio strategy under geopolitical uncertainty,” arXiv preprint arXiv:2605.25311, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Learning skill equiv- alencies across platform taxonomies,

Z. Li, C. Ren, X. Li, and Z. A. Pardos, “Learning skill equiv- alencies across platform taxonomies,” inProc. 11th Int. Learning Analytics and Knowledge Conf. (LAK’21), 2021, pp. 354–363, doi: 10.1145/3448139.3448173

work page doi:10.1145/3448139.3448173 2021

[1] [1]

ReAct: Synergizing reasoning and acting in language models,

S. Yao et al., “ReAct: Synergizing reasoning and acting in language models,” inProc. International Conference on Learning Representa- tions, 2023

2023

[2] [2]

Toolformer: Language models can teach themselves to use tools,

T. Schick et al., “Toolformer: Language models can teach themselves to use tools,” inProc. Advances in Neural Information Processing Systems, 2023

2023

[3] [3]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Q. Wu et al., “AutoGen: Enabling next-gen LLM applications via multi- agent conversation,” arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

GAIA: a benchmark for General AI Assistants

G. Mialon et al., “GAIA: A benchmark for general AI assistants,” arXiv:2311.12983, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Dapper, a large-scale distributed systems tracing infrastructure,

B. H. Sigelman et al., “Dapper, a large-scale distributed systems tracing infrastructure,” Google, Tech. Rep., 2010

2010

[6] [6]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” inProc. Conference on Empirical Methods in Natural Language Processing, 2019

2019

[7] [7]

Tiny-Critic RAG: Empowering agentic fallback with parameter-efficient small language models,

Y . Wu, P. Liang, Y . Xiang, M. Yuan, J. Liu, J. Yang, X. Li, and W. Yan, “Tiny-Critic RAG: Empowering agentic fallback with parameter-efficient small language models,” arXiv preprint arXiv:2603.00846, 2026

work page arXiv 2026

[8] [8]

TA-Mem: Tool-augmented autonomous memory retrieval for LLM in long-term conversational QA,

M. Yuan, J. Liu, J. Yang, X. Li, W. Yan, Y . Wu, and P. Liang, “TA-Mem: Tool-augmented autonomous memory retrieval for LLM in long-term conversational QA,” inProc. 9th Int. Conf. Advanced Algorithms and Control Engineering (ICAACE), 2026, pp. 2684–2688, doi: 10.1109/ICAACE69793.2026.11509181

work page doi:10.1109/icaace69793.2026.11509181 2026

[9] [9]

In: 2026 9th International Symposium on Big Data and Applied Statistics (ISBDAS)

P. Liang, M. Yuan, J. Liu, J. Yang, X. Li, W. Yan, and Y . Wu, “Dy- naRAG: Bridging static and dynamic knowledge in retrieval-augmented generation,” inProc. 9th Int. Symp. Big Data and Applied Statistics (ISB- DAS), 2026, pp. 442–445, doi: 10.1109/ISBDAS69350.2026.11484130

work page doi:10.1109/isbdas69350.2026.11484130 2026

[10] [10]

PRISM: Pipeline for root-cause investigation via special- ized multi-agents,

W. Yan, Y . Wu, P. Liang, M. Yuan, J. Liu, J. Yang, and X. Li, “PRISM: Pipeline for root-cause investigation via special- ized multi-agents,” inProc. Int. Conf. Generative Artificial Intelli- gence and Information Security (GAIIS), 2026, pp. 709–712, doi: 10.1109/GAIIS69281.2026.11519347

work page doi:10.1109/gaiis69281.2026.11519347 2026

[11] [11]

Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints

J. Liu, J. Yang, X. Li, W. Yan, Y . Wu, P. Liang, and M. Yuan, “Architecture matters more than scale: A comparative study of retrieval and memory augmentation for financial QA under SME compute con- straints,” arXiv preprint arXiv:2604.17979, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Recursive Multi-Agent Trading System: Iterative Optimized Portfolio Strategy Under Geopolitical Uncertainty

J. Yang, Y . Wu, J. Liu, P. Liang, M. Yuan, X. Li, and W. Yan, “Recursive multi-agent trading system: Iterative optimized portfolio strategy under geopolitical uncertainty,” arXiv preprint arXiv:2605.25311, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Learning skill equiv- alencies across platform taxonomies,

Z. Li, C. Ren, X. Li, and Z. A. Pardos, “Learning skill equiv- alencies across platform taxonomies,” inProc. 11th Int. Learning Analytics and Knowledge Conf. (LAK’21), 2021, pp. 354–363, doi: 10.1145/3448139.3448173

work page doi:10.1145/3448139.3448173 2021