arxiv: 2604.23455 · v2 · submitted 2026-04-25 · 💻 cs.SE

Recognition: unknown

CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend

Haoming Meng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 07:43 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM agentsfailure diagnosiscross-modal reasoningbenchmarksoftware debuggingbrowser observabilitybackend signalsagent evaluation

0 comments

The pith

A new benchmark shows LLM agents diagnose cross-modal failures at 19.7 percent accuracy because they cannot correctly attribute evidence across browser and backend signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates CUJBench to test whether LLM agents can connect visible browser symptoms to hidden backend signals when diagnosing software failures. It generates 87 reproducible scenarios across five fault types using an automated pipeline and then runs six frontier models under different tool conditions. Results show low overall success, with agents often locating key evidence yet failing to link it to the right root cause. Browser-only setups beat full-tool agents on average, suggesting extra data sources can distract rather than help synthesis.

Core claim

CUJBench demonstrates that current LLM agents retrieve decisive browser and backend evidence but consistently fail to attribute it to the correct failure, producing 19.7 percent overall accuracy and a 52 percent ceiling across six models. Browser-only agents outperform full-toolset agents because additional observability induces unfocused exploration instead of better cross-modal reasoning. The limitation appears uniform regardless of model scale or tool richness.

What carries the argument

CUJBench, a collection of 87 deterministic multi-modal failure snapshots each paired with a fixed tool interface and three-layer labels, used to score agents on retrieval, browser-only, and full-toolset diagnosis tasks.

If this is right

Expanded tool access can lower performance by encouraging unfocused exploration rather than focused synthesis.
Cross-modal attribution remains the dominant error type even when agents locate the right evidence.
The performance gap between browser-only and full-tool agents indicates that simply adding more signals does not improve diagnosis.
No model scale tested overcomes the attribution bottleneck in these scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent designs may need explicit mechanisms to track and match evidence across separate observation streams.
The benchmark could serve as a testbed for training methods that reward correct evidence-to-cause mapping.
Similar cross-modal gaps may appear in other domains that require linking user-visible symptoms to internal logs.

Load-bearing premise

The automated generation pipeline with multi-agent review produces failure scenarios that are realistic and free of artifacts that would change how agents perform.

What would settle it

Run the same six agents on a fresh set of 20 scenarios whose labels were created entirely by human experts instead of the pipeline and check whether accuracy rises above 30 percent.

Figures

Figures reproduced from arXiv: 2604.23455 by Haoming Meng.

**Figure 1.** Figure 1: Overview of CUJBench. TABLE I SCENARIO TAXONOMY OF CUJBENCH CORPUS Family Apps Representative Signal Difficulty N E M H Baseline Both Healthy end-to-end CUJs with no injected failure 2 0 0 2 Browser proxy faults Both HAR, screenshots, DOM, console, and browser-visible timing or content anomalies 5 48 3 56 Backend flag faults OTel Demo Recent changes, traces, logs, metrics, and backend service symptoms 0 4 … view at source ↗

read the original abstract

Automated failure diagnosis requires correlating browser-visible symptoms with backend observability signals, yet existing benchmarks do not evaluate this cross-modal reasoning task. Constructing one is non-trivial: multi-modal failure scenarios are costly to annotate, and live-environment capture introduces stochasticity that makes cross-run agent comparison unreliable. We present CUJBench, to our knowledge, the first benchmark to combine browser-visible failure evidence with backend observability in a diagnostic framing. CUJBench addresses annotation cost through an LLM-assisted generation pipeline with a multi-agent review loop and a three-layer annotation scheme, producing 87 labeled scenarios across five fault families, and ensures reproducibility by packaging each failure as a deterministic multi-modal snapshot with a fixed tool interface. Evaluating six frontier models under retrieval, browser-only, and full-toolset baselines, the benchmark yields an overall accuracy of 19.7% with a ceiling of 52%, well below saturation. Contrary to expectation, browser-only agents outperform full-toolset agents in aggregate, with expanded evidence access inducing unfocused exploration rather than improved synthesis. Trajectory analysis identifies cross-modal synthesis as the primary bottleneck: agents retrieve the decisive evidence but fail to attribute it correctly - a structural limitation uniform across all six models that model scale and richer tool access alone cannot resolve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CUJBench gives a first benchmark for cross-modal LLM failure diagnosis with concrete low performance numbers, but the LLM-generated scenarios leave open whether the reported synthesis bottleneck is real or partly an artifact of how the cases were made.

read the letter

The paper's main point is a new benchmark that tests how LLM agents handle failure diagnosis when they must connect browser-visible symptoms to backend observability. They built 87 deterministic scenarios across five fault families, ran six frontier models under retrieval, browser-only, and full-tool baselines, and found 19.7% overall accuracy with a 52% ceiling. Browser-only agents beat the full-toolset ones because extra access led to unfocused exploration, and the consistent failure mode was retrieving the right evidence but failing to attribute it across modalities.

Referee Report

3 major / 3 minor

Summary. The manuscript presents CUJBench, the first benchmark for LLM-agent cross-modal failure diagnosis that correlates browser-visible symptoms with backend observability. It uses an LLM-assisted generation pipeline with multi-agent review and a three-layer annotation scheme to create 87 deterministic, reproducible scenarios across five fault families. Evaluation of six frontier models under retrieval, browser-only, and full-toolset settings reports 19.7% overall accuracy (ceiling 52%), finds browser-only agents outperform full-toolset agents, and concludes via trajectory analysis that cross-modal synthesis is the primary bottleneck uniform across models.

Significance. If the generated scenarios prove representative of real failures, this benchmark fills a gap in evaluating agentic systems for practical software diagnosis tasks. The reproducible packaging of failures as fixed multi-modal snapshots is a clear strength that enables reliable cross-run comparisons. The counterintuitive result that expanded tool access can degrade performance, together with the uniform attribution failure finding, would usefully inform agent design if the benchmark construction is validated.

major comments (3)

[§3] Scenario generation section (described in abstract and §3): The LLM-assisted pipeline with multi-agent review loop and three-layer annotation scheme operates entirely within the same model family. This raises the possibility that the scenarios systematically embed attribution artifacts, making the observed cross-modal synthesis bottleneck (agents retrieve evidence but fail to attribute) an artifact of the benchmark rather than a structural property of the agents. The manuscript should add an external validation step, such as comparison against a held-out set of human-annotated real-world failures or inter-rater agreement metrics with domain experts, to confirm the scenarios are free of such circularity.
[§4] Evaluation section (§4): The headline claims of 19.7% accuracy, 52% ceiling, and browser-only outperforming full-toolset are load-bearing for the argument that richer tools do not resolve the bottleneck. However, the manuscript provides no definitions of the accuracy metric (e.g., exact criteria for correct diagnosis or partial credit), no statistical tests for the performance difference, and no per-model or per-scenario variance. These omissions prevent assessment of whether the results are robust or sensitive to implementation choices in the tool interfaces.
[§5] Trajectory analysis (§5): The identification of cross-modal synthesis as the primary bottleneck rests on qualitative trajectory inspection showing retrieval without correct attribution. The paper should supply quantitative metrics for attribution failure (e.g., a defined attribution score or count of misattribution instances per scenario) and concrete examples to make this claim falsifiable and proportionate to the central conclusion.

minor comments (3)

The abstract's claim of being 'to our knowledge, the first benchmark' should be supported by explicit comparisons in the related work section to prior failure diagnosis benchmarks in software engineering.
Ensure all figures depicting agent trajectories or the benchmark architecture include clear legends and axis labels for readability.
Expand the acronym CUJBench on first use in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which identifies key areas where the manuscript can be strengthened. We address each major comment point by point below, with clear indications of the revisions we will make.

read point-by-point responses

Referee: [§3] Scenario generation section (described in abstract and §3): The LLM-assisted pipeline with multi-agent review loop and three-layer annotation scheme operates entirely within the same model family. This raises the possibility that the scenarios systematically embed attribution artifacts, making the observed cross-modal synthesis bottleneck (agents retrieve evidence but fail to attribute) an artifact of the benchmark rather than a structural property of the agents. The manuscript should add an external validation step, such as comparison against a held-out set of human-annotated real-world failures or inter-rater agreement metrics with domain experts, to confirm the scenarios are free of such circularity.

Authors: We acknowledge the referee's concern regarding potential circularity in the LLM-assisted generation pipeline. While the multi-agent review loop and three-layer annotation scheme were intended to promote scenario quality and diversity, the use of models from the same family could introduce subtle biases. The observed bottleneck concerns attribution of already-retrieved evidence rather than retrieval itself, which we argue is less likely to be an artifact. To address this, we will add a limitations subsection to §3 discussing the generation process and its possible biases. We will also report a comparison of a random subset of 20 scenarios against real-world failure reports drawn from public GitHub repositories of web applications to assess representativeness. Full human re-annotation of the entire set is not feasible within this revision cycle. revision: partial
Referee: [§4] Evaluation section (§4): The headline claims of 19.7% accuracy, 52% ceiling, and browser-only outperforming full-toolset are load-bearing for the argument that richer tools do not resolve the bottleneck. However, the manuscript provides no definitions of the accuracy metric (e.g., exact criteria for correct diagnosis or partial credit), no statistical tests for the performance difference, and no per-model or per-scenario variance. These omissions prevent assessment of whether the results are robust or sensitive to implementation choices in the tool interfaces.

Authors: We agree that the evaluation section requires more precise definitions and statistical support to substantiate the headline results. In the revised manuscript we will explicitly define the accuracy metric in §4, stating that a diagnosis is scored correct only when both the fault family and the specific root cause match the ground-truth annotation (no partial credit is awarded). We will add tables reporting per-model and per-scenario accuracies together with standard deviations, and we will include statistical tests (McNemar's test for paired binary outcomes) to evaluate the significance of the browser-only versus full-toolset performance difference. These additions will allow readers to assess robustness directly. revision: yes
Referee: [§5] Trajectory analysis (§5): The identification of cross-modal synthesis as the primary bottleneck rests on qualitative trajectory inspection showing retrieval without correct attribution. The paper should supply quantitative metrics for attribution failure (e.g., a defined attribution score or count of misattribution instances per scenario) and concrete examples to make this claim falsifiable and proportionate to the central conclusion.

Authors: We concur that the trajectory analysis would be strengthened by quantitative metrics and concrete examples. We will introduce an 'attribution failure rate' metric in §5, defined as the percentage of scenarios in which the agent retrieves the decisive evidence (confirmed via tool-call logs) yet fails to correctly attribute it in the final diagnosis. This rate will be reported for each model and experimental setting. In addition, we will include an appendix containing three detailed trajectory examples that illustrate the retrieval-without-attribution pattern, making the central claim more falsifiable and proportionate to the evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical benchmark with direct measurements

full rationale

The paper constructs CUJBench via an LLM-assisted pipeline and evaluates six models on 87 scenarios, reporting accuracy, trajectory patterns, and comparative baselines. No equations, fitted parameters, predictions derived from subsets, or self-citation chains appear in the provided text. All reported results (19.7% accuracy, browser-only outperforming full-toolset, cross-modal attribution failures) are direct empirical observations on the generated test set rather than quantities forced by construction or renamed inputs. The generation process is presented as a practical solution to annotation cost, not as a derivation that loops back to validate itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields limited visibility into internal assumptions; the construction method rests on unverified claims about scenario quality.

axioms (1)

domain assumption LLM-assisted generation plus multi-agent review produces high-quality, unbiased failure scenarios that reflect genuine cross-modal diagnostic challenges
The entire benchmark depends on this to replace costly manual annotation while preserving validity.

pith-pipeline@v0.9.0 · 5516 in / 1286 out tokens · 53933 ms · 2026-05-08T07:43:39.889982+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 9 canonical work pages · 2 internal anchors

[1]

AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds,

Y . Chen et al., “AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds,” inProc. MLSys, 2025

2025
[2]

ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks,

S. Jha et al., “ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks,” inProc. ICML, 2025

2025
[3]

Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems,

Y . Wang et al., “Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems,” arXiv:2603.00468, 2026

work page arXiv 2026
[4]

Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?,

T. Kim et al., “Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?,” arXiv:2602.09937, 2026

work page arXiv 2026
[5]

Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis,

E. Riddell et al., “Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis,” inProc. FORGE, 2026

2026
[6]

Survey on Evaluation of LLM-based Agents

A. Yehudai et al., “Survey on Evaluation of LLM-based Agents,” arXiv:2503.16416, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Evaluation and Benchmarking of LLM Agents: A Survey,

M. Mohammadi et al., “Evaluation and Benchmarking of LLM Agents: A Survey,” inProc. KDD, 2025, pp. 6129–6139

2025
[8]

How to Fight Production Incidents? An Empirical Study on a Large-Scale Cloud Service,

S. Ghosh et al., “How to Fight Production Incidents? An Empirical Study on a Large-Scale Cloud Service,” inProc. SoCC, 2022, pp. 126–141

2022
[9]

Beyer, N

B. Beyer, N. R. Murphy, D. K. Rensin, K. Kawahara, and S. Thorne,The Site Reliability Workbook: Practical Ways to Implement SRE. O’Reilly Media, 2018, ISBN 978-1-492-02950-2

2018
[10]

Critical User Journey Test Cov- erage,

C. Arguelles and T. Sampson, “Critical User Journey Test Cov- erage,”Technical Disclosure Commons, 2020. [Online]. Available: https://www.tdcommons.org/dpubs series/3744/

2020
[11]

OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?,

J. Xu et al., “OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?,” inProc. ICLR, 2025

2025
[12]

RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data,

L. Pham et al., “RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data,” inProc. WWW, 2025

2025
[13]

arXiv:2601.22881 [cs.SE] https://arxiv.org/abs/2601.22881

K. Ping et al., “AnoMod: A Dataset for Anomaly Detection and Root Cause Analysis in Microservice Systems,” inProc. MSR, 2026. arXiv:2601.22881

work page arXiv 2026
[14]

WebArena: A Realistic Web Environment for Building Autonomous Agents,

S. Zhou et al., “WebArena: A Realistic Web Environment for Building Autonomous Agents,” inProc. ICLR, 2024

2024
[15]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks,

J. Y . Koh et al., “VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks,” inProc. ACL, 2024

2024
[16]

OSWorld: Benchmarking Multimodal Agents for Open- Ended Tasks in Real Computer Environments,

T. Xie et al., “OSWorld: Benchmarking Multimodal Agents for Open- Ended Tasks in Real Computer Environments,” inProc. NeurIPS, 2024

2024
[17]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

F. Xu et al., “TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks,” arXiv:2412.14161, 2024

work page internal anchor Pith review arXiv 2024
[18]

Odysseybench: Evaluating llm agents on long-horizon complex office application workflows

W. Wang et al., “OdysseyBench: Evaluating LLM Agents on Long- Horizon Complex Office Application Workflows,” arXiv:2508.09124, 2025

work page arXiv 2025
[19]

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution,

J. Li et al., “The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution,” inProc. ICLR, 2026

2026
[20]

Claude Sonnet 4.6,

Anthropic, “Claude Sonnet 4.6,” 2026. [Online]. Available: https://www.anthropic.com/claude/sonnet

2026
[21]

Gemini 3.1 Pro Model Card,

Google DeepMind, “Gemini 3.1 Pro Model Card,” 2026. [Online]. Available: https://deepmind.google/models/model-cards/gemini-3-1-pro/

2026
[22]

LabelEase: A Semi-Automatic Tool for Efficient and Accurate Trace Labeling in Microservices,

S. Zhang et al., “LabelEase: A Semi-Automatic Tool for Efficient and Accurate Trace Labeling in Microservices,” inProc. ISSRE, 2024

2024
[23]

AIOpsArena: Scenario-Oriented Evaluation and Leader- board for AIOps Algorithms in Microservices,

Y . Sun et al., “AIOpsArena: Scenario-Oriented Evaluation and Leader- board for AIOps Algorithms in Microservices,” inProc. SANER, 2025

2025
[24]

OpenTelemetry Demo: Astronomy Shop,

OpenTelemetry Authors, “OpenTelemetry Demo: Astronomy Shop,” GitHub repository, 2024. [Online]. Available: https://github.com/open- telemetry/opentelemetry-demo

2024
[25]

The Tractor Store - Preact,

neuland, “The Tractor Store - Preact,” GitHub repository, ver. 1.0.0, commit 187de47, 2025. [Online]. Available: https://github.com/neuland/tractor-store-preact

2025
[26]

Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-Source Data,

C. Lee et al., “Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-Source Data,” inProc. ICSE, 2023

2023
[27]

Nezha: Interpretable Fine-Grained Root Causes Anal- ysis for Microservices on Multi-modal Observability Data,

G. Yu et al., “Nezha: Interpretable Fine-Grained Root Causes Anal- ysis for Microservices on Multi-modal Observability Data,” inProc. ESEC/FSE, 2023

2023
[28]

Exploring LLM-Based Agents for Root Cause Analysis,

D. Roy et al., “Exploring LLM-Based Agents for Root Cause Analysis,” inProc. FSE Companion, 2024

2024
[29]

Introducing GPT-5.4,

OpenAI, “Introducing GPT-5.4,” 2026. [Online]. Available: https://openai.com/index/introducing-gpt-5-4/

2026
[30]

ReAct: Synergizing Reasoning and Acting in Language Models,

S. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” inProc. ICLR, 2023

2023
[31]

TAMO: Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent With Multi-Modality Observation Data in Cloud-Native Systems,

X. Zhang et al., “TAMO: Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent With Multi-Modality Observation Data in Cloud-Native Systems,”IEEE Trans. Services Comput., vol. 18, no. 6, pp. 4221–4233, 2025

2025
[32]

Bridging Temporal and Textual Modalities: A Multimodal Framework for Automated Cloud Failure Root Cause Analysis,

G. Park, “Bridging Temporal and Textual Modalities: A Multimodal Framework for Automated Cloud Failure Root Cause Analysis,” arXiv:2601.04709, 2026

work page arXiv 2026
[33]

Too Many Cooks: Assessing the Need for Multi-Source Data in Microservice Failure Diagnosis,

S. Zhang et al., “Too Many Cooks: Assessing the Need for Multi-Source Data in Microservice Failure Diagnosis,” inProc. ISSRE, 2025

2025
[34]

Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models,

T. Ahmed et al., “Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models,” inProc. ICSE, 2023, pp. 1737–1749

2023
[35]

Automatic Root Cause Analysis via Large Language Models for Cloud Incidents,

Y . Chen et al., “Automatic Root Cause Analysis via Large Language Models for Cloud Incidents,” inProc. EuroSys, 2024, pp. 674–688

2024
[36]

Gui-robust: A comprehensive dataset for test- ing gui agent robustness in real-world anomalies,

J. Yang et al., “GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies,” arXiv:2506.14477, 2025

work page arXiv 2025
[37]

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks,

L. Jang et al., “VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks,” inProc. ICLR, 2025

2025
[38]

TRAIL: Trace reasoning and agentic issue localization.arXiv preprint arXiv:2505.08638, 2025

D. Deshpande et al., “TRAIL: Trace Reasoning and Agentic Issue Localization,” arXiv:2505.08638, 2025

work page arXiv 2025
[39]

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models,

Z. Wang et al., “RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models,” inProc. CIKM, 2024, pp. 4966–4974

2024