Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

· 2026 · arXiv 2602.09937

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend

cs.SE · 2026-04-25 · unverdicted · novelty 8.0

CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.

Pooled Leaderboards Hide System-Specific Winners: A Reporting-Protocol Audit of Offline Root-Cause Analysis Benchmarks

cs.AI · 2026-06-28 · unverdicted · novelty 7.0

Pooled top-1 accuracy rankings in RCA benchmarks do not reliably identify per-subsystem winners, as pairwise comparisons across 11 subsystems show effects of both signs and leave-one-system-out selection incurs regret up to 24.8 pp.

Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents

cs.SE · 2026-06-07 · conditional · novelty 5.0

Graph Traversal Agent improves root-cause F1 from 0.6087 to 0.9130 on ITBench snapshots but the gain is benchmark-coupled to cases where the injected fault is already in the evidence graph.

citing papers explorer

Showing 3 of 3 citing papers.

CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend cs.SE · 2026-04-25 · unverdicted · none · ref 4
CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
Pooled Leaderboards Hide System-Specific Winners: A Reporting-Protocol Audit of Offline Root-Cause Analysis Benchmarks cs.AI · 2026-06-28 · unverdicted · none · ref 12
Pooled top-1 accuracy rankings in RCA benchmarks do not reliably identify per-subsystem winners, as pairwise comparisons across 11 subsystems show effects of both signs and leave-one-system-out selection incurs regret up to 24.8 pp.
Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents cs.SE · 2026-06-07 · conditional · none · ref 15
Graph Traversal Agent improves root-cause F1 from 0.6087 to 0.9130 on ITBench snapshots but the gain is benchmark-coupled to cases where the injected fault is already in the evidence graph.

Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

fields

years

verdicts

representative citing papers

citing papers explorer