CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
Pooled top-1 accuracy rankings in RCA benchmarks do not reliably identify per-subsystem winners, as pairwise comparisons across 11 subsystems show effects of both signs and leave-one-system-out selection incurs regret up to 24.8 pp.
Graph Traversal Agent improves root-cause F1 from 0.6087 to 0.9130 on ITBench snapshots but the gain is benchmark-coupled to cases where the injected fault is already in the evidence graph.
citing papers explorer
-
CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend
CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
-
Pooled Leaderboards Hide System-Specific Winners: A Reporting-Protocol Audit of Offline Root-Cause Analysis Benchmarks
Pooled top-1 accuracy rankings in RCA benchmarks do not reliably identify per-subsystem winners, as pairwise comparisons across 11 subsystems show effects of both signs and leave-one-system-out selection incurs regret up to 24.8 pp.
-
Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents
Graph Traversal Agent improves root-cause F1 from 0.6087 to 0.9130 on ITBench snapshots but the gain is benchmark-coupled to cases where the injected fault is already in the evidence graph.