CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
SREGym is a modular, open-source live benchmark with 90 high-fidelity SRE failure scenarios built on real cloud stacks for evaluating AI agents on diagnosis and mitigation tasks.
citing papers explorer
-
CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend
CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
-
SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
SREGym is a modular, open-source live benchmark with 90 high-fidelity SRE failure scenarios built on real cloud stacks for evaluating AI agents on diagnosis and mitigation tasks.