CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
arXiv:2601.22881 [cs.SE] https://arxiv.org/abs/2601.22881
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
LATS-RCA applies multi-agent Language Agent Tree Search to automate root cause analysis in microservices, reporting high accuracy on a small open-source Java system but lower accuracy in a complex production environment.
No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
citing papers explorer
-
CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend
CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
-
Multi-Agent Systems for Root Cause Analysis in Microservices
LATS-RCA applies multi-agent Language Agent Tree Search to automate root cause analysis in microservices, reporting high accuracy on a small open-source Java system but lower accuracy in a complex production environment.
-
Security Considerations for Multi-agent Systems
No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.