CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
Canonical reference
Trail: Trace reasoning and agentic issue localization,
Canonical reference. 83% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
Models delayed verification in multi-agent LLMs as graph consensus, derives stability thresholds (inverse golden ratio for delay two) via grounded Laplacian, and gives a supermodular greedy rule for corrector placement; experiments on five models confirm dose-delay oscillations.
Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.
AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.
AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
Refploit repairs code-agent trajectories for Java exploit reproduction via differential validation and focused recovery constraints, achieving 80.2% success on 172 references with 64.3% relative improvement.
Strained coherence flagged by Claude judge on 44 coding trajectories predicts failure (94% vs 46%, p=0.003), with partial replication on second model.
StepFinder turns execution logs into temporal semantic sequences via LLMs then uses temporal modeling plus attention to attribute failures to specific steps more accurately and 79% faster than direct LLM methods on the Who&When benchmark.
TrajAudit diagnoses failures in repository-level agentic coding trajectories by filtering noise and injecting test-failure priors, achieving >24.4 pp higher localization accuracy and 18% lower token use on the new RootSE benchmark of 93 instances.
PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.
Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.
This survey defines execution provenance as a typed graph of agent execution and evidence tracing as its projection onto evidence-support relations, then reviews methods, taxonomy, benchmarks, and challenges for auditable LLM agents.
Insights Generator is a multi-agent system that produces evidence-backed insights from corpora of LLM agent traces and yields 30.4pp performance gains when humans apply the reports.
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.
AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.
The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle with equally sophisticated long outputs.
This survey categorizes anomalies in agent systems into intra-agent and inter-agent types and introduces the AgentOps framework with four operational stages.
citing papers explorer
No citing papers match the current filters.