Canonical reference

TRAIL: Trace reasoning and agentic issue localization.arXiv preprint arXiv:2505.08638,

Deshpande, D · 2025 · arXiv 2505.08638

Canonical reference. 83% of citing Pith papers cite this work as background.

20 Pith papers citing it

Background 83% of classified citations

read on arXiv browse 20 citing papers

citation-role summary

background 5 baseline 1

citation-polarity summary

background 5 baseline 1

representative citing papers

CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend

cs.SE · 2026-04-25 · unverdicted · novelty 8.0

CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement

cs.MA · 2026-06-25 · unverdicted · novelty 7.0

Models delayed verification in multi-agent LLMs as graph consensus, derives stability thresholds (inverse golden ratio for delay two) via grounded Laplacian, and gives a supermodular greedy rule for corrector placement; experiments on five models confirm dose-delay oscillations.

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

cs.AI · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.

Holistic Evaluation and Failure Diagnosis of AI Agents

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

cs.AI · 2026-04-20 · unverdicted · novelty 7.0

AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.

When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

cs.SE · 2026-01-21 · unverdicted · novelty 7.0

A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

Strained coherence flagged by Claude judge on 44 coding trajectories predicts failure (94% vs 46%, p=0.003), with partial replication on second model.

StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

StepFinder turns execution logs into temporal semantic sequences via LLMs then uses temporal modeling plus attention to attribute failures to specific steps more accurately and 79% faster than direct LLM methods on the Who&When benchmark.

TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems

cs.SE · 2026-05-26 · unverdicted · novelty 6.0

TrajAudit diagnoses failures in repository-level agentic coding trajectories by filtering noise and injecting test-failure priors, achieving >24.4 pp higher localization accuracy and 18% lower token use on the new RootSE benchmark of 93 instances.

PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.

SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents

cs.SE · 2026-04-20 · unverdicted · novelty 6.0

SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.

Process-Centric Analysis of Agentic Software Systems

cs.SE · 2025-12-02 · unverdicted · novelty 6.0

Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.

From Agent Traces to Trust: A Survey of Evidence Tracing and Execution Provenance in LLM Agents

cs.CR · 2026-06-03 · unverdicted · novelty 5.0 · 2 refs

This survey defines execution provenance as a typed graph of agent execution and evidence tracing as its projection onto evidence-support relations, then reviews methods, taxonomy, benchmarks, and challenges for auditable LLM agents.

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

cs.AI · 2026-05-20 · unverdicted · novelty 5.0 · 3 refs

Insights Generator is a multi-agent system that produces evidence-backed insights from corpora of LLM agent traces and yields 30.4pp performance gains when humans apply the reports.

Towards Self-Improving Error Diagnosis in Multi-Agent Systems

cs.MA · 2026-04-19 · unverdicted · novelty 5.0

ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.

A pragmatic approach to regulating AI agents

cs.CY · 2026-04-16 · unverdicted · novelty 5.0

AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.

A Survey of Context Engineering for Large Language Models

cs.CL · 2025-07-17 · accept · novelty 4.0

The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle with equally sophisticated long outputs.

Agent System Operations: Categorization, Challenges, and Future Directions

cs.MA · 2026-06-01 · unverdicted · novelty 3.0

This survey categorizes anomalies in agent systems into intra-agent and inter-agent types and introduces the AgentOps framework with four operational stages.

citing papers explorer

Showing 20 of 20 citing papers.

CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend cs.SE · 2026-04-25 · unverdicted · none · ref 38
CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement cs.MA · 2026-06-25 · unverdicted · none · ref 18
Models delayed verification in multi-agent LLMs as graph consensus, derives stability thresholds (inverse golden ratio for delay two) via grounded Laplacian, and gives a supermodular greedy rule for corrector placement; experiments on five models confirm dose-delay oscillations.
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents cs.CL · 2026-05-21 · unverdicted · none · ref 23
Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems cs.AI · 2026-05-14 · unverdicted · none · ref 28 · 2 links
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
Holistic Evaluation and Failure Diagnosis of AI Agents cs.AI · 2026-05-14 · unverdicted · none · ref 7
A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation cs.SE · 2026-05-13 · unverdicted · none · ref 9
AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation cs.AI · 2026-04-20 · unverdicted · none · ref 1
AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.
When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling cs.SE · 2026-01-21 · unverdicted · none · ref 13
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories cs.LG · 2026-06-05 · unverdicted · none · ref 3
Strained coherence flagged by Claude judge on 44 coding trajectories predicts failure (94% vs 46%, p=0.003), with partial replication on second model.
StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems cs.AI · 2026-06-02 · unverdicted · none · ref 5
StepFinder turns execution logs into temporal semantic sequences via LLMs then uses temporal modeling plus attention to attribute failures to specific steps more accurately and 79% faster than direct LLM methods on the Who&When benchmark.
TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems cs.SE · 2026-05-26 · unverdicted · none · ref 10
TrajAudit diagnoses failures in repository-level agentic coding trajectories by filtering noise and injecting test-failure priors, achieving >24.4 pp higher localization accuracy and 18% lower token use on the new RootSE benchmark of 93 instances.
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement cs.AI · 2026-05-11 · unverdicted · none · ref 4
PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents cs.SE · 2026-04-20 · unverdicted · none · ref 23
SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.
Process-Centric Analysis of Agentic Software Systems cs.SE · 2025-12-02 · unverdicted · none · ref 15
Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.
From Agent Traces to Trust: A Survey of Evidence Tracing and Execution Provenance in LLM Agents cs.CR · 2026-06-03 · unverdicted · none · ref 36 · 2 links
This survey defines execution provenance as a typed graph of agent execution and evidence tracing as its projection onto evidence-support relations, then reviews methods, taxonomy, benchmarks, and challenges for auditable LLM agents.
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents cs.AI · 2026-05-20 · unverdicted · none · ref 7 · 3 links
Insights Generator is a multi-agent system that produces evidence-backed insights from corpora of LLM agent traces and yields 30.4pp performance gains when humans apply the reports.
Towards Self-Improving Error Diagnosis in Multi-Agent Systems cs.MA · 2026-04-19 · unverdicted · none · ref 49
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.
A pragmatic approach to regulating AI agents cs.CY · 2026-04-16 · unverdicted · none · ref 6
AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.
A Survey of Context Engineering for Large Language Models cs.CL · 2025-07-17 · accept · none · ref 220
The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle with equally sophisticated long outputs.
Agent System Operations: Categorization, Challenges, and Future Directions cs.MA · 2026-06-01 · unverdicted · none · ref 125
This survey categorizes anomalies in agent systems into intra-agent and inter-agent types and introduces the AgentOps framework with four operational stages.

TRAIL: Trace reasoning and agentic issue localization.arXiv preprint arXiv:2505.08638,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer