Canonical reference

Trail: Trace reasoning and agentic issue localization,

· 2025 · arXiv 2505.08638

Canonical reference. 83% of citing Pith papers cite this work as background.

21 Pith papers citing it

Background 83% of classified citations

read on arXiv browse 21 citing papers

citation-role summary

background 5 baseline 1

citation-polarity summary

background 5 baseline 1

representative citing papers

CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend

cs.SE · 2026-04-25 · unverdicted · novelty 8.0

CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement

cs.MA · 2026-06-25 · unverdicted · novelty 7.0

Models delayed verification in multi-agent LLMs as graph consensus, derives stability thresholds (inverse golden ratio for delay two) via grounded Laplacian, and gives a supermodular greedy rule for corrector placement; experiments on five models confirm dose-delay oscillations.

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

cs.AI · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.

Holistic Evaluation and Failure Diagnosis of AI Agents

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

cs.AI · 2026-04-20 · unverdicted · novelty 7.0

AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.

When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

cs.SE · 2026-01-21 · unverdicted · novelty 7.0

A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.

Refploit: Facilitating Exploit Construction via Code-Agent Trajectory Repair

cs.SE · 2026-07-02 · unverdicted · novelty 6.0

Refploit repairs code-agent trajectories for Java exploit reproduction via differential validation and focused recovery constraints, achieving 80.2% success on 172 references with 64.3% relative improvement.

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

Strained coherence flagged by Claude judge on 44 coding trajectories predicts failure (94% vs 46%, p=0.003), with partial replication on second model.

StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

StepFinder turns execution logs into temporal semantic sequences via LLMs then uses temporal modeling plus attention to attribute failures to specific steps more accurately and 79% faster than direct LLM methods on the Who&When benchmark.

TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems

cs.SE · 2026-05-26 · unverdicted · novelty 6.0

TrajAudit diagnoses failures in repository-level agentic coding trajectories by filtering noise and injecting test-failure priors, achieving >24.4 pp higher localization accuracy and 18% lower token use on the new RootSE benchmark of 93 instances.

PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.

SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents

cs.SE · 2026-04-20 · unverdicted · novelty 6.0

SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.

Process-Centric Analysis of Agentic Software Systems

cs.SE · 2025-12-02 · unverdicted · novelty 6.0

Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.

From Agent Traces to Trust: A Survey of Evidence Tracing and Execution Provenance in LLM Agents

cs.CR · 2026-06-03 · unverdicted · novelty 5.0 · 2 refs

This survey defines execution provenance as a typed graph of agent execution and evidence tracing as its projection onto evidence-support relations, then reviews methods, taxonomy, benchmarks, and challenges for auditable LLM agents.

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

cs.AI · 2026-05-20 · unverdicted · novelty 5.0 · 3 refs

Insights Generator is a multi-agent system that produces evidence-backed insights from corpora of LLM agent traces and yields 30.4pp performance gains when humans apply the reports.

Towards Self-Improving Error Diagnosis in Multi-Agent Systems

cs.MA · 2026-04-19 · unverdicted · novelty 5.0

ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.

A pragmatic approach to regulating AI agents

cs.CY · 2026-04-16 · unverdicted · novelty 5.0

AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.

A Survey of Context Engineering for Large Language Models

cs.CL · 2025-07-17 · accept · novelty 4.0

The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle with equally sophisticated long outputs.

Agent System Operations: Categorization, Challenges, and Future Directions

cs.MA · 2026-06-01 · unverdicted · novelty 3.0

This survey categorizes anomalies in agent systems into intra-agent and inter-agent types and introduces the AgentOps framework with four operational stages.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Trail: Trace reasoning and agentic issue localization,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer