TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems

Minxing Wang; Xiaofei Xie; Yintong Huo

REVIEW 4 major objections 4 minor 3 cited by

TrajAudit finds the earliest decisive failure step in long, noisy coding-agent trajectories more accurately than prior automated diagnosis methods.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.5

2026-07-12 15:55 UTC pith:SEHWXHFR

load-bearing objection Clear problem and a usable-looking diagnosis pipeline plus RootSE, but the body we received is unreadable so the 10.8/21.6% localization claims stay provisional. the 4 major comments →

arxiv 2605.26563 v2 pith:SEHWXHFR submitted 2026-05-26 cs.SE

TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems

Minxing Wang , Xiaofei Xie , Yintong Huo This is my paper

classification cs.SE

keywords failure diagnosiscoding agentsagent trajectoriessemantic saliency foldingRootSEagentic systemssoftware engineering

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Repository-level coding agents produce long execution traces packed with redundant program structure and low-signal observations, which makes it hard for language models to follow the causal chain of a failure. TrajAudit tackles that with an investigator agent that first compresses failure-irrelevant context by semantic saliency folding, then uses test-failure reports as prior guidance about likely error regions, and can reopen folded material with tools when the investigation needs detail. The authors also introduce RootSE: 102 real repository-level coding trajectories, each labeled with the earliest decisive error step and a justification. On RootSE, TrajAudit raises exact failure-localization accuracy by 10.8% with reference material and by 21.6% without it over the strongest baselines. Reliable localization of the first wrong step is presented as essential for refining agents that operate on complex codebases and for operational reliability.

Core claim

The paper establishes that an investigator agent supported by semantic saliency folding of trajectory noise, diagnostic priors derived from test failure reports, and on-demand tools to re-inspect folded content can recover the earliest decisive error step in repository-level coding-agent trajectories substantially more accurately than existing trajectory-based diagnosis approaches, as measured on the new RootSE benchmark of 102 annotated real-world instances.

What carries the argument

TrajAudit: an investigator agent whose two supporting modules perform semantic saliency folding (to drop low-signal trajectory observations) and extract preliminary diagnostic guidance from test failure reports, with tools that re-expand folded content on demand so the full trajectory remains available without flooding the context.

Load-bearing premise

That folding away low-saliency trajectory content still preserves enough of the earliest decisive error for the investigator to recover the same step humans annotated as ground truth.

What would settle it

On RootSE, if TrajAudit’s exact-match localization accuracy dropped to or below the strongest baseline in both the with-reference and without-reference settings when semantic saliency folding is removed or replaced by naive truncation, the central claim that folding-plus-tools enables better diagnosis would be falsified.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Exact localization of the first decisive error becomes more practical for long agent trajectories on real multi-file repositories.
Diagnosis remains useful when reference solutions or golden patches are unavailable, where the paper reports the larger accuracy gain.
RootSE offers a shared yardstick that scores earliest-step recovery rather than only final failure symptoms.
Agent refinement loops can target the true root step instead of late symptoms that only reflect downstream effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fold-then-reopen pattern may transfer to other long-horizon agent settings—web navigation, multi-tool workflows—where traces are similarly long and noisy.
If saliency folding systematically drops rare but decisive observations, the method could still miss the true earliest error; measuring fold recall against human root steps would be a direct stress test.
Continuous evaluation pipelines for coding agents could insert TrajAudit-style diagnosis without requiring human root-cause labels for every failed run.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Clear problem and a usable-looking diagnosis pipeline plus RootSE, but the body we received is unreadable so the 10.8/21.6% localization claims stay provisional.

read the letter

The one thing to know: this is a practical systems paper on a real pain point—finding the earliest decisive failure step in long, noisy trajectories from repository-level coding agents—and it ships both a method sketch (TrajAudit) and a 102-instance annotated benchmark (RootSE). The abstract’s numbers are concrete: +10.8% exact localization with reference and +21.6% without, against strongest baselines.

What is actually new is the combination, not any single ingredient. Semantic saliency folding to compress low-signal code/context, test-failure reports as prior focus, and an investigator agent that can re-expand folded steps with tools is a sensible engineering answer to length and noise. RootSE’s “earliest decisive error + justification” labels are the more durable contribution if they hold up; the field needs that kind of ground truth more than another agent wrapper.

The soft spots are real but proportional to what we can read. The full manuscript text in the packet is heavily corrupted—garbled tokens, mixed-in unrelated content—so we cannot check ablations, baseline definitions, annotation protocol, inter-annotator agreement, or whether folding ever discards the gold step and only recovers it via tools. That is exactly the stress-test concern: exact-match gains only mean something if RootSE labels are stable and independent of the folding policy, and if “with-reference” does not leak localization cues. Free parameters (backbone, saliency thresholds, annotation criteria) are also underspecified in the readable material. None of that falsifies the abstract; it just means confidence has to stay low until a clean PDF and artifacts exist.

Who it is for: people building or evaluating coding agents who need failure localization, not theory readers. The framing is coherent, the problem is timely, and there is no formal circularity in the metric design. I would send it to peer review rather than desk-reject; a serious referee can demand IAA, preservation rates for folded gold steps, and released RootSE. I would not cite it yet on the strength of this packet alone, and I would only bring it to reading group if we have a clean version and code. Engage if you work on agent reliability; otherwise wait for the camera-ready and artifacts.

Referee Report

4 major / 4 minor

Summary. The paper proposes TrajAudit, an automated failure-diagnosis framework for long, noisy trajectories of repository-level coding agents. It combines an investigator agent with two modules: semantic saliency folding to compress failure-irrelevant context, and prior guidance derived from test-failure reports; the agent may re-expand folded content via tools. The authors introduce RootSE (102 real-world instances annotated with the earliest decisive error step and a justification) and report that TrajAudit improves exact failure-localization accuracy over the strongest baselines by 10.8% (with-reference) and 21.6% (without-reference).

Significance. If the gains hold under a stable annotation protocol and a folding procedure that preserves decisive evidence, the work would supply both a practical diagnosis pipeline for agentic coding systems and a reusable localization benchmark. The combination of saliency folding, test-report priors, and on-demand tool inspection is a concrete systems contribution aimed at a real operational pain point (long, noisy trajectories). The absolute exact-match lifts on n=102 are large enough to matter for iterative agent refinement, provided the evaluation design is shown to be non-circular and statistically reliable.

major comments (4)

The central claim rests on exact match to RootSE “earliest decisive error” labels (abstract; evaluation). The readable material does not report inter-annotator agreement, multi-annotator protocol, or adjudication rules for that definition. Without IAA (or equivalent reliability evidence), the 10.8%/21.6% absolute gains cannot be interpreted as genuine diagnostic improvement rather than agreement with a single labeling style. This is load-bearing for every reported accuracy number.
Semantic saliency folding is presented as noise reduction that preserves failure-relevant content, with tools recovering folded evidence on demand. No preservation/recovery analysis is given: how often the gold earliest step is folded away, how often tools re-expand it, and whether accuracy drops when re-expansion fails. If folding systematically discards or buries the decisive step, the reported lifts overstate the method’s contribution. An ablation or retention rate on RootSE is required.
The with-reference vs. without-reference split is used for the headline deltas, but the manuscript (as readable) does not define what the reference supplies or rule out leakage of localization cues into the with-reference condition. That definition is necessary to interpret why the without-reference gap (21.6%) is much larger and whether the more modest 10.8% gap is the fairer operational setting.
The provided full-text body is severely garbled (encoding/OCR corruption across sections), so baseline definitions, ablations, statistical tests, and annotation details cannot be verified from the manuscript itself. Until a clean, complete version is available, the experimental claims cannot be audited at the level expected for a systems paper in this venue.

minor comments (4)

Abstract and intro should state the LLM backbone(s), decoding settings, and whether they are shared across TrajAudit and baselines, so absolute gains are not confounded by model choice.
Clarify the formal definition of “earliest decisive error step” (decision rule, examples of borderline cases) so RootSE can be reused or re-annotated by others.
Report confidence intervals or significance tests for the 10.8% and 21.6% absolute differences on n=102; raw point estimates alone are hard to weigh.
The arXiv footer in the source dump appears to mis-tag the paper (astro-ph.SR); correct metadata and ensure the camera-ready PDF is free of the encoding corruption seen in the review copy.

Circularity Check

0 steps flagged

No significant circularity: empirical localization accuracy is scored against human RootSE annotations, not quantities defined by TrajAudit’s free parameters.

full rationale

TrajAudit is an empirical systems paper. Its load-bearing claim is absolute exact-match lifts (10.8% with-reference, 21.6% without-reference) on RootSE (n=102), where each instance is annotated with an earliest decisive error step and justification. That metric is an external match to human labels, not a quantity derived from, or statistically forced by, TrajAudit’s own saliency-folding parameters, test-failure priors, or investigator tool policy. Semantic saliency folding and on-demand re-expansion are method components whose success is measured against those fixed labels; they do not redefine the target. There is no self-definitional loop (X defined via Y then used to predict Y), no fitted parameter renamed as a prediction of a closely related quantity, no uniqueness theorem imported from overlapping authors to forbid alternatives, and no ansatz smuggled in via self-citation that then becomes the result. Self-citation, if any, is ordinary related-work framing and is not load-bearing for the accuracy numbers. Evaluation-validity concerns (inter-annotator agreement, whether folding ever discards the gold step) affect correctness risk, not circularity of the derivation chain. The paper is self-contained against an external benchmark; score 0 is the honest finding.

Axiom & Free-Parameter Ledger

3 free parameters · 4 axioms · 2 invented entities

Empirical SE systems paper. Load-bearing premises are domain assumptions about trajectories, tests, and human labels—not free physical constants or invented particles. No formal derivation; claims rest on experimental protocol and LLM tooling assumptions.

free parameters (3)

LLM backbone / temperature / decoding settings for investigator and folding modules
Diagnosis quality depends on model choice and generation hyperparameters; values are engineering choices that affect reported accuracy.
Saliency folding thresholds / compression policy
How aggressively context is folded is a design knob that trades noise reduction against risk of hiding the decisive step.
RootSE annotation criteria for “earliest decisive error step”
Ground-truth definition is a human labeling policy; different raters or rubrics would change exact-match scores.

axioms (4)

domain assumption Repository-level coding-agent trajectories are long enough and noisy enough that vanilla LLM diagnosis is systematically impaired.
Stated as the motivating problem in the abstract; underpins need for folding and priors.
domain assumption Test failure reports contain useful prior signal about likely failure regions in the trajectory.
Used to justify the preliminary diagnostic guidance module.
domain assumption Human annotators can identify a unique earliest decisive error step with a justification that serves as exact-match ground truth.
Required for RootSE evaluation of exact failure localization accuracy.
ad hoc to paper On-demand tool inspection of folded content can recover critical evidence without forcing full-context reasoning at every step.
Core TrajAudit design claim enabling focused investigation.

invented entities (2)

TrajAudit (investigator agent + semantic saliency folding + test-report prior module) no independent evidence
purpose: Automated localization of earliest decisive failures in coding-agent trajectories.
Named system introduced by the paper; evaluated empirically rather than postulated as a physical entity.
RootSE benchmark (102 annotated instances) no independent evidence
purpose: Provide ground-truth earliest decisive error steps for evaluation.
New dataset/benchmark; value depends on annotation quality and public release.

pith-pipeline@v1.1.0-grok45 · 13922 in / 2679 out tokens · 28909 ms · 2026-07-12T15:55:46.109128+00:00 · methodology

0 comments

read the original abstract

Agentic systems have been widely studied to automate coding tasks such as bug fixing and feature implementation. As these systems increasingly operate on complex codebases, understanding where and why they fail becomes essential for iterative refinement and operational reliability. Existing automated failure diagnosis approaches leverage \textit{task execution trajectories}, yet they struggle with trajectories produced by repository-level coding agents due to two key properties. First, these trajectories are often long, spanning many execution steps, making it difficult for LLMs to track the causal chain of failure over the execution history. Second, these trajectories are laden with noise, containing substantial low-signal observations such as redundant program structures and verbose code context, which can interfere with LLM reasoning. To address these challenges, we propose \textit{TrajAudit}, an automated failure diagnosis framework specifically for trajectories produced by repository-level coding agents. TrajAudit employs an investigator agent supported by two modules: one reduces failure-irrelevant noisy context through semantic saliency folding, and the other derives preliminary diagnostic guidance from test failure reports as prior knowledge to help LLMs focus on likely failure regions. The investigator agent can further invoke tools to inspect folded content on demand, enabling a focused investigation without losing access to the full trajectory context. We also introduce \textit{RootSE}, a benchmark of 102 real-world instances from repository-level coding tasks, each annotated with the earliest decisive error step and a justification. Experiments on RootSE show that TrajAudit outperforms the strongest baselines by 10.8\% and 21.6\% in exact failure localization accuracy in the with- and without-reference settings, respectively, demonstrating its effectiveness.

Figures

Figures reproduced from arXiv: 2605.26563 by Minxing Wang, Xiaofei Xie, Yintong Huo.

**Figure 1.** Figure 1: Failure diagnosis in agentic systems. refer to the information returned by tools invoked by the agent, often accounting for over 70% of the total trajectory content. However, most observations are not relevant to failure localization, such as redundant program structures and verbose code context, which can interfere with LLM reasoning [44]. (2) Excessive length. These trajectories often span from 20 to ov… view at source ↗

**Figure 2.** Figure 2: The agent workflow and execution trajectory in a coding task. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy of baseline methods under varying tra [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: RootSE Annotation Guideline. to a single decisive step due to system limitations rather than to ambiguous task descriptions or misaligned test code. 3.3 Annotation Following prior work [57], we adopt the Earliest Decisive Error Step as the failure point definition for RootSE. We outline this problem formulation and our annotation process below. 3.3.1 Problem Formulation. We consider an agentic system as a … view at source ↗

**Figure 5.** Figure 5: Phase-wise Failure Distribution in RootSE. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: The overall workflow of TrajAudit by dynamically inspecting folded observations and probing for additional context when the compressed trajectory provides insufficient information [43]. Through the complementary strengths of targeted information extraction and active context probing, TrajAudit locates failures more accurately and efficiently than existing methods. As illustrated in [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 7.** Figure 7: Exact Step-Level Accuracy across Varying Trajectory [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: A Worked Example of TrajAudit. most suspicious failure region. The diagnosis is based on the observation that the test code explicitly specifies the expected port, yet the generated patch does not address this inconsistency, suggesting that the agent failed to correctly identify the problematic code. (ib) Concurrently, the semantic saliency folding module applies pattern matching and keyword filtering to… view at source ↗

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AfterVibe: What Remains When the Conversation Ends
cs.SE 2026-07 conditional novelty 7.5

AfterVibe extracts natural-language specs from vibe-coding trajectories, validates them via blind regeneration scored by a three-tier verifier, and reaches mean scores of 5.06–5.74/6 on 72 industrial tasks.
Failure as a Process: An Anatomy of CLI Coding Agent Trajectories
cs.SE 2026-07 conditional novelty 7.0

Across 1,794 CLI agent trajectories, failures are mostly epistemic, start by median step 7, and often stay silent until after lock-in.
What Resolve Rate Hides: Trajectory Structure Diagnostics for Coding Agents
cs.SE 2026-07 conditional novelty 6.0

TraceProbe normalizes coding agent trajectories into canonical actions and applies rule-based detectors to localize failure patterns and behavioral divergences that resolve rate hides.

Reference graph

Works this paper leans on

1 extracted references · 1 linked inside Pith · cited by 3 Pith papers

[1]

�� ...

Pith/arXiv arXiv 2026

[1] [1]

�� ...

Pith/arXiv arXiv 2026