arxiv: 2604.11641 · v3 · submitted 2026-04-13 · 💻 cs.SE · cs.AI

Recognition: unknown

CodeTracer: Towards Traceable Agent States

Han Li , Yifan Yao , Letian Zhu , Rili Feng , Hongyi Ye , Jiaming Wang , Yancheng He , Pengyu Zou

show 8 more authors

Lehan Zhang Xinping Lei Haoyang Huang Ken Deng Ming Sun Zhaoxiang Zhang He Ye Jiaheng Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords code agentsagent tracingfailure localizationstate transitionsdebugginghierarchical tracescode workflows

0 comments

The pith

CodeTracer reconstructs hierarchical state histories from code agent runs to localize failure origins and aid recovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Code agents produce complex runs with parallel tool calls and multi-stage workflows where early errors can cascade into hidden chains that are difficult to observe or fix. The paper introduces CodeTracer, which uses evolving extractors to parse artifacts from different agent frameworks into a hierarchical trace tree that records every state transition along with persistent memory of prior context. This structure supports failure onset localization, identifying the first point where the agent went off track and the downstream effects. The authors built CodeTraceBench, a dataset of supervised trajectories across bug fixing, refactoring, and terminal tasks from four frameworks, to test the approach. Experiments indicate the system outperforms direct prompting and lightweight methods while its diagnostic outputs enable recovery of failed runs when replayed under the same resource limits.

Core claim

CodeTracer parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain, with evaluation on CodeTraceBench showing consistent outperformance of baselines and recovery of failed trajectories via the resulting diagnostics.

What carries the argument

The hierarchical trace tree with persistent memory, constructed by evolving extractors that parse run artifacts into complete state-transition histories, which enables tracing error propagation and precise failure localization.

If this is right

Enables systematic, supervised evaluation of tracing on diverse code tasks through the CodeTraceBench dataset.
Diagnostic signals from the trace trees allow recovery of originally failed agent runs when replayed under matched computational budgets.
Outperforms direct prompting and lightweight baselines at localizing failures in complex, parallel workflows.
Extends tracing beyond simple interaction logs to scalable analysis of real coding agent executions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar extractor-based reconstruction could extend to non-code agents if their artifacts share comparable structure.
Persistent memory in the trace tree might support automated repair loops that feed localized failures back into the agent for self-correction.
Widespread adoption would shift debugging of agent systems from manual inspection to automated, framework-agnostic analysis.
The approach highlights a general need for standardized state-transition logging in agent frameworks to reduce reliance on post-hoc parsing.

Load-bearing premise

Evolving extractors can reliably turn run artifacts from varied agent frameworks into accurate, complete hierarchical state histories without systematic omissions or mislabeling of steps.

What would settle it

Running CodeTracer on a fresh collection of trajectories from an agent framework not seen during extractor development, then checking whether the reconstructed trees omit key transitions or misidentify failure onsets at rates high enough to prevent recovery of the original failed runs.

Figures

Figures reproduced from arXiv: 2604.11641 by Han Li, Haoyang Huang, He Ye, Hongyi Ye, Jiaheng Liu, Jiaming Wang, Ken Deng, Lehan Zhang, Letian Zhu, Ming Sun, Pengyu Zou, Rili Feng, Xinping Lei, Yancheng He, Yifan Yao, Zhaoxiang Zhang.

**Figure 1.** Figure 1: Overview of the CODETRACER pipeline. Raw trajectories are standardized into hierarchical traces, curated into CODETRACEBENCH with step-level supervision, and diagnosed via failure onset localization with optional reflective replay. et al., 2024; Yang et al., 2024). Each trajectory is annotated with structured step metadata and failure critical labels, enabling evaluation of both stage level localization an… view at source ↗

**Figure 2.** Figure 2: Task categories solved per backbone. The central 66 categories are solved by all five models; 65 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of error critical steps across stages, contrasting solved and unsolved runs. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Hierarchical trace tree. Exploration steps remain under the current state node, whereas state [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Step budget decomposition per backbone (solved vs. unsolved) on the intersection subset. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Reflective replay. Pass@1 on originally failed runs before and after injecting CODETRACER’s diagnostic signals under matched budgets. while evolving extraction contributes a further 9.4-point lift through format standardization with parser reuse. Reflective replay. We feed CODETRACER’s localized evidence back into agents: on originally failed runs, the same backbone is reinvoked under matched budget with t… view at source ↗

**Figure 7.** Figure 7: Effective action ratio. (a)–(e) per model histograms; (f) cross model violin and box summary. Trajectory Error Localization, Debugging, and Replay. Recent work has also begun to study software engineering agent trajectories and process-level quality (Bouzenia and Pradel, 2025; Kuang et al., 2025). Related benchmark efforts in other domains further show that step-level supervision and process error identifi… view at source ↗

**Figure 9.** Figure 9: Annotation tool interface — main panel B Tracing Framework Details The three-stage pipeline of CODETRACER is summarized in Section 3.1. Below we provide additional details on the scoring features, replay protocol, and computational complexity. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Resolved rate vs. iteration budget for all 15 backbone–agent combinations. Rows correspond to backbones (Claude-sonnet-4, GPT-5, DeepSeek-V3.2, Qwen3-Coder-480B, Kimi-K2-Instruct); columns correspond to agent frameworks (MiniSWE-Agent, OpenHands, Terminus 2). Each panel sweeps max_iterations over {5, . . . , 300}. The Qwen–Terminus 2 cell is a placeholder pending data availability. 22 [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 11.** Figure 11: Step-range error distribution for all 15 backbone–agent combinations. Rows correspond to backbones (Claude-sonnet-4, GPT-5, DeepSeek-V3.2, Qwen3-Coder-480B, Kimi-K2-Instruct); columns correspond to agent frameworks (MiniSWE-Agent, OpenHands, Terminus 2). Each panel shows the stacked area ratio of error critical steps across execution stages, illustrating where in the workflow failures tend to concentrate … view at source ↗

read the original abstract

Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodeTracer introduces a tracing architecture and benchmark for code agents but lacks human validation for its extractors.

read the letter

CodeTracer introduces a tracing architecture for code agents that reconstructs state histories as a hierarchical trace tree with persistent memory and localizes failure onsets, supported by a new benchmark called CodeTraceBench from multiple agent frameworks. The public release of code and data strengthens the contribution. What the paper does well is tackle the problem of debugging complex multi-stage code agents where early errors can cascade. By using evolving extractors to handle heterogeneous artifacts from different frameworks, it moves beyond simple interaction logs or small manual reviews. The benchmark includes supervision at stage and step levels, which allows for systematic evaluation of failure localization. The experiments claim substantial outperformance over direct prompting and lightweight baselines, and the replay of diagnostic signals recovers failed runs under matched budgets. Having the artifacts public means others can inspect and build on it. The soft spots are around the validation of the extractors themselves. The central results rely on these extractors producing complete and accurate hierarchical state-transition histories without omissions or mislabelings. However, there are no reported metrics such as precision, recall, or inter-annotator agreement with human ground truth on held-out trajectories. This is a load-bearing assumption because the failure localization performance and the recovery rates could be affected if the traces have systematic issues. The abstract mentions outperformance but without full details on baseline implementations, data exclusion, or statistical tests, it's hard to assess how solid the comparisons are. This work is for researchers and practitioners in AI software engineering who deal with code agents and need better debugging tools. Readers focused on agent reliability and traceability will get value from the architecture and the benchmark construction. The paper shows honest engagement with the practical challenges in the field. It deserves a serious referee. I recommend sending it to peer review, but with specific attention to adding extractor validation experiments and more details on the empirical setup.

Referee Report

2 major / 2 minor

Summary. The paper introduces CodeTracer, a tracing architecture for code agents that parses heterogeneous run artifacts via evolving extractors to reconstruct hierarchical trace trees with persistent memory, localizes failure onsets and their downstream chains, and evaluates the approach on CodeTraceBench (trajectories from four agent frameworks on tasks like bug fixing and refactoring). Experiments claim that CodeTracer substantially outperforms direct prompting and lightweight baselines on failure localization, and that replaying its diagnostic signals recovers originally failed runs under matched budgets. Code and data are released publicly.

Significance. If the extractor fidelity and localization claims hold, the work could meaningfully improve scalability of debugging for complex, multi-stage code agents where early errors cascade. The construction of CodeTraceBench with stage/step supervision and the public artifacts are concrete strengths that support reproducibility and follow-on research.

major comments (2)

[Extractor and benchmark construction sections] The sections describing the evolving extractors and CodeTraceBench construction do not report any quantitative accuracy metrics (precision, recall, or agreement with human annotations) for the reconstructed hierarchical state-transition histories on held-out trajectories. This is load-bearing for the central claims, because both the failure-onset localization results and the recovery experiment (which replays diagnostic signals derived from the same extracted traces) could be inflated by systematic omissions or mislabelings.
[Recovery experiment description] In the recovery experiment (abstract and results): it is not stated whether data-exclusion rules, run-matching criteria, or statistical tests for the recovery rates were pre-specified or applied post-hoc. Without these details, it is difficult to determine whether the reported consistent recovery under matched budgets is robust or sensitive to analysis choices.

minor comments (2)

[Abstract] The abstract refers to 'supervision at both the stage and step levels' without clarifying whether this denotes human annotation, automated labeling, or a combination; a brief clarification would improve readability.
[Architecture overview] The notion of 'persistent memory' within the hierarchical trace tree is introduced without an accompanying diagram or formal notation in the early sections, which would help readers follow the state-transition reconstruction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Extractor and benchmark construction sections] The sections describing the evolving extractors and CodeTraceBench construction do not report any quantitative accuracy metrics (precision, recall, or agreement with human annotations) for the reconstructed hierarchical state-transition histories on held-out trajectories. This is load-bearing for the central claims, because both the failure-onset localization results and the recovery experiment (which replays diagnostic signals derived from the same extracted traces) could be inflated by systematic omissions or mislabelings.

Authors: We agree that explicit quantitative validation of extractor fidelity on held-out data is necessary to support the central claims. The current manuscript describes the evolving extractors and the stage/step supervision in CodeTraceBench but does not report precision, recall, or human agreement metrics for the full hierarchical state-transition trees. In the revised manuscript we will add a dedicated human evaluation subsection: two annotators will label a held-out set of 150 trajectories, and we will report precision/recall/F1 for state extraction, failure-onset identification, and downstream chain reconstruction, together with inter-annotator agreement (Cohen’s kappa). This directly addresses the risk of inflated localization and recovery results. revision: yes
Referee: [Recovery experiment description] In the recovery experiment (abstract and results): it is not stated whether data-exclusion rules, run-matching criteria, or statistical tests for the recovery rates were pre-specified or applied post-hoc. Without these details, it is difficult to determine whether the reported consistent recovery under matched budgets is robust or sensitive to analysis choices.

Authors: The data-exclusion rules, run-matching criteria, and statistical tests were pre-specified in our experimental protocol before any recovery analyses were performed. We acknowledge that these details were insufficiently documented. In the revised paper we will expand the recovery-experiment section to explicitly list the pre-specified rules (trajectory matching by task ID, framework, and budget; exclusion of runs with missing logs), the exact matching procedure, and the statistical tests (paired McNemar tests with Bonferroni correction). We will also report a sensitivity analysis varying the matching threshold to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper describes an engineering architecture (evolving extractors, hierarchical trace trees, failure localization) evaluated empirically on a constructed benchmark (CodeTraceBench) against direct prompting and lightweight baselines. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. Performance claims rest on experimental comparisons under matched budgets rather than any reduction to inputs by construction. The central assumption about extractor accuracy is an empirical limitation, not a circular definitional step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claims rest on the engineering premise that heterogeneous artifacts can be parsed into faithful state histories and that failure onset can be localized from those histories; no free parameters, standard mathematical axioms, or new physical entities are invoked.

invented entities (2)

hierarchical trace tree with persistent memory no independent evidence
purpose: to represent the full state-transition history of an agent run
Introduced as the core data structure of the tracing architecture.
failure onset localization module no independent evidence
purpose: to identify the first point of error and its downstream chain
New component for pinpointing where the agent went off track.

pith-pipeline@v0.9.0 · 5570 in / 1316 out tokens · 49616 ms · 2026-05-10T15:18:31.143489+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
cs.SE 2026-05 unverdicted novelty 7.0

PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
Property-Level Reconstructability of Agent Decisions: An Anchor-Level Pilot Across Vendor SDK Adapter Regimes
cs.SE 2026-05 unverdicted novelty 6.0

Pilot study shows agent decision reconstructability varies by vendor SDK regime, with completeness scores from 42.9% to 85.7% and consistent gaps in reasoning traces.

Reference graph

Works this paper leans on

19 extracted references · 2 canonical work pages · cited by 2 Pith papers

[1]

URL https://www.sciencedirect.com/science/ article/pii/S0164121209001319

doi: https://doi.org/10.1016/j.jss.2009.06.035. URL https://www.sciencedirect.com/science/ article/pii/S0164121209001319. SI: TAIC PART 2007 and MUTATION 2007. Anthropic. Claude sonnet, 2025. URL https://www.anthropic.com/claude/sonnet. Accessed: 2026-03- 17. Islem Bouzenia and Michael Pradel. Understanding software engineering agents: A study of thought-...

work page doi:10.1016/j.jss.2009.06.035 2009
[2]

URLhttps://proceedings.mlr.press/v235/wang24h.html. Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Daniel Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham...

work page doi:10.1145/3715754 2025
[3]

a THOUGHT paragraph (plain text), then
[4]

stage_id

EXACTLY ONE bash code block. - The bash code blockMUSTbe formatted exactly: ```text ```bash <ONE single-line command> ``` ``` - The closing```MUSTbe on its own line. - Do NOT place```at the end of the command line. - Do NOT split the command across multiple lines. Step indexing: - Each executed command is ONE step. - Steps are ordered by execution order a...
[5]

Use tree.md to spot suspicious stages or areas (loops, stalled progress, wrong commitments)
[6]

Map suspicious areas to exact spans via stage_ranges.json
[7]

step_id": int,

Inspect only the needed step_ids in steps.json (do not scan the full file). Shell constraints (hard): - Commands run under /bin/sh. Keep commands POSIX-compatible. - One command per response. Do NOT use && or ||. - Do NOT use heredocs. - Do NOT split the command across lines. - Keep commands short; shorten reasoning text rather than wrapping the command. ...
[8]

A TASK INSTRUCTION describing what the agent was asked to accomplish
[9]

The trajectory is represented as an ordered sequence of steps, where each step is already explicitly labeled as step1, step2, step3,

A TRAJECTORY showing how the agent attempted to complete this task. The trajectory is represented as an ordered sequence of steps, where each step is already explicitly labeled as step1, step2, step3, ... Each step consists of two parts: - an action block (the command issued by the assistant and its context) - an observation block (the environment feedbac...
[10]

Read the TASK INSTRUCTION to understand the goal. 19
[11]

Read the TRAJECTORY to see how the agent attempted to complete it
[12]

Evaluate each step against the task goal
[13]

Identify steps that are problematic (incorrect or unuseful)
[14]

step": <step_number>,

Output ONLY the problematic steps. Do NOT output correct steps. ------------------------------------------------------------ Labeling criteria: ------------------------------------------------------------ - incorrect: A state-changing action that is wrong given the evidence available at that time, and that leads the task in an incorrect direction. - unuse...
[15]

Tooling investment.Industrial agents invest heavily in specialized tooling and error recovery infrastructure, while academic agents operate with a narrow, general-purpose tool set
[16]

Context management.Production agents implement sophisticated context management (com- paction, budget tracking, feature gating) that academic agents typically lack, enabling longer effective trajectories
[17]

Exploration-to-change ratio.The exploration-to-change ratio is a strong predictor of trajectory quality: Claude Code exhibits a lower ratio (more actions per exploration step) that correlates with higher task success
[18]

Parallel execution.Parallel tool execution, available in industrial agents, significantly reduces wall-clock time but introduces ordering-sensitivity issues absent from sequential academic frameworks
[19]

RL feedback signals.Per-step deviation labels produced by CODETRACERon industrial agent trajectories can serve as dense training signals, potentially bridging the behavioral gap between industrial and academic agents. Note that because the Claude Code trajectories were collected on a different task distribution (Termi- nalBench tasks executed via the Clau...