CODESTRUCT: Code Agents over Structured Action Spaces

Dingmin Wang; Joe Hsu; Murali Krishna Ramanathan; Myeongsoo Kim; Shweta Garg; Varun Kumar

arxiv: 2604.05407 · v3 · submitted 2026-04-07 · 💻 cs.AI · cs.SE

CODESTRUCT: Code Agents over Structured Action Spaces

Myeongsoo Kim , Joe Hsu , Dingmin Wang , Shweta Garg , Varun Kumar , Murali Krishna Ramanathan This is my paper

Pith reviewed 2026-05-10 19:05 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords code agentsabstract syntax treesstructured action spacesLLM code editingSWE-Benchsyntax-validated editstoken efficiency

0 comments

The pith

CODESTRUCT reframes codebases as AST-based action spaces so agents read and edit syntactic units instead of brittle text spans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM code agents edit repositories by matching strings in unstructured text, which breaks when formatting shifts or patterns are ambiguous. CODESTRUCT instead supplies two operations: readCode to fetch complete named AST entities and editCode to apply syntax-checked changes to those entities. Tests across six LLMs on SWE-Bench Verified show Pass@1 accuracy rising 1.2-5.0 percent while token use falls 12-38 percent for most models, with the largest lift for agents that previously produced many empty patches. Parallel gains appear on CodeAssistBench. The work claims that structure-aware interfaces give code agents a more dependable operating foundation.

Core claim

By modeling the repository as a structured action space of named AST entities rather than raw text, agents gain reliable readCode and editCode primitives that retrieve full syntactic units and perform validated transformations, yielding higher patch success and lower token costs than text-based baselines.

What carries the argument

The structured action space built from readCode (retrieving complete syntactic units) and editCode (syntax-validated edits on named AST entities).

If this is right

Models that frequently emit invalid or empty patches under text interfaces show the biggest accuracy increases.
Token consumption drops 12-38 percent for most models while accuracy holds or improves.
Consistent accuracy gains of 0.8-4.4 percent and cost cuts up to 33 percent appear on CodeAssistBench.
Structure-aware interfaces reduce dependence on fragile string matching for code modifications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same AST primitives could be exposed directly in IDEs to let agents collaborate with developers on larger refactors.
If AST reliability holds, the approach may scale to multi-language repositories without per-language prompt engineering.
Combining these structured actions with retrieval or planning layers might compound the observed efficiency gains.

Load-bearing premise

Observed gains come from the AST read/edit interface itself rather than prompting differences or benchmark quirks, and AST parsing remains reliable on real codebases.

What would settle it

Re-running the SWE-Bench experiments on a corpus where AST parsers frequently fail or produce incomplete trees would show whether the accuracy and token reductions disappear.

Figures

Figures reproduced from arXiv: 2604.05407 by Dingmin Wang, Joe Hsu, Murali Krishna Ramanathan, Myeongsoo Kim, Shweta Garg, Varun Kumar.

**Figure 1.** Figure 1: Contrasting action spaces for code agents. Text-based agents (left) read ~300 lines to locate a function and regenerate ~44 lines verbatim for removal, making edits brittle to formatting changes. CODESTRUCT (right) reads only the target symbol (~50 lines) and specifies removal in ~2 lines via a symbol-scoped edit. reading entire files, which introduces irrelevant context that degrades reasoning (Shi et … view at source ↗

**Figure 2.** Figure 2: Overview of CODESTRUCT. Code agents interact with repositories through a structured AST action space. Source code is parsed into an AST, exposing addressable nodes. The readCode and editCode tools operate directly on these nodes, enabling structureaware code navigation and modification without string matching, line numbers, or brittle edits. and Le Goues, 2019), Piranha (Ramanathan et al., 2020; Ketkar … view at source ↗

**Figure 3.** Figure 3: Comparison of CODESTRUCT (AST-based) vs. text-based code editing approaches. CODESTRUCT completes in 24 steps vs. 54 steps for text-based—a 55.6% reduction [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

read the original abstract

LLM-based code agents treat repositories as unstructured text, applying edits through brittle string matching that frequently fails due to formatting drift or ambiguous patterns. We propose reframing the codebase as a structured action space where agents operate on named AST entities rather than text spans. Our framework, CODESTRUCT, provides readCode for retrieving complete syntactic units and editCode for applying syntax-validated transformations to semantic program elements. Evaluated on SWE-Bench Verified across six LLMs, CODESTRUCT improves Pass@1 accuracy by 1.2-5.0% while reducing token consumption by 12-38% for most models. Models that frequently fail to produce valid patches under text-based interfaces benefit most: GPT-5-nano improves by 20.8% as empty-patch failures drop from 46.6% to 7.2%. On CodeAssistBench, we observe consistent accuracy gains (+0.8-4.4%) with cost reductions up to 33%. Our results show that structure-aware interfaces offer a more reliable foundation for code agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CODESTRUCT shows small but consistent gains from AST-based read/edit actions on code agent benchmarks, yet the setup leaves open whether structure or validation drives the results.

read the letter

The main point is that this paper reframes code repositories as structured action spaces using named AST entities, with readCode to fetch syntactic units and editCode to apply validated changes. That shift produces 1.2-5% Pass@1 lifts and 12-38% token reductions on SWE-Bench Verified across six LLMs, plus similar patterns on CodeAssistBench. Models that often output empty patches under text interfaces see the biggest help, such as the 20.8% jump for GPT-5-nano when empty failures fall sharply.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes CODESTRUCT, a framework that reframes code repositories as structured action spaces using AST-derived entities. Agents interact via readCode (retrieving complete syntactic units) and editCode (applying syntax-validated edits to named program elements) rather than brittle text-based string matching. Evaluated on SWE-Bench Verified across six LLMs, the approach yields Pass@1 gains of 1.2-5.0% and token reductions of 12-38% for most models, with larger benefits (e.g., +20.8% for GPT-5-nano via reduced empty-patch failures) for models that struggle under text interfaces; consistent but smaller gains are reported on CodeAssistBench.

Significance. If the gains are attributable to the AST-structured interface, the work could establish a more reliable design pattern for code-agent tooling, reducing patch-generation failures and token costs in software-engineering tasks. The multi-LLM, multi-benchmark evaluation provides a reasonable empirical foundation for the claim that structure-aware interfaces improve reliability over unstructured text manipulation.

major comments (2)

[§4.2] §4.2 (Experimental Setup and Baselines): The manuscript does not demonstrate that text-based baselines were implemented with identical prompting templates, tool descriptions, error-handling logic, and syntax-validation steps except for the replacement of string matching by AST operations. This isolation is load-bearing for the central claim that the structured action space itself drives the observed 1.2-5.0% Pass@1 improvements.
[Table 1] Table 1 and failure-mode analysis for GPT-5-nano: The 20.8% gain is largely explained by the drop in empty-patch failures (46.6% to 7.2%), yet no ablation is presented that retains the validation logic of editCode while removing only the AST naming/selection mechanism. Without this control, the contribution of the structured representation versus built-in validation cannot be separated.

minor comments (2)

[Abstract] The abstract and §4.1 refer to 'GPT-5-nano' without clarifying whether this is a public model, a fine-tuned variant, or a placeholder; a footnote or citation would improve reproducibility.
[Figure 3] Figure 3 (token-consumption plots) would benefit from error bars or per-run variance to allow readers to assess the statistical reliability of the 12-38% reductions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the isolation of our core contribution. We address each major comment below and will revise the manuscript to strengthen the experimental controls and analysis.

read point-by-point responses

Referee: [§4.2] §4.2 (Experimental Setup and Baselines): The manuscript does not demonstrate that text-based baselines were implemented with identical prompting templates, tool descriptions, error-handling logic, and syntax-validation steps except for the replacement of string matching by AST operations. This isolation is load-bearing for the central claim that the structured action space itself drives the observed 1.2-5.0% Pass@1 improvements.

Authors: We agree that explicit isolation of the action-space change is essential. In the original implementation, both interfaces shared the same high-level agent loop, prompting templates, and error-handling logic; the sole difference was the definition of read/edit actions (string spans vs. named AST entities), with validation steps adapted to each. To make this fully transparent, we will expand §4.2 with side-by-side tool schemas, identical prompt excerpts, and a statement confirming that only the action primitives were varied. This revision will directly address the concern without altering the reported results. revision: yes
Referee: [Table 1] Table 1 and failure-mode analysis for GPT-5-nano: The 20.8% gain is largely explained by the drop in empty-patch failures (46.6% to 7.2%), yet no ablation is presented that retains the validation logic of editCode while removing only the AST naming/selection mechanism. Without this control, the contribution of the structured representation versus built-in validation cannot be separated.

Authors: We acknowledge that the large gain for GPT-5-nano is driven primarily by fewer empty patches and that our current analysis does not fully disentangle AST-based naming from the validation logic that editCode enables. Because validation is inherently tied to operating on well-typed AST nodes, a clean ablation that keeps validation but strips naming/selection would require a new interface variant. We will add a targeted discussion in the revised failure-mode section noting this entanglement as a limitation and, where feasible, report an auxiliary run that applies post-hoc syntax checks to the text baseline; we view this as a partial but honest response to the request. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical interface comparison

full rationale

The paper advances CODESTRUCT as an empirical framework that replaces text-based string matching with AST-based readCode/editCode actions. All reported results (Pass@1 gains of 1.2-5.0%, token reductions of 12-38%) are direct measurements on public benchmarks (SWE-Bench Verified, CodeAssistBench) across six LLMs. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the abstract or described evaluation. The central claim is therefore an observable performance delta between two tool interfaces, not a derivation that reduces to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical engineering contribution; the abstract introduces no mathematical axioms, fitted free parameters, or new postulated entities.

pith-pipeline@v0.9.0 · 5492 in / 1070 out tokens · 83303 ms · 2026-05-10T19:05:32.129757+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

InProceedings of the 27th ACM Joint Meeting on European Software Engineer- ing Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 1100– 1104

Pyggi 2.0: Language independent genetic im- provement framework. InProceedings of the 27th ACM Joint Meeting on European Software Engineer- ing Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 1100– 1104. Gareth Bennett, Tracy Hall, Emily Winter, and Steve Counsell. 2024. Semgrep*: Improving the limited performance of ...

work page 2024
[2]

duplicated text

Abstract syntax networks for code generation and semantic parsing. InProceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (ACL), pages 1139–1149. Murali Krishna Ramanathan, Lazaro Clapp, Rajkishore Barik, and Manu Sridharan. 2020. Piranha: Reduc- ing feature flag debt at uber. In2020 IEEE/ACM 42nd International Confere...

work page arXiv 2020
[3]

""Delete␣the␣records␣in␣the␣current ␣␣␣␣QuerySet

Notably, GPT-5-nano’s reduction correlates directly with its 20.8pp accuracy gain, suggesting these were instances where the model had correct intent but could not express valid edits through the text-based interface. In contrast, Qwen3-8B also reduces empty patches (179 to 138) but without corresponding accuracy improvement, indicating that its failures ...

work page

[1] [1]

InProceedings of the 27th ACM Joint Meeting on European Software Engineer- ing Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 1100– 1104

Pyggi 2.0: Language independent genetic im- provement framework. InProceedings of the 27th ACM Joint Meeting on European Software Engineer- ing Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 1100– 1104. Gareth Bennett, Tracy Hall, Emily Winter, and Steve Counsell. 2024. Semgrep*: Improving the limited performance of ...

work page 2024

[2] [2]

duplicated text

Abstract syntax networks for code generation and semantic parsing. InProceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (ACL), pages 1139–1149. Murali Krishna Ramanathan, Lazaro Clapp, Rajkishore Barik, and Manu Sridharan. 2020. Piranha: Reduc- ing feature flag debt at uber. In2020 IEEE/ACM 42nd International Confere...

work page arXiv 2020

[3] [3]

""Delete␣the␣records␣in␣the␣current ␣␣␣␣QuerySet

Notably, GPT-5-nano’s reduction correlates directly with its 20.8pp accuracy gain, suggesting these were instances where the model had correct intent but could not express valid edits through the text-based interface. In contrast, Qwen3-8B also reduces empty patches (179 to 138) but without corresponding accuracy improvement, indicating that its failures ...

work page