arxiv: 2604.13108 · v1 · submitted 2026-04-11 · 💻 cs.SE · cs.AI

Recognition: unknown

Formal Architecture Descriptors as Navigation Primitives for AI Coding Agents

Ruoqi Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:48 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords AI coding agentsarchitecture descriptorscode navigationS-expressionsintent.lisptool calling efficiencycode localizationagent behavior variance

0 comments

The pith

AI coding agents navigate codebases with 33-44% fewer steps when given formal architecture descriptors

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI coding agents devote many tool calls to undirected exploration inside large codebases. This paper tests whether formal architecture descriptors can function as navigation primitives that cut that overhead. Controlled trials on 24 localization tasks show architecture context reduces steps by 33-44% with no significant format differences across S-expression, JSON, YAML, and Markdown. An auto-generated descriptor still outperforms blind agents at 100% versus 80% accuracy, and real sessions exhibit 52% lower behavioral variance. The work introduces intent.lisp as an S-expression descriptor and maps error behaviors of each format.

Core claim

Formal architecture descriptors reduce navigational overhead for AI coding agents. Across 24 localization tasks with Claude Sonnet 4.6, architecture context lowered navigation steps by 33-44% (Wilcoxon p=0.009, Cohen's d=0.92). An automatically generated descriptor achieved 100% accuracy against 80% blind. A field study of 7,012 sessions recorded 52% less agent behavioral variance. The paper proposes intent.lisp in S-expression form and demonstrates that JSON fails atomically, YAML silently corrupts half of errors, and S-expressions detect all structural completeness errors.

What carries the argument

Formal architecture descriptors, especially the proposed intent.lisp S-expression format, that supply structured high-level codebase architecture to direct agent tool calls and limit undirected exploration

If this is right

Code localization requires fewer tool calls when architecture context is supplied
Automatically generated descriptors deliver navigational value without manual developer clarification
S-expression formats catch structural completeness errors that JSON and YAML miss
Agent behavior shows lower variance across thousands of real sessions
Different serialization formats exhibit distinct failure modes during descriptor generation

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standardizing on S-expression descriptors could improve consistency when agents move between codebases or tools
The navigation benefit might extend to other agent tasks that involve traversing large repositories
Wider testing across models and task types would clarify how far the reduction generalizes

Load-bearing premise

The 24 localization tasks and Claude Sonnet 4.6 model represent typical AI coding agent usage across diverse codebases and models

What would settle it

Re-running the 24-task localization experiment with a different model such as GPT-4o or on substantially larger codebases and finding no statistically significant drop in navigation steps

read the original abstract

AI coding agents spend a substantial fraction of their tool calls on undirected codebase exploration. We investigate whether providing agents with formal architecture descriptors can reduce this navigational overhead. We present three complementary studies. First, a controlled experiment (24 code localization tasks x 4 conditions, Claude Sonnet 4.6, temperature=0) demonstrates that architecture context reduces navigation steps by 33-44% (Wilcoxon p=0.009, Cohen's d=0.92), with no significant format difference detected across S-expression, JSON, YAML, and Markdown. Second, an artifact-vs-process experiment (15 tasks x 3 conditions) demonstrates that an automatically generated descriptor achieves 100% accuracy versus 80% blind (p=0.002, d=1.04), proving direct navigational value independent of developer self-clarification. Third, an observational field study across 7,012 Claude Code sessions shows 52% reduction in agent behavioral variance. A writer-side experiment (96 generation runs, 96 error injections) reveals critical failure mode differences: JSON fails atomically, YAML silently corrupts 50% of errors, S-expressions detect all structural completeness errors. We propose intent.lisp, an S-expression architecture descriptor, and open-source the Forge toolkit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Architecture descriptors cut AI agent navigation steps by a third in narrow tests on one model, with format comparisons and an open toolkit as the main additions.

read the letter

The core finding is that feeding formal architecture descriptors to agents reduces navigation steps 33-44% on 24 localization tasks, with a large effect size and no clear winner among S-expression, JSON, YAML, or Markdown formats. The paper also shows an auto-generated descriptor beats blind performance and reports lower behavioral variance in a large observational set of sessions. They release the Forge toolkit and propose intent.lisp as a concrete format. That is the practical contribution worth noting. The work is straightforward empirical work on context engineering for agents rather than new theory or proofs. The controlled experiments include p-values and effect sizes, the error-injection test on generation is a useful addition, and the artifact is open. Those pieces make the claims checkable. The soft spots are the narrow scope: all controlled runs use Claude Sonnet 4.6 at temperature zero on small task sets, so transfer to other models or larger codebases is untested. The 7,012-session study is observational and does not isolate descriptor use from other variables, which weakens the causal claim on variance reduction. No major circularity or fitting issues appear. This is for people working on AI coding agents who want data on structured context rather than for general software engineering audiences. A reader focused on agent efficiency or tool-use loops would get concrete numbers and a starting implementation. It deserves peer review because the measurements are reported with enough detail to be replicated or extended, even if broader testing is needed.

Referee Report

2 major / 2 minor

Summary. The paper claims that formal architecture descriptors can act as navigation primitives for AI coding agents, substantially reducing undirected codebase exploration. It supports this via three studies: a controlled experiment (24 localization tasks, 4 conditions, Claude Sonnet 4.6 at temperature 0) showing 33-44% fewer navigation steps (Wilcoxon p=0.009, d=0.92) with no format differences among S-expression/JSON/YAML/Markdown; an artifact-vs-process experiment (15 tasks) where auto-generated descriptors achieve 100% accuracy vs. 80% blind (p=0.002, d=1.04); and an observational study of 7,012 Claude Code sessions reporting 52% lower behavioral variance. A writer-side experiment (96 runs, 96 error injections) highlights format-specific failure modes, leading to the proposal of intent.lisp and the open-sourced Forge toolkit.

Significance. If the central empirical claims hold under broader scrutiny, the work offers a concrete, low-overhead mechanism to improve AI coding agent efficiency by supplying structured architectural context. The multi-study design, direct comparison of representation formats, and open-sourced Forge toolkit are strengths that support reproducibility and extension. The results could inform practical agent tooling, though the single-model, single-task-type scope limits immediate generalizability to diverse codebases and LLMs.

major comments (2)

[Abstract / controlled experiment] Abstract / controlled experiment: the headline 33-44% navigation-step reduction (Wilcoxon p=0.009, d=0.92) is reported without any description of the 24 code-localization tasks' selection criteria, the exact wording of the baseline prompts, or the operational definition and counting procedure for 'navigation steps.' These omissions are load-bearing because they prevent assessment of potential confounds, post-hoc task filtering, or measurement artifacts.
[Observational field study] Observational field study: the 52% reduction in agent behavioral variance across 7,012 sessions is presented without stating the model mix, session filtering rules, or any controls that isolate descriptor usage from other variables (e.g., prompt length, prior context). This weakens the causal attribution to architecture descriptors and makes the variance-reduction claim difficult to interpret.

minor comments (2)

[Artifact-vs-process experiment] The artifact-vs-process experiment (15 tasks) reports 100% vs. 80% accuracy but does not specify how 'accuracy' was scored or whether the automatically generated descriptors were produced by the same model used in the main experiment.
[Writer-side experiment] The writer-side experiment (96 generation runs, 96 error injections) would benefit from a table or explicit counts showing the exact failure rates per format rather than the summary statements about atomic failure and silent corruption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing additional methodological details and clarifications where the original submission was insufficiently explicit. Revisions have been made to improve transparency and reproducibility.

read point-by-point responses

Referee: [Abstract / controlled experiment] Abstract / controlled experiment: the headline 33-44% navigation-step reduction (Wilcoxon p=0.009, d=0.92) is reported without any description of the 24 code-localization tasks' selection criteria, the exact wording of the baseline prompts, or the operational definition and counting procedure for 'navigation steps.' These omissions are load-bearing because they prevent assessment of potential confounds, post-hoc task filtering, or measurement artifacts.

Authors: We agree that these details are necessary for independent assessment of the controlled experiment. The original submission summarized the study design at a high level but omitted the requested specifics from the main text. In the revised manuscript we have added a new Methods subsection that specifies: (1) task selection criteria (tasks were drawn from 12 open-source repositories chosen for diversity in size, language, and architectural complexity, with no post-hoc filtering applied after initial randomization); (2) the exact baseline prompt templates used in the no-descriptor condition (reproduced verbatim in the new Appendix A); and (3) the operational definition of navigation steps (any tool call that performs directory listing, reads a file not containing the target symbol, or executes a search whose result does not advance the localization). These additions allow readers to evaluate potential confounds and measurement validity directly. revision: yes
Referee: [Observational field study] Observational field study: the 52% reduction in agent behavioral variance across 7,012 sessions is presented without stating the model mix, session filtering rules, or any controls that isolate descriptor usage from other variables (e.g., prompt length, prior context). This weakens the causal attribution to architecture descriptors and makes the variance-reduction claim difficult to interpret.

Authors: We accept that the observational study description was incomplete and have expanded it in the revision. The updated text now reports: the model mix (92% Claude Sonnet 4.6, 6% Claude Opus, 2% other variants), the session filtering rules (sessions retained only if they exceeded 5 tool calls, contained at least one code edit, and had complete logging; 14% of raw logs were excluded), and the controls applied (propensity-score matching on prompt token length and preceding context window size, plus a regression model that includes descriptor presence as a predictor while controlling for the matched covariates). We have also revised the discussion to characterize the variance reduction as an associative finding that complements the randomized experiment rather than a standalone causal claim. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential predictions

full rationale

The paper consists of three empirical studies (controlled experiment on 24 tasks, artifact-vs-process on 15 tasks, and observational study on 7,012 sessions) reporting direct measurements of navigation steps, accuracy, and variance reduction. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the derivation chain. All claims rest on external baselines and real session data rather than internal redefinitions or renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical experiments rather than mathematical derivations. No free parameters are fitted to produce the reported effect sizes. The work relies on standard statistical assumptions for non-parametric tests.

axioms (1)

standard math Wilcoxon signed-rank test assumptions hold for the navigation step counts
Used to obtain p=0.009 and Cohen's d=0.92

pith-pipeline@v0.9.0 · 5519 in / 1243 out tokens · 53027 ms · 2026-05-10T16:48:35.573293+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Claude Code

Anthropic. Claude Code. 2025

2025
[2]

The AI Code Editor

Cursor. The AI Code Editor. 2024

2024
[3]

Copilot Workspace

GitHub. Copilot Workspace. 2025

2025
[4]

arXiv:2602.20048, 2026

CodeCompass: The Navigation Paradox. arXiv:2602.20048, 2026

work page arXiv 2026
[5]

Locobench-agent: An interactive benchmark for LLM agents in long-context software engineering,

LoCoBench-Agent: Interactive Benchmark for LLM Agents in Long-Context SE. arXiv:2511.13998, 2025

work page arXiv 2025
[6]

Yang et al

J. Yang et al. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. NeurIPS, 2024

2024
[7]

AGENTS.md Specification

Linux Foundation. AGENTS.md Specification. 2025

2025
[8]

Evaluating AGENTS.md

Gloaguen et al. Evaluating AGENTS.md. ETH Zurich, 2026

2026
[9]

Gauthier

P. Gauthier. Aider: AI Pair Programming in Your Terminal. 2023

2023
[10]

Liu et al

B. Liu et al. CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases. NAACL, 2025

2025
[11]

Ouyang et al

Z. Ouyang et al. RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph. ICLR, 2025

2025
[12]

Architecture Without Architects: How AI Coding Agents Shape Software Architecture

Architecture Without Architects: How AI Coding Agents Shape Software Archi- tecture. arXiv:2604.04990, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Medvidovic and R

N. Medvidovic and R. N. Taylor. A Classification and Comparison Framework for Software Architecture Description Languages. IEEE TSE, 2000

2000
[14]

arXiv:2602.20478, 2025

Codified Context: Infrastructure for AI Agents in a Complex Codebase. arXiv:2602.20478, 2025

work page arXiv 2025
[15]

arXiv:2509.05980, 2025

GRACE: Multi-level Multi-semantic Code Graphs for Code Retrieval. arXiv:2509.05980, 2025

work page arXiv 2025
[16]

Qian et al

C. Qian et al. ChatDev: Communicative Agents for Software Development. ACL, 2024

2024
[17]

Hong et al

S. Hong et al. MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. ICLR, 2024

2024
[18]

Effective strategies for asynchronous software engineering agents, 2026.https: //arxiv.org/abs/2603.21489

Geng et al. Effective Strategies for Asynchronous Software Engineering Agents. arXiv:2603.21489, 2026

work page arXiv 2026
[19]

arXiv:2510.18893, 2025

CodeCRDT: Observation-Driven Coordination for Multi-Agent LLM Code Gen- eration. arXiv:2510.18893, 2025

work page arXiv 2025
[20]

Clements et al

P. Clements et al. Documenting Software Architectures: Views and Beyond. Addison-Wesley, 2010

2010
[21]

OpenReview, 2025

Self-Spec. OpenReview, 2025

2025