Specification-Based Code-Text-Code Reengineering for LLM-Mediated Software Evolution

Arsen Dolichnyi; Oleg Grynets; Roman Piznak; Taras Zelenyy; Vasyl Lyashkevych; Volodymyr Morozov

arxiv: 2605.25232 · v1 · pith:MOCLMTNBnew · submitted 2026-05-24 · 💻 cs.SE · cs.AI· cs.LO

Specification-Based Code-Text-Code Reengineering for LLM-Mediated Software Evolution

Oleg Grynets , Vasyl Lyashkevych , Arsen Dolichnyi , Roman Piznak , Taras Zelenyy , Volodymyr Morozov This is my paper

Pith reviewed 2026-06-29 23:35 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LO

keywords Code2Text2Codespecification-based reengineeringLLM-mediated software evolutionneutral textual specificationsemantic drifttransformation loss estimationgraph-based verificationmulti-language code transformation

0 comments

The pith

Transforming code into a neutral textual specification before regenerating it produces more controlled LLM-mediated software changes than direct translation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that direct code-to-code transformations with LLMs risk semantic drift, hidden behavioral changes, and loss of traceability. It proposes routing the process through an intermediate neutral text that records behavior, identifiers, flow, conditions, dependencies, and domain intent without carrying over source syntax. Multiple layers of extraction, iterative verification, retrieval grounding, and loss estimation sit between the steps. A sympathetic reader would care because the method reframes an opaque translation task as a traceable reengineering workflow that can incorporate documentation and detect drift. Experiments on a multi-language and SQL dataset, plus graph-based similarity metrics, test whether the added steps deliver measurable control.

Core claim

The central claim is that the Code2Text2Code pipeline functions as a controlled specification-based reengineering process rather than simple transformation. It extracts factual context from ASTs and dependency graphs, generates neutral natural-language specifications, performs iterative verification against source and target, applies retrieval-augmented grounding, and estimates losses through structural preservation, reverse compatibility, interface stability, and total graph similarity. The knowledge layer combines AST metadata, graph structures, neutral specifications, technical documentation, business documentation, and architecture representations to support evolution across languages an

What carries the argument

The neutral textual specification that captures program behavior, identifiers, computational flow, conditions, side effects, data dependencies, and domain-specific intent without transferring source-language syntax.

If this is right

Iterative verification between source code and generated specification can surface and correct semantic drift before target code is produced.
A shared neutral specification enables consistent transformation across multiple programming languages and SQL dialects.
Graph formalization supplies quantitative estimates of losses in structural preservation, interface stability, and overall similarity.
Integration of AST metadata, dependency graphs, and business documentation improves grounding of LLM outputs during regeneration.
Semantic-aware chunking and DSPy prompt tuning become applicable to optimize each stage of the reengineering flow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The generated specifications could serve as auditable living documentation that remains synchronized with evolving codebases.
The method might extend to large-scale legacy migrations where direct LLM translation currently requires extensive manual review.
Applying the same pipeline to performance-critical code could test whether non-functional properties survive the specification layer.
Teams maintaining microservice boundaries might use the neutral specs to enforce domain stability across language boundaries.

Load-bearing premise

A neutral textual specification can accurately and completely capture all relevant program semantics, behavior, and domain intent without introducing drift or loss that subsequent verification steps cannot detect or correct.

What would settle it

A concrete case in which the final target code passes every verification step yet exhibits observable behavioral differences or missing domain logic compared with the original source.

read the original abstract

Direct Code2Code transformation remains challenging to control because it can preserve surface-level syntax while introducing semantic drift, hidden behavioral changes, loss of traceability, non-idiomatic target implementations, or incomplete reconstruction of domain logic. This paper proposes a specification-based Code2Text2Code reengineering framework for LLM-mediated software evolution. The central idea is to transform source code into a neutral textual specification that captures program behavior, identifiers, computational flow, conditions, side effects, data dependencies, and domain-specific intent without directly transferring the source language syntax. The proposed framework combines factual context extraction, Code2Text generation, iterative verification between source code and text specification, Text2Code generation, target code verification, retrieval-augmented grounding, and semantic-aware chunking, and transformation loss estimation. The knowledge representation layer integrates metadata derived from AST, graph-based dependency structures, neutral natural language specifications, technical documentation, business documentation, and architecture-level representations. The conducted experiments include a Code2Text2Code dataset built from multiple programming languages and SQL dialects, comparison of intermediate representations, retrieval evaluation, documentation transformation evaluation, and prompt tuning using DSPy. A graph formalization using structural preservation, reverse compatibility, interface stability, and total graph similarity is implemented to estimate transformation losses. The results support the interpretation of the Code2Text2Code approach not as a simple code transformation, but as a controlled specification-based reengineering process for LLM-mediated software evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a Code2Text2Code framework with neutral specs and graph loss checks but supplies no metrics or comparisons to support the controlled-reengineering claim.

read the letter

The central idea is to route code through a neutral textual specification that tries to capture behavior, flow, side effects, and intent separately from syntax, then generate target code from that spec with verification steps in between. The framework adds retrieval-augmented grounding, semantic chunking, DSPy prompt tuning, and a graph formalization that scores structural preservation, reverse compatibility, interface stability, and overall similarity.

This combination is a reasonable way to organize existing pieces—code summarization, spec-based generation, and loss estimation—into one pipeline aimed at LLM-mediated evolution. The knowledge layer that pulls from ASTs, dependency graphs, docs, and architecture representations shows some care about grounding.

The experiments are described at a high level: a multi-language dataset, retrieval and documentation evaluations, and DSPy tuning. That scope makes sense for the topic.

The problem is that the abstract asserts the results back the controlled-reengineering view without giving any numbers, baselines, or error analysis. The graph-based loss measure is presented as the key control mechanism, yet if it operates only on AST and dependency structures it will miss runtime behavior or domain drift that the spec itself might introduce. That gap directly undercuts the main interpretation.

This is for people already working on LLM code tools who want an architectural sketch. A reader could borrow the verification loop or the loss graph idea, but the paper does not yet deliver evidence that the approach reduces the problems it names.

I would not send it to peer review until the quantitative results and a direct check against the stress-test concern are added.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a specification-based Code2Text2Code reengineering framework for LLM-mediated software evolution. Source code is transformed into a neutral textual specification capturing behavior, identifiers, flow, conditions, side effects, dependencies, and domain intent; the framework then performs iterative verification, Text2Code generation, retrieval-augmented grounding, semantic chunking, and graph-based transformation loss estimation. Experiments are described as including a multi-language/SQL dataset, representation comparisons, retrieval and documentation evaluations, and DSPy prompt tuning, with results claimed to support interpreting the method as controlled reengineering rather than direct transformation.

Significance. If the experimental support and verification claims hold, the work could advance controllable LLM-based code evolution in software engineering by using an intermediate neutral specification to reduce semantic drift and improve traceability of domain logic. The graph formalization for loss estimation (structural preservation, reverse compatibility, interface stability, total similarity) is a potentially useful formal contribution if shown to be robust.

major comments (2)

[Abstract / experiments paragraph] The abstract and experiments description state that 'the results support the interpretation' of Code2Text2Code as controlled reengineering, yet no quantitative metrics, baseline comparisons, error analysis, or specific outcome values are reported. This absence is load-bearing for the central claim.
[Graph formalization for transformation loss estimation] Graph formalization (structural preservation, reverse compatibility, interface stability, total graph similarity): these metrics are derived from ASTs and dependency structures. Such representations cannot encode runtime behavior, side effects, or domain intent asserted to be captured by the neutral textual specification, so the verification steps cannot guarantee detection of drift introduced during Code2Text.

minor comments (2)

[Abstract] The abstract is lengthy and contains multiple lists of components; condensing it would improve readability.
[Knowledge representation layer / graph formalization] Notation for the neutral textual specification and the four graph similarity measures should be defined explicitly with symbols or equations rather than descriptive phrases only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address the two major concerns point by point below, acknowledging where the current manuscript presentation is insufficient and outlining targeted revisions.

read point-by-point responses

Referee: [Abstract / experiments paragraph] The abstract and experiments description state that 'the results support the interpretation' of Code2Text2Code as controlled reengineering, yet no quantitative metrics, baseline comparisons, error analysis, or specific outcome values are reported. This absence is load-bearing for the central claim.

Authors: We agree that the abstract and the high-level experiments paragraph do not report quantitative metrics, baselines, or error analysis, which leaves the central claim under-supported in the summary sections. The full manuscript describes a multi-language/SQL dataset, representation comparisons, retrieval evaluations, documentation transformations, and DSPy prompt tuning, but these results were summarized rather than quantified. In the revised version we will expand both the abstract and the experiments paragraph to include concrete outcome values (e.g., Code2Text accuracy, semantic similarity scores, baseline comparisons against direct Code2Code LLM transformations, and error rates on drift cases). revision: yes
Referee: [Graph formalization for transformation loss estimation] Graph formalization (structural preservation, reverse compatibility, interface stability, total graph similarity): these metrics are derived from ASTs and dependency structures. Such representations cannot encode runtime behavior, side effects, or domain intent asserted to be captured by the neutral textual specification, so the verification steps cannot guarantee detection of drift introduced during Code2Text.

Authors: The graph formalization is deliberately limited to structural properties extractable from ASTs and dependency graphs; it does not claim to encode runtime behavior, side effects, or domain intent. Those aspects are addressed by the neutral textual specification and the separate iterative verification steps that align the specification against the source code. We will revise the manuscript to make this scope explicit, stating that graph-based loss estimation quantifies structural fidelity while behavioral and semantic drift detection occurs through the Code2Text verification loop. This clarification removes any implication that the graph alone guarantees full drift detection. revision: partial

Circularity Check

0 steps flagged

No circularity: framework proposal supported by described experiments

full rationale

The paper describes a multi-component Code2Text2Code framework (factual extraction, iterative verification, graph-based loss estimation via structural preservation etc.) and reports supporting experiments on a multi-language dataset, retrieval, documentation transformation, and DSPy prompt tuning. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central interpretation of the approach as controlled reengineering is presented as following from the experimental outcomes rather than reducing to the framework's own definitions by construction. The derivation chain is therefore self-contained against its stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on assumptions about LLM capabilities for accurate bidirectional translation via neutral text and the utility of the described verification and graph-based loss estimation steps; no free parameters or invented entities with independent evidence are detailed in the abstract.

axioms (1)

domain assumption LLMs can reliably produce and consume neutral textual specifications that preserve full program semantics including side effects and domain intent
Invoked as the basis for the Code2Text and Text2Code steps in the proposed framework.

invented entities (1)

neutral textual specification no independent evidence
purpose: Intermediate representation that decouples source syntax from target implementation while capturing behavior and intent
Central new construct introduced to enable the reengineering process; no independent evidence provided.

pith-pipeline@v0.9.1-grok · 5815 in / 1318 out tokens · 33295 ms · 2026-06-29T23:35:35.593673+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 2 internal anchors

[1]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,” ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, Art. no. 220, 2024, doi: 10.1145/3695988. [2] Q. Zhang, C. Fang, Y. Zhang, and Z. Chen, “A survey on large langu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3695988 2024
[2]

Program Synthesis with Large Language Models

J. Austin et al., “Program synthesis with large language models,” arXiv, 2021, doi: 10.48550/arXiv.2108.07732. [15] A. Tai, L. Golab, and A. Wong, “NL in the middle: Code translation with LLMs and intermediate representations,” arXiv, 2025, doi: 10.48550/arXiv.2507.08627. [16] A. LeClair, S. Jiang, and C. McMillan, “A neural model for generating natural l...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732 2021

[1] [1]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,” ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, Art. no. 220, 2024, doi: 10.1145/3695988. [2] Q. Zhang, C. Fang, Y. Zhang, and Z. Chen, “A survey on large langu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3695988 2024

[2] [2]

Program Synthesis with Large Language Models

J. Austin et al., “Program synthesis with large language models,” arXiv, 2021, doi: 10.48550/arXiv.2108.07732. [15] A. Tai, L. Golab, and A. Wong, “NL in the middle: Code translation with LLMs and intermediate representations,” arXiv, 2025, doi: 10.48550/arXiv.2507.08627. [16] A. LeClair, S. Jiang, and C. McMillan, “A neural model for generating natural l...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732 2021