Intent-aligned Formal Specification Synthesis via Traceable Refinement

Aidan Z.H. Yang; Dawn Song; Huangyuan Su; Samuel Tenka; Soonho Kong; Udaya Ghai; Zhenyu Liao; Zhe Ye; Zhizhen Qin

arxiv: 2604.10392 · v1 · submitted 2026-04-12 · 💻 cs.LG · cs.AI· cs.LO· cs.PL· cs.SE

Intent-aligned Formal Specification Synthesis via Traceable Refinement

Zhe Ye , Aidan Z.H. Yang , Huangyuan Su , Zhenyu Liao , Samuel Tenka , Zhizhen Qin , Udaya Ghai , Dawn Song

show 1 more author

Soonho Kong

This is my paper

Pith reviewed 2026-05-10 16:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LOcs.PLcs.SE

keywords formal specification synthesistraceable refinementLeanlarge language modelsprogram verificationintent alignmenttraining data generationlocalized repair

0 comments

The pith

VeriSpecGen generates intent-aligned Lean specifications by decomposing natural language into atomic requirements and using traceability maps for localized clause repairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VeriSpecGen as a method to automate the creation of formal specifications that match the original intent expressed in natural language. It breaks down requirements, links them to targeted tests through explicit maps, and repairs only the failing clauses when validation errors occur. This produces 86.6 percent success on the VERINA SpecGen benchmark with one leading model, beating prior approaches by as much as 31.8 points. The same process yields 343,000 refinement trajectories that can be turned into training data, raising specification performance by 62 to 106 percent relative and carrying over to broader reasoning tasks. Readers would care because accurate automated specifications make formal verification feasible for more codebases without constant expert intervention.

Core claim

VeriSpecGen is a traceable refinement framework that synthesizes intent-aligned specifications in Lean through requirement-level attribution and localized repair. Natural language is decomposed into atomic requirements, requirement-targeted tests are generated with explicit traceability maps, and validation failures are attributed to specific requirements so that only the responsible clauses are repaired. On the VERINA SpecGen task this reaches 86.6 percent accuracy with Claude Opus 4.5 and creates 343K training trajectories whose use improves specification synthesis by 62-106 percent while transferring gains to general reasoning.

What carries the argument

The traceable refinement framework VeriSpecGen, which decomposes requirements, attaches traceability maps from tests to individual clauses, and performs localized repairs based on attributed failures.

If this is right

Formal specifications can be produced from ordinary English descriptions at higher accuracy than previous direct-generation baselines.
Refinement trajectories collected during synthesis supply training data that measurably lifts later models on the same task.
The performance lift from the trajectories transfers to general reasoning benchmarks outside specification writing.
The same traceable approach improves results across multiple model families and sizes rather than being tied to one architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be inserted into existing LLM code-generation pipelines to produce code that is more readily verifiable.
If traceability maps prove robust, similar localized-repair loops might apply to other formalisms such as Coq or Dafny.
Large-scale collection of refinement trajectories offers a route to continued improvement even when base model capabilities plateau.
The requirement-decomposition step might also serve as a diagnostic tool for clarifying ambiguous natural-language specifications before any code is written.

Load-bearing premise

That requirement-targeted tests together with their traceability maps can detect every specification error and correctly attribute it to the responsible clause without missing interactions between requirements or producing false attributions.

What would settle it

A collection of natural-language requirements for which the generated specification passes all targeted tests yet still violates the original intent because of an undetected cross-requirement interaction.

Figures

Figures reproduced from arXiv: 2604.10392 by Aidan Z.H. Yang, Dawn Song, Huangyuan Su, Samuel Tenka, Soonho Kong, Udaya Ghai, Zhenyu Liao, Zhe Ye, Zhizhen Qin.

**Figure 1.** Figure 1: VERISPECGEN traceable refinement workflow. Given a natural language problem description (e.g., “Find the most frequent element...”), VERISPECGEN synthesizes formal specifications through three stages: (1) Atomic Requirement Decomposition: An LLM decomposes the description into testable atomic requirements (AR1: non-empty input, AR2: frequency constraint, AR3: tie-breaking rule). (2) Requirement-targeted Te… view at source ↗

**Figure 2.** Figure 2: Temperature vs. Spec Pass@1 for different SFT configurations. Lower temperatures consistently yield better performance across all configurations. SFT Spec-Only drops below the base model at higher temperatures for the 30B model, demonstrating that single-task training provides insufficient learning signal and multi-task supervision from trajectory distillation is essential for robust specification synthesi… view at source ↗

**Figure 3.** Figure 3: Code Pass@1 vs. Spec Pass@1 at T = 0.3. Each point represents a training epoch (e1–e5). SFT No-Test achieves the best specification performance while maintaining competitive code generation capability. SFT Spec-Only clusters in the lower-left quadrant, exhibiting degraded performance on both tasks despite being trained specifically for specification generation. performance while maintaining code generation… view at source ↗

**Figure 4.** Figure 4: Spec Pass@1 (%) across epochs and temperatures for SFT No-Test. Performance improves consistently with additional training epochs, with diminishing returns after epoch 3–4. At T = 0.3, the best configuration achieves 30.3% for the 4B model and 47.6% for the 30B model, representing 15.6 and 18.2 points absolute improvement over the respective base models (14.7% and 29.4%). specification refinement transfer … view at source ↗

read the original abstract

Large language models are increasingly used to generate code from natural language, but ensuring correctness remains challenging. Formal verification offers a principled way to obtain such guarantees by proving that a program satisfies a formal specification. However, specifications are frequently missing in real-world codebases, and writing high-quality specifications remains expensive and expertise-intensive. We present VeriSpecGen, a traceable refinement framework that synthesizes intent-aligned specifications in Lean through requirement-level attribution and localized repair. VeriSpecGen decomposes natural language into atomic requirements and generates requirement-targeted tests with explicit traceability maps to validate generated specifications. When validation fails, traceability maps attribute failures to specific requirements, enabling targeted clause-level repairs. VeriSpecGen achieve 86.6% on VERINA SpecGen task using Claude Opus 4.5, improving over baselines by up to 31.8 points across different model families and scales. Beyond inference-time gains, we generate 343K training examples from VeriSpecGen refinement trajectories and demonstrate that training on these trajectories substantially improves specification synthesis by 62-106% relative and transfers gains to general reasoning abilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VeriSpecGen's traceable refinement loop lifts Lean spec accuracy and yields synthetic training data, but the per-requirement tests may overlook interactions between clauses.

read the letter

The main thing to know is that this paper gives a concrete loop for turning natural-language requirements into Lean specs: break them into atomic pieces, attach targeted tests with traceability maps, and repair only the failing clauses. That produces an 86.6% score on VERINA with Claude and 343K refinement trajectories that, when used for fine-tuning, raise synthesis performance by 62-106% and show some transfer to general reasoning tasks. The traceability step is the clearest addition over earlier LLM spec work; it turns validation failures into localized fixes instead of full rewrites. The numbers are reported across model families, which is useful to see. The approach is straightforward enough that groups already doing LLM-assisted verification could try the inference-time version quickly. The soft spot is the validation itself. The method assumes that requirement-targeted tests plus traceability will catch every mismatch with intent and correctly point to the right clause. If interactions between requirements produce behaviors that no single test surfaces, both the headline accuracy and the quality of the 343K examples could be overstated. The abstract gives no baseline code, no statistical tests, and no breakdown of failure modes, so it is hard to judge how often that happens. The paper is aimed at people who want to reduce the cost of writing machine-checkable specs for real codebases. Readers working on formal methods tooling or on LLM post-training for reasoning will find the empirical results worth examining. It has enough of a new loop and measurable gains to deserve a serious referee, even if the validation questions need more attention in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VeriSpecGen, a traceable refinement framework for synthesizing intent-aligned formal specifications in Lean from natural language. It decomposes requirements into atomic clauses, generates requirement-targeted tests with explicit traceability maps, and performs localized clause-level repairs upon validation failure. The paper reports 86.6% accuracy on the VERINA SpecGen task using Claude Opus 4.5 (up to 31.8 points over baselines across model families), generates 343K training examples from refinement trajectories, and shows that fine-tuning on these trajectories yields 62-106% relative improvements in specification synthesis while transferring gains to general reasoning tasks.

Significance. If the central performance and data-quality claims hold under rigorous validation, this work could meaningfully advance automated formal specification synthesis by reducing reliance on expert-written specs and enabling scalable training data generation. The traceable refinement mechanism and the demonstration of transfer to general reasoning represent potentially high-impact contributions to the intersection of LLMs and formal methods.

major comments (2)

[§3.2–3.3] §3.2–3.3 (Requirement-targeted tests and traceability maps): The evaluation and data-generation pipeline rests on the assumption that atomic decomposition plus per-requirement tests and traceability maps detect all specification errors and correctly attribute failures. This assumption is load-bearing for both the 86.6% VERINA accuracy and the quality of the 343K synthetic trajectories; if inter-requirement interactions or emergent behaviors are missed, the reported correctness and downstream training gains would be overstated.
[§5] §5 (Experimental results on VERINA and training): The headline accuracy (86.6%) and relative gains (62-106%) are presented without baseline implementation details, statistical significance tests, dataset split information, or failure-mode analysis. These omissions make it impossible to assess whether the improvements are robust or reproducible, directly affecting the strength of the primary claims.

minor comments (2)

[Abstract] Abstract: The model reference 'Claude Opus 4.5' is non-standard; specify the exact version and provider configuration used for all reported results.
[Throughout] Throughout: Tables and figures reporting accuracy and relative improvements should include explicit captions or footnotes defining all baselines, metrics, and confidence intervals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of our evaluation methodology and experimental reporting. We respond to each major comment below, providing clarifications and indicating where revisions will strengthen the paper.

read point-by-point responses

Referee: [§3.2–3.3] §3.2–3.3 (Requirement-targeted tests and traceability maps): The evaluation and data-generation pipeline rests on the assumption that atomic decomposition plus per-requirement tests and traceability maps detect all specification errors and correctly attribute failures. This assumption is load-bearing for both the 86.6% VERINA accuracy and the quality of the 343K synthetic trajectories; if inter-requirement interactions or emergent behaviors are missed, the reported correctness and downstream training gains would be overstated.

Authors: We appreciate the referee's identification of this core assumption. VeriSpecGen's design explicitly targets this issue through atomic decomposition and requirement-specific tests linked by traceability maps, which enable precise failure attribution and localized repairs rather than holistic re-generation. In the VERINA SpecGen task, requirements are predominantly independent, as reflected in the consistent performance across model scales. Nevertheless, we acknowledge that undetected inter-requirement interactions remain a potential limitation. In the revision we will add a dedicated limitations subsection in §3.3 discussing this assumption, including quantitative analysis of requirement interdependence in the VERINA dataset and examples of cases where interactions were successfully or unsuccessfully isolated. revision: partial
Referee: [§5] §5 (Experimental results on VERINA and training): The headline accuracy (86.6%) and relative gains (62-106%) are presented without baseline implementation details, statistical significance tests, dataset split information, or failure-mode analysis. These omissions make it impossible to assess whether the improvements are robust or reproducible, directly affecting the strength of the primary claims.

Authors: We agree that these omissions reduce the ability to fully evaluate robustness and reproducibility. In the revised manuscript we will expand §5 with: complete baseline implementation details and hyperparameters; explicit train/validation/test splits for both VERINA and the 343K trajectories; statistical significance testing (e.g., bootstrap confidence intervals and McNemar's test for paired accuracy differences); and a categorized failure-mode analysis of the remaining errors. These additions will directly address the concerns and allow readers to assess the reliability of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents VeriSpecGen as a traceable refinement framework that decomposes requirements, generates targeted tests with traceability maps, performs localized repairs on failure, and reports 86.6% accuracy on the held-out VERINA SpecGen benchmark plus relative gains from training on 343K trajectories generated by the same process. These steps constitute a methodological pipeline whose correctness is measured against external benchmarks and transfer tasks rather than reducing to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The training-data generation is an augmentation technique whose downstream improvements are validated separately; no equations, uniqueness theorems, or ansatzes are shown to collapse back to the inputs by construction. The assumption that per-requirement tests suffice is a validity concern, not a circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach assumes LLMs can reliably decompose requirements and generate useful tests; it introduces traceability maps as a new mechanism whose correctness is validated only internally by the refinement loop itself.

axioms (1)

domain assumption Requirement decomposition into independent atomic clauses preserves intent and enables localized repair.
Invoked in the description of how natural language is broken down and how failures are attributed.

invented entities (1)

traceability maps no independent evidence
purpose: Link generated tests to specific requirements so failures can be attributed and repaired at clause level.
New data structure introduced by the framework; no external independent evidence provided in the abstract.

pith-pipeline@v0.9.0 · 5523 in / 1270 out tokens · 56473 ms · 2026-05-10T16:42:59.512974+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

The Lean Language Reference

URL https://lean-lang.org/doc/ reference/latest/The--grind--tactic/ . The Lean Language Reference. Leino, K. R. M. Dafny: An automatic program verifier for functional correctness. InInternational Conference on Logic for Programming Artificial Intelligence and Reasoning (LPAR), 2010. Li, K. Verified taco problems.https://huggingface. co/datasets/likaixin/T...

work page 2010
[2]

arXiv preprint arXiv:2407.10040 , year=

URL https://huggingface.co/ datasets/likaixin/TACO-verified. Lin, H., Sun, Z., Yang, Y ., and Welleck, S. Lean-STaR: Learning to interleave thinking and proving.arXiv preprint arXiv:2407.10040, 2024. Ma, L., Liu, S., Li, Y ., Xie, X., and Bu, L. SpecGen: Automated generation of formal program specifications via large language models. InInternational Confe...

work page arXiv 2024
[3]

‘task_description‘ (str): The specification task

work page
[4]

‘task_template‘ (str): Lean 4 code snippet with placeholders

work page
[5]

‘precond_desc‘ (str): Natural language precondition description

work page
[6]

‘postcond_desc‘ (str): Natural language postcondition description Your output fields are:

work page
[7]

‘imports‘ (str): Required imports (optional)

work page
[8]

‘precond_aux‘ (str): Auxiliary precondition definitions

work page
[9]

‘precond‘ (str): Generated precondition code

work page
[10]

‘postcond_aux‘ (str): Auxiliary postcondition definitions

work page
[11]

‘postcond‘ (str): Generated postcondition code All interactions will be structured in the following way, with the appropriate values filled in. [[ ## task_description ## ]] {task_description} [[ ## task_template ## ]] {task_template} [[ ## precond_desc ## ]] {precond_desc} [[ ## postcond_desc ## ]] {postcond_desc} [[ ## imports ## ]] {imports} [[ ## preco...

work page

[1] [1]

The Lean Language Reference

URL https://lean-lang.org/doc/ reference/latest/The--grind--tactic/ . The Lean Language Reference. Leino, K. R. M. Dafny: An automatic program verifier for functional correctness. InInternational Conference on Logic for Programming Artificial Intelligence and Reasoning (LPAR), 2010. Li, K. Verified taco problems.https://huggingface. co/datasets/likaixin/T...

work page 2010

[2] [2]

arXiv preprint arXiv:2407.10040 , year=

URL https://huggingface.co/ datasets/likaixin/TACO-verified. Lin, H., Sun, Z., Yang, Y ., and Welleck, S. Lean-STaR: Learning to interleave thinking and proving.arXiv preprint arXiv:2407.10040, 2024. Ma, L., Liu, S., Li, Y ., Xie, X., and Bu, L. SpecGen: Automated generation of formal program specifications via large language models. InInternational Confe...

work page arXiv 2024

[3] [3]

‘task_description‘ (str): The specification task

work page

[4] [4]

‘task_template‘ (str): Lean 4 code snippet with placeholders

work page

[5] [5]

‘precond_desc‘ (str): Natural language precondition description

work page

[6] [6]

‘postcond_desc‘ (str): Natural language postcondition description Your output fields are:

work page

[7] [7]

‘imports‘ (str): Required imports (optional)

work page

[8] [8]

‘precond_aux‘ (str): Auxiliary precondition definitions

work page

[9] [9]

‘precond‘ (str): Generated precondition code

work page

[10] [10]

‘postcond_aux‘ (str): Auxiliary postcondition definitions

work page

[11] [11]

‘postcond‘ (str): Generated postcondition code All interactions will be structured in the following way, with the appropriate values filled in. [[ ## task_description ## ]] {task_description} [[ ## task_template ## ]] {task_template} [[ ## precond_desc ## ]] {precond_desc} [[ ## postcond_desc ## ]] {postcond_desc} [[ ## imports ## ]] {imports} [[ ## preco...

work page