arxiv: 2602.04910 · v3 · submitted 2026-02-04 · 💻 cs.SE

Reducing the Costs of Proof Synthesis on Rust Systems by Scaling Up a Seed Training Set

Nongyu Di , Tianyu Chen , Shan Lu , Shuai Lu , Yeyun Gong , Peng Cheng , Jacob R. Lorch , Yuan Yao

show 1 more author

Xiaoxing Ma

This is my paper

Pith reviewed 2026-05-16 08:08 UTC · model grok-4.3

classification 💻 cs.SE

keywords VerusRustformal verificationproof synthesisLLM fine-tuningdata synthesiscode generationchain-of-thought

0 comments p. Extension

The pith

VeruSyn synthesizes 6.9 million verified Rust programs to fine-tune an open model that matches commercial LLMs on proof quality at lower cost

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VeruSyn, a synthesis pipeline that scales up verified Rust programs for the Verus verifier through self-synthesis, tutorial-based synthesis, and agent trajectory synthesis that produces long chain-of-thought data. This produces the largest known collection of 6.9 million programs each carrying both a formal specification and a machine-checked proof. The resulting dataset supports fine-tuning Qwen2.5-Coder-32B-Instruct so that the model delivers an appealing cost versus proof tradeoff compared with commercial systems such as Claude Sonnet 4.5 while also outperforming o4-mini and earlier research models. A sympathetic reader would care because formal verification of systems code has long been limited by scarce training data, and scaling synthetic verified examples could make proof-carrying code generation practical at lower expense.

Core claim

With VeruSyn, the authors synthesize the largest set of Verus verified programs: 6.9 million Rust programs, each with a formal specification and a proof that it meets that specification. This dataset lets them create a fine-tuned Qwen2.5-Coder-32B-Instruct model with an appealing cost-proof tradeoff compared with state-of-the-art commercial models like Claude Sonnet 4.5, while also significantly outperforming models like o4-mini and previously proposed research models.

What carries the argument

VeruSyn data synthesis pipeline that combines self-synthesis and tutorial-based synthesis with agent trajectory synthesis to produce long-chain-of-thought data at large scale for Verus verified programs.

If this is right

The fine-tuned model achieves an appealing cost-proof tradeoff compared with state-of-the-art commercial models like Claude Sonnet 4.5.
It significantly outperforms models like o4-mini and previously proposed research models on Verus tasks.
The pipeline reaches much larger scale and Verus-feature coverage than previous data-synthesis techniques designed for Verus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The large synthetic dataset could serve as a reusable resource for training additional models or exploring new fine-tuning strategies beyond the 32B model tested.
Similar synthesis pipelines might be applied to other verification tools to address data scarcity in proof generation for different languages.
Lower inference costs for proof generation could encourage wider adoption of formal verification in systems software development.

Load-bearing premise

The synthesized proofs are correct, diverse, and free of systematic errors introduced by the LLM generator, and scaling this synthetic data distribution improves generalization on real-world verification tasks.

What would settle it

A controlled evaluation on a held-out collection of real-world Rust verification problems that shows the fine-tuned model produces invalid proofs or fails to improve over the base model without the new dataset.

read the original abstract

Large Language Models (LLMs) are widely used for code generation. However, the correctness of code generated by LLMs remains a concern. A potential remedy to this concern is to have LLMs generate formal correctness proofs along with such code. However, compared with code generation, code-proof generation requires much higher reasoning capability and has much less existing data to learn from. In this paper, we present VeruSyn, a data synthesis pipeline for Verus, a state-of-the-art verification tool for system software written in Rust. Through self-synthesis and tutorial-based synthesis, VeruSyn achieves much larger scale and Verus-feature coverage than previous data-synthesis techniques designed for Verus; VeruSyn also supplements its dataset with long-chain-of-thought (CoT) data through agent trajectory synthesis. With VeruSyn, we synthesize the largest set of Verus verified programs: 6.9 million Rust programs, each with a formal specification and a proof that it meets that specification. This dataset lets us create a fine-tuned Qwen2.5-Coder-32B-Instruct model with appealing cost-proof tradeoff compared with state-of-the-art commercial models like Claude Sonnet 4.5. It also significantly outperforms models like o4-mini and previously proposed research models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces VeruSyn, a data synthesis pipeline for Verus that combines self-synthesis, tutorial-based synthesis, and agent trajectory synthesis to produce the largest reported set of 6.9 million Rust programs, each equipped with a formal specification and a verified proof. This dataset is then used to fine-tune Qwen2.5-Coder-32B-Instruct, yielding a model that the authors claim delivers an improved cost-proof tradeoff relative to commercial systems such as Claude Sonnet 4.5 while also outperforming o4-mini and prior research models.

Significance. If the synthesized proofs are verifiably correct and the fine-tuned model generalizes beyond the training distribution, the work supplies a large-scale, publicly usable resource for training LLMs on formal verification tasks in Rust. This could materially lower the cost of proof synthesis for systems software and reduce reliance on proprietary models, representing a concrete empirical advance in the intersection of LLMs and deductive verification.

major comments (1)

[Methods] Methods / Pipeline description: the claim that the 6.9M programs are free of systematic generator-induced errors rests on the verification success filter; the manuscript should report the precise fraction of candidates that passed verification, the distribution of proof lengths, and any post-filter diversity statistics (e.g., Verus feature coverage) so that readers can assess whether the scale translates into usable training signal.

minor comments (2)

[Abstract] Abstract: the performance claims would be more persuasive if the abstract included at least one concrete metric (e.g., proof success rate on a held-out benchmark or relative cost reduction) rather than qualitative descriptors such as “appealing” and “significantly outperforms.”
[Evaluation] Evaluation section: tables comparing the fine-tuned model against baselines should report both proof-generation success rate and token cost per successful proof so that the claimed tradeoff can be directly inspected.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the constructive suggestion on the methods description. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Methods] Methods / Pipeline description: the claim that the 6.9M programs are free of systematic generator-induced errors rests on the verification success filter; the manuscript should report the precise fraction of candidates that passed verification, the distribution of proof lengths, and any post-filter diversity statistics (e.g., Verus feature coverage) so that readers can assess whether the scale translates into usable training signal.

Authors: We agree that these statistics would help readers evaluate the quality and diversity of the synthesized dataset. In the revised manuscript we will add a new subsection (or expanded paragraph) in the Methods section that reports (1) the exact fraction of generated candidates that passed the verification filter, (2) the distribution of proof lengths (including median, quartiles, and range), and (3) post-filter diversity metrics such as coverage of key Verus features (e.g., quantifiers, loops, ownership, and ghost code). These data will be presented in a table accompanied by brief discussion. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical data-synthesis pipeline (self-synthesis, tutorial-based synthesis, agent trajectories) that produces a filtered dataset of 6.9M verified Verus programs, followed by fine-tuning of Qwen2.5-Coder-32B-Instruct and benchmark evaluation against commercial models. No equations, derivations, or fitted parameters are defined in terms of the target performance metrics; all reported gains rest on concrete generation counts, verification filtering, dataset statistics, and external benchmark comparisons that are independently verifiable. No self-citation is load-bearing for the central claims, and the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach assumes that LLM-generated proofs can be mechanically verified by Verus without introducing undetectable errors and that the synthetic distribution is sufficiently representative for downstream fine-tuning gains; these are standard ML assumptions rather than paper-specific inventions.

pith-pipeline@v0.9.0 · 5556 in / 1133 out tokens · 40170 ms · 2026-05-16T08:08:40.036121+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis
cs.SE 2026-04 unverdicted novelty 5.0

SpecRL uses the fraction of negative tests rejected by candidate specifications as a reward signal in RL training to produce stronger and more verifiable formal specifications than prior methods.