Reducing the Costs of Proof Synthesis on Rust Systems by Scaling Up a Seed Training Set
Pith reviewed 2026-05-16 08:08 UTC · model grok-4.3
The pith
VeruSyn synthesizes 6.9 million verified Rust programs to fine-tune an open model that matches commercial LLMs on proof quality at lower cost
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
With VeruSyn, the authors synthesize the largest set of Verus verified programs: 6.9 million Rust programs, each with a formal specification and a proof that it meets that specification. This dataset lets them create a fine-tuned Qwen2.5-Coder-32B-Instruct model with an appealing cost-proof tradeoff compared with state-of-the-art commercial models like Claude Sonnet 4.5, while also significantly outperforming models like o4-mini and previously proposed research models.
What carries the argument
VeruSyn data synthesis pipeline that combines self-synthesis and tutorial-based synthesis with agent trajectory synthesis to produce long-chain-of-thought data at large scale for Verus verified programs.
If this is right
- The fine-tuned model achieves an appealing cost-proof tradeoff compared with state-of-the-art commercial models like Claude Sonnet 4.5.
- It significantly outperforms models like o4-mini and previously proposed research models on Verus tasks.
- The pipeline reaches much larger scale and Verus-feature coverage than previous data-synthesis techniques designed for Verus.
Where Pith is reading between the lines
- The large synthetic dataset could serve as a reusable resource for training additional models or exploring new fine-tuning strategies beyond the 32B model tested.
- Similar synthesis pipelines might be applied to other verification tools to address data scarcity in proof generation for different languages.
- Lower inference costs for proof generation could encourage wider adoption of formal verification in systems software development.
Load-bearing premise
The synthesized proofs are correct, diverse, and free of systematic errors introduced by the LLM generator, and scaling this synthetic data distribution improves generalization on real-world verification tasks.
What would settle it
A controlled evaluation on a held-out collection of real-world Rust verification problems that shows the fine-tuned model produces invalid proofs or fails to improve over the base model without the new dataset.
read the original abstract
Large Language Models (LLMs) are widely used for code generation. However, the correctness of code generated by LLMs remains a concern. A potential remedy to this concern is to have LLMs generate formal correctness proofs along with such code. However, compared with code generation, code-proof generation requires much higher reasoning capability and has much less existing data to learn from. In this paper, we present VeruSyn, a data synthesis pipeline for Verus, a state-of-the-art verification tool for system software written in Rust. Through self-synthesis and tutorial-based synthesis, VeruSyn achieves much larger scale and Verus-feature coverage than previous data-synthesis techniques designed for Verus; VeruSyn also supplements its dataset with long-chain-of-thought (CoT) data through agent trajectory synthesis. With VeruSyn, we synthesize the largest set of Verus verified programs: 6.9 million Rust programs, each with a formal specification and a proof that it meets that specification. This dataset lets us create a fine-tuned Qwen2.5-Coder-32B-Instruct model with appealing cost-proof tradeoff compared with state-of-the-art commercial models like Claude Sonnet 4.5. It also significantly outperforms models like o4-mini and previously proposed research models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VeruSyn, a data synthesis pipeline for Verus that combines self-synthesis, tutorial-based synthesis, and agent trajectory synthesis to produce the largest reported set of 6.9 million Rust programs, each equipped with a formal specification and a verified proof. This dataset is then used to fine-tune Qwen2.5-Coder-32B-Instruct, yielding a model that the authors claim delivers an improved cost-proof tradeoff relative to commercial systems such as Claude Sonnet 4.5 while also outperforming o4-mini and prior research models.
Significance. If the synthesized proofs are verifiably correct and the fine-tuned model generalizes beyond the training distribution, the work supplies a large-scale, publicly usable resource for training LLMs on formal verification tasks in Rust. This could materially lower the cost of proof synthesis for systems software and reduce reliance on proprietary models, representing a concrete empirical advance in the intersection of LLMs and deductive verification.
major comments (1)
- [Methods] Methods / Pipeline description: the claim that the 6.9M programs are free of systematic generator-induced errors rests on the verification success filter; the manuscript should report the precise fraction of candidates that passed verification, the distribution of proof lengths, and any post-filter diversity statistics (e.g., Verus feature coverage) so that readers can assess whether the scale translates into usable training signal.
minor comments (2)
- [Abstract] Abstract: the performance claims would be more persuasive if the abstract included at least one concrete metric (e.g., proof success rate on a held-out benchmark or relative cost reduction) rather than qualitative descriptors such as “appealing” and “significantly outperforms.”
- [Evaluation] Evaluation section: tables comparing the fine-tuned model against baselines should report both proof-generation success rate and token cost per successful proof so that the claimed tradeoff can be directly inspected.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and the constructive suggestion on the methods description. We address the comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Methods] Methods / Pipeline description: the claim that the 6.9M programs are free of systematic generator-induced errors rests on the verification success filter; the manuscript should report the precise fraction of candidates that passed verification, the distribution of proof lengths, and any post-filter diversity statistics (e.g., Verus feature coverage) so that readers can assess whether the scale translates into usable training signal.
Authors: We agree that these statistics would help readers evaluate the quality and diversity of the synthesized dataset. In the revised manuscript we will add a new subsection (or expanded paragraph) in the Methods section that reports (1) the exact fraction of generated candidates that passed the verification filter, (2) the distribution of proof lengths (including median, quartiles, and range), and (3) post-filter diversity metrics such as coverage of key Verus features (e.g., quantifiers, loops, ownership, and ghost code). These data will be presented in a table accompanied by brief discussion. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper presents an empirical data-synthesis pipeline (self-synthesis, tutorial-based synthesis, agent trajectories) that produces a filtered dataset of 6.9M verified Verus programs, followed by fine-tuning of Qwen2.5-Coder-32B-Instruct and benchmark evaluation against commercial models. No equations, derivations, or fitted parameters are defined in terms of the target performance metrics; all reported gains rest on concrete generation counts, verification filtering, dataset statistics, and external benchmark comparisons that are independently verifiable. No self-citation is load-bearing for the central claims, and the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Reinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis
SpecRL uses the fraction of negative tests rejected by candidate specifications as a reward signal in RL training to produce stronger and more verifiable formal specifications than prior methods.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.