WybeCoder: Verified Imperative Code Generation
Pith reviewed 2026-05-14 00:29 UTC · model grok-4.3
The pith
WybeCoder generates verified imperative code by co-evolving programs, invariants, and proofs in an agentic Lean framework.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents WybeCoder, which combines automatic verification condition generation and SMT solving with interactive proofs in Lean to support prove-as-you-generate development of imperative code. On the translated Verina and Clever benchmarks the best system reaches 74 percent and 62 percent success rates respectively, overcoming earlier performance plateaus by synthesizing dozens of invariants and dispatching dozens of subgoals per task.
What carries the argument
prove-as-you-generate development, in which code, invariants, and proofs co-evolve using LLM agents backed by Lean and SMT solvers.
If this is right
- Scaling the approach yields consistent gains on complex algorithms such as Heapsort.
- Dozens of valid invariants are found and dozens of subgoals are dispatched per successful task.
- Hundreds of lines of verified imperative code are produced.
- Previous reported performance plateaus are overcome.
- Large-scale datasets of verified imperative code can be constructed automatically.
Where Pith is reading between the lines
- Similar agentic loops could be applied to verification in other theorem provers or languages beyond Lean.
- Success on translated benchmarks indicates the method may work on native imperative verification problems without translation.
- Automatically generated verified code datasets could be used to fine-tune future models for better performance.
- Combining WybeCoder with larger language models might push success rates even higher at the same compute budget.
Load-bearing premise
The manual translation from functional to imperative benchmark specifications keeps semantic equivalence and the same level of difficulty.
What would settle it
Running the system on a collection of imperative verification tasks created independently without manual translation and finding success rates below 40 percent would challenge the claim of substantial improvement.
Figures
read the original abstract
Recent progress in large language models (LLMs) has substantially advanced automatic code generation and formal theorem proving, yet software verification has not seen comparable gains. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development, in which code, invariants, and proofs co-evolve. WybeCoder builds on a recent framework that combines automatic verification condition generation and SMT solving with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, into equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements as we scale our approach, synthesizing dozens of valid invariants and dispatching dozens of subgoals, ultimately producing hundreds of lines of verified code and overcoming plateaus reported in previous work. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, substantially surpassing previous evaluations and paving the way for the automated construction of large-scale datasets of verified imperative code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WybeCoder, an agentic framework for prove-as-you-generate development of verified imperative code. It combines automatic verification condition generation and SMT solving with interactive Lean proofs, translates the Verina and Clever functional benchmarks into imperative specifications, and reports solving 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, claiming substantial improvements over prior evaluations and enabling large-scale verified imperative datasets.
Significance. If the manual translations preserve semantic equivalence and comparable difficulty, the work would meaningfully advance automated construction of verified imperative code by demonstrating scalable invariant synthesis and subgoal dispatch on complex algorithms such as Heapsort. The empirical scaling results and concrete success rates provide a concrete baseline for future agentic verification systems.
major comments (2)
- [Benchmark Translation] Benchmark translation section: The manuscript states only that Verina and Clever were manually translated into 'equivalent imperative code specifications' but supplies neither the translation rules, any Lean proofs of equivalence, nor spot-checks that a solution to the imperative spec satisfies the original functional spec. This directly undermines the headline comparison (74% Verina, 62% Clever) to prior functional-verification results, as the observed gains could arise from altered task difficulty rather than framework improvements.
- [Evaluation Results] Evaluation and results section: The reported success rates lack accompanying error analysis, breakdown by task category, or statistical significance tests. Without these, it is impossible to assess whether the performance edge over prior work is robust or driven by a small number of easy cases.
minor comments (2)
- [Abstract] Abstract: 'moderate compute budgets' is left undefined; a brief parenthetical on token or iteration limits would improve reproducibility.
- [Figures] Figure captions (e.g., those showing invariant counts for Heapsort): add explicit axis labels and a short note on what constitutes a 'valid' invariant.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details and analysis.
read point-by-point responses
-
Referee: [Benchmark Translation] Benchmark translation section: The manuscript states only that Verina and Clever were manually translated into 'equivalent imperative code specifications' but supplies neither the translation rules, any Lean proofs of equivalence, nor spot-checks that a solution to the imperative spec satisfies the original functional spec. This directly undermines the headline comparison (74% Verina, 62% Clever) to prior functional-verification results, as the observed gains could arise from altered task difficulty rather than framework improvements.
Authors: We agree that the current manuscript provides insufficient detail on the translation process. In the revised version, we will expand the Benchmark Translation section with explicit translation rules (e.g., mapping of recursive functions to loops with mutable arrays and explicit invariants), include Lean equivalence lemmas for a representative sample of 10 tasks, and add spot-checks verifying that solutions to the imperative specifications satisfy the original functional specs via additional proof obligations. These changes will substantiate that the reported performance reflects framework improvements rather than altered difficulty. revision: yes
-
Referee: [Evaluation Results] Evaluation and results section: The reported success rates lack accompanying error analysis, breakdown by task category, or statistical significance tests. Without these, it is impossible to assess whether the performance edge over prior work is robust or driven by a small number of easy cases.
Authors: We concur that the evaluation would benefit from greater rigor. The revised results section will include a detailed error analysis categorizing failures (invariant synthesis timeouts, unprovable subgoals, and SMT solver limits), success-rate breakdowns by task categories (sorting, searching, graph algorithms, and arithmetic), and statistical significance tests (binomial proportion tests with 95% confidence intervals) comparing against prior baselines. These additions will demonstrate the robustness of the 74% and 62% figures. revision: yes
Circularity Check
No circularity: empirical success rates are direct measurements on translated benchmarks
full rationale
The paper's central results (74% Verina, 62% Clever) are obtained by running the WybeCoder agent on a set of manually translated imperative specifications and counting solved tasks. These percentages are not computed from any fitted parameters, self-referential equations, or ansatzes that loop back to the inputs. The translation step is presented as a methodological choice whose validity is assumed rather than derived; it does not create a definitional or statistical circularity in the reported numbers. Prior frameworks are cited for the underlying verification engine, but the performance figures themselves are new empirical observations and do not reduce to those citations by construction. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Lean theorem prover and SMT solvers are sound for the verification conditions they are asked to discharge
Forward citations
Cited by 1 Pith paper
-
Fidelity Probes for Specification--Code Alignment
Fidelity probes from code raise specification fidelity from 0.63 to 0.94 on a 12k-line COBOL benchmark over eight iterations, with convergence predicted by a two-state Markov fixed point from four iterations of rate data.
Reference graph
Works this paper leans on
-
[1]
Springer Nature Switzerland. ISBN 978-3-031- 90660-2. Breitner, J. Loogle. https://github.com/ nomeata/loogle, 2023. Search tool for Lean/Math- lib. Chen, L., Gu, J., Huang, L., Huang, W., Jiang, Z., Jie, A., Jin, X., Jin, X., Li, C., Ma, K., Ren, C., Shen, J., Shi, W., Sun, T., Sun, H., Wang, J., Wang, S., Wang, Z., Wei, C., Wei, S., Wu, Y ., Wu, Y ., Xi...
-
[2]
Discusses the Loom framework and the Velvet verifier
URL https://ilyasergey.net/assets/ pdf/papers/loom-preprint.pdf. Discusses the Loom framework and the Velvet verifier. Grothendieck, A. Récoltes et semailles. https://ncatlab.org/nlab/show/R%C3% A9coltes+et+semailles, 1985. Manuscript, written 1985–1987. Harmonic. Aristotle learns to code, achieving new state-of-the-art of 96.8% on code verification bench...
work page 1985
- [3]
-
[4]
An Axiomatic Basis for Computer Programming
doi: 10.1145/363235.363259. Hubert, T., Mehta, R., Sartran, L., Horváth, M. Z., Žuži ´c, G., Wieser, E., Huang, A., Schrittwieser, J., Schroecker, Y ., Masoom, H., Bertolli, O., Zahavy, T., Mandhane, A., Yung, J., Beloshapka, I., Ibarz, B., Veeriah, V ., Yu, L., Nash, O., Lezeau, P., Mercuri, S., Sönne, C., Mehta, B., Davies, A., Zheng, D., Pedregosa, F.,...
-
[5]
seL4: formal verification of an OS kernel,
URL https://github.com/ JetBrains-Research/verified-cogen. Klein, G., Elphinstone, K., Heiser, G., Andronick, J., Cock, D., Derrin, P., Elkaduwe, D., Engelhardt, K., Kolanski, R., Norrish, M., Sewell, T., Tuch, H., and Winwood, S. sel4: formal verification of an os ker- nel. InProceedings of the ACM SIGOPS 22nd Sym- posium on Operating Systems Principles,...
-
[6]
doi: 10.1007/978-3-642-17511-4\ _20
Springer, 2010. doi: 10.1007/978-3-642-17511-4\ _20. Leroy, X. Formal verification of a realistic compiler.Com- mun. ACM, 52(7):107–115, July 2009. ISSN 0001-
-
[7]
Formal verification of a realistic compiler,
doi: 10.1145/1538788.1538814. URL https: //doi.org/10.1145/1538788.1538814. Li, Y ., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., Hubert, T., Choy, P., de Masson d’Autume, C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J., Gowal, S., Cherepanov, A., Molloy, J., Mankowitz, D. J., S...
-
[8]
URL https://github.com/math-inc/ strongpnt. mathlib Community, T. The lean mathematical library. In Proceedings of the 9th ACM SIGPLAN International Con- ference on Certified Programs and Proofs, POPL ’20, pp. 10 WybeCoder: Verified Imperative Code Generation 367–381. ACM, January 2020. doi: 10.1145/3372885. 3373824. URL http://dx.doi.org/10.1145/ 3372885...
-
[9]
ISSN 0304-3975. doi: 10.1016/j.tcs.2006.12
-
[10]
URL https://doi.org/10.1016/j.tcs. 2006.12.035. OpenAI, :, Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beu- tel, A., Carney, A., Iftimie, A., Karpenko, A., Passos, A. T., Neitz, A., Prokofiev, A., Wei, A., Tam, A., Bennett, A., Kumar, A., Saraiva, A., Vallone, A., Duberstein, A., Kondrich, A., Mishchenko...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.tcs 2006
-
[11]
Prompt Gemini 2.5 Pro to translate each specification into Velvetrequire/ensuressyntax
-
[12]
Conduct systematic human review to ensure correctness and idiomatic style
-
[13]
As a result, benchmark numbers are not strictly comparable to the original Clever results
Fix unsatisfiable specifications from the original dataset where identified We aim to stay as close as possible to the original Clever specifications, but some deviations are necessary for idiomatic Velvet. As a result, benchmark numbers are not strictly comparable to the original Clever results. However, this process yields a high-quality benchmark for v...
work page 2023
-
[14]
total" set_option loom.semantics.choice
and LeanExplore (Asher, 2025) as local services, two tools for semantic and syntactic search in Lean’s standard library andmathlib(mathlib Community, 2020). Instead of a flat design where the agent decides at each step whether to run Lean code or perform a library search and all interactions are concatenated, we opt for a nested design where outer turns a...
work page 2025
-
[15]
**Automated SMT Solving **: The primary tactic, ‘loom_solve‘, attempts to automatically discharge all proof obligations by translating them into SMT queries (using ‘cvc5‘)
-
[16]
**Interactive Proving **: If ‘loom_solve‘ fails, you can look at the remaining goals, prove theorems and use them to complete the proof interactively using standard Lean tactics (‘simp‘, ‘grind‘, ‘rw‘, ‘intros‘, etc.). --- ## 2. Velvet Language Syntax You must adhere to the following syntax for writing Velvet programs and proofs. ### Method Definition A p...
-
[17]
The necessary imports and options
-
[18]
Any required ‘solverHint‘ attributes
-
[19]
The full ‘method‘ definition with appropriate ‘require‘, ‘ensures‘, and ‘invariant‘ clauses
-
[20]
The final ‘prove_correct. . .by loom_solve‘ block that successfully verifies the program, using additional theorems and proof tactic code as needed
-
[21]
Comments explaining your choice of complex invariants. 66
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.