pith. sign in

arxiv: 2603.29088 · v2 · submitted 2026-03-31 · 💻 cs.SE · cs.AI

WybeCoder: Verified Imperative Code Generation

Pith reviewed 2026-05-14 00:29 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords verified imperative codecode generationformal verificationLeanSMT solversagentic AIinvariant discoverybenchmark translation
0
0 comments X

The pith

WybeCoder generates verified imperative code by co-evolving programs, invariants, and proofs in an agentic Lean framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

WybeCoder is an agentic framework that lets code generation, invariant finding, and formal proof steps happen together for imperative programs. The authors translate two existing functional verification benchmarks into imperative form and test the system on them. At moderate compute the best version solves 74 percent of Verina tasks and 62 percent of Clever tasks while producing hundreds of lines of verified code on hard examples like Heapsort. If this holds, it would let researchers build large collections of machine-verified imperative code automatically instead of by hand. A sympathetic reader cares because most current LLM code output still lacks any formal guarantee of correctness.

Core claim

The paper presents WybeCoder, which combines automatic verification condition generation and SMT solving with interactive proofs in Lean to support prove-as-you-generate development of imperative code. On the translated Verina and Clever benchmarks the best system reaches 74 percent and 62 percent success rates respectively, overcoming earlier performance plateaus by synthesizing dozens of invariants and dispatching dozens of subgoals per task.

What carries the argument

prove-as-you-generate development, in which code, invariants, and proofs co-evolve using LLM agents backed by Lean and SMT solvers.

If this is right

  • Scaling the approach yields consistent gains on complex algorithms such as Heapsort.
  • Dozens of valid invariants are found and dozens of subgoals are dispatched per successful task.
  • Hundreds of lines of verified imperative code are produced.
  • Previous reported performance plateaus are overcome.
  • Large-scale datasets of verified imperative code can be constructed automatically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar agentic loops could be applied to verification in other theorem provers or languages beyond Lean.
  • Success on translated benchmarks indicates the method may work on native imperative verification problems without translation.
  • Automatically generated verified code datasets could be used to fine-tune future models for better performance.
  • Combining WybeCoder with larger language models might push success rates even higher at the same compute budget.

Load-bearing premise

The manual translation from functional to imperative benchmark specifications keeps semantic equivalence and the same level of difficulty.

What would settle it

Running the system on a collection of imperative verification tasks created independently without manual translation and finding success rates below 40 percent would challenge the claim of substantial improvement.

Figures

Figures reproduced from arXiv: 2603.29088 by Amaury Hayat, Darius Feher, Fabian Gloeckle, Gabriel Synnaeve, Kunhao Zheng, Mantas Baksys, Peter O'Hearn, Sean B. Holden.

Figure 1
Figure 1. Figure 1: WybeCoder: Subgoal decomposition multi-agent system. Starting from a problem specification, the agent generates an imple￾mentation and attempts to discharge verification conditions using CVC5. Remaining goals are tackled interactively in Lean, driving iterative implementation refinement or completing the proof. In software engineering, this progress has resulted in wide adoption of LLMs to speed-up code ge… view at source ↗
Figure 2
Figure 2. Figure 2: Inference scaling for sequential agents and multi￾agent system on Verina. Different pass@k settings entail different compute-performance scalings for sequential agents which we compare to a multi-agent system. The optimal mechanism depends on the specific model in use, with Gemini-3 benefiting from the multi-agent scaffold while Claude 4.5 Opus does not. 5.5. Scaling and Model Comparison Comparing the fron… view at source ↗
Figure 5
Figure 5. Figure 5: Inference scaling comparison for multi-agent system with different models. We evaluate using up to 128 subagents on Verina and plot by the total number of LLM calls spent according to iterative deepening (Section F.1). See [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Multiple attempts with multi-agent systems. for our subgoal-decomposition system (Section 4.2), we ran GPT-5 on Clever with k = 2 copies [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-problem cumulative pass rates vs. number of LLM API calls for (a) Clever and (b) Verina. Curves represent the fraction of problems achieving first success on tests, proofs, or both. Statistics are computed per-problem then averaged to avoid overweighting easier problems [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance by compute and latency. F.3. Scalability of Multi-Agent Systems One reason for scaling inference with multi-agent systems as opposed to pure independent pass@k scaling of single-agent systems is that by a carefully designed way of sharing state, individual subsystems can mutually inform and benefit each other. For an ideal multi-agent system, additional compute would always optimally be used to… view at source ↗
Figure 8
Figure 8. Figure 8: Inference scaling comparison for multi-agent system with different models. We evaluate using up to 128 subagents on Verina and Clever and plot by the total budget of LLM calls spent and LLM call latency according to iterative deepening (Section F.1). The heapsort_main routine included a selection sort within it! The selection sort code was in fact unnecessary, because heapify does not permute outside the h… view at source ↗
Figure 9
Figure 9. Figure 9: Optimal inference budget allocation for GPT-5. Given a maximum inference budget of up to C language model calls, we compute the optimal breakdown into kT ≤ C with k ≤ 32 the number of independent attempts and T ≤ 32 the maximum number of turns per attempt. Optimal T is larger than 1, meaning that multi-turn interactions are useful. However, T saturates more quickly, suggesting that our bound k ≤ 32 was sub… view at source ↗
read the original abstract

Recent progress in large language models (LLMs) has substantially advanced automatic code generation and formal theorem proving, yet software verification has not seen comparable gains. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development, in which code, invariants, and proofs co-evolve. WybeCoder builds on a recent framework that combines automatic verification condition generation and SMT solving with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, into equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements as we scale our approach, synthesizing dozens of valid invariants and dispatching dozens of subgoals, ultimately producing hundreds of lines of verified code and overcoming plateaus reported in previous work. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, substantially surpassing previous evaluations and paving the way for the automated construction of large-scale datasets of verified imperative code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces WybeCoder, an agentic framework for prove-as-you-generate development of verified imperative code. It combines automatic verification condition generation and SMT solving with interactive Lean proofs, translates the Verina and Clever functional benchmarks into imperative specifications, and reports solving 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, claiming substantial improvements over prior evaluations and enabling large-scale verified imperative datasets.

Significance. If the manual translations preserve semantic equivalence and comparable difficulty, the work would meaningfully advance automated construction of verified imperative code by demonstrating scalable invariant synthesis and subgoal dispatch on complex algorithms such as Heapsort. The empirical scaling results and concrete success rates provide a concrete baseline for future agentic verification systems.

major comments (2)
  1. [Benchmark Translation] Benchmark translation section: The manuscript states only that Verina and Clever were manually translated into 'equivalent imperative code specifications' but supplies neither the translation rules, any Lean proofs of equivalence, nor spot-checks that a solution to the imperative spec satisfies the original functional spec. This directly undermines the headline comparison (74% Verina, 62% Clever) to prior functional-verification results, as the observed gains could arise from altered task difficulty rather than framework improvements.
  2. [Evaluation Results] Evaluation and results section: The reported success rates lack accompanying error analysis, breakdown by task category, or statistical significance tests. Without these, it is impossible to assess whether the performance edge over prior work is robust or driven by a small number of easy cases.
minor comments (2)
  1. [Abstract] Abstract: 'moderate compute budgets' is left undefined; a brief parenthetical on token or iteration limits would improve reproducibility.
  2. [Figures] Figure captions (e.g., those showing invariant counts for Heapsort): add explicit axis labels and a short note on what constitutes a 'valid' invariant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details and analysis.

read point-by-point responses
  1. Referee: [Benchmark Translation] Benchmark translation section: The manuscript states only that Verina and Clever were manually translated into 'equivalent imperative code specifications' but supplies neither the translation rules, any Lean proofs of equivalence, nor spot-checks that a solution to the imperative spec satisfies the original functional spec. This directly undermines the headline comparison (74% Verina, 62% Clever) to prior functional-verification results, as the observed gains could arise from altered task difficulty rather than framework improvements.

    Authors: We agree that the current manuscript provides insufficient detail on the translation process. In the revised version, we will expand the Benchmark Translation section with explicit translation rules (e.g., mapping of recursive functions to loops with mutable arrays and explicit invariants), include Lean equivalence lemmas for a representative sample of 10 tasks, and add spot-checks verifying that solutions to the imperative specifications satisfy the original functional specs via additional proof obligations. These changes will substantiate that the reported performance reflects framework improvements rather than altered difficulty. revision: yes

  2. Referee: [Evaluation Results] Evaluation and results section: The reported success rates lack accompanying error analysis, breakdown by task category, or statistical significance tests. Without these, it is impossible to assess whether the performance edge over prior work is robust or driven by a small number of easy cases.

    Authors: We concur that the evaluation would benefit from greater rigor. The revised results section will include a detailed error analysis categorizing failures (invariant synthesis timeouts, unprovable subgoals, and SMT solver limits), success-rate breakdowns by task categories (sorting, searching, graph algorithms, and arithmetic), and statistical significance tests (binomial proportion tests with 95% confidence intervals) comparing against prior baselines. These additions will demonstrate the robustness of the 74% and 62% figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical success rates are direct measurements on translated benchmarks

full rationale

The paper's central results (74% Verina, 62% Clever) are obtained by running the WybeCoder agent on a set of manually translated imperative specifications and counting solved tasks. These percentages are not computed from any fitted parameters, self-referential equations, or ansatzes that loop back to the inputs. The translation step is presented as a methodological choice whose validity is assumed rather than derived; it does not create a definitional or statistical circularity in the reported numbers. Prior frameworks are cited for the underlying verification engine, but the performance figures themselves are new empirical observations and do not reduce to those citations by construction. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on the established soundness of Lean and SMT solvers; no new free parameters, ad-hoc axioms, or invented entities are introduced beyond the agentic orchestration layer.

axioms (1)
  • standard math Lean theorem prover and SMT solvers are sound for the verification conditions they are asked to discharge
    The entire pipeline depends on these external tools correctly validating the generated invariants and proofs.

pith-pipeline@v0.9.0 · 5504 in / 1246 out tokens · 61705 ms · 2026-05-14T00:29:25.189587+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fidelity Probes for Specification--Code Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    Fidelity probes from code raise specification fidelity from 0.63 to 0.94 on a 12k-line COBOL benchmark over eight iterations, with convergence predicted by a two-state Markov fixed point from four iterations of rate data.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    ISBN 978-3-031- 90660-2

    Springer Nature Switzerland. ISBN 978-3-031- 90660-2. Breitner, J. Loogle. https://github.com/ nomeata/loogle, 2023. Search tool for Lean/Math- lib. Chen, L., Gu, J., Huang, L., Huang, W., Jiang, Z., Jie, A., Jin, X., Jin, X., Li, C., Ma, K., Ren, C., Shen, J., Shi, W., Sun, T., Sun, H., Wang, J., Wang, S., Wang, Z., Wei, C., Wei, S., Wu, Y ., Wu, Y ., Xi...

  2. [2]

    Discusses the Loom framework and the Velvet verifier

    URL https://ilyasergey.net/assets/ pdf/papers/loom-preprint.pdf. Discusses the Loom framework and the Velvet verifier. Grothendieck, A. Récoltes et semailles. https://ncatlab.org/nlab/show/R%C3% A9coltes+et+semailles, 1985. Manuscript, written 1985–1987. Harmonic. Aristotle learns to code, achieving new state-of-the-art of 96.8% on code verification bench...

  3. [3]

    Hoare, C

    Blog post, accessed 2026-01-28. Hoare, C. A. R. An axiomatic basis for computer program- ming.Communications of the ACM, 12(10):576–580,

  4. [4]

    An Axiomatic Basis for Computer Programming

    doi: 10.1145/363235.363259. Hubert, T., Mehta, R., Sartran, L., Horváth, M. Z., Žuži ´c, G., Wieser, E., Huang, A., Schrittwieser, J., Schroecker, Y ., Masoom, H., Bertolli, O., Zahavy, T., Mandhane, A., Yung, J., Beloshapka, I., Ibarz, B., Veeriah, V ., Yu, L., Nash, O., Lezeau, P., Mercuri, S., Sönne, C., Mehta, B., Davies, A., Zheng, D., Pedregosa, F.,...

  5. [5]

    seL4: formal verification of an OS kernel,

    URL https://github.com/ JetBrains-Research/verified-cogen. Klein, G., Elphinstone, K., Heiser, G., Andronick, J., Cock, D., Derrin, P., Elkaduwe, D., Engelhardt, K., Kolanski, R., Norrish, M., Sewell, T., Tuch, H., and Winwood, S. sel4: formal verification of an os ker- nel. InProceedings of the ACM SIGOPS 22nd Sym- posium on Operating Systems Principles,...

  6. [6]

    doi: 10.1007/978-3-642-17511-4\ _20

    Springer, 2010. doi: 10.1007/978-3-642-17511-4\ _20. Leroy, X. Formal verification of a realistic compiler.Com- mun. ACM, 52(7):107–115, July 2009. ISSN 0001-

  7. [7]

    Formal verification of a realistic compiler,

    doi: 10.1145/1538788.1538814. URL https: //doi.org/10.1145/1538788.1538814. Li, Y ., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., Hubert, T., Choy, P., de Masson d’Autume, C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J., Gowal, S., Cherepanov, A., Molloy, J., Mankowitz, D. J., S...

  8. [8]

    367–381.doi:10.1145/3372885

    URL https://github.com/math-inc/ strongpnt. mathlib Community, T. The lean mathematical library. In Proceedings of the 9th ACM SIGPLAN International Con- ference on Certified Programs and Proofs, POPL ’20, pp. 10 WybeCoder: Verified Imperative Code Generation 367–381. ACM, January 2020. doi: 10.1145/3372885. 3373824. URL http://dx.doi.org/10.1145/ 3372885...

  9. [9]

    doi: 10.1016/j.tcs.2006.12

    ISSN 0304-3975. doi: 10.1016/j.tcs.2006.12

  10. [10]

    OpenAI o1 System Card

    URL https://doi.org/10.1016/j.tcs. 2006.12.035. OpenAI, :, Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beu- tel, A., Carney, A., Iftimie, A., Karpenko, A., Passos, A. T., Neitz, A., Prokofiev, A., Wei, A., Tam, A., Bennett, A., Kumar, A., Saraiva, A., Vallone, A., Duberstein, A., Kondrich, A., Mishchenko...

  11. [11]

    Prompt Gemini 2.5 Pro to translate each specification into Velvetrequire/ensuressyntax

  12. [12]

    Conduct systematic human review to ensure correctness and idiomatic style

  13. [13]

    As a result, benchmark numbers are not strictly comparable to the original Clever results

    Fix unsatisfiable specifications from the original dataset where identified We aim to stay as close as possible to the original Clever specifications, but some deviations are necessary for idiomatic Velvet. As a result, benchmark numbers are not strictly comparable to the original Clever results. However, this process yields a high-quality benchmark for v...

  14. [14]

    total" set_option loom.semantics.choice

    and LeanExplore (Asher, 2025) as local services, two tools for semantic and syntactic search in Lean’s standard library andmathlib(mathlib Community, 2020). Instead of a flat design where the agent decides at each step whether to run Lean code or perform a library search and all interactions are concatenated, we opt for a nested design where outer turns a...

  15. [15]

    **Automated SMT Solving **: The primary tactic, ‘loom_solve‘, attempts to automatically discharge all proof obligations by translating them into SMT queries (using ‘cvc5‘)

  16. [16]

    unknown constant

    **Interactive Proving **: If ‘loom_solve‘ fails, you can look at the remaining goals, prove theorems and use them to complete the proof interactively using standard Lean tactics (‘simp‘, ‘grind‘, ‘rw‘, ‘intros‘, etc.). --- ## 2. Velvet Language Syntax You must adhere to the following syntax for writing Velvet programs and proofs. ### Method Definition A p...

  17. [17]

    The necessary imports and options

  18. [18]

    Any required ‘solverHint‘ attributes

  19. [19]

    The full ‘method‘ definition with appropriate ‘require‘, ‘ensures‘, and ‘invariant‘ clauses

  20. [20]

    by loom_solve‘ block that successfully verifies the program, using additional theorems and proof tactic code as needed

    The final ‘prove_correct. . .by loom_solve‘ block that successfully verifies the program, using additional theorems and proof tactic code as needed

  21. [21]

    Comments explaining your choice of complex invariants. 66