pith. machine review for the scientific record. sign in

arxiv: 2605.08553 · v1 · submitted 2026-05-08 · 💻 cs.SE · cs.AI· cs.LG

Recognition: no theorem link

VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:17 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords verifiable code generationformal specificationsproof generationbenchmarkcompetitive programmingRustspecification generationend-to-end synthesis
0
0 comments X

The pith

Current models generate code from descriptions at 92 percent accuracy but drop to 5 percent when also required to produce formal specifications and machine-checkable proofs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a benchmark of 946 competitive-programming problems, each equipped with natural-language descriptions, expert-validated formal specifications, accepted Rust implementations, Verus-checked proofs, and test suites. It then evaluates ten state-of-the-art language models across isolated tasks and the full end-to-end pipeline. The evaluation shows that high performance on plain code generation does not translate to the additional requirements of writing specifications and proofs. A sympathetic reader would care because the work supplies a concrete, scalable testbed for measuring progress toward AI systems that can deliver software accompanied by mathematical correctness guarantees rather than relying solely on testing.

Core claim

VeriContest demonstrates that the strongest evaluated model achieves 92.18 percent on natural-language-to-code generation yet only 48.31 percent on specification generation, 13.95 percent on proof generation, and 5.29 percent on end-to-end verified program synthesis, thereby identifying specification and proof generation as the primary bottlenecks.

What carries the argument

The three-phase construction pipeline that begins with manually verified seed problems, expands semi-automatically with human-in-the-loop review, and applies postcondition testing to validate specification completeness for all 946 problems.

If this is right

  • Specification generation must be treated as a distinct training objective rather than an automatic byproduct of code generation.
  • Proof generation remains the most difficult subtask and will require targeted improvements in reasoning over formal invariants.
  • End-to-end verified synthesis currently succeeds on only a small fraction of problems, so incremental gains in any single component will be necessary before reliable full-pipeline performance emerges.
  • The benchmark's separate evaluation modes allow future systems to be scored on partial progress rather than requiring perfect end-to-end results immediately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes that optimize only for passing tests may systematically under-prepare models for the additional constraints of formal verification.
  • The observed gap suggests that scaling laws observed for code generation may not hold for joint specification-plus-proof tasks without new architectural or data innovations.
  • Because the problems are drawn from standard competitive-programming sources, the benchmark can serve as a drop-in replacement for existing coding benchmarks when correctness guarantees become the evaluation target.

Load-bearing premise

The pipeline and review process produce complete, unbiased formal specifications and correct proofs without introducing systematic errors or missing edge cases across the entire set of problems.

What would settle it

Manual inspection of a random sample of problems that finds either incomplete postconditions or incorrect Verus proofs, or re-running the model evaluations and obtaining end-to-end success rates above 20 percent on the same benchmark.

Figures

Figures reproduced from arXiv: 2605.08553 by Aaditi Rai, John Berberian Jr., Lize Shao, Mrigank Pawagi, Sicong Che, Wenxi Wang, Yuxin Liu, Zichen Xie.

Figure 1
Figure 1. Figure 1: An example Verus problem illustrating the three artifacts of verifiable code generation: [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three-phase construction pipeline for VeriContest. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Phase III test-case generation pipeline, illustrated with LeetCode 34. An accepted negative [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of benchmark sources and difficulty levels. tests. Post2Exe converts 83% of benchmark postconditions; unsupported cases, such as those with unbounded quantifiers, are reviewed manually. This process identifies 60 incomplete postconditions, which are sent back to the agent for revision [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: ProofGen performance under independent sampling and iterative proof repair. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Error composition for the three CodeGen subtasks ( [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Error composition for SpecGen (nl2spec), ProofGen (spec_code2proof), and the End2End pipeline (end2end) for the ten evaluated models. Each bar normalizes over failed attempts only. Precondition Mismatch, Postcondition Mismatch, and Both Pre/Post Mismatch apply to nl2spec and end2end; Spec/Code Modified applies only to spec_code2proof; Incorrect Output and Timeout apply only to end2end in this figure. The c… view at source ↗
Figure 8
Figure 8. Figure 8: Ground-truth specification of lc34. accounts for 56.4% of Claude Opus 4.7’s errors, the largest single error category for any model on any task, while smaller models keep this category below 17% because their attempts often fail before reaching the verification stage. Together with F1, this pattern indicates that once models can reliably produce parsable Verus programs, the remaining challenge is not only … view at source ↗
Figure 9
Figure 9. Figure 9: Claude Opus 4.7 output for lc34 under end2end (excerpt). Verus verifies the program and all positive tests pass, but the specification is reduced to result.len() == 2. The precondition equivalence check and the postcondition testing both reject the generated specification. 1 fn lower_bound(nums: &Vec<i32>, target: i32) -> (idx: usize) 2 requires sorted(nums@), nums.len() <= 100000, 3 ensures 4 idx <= nums.… view at source ↗
Figure 10
Figure 10. Figure 10: GPT-5.5 output for lc34 under end2end (excerpt). The generated specification is substan￾tially stronger than the Claude Opus 4.7 specification, but Verus rejects two assertions inside the lower-bound loop because the proof does not connect the sortedness property to the loop invariant strongly enough to discharge the inductive step. C Limitations & Discussion Limitations. First, VeriContest focuses on Rus… view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for the nl2code setting. # Instruction You are an expert in Verus. You will be provided with a natural language problem description. Your task is to generate the Verus specification from the problem description. # Example [A one-shot Verus example here. We omit the example details in the paper for space.] # Input Field ``` {{problem_description}} ``` # Output Field Output a Verus specification. You… view at source ↗
Figure 12
Figure 12. Figure 12: Prompt for the nl2spec setting. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for the spec2code setting. # Instruction You are an expert in Verus. You will be provided with a natural language problem description and a Verus specification. Your task is to generate the Rust code from them. # Example [A one-shot Verus example here. We omit the example details in the paper for space.] # Input Field ## Problem Description ``` {{problem_description}} ``` ## Verus Specification ```… view at source ↗
Figure 14
Figure 14. Figure 14: Prompt for the nl_spec2code setting. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt for the spec_code2proof setting. # Instruction You are an expert in Verus. You will be provided with a natural language problem description. Your task is to generate a verified Verus program from the problem description, including the Verus specification, executable Rust code, and the corresponding Verus proofs. # Example [A one-shot Verus example here. We omit the example details in the paper for … view at source ↗
Figure 16
Figure 16. Figure 16: Prompt for the end2end setting. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt for proof repair. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
read the original abstract

Large language models can generate useful code from natural language, but their outputs come without correctness guarantees. Verifiable code generation offers a path beyond testing by requiring models to produce not only executable code, but also formal specifications and machine-checkable proofs. Progress in this direction, however, is difficult to measure: existing benchmarks are often small, focus on only one part of the pipeline, lack ground-truth proofs or rigorous specification validation, or target verification settings far from mainstream software development. We present VeriContest, a benchmark of 946 competitive-programming problems from LeetCode and Codeforces for verifiable code generation in Rust with Verus. Each problem pairs a natural language description with expert-validated formal specifications, judge-accepted Rust code, Verus-checked proofs, and positive and negative test suites. VeriContest is constructed through a three-phase pipeline that scales from manually verified seed problems to semi-automated expansion with human-in-the-loop review. To further strengthen benchmark quality, we use testing as an additional quality-assurance layer for validating postcondition completeness. VeriContest supports isolated and compositional evaluation of specification generation, code generation, proof generation, and end-to-end verified program synthesis. Evaluating ten state-of-the-art models reveals a sharp gap between coding ability and verifiable code generation: the strongest model reaches 92.18% on natural-language-to-code generation, but only 48.31% on specification generation, 13.95% on proof generation, and 5.29% end-to-end. These results identify proof and specification generation as the central bottlenecks for models and establish VeriContest as a rigorous platform for measuring and training future systems that generate code with machine-checkable correctness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VeriContest, a benchmark of 946 competitive-programming problems drawn from LeetCode and Codeforces. Each instance supplies a natural-language description, expert-validated formal specifications and Verus proofs for Rust, judge-accepted code, and positive/negative test suites. The benchmark is built via a three-phase pipeline (seed manual verification, semi-automated expansion, human-in-the-loop review) augmented by postcondition testing. Evaluation of ten state-of-the-art models shows a large gap: up to 92.18% on natural-language-to-code generation versus 48.31% on specification generation, 13.95% on proof generation, and 5.29% end-to-end, identifying specification and proof generation as central bottlenecks.

Significance. If the ground-truth specifications and proofs are demonstrably complete and correct, VeriContest supplies a large-scale, multi-faceted platform for measuring progress toward machine-checkable code synthesis, a capability with clear relevance to reliable software engineering. The scale, the explicit support for isolated versus compositional evaluation, and the concrete identification of bottlenecks constitute a useful contribution. The absence of quantitative validation metrics for specification completeness, however, limits the strength of the central performance-gap claims.

major comments (2)
  1. [§3] §3 (three-phase pipeline and quality-assurance description): the manuscript asserts that human-in-the-loop review plus postcondition testing yields complete, unbiased formal specifications for all 946 problems, yet supplies no inter-annotator agreement statistics, no measured rate of post-review specification edits, and no explicit audit that every edge case implied by the original problem statements (overflow, empty inputs, maximum constraints) is captured by the postconditions. Because the headline gaps (92.18% NL-to-code vs. 5.29% end-to-end) rest on these specifications being sound and complete, the lack of such metrics is load-bearing.
  2. [§4] §4 (model evaluation): the reported proof-generation (13.95%) and end-to-end (5.29%) figures are presented as evidence of model limitations, but the paper does not quantify the fraction of benchmark problems for which the provided Verus proofs were non-trivial or the rate at which human reviewers had to correct proofs during construction. This information is needed to separate benchmark artifacts from genuine model shortcomings.
minor comments (2)
  1. [Abstract / §1] The abstract and §1 could more explicitly state the license and release plan for the benchmark artifacts (specifications, proofs, tests) to facilitate reproducibility.
  2. [§2 / §4] Notation for the four evaluation modes (specification generation, code generation, proof generation, end-to-end) is introduced without a compact summary table; a small table in §2 or §4 would improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of benchmark validation. We address each major comment below and indicate where revisions will strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (three-phase pipeline and quality-assurance description): the manuscript asserts that human-in-the-loop review plus postcondition testing yields complete, unbiased formal specifications for all 946 problems, yet supplies no inter-annotator agreement statistics, no measured rate of post-review specification edits, and no explicit audit that every edge case implied by the original problem statements (overflow, empty inputs, maximum constraints) is captured by the postconditions. Because the headline gaps (92.18% NL-to-code vs. 5.29% end-to-end) rest on these specifications being sound and complete, the lack of such metrics is load-bearing.

    Authors: We agree that additional quantitative metrics would increase confidence in the specifications. The pipeline began with expert manual verification of seed problems, followed by semi-automated expansion and targeted human review; postcondition testing against judge-accepted solutions and test suites served as an empirical check for completeness. Inter-annotator agreement was not formally computed because reviews were largely sequential rather than parallel, and exact edit rates were not logged. We will revise §3 to provide a more detailed description of the review workflow, examples of edge-case handling (including overflow and boundary conditions), and any available qualitative indicators of review effort. A exhaustive per-problem audit of every conceivable edge case remains resource-intensive, but the combination of human review and testing provides substantial support for the reported performance gaps. revision: partial

  2. Referee: [§4] §4 (model evaluation): the reported proof-generation (13.95%) and end-to-end (5.29%) figures are presented as evidence of model limitations, but the paper does not quantify the fraction of benchmark problems for which the provided Verus proofs were non-trivial or the rate at which human reviewers had to correct proofs during construction. This information is needed to separate benchmark artifacts from genuine model shortcomings.

    Authors: The Verus proofs supplied with the benchmark are the minimal proofs sufficient for verification and include non-trivial elements such as loop invariants and assertions required for competitive-programming problems. We did not systematically record the fraction of non-trivial proofs or the precise correction rates during construction, as the primary goal was to produce a usable benchmark rather than to analyze construction effort. The isolated evaluations already demonstrate that proof generation remains difficult even when specifications are given, indicating that the low end-to-end scores reflect genuine model limitations rather than artifacts. In the revision we will add a qualitative discussion of proof complexity together with a categorization of a representative sample of problems by difficulty to help readers contextualize the results. revision: partial

standing simulated objections not resolved
  • Exact inter-annotator agreement statistics, rates of post-review specification edits, and rates of proof corrections during construction, as these were not systematically recorded in the original benchmark development process.

Circularity Check

0 steps flagged

No circularity: benchmark construction and external model evaluation are independent of self-defined inputs

full rationale

The paper introduces an external benchmark of 946 problems with human-validated specifications and proofs, then reports direct empirical performance of third-party models on specification generation, code generation, proof generation, and end-to-end synthesis. No equations, fitted parameters, predictions, or uniqueness theorems appear that reduce by construction to the paper's own inputs or prior self-citations. The three-phase pipeline is presented as a construction method rather than a derivation whose outputs are forced by its definitions, and all reported metrics (e.g., 92.18% NL-to-code vs. 5.29% end-to-end) are measurements on held-out models rather than tautological renamings or self-referential fits.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that competitive-programming problems can be faithfully formalized with Verus and that the human-reviewed pipeline produces high-quality ground truth without systematic bias.

axioms (1)
  • domain assumption Verus is a sound verifier that correctly checks Rust code against formal specifications
    The benchmark treats Verus-checked proofs as ground truth; soundness of the verifier is taken from the Verus literature.

pith-pipeline@v0.9.0 · 5633 in / 1352 out tokens · 40988 ms · 2026-05-12T01:17:45.726182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 7 internal anchors

  1. [1]

    Agarwal, H

    Shyam Agarwal, Hao He, and Bogdan Vasilescu. Ai ides or autonomous agents? measuring the impact of coding agents on software development.arXiv preprint arXiv:2601.13597, 2026

  2. [2]

    Claude opus 4.7

    Anthropic. Claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 , April 2026. [Online; accessed: 28-April-2026]

  3. [3]

    Claude sonnet 4.6

    Anthropic. Claude sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6 , February 2026. [Online; accessed: 28-April-2026]

  4. [4]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  5. [5]

    arXiv preprint arXiv:2509.22908 , year=

    Sergiu Bursuc, Theodore Ehrenborg, Shaowei Lin, Lacramioara Astefanoaei, Ionel Emil- ian Chiosa, Jure Kukovec, Alok Singh, Oliver Butterley, Adem Bizid, Quinn Dougherty, et al. A benchmark for vericoding: formally verified program synthesis.arXiv preprint arXiv:2509.22908, 2025

  6. [6]

    Can llms generate reliable test case genera- tors? a study on competition-level programming problems.arXiv preprint arXiv:2506.06821, 2025

    Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, et al. Can llms generate reliable test case genera- tors? a study on competition-level programming problems.arXiv preprint arXiv:2506.06821, 2025

  7. [7]

    Formalspeccpp: A dataset of c++ formal specifications created using llms

    Madhurima Chakraborty, Peter Pirkelbauer, and Qing Yi. Formalspeccpp: A dataset of c++ formal specifications created using llms. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 758–762. IEEE, 2025

  8. [8]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  9. [9]

    Codeforces

    Codeforces. Codeforces. https://codeforces.com/, 2026. [Online; accessed: 28-April- 2026]

  10. [10]

    Academic Press Ltd., 1972

    Ole-Johan Dahl, Edsger Wybe Dijkstra, and Charles Antony Richard Hoare.Structured pro- gramming. Academic Press Ltd., 1972

  11. [11]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  12. [12]

    Verifythis- bench: Generating code, specifications, and proofs all at once.arXiv preprint arXiv:2505.19271, 2025

    Xun Deng, Sicheng Zhong, Barı¸ s Bayazıt, Andreas Veneris, Fan Long, and Xujie Si. Verifythis- bench: Generating code, specifications, and proofs all at once.arXiv preprint arXiv:2505.19271, 2025. 10

  13. [13]

    Proving the coding interview: A benchmark for formally verified code generation

    Quinn Dougherty and Ronak Mehta. Proving the coding interview: A benchmark for formally verified code generation. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), pages 72–79. IEEE, 2025

  14. [14]

    Gemini 3 flash

    Google. Gemini 3 flash. https://deepmind.google/models/gemini/flash/, December

  15. [16]

    Gemini 3.1 pro

    Google. Gemini 3.1 pro. https://deepmind.google/models/gemini/pro/, February

  16. [17]

    [Online; accessed: 28-April-2026]

  17. [18]

    HardTests: Synthesizing high-quality test cases for LLM coding.arXiv preprint arXiv:2505.24098,

    Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, and Lei Li. Hardtests: Synthesizing high-quality test cases for llm coding.arXiv preprint arXiv:2505.24098, 2025

  18. [19]

    Measuring Coding Challenge Competence With APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

  19. [20]

    Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245, 2025

    Nam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245, 2025

  20. [21]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  21. [22]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  22. [23]

    sel4: Formal verification of an os kernel

    Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, et al. sel4: Formal verification of an os kernel. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 207–220, 2009

  23. [24]

    Verus: A practical foundation for systems verification

    Andrea Lattuada, Travis Hance, Jay Bosamiya, Matthias Brun, Chanhee Cho, Hayley LeBlanc, Pranav Srinivasan, Reto Achermann, Tej Chajed, Chris Hawblitzel, et al. Verus: A practical foundation for systems verification. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 438–454, 2024

  24. [25]

    Can llms reason about program semantics? a comprehensive evaluation of llms on formal specification inference

    Thanh Le-Cong, Bach Le, and Toby Murray. Can llms reason about program semantics? a comprehensive evaluation of llms on formal specification inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21991–22014, 2025

  25. [26]

    Leetcode.https://leetcode.com/, 2026

    LeetCode. Leetcode.https://leetcode.com/, 2026. [Online; accessed: 28-April-2026]

  26. [27]

    Dafny: An automatic program verifier for functional correctness

    K Rustan M Leino. Dafny: An automatic program verifier for functional correctness. In International conference on logic for programming artificial intelligence and reasoning, pages 348–370. Springer, 2010

  27. [28]

    Formal verification of a realistic compiler.Communications of the ACM, 52(7): 107–115, 2009

    Xavier Leroy. Formal verification of a realistic compiler.Communications of the ACM, 52(7): 107–115, 2009

  28. [29]

    minicodeprops: a minimal benchmark for proving code properties

    Evan Lohn and Sean Welleck. minicodeprops: a minimal benchmark for proving code properties. arXiv preprint arXiv:2406.11915, 2024

  29. [30]

    Dafnybench: A benchmark for formal software verification.arXiv preprint arXiv:2406.08467, 2024

    Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, and Max Tegmark. Dafnybench: A benchmark for formal software verification.arXiv preprint arXiv:2406.08467, 2024

  30. [31]

    Veribench: End-to-end formal verification benchmark for AI code generation in lean 4

    Brando Miranda, Zhanke Zhou, Allen Nie, Elyas Obbad, Leni Aniva, Kai Fronsdal, Weston Kirk, Dilara Soylu, Andrea Yu, Ying Li, and Sanmi Koyejo. Veribench: End-to-end formal verification benchmark for AI code generation in lean 4. In2nd AI for Math Workshop @ ICML 2025, 2025. URLhttps://openreview.net/forum?id=rWkGFmnSNl. 11

  31. [32]

    Towards ai-assisted synthesis of verified dafny methods.Proceedings of the ACM on Software Engineering, 1(FSE): 812–835, 2024

    Md Rakib Hossain Misu, Cristina V Lopes, Iris Ma, and James Noble. Towards ai-assisted synthesis of verified dafny methods.Proceedings of the ACM on Software Engineering, 1(FSE): 812–835, 2024

  32. [33]

    The lean 4 theorem prover and programming language

    Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction, pages 625–635. Springer, 2021

  33. [34]

    Gpt-5.4 mini

    OpenAI. Gpt-5.4 mini. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/ , March 2026. [Online; accessed: 28-April-2026]

  34. [35]

    OpenAI. Gpt-5.5. https://openai.com/index/introducing-gpt-5-5/ , April 2026. [Online; accessed: 28-April-2026]

  35. [36]

    cargo-mutants.https://github.com/sourcefrog/cargo-mutants, 2026

    Martin Pool. cargo-mutants.https://github.com/sourcefrog/cargo-mutants, 2026

  36. [37]

    Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026

    Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026. URL https://qwen.ai/blog?id=qwen3.6-27b

  37. [38]

    arXiv preprint arXiv:2510.12702 , year=

    Cedric Richter and Heike Wehrheim. Beyond postconditions: Can large language models infer formal contracts for automatic software verification?arXiv preprint arXiv:2510.12702, 2025

  38. [39]

    Security vulnerabilities in ai-generated code: A large- scale analysis of public github repositories

    Maximilian Schreiber and Pascal Tippe. Security vulnerabilities in ai-generated code: A large- scale analysis of public github repositories. InInternational Conference on Information and Communications Security, pages 153–172. Springer, 2025

  39. [40]

    Jinman Zhao, Erxue Min, Hui Wu, Ziheng Li, Zexu Sun, Hengyi Cai, Shuaiqiang Wang, Xu Chen, and Gerald Penn

    Jingwei Shi, Xinxiang Yin, Jing Huang, Jinman Zhao, and Shengyu Tao. Codehacker: Auto- mated test case generation for detecting vulnerabilities in competitive programming solutions. arXiv preprint arXiv:2602.20213, 2026

  40. [41]

    Clover: Clo sed-loop ver ifiable code generation

    Chuyue Sun, Ying Sheng, Oded Padon, and Clark Barrett. Clover: Clo sed-loop ver ifiable code generation. InInternational Symposium on AI Verification, pages 134–155. Springer, 2024

  41. [42]

    Veristruct: Ai-assisted automated verification of data-structure modules in verus

    Chuyue Sun, Yican Sun, Daneshvar Amrollahi, Ethan Zhang, Shuvendu Lahiri, Shan Lu, David Dill, and Clark Barrett. Veristruct: Ai-assisted automated verification of data-structure modules in verus. InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 109–128. Springer, 2026

  42. [43]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    GLM Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bo...

  43. [44]

    Clever: A curated benchmark for formally verified code generation.arXiv preprint arXiv:2505.13938, 2025

    Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri. Clever: A curated benchmark for formally verified code generation.arXiv preprint arXiv:2505.13938, 2025

  44. [45]

    Can large language models write good property-based tests?arXiv preprint arXiv:2307.04346, 2023

    Vasudev Vikram, Caroline Lemieux, Joshua Sunshine, and Rohan Padhye. Can large language models write good property-based tests?arXiv preprint arXiv:2307.04346, 2023

  45. [46]

    Towards understanding the characteristics of code generation errors made by large language models

    Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. Towards understanding the characteristics of code generation errors made by large language models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 2587–2599. IEEE, 2025

  46. [47]

    CodeContests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

    Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

  47. [48]

    Beyond static pattern matching? rethinking automatic cryptographic api misuse detection in the era of llms.Proceedings of the ACM on Software Engineering, 2(ISSTA):113–136, 2025

    Yifan Xia, Zichen Xie, Peiyu Liu, Kangjie Lu, Yan Liu, Wenhai Wang, and Shouling Ji. Beyond static pattern matching? rethinking automatic cryptographic api misuse detection in the era of llms.Proceedings of the ACM on Software Engineering, 2(ISSTA):113–136, 2025

  48. [49]

    Autoverus: Automated proof generation for rust code.Proceedings of the ACM on Programming Languages, 9(OOPSLA2): 3454–3482, 2025

    Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Jianan Yao, Weidong Cui, Yeyun Gong, Chris Hawblitzel, Shuvendu Lahiri, Jacob R Lorch, Shuai Lu, et al. Autoverus: Automated proof generation for rust code.Proceedings of the ACM on Programming Languages, 9(OOPSLA2): 3454–3482, 2025

  49. [50]

    VeruSAGE: A Study of Agent-Based Verification for Rust Systems

    Chenyuan Yang, Natalie Neamtu, Chris Hawblitzel, Jacob R Lorch, and Shan Lu. Verusage: A study of agent-based verification for rust systems.arXiv preprint arXiv:2512.18436, 2025

  50. [51]

    Knighter: Transform- ing static analysis with llm-synthesized checkers

    Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang. Knighter: Transform- ing static analysis with llm-synthesized checkers. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 655–669, 2025

  51. [52]

    Verina: Benchmarking verifiable code generation.arXiv preprint arXiv:2505.23135, 2025

    Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, and Dawn Song. Verina: Benchmarking verifiable code generation.arXiv preprint arXiv:2505.23135, 2025

  52. [53]

    Algoveri: An aligned benchmark for verified code generation on classical algorithms.arXiv preprint arXiv:2602.09464, 2026

    Haoyu Zhao, Ziran Yang, Jiawei Li, Deyuan He, Zenan Li, Chi Jin, Venugopal V Veeravalli, Aarti Gupta, and Sanjeev Arora. Algoveri: An aligned benchmark for verified code generation on classical algorithms.arXiv preprint arXiv:2602.09464, 2026

  53. [54]

    Rag-verus: Repository-level program verification with llms using retrieval augmented generation.arXiv preprint arXiv:2502.05344, 2025

    Sicheng Zhong, Jiading Zhu, Yifang Tian, and Xujie Si. Rag-verus: Repository-level program verification with llms using retrieval augmented generation.arXiv preprint arXiv:2502.05344, 2025. A Additional Benchmark and Evaluation Details A.1 Test Case Generation Details A.1.1 Positive Test Case Generation Random test case generation.For each problem, we syn...