Recognition: no theorem link
VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation
Pith reviewed 2026-05-12 01:17 UTC · model grok-4.3
The pith
Current models generate code from descriptions at 92 percent accuracy but drop to 5 percent when also required to produce formal specifications and machine-checkable proofs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VeriContest demonstrates that the strongest evaluated model achieves 92.18 percent on natural-language-to-code generation yet only 48.31 percent on specification generation, 13.95 percent on proof generation, and 5.29 percent on end-to-end verified program synthesis, thereby identifying specification and proof generation as the primary bottlenecks.
What carries the argument
The three-phase construction pipeline that begins with manually verified seed problems, expands semi-automatically with human-in-the-loop review, and applies postcondition testing to validate specification completeness for all 946 problems.
If this is right
- Specification generation must be treated as a distinct training objective rather than an automatic byproduct of code generation.
- Proof generation remains the most difficult subtask and will require targeted improvements in reasoning over formal invariants.
- End-to-end verified synthesis currently succeeds on only a small fraction of problems, so incremental gains in any single component will be necessary before reliable full-pipeline performance emerges.
- The benchmark's separate evaluation modes allow future systems to be scored on partial progress rather than requiring perfect end-to-end results immediately.
Where Pith is reading between the lines
- Training regimes that optimize only for passing tests may systematically under-prepare models for the additional constraints of formal verification.
- The observed gap suggests that scaling laws observed for code generation may not hold for joint specification-plus-proof tasks without new architectural or data innovations.
- Because the problems are drawn from standard competitive-programming sources, the benchmark can serve as a drop-in replacement for existing coding benchmarks when correctness guarantees become the evaluation target.
Load-bearing premise
The pipeline and review process produce complete, unbiased formal specifications and correct proofs without introducing systematic errors or missing edge cases across the entire set of problems.
What would settle it
Manual inspection of a random sample of problems that finds either incomplete postconditions or incorrect Verus proofs, or re-running the model evaluations and obtaining end-to-end success rates above 20 percent on the same benchmark.
Figures
read the original abstract
Large language models can generate useful code from natural language, but their outputs come without correctness guarantees. Verifiable code generation offers a path beyond testing by requiring models to produce not only executable code, but also formal specifications and machine-checkable proofs. Progress in this direction, however, is difficult to measure: existing benchmarks are often small, focus on only one part of the pipeline, lack ground-truth proofs or rigorous specification validation, or target verification settings far from mainstream software development. We present VeriContest, a benchmark of 946 competitive-programming problems from LeetCode and Codeforces for verifiable code generation in Rust with Verus. Each problem pairs a natural language description with expert-validated formal specifications, judge-accepted Rust code, Verus-checked proofs, and positive and negative test suites. VeriContest is constructed through a three-phase pipeline that scales from manually verified seed problems to semi-automated expansion with human-in-the-loop review. To further strengthen benchmark quality, we use testing as an additional quality-assurance layer for validating postcondition completeness. VeriContest supports isolated and compositional evaluation of specification generation, code generation, proof generation, and end-to-end verified program synthesis. Evaluating ten state-of-the-art models reveals a sharp gap between coding ability and verifiable code generation: the strongest model reaches 92.18% on natural-language-to-code generation, but only 48.31% on specification generation, 13.95% on proof generation, and 5.29% end-to-end. These results identify proof and specification generation as the central bottlenecks for models and establish VeriContest as a rigorous platform for measuring and training future systems that generate code with machine-checkable correctness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VeriContest, a benchmark of 946 competitive-programming problems drawn from LeetCode and Codeforces. Each instance supplies a natural-language description, expert-validated formal specifications and Verus proofs for Rust, judge-accepted code, and positive/negative test suites. The benchmark is built via a three-phase pipeline (seed manual verification, semi-automated expansion, human-in-the-loop review) augmented by postcondition testing. Evaluation of ten state-of-the-art models shows a large gap: up to 92.18% on natural-language-to-code generation versus 48.31% on specification generation, 13.95% on proof generation, and 5.29% end-to-end, identifying specification and proof generation as central bottlenecks.
Significance. If the ground-truth specifications and proofs are demonstrably complete and correct, VeriContest supplies a large-scale, multi-faceted platform for measuring progress toward machine-checkable code synthesis, a capability with clear relevance to reliable software engineering. The scale, the explicit support for isolated versus compositional evaluation, and the concrete identification of bottlenecks constitute a useful contribution. The absence of quantitative validation metrics for specification completeness, however, limits the strength of the central performance-gap claims.
major comments (2)
- [§3] §3 (three-phase pipeline and quality-assurance description): the manuscript asserts that human-in-the-loop review plus postcondition testing yields complete, unbiased formal specifications for all 946 problems, yet supplies no inter-annotator agreement statistics, no measured rate of post-review specification edits, and no explicit audit that every edge case implied by the original problem statements (overflow, empty inputs, maximum constraints) is captured by the postconditions. Because the headline gaps (92.18% NL-to-code vs. 5.29% end-to-end) rest on these specifications being sound and complete, the lack of such metrics is load-bearing.
- [§4] §4 (model evaluation): the reported proof-generation (13.95%) and end-to-end (5.29%) figures are presented as evidence of model limitations, but the paper does not quantify the fraction of benchmark problems for which the provided Verus proofs were non-trivial or the rate at which human reviewers had to correct proofs during construction. This information is needed to separate benchmark artifacts from genuine model shortcomings.
minor comments (2)
- [Abstract / §1] The abstract and §1 could more explicitly state the license and release plan for the benchmark artifacts (specifications, proofs, tests) to facilitate reproducibility.
- [§2 / §4] Notation for the four evaluation modes (specification generation, code generation, proof generation, end-to-end) is introduced without a compact summary table; a small table in §2 or §4 would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of benchmark validation. We address each major comment below and indicate where revisions will strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (three-phase pipeline and quality-assurance description): the manuscript asserts that human-in-the-loop review plus postcondition testing yields complete, unbiased formal specifications for all 946 problems, yet supplies no inter-annotator agreement statistics, no measured rate of post-review specification edits, and no explicit audit that every edge case implied by the original problem statements (overflow, empty inputs, maximum constraints) is captured by the postconditions. Because the headline gaps (92.18% NL-to-code vs. 5.29% end-to-end) rest on these specifications being sound and complete, the lack of such metrics is load-bearing.
Authors: We agree that additional quantitative metrics would increase confidence in the specifications. The pipeline began with expert manual verification of seed problems, followed by semi-automated expansion and targeted human review; postcondition testing against judge-accepted solutions and test suites served as an empirical check for completeness. Inter-annotator agreement was not formally computed because reviews were largely sequential rather than parallel, and exact edit rates were not logged. We will revise §3 to provide a more detailed description of the review workflow, examples of edge-case handling (including overflow and boundary conditions), and any available qualitative indicators of review effort. A exhaustive per-problem audit of every conceivable edge case remains resource-intensive, but the combination of human review and testing provides substantial support for the reported performance gaps. revision: partial
-
Referee: [§4] §4 (model evaluation): the reported proof-generation (13.95%) and end-to-end (5.29%) figures are presented as evidence of model limitations, but the paper does not quantify the fraction of benchmark problems for which the provided Verus proofs were non-trivial or the rate at which human reviewers had to correct proofs during construction. This information is needed to separate benchmark artifacts from genuine model shortcomings.
Authors: The Verus proofs supplied with the benchmark are the minimal proofs sufficient for verification and include non-trivial elements such as loop invariants and assertions required for competitive-programming problems. We did not systematically record the fraction of non-trivial proofs or the precise correction rates during construction, as the primary goal was to produce a usable benchmark rather than to analyze construction effort. The isolated evaluations already demonstrate that proof generation remains difficult even when specifications are given, indicating that the low end-to-end scores reflect genuine model limitations rather than artifacts. In the revision we will add a qualitative discussion of proof complexity together with a categorization of a representative sample of problems by difficulty to help readers contextualize the results. revision: partial
- Exact inter-annotator agreement statistics, rates of post-review specification edits, and rates of proof corrections during construction, as these were not systematically recorded in the original benchmark development process.
Circularity Check
No circularity: benchmark construction and external model evaluation are independent of self-defined inputs
full rationale
The paper introduces an external benchmark of 946 problems with human-validated specifications and proofs, then reports direct empirical performance of third-party models on specification generation, code generation, proof generation, and end-to-end synthesis. No equations, fitted parameters, predictions, or uniqueness theorems appear that reduce by construction to the paper's own inputs or prior self-citations. The three-phase pipeline is presented as a construction method rather than a derivation whose outputs are forced by its definitions, and all reported metrics (e.g., 92.18% NL-to-code vs. 5.29% end-to-end) are measurements on held-out models rather than tautological renamings or self-referential fits.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Verus is a sound verifier that correctly checks Rust code against formal specifications
Reference graph
Works this paper leans on
-
[1]
Shyam Agarwal, Hao He, and Bogdan Vasilescu. Ai ides or autonomous agents? measuring the impact of coding agents on software development.arXiv preprint arXiv:2601.13597, 2026
-
[2]
Anthropic. Claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 , April 2026. [Online; accessed: 28-April-2026]
work page 2026
-
[3]
Anthropic. Claude sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6 , February 2026. [Online; accessed: 28-April-2026]
work page 2026
-
[4]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
arXiv preprint arXiv:2509.22908 , year=
Sergiu Bursuc, Theodore Ehrenborg, Shaowei Lin, Lacramioara Astefanoaei, Ionel Emil- ian Chiosa, Jure Kukovec, Alok Singh, Oliver Butterley, Adem Bizid, Quinn Dougherty, et al. A benchmark for vericoding: formally verified program synthesis.arXiv preprint arXiv:2509.22908, 2025
-
[6]
Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, et al. Can llms generate reliable test case genera- tors? a study on competition-level programming problems.arXiv preprint arXiv:2506.06821, 2025
-
[7]
Formalspeccpp: A dataset of c++ formal specifications created using llms
Madhurima Chakraborty, Peter Pirkelbauer, and Qing Yi. Formalspeccpp: A dataset of c++ formal specifications created using llms. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 758–762. IEEE, 2025
work page 2025
-
[8]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Codeforces. Codeforces. https://codeforces.com/, 2026. [Online; accessed: 28-April- 2026]
work page 2026
-
[10]
Ole-Johan Dahl, Edsger Wybe Dijkstra, and Charles Antony Richard Hoare.Structured pro- gramming. Academic Press Ltd., 1972
work page 1972
-
[11]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
work page 2026
-
[12]
Xun Deng, Sicheng Zhong, Barı¸ s Bayazıt, Andreas Veneris, Fan Long, and Xujie Si. Verifythis- bench: Generating code, specifications, and proofs all at once.arXiv preprint arXiv:2505.19271, 2025. 10
-
[13]
Proving the coding interview: A benchmark for formally verified code generation
Quinn Dougherty and Ronak Mehta. Proving the coding interview: A benchmark for formally verified code generation. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), pages 72–79. IEEE, 2025
work page 2025
-
[14]
Google. Gemini 3 flash. https://deepmind.google/models/gemini/flash/, December
-
[16]
Google. Gemini 3.1 pro. https://deepmind.google/models/gemini/pro/, February
-
[17]
[Online; accessed: 28-April-2026]
work page 2026
-
[18]
HardTests: Synthesizing high-quality test cases for LLM coding.arXiv preprint arXiv:2505.24098,
Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, and Lei Li. Hardtests: Synthesizing high-quality test cases for llm coding.arXiv preprint arXiv:2505.24098, 2025
-
[19]
Measuring Coding Challenge Competence With APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Nam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.arXiv preprint arXiv:2503.01245, 2025
-
[21]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
sel4: Formal verification of an os kernel
Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, et al. sel4: Formal verification of an os kernel. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 207–220, 2009
work page 2009
-
[24]
Verus: A practical foundation for systems verification
Andrea Lattuada, Travis Hance, Jay Bosamiya, Matthias Brun, Chanhee Cho, Hayley LeBlanc, Pranav Srinivasan, Reto Achermann, Tej Chajed, Chris Hawblitzel, et al. Verus: A practical foundation for systems verification. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 438–454, 2024
work page 2024
-
[25]
Thanh Le-Cong, Bach Le, and Toby Murray. Can llms reason about program semantics? a comprehensive evaluation of llms on formal specification inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21991–22014, 2025
work page 2025
-
[26]
Leetcode.https://leetcode.com/, 2026
LeetCode. Leetcode.https://leetcode.com/, 2026. [Online; accessed: 28-April-2026]
work page 2026
-
[27]
Dafny: An automatic program verifier for functional correctness
K Rustan M Leino. Dafny: An automatic program verifier for functional correctness. In International conference on logic for programming artificial intelligence and reasoning, pages 348–370. Springer, 2010
work page 2010
-
[28]
Formal verification of a realistic compiler.Communications of the ACM, 52(7): 107–115, 2009
Xavier Leroy. Formal verification of a realistic compiler.Communications of the ACM, 52(7): 107–115, 2009
work page 2009
-
[29]
minicodeprops: a minimal benchmark for proving code properties
Evan Lohn and Sean Welleck. minicodeprops: a minimal benchmark for proving code properties. arXiv preprint arXiv:2406.11915, 2024
-
[30]
Dafnybench: A benchmark for formal software verification.arXiv preprint arXiv:2406.08467, 2024
Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, and Max Tegmark. Dafnybench: A benchmark for formal software verification.arXiv preprint arXiv:2406.08467, 2024
-
[31]
Veribench: End-to-end formal verification benchmark for AI code generation in lean 4
Brando Miranda, Zhanke Zhou, Allen Nie, Elyas Obbad, Leni Aniva, Kai Fronsdal, Weston Kirk, Dilara Soylu, Andrea Yu, Ying Li, and Sanmi Koyejo. Veribench: End-to-end formal verification benchmark for AI code generation in lean 4. In2nd AI for Math Workshop @ ICML 2025, 2025. URLhttps://openreview.net/forum?id=rWkGFmnSNl. 11
work page 2025
-
[32]
Md Rakib Hossain Misu, Cristina V Lopes, Iris Ma, and James Noble. Towards ai-assisted synthesis of verified dafny methods.Proceedings of the ACM on Software Engineering, 1(FSE): 812–835, 2024
work page 2024
-
[33]
The lean 4 theorem prover and programming language
Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction, pages 625–635. Springer, 2021
work page 2021
-
[34]
OpenAI. Gpt-5.4 mini. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/ , March 2026. [Online; accessed: 28-April-2026]
work page 2026
-
[35]
OpenAI. Gpt-5.5. https://openai.com/index/introducing-gpt-5-5/ , April 2026. [Online; accessed: 28-April-2026]
work page 2026
-
[36]
cargo-mutants.https://github.com/sourcefrog/cargo-mutants, 2026
Martin Pool. cargo-mutants.https://github.com/sourcefrog/cargo-mutants, 2026
work page 2026
-
[37]
Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026
Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model, April 2026. URL https://qwen.ai/blog?id=qwen3.6-27b
work page 2026
-
[38]
arXiv preprint arXiv:2510.12702 , year=
Cedric Richter and Heike Wehrheim. Beyond postconditions: Can large language models infer formal contracts for automatic software verification?arXiv preprint arXiv:2510.12702, 2025
-
[39]
Security vulnerabilities in ai-generated code: A large- scale analysis of public github repositories
Maximilian Schreiber and Pascal Tippe. Security vulnerabilities in ai-generated code: A large- scale analysis of public github repositories. InInternational Conference on Information and Communications Security, pages 153–172. Springer, 2025
work page 2025
-
[40]
Jingwei Shi, Xinxiang Yin, Jing Huang, Jinman Zhao, and Shengyu Tao. Codehacker: Auto- mated test case generation for detecting vulnerabilities in competitive programming solutions. arXiv preprint arXiv:2602.20213, 2026
-
[41]
Clover: Clo sed-loop ver ifiable code generation
Chuyue Sun, Ying Sheng, Oded Padon, and Clark Barrett. Clover: Clo sed-loop ver ifiable code generation. InInternational Symposium on AI Verification, pages 134–155. Springer, 2024
work page 2024
-
[42]
Veristruct: Ai-assisted automated verification of data-structure modules in verus
Chuyue Sun, Yican Sun, Daneshvar Amrollahi, Ethan Zhang, Shuvendu Lahiri, Shan Lu, David Dill, and Clark Barrett. Veristruct: Ai-assisted automated verification of data-structure modules in verus. InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 109–128. Springer, 2026
work page 2026
-
[43]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bo...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri. Clever: A curated benchmark for formally verified code generation.arXiv preprint arXiv:2505.13938, 2025
-
[45]
Can large language models write good property-based tests?arXiv preprint arXiv:2307.04346, 2023
Vasudev Vikram, Caroline Lemieux, Joshua Sunshine, and Rohan Padhye. Can large language models write good property-based tests?arXiv preprint arXiv:2307.04346, 2023
-
[46]
Towards understanding the characteristics of code generation errors made by large language models
Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. Towards understanding the characteristics of code generation errors made by large language models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 2587–2599. IEEE, 2025
work page 2025
-
[47]
Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025
-
[48]
Yifan Xia, Zichen Xie, Peiyu Liu, Kangjie Lu, Yan Liu, Wenhai Wang, and Shouling Ji. Beyond static pattern matching? rethinking automatic cryptographic api misuse detection in the era of llms.Proceedings of the ACM on Software Engineering, 2(ISSTA):113–136, 2025
work page 2025
-
[49]
Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Jianan Yao, Weidong Cui, Yeyun Gong, Chris Hawblitzel, Shuvendu Lahiri, Jacob R Lorch, Shuai Lu, et al. Autoverus: Automated proof generation for rust code.Proceedings of the ACM on Programming Languages, 9(OOPSLA2): 3454–3482, 2025
work page 2025
-
[50]
VeruSAGE: A Study of Agent-Based Verification for Rust Systems
Chenyuan Yang, Natalie Neamtu, Chris Hawblitzel, Jacob R Lorch, and Shan Lu. Verusage: A study of agent-based verification for rust systems.arXiv preprint arXiv:2512.18436, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Knighter: Transform- ing static analysis with llm-synthesized checkers
Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang. Knighter: Transform- ing static analysis with llm-synthesized checkers. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 655–669, 2025
work page 2025
-
[52]
Verina: Benchmarking verifiable code generation.arXiv preprint arXiv:2505.23135, 2025
Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, and Dawn Song. Verina: Benchmarking verifiable code generation.arXiv preprint arXiv:2505.23135, 2025
-
[53]
Haoyu Zhao, Ziran Yang, Jiawei Li, Deyuan He, Zenan Li, Chi Jin, Venugopal V Veeravalli, Aarti Gupta, and Sanjeev Arora. Algoveri: An aligned benchmark for verified code generation on classical algorithms.arXiv preprint arXiv:2602.09464, 2026
-
[54]
Sicheng Zhong, Jiading Zhu, Yifang Tian, and Xujie Si. Rag-verus: Repository-level program verification with llms using retrieval augmented generation.arXiv preprint arXiv:2502.05344, 2025. A Additional Benchmark and Evaluation Details A.1 Test Case Generation Details A.1.1 Positive Test Case Generation Random test case generation.For each problem, we syn...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.