VeruSAGE: A Study of Agent-Based Verification for Rust Systems
Pith reviewed 2026-05-16 21:01 UTC · model grok-4.3
The pith
LLM agents complete over 80 percent of Rust system proof tasks when paired with tailored agent designs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agent systems customized to the capabilities of particular LLMs enable completion of over 80 percent of the 849 system-verification tasks in VeruSAGE-Bench and over 90 percent of a separate collection of system proof tasks that human experts had not yet finished.
What carries the argument
Agent systems that pair specific tools and interaction patterns with individual LLMs to compensate for their respective weaknesses when generating Verus proofs for Rust systems.
If this is right
- Different LLMs require distinct tool combinations and agent configurations to reach high success rates on system proof tasks.
- LLM agents can finish a large fraction of proofs even on tasks that remained open after initial human development.
- Verified system software development in Rust can incorporate automated assistance for proof generation at scale.
- The benchmark itself provides a concrete yardstick for measuring future progress in LLM-based verification.
Where Pith is reading between the lines
- If the same agents succeed on fresh codebases, verification projects could shift from writing proofs from scratch to reviewing and refining agent outputs.
- The approach may transfer to other languages that have similar separation-logic provers, provided comparable task-extraction pipelines are built.
- Further gains are likely if the agents are allowed to iterate on failing proofs using feedback from the verifier itself.
Load-bearing premise
Results measured on proof tasks taken from already-verified systems will transfer to writing proofs for new, unverified Rust code without heavy per-project human adjustment.
What would settle it
Running the best reported agent on a previously unverified Rust system and observing completion rates well below 80 percent without additional tuning would show the reported performance does not generalize.
Figures
read the original abstract
Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs' capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4-mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE-Bench because they had not yet been finished by human experts. This result shows the great potential for LLM-assisted development of verified system software.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VeruSAGE-Bench, a suite of 849 proof tasks extracted from eight open-source Verus-verified Rust systems, and evaluates multiple LLM-agent combinations (o4-mini, GPT-5, Sonnet 4, Sonnet 4.5) on completing system-verification proofs. It reports that the strongest agent-LLM pairing finishes over 80% of benchmark tasks and over 90% of additional unfinished human proof tasks, concluding that this demonstrates substantial potential for LLM-assisted development of verified Rust systems software.
Significance. If the reported success rates prove robust and the approach generalizes, the work would represent a meaningful step toward practical AI assistance in formal verification of systems code, particularly by lowering barriers to using Verus for Rust. The new benchmark itself is a concrete contribution that could support future research on LLM reasoning for low-level correctness proofs.
major comments (3)
- [Benchmark construction] Benchmark construction section: the paper supplies no explicit criteria for extracting the 849 proof tasks from the eight systems, no discussion of selection bias, and no breakdown of task difficulty or coverage of typical systems-verification challenges (e.g., concurrency, memory safety, or device drivers). This omission makes it impossible to judge whether the 80%+ headline rates reflect genuine capability or an easier-than-average subset.
- [Experimental evaluation] Experimental evaluation and results sections: success rates are presented without details on agent architectures, exact prompting/tool-use patterns, statistical significance testing, variance across runs, or the precise definition of 'completion' (full verification vs. partial proof). These omissions directly affect reproducibility and the reliability of the 80-90% figures.
- [Discussion and conclusion] Discussion and conclusion: the claim of 'great potential for LLM-assisted development of verified system software' rests on tasks drawn exclusively from already-verified systems. No experiments test transfer to fresh, unverified Rust modules where no proof skeleton exists and the verification strategy must be invented rather than recovered; this gap is load-bearing for the broader significance asserted in the abstract.
minor comments (1)
- [Abstract] Abstract and introduction: the models are referred to as 'o4-mini' and 'GPT-5'; confirm these are the intended names or correct to the standard designations used in the experimental section.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments identify key areas where additional detail and nuance will strengthen the manuscript. We respond to each major comment below and indicate the revisions that will be incorporated.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction section: the paper supplies no explicit criteria for extracting the 849 proof tasks from the eight systems, no discussion of selection bias, and no breakdown of task difficulty or coverage of typical systems-verification challenges (e.g., concurrency, memory safety, or device drivers). This omission makes it impossible to judge whether the 80%+ headline rates reflect genuine capability or an easier-than-average subset.
Authors: We agree that the benchmark construction section requires more explicit documentation. In the revised manuscript we will add a dedicated subsection that states the extraction criteria: all tasks were obtained by automatically identifying complete 'proof' and 'lemma' blocks from the eight open-source Verus-verified systems that already pass the Verus verifier. We will discuss selection bias by explaining that verified systems were deliberately chosen to provide reliable ground-truth checking and will supply a breakdown of the 849 tasks by system, by category (memory safety, concurrency, ownership, etc.), and by proxy difficulty metrics such as proof length and number of required invariants. revision: yes
-
Referee: [Experimental evaluation] Experimental evaluation and results sections: success rates are presented without details on agent architectures, exact prompting/tool-use patterns, statistical significance testing, variance across runs, or the precise definition of 'completion' (full verification vs. partial proof). These omissions directly affect reproducibility and the reliability of the 80-90% figures.
Authors: We accept that the experimental description must be expanded for reproducibility. The revised paper will include a full account of each agent architecture, the precise prompting templates and tool-use loops (including how compiler feedback from Verus is incorporated), the definition of completion (a proof is counted as complete only when the Verus verifier reports no errors), and results aggregated over three independent runs with reported variance and appropriate statistical tests. revision: yes
-
Referee: [Discussion and conclusion] Discussion and conclusion: the claim of 'great potential for LLM-assisted development of verified system software' rests on tasks drawn exclusively from already-verified systems. No experiments test transfer to fresh, unverified Rust modules where no proof skeleton exists and the verification strategy must be invented rather than recovered; this gap is load-bearing for the broader significance asserted in the abstract.
Authors: We acknowledge the scope limitation. Our evaluation measures the ability to complete and extend proofs inside already-verified codebases, which is a realistic and high-value use case. The >90 % success rate on unfinished human proofs already demonstrates some capacity to generate new proof steps. Nevertheless, we agree that the absence of experiments on entirely new, unverified modules restricts the strength of the broader claim. In revision we will add an explicit limitations paragraph, moderate the language in the abstract and conclusion to emphasize assistance in proof completion rather than fully autonomous de-novo verification, and note the need for future work on fresh modules. revision: partial
Circularity Check
No significant circularity: purely empirical benchmark evaluation
full rationale
The paper reports experimental results from running LLM-based agents on VeruSAGE-Bench, a benchmark of 849 proof tasks extracted from eight already-Verus-verified open-source Rust systems, plus a small set of unfinished human tasks. No equations, derivations, fitted parameters, or self-referential definitions appear in the abstract or described structure. Claims reduce to direct measurements of completion rates (over 80% on the benchmark, over 90% on unfinished tasks) rather than any prediction that is forced by construction from the inputs. Self-citations, if present, are not load-bearing for a central theorem or uniqueness result. The generalization question (whether performance transfers to fresh unverified code) is an external-validity concern, not an internal circularity in any derivation chain. This matches the default expectation for an empirical systems paper with no mathematical reduction steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Verus verifier is sound for the extracted proof tasks
Forward citations
Cited by 1 Pith paper
-
VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation
VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.
Reference graph
Works this paper leans on
-
[1]
P. Aggarwal, B. Parno, and S. Welleck. Alphaverus: Bootstrapping formally verified code generation through self-improving translation and treefinement.arXiv preprint arXiv:2412.06176, 2024
-
[2]
Introducing claude 4: Claude opus 4 and claude sonnet 4
Anthropic. Introducing claude 4: Claude opus 4 and claude sonnet 4. https://www.anthropic.co m/news/claude-4 , May 2025. Introduces the Claude Sonnet 4 model
work page 2025
-
[3]
Anthropic. Introducing claude sonnet 4.5. https: //www.anthropic.com/news/claude-son net-4-5 , Sept. 2025. Describes the Claude Sonnet 4.5 model
work page 2025
-
[4]
V . Astrauskas, P. Müller, F. Poli, and A. J. Sum- mers. Leveraging rust types for modular specifica- tion and verification.Proc. ACM Program. Lang., 3(OOPSLA):147:1–147:30, 2019
work page 2019
-
[5]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, Q. V . Le, and C. Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Y . Cai, P. Singh, Z. Lin, J. Bosamiya, J. Gancher, M. Sur- batovich, and B. Parno. Vest: Verified, secure, high- performance parsing and serialization for Rust. InPro- ceedings of the USENIX Security Symposium, August 2025
work page 2025
-
[7]
S. Chakraborty, G. Ebner, S. Bhat, S. Fakhoury, S. Fa- tima, S. K. Lahiri, and N. Swamy. Towards neural syn- thesis for smt-assisted proof-oriented programming. In IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025
work page 2025
-
[8]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, M. Krueger, H. Petrov, I. Khattam, C. Hesse, S. Agar- wal, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, B. Power, L. Kaiser, M. Bavarian, C. King, T. Kerr, S. McCandlish, A. Radford, I. Sutskever, and W. Za...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
T. Chen, S. Lu, S. Lu, Y . Gong, C. Yang, X. Li, M. R. H. Misu, H. Yu, N. Duan, P. Cheng, F. Yang, S. K. Lahiri, T. Xie, and L. Zhou. Automated proof generation for rust code via self-evolution. InProceedings of the 13th International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[10]
X. Chen, Z. Li, J. Zhang, V . Narayanan, and A. Burt- sev. Atmosphere: Practical verified kernels with rust and verus. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP). Association for Computing Machinery, 2025
work page 2025
-
[11]
X. Denis, J. Jourdan, and C. Marché. Creusot: A foundry for the deductive verification of rust programs. InFormal Methods and Software Engineering - 23rd International Conference on Formal Engineering Meth- ods, ICFEM 2022, Madrid, Spain, October 24-27, 2022, Proceedings, volume 13478 ofLecture Notes in Com- puter Science, pages 90–105. Springer, 2022
work page 2022
-
[12]
GitHub. Github copilot cli. https://github.com /github/copilot-cli , Sept. 2025. Command- line interface for GitHub Copilot
work page 2025
- [13]
-
[14]
A. Hurst, A. Lerer, and OpenAI. GPT-4o System Card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [15]
- [16]
-
[17]
A. Lattuada, T. Hance, J. Bosamiya, M. Brun, C. Cho, H. LeBlanc, P. Srinivasan, R. Achermann, T. Chajed, C. Hawblitzel, et al. Verus: A practical foundation for systems verification. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Princi- ples, pages 438–454, 2024
work page 2024
-
[18]
A. Lattuada, T. Hance, C. Cho, M. Brun, I. Subasinghe, Y . Zhou, J. Howell, B. Parno, and C. Hawblitzel. Verus: Verifying rust programs using linear ghost types.Proc. ACM Program. Lang., 7(OOPSLA1):286–315, 2023. 13
work page 2023
-
[19]
H. LeBlanc, J. R. Lorch, C. Hawblitzel, C. Huang, Y . Tao, N. Zeldovich, and V . Chidambaram. Power never corrupts: Tool-agnostic verification of crash con- sistency and corruption detection.OSDI (to appear), 2025
work page 2025
- [20]
-
[21]
VeriPlan: Integrating formal verification and LLMs into end-user planning,
Y . Liu and et al. Veriplan: Integrating formal verification and llms into end-user planning.CoRR, abs/2502.17898, 2025
-
[22]
J. R. Lorch. Fix usability issue with unused sub- regions library. https://github.com/microsoft/verified- storage/pull/39
-
[23]
DafnyBench: A benchmark for formal software verification
C. Loughridge, Q. Sun, S. Ahrenbach, F. Cassano, C. Sun, Y . Sheng, A. Mudide, M. R. H. Misu, N. Amin, and M. Tegmark. Dafnybench: A benchmark for formal software verification.CoRR, abs/2406.08467, 2024
-
[24]
H. Ma, A. Goel, J.-B. Jeannin, M. Kapritsos, B. Kasikci, and K. A. Sakallah. I4: incremental inference of induc- tive invariants for verification of distributed protocols. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), pages 370–384, 2019
work page 2019
-
[25]
M. R. H. Misu, C. V . Lopes, I. Ma, and J. Noble. To- wards ai-assisted synthesis of verified dafny methods. Proc. ACM Softw. Eng., 1(FSE):812–835, 2024
work page 2024
-
[26]
Laurel: Unblocking automated verification with large language models.PACMPL (OOPSLA), 2025
E. Mugnier, E. A. Gonzalez, R. Jhala, N. Polikarpova, and Y . Zhou. Laurel: Generating dafny assertions using large language models.CoRR, abs/2405.16792, 2024
-
[27]
OpenAI. Codex. https://openai.com/codex/,
-
[28]
OpenAI’s coding agent for software development
-
[29]
OpenAI. Gpt-5 system card. Technical report, OpenAI, Aug. 2025. Describes the GPT-5 model series
work page 2025
-
[30]
Openai o3 and o4-mini system card
OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, Apr. 2025. Describes the OpenAI o4- mini reasoning model
work page 2025
-
[31]
C. Sun, Y . Sheng, O. Padon, and C. W. Barrett. Clover: Closed-loop verifiable code generation. InAI Verifica- tion - First International Symposium, SAIV 2024, Mon- treal, QC, Canada, July 22-23, 2024, Proceedings, vol- ume 14846 ofLecture Notes in Computer Science, pages 134–155. Springer, 2024
work page 2024
- [32]
-
[33]
X. Sun, W. Ma, J. T. Gu, Z. Ma, T. Chajed, J. How- ell, A. Lattuada, O. Padon, L. Suresh, A. Szekeres, and T. Xu. Anvil: Verifying liveness of cluster management controllers. In18th USENIX Symposium on Operat- ing Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, pages 649–666. USENIX Association, 2024
work page 2024
-
[34]
T. K. Team. How open source projects are using kani to write better software in rust. https://aws.amaz on.com/blogs/opensource/how-open-sou rce-projects-are-using-kani-to-write -better-software-in-rust/ , 2024. [Online], [Accessed: 2024-09-01]
work page 2024
-
[35]
Ferreira, Sorin Lerner, and Emily First
K. Thompson, N. Saavedra, P. Carrott, K. Fisher, A. Sanchez-Stern, Y . Brun, J. F. Ferreira, S. Lerner, and E. First. Rango: Adaptive retrieval-augmented prov- ing for automated software verification.arXiv preprint arXiv:2412.14063, 2024
-
[36]
R. Tian, Y . Ye, Y . Qin, X. Cong, Y . Lin, Y . Pan, Y . Wu, H. Hui, W. Liu, Z. Liu, and M. Sun. Debugbench: Eval- uating debugging capability of large language models. pages 4173–4198, 2024
work page 2024
-
[37]
Point-biserial correlation coef- ficient — Wikipedia, the free encyclopedia, 2025
Wikipedia contributors. Point-biserial correlation coef- ficient — Wikipedia, the free encyclopedia, 2025. [On- line; accessed 10-December-2025]
work page 2025
- [38]
- [39]
-
[40]
C. Yang, Y . Deng, R. Lu, J. Yao, J. Liu, R. Jabbar- vand, and L. Zhang. Whitefox: White-box compiler fuzzing empowered by large language models.Pro- ceedings of the ACM on Programming Languages, 8(OOPSLA2):709–735, 2024
work page 2024
-
[41]
C. Yang, X. Li, M. R. H. Misu, J. Yao, W. Cui, Y . Gong, C. Hawblitzel, S. K. Lahiri, J. R. Lorch, S. Lu, F. Yang, Z. Zhou, and S. Lu. Autoverus: Automated proof gen- eration for rust code.Proceedings of the ACM on Pro- gramming Languages, 9(OOPSLA2), 2025
work page 2025
-
[42]
C. Yang, Z. Zhao, Z. Xie, H. Li, and L. Zhang. Knighter: Transforming static analysis with llm-synthesized check- ers. InProceedings of the ACM SIGOPS 31st Sympo- sium on Operating Systems Principles, SOSP ’25, New York, NY , USA, 2025. Association for Computing Ma- chinery. 14
work page 2025
-
[43]
C. Yang, Z. Zhao, and L. Zhang. Kernelgpt: Enhanced kernel fuzzing via large language models. InProceed- ings of the 30th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 2, pages 560–573, 2025
work page 2025
-
[44]
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent- computer interfaces enable automated software engi- neering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024
work page 2024
-
[45]
J. Yao, R. Tao, R. Gu, and J. Nieh. DuoAI: Fast, au- tomated inference of inductive invariants for verifying distributed protocols. InProceedings of the USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI), 2022
work page 2022
-
[46]
J. Yao, R. Tao, R. Gu, J. Nieh, S. S. Jana, and G. Ryan. DistAI: Data-driven automated invariant learning for distributed protocols. InProceedings of the USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI), 2021
work page 2021
- [47]
-
[48]
T. N. Zhang, K. Singh, T. Chajed, M. Kapritsos, and B. Parno. Basilisk: using provenance invariants to auto- mate proofs of undecidable protocols. InProceedings of the 19th USENIX Conference on Operating Systems Design and Implementation (OSDI), 2025
work page 2025
-
[49]
S. C. Zhong and X. Si. Towards repository-level program verification with large language models. In Proceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Lan- guages (LMPL), 2025
work page 2025
-
[50]
Z. Zhou, Anjali, W. Chen, S. Gong, C. Hawblitzel, and W. Cui. Verismo: A verified security module for con- fidential vms. In18th USENIX Symposium on Operat- ing Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, pages 599–614. USENIX Association, 2024. 15
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.