VeruSAGE: A Study of Agent-Based Verification for Rust Systems

Chenyuan Yang; Chris Hawblitzel; Jacob R. Lorch; Natalie Neamtu; Shan Lu

arxiv: 2512.18436 · v2 · submitted 2025-12-20 · 💻 cs.OS · cs.AI· cs.FL· cs.SE

VeruSAGE: A Study of Agent-Based Verification for Rust Systems

Chenyuan Yang , Natalie Neamtu , Chris Hawblitzel , Jacob R. Lorch , Shan Lu This is my paper

Pith reviewed 2026-05-16 21:01 UTC · model grok-4.3

classification 💻 cs.OS cs.AIcs.FLcs.SE

keywords LLM agentsformal verificationRustVerusproof generationsystem softwarebenchmark

0 comments

The pith

LLM agents complete over 80 percent of Rust system proof tasks when paired with tailored agent designs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can produce rigorous correctness proofs for system software written in Rust. It introduces VeruSAGE-Bench, a collection of 849 proof tasks drawn from eight already-verified open-source Rust programs that use the Verus verifier. Different agent architectures are constructed to compensate for the individual limitations of models such as o4-mini, GPT-5, and Sonnet variants. The strongest agent-model pair finishes more than 80 percent of the benchmark tasks and more than 90 percent of additional unfinished proof tasks drawn from the same projects. If these rates hold, LLM agents could materially reduce the manual effort required to verify system-level Rust code.

Core claim

Agent systems customized to the capabilities of particular LLMs enable completion of over 80 percent of the 849 system-verification tasks in VeruSAGE-Bench and over 90 percent of a separate collection of system proof tasks that human experts had not yet finished.

What carries the argument

Agent systems that pair specific tools and interaction patterns with individual LLMs to compensate for their respective weaknesses when generating Verus proofs for Rust systems.

If this is right

Different LLMs require distinct tool combinations and agent configurations to reach high success rates on system proof tasks.
LLM agents can finish a large fraction of proofs even on tasks that remained open after initial human development.
Verified system software development in Rust can incorporate automated assistance for proof generation at scale.
The benchmark itself provides a concrete yardstick for measuring future progress in LLM-based verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same agents succeed on fresh codebases, verification projects could shift from writing proofs from scratch to reviewing and refining agent outputs.
The approach may transfer to other languages that have similar separation-logic provers, provided comparable task-extraction pipelines are built.
Further gains are likely if the agents are allowed to iterate on failing proofs using feedback from the verifier itself.

Load-bearing premise

Results measured on proof tasks taken from already-verified systems will transfer to writing proofs for new, unverified Rust code without heavy per-project human adjustment.

What would settle it

Running the best reported agent on a previously unverified Rust system and observing completion rates well below 80 percent without additional tuning would show the reported performance does not generalize.

Figures

Figures reproduced from arXiv: 2512.18436 by Chenyuan Yang, Chris Hawblitzel, Jacob R. Lorch, Natalie Neamtu, Shan Lu.

**Figure 2.** Figure 2: Fraction of proof tasks with certain features [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of VeruSAGE AL AC IR MA NO NR OS ST VE Tot. 30% 6% 24% 32% 34% 15% 13% 19% 23% 20% [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: A specification adjustment suggested by an LLM [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: The monologue of Sonnet 4.5 for seq_skip_lemma. element whose value equals that of the original s[0]. As the log indicates ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The success rate among tasks with different sizes [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Tradeoff among different models and agent systems [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs' capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4-mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE-Bench because they had not yet been finished by human experts. This result shows the great potential for LLM-assisted development of verified system software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VeruSAGE gives the first sizable benchmark for LLM agents on real Verus system proofs and reports solid completion rates, but the tasks all come from already-verified code so generalization to fresh unverified modules stays untested.

read the letter

The main thing to know is that this paper builds VeruSAGE-Bench with 849 proof tasks drawn from eight open-source Verus-verified Rust systems and shows that well-tuned agent setups for models like o4-mini, GPT-5, and the Sonnet variants can finish over 80% of them, plus over 90% of some human-unfinished proofs. That is the first quantified data at this scale for system-level verification work in Rust. They also put real effort into matching different agent scaffolds and tools to each model's strengths, which is a practical step beyond generic prompting. Reporting results on the unfinished human tasks adds a useful angle for how this might help ongoing projects. The benchmark itself is new and drawn from actual systems code, so it moves past the smaller or synthetic setups in earlier LLM verification papers. The soft spot is exactly the one the stress-test flags: every task comes from systems whose proofs already exist, so the agents are largely recovering or repairing known structures rather than inventing verification strategies for brand-new, unverified Rust modules. Nothing in the reported experiments checks whether the same scaffolding transfers when no proof skeleton is present. The abstract also gives no details on task extraction rules, selection criteria, or statistical significance, which makes it hard to judge how robust the 80-90% numbers really are. This work is aimed at people building AI tools for formal verification and at systems researchers who want to see how far current agents can push Verus. It has enough new empirical content and a reusable benchmark to deserve a serious referee, even if the methods section will need expansion and a clearer discussion of generalization limits before publication.

Referee Report

3 major / 1 minor

Summary. The paper introduces VeruSAGE-Bench, a suite of 849 proof tasks extracted from eight open-source Verus-verified Rust systems, and evaluates multiple LLM-agent combinations (o4-mini, GPT-5, Sonnet 4, Sonnet 4.5) on completing system-verification proofs. It reports that the strongest agent-LLM pairing finishes over 80% of benchmark tasks and over 90% of additional unfinished human proof tasks, concluding that this demonstrates substantial potential for LLM-assisted development of verified Rust systems software.

Significance. If the reported success rates prove robust and the approach generalizes, the work would represent a meaningful step toward practical AI assistance in formal verification of systems code, particularly by lowering barriers to using Verus for Rust. The new benchmark itself is a concrete contribution that could support future research on LLM reasoning for low-level correctness proofs.

major comments (3)

[Benchmark construction] Benchmark construction section: the paper supplies no explicit criteria for extracting the 849 proof tasks from the eight systems, no discussion of selection bias, and no breakdown of task difficulty or coverage of typical systems-verification challenges (e.g., concurrency, memory safety, or device drivers). This omission makes it impossible to judge whether the 80%+ headline rates reflect genuine capability or an easier-than-average subset.
[Experimental evaluation] Experimental evaluation and results sections: success rates are presented without details on agent architectures, exact prompting/tool-use patterns, statistical significance testing, variance across runs, or the precise definition of 'completion' (full verification vs. partial proof). These omissions directly affect reproducibility and the reliability of the 80-90% figures.
[Discussion and conclusion] Discussion and conclusion: the claim of 'great potential for LLM-assisted development of verified system software' rests on tasks drawn exclusively from already-verified systems. No experiments test transfer to fresh, unverified Rust modules where no proof skeleton exists and the verification strategy must be invented rather than recovered; this gap is load-bearing for the broader significance asserted in the abstract.

minor comments (1)

[Abstract] Abstract and introduction: the models are referred to as 'o4-mini' and 'GPT-5'; confirm these are the intended names or correct to the standard designations used in the experimental section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments identify key areas where additional detail and nuance will strengthen the manuscript. We respond to each major comment below and indicate the revisions that will be incorporated.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: the paper supplies no explicit criteria for extracting the 849 proof tasks from the eight systems, no discussion of selection bias, and no breakdown of task difficulty or coverage of typical systems-verification challenges (e.g., concurrency, memory safety, or device drivers). This omission makes it impossible to judge whether the 80%+ headline rates reflect genuine capability or an easier-than-average subset.

Authors: We agree that the benchmark construction section requires more explicit documentation. In the revised manuscript we will add a dedicated subsection that states the extraction criteria: all tasks were obtained by automatically identifying complete 'proof' and 'lemma' blocks from the eight open-source Verus-verified systems that already pass the Verus verifier. We will discuss selection bias by explaining that verified systems were deliberately chosen to provide reliable ground-truth checking and will supply a breakdown of the 849 tasks by system, by category (memory safety, concurrency, ownership, etc.), and by proxy difficulty metrics such as proof length and number of required invariants. revision: yes
Referee: [Experimental evaluation] Experimental evaluation and results sections: success rates are presented without details on agent architectures, exact prompting/tool-use patterns, statistical significance testing, variance across runs, or the precise definition of 'completion' (full verification vs. partial proof). These omissions directly affect reproducibility and the reliability of the 80-90% figures.

Authors: We accept that the experimental description must be expanded for reproducibility. The revised paper will include a full account of each agent architecture, the precise prompting templates and tool-use loops (including how compiler feedback from Verus is incorporated), the definition of completion (a proof is counted as complete only when the Verus verifier reports no errors), and results aggregated over three independent runs with reported variance and appropriate statistical tests. revision: yes
Referee: [Discussion and conclusion] Discussion and conclusion: the claim of 'great potential for LLM-assisted development of verified system software' rests on tasks drawn exclusively from already-verified systems. No experiments test transfer to fresh, unverified Rust modules where no proof skeleton exists and the verification strategy must be invented rather than recovered; this gap is load-bearing for the broader significance asserted in the abstract.

Authors: We acknowledge the scope limitation. Our evaluation measures the ability to complete and extend proofs inside already-verified codebases, which is a realistic and high-value use case. The >90 % success rate on unfinished human proofs already demonstrates some capacity to generate new proof steps. Nevertheless, we agree that the absence of experiments on entirely new, unverified modules restricts the strength of the broader claim. In revision we will add an explicit limitations paragraph, moderate the language in the abstract and conclusion to emphasize assistance in proof completion rather than fully autonomous de-novo verification, and note the need for future work on fresh modules. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely empirical benchmark evaluation

full rationale

The paper reports experimental results from running LLM-based agents on VeruSAGE-Bench, a benchmark of 849 proof tasks extracted from eight already-Verus-verified open-source Rust systems, plus a small set of unfinished human tasks. No equations, derivations, fitted parameters, or self-referential definitions appear in the abstract or described structure. Claims reduce to direct measurements of completion rates (over 80% on the benchmark, over 90% on unfinished tasks) rather than any prediction that is forced by construction from the inputs. Self-citations, if present, are not load-bearing for a central theorem or uniqueness result. The generalization question (whether performance transfers to fresh unverified code) is an external-validity concern, not an internal circularity in any derivation chain. This matches the default expectation for an empirical systems paper with no mathematical reduction steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical study with light axiomatic load; relies on the soundness of the existing Verus verifier and the representativeness of the selected open-source projects.

axioms (1)

domain assumption Verus verifier is sound for the extracted proof tasks
The benchmark treats successful Verus checks as ground truth for correctness.

pith-pipeline@v0.9.0 · 5521 in / 1236 out tokens · 29006 ms · 2026-05-16T21:01:41.900352+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

AlphaVerus: Bootstrapping formally verified code generation through self-improving translation and treefinement.ICML, 2025

P. Aggarwal, B. Parno, and S. Welleck. Alphaverus: Bootstrapping formally verified code generation through self-improving translation and treefinement.arXiv preprint arXiv:2412.06176, 2024

work page arXiv 2024
[2]

Introducing claude 4: Claude opus 4 and claude sonnet 4

Anthropic. Introducing claude 4: Claude opus 4 and claude sonnet 4. https://www.anthropic.co m/news/claude-4 , May 2025. Introduces the Claude Sonnet 4 model

work page 2025
[3]

Introducing claude sonnet 4.5

Anthropic. Introducing claude sonnet 4.5. https: //www.anthropic.com/news/claude-son net-4-5 , Sept. 2025. Describes the Claude Sonnet 4.5 model

work page 2025
[4]

Astrauskas, P

V . Astrauskas, P. Müller, F. Poli, and A. J. Sum- mers. Leveraging rust types for modular specifica- tion and verification.Proc. ACM Program. Lang., 3(OOPSLA):147:1–147:30, 2019

work page 2019
[5]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, Q. V . Le, and C. Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Y . Cai, P. Singh, Z. Lin, J. Bosamiya, J. Gancher, M. Sur- batovich, and B. Parno. Vest: Verified, secure, high- performance parsing and serialization for Rust. InPro- ceedings of the USENIX Security Symposium, August 2025

work page 2025
[7]

Chakraborty, G

S. Chakraborty, G. Ebner, S. Bhat, S. Fakhoury, S. Fa- tima, S. K. Lahiri, and N. Swamy. Towards neural syn- thesis for smt-assisted proof-oriented programming. In IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025

work page 2025
[8]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, M. Krueger, H. Petrov, I. Khattam, C. Hesse, S. Agar- wal, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, B. Power, L. Kaiser, M. Bavarian, C. King, T. Kerr, S. McCandlish, A. Radford, I. Sutskever, and W. Za...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

T. Chen, S. Lu, S. Lu, Y . Gong, C. Yang, X. Li, M. R. H. Misu, H. Yu, N. Duan, P. Cheng, F. Yang, S. K. Lahiri, T. Xie, and L. Zhou. Automated proof generation for rust code via self-evolution. InProceedings of the 13th International Conference on Learning Representations (ICLR), 2025

work page 2025
[10]

X. Chen, Z. Li, J. Zhang, V . Narayanan, and A. Burt- sev. Atmosphere: Practical verified kernels with rust and verus. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP). Association for Computing Machinery, 2025

work page 2025
[11]

Denis, J

X. Denis, J. Jourdan, and C. Marché. Creusot: A foundry for the deductive verification of rust programs. InFormal Methods and Software Engineering - 23rd International Conference on Formal Engineering Meth- ods, ICFEM 2022, Madrid, Spain, October 24-27, 2022, Proceedings, volume 13478 ofLecture Notes in Com- puter Science, pages 90–105. Springer, 2022

work page 2022
[12]

Github copilot cli

GitHub. Github copilot cli. https://github.com /github/copilot-cli , Sept. 2025. Command- line interface for GitHub Copilot

work page 2025
[13]

Ho and J

S. Ho and J. Protzenko. Aeneas: Rust verification by functional translation.Proc. ACM Program. Lang., 6(ICFP):711–741, 2022

work page 2022
[14]

GPT-4o System Card

A. Hurst, A. Lerer, and OpenAI. GPT-4o System Card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

R. Jain, S. Barke, G. Ebner, M. R. H. Misu, S. Lu, and S. Fakhoury. What’s in a proof? analyzing expert proof-writing processes in f* and verus.arXiv preprint arXiv:2508.02733, 2025

work page arXiv 2025
[16]

Kumar, D

A. Kumar, D. N. Gadde, K. K. Radhakrishna, and D. Let- tnin. Saarthi: The first ai formal verification engineer. CoRR, abs/2502.16662, 2025

work page arXiv 2025
[17]

Lattuada, T

A. Lattuada, T. Hance, J. Bosamiya, M. Brun, C. Cho, H. LeBlanc, P. Srinivasan, R. Achermann, T. Chajed, C. Hawblitzel, et al. Verus: A practical foundation for systems verification. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Princi- ples, pages 438–454, 2024

work page 2024
[18]

Lattuada, T

A. Lattuada, T. Hance, C. Cho, M. Brun, I. Subasinghe, Y . Zhou, J. Howell, B. Parno, and C. Hawblitzel. Verus: Verifying rust programs using linear ghost types.Proc. ACM Program. Lang., 7(OOPSLA1):286–315, 2023. 13

work page 2023
[19]

LeBlanc, J

H. LeBlanc, J. R. Lorch, C. Hawblitzel, C. Huang, Y . Tao, N. Zeldovich, and V . Chidambaram. Power never corrupts: Tool-agnostic verification of crash con- sistency and corruption detection.OSDI (to appear), 2025

work page 2025
[20]

Z. Li, J. Sun, L. Murphy, Q. Su, Z. Li, X. Zhang, K. Yang, and X. Si. A survey on deep learning for theorem proving.CoRR, abs/2404.09939, 2024

work page arXiv 2024
[21]

VeriPlan: Integrating formal verification and LLMs into end-user planning,

Y . Liu and et al. Veriplan: Integrating formal verification and llms into end-user planning.CoRR, abs/2502.17898, 2025

work page arXiv 2025
[22]

J. R. Lorch. Fix usability issue with unused sub- regions library. https://github.com/microsoft/verified- storage/pull/39

work page
[23]

DafnyBench: A benchmark for formal software verification

C. Loughridge, Q. Sun, S. Ahrenbach, F. Cassano, C. Sun, Y . Sheng, A. Mudide, M. R. H. Misu, N. Amin, and M. Tegmark. Dafnybench: A benchmark for formal software verification.CoRR, abs/2406.08467, 2024

work page arXiv 2024
[24]

H. Ma, A. Goel, J.-B. Jeannin, M. Kapritsos, B. Kasikci, and K. A. Sakallah. I4: incremental inference of induc- tive invariants for verification of distributed protocols. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), pages 370–384, 2019

work page 2019
[25]

M. R. H. Misu, C. V . Lopes, I. Ma, and J. Noble. To- wards ai-assisted synthesis of verified dafny methods. Proc. ACM Softw. Eng., 1(FSE):812–835, 2024

work page 2024
[26]

Laurel: Unblocking automated verification with large language models.PACMPL (OOPSLA), 2025

E. Mugnier, E. A. Gonzalez, R. Jhala, N. Polikarpova, and Y . Zhou. Laurel: Generating dafny assertions using large language models.CoRR, abs/2405.16792, 2024

work page arXiv 2024
[27]

OpenAI. Codex. https://openai.com/codex/,

work page
[28]

OpenAI’s coding agent for software development

work page
[29]

Gpt-5 system card

OpenAI. Gpt-5 system card. Technical report, OpenAI, Aug. 2025. Describes the GPT-5 model series

work page 2025
[30]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, Apr. 2025. Describes the OpenAI o4- mini reasoning model

work page 2025
[31]

C. Sun, Y . Sheng, O. Padon, and C. W. Barrett. Clover: Closed-loop verifiable code generation. InAI Verifica- tion - First International Symposium, SAIV 2024, Mon- treal, QC, Canada, July 22-23, 2024, Proceedings, vol- ume 14846 ofLecture Notes in Computer Science, pages 134–155. Springer, 2024

work page 2024
[32]

C. Sun, Y . Sun, D. Amrollahi, E. Zhang, S. Lahiri, S. Lu, D. Dill, and C. Barrett. Veristruct: Ai-assisted auto- mated verification of data-structure modules in verus. arXiv preprint arXiv:2510.25015, 2025

work page arXiv 2025
[33]

X. Sun, W. Ma, J. T. Gu, Z. Ma, T. Chajed, J. How- ell, A. Lattuada, O. Padon, L. Suresh, A. Szekeres, and T. Xu. Anvil: Verifying liveness of cluster management controllers. In18th USENIX Symposium on Operat- ing Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, pages 649–666. USENIX Association, 2024

work page 2024
[34]

T. K. Team. How open source projects are using kani to write better software in rust. https://aws.amaz on.com/blogs/opensource/how-open-sou rce-projects-are-using-kani-to-write -better-software-in-rust/ , 2024. [Online], [Accessed: 2024-09-01]

work page 2024
[35]

Ferreira, Sorin Lerner, and Emily First

K. Thompson, N. Saavedra, P. Carrott, K. Fisher, A. Sanchez-Stern, Y . Brun, J. F. Ferreira, S. Lerner, and E. First. Rango: Adaptive retrieval-augmented prov- ing for automated software verification.arXiv preprint arXiv:2412.14063, 2024

work page arXiv 2024
[36]

R. Tian, Y . Ye, Y . Qin, X. Cong, Y . Lin, Y . Pan, Y . Wu, H. Hui, W. Liu, Z. Liu, and M. Sun. Debugbench: Eval- uating debugging capability of large language models. pages 4173–4198, 2024

work page 2024
[37]

Point-biserial correlation coef- ficient — Wikipedia, the free encyclopedia, 2025

Wikipedia contributors. Point-biserial correlation coef- ficient — Wikipedia, the free encyclopedia, 2025. [On- line; accessed 10-December-2025]

work page 2025
[38]

C. S. Xia, Z. Wang, Y . Yang, Y . Wei, and L. Zhang. Live- swe-agent: Can software engineering agents self-evolve on the fly?arXiv preprint arXiv:2511.13646, 2025

work page arXiv 2025
[39]

C. S. Xia and L. Zhang. Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. CoRR, abs/2304.00385, 2023

work page arXiv 2023
[40]

C. Yang, Y . Deng, R. Lu, J. Yao, J. Liu, R. Jabbar- vand, and L. Zhang. Whitefox: White-box compiler fuzzing empowered by large language models.Pro- ceedings of the ACM on Programming Languages, 8(OOPSLA2):709–735, 2024

work page 2024
[41]

C. Yang, X. Li, M. R. H. Misu, J. Yao, W. Cui, Y . Gong, C. Hawblitzel, S. K. Lahiri, J. R. Lorch, S. Lu, F. Yang, Z. Zhou, and S. Lu. Autoverus: Automated proof gen- eration for rust code.Proceedings of the ACM on Pro- gramming Languages, 9(OOPSLA2), 2025

work page 2025
[42]

C. Yang, Z. Zhao, Z. Xie, H. Li, and L. Zhang. Knighter: Transforming static analysis with llm-synthesized check- ers. InProceedings of the ACM SIGOPS 31st Sympo- sium on Operating Systems Principles, SOSP ’25, New York, NY , USA, 2025. Association for Computing Ma- chinery. 14

work page 2025
[43]

C. Yang, Z. Zhao, and L. Zhang. Kernelgpt: Enhanced kernel fuzzing via large language models. InProceed- ings of the 30th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 2, pages 560–573, 2025

work page 2025
[44]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent- computer interfaces enable automated software engi- neering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024
[45]

J. Yao, R. Tao, R. Gu, and J. Nieh. DuoAI: Fast, au- tomated inference of inductive invariants for verifying distributed protocols. InProceedings of the USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI), 2022

work page 2022
[46]

J. Yao, R. Tao, R. Gu, J. Nieh, S. S. Jana, and G. Ryan. DistAI: Data-driven automated invariant learning for distributed protocols. InProceedings of the USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI), 2021

work page 2021
[47]

Zhang, X

J. Zhang, X. Xu, Y . Zou, Z. Tang, X. Wan, K. Hu, S. Wang, W. Xu, D. Wang, H. Chen, et al. Cortenmm: Efficient memory management with strong correctness guarantees. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 1082–1098, 2025

work page 2025
[48]

T. N. Zhang, K. Singh, T. Chajed, M. Kapritsos, and B. Parno. Basilisk: using provenance invariants to auto- mate proofs of undecidable protocols. InProceedings of the 19th USENIX Conference on Operating Systems Design and Implementation (OSDI), 2025

work page 2025
[49]

S. C. Zhong and X. Si. Towards repository-level program verification with large language models. In Proceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Lan- guages (LMPL), 2025

work page 2025
[50]

Zhou, Anjali, W

Z. Zhou, Anjali, W. Chen, S. Gong, C. Hawblitzel, and W. Cui. Verismo: A verified security module for con- fidential vms. In18th USENIX Symposium on Operat- ing Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, pages 599–614. USENIX Association, 2024. 15

work page 2024

[1] [1]

AlphaVerus: Bootstrapping formally verified code generation through self-improving translation and treefinement.ICML, 2025

P. Aggarwal, B. Parno, and S. Welleck. Alphaverus: Bootstrapping formally verified code generation through self-improving translation and treefinement.arXiv preprint arXiv:2412.06176, 2024

work page arXiv 2024

[2] [2]

Introducing claude 4: Claude opus 4 and claude sonnet 4

Anthropic. Introducing claude 4: Claude opus 4 and claude sonnet 4. https://www.anthropic.co m/news/claude-4 , May 2025. Introduces the Claude Sonnet 4 model

work page 2025

[3] [3]

Introducing claude sonnet 4.5

Anthropic. Introducing claude sonnet 4.5. https: //www.anthropic.com/news/claude-son net-4-5 , Sept. 2025. Describes the Claude Sonnet 4.5 model

work page 2025

[4] [4]

Astrauskas, P

V . Astrauskas, P. Müller, F. Poli, and A. J. Sum- mers. Leveraging rust types for modular specifica- tion and verification.Proc. ACM Program. Lang., 3(OOPSLA):147:1–147:30, 2019

work page 2019

[5] [5]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, Q. V . Le, and C. Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Y . Cai, P. Singh, Z. Lin, J. Bosamiya, J. Gancher, M. Sur- batovich, and B. Parno. Vest: Verified, secure, high- performance parsing and serialization for Rust. InPro- ceedings of the USENIX Security Symposium, August 2025

work page 2025

[7] [7]

Chakraborty, G

S. Chakraborty, G. Ebner, S. Bhat, S. Fakhoury, S. Fa- tima, S. K. Lahiri, and N. Swamy. Towards neural syn- thesis for smt-assisted proof-oriented programming. In IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025

work page 2025

[8] [8]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, M. Krueger, H. Petrov, I. Khattam, C. Hesse, S. Agar- wal, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, B. Power, L. Kaiser, M. Bavarian, C. King, T. Kerr, S. McCandlish, A. Radford, I. Sutskever, and W. Za...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

T. Chen, S. Lu, S. Lu, Y . Gong, C. Yang, X. Li, M. R. H. Misu, H. Yu, N. Duan, P. Cheng, F. Yang, S. K. Lahiri, T. Xie, and L. Zhou. Automated proof generation for rust code via self-evolution. InProceedings of the 13th International Conference on Learning Representations (ICLR), 2025

work page 2025

[10] [10]

X. Chen, Z. Li, J. Zhang, V . Narayanan, and A. Burt- sev. Atmosphere: Practical verified kernels with rust and verus. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP). Association for Computing Machinery, 2025

work page 2025

[11] [11]

Denis, J

X. Denis, J. Jourdan, and C. Marché. Creusot: A foundry for the deductive verification of rust programs. InFormal Methods and Software Engineering - 23rd International Conference on Formal Engineering Meth- ods, ICFEM 2022, Madrid, Spain, October 24-27, 2022, Proceedings, volume 13478 ofLecture Notes in Com- puter Science, pages 90–105. Springer, 2022

work page 2022

[12] [12]

Github copilot cli

GitHub. Github copilot cli. https://github.com /github/copilot-cli , Sept. 2025. Command- line interface for GitHub Copilot

work page 2025

[13] [13]

Ho and J

S. Ho and J. Protzenko. Aeneas: Rust verification by functional translation.Proc. ACM Program. Lang., 6(ICFP):711–741, 2022

work page 2022

[14] [14]

GPT-4o System Card

A. Hurst, A. Lerer, and OpenAI. GPT-4o System Card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

R. Jain, S. Barke, G. Ebner, M. R. H. Misu, S. Lu, and S. Fakhoury. What’s in a proof? analyzing expert proof-writing processes in f* and verus.arXiv preprint arXiv:2508.02733, 2025

work page arXiv 2025

[16] [16]

Kumar, D

A. Kumar, D. N. Gadde, K. K. Radhakrishna, and D. Let- tnin. Saarthi: The first ai formal verification engineer. CoRR, abs/2502.16662, 2025

work page arXiv 2025

[17] [17]

Lattuada, T

A. Lattuada, T. Hance, J. Bosamiya, M. Brun, C. Cho, H. LeBlanc, P. Srinivasan, R. Achermann, T. Chajed, C. Hawblitzel, et al. Verus: A practical foundation for systems verification. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Princi- ples, pages 438–454, 2024

work page 2024

[18] [18]

Lattuada, T

A. Lattuada, T. Hance, C. Cho, M. Brun, I. Subasinghe, Y . Zhou, J. Howell, B. Parno, and C. Hawblitzel. Verus: Verifying rust programs using linear ghost types.Proc. ACM Program. Lang., 7(OOPSLA1):286–315, 2023. 13

work page 2023

[19] [19]

LeBlanc, J

H. LeBlanc, J. R. Lorch, C. Hawblitzel, C. Huang, Y . Tao, N. Zeldovich, and V . Chidambaram. Power never corrupts: Tool-agnostic verification of crash con- sistency and corruption detection.OSDI (to appear), 2025

work page 2025

[20] [20]

Z. Li, J. Sun, L. Murphy, Q. Su, Z. Li, X. Zhang, K. Yang, and X. Si. A survey on deep learning for theorem proving.CoRR, abs/2404.09939, 2024

work page arXiv 2024

[21] [21]

VeriPlan: Integrating formal verification and LLMs into end-user planning,

Y . Liu and et al. Veriplan: Integrating formal verification and llms into end-user planning.CoRR, abs/2502.17898, 2025

work page arXiv 2025

[22] [22]

J. R. Lorch. Fix usability issue with unused sub- regions library. https://github.com/microsoft/verified- storage/pull/39

work page

[23] [23]

DafnyBench: A benchmark for formal software verification

C. Loughridge, Q. Sun, S. Ahrenbach, F. Cassano, C. Sun, Y . Sheng, A. Mudide, M. R. H. Misu, N. Amin, and M. Tegmark. Dafnybench: A benchmark for formal software verification.CoRR, abs/2406.08467, 2024

work page arXiv 2024

[24] [24]

H. Ma, A. Goel, J.-B. Jeannin, M. Kapritsos, B. Kasikci, and K. A. Sakallah. I4: incremental inference of induc- tive invariants for verification of distributed protocols. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), pages 370–384, 2019

work page 2019

[25] [25]

M. R. H. Misu, C. V . Lopes, I. Ma, and J. Noble. To- wards ai-assisted synthesis of verified dafny methods. Proc. ACM Softw. Eng., 1(FSE):812–835, 2024

work page 2024

[26] [26]

Laurel: Unblocking automated verification with large language models.PACMPL (OOPSLA), 2025

E. Mugnier, E. A. Gonzalez, R. Jhala, N. Polikarpova, and Y . Zhou. Laurel: Generating dafny assertions using large language models.CoRR, abs/2405.16792, 2024

work page arXiv 2024

[27] [27]

OpenAI. Codex. https://openai.com/codex/,

work page

[28] [28]

OpenAI’s coding agent for software development

work page

[29] [29]

Gpt-5 system card

OpenAI. Gpt-5 system card. Technical report, OpenAI, Aug. 2025. Describes the GPT-5 model series

work page 2025

[30] [30]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, Apr. 2025. Describes the OpenAI o4- mini reasoning model

work page 2025

[31] [31]

C. Sun, Y . Sheng, O. Padon, and C. W. Barrett. Clover: Closed-loop verifiable code generation. InAI Verifica- tion - First International Symposium, SAIV 2024, Mon- treal, QC, Canada, July 22-23, 2024, Proceedings, vol- ume 14846 ofLecture Notes in Computer Science, pages 134–155. Springer, 2024

work page 2024

[32] [32]

C. Sun, Y . Sun, D. Amrollahi, E. Zhang, S. Lahiri, S. Lu, D. Dill, and C. Barrett. Veristruct: Ai-assisted auto- mated verification of data-structure modules in verus. arXiv preprint arXiv:2510.25015, 2025

work page arXiv 2025

[33] [33]

X. Sun, W. Ma, J. T. Gu, Z. Ma, T. Chajed, J. How- ell, A. Lattuada, O. Padon, L. Suresh, A. Szekeres, and T. Xu. Anvil: Verifying liveness of cluster management controllers. In18th USENIX Symposium on Operat- ing Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, pages 649–666. USENIX Association, 2024

work page 2024

[34] [34]

T. K. Team. How open source projects are using kani to write better software in rust. https://aws.amaz on.com/blogs/opensource/how-open-sou rce-projects-are-using-kani-to-write -better-software-in-rust/ , 2024. [Online], [Accessed: 2024-09-01]

work page 2024

[35] [35]

Ferreira, Sorin Lerner, and Emily First

K. Thompson, N. Saavedra, P. Carrott, K. Fisher, A. Sanchez-Stern, Y . Brun, J. F. Ferreira, S. Lerner, and E. First. Rango: Adaptive retrieval-augmented prov- ing for automated software verification.arXiv preprint arXiv:2412.14063, 2024

work page arXiv 2024

[36] [36]

R. Tian, Y . Ye, Y . Qin, X. Cong, Y . Lin, Y . Pan, Y . Wu, H. Hui, W. Liu, Z. Liu, and M. Sun. Debugbench: Eval- uating debugging capability of large language models. pages 4173–4198, 2024

work page 2024

[37] [37]

Point-biserial correlation coef- ficient — Wikipedia, the free encyclopedia, 2025

Wikipedia contributors. Point-biserial correlation coef- ficient — Wikipedia, the free encyclopedia, 2025. [On- line; accessed 10-December-2025]

work page 2025

[38] [38]

C. S. Xia, Z. Wang, Y . Yang, Y . Wei, and L. Zhang. Live- swe-agent: Can software engineering agents self-evolve on the fly?arXiv preprint arXiv:2511.13646, 2025

work page arXiv 2025

[39] [39]

C. S. Xia and L. Zhang. Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. CoRR, abs/2304.00385, 2023

work page arXiv 2023

[40] [40]

C. Yang, Y . Deng, R. Lu, J. Yao, J. Liu, R. Jabbar- vand, and L. Zhang. Whitefox: White-box compiler fuzzing empowered by large language models.Pro- ceedings of the ACM on Programming Languages, 8(OOPSLA2):709–735, 2024

work page 2024

[41] [41]

C. Yang, X. Li, M. R. H. Misu, J. Yao, W. Cui, Y . Gong, C. Hawblitzel, S. K. Lahiri, J. R. Lorch, S. Lu, F. Yang, Z. Zhou, and S. Lu. Autoverus: Automated proof gen- eration for rust code.Proceedings of the ACM on Pro- gramming Languages, 9(OOPSLA2), 2025

work page 2025

[42] [42]

C. Yang, Z. Zhao, Z. Xie, H. Li, and L. Zhang. Knighter: Transforming static analysis with llm-synthesized check- ers. InProceedings of the ACM SIGOPS 31st Sympo- sium on Operating Systems Principles, SOSP ’25, New York, NY , USA, 2025. Association for Computing Ma- chinery. 14

work page 2025

[43] [43]

C. Yang, Z. Zhao, and L. Zhang. Kernelgpt: Enhanced kernel fuzzing via large language models. InProceed- ings of the 30th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 2, pages 560–573, 2025

work page 2025

[44] [44]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent- computer interfaces enable automated software engi- neering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024

[45] [45]

J. Yao, R. Tao, R. Gu, and J. Nieh. DuoAI: Fast, au- tomated inference of inductive invariants for verifying distributed protocols. InProceedings of the USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI), 2022

work page 2022

[46] [46]

J. Yao, R. Tao, R. Gu, J. Nieh, S. S. Jana, and G. Ryan. DistAI: Data-driven automated invariant learning for distributed protocols. InProceedings of the USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI), 2021

work page 2021

[47] [47]

Zhang, X

J. Zhang, X. Xu, Y . Zou, Z. Tang, X. Wan, K. Hu, S. Wang, W. Xu, D. Wang, H. Chen, et al. Cortenmm: Efficient memory management with strong correctness guarantees. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 1082–1098, 2025

work page 2025

[48] [48]

T. N. Zhang, K. Singh, T. Chajed, M. Kapritsos, and B. Parno. Basilisk: using provenance invariants to auto- mate proofs of undecidable protocols. InProceedings of the 19th USENIX Conference on Operating Systems Design and Implementation (OSDI), 2025

work page 2025

[49] [49]

S. C. Zhong and X. Si. Towards repository-level program verification with large language models. In Proceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Lan- guages (LMPL), 2025

work page 2025

[50] [50]

Zhou, Anjali, W

Z. Zhou, Anjali, W. Chen, S. Gong, C. Hawblitzel, and W. Cui. Verismo: A verified security module for con- fidential vms. In18th USENIX Symposium on Operat- ing Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, pages 599–614. USENIX Association, 2024. 15

work page 2024