pith. sign in

arxiv: 2512.18436 · v2 · submitted 2025-12-20 · 💻 cs.OS · cs.AI· cs.FL· cs.SE

VeruSAGE: A Study of Agent-Based Verification for Rust Systems

Pith reviewed 2026-05-16 21:01 UTC · model grok-4.3

classification 💻 cs.OS cs.AIcs.FLcs.SE
keywords LLM agentsformal verificationRustVerusproof generationsystem softwarebenchmark
0
0 comments X

The pith

LLM agents complete over 80 percent of Rust system proof tasks when paired with tailored agent designs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can produce rigorous correctness proofs for system software written in Rust. It introduces VeruSAGE-Bench, a collection of 849 proof tasks drawn from eight already-verified open-source Rust programs that use the Verus verifier. Different agent architectures are constructed to compensate for the individual limitations of models such as o4-mini, GPT-5, and Sonnet variants. The strongest agent-model pair finishes more than 80 percent of the benchmark tasks and more than 90 percent of additional unfinished proof tasks drawn from the same projects. If these rates hold, LLM agents could materially reduce the manual effort required to verify system-level Rust code.

Core claim

Agent systems customized to the capabilities of particular LLMs enable completion of over 80 percent of the 849 system-verification tasks in VeruSAGE-Bench and over 90 percent of a separate collection of system proof tasks that human experts had not yet finished.

What carries the argument

Agent systems that pair specific tools and interaction patterns with individual LLMs to compensate for their respective weaknesses when generating Verus proofs for Rust systems.

If this is right

  • Different LLMs require distinct tool combinations and agent configurations to reach high success rates on system proof tasks.
  • LLM agents can finish a large fraction of proofs even on tasks that remained open after initial human development.
  • Verified system software development in Rust can incorporate automated assistance for proof generation at scale.
  • The benchmark itself provides a concrete yardstick for measuring future progress in LLM-based verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same agents succeed on fresh codebases, verification projects could shift from writing proofs from scratch to reviewing and refining agent outputs.
  • The approach may transfer to other languages that have similar separation-logic provers, provided comparable task-extraction pipelines are built.
  • Further gains are likely if the agents are allowed to iterate on failing proofs using feedback from the verifier itself.

Load-bearing premise

Results measured on proof tasks taken from already-verified systems will transfer to writing proofs for new, unverified Rust code without heavy per-project human adjustment.

What would settle it

Running the best reported agent on a previously unverified Rust system and observing completion rates well below 80 percent without additional tuning would show the reported performance does not generalize.

Figures

Figures reproduced from arXiv: 2512.18436 by Chenyuan Yang, Chris Hawblitzel, Jacob R. Lorch, Natalie Neamtu, Shan Lu.

Figure 1
Figure 1. Figure 1: A Verus-verified function simplified from NRKernel [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fraction of proof tasks with certain features [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of VeruSAGE AL AC IR MA NO NR OS ST VE Tot. 30% 6% 24% 32% 34% 15% 13% 19% 23% 20% [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A specification adjustment suggested by an LLM [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The monologue of Sonnet 4.5 for seq_skip_lemma. element whose value equals that of the original s[0]. As the log indicates ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The success rate among tasks with different sizes [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Tradeoff among different models and agent systems [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs' capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4-mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE-Bench because they had not yet been finished by human experts. This result shows the great potential for LLM-assisted development of verified system software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces VeruSAGE-Bench, a suite of 849 proof tasks extracted from eight open-source Verus-verified Rust systems, and evaluates multiple LLM-agent combinations (o4-mini, GPT-5, Sonnet 4, Sonnet 4.5) on completing system-verification proofs. It reports that the strongest agent-LLM pairing finishes over 80% of benchmark tasks and over 90% of additional unfinished human proof tasks, concluding that this demonstrates substantial potential for LLM-assisted development of verified Rust systems software.

Significance. If the reported success rates prove robust and the approach generalizes, the work would represent a meaningful step toward practical AI assistance in formal verification of systems code, particularly by lowering barriers to using Verus for Rust. The new benchmark itself is a concrete contribution that could support future research on LLM reasoning for low-level correctness proofs.

major comments (3)
  1. [Benchmark construction] Benchmark construction section: the paper supplies no explicit criteria for extracting the 849 proof tasks from the eight systems, no discussion of selection bias, and no breakdown of task difficulty or coverage of typical systems-verification challenges (e.g., concurrency, memory safety, or device drivers). This omission makes it impossible to judge whether the 80%+ headline rates reflect genuine capability or an easier-than-average subset.
  2. [Experimental evaluation] Experimental evaluation and results sections: success rates are presented without details on agent architectures, exact prompting/tool-use patterns, statistical significance testing, variance across runs, or the precise definition of 'completion' (full verification vs. partial proof). These omissions directly affect reproducibility and the reliability of the 80-90% figures.
  3. [Discussion and conclusion] Discussion and conclusion: the claim of 'great potential for LLM-assisted development of verified system software' rests on tasks drawn exclusively from already-verified systems. No experiments test transfer to fresh, unverified Rust modules where no proof skeleton exists and the verification strategy must be invented rather than recovered; this gap is load-bearing for the broader significance asserted in the abstract.
minor comments (1)
  1. [Abstract] Abstract and introduction: the models are referred to as 'o4-mini' and 'GPT-5'; confirm these are the intended names or correct to the standard designations used in the experimental section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments identify key areas where additional detail and nuance will strengthen the manuscript. We respond to each major comment below and indicate the revisions that will be incorporated.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: the paper supplies no explicit criteria for extracting the 849 proof tasks from the eight systems, no discussion of selection bias, and no breakdown of task difficulty or coverage of typical systems-verification challenges (e.g., concurrency, memory safety, or device drivers). This omission makes it impossible to judge whether the 80%+ headline rates reflect genuine capability or an easier-than-average subset.

    Authors: We agree that the benchmark construction section requires more explicit documentation. In the revised manuscript we will add a dedicated subsection that states the extraction criteria: all tasks were obtained by automatically identifying complete 'proof' and 'lemma' blocks from the eight open-source Verus-verified systems that already pass the Verus verifier. We will discuss selection bias by explaining that verified systems were deliberately chosen to provide reliable ground-truth checking and will supply a breakdown of the 849 tasks by system, by category (memory safety, concurrency, ownership, etc.), and by proxy difficulty metrics such as proof length and number of required invariants. revision: yes

  2. Referee: [Experimental evaluation] Experimental evaluation and results sections: success rates are presented without details on agent architectures, exact prompting/tool-use patterns, statistical significance testing, variance across runs, or the precise definition of 'completion' (full verification vs. partial proof). These omissions directly affect reproducibility and the reliability of the 80-90% figures.

    Authors: We accept that the experimental description must be expanded for reproducibility. The revised paper will include a full account of each agent architecture, the precise prompting templates and tool-use loops (including how compiler feedback from Verus is incorporated), the definition of completion (a proof is counted as complete only when the Verus verifier reports no errors), and results aggregated over three independent runs with reported variance and appropriate statistical tests. revision: yes

  3. Referee: [Discussion and conclusion] Discussion and conclusion: the claim of 'great potential for LLM-assisted development of verified system software' rests on tasks drawn exclusively from already-verified systems. No experiments test transfer to fresh, unverified Rust modules where no proof skeleton exists and the verification strategy must be invented rather than recovered; this gap is load-bearing for the broader significance asserted in the abstract.

    Authors: We acknowledge the scope limitation. Our evaluation measures the ability to complete and extend proofs inside already-verified codebases, which is a realistic and high-value use case. The >90 % success rate on unfinished human proofs already demonstrates some capacity to generate new proof steps. Nevertheless, we agree that the absence of experiments on entirely new, unverified modules restricts the strength of the broader claim. In revision we will add an explicit limitations paragraph, moderate the language in the abstract and conclusion to emphasize assistance in proof completion rather than fully autonomous de-novo verification, and note the need for future work on fresh modules. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely empirical benchmark evaluation

full rationale

The paper reports experimental results from running LLM-based agents on VeruSAGE-Bench, a benchmark of 849 proof tasks extracted from eight already-Verus-verified open-source Rust systems, plus a small set of unfinished human tasks. No equations, derivations, fitted parameters, or self-referential definitions appear in the abstract or described structure. Claims reduce to direct measurements of completion rates (over 80% on the benchmark, over 90% on unfinished tasks) rather than any prediction that is forced by construction from the inputs. Self-citations, if present, are not load-bearing for a central theorem or uniqueness result. The generalization question (whether performance transfers to fresh unverified code) is an external-validity concern, not an internal circularity in any derivation chain. This matches the default expectation for an empirical systems paper with no mathematical reduction steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical study with light axiomatic load; relies on the soundness of the existing Verus verifier and the representativeness of the selected open-source projects.

axioms (1)
  • domain assumption Verus verifier is sound for the extracted proof tasks
    The benchmark treats successful Verus checks as ground truth for correctness.

pith-pipeline@v0.9.0 · 5521 in / 1236 out tokens · 29006 ms · 2026-05-16T21:01:41.900352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation

    cs.SE 2026-05 unverdicted novelty 6.0

    VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    AlphaVerus: Bootstrapping formally verified code generation through self-improving translation and treefinement.ICML, 2025

    P. Aggarwal, B. Parno, and S. Welleck. Alphaverus: Bootstrapping formally verified code generation through self-improving translation and treefinement.arXiv preprint arXiv:2412.06176, 2024

  2. [2]

    Introducing claude 4: Claude opus 4 and claude sonnet 4

    Anthropic. Introducing claude 4: Claude opus 4 and claude sonnet 4. https://www.anthropic.co m/news/claude-4 , May 2025. Introduces the Claude Sonnet 4 model

  3. [3]

    Introducing claude sonnet 4.5

    Anthropic. Introducing claude sonnet 4.5. https: //www.anthropic.com/news/claude-son net-4-5 , Sept. 2025. Describes the Claude Sonnet 4.5 model

  4. [4]

    Astrauskas, P

    V . Astrauskas, P. Müller, F. Poli, and A. J. Sum- mers. Leveraging rust types for modular specifica- tion and verification.Proc. ACM Program. Lang., 3(OOPSLA):147:1–147:30, 2019

  5. [5]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, Q. V . Le, and C. Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021

  6. [6]

    Y . Cai, P. Singh, Z. Lin, J. Bosamiya, J. Gancher, M. Sur- batovich, and B. Parno. Vest: Verified, secure, high- performance parsing and serialization for Rust. InPro- ceedings of the USENIX Security Symposium, August 2025

  7. [7]

    Chakraborty, G

    S. Chakraborty, G. Ebner, S. Bhat, S. Fakhoury, S. Fa- tima, S. K. Lahiri, and N. Swamy. Towards neural syn- thesis for smt-assisted proof-oriented programming. In IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025

  8. [8]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, M. Krueger, H. Petrov, I. Khattam, C. Hesse, S. Agar- wal, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, B. Power, L. Kaiser, M. Bavarian, C. King, T. Kerr, S. McCandlish, A. Radford, I. Sutskever, and W. Za...

  9. [9]

    T. Chen, S. Lu, S. Lu, Y . Gong, C. Yang, X. Li, M. R. H. Misu, H. Yu, N. Duan, P. Cheng, F. Yang, S. K. Lahiri, T. Xie, and L. Zhou. Automated proof generation for rust code via self-evolution. InProceedings of the 13th International Conference on Learning Representations (ICLR), 2025

  10. [10]

    X. Chen, Z. Li, J. Zhang, V . Narayanan, and A. Burt- sev. Atmosphere: Practical verified kernels with rust and verus. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP). Association for Computing Machinery, 2025

  11. [11]

    Denis, J

    X. Denis, J. Jourdan, and C. Marché. Creusot: A foundry for the deductive verification of rust programs. InFormal Methods and Software Engineering - 23rd International Conference on Formal Engineering Meth- ods, ICFEM 2022, Madrid, Spain, October 24-27, 2022, Proceedings, volume 13478 ofLecture Notes in Com- puter Science, pages 90–105. Springer, 2022

  12. [12]

    Github copilot cli

    GitHub. Github copilot cli. https://github.com /github/copilot-cli , Sept. 2025. Command- line interface for GitHub Copilot

  13. [13]

    Ho and J

    S. Ho and J. Protzenko. Aeneas: Rust verification by functional translation.Proc. ACM Program. Lang., 6(ICFP):711–741, 2022

  14. [14]

    GPT-4o System Card

    A. Hurst, A. Lerer, and OpenAI. GPT-4o System Card. arXiv preprint arXiv:2410.21276, 2024

  15. [15]

    R. Jain, S. Barke, G. Ebner, M. R. H. Misu, S. Lu, and S. Fakhoury. What’s in a proof? analyzing expert proof-writing processes in f* and verus.arXiv preprint arXiv:2508.02733, 2025

  16. [16]

    Kumar, D

    A. Kumar, D. N. Gadde, K. K. Radhakrishna, and D. Let- tnin. Saarthi: The first ai formal verification engineer. CoRR, abs/2502.16662, 2025

  17. [17]

    Lattuada, T

    A. Lattuada, T. Hance, J. Bosamiya, M. Brun, C. Cho, H. LeBlanc, P. Srinivasan, R. Achermann, T. Chajed, C. Hawblitzel, et al. Verus: A practical foundation for systems verification. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Princi- ples, pages 438–454, 2024

  18. [18]

    Lattuada, T

    A. Lattuada, T. Hance, C. Cho, M. Brun, I. Subasinghe, Y . Zhou, J. Howell, B. Parno, and C. Hawblitzel. Verus: Verifying rust programs using linear ghost types.Proc. ACM Program. Lang., 7(OOPSLA1):286–315, 2023. 13

  19. [19]

    LeBlanc, J

    H. LeBlanc, J. R. Lorch, C. Hawblitzel, C. Huang, Y . Tao, N. Zeldovich, and V . Chidambaram. Power never corrupts: Tool-agnostic verification of crash con- sistency and corruption detection.OSDI (to appear), 2025

  20. [20]

    Z. Li, J. Sun, L. Murphy, Q. Su, Z. Li, X. Zhang, K. Yang, and X. Si. A survey on deep learning for theorem proving.CoRR, abs/2404.09939, 2024

  21. [21]

    VeriPlan: Integrating formal verification and LLMs into end-user planning,

    Y . Liu and et al. Veriplan: Integrating formal verification and llms into end-user planning.CoRR, abs/2502.17898, 2025

  22. [22]

    J. R. Lorch. Fix usability issue with unused sub- regions library. https://github.com/microsoft/verified- storage/pull/39

  23. [23]

    DafnyBench: A benchmark for formal software verification

    C. Loughridge, Q. Sun, S. Ahrenbach, F. Cassano, C. Sun, Y . Sheng, A. Mudide, M. R. H. Misu, N. Amin, and M. Tegmark. Dafnybench: A benchmark for formal software verification.CoRR, abs/2406.08467, 2024

  24. [24]

    H. Ma, A. Goel, J.-B. Jeannin, M. Kapritsos, B. Kasikci, and K. A. Sakallah. I4: incremental inference of induc- tive invariants for verification of distributed protocols. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), pages 370–384, 2019

  25. [25]

    M. R. H. Misu, C. V . Lopes, I. Ma, and J. Noble. To- wards ai-assisted synthesis of verified dafny methods. Proc. ACM Softw. Eng., 1(FSE):812–835, 2024

  26. [26]

    Laurel: Unblocking automated verification with large language models.PACMPL (OOPSLA), 2025

    E. Mugnier, E. A. Gonzalez, R. Jhala, N. Polikarpova, and Y . Zhou. Laurel: Generating dafny assertions using large language models.CoRR, abs/2405.16792, 2024

  27. [27]

    OpenAI. Codex. https://openai.com/codex/,

  28. [28]

    OpenAI’s coding agent for software development

  29. [29]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. Technical report, OpenAI, Aug. 2025. Describes the GPT-5 model series

  30. [30]

    Openai o3 and o4-mini system card

    OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, Apr. 2025. Describes the OpenAI o4- mini reasoning model

  31. [31]

    C. Sun, Y . Sheng, O. Padon, and C. W. Barrett. Clover: Closed-loop verifiable code generation. InAI Verifica- tion - First International Symposium, SAIV 2024, Mon- treal, QC, Canada, July 22-23, 2024, Proceedings, vol- ume 14846 ofLecture Notes in Computer Science, pages 134–155. Springer, 2024

  32. [32]

    C. Sun, Y . Sun, D. Amrollahi, E. Zhang, S. Lahiri, S. Lu, D. Dill, and C. Barrett. Veristruct: Ai-assisted auto- mated verification of data-structure modules in verus. arXiv preprint arXiv:2510.25015, 2025

  33. [33]

    X. Sun, W. Ma, J. T. Gu, Z. Ma, T. Chajed, J. How- ell, A. Lattuada, O. Padon, L. Suresh, A. Szekeres, and T. Xu. Anvil: Verifying liveness of cluster management controllers. In18th USENIX Symposium on Operat- ing Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, pages 649–666. USENIX Association, 2024

  34. [34]

    T. K. Team. How open source projects are using kani to write better software in rust. https://aws.amaz on.com/blogs/opensource/how-open-sou rce-projects-are-using-kani-to-write -better-software-in-rust/ , 2024. [Online], [Accessed: 2024-09-01]

  35. [35]

    Ferreira, Sorin Lerner, and Emily First

    K. Thompson, N. Saavedra, P. Carrott, K. Fisher, A. Sanchez-Stern, Y . Brun, J. F. Ferreira, S. Lerner, and E. First. Rango: Adaptive retrieval-augmented prov- ing for automated software verification.arXiv preprint arXiv:2412.14063, 2024

  36. [36]

    R. Tian, Y . Ye, Y . Qin, X. Cong, Y . Lin, Y . Pan, Y . Wu, H. Hui, W. Liu, Z. Liu, and M. Sun. Debugbench: Eval- uating debugging capability of large language models. pages 4173–4198, 2024

  37. [37]

    Point-biserial correlation coef- ficient — Wikipedia, the free encyclopedia, 2025

    Wikipedia contributors. Point-biserial correlation coef- ficient — Wikipedia, the free encyclopedia, 2025. [On- line; accessed 10-December-2025]

  38. [38]

    C. S. Xia, Z. Wang, Y . Yang, Y . Wei, and L. Zhang. Live- swe-agent: Can software engineering agents self-evolve on the fly?arXiv preprint arXiv:2511.13646, 2025

  39. [39]

    C. S. Xia and L. Zhang. Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. CoRR, abs/2304.00385, 2023

  40. [40]

    C. Yang, Y . Deng, R. Lu, J. Yao, J. Liu, R. Jabbar- vand, and L. Zhang. Whitefox: White-box compiler fuzzing empowered by large language models.Pro- ceedings of the ACM on Programming Languages, 8(OOPSLA2):709–735, 2024

  41. [41]

    C. Yang, X. Li, M. R. H. Misu, J. Yao, W. Cui, Y . Gong, C. Hawblitzel, S. K. Lahiri, J. R. Lorch, S. Lu, F. Yang, Z. Zhou, and S. Lu. Autoverus: Automated proof gen- eration for rust code.Proceedings of the ACM on Pro- gramming Languages, 9(OOPSLA2), 2025

  42. [42]

    C. Yang, Z. Zhao, Z. Xie, H. Li, and L. Zhang. Knighter: Transforming static analysis with llm-synthesized check- ers. InProceedings of the ACM SIGOPS 31st Sympo- sium on Operating Systems Principles, SOSP ’25, New York, NY , USA, 2025. Association for Computing Ma- chinery. 14

  43. [43]

    C. Yang, Z. Zhao, and L. Zhang. Kernelgpt: Enhanced kernel fuzzing via large language models. InProceed- ings of the 30th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 2, pages 560–573, 2025

  44. [44]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent- computer interfaces enable automated software engi- neering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  45. [45]

    J. Yao, R. Tao, R. Gu, and J. Nieh. DuoAI: Fast, au- tomated inference of inductive invariants for verifying distributed protocols. InProceedings of the USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI), 2022

  46. [46]

    J. Yao, R. Tao, R. Gu, J. Nieh, S. S. Jana, and G. Ryan. DistAI: Data-driven automated invariant learning for distributed protocols. InProceedings of the USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI), 2021

  47. [47]

    Zhang, X

    J. Zhang, X. Xu, Y . Zou, Z. Tang, X. Wan, K. Hu, S. Wang, W. Xu, D. Wang, H. Chen, et al. Cortenmm: Efficient memory management with strong correctness guarantees. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 1082–1098, 2025

  48. [48]

    T. N. Zhang, K. Singh, T. Chajed, M. Kapritsos, and B. Parno. Basilisk: using provenance invariants to auto- mate proofs of undecidable protocols. InProceedings of the 19th USENIX Conference on Operating Systems Design and Implementation (OSDI), 2025

  49. [49]

    S. C. Zhong and X. Si. Towards repository-level program verification with large language models. In Proceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Lan- guages (LMPL), 2025

  50. [50]

    Zhou, Anjali, W

    Z. Zhou, Anjali, W. Chen, S. Gong, C. Hawblitzel, and W. Cui. Verismo: A verified security module for con- fidential vms. In18th USENIX Symposium on Operat- ing Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, pages 599–614. USENIX Association, 2024. 15