pith. machine review for the scientific record. sign in

arxiv: 2604.11556 · v1 · submitted 2026-04-13 · 💻 cs.SE · cs.AI

Recognition: unknown

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:32 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords formal verificationLLM-assisted reasoningHoare logicbug detectioncompositional reasoningsoftware correctnesslarge-scale systemsnatural language specifications
0
0 comments X

The pith

FM-Agent automates compositional reasoning for large systems by using LLMs to generate natural-language function specifications from caller expectations and applying generalized Hoare-style inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents FM-Agent, a framework that automates the process of verifying large software systems for correctness. It uses large language models to create function specifications in natural language by working top-down from what callers expect, allowing it to capture intended behavior even if the code is flawed. The system then applies a generalized version of Hoare-style logical reasoning directly to these natural-language descriptions instead of formal formulas. It also creates test cases to verify any potential issues. Readers should care because this approach found 522 new bugs in systems that developers had already tested, including ones that could cause crashes or wrong results, showing a path to better reliability for complex code without the usual manual specification burden.

Core claim

FM-Agent is the first framework for automated compositional reasoning on large-scale systems. It automatically generates function-level specifications using LLMs in a top-down fashion derived from caller expectations, which reflect developer intent regardless of implementation bugs. It generalizes Hoare-style inference to operate on natural-language specifications and generates test cases to confirm and explain bugs. Evaluation shows it can analyze systems with up to 143k lines of code in two days and uncover 522 previously unknown bugs in already-tested code.

What carries the argument

The top-down LLM-driven generation of natural-language function specifications based on caller expectations, paired with generalized Hoare-style inference that reasons over these specifications to detect inconsistencies.

If this is right

  • Systems as large as 143k lines of code can be fully reasoned about compositionally in a matter of days.
  • New bugs can be discovered even in code that has undergone developer testing.
  • Test cases are produced automatically to demonstrate and diagnose the bugs.
  • Bug reports include explanations of causes based on the reasoning against caller expectations.
  • The method applies to LLM-generated code where full formal specs would be impractical to write manually.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If reliable at scale, this could be integrated into continuous integration pipelines for early detection of issues in large projects.
  • Similar top-down derivation might extend to other verification techniques beyond Hoare logic for broader program analysis.
  • Testing on distributed or safety-critical systems could reveal whether the approach generalizes beyond the evaluated codebases.
  • Success here indicates natural language can practically bridge the gap between informal developer intent and formal checking.

Load-bearing premise

LLMs are able to produce natural-language specifications that correctly capture the intended behavior from caller code, even when the actual implementation is incorrect, and that the generalized inference rules applied to these specs yield accurate bug detections.

What would settle it

Independent manual review of a random sample of the reported bugs to confirm they are genuine defects causing crashes or incorrect results rather than false positives from imprecise specifications.

Figures

Figures reproduced from arXiv: 2604.11556 by Haibo Chen, Haoran Ding, Zhaoguo Wang.

Figure 1
Figure 1. Figure 1: An example of a buggy Rust function simplified from a C compiler [3] developed by Anthropic. Pre-condition: - “s” is a valid UTF-8 string. - “gnu_extensions” is a boolean that indicates if GNU C extension keywords should be recognized. Post-condition: - Returns Some(TokenKind::T) if and only if “s” exactly matches a known C/GCC keyword. - Returns None if “s” does not match any recognized keyword, including… view at source ↗
Figure 2
Figure 2. Figure 2: Specifications generated for the from_keyword function in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A test case generated by Claude Code to trigger the bug in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The workflow of FM-Agent. specification describes its expected behavior rather than the implementation steps. Second, the specification generator fully exploits the concurrency potential of generating spec￾ifications for massive functions. It allows FM-Agent to be scaled to large codebases. 4.1 Basic Idea Before introducing more details, we define expected behav￾ior from caller functions as a new concept c… view at source ↗
Figure 5
Figure 5. Figure 5: An example of the top-down paradigm for speci￾fication generation. Each directed edge from function 𝐹𝑖 to function 𝐹𝑗 indicates that 𝐹𝑖 invokes 𝐹𝑗 . The specification of 𝐹𝑗 is generated based on the expected specification from all its callers 𝐹𝑖 , which is denoted as 𝐹𝑖 ⊢ {𝑃}𝐹𝑗 {𝑄} on the edge. function implementation, and the domain knowledge. Partic￾ularly, if a function is invoked by multiple caller fun… view at source ↗
Figure 6
Figure 6. Figure 6: The reasoning process of the function from_keyword in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces FM-Agent, a framework that uses LLMs for automated compositional reasoning on large-scale systems. It employs a top-down approach to derive natural-language function specifications from caller expectations (to capture intent even for buggy implementations), generalizes Hoare-style inference to operate over these NL specs, and generates test cases to confirm and explain bugs. The central empirical claim is that the system scales to codebases of up to 143k LoC, completing reasoning in under 2 days and discovering 522 previously unknown bugs in systems already tested by their developers.

Significance. If the reported bugs can be shown to be reliable (low false-positive rate, validated against ground truth, and not artifacts of LLM imprecision), the top-down NL-spec generation paradigm would meaningfully reduce the human burden of formal specification writing and extend compositional verification to LLM-generated systems. The scaling result to 143k LoC is notable for an empirical tool-building effort, but the absence of soundness arguments or quantitative validation metrics limits the immediate contribution to the field.

major comments (3)
  1. [Abstract and §5] Abstract and §5 (Evaluation): The headline claim of discovering 522 new bugs is load-bearing for the paper's contribution, yet the manuscript provides no quantitative false-positive rate, no manual validation protocol or sample size for the 522 reports, and no comparison against baselines such as existing static analyzers or other LLM-based bug-finding tools. This makes it impossible to assess whether the violations are actual defects or LLM artifacts.
  2. [§3.2] §3.2 (Generalized Hoare-style inference): The core technical step generalizes Hoare logic to natural-language pre/post-conditions without supplying a formal semantics for the NL assertions or a soundness argument for the inference rules. Because both spec generation and violation detection are LLM-mediated, reported bugs could arise from imprecise language, hallucinated expectations, or incorrect caller-derived specs rather than implementation errors; an empirical soundness check (e.g., on a small verified corpus) is required.
  3. [§4.1] §4.1 (Top-down spec derivation): The claim that specifications derived from caller expectations accurately reflect developer intent even when the callee implementation is buggy is central to the method, but the paper does not report an ablation or validation experiment measuring how often the generated NL specs diverge from actual developer intent or how such divergence affects downstream bug reports.
minor comments (2)
  1. [Abstract] The abstract states that reasoning completes 'within 2 days' but does not break down wall-clock time or token usage across spec generation, inference, and test-case synthesis, which would help readers gauge practicality.
  2. [§3] Notation for the generalized Hoare triples (e.g., how NL pre- and post-conditions are represented and passed to the LLM) is introduced informally; a small illustrative example with explicit prompt templates would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and commit to revisions that strengthen the empirical validation and clarity of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §5] Abstract and §5 (Evaluation): The headline claim of discovering 522 new bugs is load-bearing for the paper's contribution, yet the manuscript provides no quantitative false-positive rate, no manual validation protocol or sample size for the 522 reports, and no comparison against baselines such as existing static analyzers or other LLM-based bug-finding tools. This makes it impossible to assess whether the violations are actual defects or LLM artifacts.

    Authors: We agree that additional details on bug validation are necessary to substantiate the 522 reports. In the revised manuscript we will expand §5 with a dedicated subsection describing the manual validation protocol, the number of bugs inspected, the observed false-positive rate, and comparisons against relevant static-analysis and LLM-based baselines. revision: yes

  2. Referee: [§3.2] §3.2 (Generalized Hoare-style inference): The core technical step generalizes Hoare logic to natural-language pre/post-conditions without supplying a formal semantics for the NL assertions or a soundness argument for the inference rules. Because both spec generation and violation detection are LLM-mediated, reported bugs could arise from imprecise language, hallucinated expectations, or incorrect caller-derived specs rather than implementation errors; an empirical soundness check (e.g., on a small verified corpus) is required.

    Authors: The natural-language formulation is a deliberate choice to handle large LLM-generated codebases where formal specifications are unavailable. While a full formal semantics is outside the scope of this work, we will add an empirical soundness evaluation on a small verified corpus to demonstrate the practical reliability of the generalized inference rules. revision: yes

  3. Referee: [§4.1] §4.1 (Top-down spec derivation): The claim that specifications derived from caller expectations accurately reflect developer intent even when the callee implementation is buggy is central to the method, but the paper does not report an ablation or validation experiment measuring how often the generated NL specs diverge from actual developer intent or how such divergence affects downstream bug reports.

    Authors: We will incorporate an ablation study that quantifies divergence between caller-derived NL specifications and developer intent (via documentation review on a representative subset) and measures the downstream effect on bug-detection precision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical tool evaluation with no derivations or self-referential reductions

full rationale

The paper describes an LLM-based framework for generating natural-language function specifications top-down from caller expectations and performing generalized Hoare-style inference over them, followed by test-case generation for bug confirmation. No equations, fitted parameters, uniqueness theorems, or ansatzes are present. The central results (successful reasoning on up to 143k LoC systems and discovery of 522 bugs) are empirical outcomes from running the tool on existing codebases, not predictions or derivations that reduce to the method's own inputs by construction. Self-citations, if any, are not load-bearing for any formal claim. The work is self-contained as an engineering and evaluation effort.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on unstated assumptions about LLM reliability for specification inference and the soundness of the natural-language Hoare generalization; these are domain assumptions rather than derived results.

axioms (2)
  • domain assumption LLMs can derive accurate function specifications from caller expectations even when the implementation is incorrect
    This is the core of the top-down paradigm described in the abstract.
  • domain assumption Hoare-style reasoning can be generalized to operate correctly over natural-language specifications
    Required for the automated inference step.

pith-pipeline@v0.9.0 · 5605 in / 1466 out tokens · 58126 ms · 2026-05-10T15:32:55.295800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs

    cs.AR 2026-05 conditional novelty 7.0

    AccelSync reduces synchronization correctness in accelerator pipelines to a decidable barrier-sufficiency check with O(|E|^2) complexity, proves soundness and completeness, and detects real hazards in thousands of pro...

Reference graph

Works this paper leans on

64 extracted references · 18 canonical work pages · cited by 1 Pith paper

  1. [1]

    Anthropic. 2026. Claude Code.https://claude.com/product/claude- code,

  2. [2]

    Anthropic. 2026. Claude Sonnet 4.6.https://www.anthropic.com/ claude/sonnet,

  3. [3]

    Anthropics. 2026. CCC — Claude’s C Compiler.https://github.com/ anthropics/claudes-c-compiler/,

  4. [4]

    Anysphere. 2026. Cursor.https://cursor.com/,

  5. [5]

    Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unas- sisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. InProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 209–224

  6. [6]

    Tej Chajed, Frans Kaashoek, Butler Lampson, and Nickolai Zeldovich

  7. [7]

    In13th USENIX Symposium on Operating Systems Design and Implementation

    Verifying Concurrent Software Using Movers in CSPEC. In13th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Carlsbad, CA, 306–322.https://www.usenix.org/ conference/osdi18/presentation/chajed

  8. [8]

    Frans Kaashoek, and Nickolai Zel- dovich

    Tej Chajed, Joseph Tassarotti, M. Frans Kaashoek, and Nickolai Zel- dovich. 2019. Verifying Concurrent, Crash-Safe Systems with Peren- nial. InProceedings of the 27th ACM Symposium on Operating Systems Principles. Hunstville, ON, Canada

  9. [9]

    Frans Kaashoek, and Nickolai Zeldovich

    Tej Chajed, Joseph Tassarotti, Mark Theng, Ralf Jung, M. Frans Kaashoek, and Nickolai Zeldovich. 2021. GoJournal: A Verified, Concur- rent, Crash-safe Journaling System. In15th USENIX Symposium on Op- erating Systems Design and Implementation. USENIX Association, 423– 439.https://www.usenix.org/conference/osdi21/presentation/chajed

  10. [10]

    Frans Kaashoek, and Nickolai Zeldovich

    Yun-Sheng Chang, Ralf Jung, Upamanyu Sharma, Joseph Tassarotti, M. Frans Kaashoek, and Nickolai Zeldovich. 2023. Verifying vMVCC, a high-performance transaction library using multi-version concurrency control. InProceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’23)

  11. [11]

    Frans Kaashoek, and Nickolai Zel- dovich

    Haogang Chen, Tej Chajed, Alex Konradi, Stephanie Wang, Atalay undefinedleri, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zel- dovich. 2017. Verifying a High-Performance Crash-Safe File System Using a Tree Specification. InProceedings of the 26th Symposium on Operating Systems Principles(Shanghai, China). Association for Com- puting Machinery, New York...

  12. [12]

    Frans Kaashoek, and Nickolai Zeldovich

    Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. 2015. Using Crash Hoare Logic for Certifying the FSCQ File System. InProceedings of the 25th Symposium on Operating Systems Principles(Monterey, California). Association for Computing Machinery, New York, NY, USA, 18–37

  13. [13]

    Tianyu Chen, Shuai Lu, Shan Lu, Yeyun Gong, Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Hao Yu, Nan Duan, Peng Cheng, Fan Yang, Shuvendu Lahiri, Tao Xie, and Lidong Zhou. 2024. Automated Proof Generation for Rust Code via Self-Evolution. In ICLR 2025.https://www.microsoft.com/en-us/research/publication/ automated-proof-generation-for-rust-code-via-se...

  14. [14]

    Xiangdong Chen, Zhaofeng Li, Jerry Zhang, Vikram Narayanan, and Anton Burtsev. 2025. Atmosphere: Practical Verified Kernels with Rust and Verus. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Computing Machinery, New York, NY, USA, 752–767. doi:10.1145/...

  15. [15]

    cloc Development Team. 2026. cloc - Count Lines of Code.https: //github.com/aldanial/cloc,

  16. [16]

    Andrew Ferraiuolo, Andrew Baumann, Chris Hawblitzel, and Bryan Parno. 2017. Komodo: Using verification to disentangle secure-enclave hardware from software. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP)

  17. [17]

    GitHub. 2026. GitHub Copilot.https://github.com/features/copilot,

  18. [18]

    Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan (Newman) Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo. 2016. CertiKOS: An Ex- tensible Architecture for Building Certified Concurrent OS Kernels. In 12th USENIX Symposium on Operating Systems Design and Implemen- tation. USENIX Association, Savannah, GA, 653–669.https://www. usenix.org/conference/osdi16/te...

  19. [19]

    Travis Hance, Andrea Lattuada, Chris Hawblitzel, Jon Howell, Rob Johnson, and Bryan Parno. 2020. Storage Systems are Distributed Systems (So Verify Them That Way!). InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)

  20. [20]

    Travis Hance, Yi Zhou, Andrea Lattuada, Reto Achermann, Alex Con- way, Ryan Stutsman, Gerd Zellweger, Chris Hawblitzel, Jon Howell, and Bryan Parno. 2023. Sharding the State Machine: Automated Mod- ular Reasoning for Complex Concurrent Systems. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)

  21. [21]

    Lorch, Bryan Parno, Michael L

    Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet: Proving Practical Distributed Systems Correct. InProceedings of the 25th Symposium on Operating Systems Principles(Monterey, California). Association for Computing Machinery, New York, NY, USA, 1–17. doi:10.1145/2...

  22. [22]

    C. A. R. Hoare. 1969. An axiomatic basis for computer programming. Commun. ACM12, 10 (Oct. 1969), 576–580. doi:10.1145/363235.363259

  23. [23]

    Kaanse. 2026. VibeOS.https://github.com/kaansenol5/VibeOS/tree/ main,

  24. [24]

    Gerwin Klein. 2009. Operating System Verification—An Overview. Sadhana34 (Feb 2009), 27–69. doi:10.1007/s12046-009-0002-4

  25. [25]

    Shuvendu K. Lahirie. 2024. Evaluating LLM-driven User-Intent For- malization for Verification-Aware Languages. In2024 Formal Methods in Computer-Aided Design (FMCAD). 142–147. doi:10.34727/2024/isbn. 978-3-85448-065-5_19

  26. [26]

    Lyu, and Yang- fan Zhou

    Andrea Lattuada, Travis Hance, Jay Bosamiya, Matthias Brun, Chanhee Cho, Hayley LeBlanc, Pranav Srinivasan, Reto Achermann, Tej Chajed, Chris Hawblitzel, Jon Howell, Jacob R. Lorch, Oded Padon, and Bryan Parno. 2024. Verus: A Practical Foundation for Systems Verification. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles(Aust...

  27. [27]

    Andrea Lattuada, Travis Hance, Chanhee Cho, Matthias Brun, Isitha Subasinghe, Yi Zhou, Jon Howell, Bryan Parno, and Chris Hawblitzel

  28. [28]

    In Proceedings of the ACM on Programming Languages (OOPSLA)

    Verus: Verifying Rust Programs using Linear Ghost Types. In Proceedings of the ACM on Programming Languages (OOPSLA). ACM

  29. [29]

    Rustan M

    K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. InProceedings of the 16th International Con- ference on Logic for Programming, Artificial Intelligence, and Reasoning. Springer, 348–370

  30. [30]

    Xupeng Li, Xuheng Li, Christoffer Dall, Ronghui Gu, Jason Nieh, Yousuf Sait, and Gareth Stockwell. 2022. Design and Verification of the Arm Confidential Compute Architecture. In16th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Carlsbad, CA, 465–484.https://www.usenix.org/conference/osdi22/ presentation/li

  31. [31]

    Xupeng Li, Xuheng Li, Wei Qiang, Ronghui Gu, and Jason Nieh. 2023. Spoq: Scaling Machine-Checkable Systems Verification in Coq. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 851–869.https://www. usenix.org/conference/osdi23/presentation/li-xupeng

  32. [32]

    Hongjin Liang and Xinyu Feng. 2013. Modular Verification of Lineariz- ability with Non-Fixed Linearization Points.SIGPLAN Not.48, 6 (jun 2013), 459–470. doi:10.1145/2499370.2462189

  33. [33]

    Hongjin Liang, Xinyu Feng, and Ming Fu. 2012. A Rely-Guarantee- Based Simulation for Verifying Concurrent Program Transformations. InProceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium 13 on Principles of Programming Languages(Philadelphia, PA, USA). As- sociation for Computing Machinery, New York, NY, USA, 455–468

  34. [34]

    Qingyuan Liu, Mo Zou, Hengbin Zhang, Dong Du, Yubin Xia, and Haibo Chen. 2026. Sharpen the Spec, Cut the Code: A Case for Gen- erative File System with SYSSPEC. In24th USENIX Conference on File and Storage Technologies (FAST 26). 291–311

  35. [35]

    Lorch, Yixuan Chen, Manos Kapritsos, Bryan Parno, Shaz Qadeer, Upamanyu Sharma, James R

    Jacob R. Lorch, Yixuan Chen, Manos Kapritsos, Bryan Parno, Shaz Qadeer, Upamanyu Sharma, James R. Wilcox, and Xueyuan Zhao

  36. [36]

    InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation(London, UK)

    Armada: Low-Effort Verification of High-Performance Concur- rent Programs. InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation(London, UK). Association for Computing Machinery, New York, NY, USA, 197–210

  37. [37]

    Microsoft. 2026. VerusBench.https://github.com/microsoft/verus- proof-synthesis/tree/main/benchmarks/VerusBench,

  38. [38]

    Eric Mugnier, Emmanuel Anaya Gonzalez, Nadia Polikarpova, Ranjit Jhala, and Zhou Yuanyuan. 2025. Laurel: Unblocking Automated Verification with Large Language Models.Proc. ACM Program. Lang. 9, OOPSLA1, Article 134 (April 2025), 27 pages. doi:10.1145/3720499

  39. [39]

    Luke Nelson, James Bornholt, Ronghui Gu, Andrew Baumann, Emina Torlak, and Xi Wang. 2019. Scaling Symbolic Evaluation for Automated Verification of Systems Code with Serval. InProceedings of the 27th ACM Symposium on Operating Systems Principles(Huntsville, Ontario, Canada). Association for Computing Machinery, New York, NY, USA, 225–242

  40. [40]

    Luke Nelson, Helgi Sigurbjarnarson, Kaiyuan Zhang, Dylan Johnson, James Bornholt, Emina Torlak, and Xi Wang. 2017. Hyperkernel: Push-Button Verification of an OS Kernel. InProceedings of the 26th Symposium on Operating Systems Principles(Shanghai, China). As- sociation for Computing Machinery, New York, NY, USA, 252–269. doi:10.1145/3132747.3132748

  41. [41]

    Luke Nelson, Jacob Van Geffen, Emina Torlak, and Xi Wang. 2020. Specification and Verification in the Field: Applying Formal Methods to BPF Just-in-Time Compilers in the Linux Kernel. USENIX Association, USA

  42. [42]

    NVIDIA. 2026. VibeTensor.https://github.com/NVlabs/vibetensor,

  43. [43]

    Peter W. OHearn. 2007. Resources, Concurrency, and Local Reasoning. Theor. Comput. Sci.375, 1-3 (apr 2007), 271–307. doi:10.1016/j.tcs.2006 .12.035

  44. [44]

    OpenAI. 2026. OpenAI Codex.https://openai.com/codex/,

  45. [45]

    Jianxing Qin, Alexander Du, Danfeng Zhang, Matthew Lentz, and Danyang Zhuo. 2025. Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems(Banff, AB, Canada)(HotOS ’25). Association for Computing Machinery, New York, NY, USA, 34–41. doi:10.1145/3713082.3730382

  46. [46]

    Jung Ralf, Swasey David, Sieczkowski Filip, Svendsen Kasper, Turon Aaron, Birkedal Lars, and Dreyer Derek. 2015. Iris: Monoids and Invari- ants as an Orthogonal Basis for Concurrent Reasoning. InProceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages(Mumbai, India). Association for Computing Machinery, New York, ...

  47. [47]

    Upamanyu Sharma, Ralf Jung, Joseph Tassarotti, Frans Kaashoek, and Nickolai Zeldovich. 2023. Grove: a Separation-Logic Library for Verify- ing Distributed Systems. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP 2023). Koblenz, Germany, 113–129

  48. [48]

    Helgi Sigurbjarnarson, James Bornholt, Emina Torlak, and Xi Wang

  49. [49]

    In12th USENIX Symposium on Operating Systems De- sign and Implementation

    Push-Button Verification of File Systems via Crash Re- finement. In12th USENIX Symposium on Operating Systems De- sign and Implementation. USENIX Association, Savannah, GA, 1– 16.https://www.usenix.org/conference/osdi16/technical-sessions/ presentation/sigurbjarnarson

  50. [50]

    Helgi Sigurbjarnarson, Luke Nelson, Bruno Castro-Karney, James Bornholt, Emina Torlak, and Xi Wang. 2018. Nickel: A Framework for Design and Verification of Information Flow Control Systems. In13th USENIX Symposium on Operating Systems Design and Im- plementation. USENIX Association, Carlsbad, CA, 287–305.http: //www.usenix.org/conference/osdi18/presentat...

  51. [51]

    Xudong Sun, Wenjie Ma, Jiawei Tyler Gu, Zicheng Ma, Tej Chajed, Jon Howell, Andrea Lattuada, Oded Padon, Lalith Suresh, Adriana Szekeres, and Tianyin Xu. 2024. Anvil: Verifying Liveness of Cluster Management Controllers. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 649–666.https://w...

  52. [52]

    Runzhou Tao, Jianan Yao, Xupeng Li, Shih-Wei Li, Jason Nieh, and Ronghui Gu. 2021. Formal Verification of a Multiprocessor Hypervisor on Arm Relaxed Memory Hardware. InProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles(Virtual Event, Ger- many). Association for Computing Machinery, New York, NY, USA, 866–881. doi:10.1145/3477132.3483560

  53. [53]

    Coq Development Team. 2024. The Coq Proof Assistant.https://coq. inria.fr/,

  54. [54]

    OpenCode Development Team. 2026. The open source AI coding agent. https://opencode.ai/,

  55. [55]

    OpenRouter Development Team. 2026. OpenRouter.https://openrouter. ai/,

  56. [56]

    Johannes Wehrstein, Timo Eckmann, Matthias Jasny, and Carsten Binnig. 2025. Bespoke OLAP: Synthesizing Workload-Specific One- Size-Fits-One Database Engines.arXiv preprint arXiv:2603.02001(2025). https://arxiv.org/abs/2603.02001

  57. [57]

    Johannes Wehrstein, Timo Eckmann, Matthias Jasny, and Carsten Bin- nig. 2026. Bespoke-OLAP.https://github.com/DataManagementLab/ BespokeOLAP,

  58. [58]

    Bing Xu, Terry Chen, Fengzhe Zhou, Tianqi Chen, Yangqing Jia, Vinod Grover, Haicheng Wu, Wei Liu, Craig Wittenbrink, Wen mei Hwu, Roger Bringmann, Ming-Yu Liu, Luis Ceze, Michael Lightstone, and Humphrey Shi. 2026. VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents. arXiv:arXiv:2601.16238

  59. [59]

    Lorch, Shuai Lu, Fan Yang, Ziqiao Zhou, and Shan Lu

    Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Jianan Yao, Wei- dong Cui, Yeyun Gong, Chris Hawblitzel, Shuvendu Lahiri, Jacob R. Lorch, Shuai Lu, Fan Yang, Ziqiao Zhou, and Shan Lu. 2025. AutoVerus: Automated Proof Generation for Rust Code.Proc. ACM Program. Lang. 9, OOPSLA2, Article 396 (Oct. 2025), 29 pages. doi:10.1145/3763174

  60. [60]

    Arseniy Zaostrovnykh, Solal Pirelli, Rishabh Iyer, Matteo Rizzo, Luis Pedrosa, Katerina Argyraki, and George Candea. 2019. Verifying soft- ware network functions with no verification expertise. InProceedings of the 27th ACM Symposium on Operating Systems Principles(Huntsville, Ontario, Canada)(SOSP ’19). Association for Computing Machinery, New York, NY, ...

  61. [61]

    Zihao Zhang, Ti Zhou, Christa Jenkins, Omar Chowdhury, and Shuai Mu. 2025. AutoMan: Facilitating Verified Distributed Systems Develop- ment Through Automatic Code Generation and Manual Optimizations. InProceedings of the ACM SIGOPS 31st Symposium on Operating Sys- tems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Comp...

  62. [62]

    Ziqiao Zhou, Anjali, Weiteng Chen, Sishuai Gong, Chris Hawblitzel, and Weidong Cui. 2024. VeriSMo: A Verified Security Module for Con- fidential VMs. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 599–614.https://www.usenix.org/conference/osdi24/presentation/ zhou

  63. [63]

    Mo Zou, Haoran Ding, Dong Du, Ming Fu, Ronghui Gu, and Haibo Chen. 2019. Using Concurrent Relational Logic with Helpers for Verifying the AtomFS File System. InProceedings of the 27th ACM Sym- posium on Operating Systems Principles(Huntsville, Ontario, Canada). Association for Computing Machinery, New York, NY, USA, 259–274. 14

  64. [64]

    Mo Zou, Dong Du, Mingkai Dong, and Haibo Chen. 2024. Using Dynamically Layered Definite Releases for Verifying the RefFS File System. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 629– 648.https://www.usenix.org/conference/osdi24/presentation/zou 15