arxiv: 2604.11556 · v1 · submitted 2026-04-13 · 💻 cs.SE · cs.AI

Recognition: unknown

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

Haoran Ding , Zhaoguo Wang , Haibo Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:32 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords formal verificationLLM-assisted reasoningHoare logicbug detectioncompositional reasoningsoftware correctnesslarge-scale systemsnatural language specifications

0 comments

The pith

FM-Agent automates compositional reasoning for large systems by using LLMs to generate natural-language function specifications from caller expectations and applying generalized Hoare-style inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents FM-Agent, a framework that automates the process of verifying large software systems for correctness. It uses large language models to create function specifications in natural language by working top-down from what callers expect, allowing it to capture intended behavior even if the code is flawed. The system then applies a generalized version of Hoare-style logical reasoning directly to these natural-language descriptions instead of formal formulas. It also creates test cases to verify any potential issues. Readers should care because this approach found 522 new bugs in systems that developers had already tested, including ones that could cause crashes or wrong results, showing a path to better reliability for complex code without the usual manual specification burden.

Core claim

FM-Agent is the first framework for automated compositional reasoning on large-scale systems. It automatically generates function-level specifications using LLMs in a top-down fashion derived from caller expectations, which reflect developer intent regardless of implementation bugs. It generalizes Hoare-style inference to operate on natural-language specifications and generates test cases to confirm and explain bugs. Evaluation shows it can analyze systems with up to 143k lines of code in two days and uncover 522 previously unknown bugs in already-tested code.

What carries the argument

The top-down LLM-driven generation of natural-language function specifications based on caller expectations, paired with generalized Hoare-style inference that reasons over these specifications to detect inconsistencies.

If this is right

Systems as large as 143k lines of code can be fully reasoned about compositionally in a matter of days.
New bugs can be discovered even in code that has undergone developer testing.
Test cases are produced automatically to demonstrate and diagnose the bugs.
Bug reports include explanations of causes based on the reasoning against caller expectations.
The method applies to LLM-generated code where full formal specs would be impractical to write manually.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If reliable at scale, this could be integrated into continuous integration pipelines for early detection of issues in large projects.
Similar top-down derivation might extend to other verification techniques beyond Hoare logic for broader program analysis.
Testing on distributed or safety-critical systems could reveal whether the approach generalizes beyond the evaluated codebases.
Success here indicates natural language can practically bridge the gap between informal developer intent and formal checking.

Load-bearing premise

LLMs are able to produce natural-language specifications that correctly capture the intended behavior from caller code, even when the actual implementation is incorrect, and that the generalized inference rules applied to these specs yield accurate bug detections.

What would settle it

Independent manual review of a random sample of the reported bugs to confirm they are genuine defects causing crashes or incorrect results rather than false positives from imprecise specifications.

Figures

Figures reproduced from arXiv: 2604.11556 by Haibo Chen, Haoran Ding, Zhaoguo Wang.

**Figure 1.** Figure 1: An example of a buggy Rust function simplified from a C compiler [3] developed by Anthropic. Pre-condition: - “s” is a valid UTF-8 string. - “gnu_extensions” is a boolean that indicates if GNU C extension keywords should be recognized. Post-condition: - Returns Some(TokenKind::T) if and only if “s” exactly matches a known C/GCC keyword. - Returns None if “s” does not match any recognized keyword, including… view at source ↗

**Figure 2.** Figure 2: Specifications generated for the from_keyword function in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A test case generated by Claude Code to trigger the bug in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The workflow of FM-Agent. specification describes its expected behavior rather than the implementation steps. Second, the specification generator fully exploits the concurrency potential of generating specifications for massive functions. It allows FM-Agent to be scaled to large codebases. 4.1 Basic Idea Before introducing more details, we define expected behavior from caller functions as a new concept c… view at source ↗

**Figure 5.** Figure 5: An example of the top-down paradigm for specification generation. Each directed edge from function 𝐹𝑖 to function 𝐹𝑗 indicates that 𝐹𝑖 invokes 𝐹𝑗 . The specification of 𝐹𝑗 is generated based on the expected specification from all its callers 𝐹𝑖 , which is denoted as 𝐹𝑖 ⊢ {𝑃}𝐹𝑗 {𝑄} on the edge. function implementation, and the domain knowledge. Particularly, if a function is invoked by multiple caller fun… view at source ↗

**Figure 6.** Figure 6: The reasoning process of the function from_keyword in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FM-Agent's top-down LLM specs plus generalized Hoare reasoning is a fresh engineering angle on scaling verification, but the 522-bug claims rest on ungrounded natural-language inference with no reported validation.

read the letter

The main takeaway is that this paper builds a system that generates natural-language function specs top-down from caller expectations, then runs a generalized Hoare-style check over those specs to hunt for bugs in codebases up to 143k lines. It reports finding 522 new defects in systems that developers had already tested, all within a couple of days of runtime. That combination of top-down intent capture and LLM-mediated reasoning is the part that has not been tried in exactly this form before, and it directly targets the practical problem of checking large LLM-generated programs without forcing humans to write formal specs for every function. The engineering move to produce test cases automatically for confirmation is also a sensible addition for making the output usable. The soft spot is exactly where the stress-test note flags it: the natural-language specs and the inference rules over them have no formal semantics or soundness argument, and the abstract gives no false-positive numbers, no baseline comparisons, and no description of how the 522 reports were manually inspected. Without those, it is impossible to tell how many of the reported bugs are real defects versus LLM artifacts or mismatched caller expectations. The evaluation therefore stays at the level of a promising prototype rather than a reliable result. This work is for researchers who build tools at the intersection of LLMs and software verification, especially those who already accept that some imprecision is the price of scaling. A reader who wants concrete, reproducible bug-finding numbers will be disappointed, but someone looking for new ideas on compositional reasoning at industrial sizes will find usable material. The paper deserves peer review because the scale and the core idea are substantial enough to justify referee time, even though the current evaluation section will need substantial strengthening before acceptance.

Referee Report

3 major / 2 minor

Summary. The paper introduces FM-Agent, a framework that uses LLMs for automated compositional reasoning on large-scale systems. It employs a top-down approach to derive natural-language function specifications from caller expectations (to capture intent even for buggy implementations), generalizes Hoare-style inference to operate over these NL specs, and generates test cases to confirm and explain bugs. The central empirical claim is that the system scales to codebases of up to 143k LoC, completing reasoning in under 2 days and discovering 522 previously unknown bugs in systems already tested by their developers.

Significance. If the reported bugs can be shown to be reliable (low false-positive rate, validated against ground truth, and not artifacts of LLM imprecision), the top-down NL-spec generation paradigm would meaningfully reduce the human burden of formal specification writing and extend compositional verification to LLM-generated systems. The scaling result to 143k LoC is notable for an empirical tool-building effort, but the absence of soundness arguments or quantitative validation metrics limits the immediate contribution to the field.

major comments (3)

[Abstract and §5] Abstract and §5 (Evaluation): The headline claim of discovering 522 new bugs is load-bearing for the paper's contribution, yet the manuscript provides no quantitative false-positive rate, no manual validation protocol or sample size for the 522 reports, and no comparison against baselines such as existing static analyzers or other LLM-based bug-finding tools. This makes it impossible to assess whether the violations are actual defects or LLM artifacts.
[§3.2] §3.2 (Generalized Hoare-style inference): The core technical step generalizes Hoare logic to natural-language pre/post-conditions without supplying a formal semantics for the NL assertions or a soundness argument for the inference rules. Because both spec generation and violation detection are LLM-mediated, reported bugs could arise from imprecise language, hallucinated expectations, or incorrect caller-derived specs rather than implementation errors; an empirical soundness check (e.g., on a small verified corpus) is required.
[§4.1] §4.1 (Top-down spec derivation): The claim that specifications derived from caller expectations accurately reflect developer intent even when the callee implementation is buggy is central to the method, but the paper does not report an ablation or validation experiment measuring how often the generated NL specs diverge from actual developer intent or how such divergence affects downstream bug reports.

minor comments (2)

[Abstract] The abstract states that reasoning completes 'within 2 days' but does not break down wall-clock time or token usage across spec generation, inference, and test-case synthesis, which would help readers gauge practicality.
[§3] Notation for the generalized Hoare triples (e.g., how NL pre- and post-conditions are represented and passed to the LLM) is introduced informally; a small illustrative example with explicit prompt templates would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and commit to revisions that strengthen the empirical validation and clarity of the manuscript.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Evaluation): The headline claim of discovering 522 new bugs is load-bearing for the paper's contribution, yet the manuscript provides no quantitative false-positive rate, no manual validation protocol or sample size for the 522 reports, and no comparison against baselines such as existing static analyzers or other LLM-based bug-finding tools. This makes it impossible to assess whether the violations are actual defects or LLM artifacts.

Authors: We agree that additional details on bug validation are necessary to substantiate the 522 reports. In the revised manuscript we will expand §5 with a dedicated subsection describing the manual validation protocol, the number of bugs inspected, the observed false-positive rate, and comparisons against relevant static-analysis and LLM-based baselines. revision: yes
Referee: [§3.2] §3.2 (Generalized Hoare-style inference): The core technical step generalizes Hoare logic to natural-language pre/post-conditions without supplying a formal semantics for the NL assertions or a soundness argument for the inference rules. Because both spec generation and violation detection are LLM-mediated, reported bugs could arise from imprecise language, hallucinated expectations, or incorrect caller-derived specs rather than implementation errors; an empirical soundness check (e.g., on a small verified corpus) is required.

Authors: The natural-language formulation is a deliberate choice to handle large LLM-generated codebases where formal specifications are unavailable. While a full formal semantics is outside the scope of this work, we will add an empirical soundness evaluation on a small verified corpus to demonstrate the practical reliability of the generalized inference rules. revision: yes
Referee: [§4.1] §4.1 (Top-down spec derivation): The claim that specifications derived from caller expectations accurately reflect developer intent even when the callee implementation is buggy is central to the method, but the paper does not report an ablation or validation experiment measuring how often the generated NL specs diverge from actual developer intent or how such divergence affects downstream bug reports.

Authors: We will incorporate an ablation study that quantifies divergence between caller-derived NL specifications and developer intent (via documentation review on a representative subset) and measures the downstream effect on bug-detection precision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical tool evaluation with no derivations or self-referential reductions

full rationale

The paper describes an LLM-based framework for generating natural-language function specifications top-down from caller expectations and performing generalized Hoare-style inference over them, followed by test-case generation for bug confirmation. No equations, fitted parameters, uniqueness theorems, or ansatzes are present. The central results (successful reasoning on up to 143k LoC systems and discovery of 522 bugs) are empirical outcomes from running the tool on existing codebases, not predictions or derivations that reduce to the method's own inputs by construction. Self-citations, if any, are not load-bearing for any formal claim. The work is self-contained as an engineering and evaluation effort.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on unstated assumptions about LLM reliability for specification inference and the soundness of the natural-language Hoare generalization; these are domain assumptions rather than derived results.

axioms (2)

domain assumption LLMs can derive accurate function specifications from caller expectations even when the implementation is incorrect
This is the core of the top-down paradigm described in the abstract.
domain assumption Hoare-style reasoning can be generalized to operate correctly over natural-language specifications
Required for the automated inference step.

pith-pipeline@v0.9.0 · 5605 in / 1466 out tokens · 58126 ms · 2026-05-10T15:32:55.295800+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs
cs.AR 2026-05 conditional novelty 7.0

AccelSync reduces synchronization correctness in accelerator pipelines to a decidable barrier-sufficiency check with O(|E|^2) complexity, proves soundness and completeness, and detects real hazards in thousands of pro...

Reference graph

Works this paper leans on

64 extracted references · 18 canonical work pages · cited by 1 Pith paper

[1]

Anthropic. 2026. Claude Code.https://claude.com/product/claude- code,

2026
[2]

Anthropic. 2026. Claude Sonnet 4.6.https://www.anthropic.com/ claude/sonnet,

2026
[3]

Anthropics. 2026. CCC — Claude’s C Compiler.https://github.com/ anthropics/claudes-c-compiler/,

2026
[4]

Anysphere. 2026. Cursor.https://cursor.com/,

2026
[5]

Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unas- sisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. InProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 209–224

2008
[6]

Tej Chajed, Frans Kaashoek, Butler Lampson, and Nickolai Zeldovich
[7]

In13th USENIX Symposium on Operating Systems Design and Implementation

Verifying Concurrent Software Using Movers in CSPEC. In13th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Carlsbad, CA, 306–322.https://www.usenix.org/ conference/osdi18/presentation/chajed
[8]

Frans Kaashoek, and Nickolai Zel- dovich

Tej Chajed, Joseph Tassarotti, M. Frans Kaashoek, and Nickolai Zel- dovich. 2019. Verifying Concurrent, Crash-Safe Systems with Peren- nial. InProceedings of the 27th ACM Symposium on Operating Systems Principles. Hunstville, ON, Canada

2019
[9]

Frans Kaashoek, and Nickolai Zeldovich

Tej Chajed, Joseph Tassarotti, Mark Theng, Ralf Jung, M. Frans Kaashoek, and Nickolai Zeldovich. 2021. GoJournal: A Verified, Concur- rent, Crash-safe Journaling System. In15th USENIX Symposium on Op- erating Systems Design and Implementation. USENIX Association, 423– 439.https://www.usenix.org/conference/osdi21/presentation/chajed

2021
[10]

Frans Kaashoek, and Nickolai Zeldovich

Yun-Sheng Chang, Ralf Jung, Upamanyu Sharma, Joseph Tassarotti, M. Frans Kaashoek, and Nickolai Zeldovich. 2023. Verifying vMVCC, a high-performance transaction library using multi-version concurrency control. InProceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’23)

2023
[11]

Frans Kaashoek, and Nickolai Zel- dovich

Haogang Chen, Tej Chajed, Alex Konradi, Stephanie Wang, Atalay undefinedleri, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zel- dovich. 2017. Verifying a High-Performance Crash-Safe File System Using a Tree Specification. InProceedings of the 26th Symposium on Operating Systems Principles(Shanghai, China). Association for Com- puting Machinery, New York...

2017
[12]

Frans Kaashoek, and Nickolai Zeldovich

Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. 2015. Using Crash Hoare Logic for Certifying the FSCQ File System. InProceedings of the 25th Symposium on Operating Systems Principles(Monterey, California). Association for Computing Machinery, New York, NY, USA, 18–37

2015
[13]

Tianyu Chen, Shuai Lu, Shan Lu, Yeyun Gong, Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Hao Yu, Nan Duan, Peng Cheng, Fan Yang, Shuvendu Lahiri, Tao Xie, and Lidong Zhou. 2024. Automated Proof Generation for Rust Code via Self-Evolution. In ICLR 2025.https://www.microsoft.com/en-us/research/publication/ automated-proof-generation-for-rust-code-via-se...

2024
[14]

Xiangdong Chen, Zhaofeng Li, Jerry Zhang, Vikram Narayanan, and Anton Burtsev. 2025. Atmosphere: Practical Verified Kernels with Rust and Verus. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Computing Machinery, New York, NY, USA, 752–767. doi:10.1145/...

work page doi:10.1145/3731569.3764821 2025
[15]

cloc Development Team. 2026. cloc - Count Lines of Code.https: //github.com/aldanial/cloc,

2026
[16]

Andrew Ferraiuolo, Andrew Baumann, Chris Hawblitzel, and Bryan Parno. 2017. Komodo: Using verification to disentangle secure-enclave hardware from software. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP)

2017
[17]

GitHub. 2026. GitHub Copilot.https://github.com/features/copilot,

2026
[18]

Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan (Newman) Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo. 2016. CertiKOS: An Ex- tensible Architecture for Building Certified Concurrent OS Kernels. In 12th USENIX Symposium on Operating Systems Design and Implemen- tation. USENIX Association, Savannah, GA, 653–669.https://www. usenix.org/conference/osdi16/te...

2016
[19]

Travis Hance, Andrea Lattuada, Chris Hawblitzel, Jon Howell, Rob Johnson, and Bryan Parno. 2020. Storage Systems are Distributed Systems (So Verify Them That Way!). InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)

2020
[20]

Travis Hance, Yi Zhou, Andrea Lattuada, Reto Achermann, Alex Con- way, Ryan Stutsman, Gerd Zellweger, Chris Hawblitzel, Jon Howell, and Bryan Parno. 2023. Sharding the State Machine: Automated Mod- ular Reasoning for Complex Concurrent Systems. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)

2023
[21]

Lorch, Bryan Parno, Michael L

Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet: Proving Practical Distributed Systems Correct. InProceedings of the 25th Symposium on Operating Systems Principles(Monterey, California). Association for Computing Machinery, New York, NY, USA, 1–17. doi:10.1145/2...

work page doi:10.1145/2815400.2815428 2015
[22]

C. A. R. Hoare. 1969. An axiomatic basis for computer programming. Commun. ACM12, 10 (Oct. 1969), 576–580. doi:10.1145/363235.363259

work page doi:10.1145/363235.363259 1969
[23]

Kaanse. 2026. VibeOS.https://github.com/kaansenol5/VibeOS/tree/ main,

2026
[24]

Gerwin Klein. 2009. Operating System Verification—An Overview. Sadhana34 (Feb 2009), 27–69. doi:10.1007/s12046-009-0002-4

work page doi:10.1007/s12046-009-0002-4 2009
[25]

Shuvendu K. Lahirie. 2024. Evaluating LLM-driven User-Intent For- malization for Verification-Aware Languages. In2024 Formal Methods in Computer-Aided Design (FMCAD). 142–147. doi:10.34727/2024/isbn. 978-3-85448-065-5_19

work page doi:10.34727/2024/isbn 2024
[26]

Lyu, and Yang- fan Zhou

Andrea Lattuada, Travis Hance, Jay Bosamiya, Matthias Brun, Chanhee Cho, Hayley LeBlanc, Pranav Srinivasan, Reto Achermann, Tej Chajed, Chris Hawblitzel, Jon Howell, Jacob R. Lorch, Oded Padon, and Bryan Parno. 2024. Verus: A Practical Foundation for Systems Verification. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles(Aust...

work page doi:10.1145/3694715.3695952 2024
[27]

Andrea Lattuada, Travis Hance, Chanhee Cho, Matthias Brun, Isitha Subasinghe, Yi Zhou, Jon Howell, Bryan Parno, and Chris Hawblitzel
[28]

In Proceedings of the ACM on Programming Languages (OOPSLA)

Verus: Verifying Rust Programs using Linear Ghost Types. In Proceedings of the ACM on Programming Languages (OOPSLA). ACM
[29]

Rustan M

K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. InProceedings of the 16th International Con- ference on Logic for Programming, Artificial Intelligence, and Reasoning. Springer, 348–370

2010
[30]

Xupeng Li, Xuheng Li, Christoffer Dall, Ronghui Gu, Jason Nieh, Yousuf Sait, and Gareth Stockwell. 2022. Design and Verification of the Arm Confidential Compute Architecture. In16th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Carlsbad, CA, 465–484.https://www.usenix.org/conference/osdi22/ presentation/li

2022
[31]

Xupeng Li, Xuheng Li, Wei Qiang, Ronghui Gu, and Jason Nieh. 2023. Spoq: Scaling Machine-Checkable Systems Verification in Coq. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 851–869.https://www. usenix.org/conference/osdi23/presentation/li-xupeng

2023
[32]

Hongjin Liang and Xinyu Feng. 2013. Modular Verification of Lineariz- ability with Non-Fixed Linearization Points.SIGPLAN Not.48, 6 (jun 2013), 459–470. doi:10.1145/2499370.2462189

work page doi:10.1145/2499370.2462189 2013
[33]

Hongjin Liang, Xinyu Feng, and Ming Fu. 2012. A Rely-Guarantee- Based Simulation for Verifying Concurrent Program Transformations. InProceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium 13 on Principles of Programming Languages(Philadelphia, PA, USA). As- sociation for Computing Machinery, New York, NY, USA, 455–468

2012
[34]

Qingyuan Liu, Mo Zou, Hengbin Zhang, Dong Du, Yubin Xia, and Haibo Chen. 2026. Sharpen the Spec, Cut the Code: A Case for Gen- erative File System with SYSSPEC. In24th USENIX Conference on File and Storage Technologies (FAST 26). 291–311

2026
[35]

Lorch, Yixuan Chen, Manos Kapritsos, Bryan Parno, Shaz Qadeer, Upamanyu Sharma, James R

Jacob R. Lorch, Yixuan Chen, Manos Kapritsos, Bryan Parno, Shaz Qadeer, Upamanyu Sharma, James R. Wilcox, and Xueyuan Zhao
[36]

InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation(London, UK)

Armada: Low-Effort Verification of High-Performance Concur- rent Programs. InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation(London, UK). Association for Computing Machinery, New York, NY, USA, 197–210
[37]

Microsoft. 2026. VerusBench.https://github.com/microsoft/verus- proof-synthesis/tree/main/benchmarks/VerusBench,

2026
[38]

Eric Mugnier, Emmanuel Anaya Gonzalez, Nadia Polikarpova, Ranjit Jhala, and Zhou Yuanyuan. 2025. Laurel: Unblocking Automated Verification with Large Language Models.Proc. ACM Program. Lang. 9, OOPSLA1, Article 134 (April 2025), 27 pages. doi:10.1145/3720499

work page doi:10.1145/3720499 2025
[39]

Luke Nelson, James Bornholt, Ronghui Gu, Andrew Baumann, Emina Torlak, and Xi Wang. 2019. Scaling Symbolic Evaluation for Automated Verification of Systems Code with Serval. InProceedings of the 27th ACM Symposium on Operating Systems Principles(Huntsville, Ontario, Canada). Association for Computing Machinery, New York, NY, USA, 225–242

2019
[40]

Luke Nelson, Helgi Sigurbjarnarson, Kaiyuan Zhang, Dylan Johnson, James Bornholt, Emina Torlak, and Xi Wang. 2017. Hyperkernel: Push-Button Verification of an OS Kernel. InProceedings of the 26th Symposium on Operating Systems Principles(Shanghai, China). As- sociation for Computing Machinery, New York, NY, USA, 252–269. doi:10.1145/3132747.3132748

work page doi:10.1145/3132747.3132748 2017
[41]

Luke Nelson, Jacob Van Geffen, Emina Torlak, and Xi Wang. 2020. Specification and Verification in the Field: Applying Formal Methods to BPF Just-in-Time Compilers in the Linux Kernel. USENIX Association, USA

2020
[42]

NVIDIA. 2026. VibeTensor.https://github.com/NVlabs/vibetensor,

2026
[43]

Peter W. OHearn. 2007. Resources, Concurrency, and Local Reasoning. Theor. Comput. Sci.375, 1-3 (apr 2007), 271–307. doi:10.1016/j.tcs.2006 .12.035

work page doi:10.1016/j.tcs.2006 2007
[44]

OpenAI. 2026. OpenAI Codex.https://openai.com/codex/,

2026
[45]

Jianxing Qin, Alexander Du, Danfeng Zhang, Matthew Lentz, and Danyang Zhuo. 2025. Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems(Banff, AB, Canada)(HotOS ’25). Association for Computing Machinery, New York, NY, USA, 34–41. doi:10.1145/3713082.3730382

work page doi:10.1145/3713082.3730382 2025
[46]

Jung Ralf, Swasey David, Sieczkowski Filip, Svendsen Kasper, Turon Aaron, Birkedal Lars, and Dreyer Derek. 2015. Iris: Monoids and Invari- ants as an Orthogonal Basis for Concurrent Reasoning. InProceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages(Mumbai, India). Association for Computing Machinery, New York, ...

work page doi:10.1145/2676726.2676980 2015
[47]

Upamanyu Sharma, Ralf Jung, Joseph Tassarotti, Frans Kaashoek, and Nickolai Zeldovich. 2023. Grove: a Separation-Logic Library for Verify- ing Distributed Systems. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP 2023). Koblenz, Germany, 113–129

2023
[48]

Helgi Sigurbjarnarson, James Bornholt, Emina Torlak, and Xi Wang
[49]

In12th USENIX Symposium on Operating Systems De- sign and Implementation

Push-Button Verification of File Systems via Crash Re- finement. In12th USENIX Symposium on Operating Systems De- sign and Implementation. USENIX Association, Savannah, GA, 1– 16.https://www.usenix.org/conference/osdi16/technical-sessions/ presentation/sigurbjarnarson
[50]

Helgi Sigurbjarnarson, Luke Nelson, Bruno Castro-Karney, James Bornholt, Emina Torlak, and Xi Wang. 2018. Nickel: A Framework for Design and Verification of Information Flow Control Systems. In13th USENIX Symposium on Operating Systems Design and Im- plementation. USENIX Association, Carlsbad, CA, 287–305.http: //www.usenix.org/conference/osdi18/presentat...

2018
[51]

Xudong Sun, Wenjie Ma, Jiawei Tyler Gu, Zicheng Ma, Tej Chajed, Jon Howell, Andrea Lattuada, Oded Padon, Lalith Suresh, Adriana Szekeres, and Tianyin Xu. 2024. Anvil: Verifying Liveness of Cluster Management Controllers. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 649–666.https://w...

2024
[52]

Runzhou Tao, Jianan Yao, Xupeng Li, Shih-Wei Li, Jason Nieh, and Ronghui Gu. 2021. Formal Verification of a Multiprocessor Hypervisor on Arm Relaxed Memory Hardware. InProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles(Virtual Event, Ger- many). Association for Computing Machinery, New York, NY, USA, 866–881. doi:10.1145/3477132.3483560

work page doi:10.1145/3477132.3483560 2021
[53]

Coq Development Team. 2024. The Coq Proof Assistant.https://coq. inria.fr/,

2024
[54]

OpenCode Development Team. 2026. The open source AI coding agent. https://opencode.ai/,

2026
[55]

OpenRouter Development Team. 2026. OpenRouter.https://openrouter. ai/,

2026
[56]

Johannes Wehrstein, Timo Eckmann, Matthias Jasny, and Carsten Binnig. 2025. Bespoke OLAP: Synthesizing Workload-Specific One- Size-Fits-One Database Engines.arXiv preprint arXiv:2603.02001(2025). https://arxiv.org/abs/2603.02001

work page arXiv 2025
[57]

Johannes Wehrstein, Timo Eckmann, Matthias Jasny, and Carsten Bin- nig. 2026. Bespoke-OLAP.https://github.com/DataManagementLab/ BespokeOLAP,

2026
[58]

Bing Xu, Terry Chen, Fengzhe Zhou, Tianqi Chen, Yangqing Jia, Vinod Grover, Haicheng Wu, Wei Liu, Craig Wittenbrink, Wen mei Hwu, Roger Bringmann, Ming-Yu Liu, Luis Ceze, Michael Lightstone, and Humphrey Shi. 2026. VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents. arXiv:arXiv:2601.16238

work page arXiv 2026
[59]

Lorch, Shuai Lu, Fan Yang, Ziqiao Zhou, and Shan Lu

Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Jianan Yao, Wei- dong Cui, Yeyun Gong, Chris Hawblitzel, Shuvendu Lahiri, Jacob R. Lorch, Shuai Lu, Fan Yang, Ziqiao Zhou, and Shan Lu. 2025. AutoVerus: Automated Proof Generation for Rust Code.Proc. ACM Program. Lang. 9, OOPSLA2, Article 396 (Oct. 2025), 29 pages. doi:10.1145/3763174

work page doi:10.1145/3763174 2025
[60]

Arseniy Zaostrovnykh, Solal Pirelli, Rishabh Iyer, Matteo Rizzo, Luis Pedrosa, Katerina Argyraki, and George Candea. 2019. Verifying soft- ware network functions with no verification expertise. InProceedings of the 27th ACM Symposium on Operating Systems Principles(Huntsville, Ontario, Canada)(SOSP ’19). Association for Computing Machinery, New York, NY, ...

work page doi:10.1145/3341301.3359647 2019
[61]

Zihao Zhang, Ti Zhou, Christa Jenkins, Omar Chowdhury, and Shuai Mu. 2025. AutoMan: Facilitating Verified Distributed Systems Develop- ment Through Automatic Code Generation and Manual Optimizations. InProceedings of the ACM SIGOPS 31st Symposium on Operating Sys- tems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Comp...

work page doi:10.1145/3731569.3764822 2025
[62]

Ziqiao Zhou, Anjali, Weiteng Chen, Sishuai Gong, Chris Hawblitzel, and Weidong Cui. 2024. VeriSMo: A Verified Security Module for Con- fidential VMs. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 599–614.https://www.usenix.org/conference/osdi24/presentation/ zhou

2024
[63]

Mo Zou, Haoran Ding, Dong Du, Ming Fu, Ronghui Gu, and Haibo Chen. 2019. Using Concurrent Relational Logic with Helpers for Verifying the AtomFS File System. InProceedings of the 27th ACM Sym- posium on Operating Systems Principles(Huntsville, Ontario, Canada). Association for Computing Machinery, New York, NY, USA, 259–274. 14

2019
[64]

Mo Zou, Dong Du, Mingkai Dong, and Haibo Chen. 2024. Using Dynamically Layered Definite Releases for Verifying the RefFS File System. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 629– 648.https://www.usenix.org/conference/osdi24/presentation/zou 15

2024