Recognition: unknown
FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning
Pith reviewed 2026-05-10 15:32 UTC · model grok-4.3
The pith
FM-Agent automates compositional reasoning for large systems by using LLMs to generate natural-language function specifications from caller expectations and applying generalized Hoare-style inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FM-Agent is the first framework for automated compositional reasoning on large-scale systems. It automatically generates function-level specifications using LLMs in a top-down fashion derived from caller expectations, which reflect developer intent regardless of implementation bugs. It generalizes Hoare-style inference to operate on natural-language specifications and generates test cases to confirm and explain bugs. Evaluation shows it can analyze systems with up to 143k lines of code in two days and uncover 522 previously unknown bugs in already-tested code.
What carries the argument
The top-down LLM-driven generation of natural-language function specifications based on caller expectations, paired with generalized Hoare-style inference that reasons over these specifications to detect inconsistencies.
If this is right
- Systems as large as 143k lines of code can be fully reasoned about compositionally in a matter of days.
- New bugs can be discovered even in code that has undergone developer testing.
- Test cases are produced automatically to demonstrate and diagnose the bugs.
- Bug reports include explanations of causes based on the reasoning against caller expectations.
- The method applies to LLM-generated code where full formal specs would be impractical to write manually.
Where Pith is reading between the lines
- If reliable at scale, this could be integrated into continuous integration pipelines for early detection of issues in large projects.
- Similar top-down derivation might extend to other verification techniques beyond Hoare logic for broader program analysis.
- Testing on distributed or safety-critical systems could reveal whether the approach generalizes beyond the evaluated codebases.
- Success here indicates natural language can practically bridge the gap between informal developer intent and formal checking.
Load-bearing premise
LLMs are able to produce natural-language specifications that correctly capture the intended behavior from caller code, even when the actual implementation is incorrect, and that the generalized inference rules applied to these specs yield accurate bug detections.
What would settle it
Independent manual review of a random sample of the reported bugs to confirm they are genuine defects causing crashes or incorrect results rather than false positives from imprecise specifications.
Figures
read the original abstract
LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FM-Agent, a framework that uses LLMs for automated compositional reasoning on large-scale systems. It employs a top-down approach to derive natural-language function specifications from caller expectations (to capture intent even for buggy implementations), generalizes Hoare-style inference to operate over these NL specs, and generates test cases to confirm and explain bugs. The central empirical claim is that the system scales to codebases of up to 143k LoC, completing reasoning in under 2 days and discovering 522 previously unknown bugs in systems already tested by their developers.
Significance. If the reported bugs can be shown to be reliable (low false-positive rate, validated against ground truth, and not artifacts of LLM imprecision), the top-down NL-spec generation paradigm would meaningfully reduce the human burden of formal specification writing and extend compositional verification to LLM-generated systems. The scaling result to 143k LoC is notable for an empirical tool-building effort, but the absence of soundness arguments or quantitative validation metrics limits the immediate contribution to the field.
major comments (3)
- [Abstract and §5] Abstract and §5 (Evaluation): The headline claim of discovering 522 new bugs is load-bearing for the paper's contribution, yet the manuscript provides no quantitative false-positive rate, no manual validation protocol or sample size for the 522 reports, and no comparison against baselines such as existing static analyzers or other LLM-based bug-finding tools. This makes it impossible to assess whether the violations are actual defects or LLM artifacts.
- [§3.2] §3.2 (Generalized Hoare-style inference): The core technical step generalizes Hoare logic to natural-language pre/post-conditions without supplying a formal semantics for the NL assertions or a soundness argument for the inference rules. Because both spec generation and violation detection are LLM-mediated, reported bugs could arise from imprecise language, hallucinated expectations, or incorrect caller-derived specs rather than implementation errors; an empirical soundness check (e.g., on a small verified corpus) is required.
- [§4.1] §4.1 (Top-down spec derivation): The claim that specifications derived from caller expectations accurately reflect developer intent even when the callee implementation is buggy is central to the method, but the paper does not report an ablation or validation experiment measuring how often the generated NL specs diverge from actual developer intent or how such divergence affects downstream bug reports.
minor comments (2)
- [Abstract] The abstract states that reasoning completes 'within 2 days' but does not break down wall-clock time or token usage across spec generation, inference, and test-case synthesis, which would help readers gauge practicality.
- [§3] Notation for the generalized Hoare triples (e.g., how NL pre- and post-conditions are represented and passed to the LLM) is introduced informally; a small illustrative example with explicit prompt templates would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and commit to revisions that strengthen the empirical validation and clarity of the manuscript.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (Evaluation): The headline claim of discovering 522 new bugs is load-bearing for the paper's contribution, yet the manuscript provides no quantitative false-positive rate, no manual validation protocol or sample size for the 522 reports, and no comparison against baselines such as existing static analyzers or other LLM-based bug-finding tools. This makes it impossible to assess whether the violations are actual defects or LLM artifacts.
Authors: We agree that additional details on bug validation are necessary to substantiate the 522 reports. In the revised manuscript we will expand §5 with a dedicated subsection describing the manual validation protocol, the number of bugs inspected, the observed false-positive rate, and comparisons against relevant static-analysis and LLM-based baselines. revision: yes
-
Referee: [§3.2] §3.2 (Generalized Hoare-style inference): The core technical step generalizes Hoare logic to natural-language pre/post-conditions without supplying a formal semantics for the NL assertions or a soundness argument for the inference rules. Because both spec generation and violation detection are LLM-mediated, reported bugs could arise from imprecise language, hallucinated expectations, or incorrect caller-derived specs rather than implementation errors; an empirical soundness check (e.g., on a small verified corpus) is required.
Authors: The natural-language formulation is a deliberate choice to handle large LLM-generated codebases where formal specifications are unavailable. While a full formal semantics is outside the scope of this work, we will add an empirical soundness evaluation on a small verified corpus to demonstrate the practical reliability of the generalized inference rules. revision: yes
-
Referee: [§4.1] §4.1 (Top-down spec derivation): The claim that specifications derived from caller expectations accurately reflect developer intent even when the callee implementation is buggy is central to the method, but the paper does not report an ablation or validation experiment measuring how often the generated NL specs diverge from actual developer intent or how such divergence affects downstream bug reports.
Authors: We will incorporate an ablation study that quantifies divergence between caller-derived NL specifications and developer intent (via documentation review on a representative subset) and measures the downstream effect on bug-detection precision. revision: yes
Circularity Check
No circularity: empirical tool evaluation with no derivations or self-referential reductions
full rationale
The paper describes an LLM-based framework for generating natural-language function specifications top-down from caller expectations and performing generalized Hoare-style inference over them, followed by test-case generation for bug confirmation. No equations, fitted parameters, uniqueness theorems, or ansatzes are present. The central results (successful reasoning on up to 143k LoC systems and discovery of 522 bugs) are empirical outcomes from running the tool on existing codebases, not predictions or derivations that reduce to the method's own inputs by construction. Self-citations, if any, are not load-bearing for any formal claim. The work is self-contained as an engineering and evaluation effort.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can derive accurate function specifications from caller expectations even when the implementation is incorrect
- domain assumption Hoare-style reasoning can be generalized to operate correctly over natural-language specifications
Forward citations
Cited by 1 Pith paper
-
AccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs
AccelSync reduces synchronization correctness in accelerator pipelines to a decidable barrier-sufficiency check with O(|E|^2) complexity, proves soundness and completeness, and detects real hazards in thousands of pro...
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2026. Claude Code.https://claude.com/product/claude- code,
2026
-
[2]
Anthropic. 2026. Claude Sonnet 4.6.https://www.anthropic.com/ claude/sonnet,
2026
-
[3]
Anthropics. 2026. CCC — Claude’s C Compiler.https://github.com/ anthropics/claudes-c-compiler/,
2026
-
[4]
Anysphere. 2026. Cursor.https://cursor.com/,
2026
-
[5]
Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unas- sisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. InProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 209–224
2008
-
[6]
Tej Chajed, Frans Kaashoek, Butler Lampson, and Nickolai Zeldovich
-
[7]
In13th USENIX Symposium on Operating Systems Design and Implementation
Verifying Concurrent Software Using Movers in CSPEC. In13th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Carlsbad, CA, 306–322.https://www.usenix.org/ conference/osdi18/presentation/chajed
-
[8]
Frans Kaashoek, and Nickolai Zel- dovich
Tej Chajed, Joseph Tassarotti, M. Frans Kaashoek, and Nickolai Zel- dovich. 2019. Verifying Concurrent, Crash-Safe Systems with Peren- nial. InProceedings of the 27th ACM Symposium on Operating Systems Principles. Hunstville, ON, Canada
2019
-
[9]
Frans Kaashoek, and Nickolai Zeldovich
Tej Chajed, Joseph Tassarotti, Mark Theng, Ralf Jung, M. Frans Kaashoek, and Nickolai Zeldovich. 2021. GoJournal: A Verified, Concur- rent, Crash-safe Journaling System. In15th USENIX Symposium on Op- erating Systems Design and Implementation. USENIX Association, 423– 439.https://www.usenix.org/conference/osdi21/presentation/chajed
2021
-
[10]
Frans Kaashoek, and Nickolai Zeldovich
Yun-Sheng Chang, Ralf Jung, Upamanyu Sharma, Joseph Tassarotti, M. Frans Kaashoek, and Nickolai Zeldovich. 2023. Verifying vMVCC, a high-performance transaction library using multi-version concurrency control. InProceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’23)
2023
-
[11]
Frans Kaashoek, and Nickolai Zel- dovich
Haogang Chen, Tej Chajed, Alex Konradi, Stephanie Wang, Atalay undefinedleri, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zel- dovich. 2017. Verifying a High-Performance Crash-Safe File System Using a Tree Specification. InProceedings of the 26th Symposium on Operating Systems Principles(Shanghai, China). Association for Com- puting Machinery, New York...
2017
-
[12]
Frans Kaashoek, and Nickolai Zeldovich
Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. 2015. Using Crash Hoare Logic for Certifying the FSCQ File System. InProceedings of the 25th Symposium on Operating Systems Principles(Monterey, California). Association for Computing Machinery, New York, NY, USA, 18–37
2015
-
[13]
Tianyu Chen, Shuai Lu, Shan Lu, Yeyun Gong, Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Hao Yu, Nan Duan, Peng Cheng, Fan Yang, Shuvendu Lahiri, Tao Xie, and Lidong Zhou. 2024. Automated Proof Generation for Rust Code via Self-Evolution. In ICLR 2025.https://www.microsoft.com/en-us/research/publication/ automated-proof-generation-for-rust-code-via-se...
2024
-
[14]
Xiangdong Chen, Zhaofeng Li, Jerry Zhang, Vikram Narayanan, and Anton Burtsev. 2025. Atmosphere: Practical Verified Kernels with Rust and Verus. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Computing Machinery, New York, NY, USA, 752–767. doi:10.1145/...
-
[15]
cloc Development Team. 2026. cloc - Count Lines of Code.https: //github.com/aldanial/cloc,
2026
-
[16]
Andrew Ferraiuolo, Andrew Baumann, Chris Hawblitzel, and Bryan Parno. 2017. Komodo: Using verification to disentangle secure-enclave hardware from software. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP)
2017
-
[17]
GitHub. 2026. GitHub Copilot.https://github.com/features/copilot,
2026
-
[18]
Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan (Newman) Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo. 2016. CertiKOS: An Ex- tensible Architecture for Building Certified Concurrent OS Kernels. In 12th USENIX Symposium on Operating Systems Design and Implemen- tation. USENIX Association, Savannah, GA, 653–669.https://www. usenix.org/conference/osdi16/te...
2016
-
[19]
Travis Hance, Andrea Lattuada, Chris Hawblitzel, Jon Howell, Rob Johnson, and Bryan Parno. 2020. Storage Systems are Distributed Systems (So Verify Them That Way!). InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)
2020
-
[20]
Travis Hance, Yi Zhou, Andrea Lattuada, Reto Achermann, Alex Con- way, Ryan Stutsman, Gerd Zellweger, Chris Hawblitzel, Jon Howell, and Bryan Parno. 2023. Sharding the State Machine: Automated Mod- ular Reasoning for Complex Concurrent Systems. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)
2023
-
[21]
Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet: Proving Practical Distributed Systems Correct. InProceedings of the 25th Symposium on Operating Systems Principles(Monterey, California). Association for Computing Machinery, New York, NY, USA, 1–17. doi:10.1145/2...
-
[22]
C. A. R. Hoare. 1969. An axiomatic basis for computer programming. Commun. ACM12, 10 (Oct. 1969), 576–580. doi:10.1145/363235.363259
-
[23]
Kaanse. 2026. VibeOS.https://github.com/kaansenol5/VibeOS/tree/ main,
2026
-
[24]
Gerwin Klein. 2009. Operating System Verification—An Overview. Sadhana34 (Feb 2009), 27–69. doi:10.1007/s12046-009-0002-4
-
[25]
Shuvendu K. Lahirie. 2024. Evaluating LLM-driven User-Intent For- malization for Verification-Aware Languages. In2024 Formal Methods in Computer-Aided Design (FMCAD). 142–147. doi:10.34727/2024/isbn. 978-3-85448-065-5_19
-
[26]
Andrea Lattuada, Travis Hance, Jay Bosamiya, Matthias Brun, Chanhee Cho, Hayley LeBlanc, Pranav Srinivasan, Reto Achermann, Tej Chajed, Chris Hawblitzel, Jon Howell, Jacob R. Lorch, Oded Padon, and Bryan Parno. 2024. Verus: A Practical Foundation for Systems Verification. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles(Aust...
-
[27]
Andrea Lattuada, Travis Hance, Chanhee Cho, Matthias Brun, Isitha Subasinghe, Yi Zhou, Jon Howell, Bryan Parno, and Chris Hawblitzel
-
[28]
In Proceedings of the ACM on Programming Languages (OOPSLA)
Verus: Verifying Rust Programs using Linear Ghost Types. In Proceedings of the ACM on Programming Languages (OOPSLA). ACM
-
[29]
Rustan M
K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. InProceedings of the 16th International Con- ference on Logic for Programming, Artificial Intelligence, and Reasoning. Springer, 348–370
2010
-
[30]
Xupeng Li, Xuheng Li, Christoffer Dall, Ronghui Gu, Jason Nieh, Yousuf Sait, and Gareth Stockwell. 2022. Design and Verification of the Arm Confidential Compute Architecture. In16th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Carlsbad, CA, 465–484.https://www.usenix.org/conference/osdi22/ presentation/li
2022
-
[31]
Xupeng Li, Xuheng Li, Wei Qiang, Ronghui Gu, and Jason Nieh. 2023. Spoq: Scaling Machine-Checkable Systems Verification in Coq. In17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 851–869.https://www. usenix.org/conference/osdi23/presentation/li-xupeng
2023
-
[32]
Hongjin Liang and Xinyu Feng. 2013. Modular Verification of Lineariz- ability with Non-Fixed Linearization Points.SIGPLAN Not.48, 6 (jun 2013), 459–470. doi:10.1145/2499370.2462189
-
[33]
Hongjin Liang, Xinyu Feng, and Ming Fu. 2012. A Rely-Guarantee- Based Simulation for Verifying Concurrent Program Transformations. InProceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium 13 on Principles of Programming Languages(Philadelphia, PA, USA). As- sociation for Computing Machinery, New York, NY, USA, 455–468
2012
-
[34]
Qingyuan Liu, Mo Zou, Hengbin Zhang, Dong Du, Yubin Xia, and Haibo Chen. 2026. Sharpen the Spec, Cut the Code: A Case for Gen- erative File System with SYSSPEC. In24th USENIX Conference on File and Storage Technologies (FAST 26). 291–311
2026
-
[35]
Lorch, Yixuan Chen, Manos Kapritsos, Bryan Parno, Shaz Qadeer, Upamanyu Sharma, James R
Jacob R. Lorch, Yixuan Chen, Manos Kapritsos, Bryan Parno, Shaz Qadeer, Upamanyu Sharma, James R. Wilcox, and Xueyuan Zhao
-
[36]
InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation(London, UK)
Armada: Low-Effort Verification of High-Performance Concur- rent Programs. InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation(London, UK). Association for Computing Machinery, New York, NY, USA, 197–210
-
[37]
Microsoft. 2026. VerusBench.https://github.com/microsoft/verus- proof-synthesis/tree/main/benchmarks/VerusBench,
2026
-
[38]
Eric Mugnier, Emmanuel Anaya Gonzalez, Nadia Polikarpova, Ranjit Jhala, and Zhou Yuanyuan. 2025. Laurel: Unblocking Automated Verification with Large Language Models.Proc. ACM Program. Lang. 9, OOPSLA1, Article 134 (April 2025), 27 pages. doi:10.1145/3720499
-
[39]
Luke Nelson, James Bornholt, Ronghui Gu, Andrew Baumann, Emina Torlak, and Xi Wang. 2019. Scaling Symbolic Evaluation for Automated Verification of Systems Code with Serval. InProceedings of the 27th ACM Symposium on Operating Systems Principles(Huntsville, Ontario, Canada). Association for Computing Machinery, New York, NY, USA, 225–242
2019
-
[40]
Luke Nelson, Helgi Sigurbjarnarson, Kaiyuan Zhang, Dylan Johnson, James Bornholt, Emina Torlak, and Xi Wang. 2017. Hyperkernel: Push-Button Verification of an OS Kernel. InProceedings of the 26th Symposium on Operating Systems Principles(Shanghai, China). As- sociation for Computing Machinery, New York, NY, USA, 252–269. doi:10.1145/3132747.3132748
-
[41]
Luke Nelson, Jacob Van Geffen, Emina Torlak, and Xi Wang. 2020. Specification and Verification in the Field: Applying Formal Methods to BPF Just-in-Time Compilers in the Linux Kernel. USENIX Association, USA
2020
-
[42]
NVIDIA. 2026. VibeTensor.https://github.com/NVlabs/vibetensor,
2026
-
[43]
Peter W. OHearn. 2007. Resources, Concurrency, and Local Reasoning. Theor. Comput. Sci.375, 1-3 (apr 2007), 271–307. doi:10.1016/j.tcs.2006 .12.035
-
[44]
OpenAI. 2026. OpenAI Codex.https://openai.com/codex/,
2026
-
[45]
Jianxing Qin, Alexander Du, Danfeng Zhang, Matthew Lentz, and Danyang Zhuo. 2025. Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems(Banff, AB, Canada)(HotOS ’25). Association for Computing Machinery, New York, NY, USA, 34–41. doi:10.1145/3713082.3730382
-
[46]
Jung Ralf, Swasey David, Sieczkowski Filip, Svendsen Kasper, Turon Aaron, Birkedal Lars, and Dreyer Derek. 2015. Iris: Monoids and Invari- ants as an Orthogonal Basis for Concurrent Reasoning. InProceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages(Mumbai, India). Association for Computing Machinery, New York, ...
-
[47]
Upamanyu Sharma, Ralf Jung, Joseph Tassarotti, Frans Kaashoek, and Nickolai Zeldovich. 2023. Grove: a Separation-Logic Library for Verify- ing Distributed Systems. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP 2023). Koblenz, Germany, 113–129
2023
-
[48]
Helgi Sigurbjarnarson, James Bornholt, Emina Torlak, and Xi Wang
-
[49]
In12th USENIX Symposium on Operating Systems De- sign and Implementation
Push-Button Verification of File Systems via Crash Re- finement. In12th USENIX Symposium on Operating Systems De- sign and Implementation. USENIX Association, Savannah, GA, 1– 16.https://www.usenix.org/conference/osdi16/technical-sessions/ presentation/sigurbjarnarson
-
[50]
Helgi Sigurbjarnarson, Luke Nelson, Bruno Castro-Karney, James Bornholt, Emina Torlak, and Xi Wang. 2018. Nickel: A Framework for Design and Verification of Information Flow Control Systems. In13th USENIX Symposium on Operating Systems Design and Im- plementation. USENIX Association, Carlsbad, CA, 287–305.http: //www.usenix.org/conference/osdi18/presentat...
2018
-
[51]
Xudong Sun, Wenjie Ma, Jiawei Tyler Gu, Zicheng Ma, Tej Chajed, Jon Howell, Andrea Lattuada, Oded Padon, Lalith Suresh, Adriana Szekeres, and Tianyin Xu. 2024. Anvil: Verifying Liveness of Cluster Management Controllers. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 649–666.https://w...
2024
-
[52]
Runzhou Tao, Jianan Yao, Xupeng Li, Shih-Wei Li, Jason Nieh, and Ronghui Gu. 2021. Formal Verification of a Multiprocessor Hypervisor on Arm Relaxed Memory Hardware. InProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles(Virtual Event, Ger- many). Association for Computing Machinery, New York, NY, USA, 866–881. doi:10.1145/3477132.3483560
-
[53]
Coq Development Team. 2024. The Coq Proof Assistant.https://coq. inria.fr/,
2024
-
[54]
OpenCode Development Team. 2026. The open source AI coding agent. https://opencode.ai/,
2026
-
[55]
OpenRouter Development Team. 2026. OpenRouter.https://openrouter. ai/,
2026
- [56]
-
[57]
Johannes Wehrstein, Timo Eckmann, Matthias Jasny, and Carsten Bin- nig. 2026. Bespoke-OLAP.https://github.com/DataManagementLab/ BespokeOLAP,
2026
-
[58]
Bing Xu, Terry Chen, Fengzhe Zhou, Tianqi Chen, Yangqing Jia, Vinod Grover, Haicheng Wu, Wei Liu, Craig Wittenbrink, Wen mei Hwu, Roger Bringmann, Ming-Yu Liu, Luis Ceze, Michael Lightstone, and Humphrey Shi. 2026. VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents. arXiv:arXiv:2601.16238
-
[59]
Lorch, Shuai Lu, Fan Yang, Ziqiao Zhou, and Shan Lu
Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Jianan Yao, Wei- dong Cui, Yeyun Gong, Chris Hawblitzel, Shuvendu Lahiri, Jacob R. Lorch, Shuai Lu, Fan Yang, Ziqiao Zhou, and Shan Lu. 2025. AutoVerus: Automated Proof Generation for Rust Code.Proc. ACM Program. Lang. 9, OOPSLA2, Article 396 (Oct. 2025), 29 pages. doi:10.1145/3763174
-
[60]
Arseniy Zaostrovnykh, Solal Pirelli, Rishabh Iyer, Matteo Rizzo, Luis Pedrosa, Katerina Argyraki, and George Candea. 2019. Verifying soft- ware network functions with no verification expertise. InProceedings of the 27th ACM Symposium on Operating Systems Principles(Huntsville, Ontario, Canada)(SOSP ’19). Association for Computing Machinery, New York, NY, ...
-
[61]
Zihao Zhang, Ti Zhou, Christa Jenkins, Omar Chowdhury, and Shuai Mu. 2025. AutoMan: Facilitating Verified Distributed Systems Develop- ment Through Automatic Code Generation and Manual Optimizations. InProceedings of the ACM SIGOPS 31st Symposium on Operating Sys- tems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Comp...
-
[62]
Ziqiao Zhou, Anjali, Weiteng Chen, Sishuai Gong, Chris Hawblitzel, and Weidong Cui. 2024. VeriSMo: A Verified Security Module for Con- fidential VMs. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 599–614.https://www.usenix.org/conference/osdi24/presentation/ zhou
2024
-
[63]
Mo Zou, Haoran Ding, Dong Du, Ming Fu, Ronghui Gu, and Haibo Chen. 2019. Using Concurrent Relational Logic with Helpers for Verifying the AtomFS File System. InProceedings of the 27th ACM Sym- posium on Operating Systems Principles(Huntsville, Ontario, Canada). Association for Computing Machinery, New York, NY, USA, 259–274. 14
2019
-
[64]
Mo Zou, Dong Du, Mingkai Dong, and Haibo Chen. 2024. Using Dynamically Layered Definite Releases for Verifying the RefFS File System. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 629– 648.https://www.usenix.org/conference/osdi24/presentation/zou 15
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.