FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

· 2026 · cs.SE · arXiv 2604.11556

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

AccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs

cs.AR · 2026-05-08 · conditional · novelty 7.0

AccelSync reduces synchronization correctness in accelerator pipelines to a decidable barrier-sufficiency check with O(|E|^2) complexity, proves soundness and completeness, and detects real hazards in thousands of production kernels.

Agentic Model Checking

cs.SE · 2026-05-20 · unverdicted · novelty 6.0

Agentic model checking pairs LLM agents for spec inference and counterexample classification with compositional bounded model checking to verify LLM-generated kernel and compiler code.

citing papers explorer

Showing 2 of 2 citing papers.

AccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs cs.AR · 2026-05-08 · conditional · none · ref 17 · internal anchor
AccelSync reduces synchronization correctness in accelerator pipelines to a decidable barrier-sufficiency check with O(|E|^2) complexity, proves soundness and completeness, and detects real hazards in thousands of production kernels.
Agentic Model Checking cs.SE · 2026-05-20 · unverdicted · none · ref 11 · internal anchor
Agentic model checking pairs LLM agents for spec inference and counterexample classification with compositional bounded model checking to verify LLM-generated kernel and compiler code.

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer