pith. sign in

arxiv: 2605.21434 · v1 · pith:KWF6UAZEnew · submitted 2026-05-20 · 💻 cs.SE

Agentic Model Checking

Pith reviewed 2026-05-21 03:06 UTC · model grok-4.3

classification 💻 cs.SE
keywords agentic model checkingLLM-generated codebounded model checkingspecification inferencecompositional verificationcounterexample validationsoftware verificationC and Rust code
0
0 comments X

The pith

LLM agents propose specifications and refinements while a bounded model checker verifies every soundness-critical decision in generated systems code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that coupling LLM agents with bounded model checking enables verification of LLM-generated C and Rust code by letting agents handle semantic tasks such as spec inference from caller context and counterexample classification while the solver enforces all soundness requirements. Specifications are written in a restricted DSL that translates directly into assume/assert primitives and support optional functional-correctness clauses. Verification proceeds compositionally by checking each function against its spec with callees replaced by postcondition stubs so that cost stays local to one function. Counterexamples pass through a validation pipeline of reachability checks, callee feasibility, dynamic replay, and realism audit to separate actual defects from modeling artifacts and drive refinements rather than suppress them.

Core claim

Agentic model checking rests on the principle that agents propose while solvers verify: agents infer top-down specifications in a restricted DSL from caller context, select checks, classify counterexamples, and propose refinements, while the bounded model checker discharges every soundness-relevant step. The approach is compositional, replacing callees with postcondition-constrained stubs so each query examines only one function's state space, and it subjects every reported counterexample to a pipeline that distinguishes active crashes from latent failures and modeling artifacts.

What carries the argument

The agents-propose-solvers-verify principle, in which LLM agents perform semantic judgment tasks and the BMC backend handles all soundness-critical decisions through deterministic DSL translation and a counterexample validation pipeline.

If this is right

  • Each function is checked in isolation against its inferred spec with callees modeled as postcondition-constrained stubs.
  • Refinements from counterexample analysis propagate automatically to callers.
  • Verification lifts from panic-freeness to behavioral faithfulness when optional functional-correctness clauses are added.
  • Real defects are confirmed and bounded clean verifications are produced on heavily-fuzzed surfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of proposal and verification could be applied to other backends such as unbounded model checkers or theorem provers.
  • The method may allow verification effort to scale to larger codebases by keeping per-query cost proportional to a single function.
  • Integration with existing fuzzers could use the validation pipeline to prioritize counterexamples that survive the realism audit.

Load-bearing premise

The restricted DSL for inferred specifications translates deterministically into assume/assert primitives and the counterexample validation pipeline reliably separates real defects from modeling artifacts.

What would settle it

Running the validation pipeline on a known modeling artifact that produces a false positive or on a documented real bug that the pipeline incorrectly rejects in one of the evaluated LLM-generated kernel or compiler modules.

Figures

Figures reproduced from arXiv: 2605.21434 by Daniel Kroening, Jason Xue, Jiawen Liu, Youcheng Sun.

Figure 1
Figure 1. Figure 1: BMC-Agent architecture. The main pipeline parses the program, builds a domain [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

Verifying LLM-generated systems code is hard: bugs are prevalent, formal specifications are missing, and safety contracts are encoded implicitly at call sites rather than enforced at function boundaries. We propose agentic model checking, a paradigm that couples LLM agents with a bounded model checking backend under the principle agents propose, solvers verify: agents handle tasks requiring semantic judgment (spec inference, check selection, counterexample classification, refinement proposal) while BMC discharges every soundness-relevant decision. The paradigm rests on three commitments. Specifications are inferred top-down from caller context in a restricted DSL that translates deterministically into the backend's assume/assert primitives, with optional functional-correctness clauses lifting verification from panic-freeness to behavioural faithfulness. Verification is compositional: each function is checked in isolation against its spec with callees replaced by postcondition-constrained stubs, so per-query cost scales with a single function's state space and refinements propagate automatically to callers. Counterexamples are not bug reports: they pass through a validation pipeline (reachability, callee feasibility, dynamic replay, realism audit) that distinguishes active in-tree crashes from latent public-API failures, while modelling artifacts drive a refinement loop rather than being suppressed. We instantiate the approach in BMC-Agent and evaluate it on LLM-generated kernel and compiler code in C and Rust alongside mature OSS-Fuzz-hardened libraries, confirming real defects, producing bounded clean verifications on heavily-fuzzed surfaces, and establishing functional equivalence on selected algorithmic functions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes 'agentic model checking,' a paradigm that pairs LLM agents with a bounded model checking (BMC) backend under the principle that agents propose while solvers verify. Agents perform semantic tasks such as inferring specifications top-down from caller context in a restricted DSL (translating to assume/assert primitives, with optional functional-correctness clauses), selecting checks, classifying counterexamples, and proposing refinements. BMC discharges all soundness-relevant decisions. Verification is compositional: each function is checked in isolation against its spec with postcondition-constrained stubs for callees. Counterexamples pass through a four-stage validation pipeline (reachability, callee feasibility, dynamic replay, realism audit) to separate genuine defects from modeling artifacts, triggering refinement rather than suppression. The approach is instantiated as BMC-Agent and evaluated on LLM-generated kernel and compiler code in C and Rust plus OSS-Fuzz-hardened libraries, claiming to confirm real defects and produce bounded clean verifications.

Significance. If the counterexample validation pipeline proves reliable and the compositional checks scale, the paradigm could meaningfully advance verification of LLM-generated systems code by delegating semantic judgment to agents while retaining formal soundness guarantees from BMC. The top-down spec inference and automatic propagation of refinements address the common absence of explicit contracts in generated code. Evaluation on real kernel/compiler surfaces and hardened libraries provides initial evidence of practical utility for defect discovery and clean verification.

major comments (1)
  1. [Description of the counterexample validation pipeline and evaluation results] The soundness of reported evaluation outcomes rests on the counterexample validation pipeline (reachability, callee feasibility, dynamic replay, realism audit) correctly classifying every BMC-produced counterexample so that only genuine defects are reported and modeling artifacts drive refinement. The manuscript describes the stages but supplies neither a formal argument establishing completeness/soundness of the pipeline nor quantitative false-positive or false-negative rates on the kernel/compiler codebases. This directly affects trustworthiness of the claims that real defects were confirmed and bounded clean verifications achieved, since agents propose specifications and checks while only BMC decisions are treated as sound.
minor comments (2)
  1. [Abstract] The abstract states high-level evaluation outcomes but omits quantitative details such as number of functions analyzed, state-space sizes, refinement iterations, or specific defect counts with failure modes.
  2. [Specification inference section] Notation for the restricted DSL and its deterministic translation to assume/assert primitives could be clarified with a small example showing an inferred spec and the resulting BMC encoding.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the significance of agentic model checking. We address the major comment on the counterexample validation pipeline below and outline planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: The soundness of reported evaluation outcomes rests on the counterexample validation pipeline (reachability, callee feasibility, dynamic replay, realism audit) correctly classifying every BMC-produced counterexample so that only genuine defects are reported and modeling artifacts drive refinement. The manuscript describes the stages but supplies neither a formal argument establishing completeness/soundness of the pipeline nor quantitative false-positive or false-negative rates on the kernel/compiler codebases. This directly affects trustworthiness of the claims that real defects were confirmed and bounded clean verifications achieved, since agents propose specifications and checks while only BMC decisions are treated as sound.

    Authors: We agree that the current manuscript describes the four-stage pipeline but does not supply a formal soundness or completeness argument, nor quantitative false-positive/false-negative rates on the evaluated kernel and compiler codebases. This is a valid observation that affects the strength of the evaluation claims. The pipeline is intentionally conservative and heuristic: counterexamples must pass reachability, callee feasibility, dynamic replay, and realism audit before being treated as genuine defects, with failures triggering refinement rather than suppression. This design prioritizes avoiding false defect reports over exhaustive coverage. In the revised manuscript we will add (1) an expanded rationale section explaining the design choices for each stage and their collective effect on precision, (2) an empirical breakdown of classification outcomes on the reported counterexamples (including counts of artifacts versus confirmed defects), and (3) an explicit limitations paragraph stating that the pipeline is a practical filter whose overall verification soundness continues to rest on the BMC backend. These additions will be incorporated without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No circularity: paradigm proposal with independent soundness commitments

full rationale

The manuscript presents agentic model checking as a coupling of LLM agents (for semantic tasks like spec inference and counterexample classification) with a BMC backend (for all soundness-relevant decisions). The three commitments—top-down spec inference in a restricted DSL, compositional verification with stubs, and a four-stage counterexample validation pipeline—are described procedurally without equations, fitted parameters, or derivations that reduce claimed outcomes to inputs by construction. No self-citations appear as load-bearing justifications for uniqueness or ansatzes, and the evaluation on kernel/compiler code is framed as empirical confirmation rather than a statistically forced prediction. The approach is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that bounded model checking remains sound once specifications are expressed in the restricted DSL and that LLM agents can produce usable semantic judgments without introducing unhandled unsoundness.

axioms (1)
  • domain assumption Bounded model checking can discharge all soundness-relevant decisions once specifications are translated into assume/assert primitives.
    Invoked in the description of the agents-propose-solvers-verify principle and the deterministic translation of the DSL.
invented entities (1)
  • BMC-Agent no independent evidence
    purpose: Concrete implementation of the agentic model checking paradigm for evaluation.
    New system name introduced to instantiate the proposed approach.

pith-pipeline@v0.9.0 · 5783 in / 1527 out tokens · 41414 ms · 2026-05-21T03:06:04.008746+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    https://github.com/kaansenol5/VibeOS, 2026

    Vibeos. https://github.com/kaansenol5/VibeOS, 2026. GitHub repository, accessed 2026-05-18. 12

  2. [2]

    Automatic predicate abstraction of c programs

    Thomas Ball, Rupak Majumdar, Todd Millstein, and Sriram K Rajamani. Automatic predicate abstraction of c programs. InProceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, pages 203–213, 2001

  3. [3]

    Thorough static analysis of device drivers

    Thomas Ball, Ella Bounimova, Byron Cook, Vladimir Levin, Jakob Lichtenberg, Con McGarvey, Bohus Ondrusek, Sriram K Rajamani, and Abdullah Ustuner. Thorough static analysis of device drivers. volume 40, pages 73–85. ACM New York, NY, USA, 2006

  4. [4]

    Symbolic model check- ing without bdds

    Armin Biere, Alessandro Cimatti, Edmund Clarke, and Yunshan Zhu. Symbolic model check- ing without bdds. InInternational conference on tools and algorithms for the construction and analysis of systems, pages 193–207. Springer, 1999

  5. [5]

    KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs

    Cristian Cadar, Daniel Dunbar, and Dawson Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. InProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’08), pages 209–224, San Diego, CA, USA, December 2008. USENIX Association. URL https: //www.usenix.org/legacy/event/...

  6. [6]

    Smatch: Pluggable static analysis for c

    Dan Carpenter. Smatch: Pluggable static analysis for c. https://lwn.net/Articles/ 691882/, 2016

  7. [7]

    Modular verification of software components in c.IEEE Transactions on Software Engineering, 30 (6):388–402, 2004

    Sagar Chaki, Edmund M Clarke, Alex Groce, Somesh Jha, and Helmut Veith. Modular verification of software components in c.IEEE Transactions on Software Engineering, 30 (6):388–402, 2004

  8. [8]

    Counterexample- guided abstraction refinement

    Edmund Clarke, Orna Grumberg, Somesh Jha, Yuan Lu, and Helmut Veith. Counterexample- guided abstraction refinement. InInternational Conference on Computer Aided Verification, pages 154–169. Springer, 2000

  9. [9]

    A tool for checking ansi-c programs

    Edmund Clarke, Daniel Kroening, and Flavio Lerda. A tool for checking ansi-c programs. InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 168–176. Springer, 2004

  10. [10]

    Frama-c: a software analysis perspective

    Pascal Cuoq, Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. Frama-c: a software analysis perspective. InProceedings of the 10th International Conference on Software Engineering and Formal Methods, SEFM’12, page 233–247, Berlin, Heidelberg, 2012. Springer-Verlag. ISBN 9783642338250. doi: 10.1007/ 978-3-642-3382...

  11. [11]

    FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

    Haoran Ding, Zhaoguo Wang, and Haibo Chen. Fm-agent: Scaling formal methods to large systems via llm-based hoare-style reasoning.arXiv preprint arXiv:2604.11556, 2026

  12. [12]

    Lazy abstrac- tion

    Thomas A Henzinger, Ranjit Jhala, Rupak Majumdar, and Gregoire Sutre. Lazy abstrac- tion. InProceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 58–70, 2002

  13. [13]

    Enhancing automated loop invariant generation for complex programs with large language models.Science of Computer Programming, page 103387, 2025

    Ruibang Liu, Minyu Chen, Ling-I Wu, Jingyu Ke, and Guoqiang Li. Enhancing automated loop invariant generation for complex programs with large language models.Science of Computer Programming, page 103387, 2025

  14. [14]

    Scaling symbolic evaluation for automated verification of systems code with serval

    Luke Nelson, James Bornholt, Ronghui Gu, Andrew Baumann, Emina Torlak, and Xi Wang. Scaling symbolic evaluation for automated verification of systems code with serval. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, page 225–242, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450368735. doi: 10.1...

  15. [15]

    Veri- fying dynamic trait objects in rust

    Alexa VanHattum, Daniel Schwartz-Narbonne, Nathan Chong, and Adrian Sampson. Veri- fying dynamic trait objects in rust. InProceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice, pages 321–330, 2022

  16. [16]

    Enchanting program specification synthesis by large language models using static analysis and program verification

    Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing- Chi Cheung, and Cong Tian. Enchanting program specification synthesis by large language models using static analysis and program verification. InInternational Conference on Computer Aided Verification, pages 302–328. Springer, 2024

  17. [17]

    Lemur: Integrating large language models in automated program verification

    Haoze Wu, Clark Barrett, and Nina Narodytska. Lemur: Integrating large language models in automated program verification. InInternational Conference on Learning Representations, volume 2024, pages 2968–2978, 2024

  18. [18]

    React: Synergizing reasoning and acting in language models.ICLR, 2023

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.ICLR, 2023. Appendix A VibeOS Case Studies We evaluate BMC-Agent on VibeOS [ 1], a bare-metal ARM64 hobby operating system of approximately 15,000 lines of C written with substantial LLM assistance. We chos...

  19. [19]

    A refined spec is constructed with the tightened precondition and replaces the original incurrent specs

  20. [20]

    / " , & rest ) ) != NULL ) { 9if ( strcmp ( token ,

    The refined spec is persisted to the artifact store. 3.calloc is added to theself-recheck queue: the CEGAR re-verification queue for functions whose precondition was tightened after a spurious counterexample. Without this step a tool that only filters (never refines) would either report the loop artifact as a false positive or drop it silently — and never...