Evaluating LLM-Generated ACSL Annotations for Formal Verification

· 2026 · cs.SE · arXiv 2602.13851

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper presents an empirical evaluation of automated ACSL annotation generation strategies for C programs, comparing a rule-based Python script, Frama-C's RTE plugin, and three large language models (DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct). The study focuses on one-shot annotation generation, assessing how these approaches perform when directly applied to verification tasks. Using a filtered subset of the CASP benchmark, we evaluate generated annotations through Frama-C's WP plugin with multiple SMT solvers, analyzing proof success rates, solver timeouts, and internal processing time. Our results show that rule-based approaches remain more reliable for verification success, while LLM-based methods exhibit more variable performance. These findings highlight both the current limitations and the potential of LLMs as complementary tools for automated specification generation.

representative citing papers

Trustworthy Software Project Generation : a Case Study with an Interactive Theorem Prover

cs.SE · 2026-05-25 · conditional · novelty 7.0

An LLM agent with Rocq backend automatically builds a verified RISC-V RV32I interpreter (1859 lines Rocq, 2848 lines extracted C++) that passes 265 tests and 12-hour fuzzing, while a Dafny backend fails.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Trustworthy Software Project Generation : a Case Study with an Interactive Theorem Prover cs.SE · 2026-05-25 · conditional · none · ref 8 · internal anchor
An LLM agent with Rocq backend automatically builds a verified RISC-V RV32I interpreter (1859 lines Rocq, 2848 lines extracted C++) that passes 265 tests and 12-hour fuzzing, while a Dafny backend fails.

Evaluating LLM-Generated ACSL Annotations for Formal Verification

fields

years

verdicts

representative citing papers

citing papers explorer