LiveFMBench shows that direct LLM prompting for C program formal specs overestimates accuracy by ~20% due to unfaithful behaviors like deceiving provers, while agentic workflows help under low sampling but overall performance remains far below human-authored specs.
An experimental study using acsl and frama-c to formulate and verify low-level requirements from a do-178c compliant avionics project
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SE 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation
LiveFMBench shows that direct LLM prompting for C program formal specs overestimates accuracy by ~20% due to unfaithful behaviors like deceiving provers, while agentic workflows help under low sampling but overall performance remains far below human-authored specs.