LLM-Based Static Verification of Code Against Natural-Language Requirements: An Industrial Experience Report

Dave Towey; Tsong Yueh Chen; Zhi Quan Zhou

arxiv: 2605.17926 · v1 · pith:VONJ76KJnew · submitted 2026-05-18 · 💻 cs.SE

LLM-Based Static Verification of Code Against Natural-Language Requirements: An Industrial Experience Report

Zhi Quan Zhou , Dave Towey , Tsong Yueh Chen This is my paper

Pith reviewed 2026-05-20 09:37 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLMstatic verificationnatural language requirementssoftware testingtest oracle problemindustrial experience reportcybersecurityintelligent vehicle

0 comments

The pith

A two-stage LLM workflow verifies code against natural-language requirements without compilation or execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a two-stage LLM-based workflow for statically checking whether implemented code satisfies requirements written in natural language. The first stage uses an AI rule miner to extract verifiable rules from the requirements text while flagging ambiguity, self-contradiction, and other non-verifiable parts. The second stage uses an AI code auditor to examine implementation evidence against the extracted rules. This structured approach avoids direct verification over long specifications and reduces hallucination and context loss. It supplies a requirement-aware static analysis technique that works without needing to compile, run, or set up runtime environments for the software.

Core claim

The two-stage LLM-based workflow provides a requirement-aware and semantics-aware form of static analysis that complements software testing by analyzing requirements and source code without requiring compilation, execution, or runtime environments, and offers a new approach to addressing the test oracle problem.

What carries the argument

The two-stage workflow that first has an AI-based rule miner extract verifiable rules from natural-language requirements while identifying non-verifiable statements, then has an AI-based code auditor check implementation evidence against those rules using a structured intermediate representation.

If this is right

Verification and validation steps can move earlier in the development lifecycle without waiting for executable builds or test harnesses.
The method can surface problems in the requirements themselves before any code is written or tested.
It supplies an additional check for business logic correctness that traditional static analyzers focused on coding defects cannot provide.
The intermediate rules create an auditable trail that improves explainability of verification results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique might be combined with conventional static analysis tools to catch both coding defects and specification mismatches in one pass.
In domains with evolving requirements, the rule extraction step could be rerun periodically to keep verification aligned with updated specifications.
Scaling the approach to very large codebases would require testing how the context length and rule complexity affect auditor accuracy.

Load-bearing premise

That the AI rule miner can reliably pull out accurate verifiable rules and flag non-verifiable statements from natural language, and that the AI code auditor can then check the code against those rules without substantial hallucination or inconsistent results.

What would settle it

Apply the workflow to a collection of requirements containing deliberate ambiguities and contradictions together with code containing known mismatches to the intended rules, then measure whether the system correctly identifies the ambiguities and detects the mismatches.

read the original abstract

Large language models (LLMs) are increasingly used to generate requirements specifications, design documents, code, and test cases. In contrast, much less attention has been given to a more difficult assurance problem: statically verifying whether implemented code satisfies requirements written in natural language. Conventional static analysis tools are effective at detecting coding defects and known vulnerability patterns, but they cannot determine whether program behavior matches intended business logic. Detecting such defects requires reasoning over the specification rather than the code alone. Software testing can expose some of these mismatches, but its effectiveness depends heavily on test design, executable artifacts, and runtime environments. This article presents a two-stage LLM-based workflow for addressing this challenge in an intelligent-vehicle cybersecurity case study. In the first stage, an AI-based rule miner extracts verifiable rules from natural-language requirements while explicitly identifying ambiguity, self-contradiction, and other non-verifiable statements. In the second stage, an AI-based code auditor checks implementation evidence against the extracted rules. Instead of asking a single LLM to directly verify code against lengthy natural-language specifications, the workflow introduces a structured intermediate representation to reduce hallucination, output variability, limited explainability, and context loss. The resulting approach is a requirement-aware and semantics-aware form of static analysis that complements software testing. By analyzing requirements and source code without requiring compilation, execution, or runtime environments, the method shifts verification and validation activities left in the development lifecycle. This LLM-based static analysis is also a new approach to addressing the test oracle problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper outlines a two-stage LLM workflow for verifying code against natural-language requirements in an industrial case but provides no quantitative metrics on reliability or accuracy.

read the letter

The key takeaway is that this is an industrial experience report outlining a two-stage LLM workflow for static verification of code against natural-language requirements, but it doesn't include quantitative results on its effectiveness. The first stage uses an AI rule miner to extract verifiable rules from the requirements, noting ambiguities and contradictions. The second stage has an AI auditor check the code against those rules. They tested this on a cybersecurity case for intelligent vehicles. What the paper does well is identify a gap: conventional static analysis catches code defects but not logic mismatches with business requirements. Using an intermediate rule representation to break down the task for the LLM is a practical way to reduce hallucination and improve explainability. The approach could complement testing by allowing earlier checks without running the code. That's a useful angle for safety-critical systems. The main soft spot is the missing evidence. No numbers on how often the rules were correctly extracted, how the audits matched human judgment, or how consistent the outputs were across runs. Without that, the claims about reduced hallucination and effective verification rest on the workflow description alone. This paper is for readers interested in applying LLMs to software assurance and verification tasks in industry. It shows honest engagement with the challenges of using LLMs for this purpose. I'd recommend sending it for peer review. Referees can point out where more evaluation would strengthen it.

Referee Report

1 major / 1 minor

Summary. The manuscript presents an industrial experience report on a two-stage LLM-based workflow for static verification of code against natural-language requirements, demonstrated in an intelligent-vehicle cybersecurity case study. Stage 1 uses an AI-based rule miner to extract verifiable rules from requirements while flagging ambiguity, self-contradiction, and non-verifiable statements. Stage 2 uses an AI-based code auditor to check implementation evidence against the extracted rules. The approach is claimed to reduce hallucination, output variability, and context loss compared to direct LLM verification, yielding a requirement-aware and semantics-aware static analysis that complements testing, shifts V&V left in the lifecycle, and offers a new approach to the test oracle problem without requiring compilation, execution, or runtime environments.

Significance. If the workflow's reliability claims hold, the work would represent a meaningful step toward integrating natural-language specifications into early-stage static analysis for safety-critical systems. It could complement existing tools by addressing semantic mismatches that conventional static analyzers and testing miss. The industrial context and structured intermediate representation are strengths, but the current lack of quantitative validation limits the assessed impact to potential rather than demonstrated.

major comments (1)

[abstract and case study section] The central claim that the two-stage workflow reliably extracts rules and audits code with reduced hallucination (abstract and §3 workflow description) rests on an untested assumption. The manuscript provides selected successful traces from the cybersecurity case study but reports no quantitative metrics such as precision/recall against human-labeled ground truth, inter-run consistency (e.g., variance over repeated LLM calls with fixed prompts), or agreement rates between the auditor and independent reviewers. This is load-bearing because the value proposition of requirement-aware static analysis collapses without evidence that the intermediate rules and verdicts are stable and accurate at non-trivial rates.

minor comments (1)

[rule miner stage] The description of how the rule miner explicitly identifies non-verifiable statements could be clarified with a concrete example from the requirements set.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the importance of quantitative validation to support the reliability claims. We address this point directly below and commit to strengthening the manuscript accordingly.

read point-by-point responses

Referee: [abstract and case study section] The central claim that the two-stage workflow reliably extracts rules and audits code with reduced hallucination (abstract and §3 workflow description) rests on an untested assumption. The manuscript provides selected successful traces from the cybersecurity case study but reports no quantitative metrics such as precision/recall against human-labeled ground truth, inter-run consistency (e.g., variance over repeated LLM calls with fixed prompts), or agreement rates between the auditor and independent reviewers. This is load-bearing because the value proposition of requirement-aware static analysis collapses without evidence that the intermediate rules and verdicts are stable and accurate at non-trivial rates.

Authors: We agree that the absence of quantitative metrics is a limitation in the current version. As an industrial experience report, the manuscript prioritizes describing the practical two-stage workflow and illustrating it with real traces from the cybersecurity case study rather than a controlled benchmark evaluation. However, we recognize that metrics are needed to substantiate the claims of reduced hallucination and improved stability. In the revised manuscript we will add a dedicated evaluation subsection that reports: (1) precision and recall of the rule miner against a human-labeled subset of requirements, (2) inter-run consistency measured by repeating LLM calls with fixed prompts and reporting variance in extracted rules and verdicts, and (3) agreement rates between the code auditor outputs and independent human reviewers. These additions will directly address the load-bearing concern while preserving the experience-report character of the work. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical experience report with no derivations or self-referential claims

full rationale

The paper is an industrial experience report presenting a two-stage LLM workflow for requirement-aware static verification in a cybersecurity case study. It describes the rule miner and code auditor stages, their benefits for reducing hallucination via intermediate representations, and shifts verification left in the lifecycle, supported by the case study observations. No mathematical derivations, equations, fitted parameters, predictions, or uniqueness theorems appear. Claims rest on the workflow description and empirical application rather than any reduction to inputs by construction or self-citation load-bearing chains. The approach is introduced directly as a structured alternative to direct LLM verification, with no ansatz smuggling, renaming of known results, or self-definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on assumptions about LLM capabilities for accurate rule extraction and auditing; these are domain assumptions without independent evidence or quantitative support in the abstract.

axioms (2)

domain assumption LLMs can extract verifiable rules from natural-language requirements while explicitly identifying ambiguity, self-contradiction, and other non-verifiable statements
This capability is required for the first stage of the workflow as stated in the abstract.
domain assumption LLMs can check implementation evidence against extracted rules without substantial hallucination or output variability when using a structured intermediate representation
This underpins the second stage and the claimed reductions in hallucination and context loss.

pith-pipeline@v0.9.0 · 5802 in / 1436 out tokens · 65055 ms · 2026-05-20T09:37:30.112961+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage LLM-based workflow … AI-based rule miner extracts verifiable rules … AI-based code auditor checks implementation evidence
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

metamorphic relation MR1 … pooling strategy … temperature fixed at 0

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

The Oracle Problem in Software Testing: A Survey,

E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The Oracle Problem in Software Testing: A Survey,”IEEE Transactions on Software Engineering, vol. 41, no. 5, pp. 507–525, 2015. https://doi.org/ 10.1109/TSE.2014.2372785

work page doi:10.1109/tse.2014.2372785 2015
[2]

Analyzing Regulatory Rules for Privacy and Security Requirements,

T. Breaux and A. Anton, “Analyzing Regulatory Rules for Privacy and Security Requirements,”IEEE Transactions on Software Engineering, vol. 34, no. 1, pp. 5–20, 2008. https://doi.org/10.1109/TSE.2007.70746

work page doi:10.1109/tse.2007.70746 2008
[3]

Available: https://doi.org/10.1145/3143561

T. Y . Chen, F.-C. Kuo, H. Liu, P.-L. Poon, D. Towey, T. H. Tse, and Z. Q. Zhou, “Metamorphic Testing: A Review of Challenges and Opportunities,”ACM Computing Surveys, vol. 51, no. 1, Article 4, pp. 4:1–4:27, 2018. https://doi.org/10.1145/3143561

work page doi:10.1145/3143561 2018
[4]

Natural Language Processing for Requirements Engineering: The Best Is Yet to Come,

F. Dalpiaz, A. Ferrari, X. Franch, and C. Palomares, “Natural Language Processing for Requirements Engineering: The Best Is Yet to Come,” IEEE Software, vol. 35, no. 5, pp. 115–119, 2018. https://doi.org/10. 1109/MS.2018.3571242

work page arXiv 2018
[5]

Modeling and discovering vulnerabilities with code property graphs

F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and Dis- covering Vulnerabilities with Code Property Graphs,” in2014 IEEE Symposium on Security and Privacy, 2014, pp. 590–604. https://doi. org/10.1109/SP.2014.44

work page doi:10.1109/sp.2014.44 2014
[6]

Natural Language Processing for Requirements Engineering: A Systematic Mapping Study,

L. Zhao, W. Alhoshan, A. Ferrari, K. J. Letsholo, M. A. Ajagbe, E.- V . Chioasca, and R. T. Batista-Navarro, “Natural Language Processing for Requirements Engineering: A Systematic Mapping Study,”ACM Computing Surveys, vol. 54, no. 3, Article 55, pp. 55:1–55:41, 2021. https://doi.org/10.1145/3444689

work page doi:10.1145/3444689 2021

[1] [1]

The Oracle Problem in Software Testing: A Survey,

E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The Oracle Problem in Software Testing: A Survey,”IEEE Transactions on Software Engineering, vol. 41, no. 5, pp. 507–525, 2015. https://doi.org/ 10.1109/TSE.2014.2372785

work page doi:10.1109/tse.2014.2372785 2015

[2] [2]

Analyzing Regulatory Rules for Privacy and Security Requirements,

T. Breaux and A. Anton, “Analyzing Regulatory Rules for Privacy and Security Requirements,”IEEE Transactions on Software Engineering, vol. 34, no. 1, pp. 5–20, 2008. https://doi.org/10.1109/TSE.2007.70746

work page doi:10.1109/tse.2007.70746 2008

[3] [3]

Available: https://doi.org/10.1145/3143561

T. Y . Chen, F.-C. Kuo, H. Liu, P.-L. Poon, D. Towey, T. H. Tse, and Z. Q. Zhou, “Metamorphic Testing: A Review of Challenges and Opportunities,”ACM Computing Surveys, vol. 51, no. 1, Article 4, pp. 4:1–4:27, 2018. https://doi.org/10.1145/3143561

work page doi:10.1145/3143561 2018

[4] [4]

Natural Language Processing for Requirements Engineering: The Best Is Yet to Come,

F. Dalpiaz, A. Ferrari, X. Franch, and C. Palomares, “Natural Language Processing for Requirements Engineering: The Best Is Yet to Come,” IEEE Software, vol. 35, no. 5, pp. 115–119, 2018. https://doi.org/10. 1109/MS.2018.3571242

work page arXiv 2018

[5] [5]

Modeling and discovering vulnerabilities with code property graphs

F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and Dis- covering Vulnerabilities with Code Property Graphs,” in2014 IEEE Symposium on Security and Privacy, 2014, pp. 590–604. https://doi. org/10.1109/SP.2014.44

work page doi:10.1109/sp.2014.44 2014

[6] [6]

Natural Language Processing for Requirements Engineering: A Systematic Mapping Study,

L. Zhao, W. Alhoshan, A. Ferrari, K. J. Letsholo, M. A. Ajagbe, E.- V . Chioasca, and R. T. Batista-Navarro, “Natural Language Processing for Requirements Engineering: A Systematic Mapping Study,”ACM Computing Surveys, vol. 54, no. 3, Article 55, pp. 55:1–55:41, 2021. https://doi.org/10.1145/3444689

work page doi:10.1145/3444689 2021