pith. sign in

arxiv: 2605.17926 · v1 · pith:VONJ76KJnew · submitted 2026-05-18 · 💻 cs.SE

LLM-Based Static Verification of Code Against Natural-Language Requirements: An Industrial Experience Report

Pith reviewed 2026-05-20 09:37 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLMstatic verificationnatural language requirementssoftware testingtest oracle problemindustrial experience reportcybersecurityintelligent vehicle
0
0 comments X

The pith

A two-stage LLM workflow verifies code against natural-language requirements without compilation or execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a two-stage LLM-based workflow for statically checking whether implemented code satisfies requirements written in natural language. The first stage uses an AI rule miner to extract verifiable rules from the requirements text while flagging ambiguity, self-contradiction, and other non-verifiable parts. The second stage uses an AI code auditor to examine implementation evidence against the extracted rules. This structured approach avoids direct verification over long specifications and reduces hallucination and context loss. It supplies a requirement-aware static analysis technique that works without needing to compile, run, or set up runtime environments for the software.

Core claim

The two-stage LLM-based workflow provides a requirement-aware and semantics-aware form of static analysis that complements software testing by analyzing requirements and source code without requiring compilation, execution, or runtime environments, and offers a new approach to addressing the test oracle problem.

What carries the argument

The two-stage workflow that first has an AI-based rule miner extract verifiable rules from natural-language requirements while identifying non-verifiable statements, then has an AI-based code auditor check implementation evidence against those rules using a structured intermediate representation.

If this is right

  • Verification and validation steps can move earlier in the development lifecycle without waiting for executable builds or test harnesses.
  • The method can surface problems in the requirements themselves before any code is written or tested.
  • It supplies an additional check for business logic correctness that traditional static analyzers focused on coding defects cannot provide.
  • The intermediate rules create an auditable trail that improves explainability of verification results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique might be combined with conventional static analysis tools to catch both coding defects and specification mismatches in one pass.
  • In domains with evolving requirements, the rule extraction step could be rerun periodically to keep verification aligned with updated specifications.
  • Scaling the approach to very large codebases would require testing how the context length and rule complexity affect auditor accuracy.

Load-bearing premise

That the AI rule miner can reliably pull out accurate verifiable rules and flag non-verifiable statements from natural language, and that the AI code auditor can then check the code against those rules without substantial hallucination or inconsistent results.

What would settle it

Apply the workflow to a collection of requirements containing deliberate ambiguities and contradictions together with code containing known mismatches to the intended rules, then measure whether the system correctly identifies the ambiguities and detects the mismatches.

read the original abstract

Large language models (LLMs) are increasingly used to generate requirements specifications, design documents, code, and test cases. In contrast, much less attention has been given to a more difficult assurance problem: statically verifying whether implemented code satisfies requirements written in natural language. Conventional static analysis tools are effective at detecting coding defects and known vulnerability patterns, but they cannot determine whether program behavior matches intended business logic. Detecting such defects requires reasoning over the specification rather than the code alone. Software testing can expose some of these mismatches, but its effectiveness depends heavily on test design, executable artifacts, and runtime environments. This article presents a two-stage LLM-based workflow for addressing this challenge in an intelligent-vehicle cybersecurity case study. In the first stage, an AI-based rule miner extracts verifiable rules from natural-language requirements while explicitly identifying ambiguity, self-contradiction, and other non-verifiable statements. In the second stage, an AI-based code auditor checks implementation evidence against the extracted rules. Instead of asking a single LLM to directly verify code against lengthy natural-language specifications, the workflow introduces a structured intermediate representation to reduce hallucination, output variability, limited explainability, and context loss. The resulting approach is a requirement-aware and semantics-aware form of static analysis that complements software testing. By analyzing requirements and source code without requiring compilation, execution, or runtime environments, the method shifts verification and validation activities left in the development lifecycle. This LLM-based static analysis is also a new approach to addressing the test oracle problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents an industrial experience report on a two-stage LLM-based workflow for static verification of code against natural-language requirements, demonstrated in an intelligent-vehicle cybersecurity case study. Stage 1 uses an AI-based rule miner to extract verifiable rules from requirements while flagging ambiguity, self-contradiction, and non-verifiable statements. Stage 2 uses an AI-based code auditor to check implementation evidence against the extracted rules. The approach is claimed to reduce hallucination, output variability, and context loss compared to direct LLM verification, yielding a requirement-aware and semantics-aware static analysis that complements testing, shifts V&V left in the lifecycle, and offers a new approach to the test oracle problem without requiring compilation, execution, or runtime environments.

Significance. If the workflow's reliability claims hold, the work would represent a meaningful step toward integrating natural-language specifications into early-stage static analysis for safety-critical systems. It could complement existing tools by addressing semantic mismatches that conventional static analyzers and testing miss. The industrial context and structured intermediate representation are strengths, but the current lack of quantitative validation limits the assessed impact to potential rather than demonstrated.

major comments (1)
  1. [abstract and case study section] The central claim that the two-stage workflow reliably extracts rules and audits code with reduced hallucination (abstract and §3 workflow description) rests on an untested assumption. The manuscript provides selected successful traces from the cybersecurity case study but reports no quantitative metrics such as precision/recall against human-labeled ground truth, inter-run consistency (e.g., variance over repeated LLM calls with fixed prompts), or agreement rates between the auditor and independent reviewers. This is load-bearing because the value proposition of requirement-aware static analysis collapses without evidence that the intermediate rules and verdicts are stable and accurate at non-trivial rates.
minor comments (1)
  1. [rule miner stage] The description of how the rule miner explicitly identifies non-verifiable statements could be clarified with a concrete example from the requirements set.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the importance of quantitative validation to support the reliability claims. We address this point directly below and commit to strengthening the manuscript accordingly.

read point-by-point responses
  1. Referee: [abstract and case study section] The central claim that the two-stage workflow reliably extracts rules and audits code with reduced hallucination (abstract and §3 workflow description) rests on an untested assumption. The manuscript provides selected successful traces from the cybersecurity case study but reports no quantitative metrics such as precision/recall against human-labeled ground truth, inter-run consistency (e.g., variance over repeated LLM calls with fixed prompts), or agreement rates between the auditor and independent reviewers. This is load-bearing because the value proposition of requirement-aware static analysis collapses without evidence that the intermediate rules and verdicts are stable and accurate at non-trivial rates.

    Authors: We agree that the absence of quantitative metrics is a limitation in the current version. As an industrial experience report, the manuscript prioritizes describing the practical two-stage workflow and illustrating it with real traces from the cybersecurity case study rather than a controlled benchmark evaluation. However, we recognize that metrics are needed to substantiate the claims of reduced hallucination and improved stability. In the revised manuscript we will add a dedicated evaluation subsection that reports: (1) precision and recall of the rule miner against a human-labeled subset of requirements, (2) inter-run consistency measured by repeating LLM calls with fixed prompts and reporting variance in extracted rules and verdicts, and (3) agreement rates between the code auditor outputs and independent human reviewers. These additions will directly address the load-bearing concern while preserving the experience-report character of the work. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical experience report with no derivations or self-referential claims

full rationale

The paper is an industrial experience report presenting a two-stage LLM workflow for requirement-aware static verification in a cybersecurity case study. It describes the rule miner and code auditor stages, their benefits for reducing hallucination via intermediate representations, and shifts verification left in the lifecycle, supported by the case study observations. No mathematical derivations, equations, fitted parameters, predictions, or uniqueness theorems appear. Claims rest on the workflow description and empirical application rather than any reduction to inputs by construction or self-citation load-bearing chains. The approach is introduced directly as a structured alternative to direct LLM verification, with no ansatz smuggling, renaming of known results, or self-definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on assumptions about LLM capabilities for accurate rule extraction and auditing; these are domain assumptions without independent evidence or quantitative support in the abstract.

axioms (2)
  • domain assumption LLMs can extract verifiable rules from natural-language requirements while explicitly identifying ambiguity, self-contradiction, and other non-verifiable statements
    This capability is required for the first stage of the workflow as stated in the abstract.
  • domain assumption LLMs can check implementation evidence against extracted rules without substantial hallucination or output variability when using a structured intermediate representation
    This underpins the second stage and the claimed reductions in hallucination and context loss.

pith-pipeline@v0.9.0 · 5802 in / 1436 out tokens · 65055 ms · 2026-05-20T09:37:30.112961+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

  1. [1]

    The Oracle Problem in Software Testing: A Survey,

    E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The Oracle Problem in Software Testing: A Survey,”IEEE Transactions on Software Engineering, vol. 41, no. 5, pp. 507–525, 2015. https://doi.org/ 10.1109/TSE.2014.2372785

  2. [2]

    Analyzing Regulatory Rules for Privacy and Security Requirements,

    T. Breaux and A. Anton, “Analyzing Regulatory Rules for Privacy and Security Requirements,”IEEE Transactions on Software Engineering, vol. 34, no. 1, pp. 5–20, 2008. https://doi.org/10.1109/TSE.2007.70746

  3. [3]

    Available: https://doi.org/10.1145/3143561

    T. Y . Chen, F.-C. Kuo, H. Liu, P.-L. Poon, D. Towey, T. H. Tse, and Z. Q. Zhou, “Metamorphic Testing: A Review of Challenges and Opportunities,”ACM Computing Surveys, vol. 51, no. 1, Article 4, pp. 4:1–4:27, 2018. https://doi.org/10.1145/3143561

  4. [4]

    Natural Language Processing for Requirements Engineering: The Best Is Yet to Come,

    F. Dalpiaz, A. Ferrari, X. Franch, and C. Palomares, “Natural Language Processing for Requirements Engineering: The Best Is Yet to Come,” IEEE Software, vol. 35, no. 5, pp. 115–119, 2018. https://doi.org/10. 1109/MS.2018.3571242

  5. [5]

    Modeling and discovering vulnerabilities with code property graphs

    F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and Dis- covering Vulnerabilities with Code Property Graphs,” in2014 IEEE Symposium on Security and Privacy, 2014, pp. 590–604. https://doi. org/10.1109/SP.2014.44

  6. [6]

    Natural Language Processing for Requirements Engineering: A Systematic Mapping Study,

    L. Zhao, W. Alhoshan, A. Ferrari, K. J. Letsholo, M. A. Ajagbe, E.- V . Chioasca, and R. T. Batista-Navarro, “Natural Language Processing for Requirements Engineering: A Systematic Mapping Study,”ACM Computing Surveys, vol. 54, no. 3, Article 55, pp. 55:1–55:41, 2021. https://doi.org/10.1145/3444689