arxiv: 2605.13817 · v1 · submitted 2026-05-13 · 💻 cs.SE · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Neurosymbolic Auditing of Natural-Language Software Requirements

Bethel Hall , William Eiers

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:57 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords neurosymbolic auditingnatural language requirementsambiguity detectionSMT solverformal verificationmedical device softwareLLM translationsoftware safety

0 comments

The pith

Large language models with an SMT solver can translate and audit natural-language software requirements for ambiguity and defects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Natural language software requirements often hide ambiguities and inconsistencies that cause problems in safety-critical systems like medical devices. This paper shows that large language models can convert these requirements into formal logical statements, where differences in multiple translations point to ambiguous requirements. An SMT solver then examines these statements to find inconsistencies, unnecessary conditions, and potential safety violations. The VERIMED system implements this for hemodialysis requirements and demonstrates that using specific counterexamples from the solver boosts the accuracy of fixed requirements from about 55 percent to 98.5 percent.

Core claim

Large language models equipped with an SMT solver can audit natural-language software requirements by translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalizations, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification.

What carries the argument

VERIMED, the neurosymbolic pipeline that uses an LLM for translating requirements to SMT formulas and an SMT solver for equivalence checking and counterexample-guided repair.

If this is right

Requirements with multiple plausible interpretations produce SMT-inequivalent formalizations that can be detected by bidirectional equivalence checking.
Concrete SMT counterexamples raise verified accuracy in counterexample-guided repair from 55.4% to 98.5% on a hemodialysis benchmark.
The approach successfully reduces ambiguity-sensitive requirements in open-source hemodialysis safety specifications.
Solver queries can expose inconsistency, vacuousness, and safety violations in the formalized requirements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This auditing method could extend to other safety-critical domains such as automotive or aerospace software requirements.
Early integration of this technique in the requirements phase might prevent costly errors in later implementation and verification stages.
Combining outputs from several different large language models could strengthen the detection of genuine ambiguity versus model noise.

Load-bearing premise

Stochastic variation in the formalizations generated by the LLM reliably indicates genuine ambiguity in the natural-language requirement instead of just the model's own inconsistencies or hallucinations.

What would settle it

Run the formalization process multiple times on a set of requirements that are independently known to be unambiguous and check if the rate of SMT-inequivalent outputs is significantly lower than for requirements known to be ambiguous.

Figures

Figures reproduced from arXiv: 2605.13817 by Bethel Hall, William Eiers.

**Figure 2.** Figure 2: Ambiguity-resolution loop with a dialysate-temperature example. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example variable schema and SMT encoding generated from a requirement. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of verifier feedback on question answering. (Left) accepted on the first attempt, fixed [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Mutation Detection. (Left) detection outcomes by fault type: circles denote a detected [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Ambiguity detection and resolution via repeated autoformalization. (A) Share of require [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped with an SMT solver, can audit such requirements: translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalization, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification. We present VERIMED, a neurosymbolic pipeline that operationalizes this idea for medical-device software requirements, and report two findings. First, stochastic variation across independent formalizations is a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test. Second, the usefulness of symbolic feedback depends on its granularity: in counterexample-guided repair on a hemodialysis question-answering benchmark, concrete SMT counterexamples raise verified accuracy from 55.4% to 98.5%. Over an extensive experimental evaluation on open-source hemodialysis safety requirements, we show that the LLM-based approach in VERIMED successfully reduces ambiguity-sensitive requirements and enables rigorous auditing of software requirements through SMT-based queries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core contribution is using LLM sampling variance plus SMT equivalence to detect ambiguity in requirements, with counterexample repair lifting accuracy sharply on one benchmark, but the detection step lacks controls for LLM inconsistency on clear cases.

read the letter

The new piece here is treating stochastic disagreement across LLM formalizations as a solver-checkable signal for ambiguity, then feeding SMT counterexamples back to repair the output. That combination is not just another LLM-to-logic translation; the bidirectional equivalence check and the repair loop are the operational parts that stand out. On the hemodialysis benchmark they report a jump from 55.4% to 98.5% verified accuracy once concrete counterexamples are supplied, which is the clearest quantitative result in the abstract. Applying the pipeline to open-source medical-device requirements also gives it a concrete domain anchor rather than staying purely synthetic. The SMT integration itself looks clean: once the formulas are produced, the solver handles equivalence, satisfiability, and vacuity checks without additional fitting parameters. That part is reproducible in principle and gives the neurosymbolic claim some grounding. The soft spot is exactly the one the stress-test flags. The method declares ambiguity whenever two formalizations are not SMT-equivalent, but the abstract supplies no control experiment on requirements that have only one plausible reading. If the LLM produces inequivalent outputs even for unambiguous text, the test will over-flag. Without that baseline, it is difficult to know how much of the detected ambiguity is real versus model noise. The evaluation description is also thin on details: no error bars, no direct comparison to a single-prompt baseline or a non-LLM checker, and no breakdown of how many requirements were excluded post-hoc. Those omissions make the accuracy numbers harder to interpret. This work is aimed at people building tools for safety-critical requirements engineering, especially in regulated domains like medical devices. A reader already working on formalization pipelines or neurosymbolic verification would find the repair mechanism and the equivalence test worth examining. I would send it to peer review. The idea is distinct enough and the solver usage is rigorous enough that referees can usefully check the missing controls and experimental reporting rather than reject it outright.

Referee Report

3 major / 1 minor

Summary. The paper introduces VERIMED, a neurosymbolic pipeline that translates natural-language software requirements (focused on medical-device safety) into formal logic using LLMs, detects ambiguity by checking whether multiple stochastic formalizations are bidirectionally SMT-equivalent, and uses SMT queries to expose inconsistency, vacuousness, and safety violations. It reports that this approach reduces ambiguity-sensitive requirements on open-source hemodialysis specs and raises verified accuracy on a hemodialysis QA benchmark from 55.4% to 98.5% via counterexample-guided repair.

Significance. If the central claims hold, the work provides a concrete, solver-backed method for auditing natural-language requirements in safety-critical domains, bridging LLM flexibility with formal verification guarantees. The reported accuracy lift and the use of SMT equivalence as an ambiguity oracle would be notable contributions to neurosymbolic requirements engineering.

major comments (3)

[Abstract / experimental evaluation] Abstract and experimental evaluation: the claim that 'requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations' is load-bearing for the ambiguity-detection component, yet no control experiment on unambiguously worded requirements is described to show that LLM sampling variance alone does not produce inequivalent outputs at a comparable rate.
[Abstract] Abstract: the accuracy improvement from 55.4% to 98.5% is presented without baseline comparisons, error bars, or details on how post-hoc exclusions and prompt choices were handled, so the magnitude and robustness of the counterexample-guided repair result cannot be assessed from the given information.
[Pipeline description] Pipeline description: the ambiguity test treats any SMT-inequivalent pair as evidence of genuine ambiguity, but the manuscript supplies no quantitative bound on the LLM's own inconsistency rate for single-interpretation inputs, leaving open the possibility that the reported reduction in ambiguity-sensitive requirements is inflated by false positives.

minor comments (1)

[Method] Notation for the bidirectional SMT equivalence check should be defined more explicitly (e.g., as a single predicate rather than described procedurally) to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on VERIMED. We address each major comment below with clarifications and planned revisions to strengthen the experimental validation and reporting.

read point-by-point responses

Referee: [Abstract / experimental evaluation] Abstract and experimental evaluation: the claim that 'requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations' is load-bearing for the ambiguity-detection component, yet no control experiment on unambiguously worded requirements is described to show that LLM sampling variance alone does not produce inequivalent outputs at a comparable rate.

Authors: We agree that a control experiment is necessary to isolate the contribution of genuine ambiguity from LLM sampling variance. In the revised manuscript we will add a dedicated control using a set of expert-verified unambiguous hemodialysis requirements. For each such requirement we will generate multiple independent formalizations and measure the rate of bidirectional SMT inequivalence; this will provide an empirical upper bound on the false-positive rate of the ambiguity oracle and allow readers to assess the specificity of the reported reductions. revision: yes
Referee: [Abstract] Abstract: the accuracy improvement from 55.4% to 98.5% is presented without baseline comparisons, error bars, or details on how post-hoc exclusions and prompt choices were handled, so the magnitude and robustness of the counterexample-guided repair result cannot be assessed from the given information.

Authors: We acknowledge that the abstract and evaluation section require additional context for proper assessment. In the revision we will (i) add comparisons against standard few-shot and chain-of-thought LLM baselines on the same hemodialysis QA benchmark, (ii) report mean accuracy and standard deviation across five independent runs to supply error bars, and (iii) include an appendix with the exact prompt templates, temperature settings, and any filtering criteria applied to the benchmark instances. revision: yes
Referee: [Pipeline description] Pipeline description: the ambiguity test treats any SMT-inequivalent pair as evidence of genuine ambiguity, but the manuscript supplies no quantitative bound on the LLM's own inconsistency rate for single-interpretation inputs, leaving open the possibility that the reported reduction in ambiguity-sensitive requirements is inflated by false positives.

Authors: This observation correctly identifies a missing quantitative calibration. We will augment the pipeline description with a new subsection that measures the LLM inconsistency rate on a curated collection of single-interpretation requirements. The subsection will report the fraction of requirement instances for which multiple stochastic formalizations are SMT-inequivalent, thereby supplying the requested bound and enabling direct evaluation of whether the observed reductions in ambiguity-sensitive requirements remain significant after accounting for model noise. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the neurosymbolic auditing pipeline

full rationale

The paper's derivation chain translates NL requirements to formal logic via LLM sampling and then applies an independent external SMT solver for bidirectional equivalence and satisfiability queries. The ambiguity signal is presented as an empirical observation tested on hemodialysis requirements rather than defined by construction. The accuracy lift from 55.4% to 98.5% is obtained by feeding concrete SMT counterexamples into a repair loop on an external benchmark; no equation or step reduces this metric to a parameter fitted inside the paper. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear. The pipeline remains self-contained against the external solver and benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on the assumption that LLM translations preserve enough meaning for solver queries to be meaningful and that variation across samples tracks real ambiguity; no free parameters are explicitly fitted in the abstract, and no new physical or mathematical entities are introduced.

axioms (2)

domain assumption Large language models can produce logically coherent translations of natural-language requirements that are stable enough for equivalence checking to be informative.
Invoked when stochastic variation is treated as a signal of ambiguity rather than model noise.
standard math SMT solvers can decide the relevant properties (equivalence, satisfiability, safety violations) of the generated formulas.
Standard background for the symbolic component.

invented entities (1)

VERIMED no independent evidence
purpose: Named neurosymbolic pipeline that operationalizes LLM-plus-SMT auditing for medical-device requirements.
Introduced as the concrete system realizing the described method.

pith-pipeline@v0.9.0 · 5511 in / 1416 out tokens · 35966 ms · 2026-05-14T17:57:24.470155+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

stochastic variation across independent formalizations is a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

Akinfaderin and S

A. Akinfaderin and S. Subramanian. Verafi: Verified agentic financial intelligence through neurosymbolic policy generation.arXiv preprint arXiv:2512.14744, 2025. 9

work page arXiv 2025
[2]

Allamanis, S

M. Allamanis, S. Panthaplackel, and P. Yin. Unsupervised evaluation of code llms with round-trip correctness.arXiv preprint arXiv:2402.08699, 2024

work page arXiv 2024
[3]

Faithful Autoformalization via Roundtrip Verification and Repair

D. Amrollahi, J. Lopez, and C. Barrett. Faithful autoformalization via roundtrip verification and repair.arXiv preprint arXiv:2604.25031, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Arcaini, S

P. Arcaini, S. Bonfanti, A. Gargantini, and E. Riccobene. How to assure correctness and safety of medical software: the hemodialysis machine case study. InInternational Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z, pages 344–359. Springer, 2016

work page 2016
[5]

Barrett, P

C. Barrett, P. Fontaine, and C. Tinelli. The satisfiability modulo theories library (smt-lib). www. SMT-LIB. org, 2:68, 2016

work page 2016
[6]

Bashir, A

S. Bashir, A. Ferrari, A. Khan, P. E. Strandberg, Z. Haider, M. Saadatmand, and M. Bohlin. Requirements ambiguity detection and explanation with llms: An industrial study. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 620–631. IEEE, 2025

work page 2025
[7]

Bayless, S

S. Bayless, S. Buliani, D. Cassel, B. Cook, D. Clough, R. Delmas, N. Diallo, F. Erata, N. Feng, D. Giannakopoulou, et al. A neurosymbolic approach to natural language formalization and verification.arXiv preprint arXiv:2511.09008, 2025

work page arXiv 2025
[8]

A. Beg, D. O’Donoghue, and R. Monahan. Formalising software requirements with large language models. InADAPT Annual Conference, 2025

work page 2025
[9]

J. Cao, Y . Lu, M. Li, H. Ma, H. Li, M. He, C. Wen, L. Sun, H. Zhang, S. Qin, et al. From informal to formal–incorporating and evaluating llms on natural language requirements to verifiable formal proofs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26984–27003, 2025

work page 2025
[10]

G. Chen, J. Wu, X. Chen, W. X. Zhao, R. Song, C. Li, K. Fan, D. Liu, and M. Liao. Reform: Reflective autoformalization with prospective bounded sequence optimization.arXiv preprint arXiv:2510.24592, 2025

work page arXiv 2025
[11]

De Moura and N

L. De Moura and N. Bjørner. Z3: An efficient smt solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008

work page 2008
[12]

A. O. Gomes and A. Butterfield. Modelling the haemodialysis machine with circus. In International Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z, pages 409–424. Springer, 2016

work page 2016
[13]

T. S. Hoang, C. Snook, L. Ladenberger, and M. Butler. Validating the requirements and design of a hemodialysis machine using iuml-b, bmotion studio, and co-simulation. InInternational Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z, pages 360–375. Springer, 2016

work page 2016
[14]

A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, M. Jamnik, T. Lacroix, Y . Wu, and G. Lample. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs.arXiv preprint arXiv:2210.12283, 2022

work page arXiv 2022
[15]

B. Kim, A. Ayoub, O. Sokolsky, I. Lee, P. Jones, Y . Zhang, and R. Jetley. Safety-assured development of the gpca infusion pump software. InProceedings of the ninth ACM international conference on Embedded software, pages 155–164, 2011

work page 2011
[16]

Z. Li, Y . Wu, Z. Li, X. Wei, F. Yang, X. Zhang, and X. Ma. Autoformalize mathematical statements by symbolic equivalence and semantic consistency.Advances in Neural Information Processing Systems, 37:53598–53625, 2024

work page 2024
[17]

Y . Liu, S. Yu, H. Jin, J. Wen, A. Qian, T. Lee, M. Ramsis, G. W. Choi, L. Qin, X. Liu, et al. A multi-agent framework combining large language models with medical flowcharts for self-triage. Nature Health, pages 1–10, 2026

work page 2026
[18]

Mandal, A

S. Mandal, A. Chethan, V . Janfaza, S. Mahmud, T. A. Anderson, J. Turek, J. J. Tithi, and A. Muzahid. Large language models based automatic synthesis of software specifications.arXiv preprint arXiv:2304.09181, 2023. 10

work page arXiv 2023
[19]

Mashkoor

A. Mashkoor. The hemodialysis machine case study. InInternational Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z, pages 329–343. Springer, 2016

work page 2016
[20]

F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, and Q. Wang. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering, 1(FSE):2332–2354, 2024

work page 2024
[21]

Open PCA pump requirements

Ovasicek. Open PCA pump requirements. https://ovasicek.github.io/ pca-pump-ec-artifacts/Open-PCA-Pump-Requirements.pdf , 2016. Accessed: 2026- 05-07

work page 2016
[22]

M. Q. Riaz, W. H. Butt, and S. Rehman. Automatic detection of ambiguous software require- ments: An insight. In2019 5th International Conference on Information Management (ICIM), pages 1–6. IEEE, 2019

work page 2019
[23]

C. Sun, Y . Sheng, O. Padon, and C. Barrett. Clover: Clo sed-loop ver ifiable code generation. InInternational Symposium on AI Verification, pages 134–155. Springer, 2024

work page 2024
[24]

Food and Drug Administration

U.S. Food and Drug Administration. MAUDE — manufacturer and user facility device expe- rience. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search. cfm, 2026

work page 2026
[25]

W. Wang, M. Farrell, L. C. Cordeiro, and L. Zhao. Supporting software formal verification with large language models: An experimental study. In2025 IEEE 33rd International Requirements Engineering Conference (RE), pages 423–431. IEEE, 2025

work page 2025
[26]

J. J. Wu and F. H. Fard. Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agents.ACM Transactions on Software Engineering and Methodology, 34(7):1–42, 2025

work page 2025
[27]

J. J. Wu, M. Chaudhary, D. Abrahamyan, A. Khaku, A. Wei, and F. H. Fard. Can code language models learn clarification-seeking behaviors?arXiv preprint arXiv:2504.16331, 2025

work page arXiv 2025
[28]

Y . Wu, A. Q. Jiang, W. Li, M. Rabe, C. Staats, M. Jamnik, and C. Szegedy. Autoformalization with large language models.Advances in neural information processing systems, 35:32353– 32368, 2022

work page 2022
[29]

No mandatory machine action

K. Yang, A. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. J. Prenger, and A. Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models. Advances in Neural Information Processing Systems, 36:21573–21612, 2023. 11 Table 2: Round-trip evaluation on the PCA pump requirement set. Formal equivalence is measured between the origin...

work page 2023