pith. machine review for the scientific record. sign in

arxiv: 2605.13817 · v1 · submitted 2026-05-13 · 💻 cs.SE · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Neurosymbolic Auditing of Natural-Language Software Requirements

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:57 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords neurosymbolic auditingnatural language requirementsambiguity detectionSMT solverformal verificationmedical device softwareLLM translationsoftware safety
0
0 comments X

The pith

Large language models with an SMT solver can translate and audit natural-language software requirements for ambiguity and defects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Natural language software requirements often hide ambiguities and inconsistencies that cause problems in safety-critical systems like medical devices. This paper shows that large language models can convert these requirements into formal logical statements, where differences in multiple translations point to ambiguous requirements. An SMT solver then examines these statements to find inconsistencies, unnecessary conditions, and potential safety violations. The VERIMED system implements this for hemodialysis requirements and demonstrates that using specific counterexamples from the solver boosts the accuracy of fixed requirements from about 55 percent to 98.5 percent.

Core claim

Large language models equipped with an SMT solver can audit natural-language software requirements by translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalizations, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification.

What carries the argument

VERIMED, the neurosymbolic pipeline that uses an LLM for translating requirements to SMT formulas and an SMT solver for equivalence checking and counterexample-guided repair.

If this is right

  • Requirements with multiple plausible interpretations produce SMT-inequivalent formalizations that can be detected by bidirectional equivalence checking.
  • Concrete SMT counterexamples raise verified accuracy in counterexample-guided repair from 55.4% to 98.5% on a hemodialysis benchmark.
  • The approach successfully reduces ambiguity-sensitive requirements in open-source hemodialysis safety specifications.
  • Solver queries can expose inconsistency, vacuousness, and safety violations in the formalized requirements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This auditing method could extend to other safety-critical domains such as automotive or aerospace software requirements.
  • Early integration of this technique in the requirements phase might prevent costly errors in later implementation and verification stages.
  • Combining outputs from several different large language models could strengthen the detection of genuine ambiguity versus model noise.

Load-bearing premise

Stochastic variation in the formalizations generated by the LLM reliably indicates genuine ambiguity in the natural-language requirement instead of just the model's own inconsistencies or hallucinations.

What would settle it

Run the formalization process multiple times on a set of requirements that are independently known to be unambiguous and check if the rate of SMT-inequivalent outputs is significantly lower than for requirements known to be ambiguous.

Figures

Figures reproduced from arXiv: 2605.13817 by Bethel Hall, William Eiers.

Figure 1
Figure 1. Figure 1: Architecture of VERIMED [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ambiguity-resolution loop with a dialysate-temperature example. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example variable schema and SMT encoding generated from a requirement. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of verifier feedback on question answering. (Left) accepted on the first attempt, fixed [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mutation Detection. (Left) detection outcomes by fault type: circles denote a detected [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ambiguity detection and resolution via repeated autoformalization. (A) Share of require [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped with an SMT solver, can audit such requirements: translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalization, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification. We present VERIMED, a neurosymbolic pipeline that operationalizes this idea for medical-device software requirements, and report two findings. First, stochastic variation across independent formalizations is a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test. Second, the usefulness of symbolic feedback depends on its granularity: in counterexample-guided repair on a hemodialysis question-answering benchmark, concrete SMT counterexamples raise verified accuracy from 55.4% to 98.5%. Over an extensive experimental evaluation on open-source hemodialysis safety requirements, we show that the LLM-based approach in VERIMED successfully reduces ambiguity-sensitive requirements and enables rigorous auditing of software requirements through SMT-based queries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces VERIMED, a neurosymbolic pipeline that translates natural-language software requirements (focused on medical-device safety) into formal logic using LLMs, detects ambiguity by checking whether multiple stochastic formalizations are bidirectionally SMT-equivalent, and uses SMT queries to expose inconsistency, vacuousness, and safety violations. It reports that this approach reduces ambiguity-sensitive requirements on open-source hemodialysis specs and raises verified accuracy on a hemodialysis QA benchmark from 55.4% to 98.5% via counterexample-guided repair.

Significance. If the central claims hold, the work provides a concrete, solver-backed method for auditing natural-language requirements in safety-critical domains, bridging LLM flexibility with formal verification guarantees. The reported accuracy lift and the use of SMT equivalence as an ambiguity oracle would be notable contributions to neurosymbolic requirements engineering.

major comments (3)
  1. [Abstract / experimental evaluation] Abstract and experimental evaluation: the claim that 'requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations' is load-bearing for the ambiguity-detection component, yet no control experiment on unambiguously worded requirements is described to show that LLM sampling variance alone does not produce inequivalent outputs at a comparable rate.
  2. [Abstract] Abstract: the accuracy improvement from 55.4% to 98.5% is presented without baseline comparisons, error bars, or details on how post-hoc exclusions and prompt choices were handled, so the magnitude and robustness of the counterexample-guided repair result cannot be assessed from the given information.
  3. [Pipeline description] Pipeline description: the ambiguity test treats any SMT-inequivalent pair as evidence of genuine ambiguity, but the manuscript supplies no quantitative bound on the LLM's own inconsistency rate for single-interpretation inputs, leaving open the possibility that the reported reduction in ambiguity-sensitive requirements is inflated by false positives.
minor comments (1)
  1. [Method] Notation for the bidirectional SMT equivalence check should be defined more explicitly (e.g., as a single predicate rather than described procedurally) to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on VERIMED. We address each major comment below with clarifications and planned revisions to strengthen the experimental validation and reporting.

read point-by-point responses
  1. Referee: [Abstract / experimental evaluation] Abstract and experimental evaluation: the claim that 'requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations' is load-bearing for the ambiguity-detection component, yet no control experiment on unambiguously worded requirements is described to show that LLM sampling variance alone does not produce inequivalent outputs at a comparable rate.

    Authors: We agree that a control experiment is necessary to isolate the contribution of genuine ambiguity from LLM sampling variance. In the revised manuscript we will add a dedicated control using a set of expert-verified unambiguous hemodialysis requirements. For each such requirement we will generate multiple independent formalizations and measure the rate of bidirectional SMT inequivalence; this will provide an empirical upper bound on the false-positive rate of the ambiguity oracle and allow readers to assess the specificity of the reported reductions. revision: yes

  2. Referee: [Abstract] Abstract: the accuracy improvement from 55.4% to 98.5% is presented without baseline comparisons, error bars, or details on how post-hoc exclusions and prompt choices were handled, so the magnitude and robustness of the counterexample-guided repair result cannot be assessed from the given information.

    Authors: We acknowledge that the abstract and evaluation section require additional context for proper assessment. In the revision we will (i) add comparisons against standard few-shot and chain-of-thought LLM baselines on the same hemodialysis QA benchmark, (ii) report mean accuracy and standard deviation across five independent runs to supply error bars, and (iii) include an appendix with the exact prompt templates, temperature settings, and any filtering criteria applied to the benchmark instances. revision: yes

  3. Referee: [Pipeline description] Pipeline description: the ambiguity test treats any SMT-inequivalent pair as evidence of genuine ambiguity, but the manuscript supplies no quantitative bound on the LLM's own inconsistency rate for single-interpretation inputs, leaving open the possibility that the reported reduction in ambiguity-sensitive requirements is inflated by false positives.

    Authors: This observation correctly identifies a missing quantitative calibration. We will augment the pipeline description with a new subsection that measures the LLM inconsistency rate on a curated collection of single-interpretation requirements. The subsection will report the fraction of requirement instances for which multiple stochastic formalizations are SMT-inequivalent, thereby supplying the requested bound and enabling direct evaluation of whether the observed reductions in ambiguity-sensitive requirements remain significant after accounting for model noise. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the neurosymbolic auditing pipeline

full rationale

The paper's derivation chain translates NL requirements to formal logic via LLM sampling and then applies an independent external SMT solver for bidirectional equivalence and satisfiability queries. The ambiguity signal is presented as an empirical observation tested on hemodialysis requirements rather than defined by construction. The accuracy lift from 55.4% to 98.5% is obtained by feeding concrete SMT counterexamples into a repair loop on an external benchmark; no equation or step reduces this metric to a parameter fitted inside the paper. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear. The pipeline remains self-contained against the external solver and benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on the assumption that LLM translations preserve enough meaning for solver queries to be meaningful and that variation across samples tracks real ambiguity; no free parameters are explicitly fitted in the abstract, and no new physical or mathematical entities are introduced.

axioms (2)
  • domain assumption Large language models can produce logically coherent translations of natural-language requirements that are stable enough for equivalence checking to be informative.
    Invoked when stochastic variation is treated as a signal of ambiguity rather than model noise.
  • standard math SMT solvers can decide the relevant properties (equivalence, satisfiability, safety violations) of the generated formulas.
    Standard background for the symbolic component.
invented entities (1)
  • VERIMED no independent evidence
    purpose: Named neurosymbolic pipeline that operationalizes LLM-plus-SMT auditing for medical-device requirements.
    Introduced as the concrete system realizing the described method.

pith-pipeline@v0.9.0 · 5511 in / 1416 out tokens · 35966 ms · 2026-05-14T17:57:24.470155+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Akinfaderin and S

    A. Akinfaderin and S. Subramanian. Verafi: Verified agentic financial intelligence through neurosymbolic policy generation.arXiv preprint arXiv:2512.14744, 2025. 9

  2. [2]

    Allamanis, S

    M. Allamanis, S. Panthaplackel, and P. Yin. Unsupervised evaluation of code llms with round-trip correctness.arXiv preprint arXiv:2402.08699, 2024

  3. [3]

    Faithful Autoformalization via Roundtrip Verification and Repair

    D. Amrollahi, J. Lopez, and C. Barrett. Faithful autoformalization via roundtrip verification and repair.arXiv preprint arXiv:2604.25031, 2026

  4. [4]

    Arcaini, S

    P. Arcaini, S. Bonfanti, A. Gargantini, and E. Riccobene. How to assure correctness and safety of medical software: the hemodialysis machine case study. InInternational Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z, pages 344–359. Springer, 2016

  5. [5]

    Barrett, P

    C. Barrett, P. Fontaine, and C. Tinelli. The satisfiability modulo theories library (smt-lib). www. SMT-LIB. org, 2:68, 2016

  6. [6]

    Bashir, A

    S. Bashir, A. Ferrari, A. Khan, P. E. Strandberg, Z. Haider, M. Saadatmand, and M. Bohlin. Requirements ambiguity detection and explanation with llms: An industrial study. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 620–631. IEEE, 2025

  7. [7]

    Bayless, S

    S. Bayless, S. Buliani, D. Cassel, B. Cook, D. Clough, R. Delmas, N. Diallo, F. Erata, N. Feng, D. Giannakopoulou, et al. A neurosymbolic approach to natural language formalization and verification.arXiv preprint arXiv:2511.09008, 2025

  8. [8]

    A. Beg, D. O’Donoghue, and R. Monahan. Formalising software requirements with large language models. InADAPT Annual Conference, 2025

  9. [9]

    J. Cao, Y . Lu, M. Li, H. Ma, H. Li, M. He, C. Wen, L. Sun, H. Zhang, S. Qin, et al. From informal to formal–incorporating and evaluating llms on natural language requirements to verifiable formal proofs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26984–27003, 2025

  10. [10]

    G. Chen, J. Wu, X. Chen, W. X. Zhao, R. Song, C. Li, K. Fan, D. Liu, and M. Liao. Reform: Reflective autoformalization with prospective bounded sequence optimization.arXiv preprint arXiv:2510.24592, 2025

  11. [11]

    De Moura and N

    L. De Moura and N. Bjørner. Z3: An efficient smt solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008

  12. [12]

    A. O. Gomes and A. Butterfield. Modelling the haemodialysis machine with circus. In International Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z, pages 409–424. Springer, 2016

  13. [13]

    T. S. Hoang, C. Snook, L. Ladenberger, and M. Butler. Validating the requirements and design of a hemodialysis machine using iuml-b, bmotion studio, and co-simulation. InInternational Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z, pages 360–375. Springer, 2016

  14. [14]

    A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, M. Jamnik, T. Lacroix, Y . Wu, and G. Lample. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs.arXiv preprint arXiv:2210.12283, 2022

  15. [15]

    B. Kim, A. Ayoub, O. Sokolsky, I. Lee, P. Jones, Y . Zhang, and R. Jetley. Safety-assured development of the gpca infusion pump software. InProceedings of the ninth ACM international conference on Embedded software, pages 155–164, 2011

  16. [16]

    Z. Li, Y . Wu, Z. Li, X. Wei, F. Yang, X. Zhang, and X. Ma. Autoformalize mathematical statements by symbolic equivalence and semantic consistency.Advances in Neural Information Processing Systems, 37:53598–53625, 2024

  17. [17]

    Y . Liu, S. Yu, H. Jin, J. Wen, A. Qian, T. Lee, M. Ramsis, G. W. Choi, L. Qin, X. Liu, et al. A multi-agent framework combining large language models with medical flowcharts for self-triage. Nature Health, pages 1–10, 2026

  18. [18]

    Mandal, A

    S. Mandal, A. Chethan, V . Janfaza, S. Mahmud, T. A. Anderson, J. Turek, J. J. Tithi, and A. Muzahid. Large language models based automatic synthesis of software specifications.arXiv preprint arXiv:2304.09181, 2023. 10

  19. [19]

    Mashkoor

    A. Mashkoor. The hemodialysis machine case study. InInternational Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z, pages 329–343. Springer, 2016

  20. [20]

    F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, and Q. Wang. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering, 1(FSE):2332–2354, 2024

  21. [21]

    Open PCA pump requirements

    Ovasicek. Open PCA pump requirements. https://ovasicek.github.io/ pca-pump-ec-artifacts/Open-PCA-Pump-Requirements.pdf , 2016. Accessed: 2026- 05-07

  22. [22]

    M. Q. Riaz, W. H. Butt, and S. Rehman. Automatic detection of ambiguous software require- ments: An insight. In2019 5th International Conference on Information Management (ICIM), pages 1–6. IEEE, 2019

  23. [23]

    C. Sun, Y . Sheng, O. Padon, and C. Barrett. Clover: Clo sed-loop ver ifiable code generation. InInternational Symposium on AI Verification, pages 134–155. Springer, 2024

  24. [24]

    Food and Drug Administration

    U.S. Food and Drug Administration. MAUDE — manufacturer and user facility device expe- rience. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search. cfm, 2026

  25. [25]

    W. Wang, M. Farrell, L. C. Cordeiro, and L. Zhao. Supporting software formal verification with large language models: An experimental study. In2025 IEEE 33rd International Requirements Engineering Conference (RE), pages 423–431. IEEE, 2025

  26. [26]

    J. J. Wu and F. H. Fard. Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agents.ACM Transactions on Software Engineering and Methodology, 34(7):1–42, 2025

  27. [27]

    J. J. Wu, M. Chaudhary, D. Abrahamyan, A. Khaku, A. Wei, and F. H. Fard. Can code language models learn clarification-seeking behaviors?arXiv preprint arXiv:2504.16331, 2025

  28. [28]

    Y . Wu, A. Q. Jiang, W. Li, M. Rabe, C. Staats, M. Jamnik, and C. Szegedy. Autoformalization with large language models.Advances in neural information processing systems, 35:32353– 32368, 2022

  29. [29]

    No mandatory machine action

    K. Yang, A. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. J. Prenger, and A. Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models. Advances in Neural Information Processing Systems, 36:21573–21612, 2023. 11 Table 2: Round-trip evaluation on the PCA pump requirement set. Formal equivalence is measured between the origin...