Recognition: 1 theorem link
· Lean TheoremNeurosymbolic Auditing of Natural-Language Software Requirements
Pith reviewed 2026-05-14 17:57 UTC · model grok-4.3
The pith
Large language models with an SMT solver can translate and audit natural-language software requirements for ambiguity and defects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large language models equipped with an SMT solver can audit natural-language software requirements by translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalizations, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification.
What carries the argument
VERIMED, the neurosymbolic pipeline that uses an LLM for translating requirements to SMT formulas and an SMT solver for equivalence checking and counterexample-guided repair.
If this is right
- Requirements with multiple plausible interpretations produce SMT-inequivalent formalizations that can be detected by bidirectional equivalence checking.
- Concrete SMT counterexamples raise verified accuracy in counterexample-guided repair from 55.4% to 98.5% on a hemodialysis benchmark.
- The approach successfully reduces ambiguity-sensitive requirements in open-source hemodialysis safety specifications.
- Solver queries can expose inconsistency, vacuousness, and safety violations in the formalized requirements.
Where Pith is reading between the lines
- This auditing method could extend to other safety-critical domains such as automotive or aerospace software requirements.
- Early integration of this technique in the requirements phase might prevent costly errors in later implementation and verification stages.
- Combining outputs from several different large language models could strengthen the detection of genuine ambiguity versus model noise.
Load-bearing premise
Stochastic variation in the formalizations generated by the LLM reliably indicates genuine ambiguity in the natural-language requirement instead of just the model's own inconsistencies or hallucinations.
What would settle it
Run the formalization process multiple times on a set of requirements that are independently known to be unambiguous and check if the rate of SMT-inequivalent outputs is significantly lower than for requirements known to be ambiguous.
Figures
read the original abstract
Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped with an SMT solver, can audit such requirements: translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalization, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification. We present VERIMED, a neurosymbolic pipeline that operationalizes this idea for medical-device software requirements, and report two findings. First, stochastic variation across independent formalizations is a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test. Second, the usefulness of symbolic feedback depends on its granularity: in counterexample-guided repair on a hemodialysis question-answering benchmark, concrete SMT counterexamples raise verified accuracy from 55.4% to 98.5%. Over an extensive experimental evaluation on open-source hemodialysis safety requirements, we show that the LLM-based approach in VERIMED successfully reduces ambiguity-sensitive requirements and enables rigorous auditing of software requirements through SMT-based queries.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VERIMED, a neurosymbolic pipeline that translates natural-language software requirements (focused on medical-device safety) into formal logic using LLMs, detects ambiguity by checking whether multiple stochastic formalizations are bidirectionally SMT-equivalent, and uses SMT queries to expose inconsistency, vacuousness, and safety violations. It reports that this approach reduces ambiguity-sensitive requirements on open-source hemodialysis specs and raises verified accuracy on a hemodialysis QA benchmark from 55.4% to 98.5% via counterexample-guided repair.
Significance. If the central claims hold, the work provides a concrete, solver-backed method for auditing natural-language requirements in safety-critical domains, bridging LLM flexibility with formal verification guarantees. The reported accuracy lift and the use of SMT equivalence as an ambiguity oracle would be notable contributions to neurosymbolic requirements engineering.
major comments (3)
- [Abstract / experimental evaluation] Abstract and experimental evaluation: the claim that 'requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations' is load-bearing for the ambiguity-detection component, yet no control experiment on unambiguously worded requirements is described to show that LLM sampling variance alone does not produce inequivalent outputs at a comparable rate.
- [Abstract] Abstract: the accuracy improvement from 55.4% to 98.5% is presented without baseline comparisons, error bars, or details on how post-hoc exclusions and prompt choices were handled, so the magnitude and robustness of the counterexample-guided repair result cannot be assessed from the given information.
- [Pipeline description] Pipeline description: the ambiguity test treats any SMT-inequivalent pair as evidence of genuine ambiguity, but the manuscript supplies no quantitative bound on the LLM's own inconsistency rate for single-interpretation inputs, leaving open the possibility that the reported reduction in ambiguity-sensitive requirements is inflated by false positives.
minor comments (1)
- [Method] Notation for the bidirectional SMT equivalence check should be defined more explicitly (e.g., as a single predicate rather than described procedurally) to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on VERIMED. We address each major comment below with clarifications and planned revisions to strengthen the experimental validation and reporting.
read point-by-point responses
-
Referee: [Abstract / experimental evaluation] Abstract and experimental evaluation: the claim that 'requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations' is load-bearing for the ambiguity-detection component, yet no control experiment on unambiguously worded requirements is described to show that LLM sampling variance alone does not produce inequivalent outputs at a comparable rate.
Authors: We agree that a control experiment is necessary to isolate the contribution of genuine ambiguity from LLM sampling variance. In the revised manuscript we will add a dedicated control using a set of expert-verified unambiguous hemodialysis requirements. For each such requirement we will generate multiple independent formalizations and measure the rate of bidirectional SMT inequivalence; this will provide an empirical upper bound on the false-positive rate of the ambiguity oracle and allow readers to assess the specificity of the reported reductions. revision: yes
-
Referee: [Abstract] Abstract: the accuracy improvement from 55.4% to 98.5% is presented without baseline comparisons, error bars, or details on how post-hoc exclusions and prompt choices were handled, so the magnitude and robustness of the counterexample-guided repair result cannot be assessed from the given information.
Authors: We acknowledge that the abstract and evaluation section require additional context for proper assessment. In the revision we will (i) add comparisons against standard few-shot and chain-of-thought LLM baselines on the same hemodialysis QA benchmark, (ii) report mean accuracy and standard deviation across five independent runs to supply error bars, and (iii) include an appendix with the exact prompt templates, temperature settings, and any filtering criteria applied to the benchmark instances. revision: yes
-
Referee: [Pipeline description] Pipeline description: the ambiguity test treats any SMT-inequivalent pair as evidence of genuine ambiguity, but the manuscript supplies no quantitative bound on the LLM's own inconsistency rate for single-interpretation inputs, leaving open the possibility that the reported reduction in ambiguity-sensitive requirements is inflated by false positives.
Authors: This observation correctly identifies a missing quantitative calibration. We will augment the pipeline description with a new subsection that measures the LLM inconsistency rate on a curated collection of single-interpretation requirements. The subsection will report the fraction of requirement instances for which multiple stochastic formalizations are SMT-inequivalent, thereby supplying the requested bound and enabling direct evaluation of whether the observed reductions in ambiguity-sensitive requirements remain significant after accounting for model noise. revision: yes
Circularity Check
No significant circularity in the neurosymbolic auditing pipeline
full rationale
The paper's derivation chain translates NL requirements to formal logic via LLM sampling and then applies an independent external SMT solver for bidirectional equivalence and satisfiability queries. The ambiguity signal is presented as an empirical observation tested on hemodialysis requirements rather than defined by construction. The accuracy lift from 55.4% to 98.5% is obtained by feeding concrete SMT counterexamples into a repair loop on an external benchmark; no equation or step reduces this metric to a parameter fitted inside the paper. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear. The pipeline remains self-contained against the external solver and benchmark.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Large language models can produce logically coherent translations of natural-language requirements that are stable enough for equivalence checking to be informative.
- standard math SMT solvers can decide the relevant properties (equivalence, satisfiability, safety violations) of the generated formulas.
invented entities (1)
-
VERIMED
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
stochastic variation across independent formalizations is a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Akinfaderin and S. Subramanian. Verafi: Verified agentic financial intelligence through neurosymbolic policy generation.arXiv preprint arXiv:2512.14744, 2025. 9
-
[2]
M. Allamanis, S. Panthaplackel, and P. Yin. Unsupervised evaluation of code llms with round-trip correctness.arXiv preprint arXiv:2402.08699, 2024
-
[3]
Faithful Autoformalization via Roundtrip Verification and Repair
D. Amrollahi, J. Lopez, and C. Barrett. Faithful autoformalization via roundtrip verification and repair.arXiv preprint arXiv:2604.25031, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
P. Arcaini, S. Bonfanti, A. Gargantini, and E. Riccobene. How to assure correctness and safety of medical software: the hemodialysis machine case study. InInternational Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z, pages 344–359. Springer, 2016
work page 2016
-
[5]
C. Barrett, P. Fontaine, and C. Tinelli. The satisfiability modulo theories library (smt-lib). www. SMT-LIB. org, 2:68, 2016
work page 2016
-
[6]
S. Bashir, A. Ferrari, A. Khan, P. E. Strandberg, Z. Haider, M. Saadatmand, and M. Bohlin. Requirements ambiguity detection and explanation with llms: An industrial study. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 620–631. IEEE, 2025
work page 2025
-
[7]
S. Bayless, S. Buliani, D. Cassel, B. Cook, D. Clough, R. Delmas, N. Diallo, F. Erata, N. Feng, D. Giannakopoulou, et al. A neurosymbolic approach to natural language formalization and verification.arXiv preprint arXiv:2511.09008, 2025
-
[8]
A. Beg, D. O’Donoghue, and R. Monahan. Formalising software requirements with large language models. InADAPT Annual Conference, 2025
work page 2025
-
[9]
J. Cao, Y . Lu, M. Li, H. Ma, H. Li, M. He, C. Wen, L. Sun, H. Zhang, S. Qin, et al. From informal to formal–incorporating and evaluating llms on natural language requirements to verifiable formal proofs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26984–27003, 2025
work page 2025
- [10]
-
[11]
L. De Moura and N. Bjørner. Z3: An efficient smt solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008
work page 2008
-
[12]
A. O. Gomes and A. Butterfield. Modelling the haemodialysis machine with circus. In International Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z, pages 409–424. Springer, 2016
work page 2016
-
[13]
T. S. Hoang, C. Snook, L. Ladenberger, and M. Butler. Validating the requirements and design of a hemodialysis machine using iuml-b, bmotion studio, and co-simulation. InInternational Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z, pages 360–375. Springer, 2016
work page 2016
- [14]
-
[15]
B. Kim, A. Ayoub, O. Sokolsky, I. Lee, P. Jones, Y . Zhang, and R. Jetley. Safety-assured development of the gpca infusion pump software. InProceedings of the ninth ACM international conference on Embedded software, pages 155–164, 2011
work page 2011
-
[16]
Z. Li, Y . Wu, Z. Li, X. Wei, F. Yang, X. Zhang, and X. Ma. Autoformalize mathematical statements by symbolic equivalence and semantic consistency.Advances in Neural Information Processing Systems, 37:53598–53625, 2024
work page 2024
-
[17]
Y . Liu, S. Yu, H. Jin, J. Wen, A. Qian, T. Lee, M. Ramsis, G. W. Choi, L. Qin, X. Liu, et al. A multi-agent framework combining large language models with medical flowcharts for self-triage. Nature Health, pages 1–10, 2026
work page 2026
- [18]
- [19]
-
[20]
F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, and Q. Wang. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering, 1(FSE):2332–2354, 2024
work page 2024
-
[21]
Ovasicek. Open PCA pump requirements. https://ovasicek.github.io/ pca-pump-ec-artifacts/Open-PCA-Pump-Requirements.pdf , 2016. Accessed: 2026- 05-07
work page 2016
-
[22]
M. Q. Riaz, W. H. Butt, and S. Rehman. Automatic detection of ambiguous software require- ments: An insight. In2019 5th International Conference on Information Management (ICIM), pages 1–6. IEEE, 2019
work page 2019
-
[23]
C. Sun, Y . Sheng, O. Padon, and C. Barrett. Clover: Clo sed-loop ver ifiable code generation. InInternational Symposium on AI Verification, pages 134–155. Springer, 2024
work page 2024
-
[24]
U.S. Food and Drug Administration. MAUDE — manufacturer and user facility device expe- rience. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search. cfm, 2026
work page 2026
-
[25]
W. Wang, M. Farrell, L. C. Cordeiro, and L. Zhao. Supporting software formal verification with large language models: An experimental study. In2025 IEEE 33rd International Requirements Engineering Conference (RE), pages 423–431. IEEE, 2025
work page 2025
-
[26]
J. J. Wu and F. H. Fard. Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agents.ACM Transactions on Software Engineering and Methodology, 34(7):1–42, 2025
work page 2025
- [27]
-
[28]
Y . Wu, A. Q. Jiang, W. Li, M. Rabe, C. Staats, M. Jamnik, and C. Szegedy. Autoformalization with large language models.Advances in neural information processing systems, 35:32353– 32368, 2022
work page 2022
-
[29]
K. Yang, A. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. J. Prenger, and A. Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models. Advances in Neural Information Processing Systems, 36:21573–21612, 2023. 11 Table 2: Round-trip evaluation on the PCA pump requirement set. Formal equivalence is measured between the origin...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.