Towards an Agentic LLM-based Approach to Requirement Formalization from Unstructured Specifications
Pith reviewed 2026-05-10 04:32 UTC · model grok-4.3
The pith
A modular agentic LLM pipeline extracts formal verification properties from unstructured natural language specifications with 77.8% accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that their agentic LLM-based pipeline, which integrates requirement extraction, compatibility filtering with respect to a target formalism, and translation into formal properties, generates syntactically and semantically aligned formal properties from unstructured specifications, as demonstrated by 77.8% accuracy in three experimental scenarios. By explicitly incorporating modeling and verification constraints, the approach helps connect informal system descriptions to semantically meaningful formal verification.
What carries the argument
The modular pipeline performing requirement extraction, compatibility filtering, and LLM-based translation into formal properties.
If this is right
- Enables derivation of verification-ready properties directly from natural language at the start of safety-critical system development.
- Reduces the manual effort and potential errors in translating informal requirements to formal artifacts.
- Produces properties that account for both modeling and verification constraints of the target formalism.
- Supports bridging the gap between unstructured specifications and formal verification through AI assistance.
Where Pith is reading between the lines
- The pipeline could be combined with automated model checkers to provide immediate feedback on potential safety issues in initial designs.
- It may generalize to additional formalisms or application domains not tested in the three scenarios.
- Future work could add targeted human review at the filtering step to further improve reliability in high-stakes contexts.
Load-bearing premise
The steps of compatibility filtering and LLM translation together maintain the semantic alignment between the informal requirements and the generated formal properties without requiring further human validation.
What would settle it
A case where the generated formal property allows execution traces or states that contradict the safety or functional intent stated in the original natural language specification.
Figures
read the original abstract
Early-stage specifications of safety-critical systems are typically expressed in natural language, making it difficult to derive formal properties suitable for verification and needed to guarantee safety. While recent Large Language Model (LLM)-based approaches can generate formal artifacts from text, they mainly focus on syntactic correctness and do not ensure semantic alignment between informal requirements and formally verifiable properties. We propose an agentic methodology that automatically extracts verification-ready properties from unstructured specifications. The modular pipeline combines requirement extraction, compatibility filtering with respect to a target formalism, and translation into formal properties. Experimental results across three scenarios show that the pipeline generates syntactically and semantically aligned formal properties with a 77.8% accuracy. By explicitly accounting for modeling and verification constraints, the approach is a paving step towards exploiting Artificial Intelligence (AI) to bridge the gap between informal descriptions and semantically meaningful formal verification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an agentic LLM-based pipeline to extract verification-ready formal properties from unstructured natural-language specifications of safety-critical systems. The modular approach consists of requirement extraction, compatibility filtering against a target formalism, and LLM translation to formal properties. Experimental results on three scenarios are reported to produce syntactically and semantically aligned formal properties at 77.8% accuracy, with the goal of bridging informal descriptions to semantically meaningful formal verification.
Significance. If the accuracy claim can be substantiated with a reproducible evaluation protocol, the work would offer a practical step toward automating requirement formalization in safety-critical domains, where manual translation from natural language to verifiable properties remains a major bottleneck.
major comments (2)
- [Abstract] Abstract: The headline result of 77.8% accuracy for both syntactic and semantic alignment is stated without any description of the evaluation metric for semantic alignment, the ground-truth reference (expert annotation, formal equivalence check, or otherwise), baseline comparisons, scenario selection criteria, or statistical significance. This directly undermines assessment of the central empirical claim.
- [Pipeline description] Pipeline description (requirement extraction + compatibility filtering + LLM translation): No quantitative breakdown is provided on the contribution of the compatibility filtering step to semantic preservation, nor is there an independent validation method (e.g., blinded expert review or automated equivalence checking) that would distinguish true semantic fidelity from plausible but drifted output.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment below and will revise the manuscript to improve the clarity and substantiation of our empirical claims where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result of 77.8% accuracy for both syntactic and semantic alignment is stated without any description of the evaluation metric for semantic alignment, the ground-truth reference (expert annotation, formal equivalence check, or otherwise), baseline comparisons, scenario selection criteria, or statistical significance. This directly undermines assessment of the central empirical claim.
Authors: We agree that the abstract is too concise on evaluation details. The full manuscript (Section 4) specifies that syntactic correctness is checked via automated parsing against the target formalism's grammar, while semantic alignment is evaluated by direct expert comparison of intent between the original natural-language requirement and the generated formal property. Ground truth is provided by manually formalized properties created by the authors with domain knowledge. The three scenarios were selected as representative cases from safety-critical domains with publicly available unstructured specs. We performed no baseline comparisons because the work focuses on the novel agentic pipeline rather than a comparative benchmark, and statistical significance is not reported given the small number of scenarios. We will revise the abstract to briefly describe the evaluation approach, ground truth, and scenario criteria while noting these scope limitations. revision: yes
-
Referee: [Pipeline description] Pipeline description (requirement extraction + compatibility filtering + LLM translation): No quantitative breakdown is provided on the contribution of the compatibility filtering step to semantic preservation, nor is there an independent validation method (e.g., blinded expert review or automated equivalence checking) that would distinguish true semantic fidelity from plausible but drifted output.
Authors: The compatibility filtering step explicitly checks extracted requirements against the syntactic and semantic constraints of the target formalism before translation, which reduces the risk of semantic drift by discarding incompatible cases early. The manuscript describes this qualitatively in the pipeline section. We acknowledge that a quantitative breakdown would strengthen the presentation and will add an ablation study in the revision showing end-to-end accuracy with and without the filtering stage. Semantic validation in the current work relies on post-hoc expert review rather than blinded protocols or automated equivalence checkers; the latter remains an open research problem for natural-language-to-formal translations. We will expand the text to clarify the validation procedure and explicitly discuss it as a limitation. revision: partial
Circularity Check
No circularity: accuracy is empirical outcome, not derived by construction
full rationale
The paper describes an agentic LLM pipeline (requirement extraction + compatibility filtering + translation) and reports 77.8% syntactic/semantic alignment accuracy as the result of running the pipeline on three scenarios. No equations, parameters, or derivations are present. The accuracy figure is presented as an independent experimental measurement rather than a quantity obtained by fitting, self-definition, or reduction to the method's own inputs. No load-bearing self-citations or uniqueness claims appear in the abstract or description that would create circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A survey of statistical model checking
Gul Agha and Karl Palmskog. “A survey of statistical model checking”. In:ACM Trans- actions on Modeling and Computer Simulation (TOMACS)28.1 (2018), pp. 1–39
work page 2018
-
[2]
The benefits of relaxing punctuality
Rajeev Alur et al. “The benefits of relaxing punctuality”. In:Journal of the ACM (JACM) 43.1 (1996), pp. 116–146
work page 1996
-
[3]
Results of Field Trials with a Mobile Service Robot for Older Adults in 16 Private Households
Markus Bajones et al. “Results of Field Trials with a Mobile Service Robot for Older Adults in 16 Private Households”. In:ACM Transactions on Human-Robot Interaction (THRI)9.2 (2019), pp. 1–27
work page 2019
-
[4]
Leveraging LLMs for Formal Software Requirements–Challenges and Prospects
Arshad Beg et al. “Leveraging LLMs for Formal Software Requirements–Challenges and Prospects”. In:International Workshop on Artificial Intelligence and fOrmal VERifica- tion, Logic, Automata, and sYnthesis. ceur-ws.org. 2025, pp. 95–105
work page 2025
-
[5]
Vincent Bertram et al. “Leveraging natural language processing for a consistency check- ing toolchain of automotive requirements”. In:International Requirements Engineering Conference. IEEE. 2023, pp. 212–222
work page 2023
-
[6]
Language Models are Few-Shot Learners
Tom B. Brown et al. “Language Models are Few-Shot Learners”. In:Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. 2020. [7]Case studies - Service robots. 2018.url:https://ifr.org/case- studies/service- robots-case-studies
work page 2020
-
[7]
LLM-basedSatisfiabilityCheckingofStringRequirementsbyConsistent Data and Checker Generation
BoqiChenetal.“LLM-basedSatisfiabilityCheckingofStringRequirementsbyConsistent Data and Checker Generation”. In:International Requirements Engineering Conference. 2025, pp. 231–243. 11 Tagliaferro, Guindani, Lestingi, and Rossi
work page 2025
-
[8]
Matthias Cosler et al. “nl2spec: Interactively translating unstructured natural language to temporal logics with large language models”. In:International Conference on Computer Aided Verification. Springer. 2023, pp. 383–396
work page 2023
-
[9]
Alexandre David et al. “Uppaal SMC tutorial”. In:International journal on software tools for technology transfer17.4 (2015), pp. 397–415
work page 2015
-
[10]
Joao Pascoal Faria et al. “Automatic Generation of Formal Specification and Verification Annotations Using LLMs and Test Oracles”. In:arXiv preprint arXiv:2601.12845(2026)
-
[11]
Lessons from the use of natural language inference (nli) in requirements engineering tasks
Mohamad Fazelnia et al. “Lessons from the use of natural language inference (nli) in requirements engineering tasks”. In:International Requirements Engineering Conference. IEEE. 2024, pp. 103–115
work page 2024
-
[12]
Automated requirement contradiction detection through formal logic and LLMs
Alexander Elenga Gärtner and Dietmar Göhlich. “Automated requirement contradiction detection through formal logic and LLMs”. In:Automated Software Engineering31.2 (2024), p. 49
work page 2024
-
[13]
Translating Requirements in Property Specification Patterns using LLMs
Dario Guidotti et al. “Translating Requirements in Property Specification Patterns using LLMs”. In:31st RCRA Workshop. ceur-ws.org. 2024, pp. 68–80
work page 2024
-
[14]
State of Practice: LLMs in Software Engineering and Software Architecture
Jasmin Jahić and Ashkan Sami. “State of Practice: LLMs in Software Engineering and Software Architecture”. In:International Conference on Software Architecture Compan- ion. 2024, pp. 311–318
work page 2024
-
[15]
Extracting Formal Specifications From Documents Using LLMS for Test Au- tomation
Hui Li et al. “Extracting Formal Specifications From Documents Using LLMS for Test Au- tomation”. In:International Conference on Program Comprehension. IEEE. 2025, pp. 1– 12
work page 2025
-
[16]
Data-driven requirements elicitation: A systematic literature review
Sachiko Lim et al. “Data-driven requirements elicitation: A systematic literature review”. In:SN Computer Science2.1 (2021), p. 16
work page 2021
-
[17]
Improving requirements completeness: Automated assistance through large language models
Dipeeka Luitel et al. “Improving requirements completeness: Automated assistance through large language models”. In:Requirements Engineering29.1 (2024), pp. 73–95
work page 2024
-
[18]
URL https://arxiv.org/abs/2512
Zhi Ma et al. “Bridging Natural Language and Formal Specification–Automated Trans- lation of Software Requirements to LTL via Hierarchical Semantics Decomposition Using LLMs”. In:arXiv preprint arXiv:2512.17334(2025)
-
[19]
Taslim Mahbub et al. “Can GPT-4 aid in detecting ambiguities, inconsistencies, and incompleteness in requirements analysis? A comprehensive case study.” In:IEEE Access (2024)
work page 2024
-
[20]
Specification patterns for robotic missions
Claudio Menghi et al. “Specification patterns for robotic missions”. In:IEEE Transactions on Software Engineering(2019)
work page 2019
-
[21]
The state of the art in automated requirements elicitation
Hendrik Meth et al. “The state of the art in automated requirements elicitation”. In: Information and Software Technology55.10 (2013), pp. 1695–1709
work page 2013
-
[22]
Towards verifiable multi-agent interaction pattern specifica- tion
Alberto Tagliaferro et al. “Towards verifiable multi-agent interaction pattern specifica- tion”. In:Proceedings of the 2024 IEEE/ACM 12th International Conference on Formal Methods in Software Engineering (FormaliSE). 2024, pp. 122–126
work page 2024
-
[23]
Verification-Oriented Specification of Multi-agent Interaction Patterns
Alberto Tagliaferro et al. “Verification-Oriented Specification of Multi-agent Interaction Patterns”.In:Workshop on Agents and Robots for reliable Engineered Autonomy.Springer. 2024, pp. 38–53
work page 2024
-
[24]
Ashish Vaswani et al. “Attention is All you Need”. In:Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 2017, pp. 5998–6008. 12 Tagliaferro, Guindani, Lestingi, and Rossi
work page 2017
-
[25]
Cheng Wen et al. “Enchanting program specification synthesis by large language models using static analysis and program verification”. In:International Conference on Computer Aided Verification. Springer. 2024, pp. 302–328
work page 2024
-
[26]
Autoformalization with large language models
Yuhuai Wu et al. “Autoformalization with large language models”. In:Advances in neural information processing systems35 (2022), pp. 32353–32368
work page 2022
-
[27]
Formal consistency checking over specifications in natural languages
Rongjie Yan et al. “Formal consistency checking over specifications in natural languages”. In:Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE. 2015, pp. 1677–1682
work page 2015
-
[28]
A., D˛ abrowski, J., Alhoshan, W., Zhao, L., and Ferrari, A., 2025
Mohammad Amin Zadenoori et al. “Large language models (LLMs) for requirements engineering (RE): A systematic literature review”. In:arXiv preprint arXiv:2509.11446 (2025)
-
[29]
NL2CTL: automatic generation of formal requirements specifica- tions via large language models
Mengyan Zhao et al. “NL2CTL: automatic generation of formal requirements specifica- tions via large language models”. In:International Conference on Formal Engineering Methods. Springer. 2024, pp. 1–17. 13
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.