Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation
Pith reviewed 2026-06-28 02:05 UTC · model grok-4.3
The pith
Current LLMs generate semantically correct TLA+ specifications from natural language at most 8.6 percent of the time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across 30 LLMs and multiple prompting strategies, the highest syntactic correctness rate is 26.6 percent and the highest semantic correctness rate is 8.6 percent, with all semantic successes limited to progressive prompting. Model size does not predict quality and some smaller models outperform larger ones. Code-specialized models underperform due to negative transfer. Five recurring hallucination categories appear, each traceable to training-data biases. The results indicate that LLMs do not yet produce reliable TLA+ specifications without expert oversight.
What carries the argument
Benchmark evaluation of 30 LLMs on a curated dataset of 205 TLA+ specifications, validated for syntax by the SANY parser and for semantics by the TLC model checker under four prompting strategies.
If this is right
- Model size alone does not determine success at generating correct TLA+ specifications.
- Code-specialized models suffer negative transfer and perform worse than general-purpose models on this task.
- Progressive prompting is the only strategy that produces any semantically correct outputs.
- Training-data biases produce five identifiable, recurring hallucination patterns in generated specifications.
Where Pith is reading between the lines
- Hybrid workflows that combine LLM drafts with targeted human review could lower the expertise barrier for TLA+ adoption.
- Targeted fine-tuning on formal-specification examples might raise semantic correctness beyond the levels measured here.
- The same evaluation approach could be repeated for other formal languages to test whether the low reliability pattern is general.
Load-bearing premise
The 205-spec dataset and the SANY-plus-TLC validation procedure give a faithful measure of real-world semantic correctness for natural-language-to-TLA+ tasks.
What would settle it
An LLM or prompting technique that produces TLC-verified semantically correct TLA+ specifications on more than half of a comparable set of natural-language descriptions.
read the original abstract
TLA+ has supported industrial verification at companies such as Amazon and Microsoft, yet writing correct TLA+ specifications from natural language still requires time and expertise, which limits adoption. LLMs show promise, but no prior study measures whether they produce semantically correct TLA+ specifications from natural language. This paper presents the first systematic evaluation of LLM-based TLA+ specification synthesis from natural language. Our study evaluates 30 LLMs across eight families on a curated dataset of 205 TLA+ specifications: 25 open-weight models across four prompting strategies (2,600 runs) and 5 proprietary models under few-shot prompting (130 runs), all validated by the SANY parser and TLC model checker. LLMs achieve up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusive to progressive prompting. Results show that model size does not predict quality, e.g., DeepSeek r1:8b outperforms its 70B variant across all strategies, which suggests the importance of reasoning alignment for formal languages. Code-specialized models consistently underperform due to negative transfer from mainstream language training. We identify five recurring hallucination categories, all traceable to specific training data biases. These results suggest that current LLMs do not generate reliable TLA+ specifications without expert oversight. We release the evaluation framework, code, and dataset to support reproducibility and future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the first systematic evaluation of LLM-based natural-language-to-TLA+ specification generation. It evaluates 30 models (25 open-weight across four prompting strategies for 2,600 runs; 5 proprietary under few-shot for 130 runs) on a curated dataset of 205 TLA+ specifications, using SANY for syntax and TLC for semantic validation. Key findings include maximum 26.6% syntactic correctness and 8.6% semantic correctness (successes only with progressive prompting), no correlation with model size (e.g., DeepSeek r1:8b outperforming 70B), underperformance of code-specialized models, five hallucination categories, and the conclusion that current LLMs require expert oversight. The evaluation framework, code, and dataset are released.
Significance. If the validation procedure faithfully measures semantic fidelity to the supplied natural-language intent, the work supplies the first large-scale empirical baseline on LLM limitations for TLA+ synthesis, an industrially relevant formal language. The release of the full dataset, code, and 2,730-run results is a clear strength for reproducibility and future benchmarking. The observation that reasoning-aligned smaller models can outperform larger ones and that code-specialized training produces negative transfer are actionable for the formal-methods and LLM communities.
major comments (1)
- [Dataset and Evaluation sections] Dataset and Evaluation sections: The paper provides insufficient detail on (a) how the 205 natural-language descriptions were obtained or independently validated against the reference TLA+ specifications, (b) the precise TLC configurations, invariants, or properties checked to establish semantic correctness, and (c) whether TLC acceptance was treated as evidence of equivalence to the original NL intent rather than merely absence of runtime errors. Because the headline 8.6% semantic-correctness figure rests on this proxy, the lack of these specifics makes it impossible to determine whether the reported rate under- or over-states real capability.
minor comments (1)
- [Abstract and §1] Abstract and §1: The claim that the study covers "30 LLMs across eight families" should be cross-checked against the exact model list and family definitions in the experimental setup for consistency.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater transparency in our dataset curation and evaluation procedures. We agree that additional detail is required to allow readers to fully assess the validity of the 8.6% semantic-correctness figure and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Dataset and Evaluation sections] Dataset and Evaluation sections: The paper provides insufficient detail on (a) how the 205 natural-language descriptions were obtained or independently validated against the reference TLA+ specifications, (b) the precise TLC configurations, invariants, or properties checked to establish semantic correctness, and (c) whether TLC acceptance was treated as evidence of equivalence to the original NL intent rather than merely absence of runtime errors. Because the headline 8.6% semantic-correctness figure rests on this proxy, the lack of these specifics makes it impossible to determine whether the reported rate under- or over-states real capability.
Authors: We accept the criticism that the manuscript currently provides insufficient detail on these points. (a) The 205 natural-language descriptions were obtained by the authors through manual reverse translation of the reference TLA+ specifications into concise English statements that capture the core intent; each pair was cross-checked by at least two authors for fidelity. We will add a new subsection in the Dataset section that documents the curation process, the source repositories from which the reference specifications were drawn, and the validation steps performed. (b) and (c) Semantic correctness was defined as the generated specification passing the same TLC checks (including the same invariants, temporal properties, and state-space exploration parameters) that the reference specification satisfies, with no runtime errors or violations reported. TLC acceptance therefore indicates that the generated spec is consistent with the checked properties of the reference, but we acknowledge it is only a proxy and does not guarantee complete semantic equivalence to every nuance of the original natural-language intent. We will expand the Evaluation section with explicit descriptions of the TLC command-line configurations, the exact properties checked for each task, and a dedicated limitations paragraph discussing the proxy nature of the metric. These revisions will be made in the next version of the manuscript. revision: yes
Circularity Check
No circularity: direct empirical counts from external validators
full rationale
The paper performs an empirical measurement: it runs 30 LLMs on a fixed curated dataset of 205 TLA+ specifications under defined prompting strategies, then counts syntactic acceptance by SANY and semantic acceptance by TLC. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the reported results. The 26.6% / 8.6% figures are raw pass rates from external tools; the dataset and validation procedure are treated as given inputs rather than derived quantities. This matches the default non-circular case for measurement studies.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SANY parser and TLC model checker provide reliable ground truth for syntactic and semantic correctness of TLA+ specifications
Reference graph
Works this paper leans on
-
[1]
Specifying Systems: The Language and Tools for Hardware and Software Engineers , year =
Lamport, Leslie , isbn =. Specifying Systems: The Language and Tools for Hardware and Software Engineers , year =
-
[2]
15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21) , title =
Jianan Yao and others , authorall =. 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21) , title =
-
[3]
Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-LoRAs , year =
Abouelenin, Abdelrahman and others , authorall =. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-LoRAs , year =
-
[4]
Introducing Claude: A Next-Generation AI Assistant , year =
Anthropic , howpublished =. Introducing Claude: A Next-Generation AI Assistant , year =
-
[5]
Program Synthesis with Large Language Models , year =
Austin, Jacob and others , authorall =. Program Synthesis with Large Language Models , year =
-
[6]
arXiv , author =:2507.14330 , note =
Leveraging LLMs for Formal Software Requirements---Challenges and Prospects , year =. arXiv , author =:2507.14330 , note =
-
[7]
arXiv , author =:2509.23130 , note =
SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems , year =. arXiv , author =:2509.23130 , note =
-
[8]
arXiv , author =:2404.16075 , note =
Validating Traces of Distributed Programs Against Specifications , year =. arXiv , author =:2404.16075 , note =
-
[9]
Formal requirements engineering and large language models: A two-way roadmap , volume =
Alessio Ferrari and Paola Spoletini , doi =. Formal requirements engineering and large language models: A two-way roadmap , volume =. Information and Software Technology , pages =
-
[10]
arXiv , author =:2403.08295 , note =
Gemma: Open Models Based on Gemini Research and Technology , year =. arXiv , author =:2403.08295 , note =
-
[11]
arXiv , author =:2407.21783 , note =
The Llama 3 Herd of Models , year =. arXiv , author =:2407.21783 , note =
-
[12]
Nature645(8081), 633–638 (Sep 2025)
Guo, Daya and others , authorall =. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume =. doi:10.1038/s41586-025-09422-z , journal =
-
[13]
arXiv , author =:2206.01962 , note =
Formal Specifications from Natural Language , year =. arXiv , author =:2206.01962 , note =
-
[14]
Second Place, GenAI-Accelerated Challenge , year =
Helwer, Andrew , howpublished =. Second Place, GenAI-Accelerated Challenge , year =
-
[15]
Reliable Generation of Formal Specifications using Large Language Models , year =
Kogler, Philipp and Falkner, Andreas and Sperl, Simon , booktitle =. Reliable Generation of Formal Specifications using Large Language Models , year =. doi:10.18420/sw2024-ws_10 , pages =
-
[16]
Examples: A Collection of Specifications and Models , year =
Community , howpublished =. Examples: A Collection of Specifications and Models , year =
-
[17]
ROUGE: A Package for Automatic Evaluation of Summaries , year =
Lin, Chin-Yew , booktitle =. ROUGE: A Package for Automatic Evaluation of Summaries , year =
-
[18]
arXiv , author =:2410.05229 , note =
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , year =. arXiv , author =:2410.05229 , note =
-
[19]
arXiv , author =:2310.06825 , note =
Mistral 7B , year =. arXiv , author =:2310.06825 , note =
-
[20]
How Amazon Web Services uses formal methods , volume =
Chris Newcombe and Tim Rath and Fan Zhang and Bogdan Munteanu and Marc Brooker and Michael Deardeuff , journal =. How Amazon Web Services uses formal methods , volume =
-
[21]
Ollama: Open Source Runtime for Local Large Language Models , year =
Ollama Contributors , note =. Ollama: Open Source Runtime for Local Large Language Models , year =
-
[22]
arXiv , author =:2303.08774 , note =
GPT-4 Technical Report , url =. arXiv , author =:2303.08774 , note =
-
[23]
Bleu: a method for automatic evaluation of machine translation
Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle =. Bleu: a Method for Automatic Evaluation of Machine Translation , year =. doi:10.3115/1073083.1073135 , pages =
-
[24]
arXiv , author =:2601.03267 , note =
OpenAI GPT-5 System Card , year =. arXiv , author =:2601.03267 , note =
-
[25]
Introducing Claude Sonnet\, 4.5 , year =
Anthropic , howpublished =. Introducing Claude Sonnet\, 4.5 , year =
-
[26]
arXiv , author =:2401.08807 , note =
SpecGen: Automated Generation of Formal Program Specifications via Large Language Models , year =. arXiv , author =:2401.08807 , note =
-
[27]
Specula: First Place, GenAI-Accelerated Challenge (Code → Spec) , year =
Qian Cheng and Tianyin Xu and Yu Huang , howpublished =. Specula: First Place, GenAI-Accelerated Challenge (Code → Spec) , year =
-
[28]
Spracklen, Joseph and others , authorall =. Proc. 34th USENIX Security Symposium , title =
-
[29]
Third Place, GenAI-Accelerated Challenge , year =
Gregory Terzian , howpublished =. Third Place, GenAI-Accelerated Challenge , year =
-
[30]
GenAI-Accelerated Challenge Announcement , year =
Foundation , howpublished =. GenAI-Accelerated Challenge Announcement , year =
-
[31]
arXiv , author =:2201.11903 , note =
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , year =. arXiv , author =:2201.11903 , note =
-
[32]
arXiv , author =:2505.09388 , note =
Qwen3 Technical Report , year =. arXiv , author =:2505.09388 , note =
-
[33]
arXiv , author =:2309.06794 , title =
-
[34]
Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review , volume =
Zhang, Wan and Zhang, Jing , doi =. Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review , volume =. Mathematics , number =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.