Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

Arslan Bisharat; Brian Ortiz; Eric Spencer; George K. Thiruvathukal; Khushboo Bhadauria; Konstantin Laufer; Mohammed Abuhamad; TaiNing Wang

arxiv: 2606.05792 · v1 · pith:ZWDVYUFUnew · submitted 2026-06-04 · 💻 cs.AI · cs.LG· cs.LO· cs.SE

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

Arslan Bisharat , Brian Ortiz , Eric Spencer , Khushboo Bhadauria , TaiNing Wang , George K. Thiruvathukal , Konstantin Laufer , Mohammed Abuhamad This is my paper

Pith reviewed 2026-06-28 02:05 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.LOcs.SE

keywords large language modelsTLA+formal specificationsnatural language to formalmodel checkingprompting strategiessyntactic correctnesssemantic correctness

0 comments

The pith

Current LLMs generate semantically correct TLA+ specifications from natural language at most 8.6 percent of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can turn everyday descriptions into correct TLA+ specifications, the formal language used for industrial verification at companies like Amazon. It runs 30 models on a set of 205 examples and checks outputs with both a parser and a model checker. Syntactic correctness reaches 26.6 percent at best, but semantic correctness tops out at 8.6 percent and only appears with progressive prompting. Model size and code-specialization do not help and sometimes hurt. The work concludes that expert oversight remains necessary for reliable results.

Core claim

Across 30 LLMs and multiple prompting strategies, the highest syntactic correctness rate is 26.6 percent and the highest semantic correctness rate is 8.6 percent, with all semantic successes limited to progressive prompting. Model size does not predict quality and some smaller models outperform larger ones. Code-specialized models underperform due to negative transfer. Five recurring hallucination categories appear, each traceable to training-data biases. The results indicate that LLMs do not yet produce reliable TLA+ specifications without expert oversight.

What carries the argument

Benchmark evaluation of 30 LLMs on a curated dataset of 205 TLA+ specifications, validated for syntax by the SANY parser and for semantics by the TLC model checker under four prompting strategies.

If this is right

Model size alone does not determine success at generating correct TLA+ specifications.
Code-specialized models suffer negative transfer and perform worse than general-purpose models on this task.
Progressive prompting is the only strategy that produces any semantically correct outputs.
Training-data biases produce five identifiable, recurring hallucination patterns in generated specifications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid workflows that combine LLM drafts with targeted human review could lower the expertise barrier for TLA+ adoption.
Targeted fine-tuning on formal-specification examples might raise semantic correctness beyond the levels measured here.
The same evaluation approach could be repeated for other formal languages to test whether the low reliability pattern is general.

Load-bearing premise

The 205-spec dataset and the SANY-plus-TLC validation procedure give a faithful measure of real-world semantic correctness for natural-language-to-TLA+ tasks.

What would settle it

An LLM or prompting technique that produces TLC-verified semantically correct TLA+ specifications on more than half of a comparable set of natural-language descriptions.

read the original abstract

TLA+ has supported industrial verification at companies such as Amazon and Microsoft, yet writing correct TLA+ specifications from natural language still requires time and expertise, which limits adoption. LLMs show promise, but no prior study measures whether they produce semantically correct TLA+ specifications from natural language. This paper presents the first systematic evaluation of LLM-based TLA+ specification synthesis from natural language. Our study evaluates 30 LLMs across eight families on a curated dataset of 205 TLA+ specifications: 25 open-weight models across four prompting strategies (2,600 runs) and 5 proprietary models under few-shot prompting (130 runs), all validated by the SANY parser and TLC model checker. LLMs achieve up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusive to progressive prompting. Results show that model size does not predict quality, e.g., DeepSeek r1:8b outperforms its 70B variant across all strategies, which suggests the importance of reasoning alignment for formal languages. Code-specialized models consistently underperform due to negative transfer from mainstream language training. We identify five recurring hallucination categories, all traceable to specific training data biases. These results suggest that current LLMs do not generate reliable TLA+ specifications without expert oversight. We release the evaluation framework, code, and dataset to support reproducibility and future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

First broad benchmark finds LLMs reach only 8.6% semantic correctness on TLA+ from natural language, with progressive prompting as the sole working strategy.

read the letter

This paper's main result is that LLMs hit at most 26.6% syntactic correctness and 8.6% semantic correctness when turning natural language into TLA+ specs, and the semantic wins appear only under progressive prompting.

The work is new as the first systematic multi-model, multi-strategy test that uses SANY and TLC to separate syntax from semantics across 2730 runs on 205 specs. It covers 25 open-weight models plus five proprietary ones, shows model size does not predict success, finds reasoning-aligned models like DeepSeek r1:8b beating larger variants, and notes that code-specialized models underperform due to negative transfer. The five hallucination categories tied to training biases give concrete targets. Releasing the dataset, code, and framework supports direct follow-up.

The soft spot is the limited information on dataset construction. The abstract does not detail how the natural-language prompts were obtained or checked for independence from the reference specs, nor which TLC configurations or properties were used. If the prompts were reverse-engineered or the checks were shallow, the 8.6% semantic figure could shift. That concern from the stress-test note holds based on what is visible here.

The paper is for researchers working at the intersection of AI and formal methods or anyone building benchmarks for specification generation. A reader who needs a reproducible baseline on this task gets clear value from the numbers and artifacts. The design is straightforward empirical work with no circularity or unsupported leaps.

Send it for peer review. The contribution as an initial benchmark is solid enough to justify referee time on the methods details.

Referee Report

1 major / 1 minor

Summary. The paper presents the first systematic evaluation of LLM-based natural-language-to-TLA+ specification generation. It evaluates 30 models (25 open-weight across four prompting strategies for 2,600 runs; 5 proprietary under few-shot for 130 runs) on a curated dataset of 205 TLA+ specifications, using SANY for syntax and TLC for semantic validation. Key findings include maximum 26.6% syntactic correctness and 8.6% semantic correctness (successes only with progressive prompting), no correlation with model size (e.g., DeepSeek r1:8b outperforming 70B), underperformance of code-specialized models, five hallucination categories, and the conclusion that current LLMs require expert oversight. The evaluation framework, code, and dataset are released.

Significance. If the validation procedure faithfully measures semantic fidelity to the supplied natural-language intent, the work supplies the first large-scale empirical baseline on LLM limitations for TLA+ synthesis, an industrially relevant formal language. The release of the full dataset, code, and 2,730-run results is a clear strength for reproducibility and future benchmarking. The observation that reasoning-aligned smaller models can outperform larger ones and that code-specialized training produces negative transfer are actionable for the formal-methods and LLM communities.

major comments (1)

[Dataset and Evaluation sections] Dataset and Evaluation sections: The paper provides insufficient detail on (a) how the 205 natural-language descriptions were obtained or independently validated against the reference TLA+ specifications, (b) the precise TLC configurations, invariants, or properties checked to establish semantic correctness, and (c) whether TLC acceptance was treated as evidence of equivalence to the original NL intent rather than merely absence of runtime errors. Because the headline 8.6% semantic-correctness figure rests on this proxy, the lack of these specifics makes it impossible to determine whether the reported rate under- or over-states real capability.

minor comments (1)

[Abstract and §1] Abstract and §1: The claim that the study covers "30 LLMs across eight families" should be cross-checked against the exact model list and family definitions in the experimental setup for consistency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our dataset curation and evaluation procedures. We agree that additional detail is required to allow readers to fully assess the validity of the 8.6% semantic-correctness figure and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Dataset and Evaluation sections] Dataset and Evaluation sections: The paper provides insufficient detail on (a) how the 205 natural-language descriptions were obtained or independently validated against the reference TLA+ specifications, (b) the precise TLC configurations, invariants, or properties checked to establish semantic correctness, and (c) whether TLC acceptance was treated as evidence of equivalence to the original NL intent rather than merely absence of runtime errors. Because the headline 8.6% semantic-correctness figure rests on this proxy, the lack of these specifics makes it impossible to determine whether the reported rate under- or over-states real capability.

Authors: We accept the criticism that the manuscript currently provides insufficient detail on these points. (a) The 205 natural-language descriptions were obtained by the authors through manual reverse translation of the reference TLA+ specifications into concise English statements that capture the core intent; each pair was cross-checked by at least two authors for fidelity. We will add a new subsection in the Dataset section that documents the curation process, the source repositories from which the reference specifications were drawn, and the validation steps performed. (b) and (c) Semantic correctness was defined as the generated specification passing the same TLC checks (including the same invariants, temporal properties, and state-space exploration parameters) that the reference specification satisfies, with no runtime errors or violations reported. TLC acceptance therefore indicates that the generated spec is consistent with the checked properties of the reference, but we acknowledge it is only a proxy and does not guarantee complete semantic equivalence to every nuance of the original natural-language intent. We will expand the Evaluation section with explicit descriptions of the TLC command-line configurations, the exact properties checked for each task, and a dedicated limitations paragraph discussing the proxy nature of the metric. These revisions will be made in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical counts from external validators

full rationale

The paper performs an empirical measurement: it runs 30 LLMs on a fixed curated dataset of 205 TLA+ specifications under defined prompting strategies, then counts syntactic acceptance by SANY and semantic acceptance by TLC. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the reported results. The 26.6% / 8.6% figures are raw pass rates from external tools; the dataset and validation procedure are treated as given inputs rather than derived quantities. This matches the default non-circular case for measurement studies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmarking study. It depends on the external correctness of the SANY parser and TLC model checker but introduces no free parameters, ad-hoc axioms, or new entities.

axioms (1)

domain assumption SANY parser and TLC model checker provide reliable ground truth for syntactic and semantic correctness of TLA+ specifications
All reported correctness percentages rest on acceptance or rejection by these two tools.

pith-pipeline@v0.9.1-grok · 5821 in / 1247 out tokens · 27529 ms · 2026-06-28T02:05:25.041585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 3 canonical work pages

[1]

Specifying Systems: The Language and Tools for Hardware and Software Engineers , year =

Lamport, Leslie , isbn =. Specifying Systems: The Language and Tools for Hardware and Software Engineers , year =
[2]

15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21) , title =

Jianan Yao and others , authorall =. 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21) , title =
[3]

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-LoRAs , year =

Abouelenin, Abdelrahman and others , authorall =. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-LoRAs , year =
[4]

Introducing Claude: A Next-Generation AI Assistant , year =

Anthropic , howpublished =. Introducing Claude: A Next-Generation AI Assistant , year =
[5]

Program Synthesis with Large Language Models , year =

Austin, Jacob and others , authorall =. Program Synthesis with Large Language Models , year =
[6]

arXiv , author =:2507.14330 , note =

Leveraging LLMs for Formal Software Requirements---Challenges and Prospects , year =. arXiv , author =:2507.14330 , note =

arXiv
[7]

arXiv , author =:2509.23130 , note =

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems , year =. arXiv , author =:2509.23130 , note =

arXiv
[8]

arXiv , author =:2404.16075 , note =

Validating Traces of Distributed Programs Against Specifications , year =. arXiv , author =:2404.16075 , note =

arXiv
[9]

Formal requirements engineering and large language models: A two-way roadmap , volume =

Alessio Ferrari and Paola Spoletini , doi =. Formal requirements engineering and large language models: A two-way roadmap , volume =. Information and Software Technology , pages =
[10]

arXiv , author =:2403.08295 , note =

Gemma: Open Models Based on Gemini Research and Technology , year =. arXiv , author =:2403.08295 , note =

Pith/arXiv arXiv
[11]

arXiv , author =:2407.21783 , note =

The Llama 3 Herd of Models , year =. arXiv , author =:2407.21783 , note =

Pith/arXiv arXiv
[12]

Nature645(8081), 633–638 (Sep 2025)

Guo, Daya and others , authorall =. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume =. doi:10.1038/s41586-025-09422-z , journal =

work page doi:10.1038/s41586-025-09422-z
[13]

arXiv , author =:2206.01962 , note =

Formal Specifications from Natural Language , year =. arXiv , author =:2206.01962 , note =

arXiv
[14]

Second Place, GenAI-Accelerated Challenge , year =

Helwer, Andrew , howpublished =. Second Place, GenAI-Accelerated Challenge , year =
[15]

Reliable Generation of Formal Specifications using Large Language Models , year =

Kogler, Philipp and Falkner, Andreas and Sperl, Simon , booktitle =. Reliable Generation of Formal Specifications using Large Language Models , year =. doi:10.18420/sw2024-ws_10 , pages =

work page doi:10.18420/sw2024-ws_10
[16]

Examples: A Collection of Specifications and Models , year =

Community , howpublished =. Examples: A Collection of Specifications and Models , year =
[17]

ROUGE: A Package for Automatic Evaluation of Summaries , year =

Lin, Chin-Yew , booktitle =. ROUGE: A Package for Automatic Evaluation of Summaries , year =
[18]

arXiv , author =:2410.05229 , note =

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , year =. arXiv , author =:2410.05229 , note =

Pith/arXiv arXiv
[19]

arXiv , author =:2310.06825 , note =

Mistral 7B , year =. arXiv , author =:2310.06825 , note =

Pith/arXiv arXiv
[20]

How Amazon Web Services uses formal methods , volume =

Chris Newcombe and Tim Rath and Fan Zhang and Bogdan Munteanu and Marc Brooker and Michael Deardeuff , journal =. How Amazon Web Services uses formal methods , volume =
[21]

Ollama: Open Source Runtime for Local Large Language Models , year =

Ollama Contributors , note =. Ollama: Open Source Runtime for Local Large Language Models , year =
[22]

arXiv , author =:2303.08774 , note =

GPT-4 Technical Report , url =. arXiv , author =:2303.08774 , note =

Pith/arXiv arXiv
[23]

Bleu: a method for automatic evaluation of machine translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle =. Bleu: a Method for Automatic Evaluation of Machine Translation , year =. doi:10.3115/1073083.1073135 , pages =

work page doi:10.3115/1073083.1073135
[24]

arXiv , author =:2601.03267 , note =

OpenAI GPT-5 System Card , year =. arXiv , author =:2601.03267 , note =

Pith/arXiv arXiv
[25]

Introducing Claude Sonnet\, 4.5 , year =

Anthropic , howpublished =. Introducing Claude Sonnet\, 4.5 , year =
[26]

arXiv , author =:2401.08807 , note =

SpecGen: Automated Generation of Formal Program Specifications via Large Language Models , year =. arXiv , author =:2401.08807 , note =

arXiv
[27]

Specula: First Place, GenAI-Accelerated Challenge (Code → Spec) , year =

Qian Cheng and Tianyin Xu and Yu Huang , howpublished =. Specula: First Place, GenAI-Accelerated Challenge (Code → Spec) , year =
[28]

Spracklen, Joseph and others , authorall =. Proc. 34th USENIX Security Symposium , title =
[29]

Third Place, GenAI-Accelerated Challenge , year =

Gregory Terzian , howpublished =. Third Place, GenAI-Accelerated Challenge , year =
[30]

GenAI-Accelerated Challenge Announcement , year =

Foundation , howpublished =. GenAI-Accelerated Challenge Announcement , year =
[31]

arXiv , author =:2201.11903 , note =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , year =. arXiv , author =:2201.11903 , note =

Pith/arXiv arXiv
[32]

arXiv , author =:2505.09388 , note =

Qwen3 Technical Report , year =. arXiv , author =:2505.09388 , note =

Pith/arXiv arXiv
[33]

arXiv , author =:2309.06794 , title =

arXiv
[34]

Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review , volume =

Zhang, Wan and Zhang, Jing , doi =. Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review , volume =. Mathematics , number =

[1] [1]

Specifying Systems: The Language and Tools for Hardware and Software Engineers , year =

Lamport, Leslie , isbn =. Specifying Systems: The Language and Tools for Hardware and Software Engineers , year =

[2] [2]

15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21) , title =

Jianan Yao and others , authorall =. 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21) , title =

[3] [3]

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-LoRAs , year =

Abouelenin, Abdelrahman and others , authorall =. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-LoRAs , year =

[4] [4]

Introducing Claude: A Next-Generation AI Assistant , year =

Anthropic , howpublished =. Introducing Claude: A Next-Generation AI Assistant , year =

[5] [5]

Program Synthesis with Large Language Models , year =

Austin, Jacob and others , authorall =. Program Synthesis with Large Language Models , year =

[6] [6]

arXiv , author =:2507.14330 , note =

Leveraging LLMs for Formal Software Requirements---Challenges and Prospects , year =. arXiv , author =:2507.14330 , note =

arXiv

[7] [7]

arXiv , author =:2509.23130 , note =

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems , year =. arXiv , author =:2509.23130 , note =

arXiv

[8] [8]

arXiv , author =:2404.16075 , note =

Validating Traces of Distributed Programs Against Specifications , year =. arXiv , author =:2404.16075 , note =

arXiv

[9] [9]

Formal requirements engineering and large language models: A two-way roadmap , volume =

Alessio Ferrari and Paola Spoletini , doi =. Formal requirements engineering and large language models: A two-way roadmap , volume =. Information and Software Technology , pages =

[10] [10]

arXiv , author =:2403.08295 , note =

Gemma: Open Models Based on Gemini Research and Technology , year =. arXiv , author =:2403.08295 , note =

Pith/arXiv arXiv

[11] [11]

arXiv , author =:2407.21783 , note =

The Llama 3 Herd of Models , year =. arXiv , author =:2407.21783 , note =

Pith/arXiv arXiv

[12] [12]

Nature645(8081), 633–638 (Sep 2025)

Guo, Daya and others , authorall =. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume =. doi:10.1038/s41586-025-09422-z , journal =

work page doi:10.1038/s41586-025-09422-z

[13] [13]

arXiv , author =:2206.01962 , note =

Formal Specifications from Natural Language , year =. arXiv , author =:2206.01962 , note =

arXiv

[14] [14]

Second Place, GenAI-Accelerated Challenge , year =

Helwer, Andrew , howpublished =. Second Place, GenAI-Accelerated Challenge , year =

[15] [15]

Reliable Generation of Formal Specifications using Large Language Models , year =

Kogler, Philipp and Falkner, Andreas and Sperl, Simon , booktitle =. Reliable Generation of Formal Specifications using Large Language Models , year =. doi:10.18420/sw2024-ws_10 , pages =

work page doi:10.18420/sw2024-ws_10

[16] [16]

Examples: A Collection of Specifications and Models , year =

Community , howpublished =. Examples: A Collection of Specifications and Models , year =

[17] [17]

ROUGE: A Package for Automatic Evaluation of Summaries , year =

Lin, Chin-Yew , booktitle =. ROUGE: A Package for Automatic Evaluation of Summaries , year =

[18] [18]

arXiv , author =:2410.05229 , note =

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , year =. arXiv , author =:2410.05229 , note =

Pith/arXiv arXiv

[19] [19]

arXiv , author =:2310.06825 , note =

Mistral 7B , year =. arXiv , author =:2310.06825 , note =

Pith/arXiv arXiv

[20] [20]

How Amazon Web Services uses formal methods , volume =

Chris Newcombe and Tim Rath and Fan Zhang and Bogdan Munteanu and Marc Brooker and Michael Deardeuff , journal =. How Amazon Web Services uses formal methods , volume =

[21] [21]

Ollama: Open Source Runtime for Local Large Language Models , year =

Ollama Contributors , note =. Ollama: Open Source Runtime for Local Large Language Models , year =

[22] [22]

arXiv , author =:2303.08774 , note =

GPT-4 Technical Report , url =. arXiv , author =:2303.08774 , note =

Pith/arXiv arXiv

[23] [23]

Bleu: a method for automatic evaluation of machine translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle =. Bleu: a Method for Automatic Evaluation of Machine Translation , year =. doi:10.3115/1073083.1073135 , pages =

work page doi:10.3115/1073083.1073135

[24] [24]

arXiv , author =:2601.03267 , note =

OpenAI GPT-5 System Card , year =. arXiv , author =:2601.03267 , note =

Pith/arXiv arXiv

[25] [25]

Introducing Claude Sonnet\, 4.5 , year =

Anthropic , howpublished =. Introducing Claude Sonnet\, 4.5 , year =

[26] [26]

arXiv , author =:2401.08807 , note =

SpecGen: Automated Generation of Formal Program Specifications via Large Language Models , year =. arXiv , author =:2401.08807 , note =

arXiv

[27] [27]

Specula: First Place, GenAI-Accelerated Challenge (Code → Spec) , year =

Qian Cheng and Tianyin Xu and Yu Huang , howpublished =. Specula: First Place, GenAI-Accelerated Challenge (Code → Spec) , year =

[28] [28]

Spracklen, Joseph and others , authorall =. Proc. 34th USENIX Security Symposium , title =

[29] [29]

Third Place, GenAI-Accelerated Challenge , year =

Gregory Terzian , howpublished =. Third Place, GenAI-Accelerated Challenge , year =

[30] [30]

GenAI-Accelerated Challenge Announcement , year =

Foundation , howpublished =. GenAI-Accelerated Challenge Announcement , year =

[31] [31]

arXiv , author =:2201.11903 , note =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , year =. arXiv , author =:2201.11903 , note =

Pith/arXiv arXiv

[32] [32]

arXiv , author =:2505.09388 , note =

Qwen3 Technical Report , year =. arXiv , author =:2505.09388 , note =

Pith/arXiv arXiv

[33] [33]

arXiv , author =:2309.06794 , title =

arXiv

[34] [34]

Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review , volume =

Zhang, Wan and Zhang, Jing , doi =. Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review , volume =. Mathematics , number =