Formal Specifications from Natural Language

Bernd Finkbeiner; Christopher Hahn; Frederik Schmitt; Julia J. Tillman; Julian Siber; Niklas Metzger

arxiv: 2206.01962 · v2 · pith:DDRJVLMInew · submitted 2022-06-04 · 💻 cs.SE · cs.LG· cs.PL

Formal Specifications from Natural Language

Christopher Hahn , Frederik Schmitt , Julia J. Tillman , Niklas Metzger , Julian Siber , Bernd Finkbeiner This is my paper

classification 💻 cs.SE cs.LGcs.PL

keywords languageformalmodelsnaturalexpressionsfine-tunegeneralizationlogic

0 comments

read the original abstract

We study the generalization abilities of language models when translating natural language into formal specifications with complex semantics. In particular, we fine-tune language models on three datasets consisting of English sentences and their corresponding formal representation: 1) regular expressions (regex), frequently used in programming and search; 2) First-order logic (FOL), commonly used in software verification and theorem proving; and 3) linear-time temporal logic (LTL), which forms the basis for industrial hardware specification languages. Our experiments show that, in these diverse domains, the language models maintain their generalization capabilities from pre-trained knowledge of natural language to generalize, e.g., to new variable names or operator descriptions. Additionally, they achieve competitive performance, and even outperform the state-of-the-art for translating into regular expressions, with the benefits of being easy to access, efficient to fine-tune, and without a particular need for domain-specific reasoning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation
cs.AI 2026-06 accept novelty 8.0

Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
cs.CL 2026-06 unverdicted novelty 7.0

Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
cs.SE 2026-05 unverdicted novelty 7.0

POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
Coherency through formalisations of Structured Natural Language, A case study on FRETish
cs.CL 2026-05 unverdicted novelty 5.0

A coherency guideline for multi-level requirement formalization is introduced, along with an equivalent but statistically improved FRETish-to-MTL translation for the FRET tool.