FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean
Pith reviewed 2026-05-08 11:52 UTC · model grok-4.3
The pith
FormalScience lets one domain expert turn informal physics into valid Lean4 proofs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FormalScience is a domain-agnostic human-in-the-loop agentic pipeline that enables a single domain expert without deep formal language experience to produce syntactically correct and semantically aligned formal proofs of informal physics reasoning in Lean4, yielding the FormalPhysics dataset of 200 problems that exhibits perfect formal validity and greater statement complexity than prior formal math benchmarks.
What carries the argument
The human-in-the-loop agentic pipeline with multi-stage refinement and error feedback that iteratively generates and corrects Lean4 code from LaTeX physics problems and solutions.
If this is right
- LLMs can be systematically tested on physics autoformalisation tasks using zero-shot, self-refinement, and multi-stage agentic prompting.
- Semantic drift in formalisation can be described in terms of notational collapse and abstraction elevation when full semantic preservation is not achieved.
- The released codebase and UI extend autoformalisation and theorem proving support to scientific domains beyond physics.
Where Pith is reading between the lines
- The pipeline may reduce the cost of building large formal scientific knowledge bases that AI systems can verify.
- Similar human-guided agentic methods could be tested on chemistry or biology problems to check domain-agnostic scalability.
- Insights on semantic drift could guide training of future models to better handle domain-specific notations like Dirac notation.
Load-bearing premise
A single domain expert without deep formal language experience can reliably guide the agentic pipeline to produce semantically aligned formal proofs.
What would settle it
Independent physics experts review a sample of the FormalPhysics formalizations and find that many fail to preserve the original informal meaning, or the pipeline produces invalid code when applied to new problems outside the initial 200.
Figures
read the original abstract
Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit{e.g.} Dirac notation, vector calculus) imposes additional formalisation challenges that modern LLMs and agentic approaches have yet to tackle. To aid autoformalisation in scientific domains, we present FormalScience; a domain-agnostic human-in-the-loop agentic pipeline that enables a single domain expert (without deep formal language experience) to produce \textit{syntactically correct} and \textit{semantically aligned} formal proofs of informal reasoning for low economic cost. Applying FormalScience to physics, we construct FormalPhysics, a dataset of 200 university-level (LaTeX) physics problems and solutions (primarily quantum mechanics and electromagnetism), along with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity. We evaluate open-source models and proprietary systems on a statement autoformalisation task on our dataset via zero-shot prompting, self-refinement with error feedback, and a novel multi-stage agentic approach, and explore autoformalisation limitations in modern LLM-based approaches. We provide the first systematic characterisation of semantic drift in physics autoformalisation in terms of concepts such as notational collapse and abstraction elevation which reveals what formal language verifies when full semantic preservation is unattainable. We release the codebase together with an interactive UI-based FormalScience system which facilitates autoformalisation and theorem proving in scientific domains beyond physics.https://github.com/jmeadows17/formal-science
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FormalScience, a domain-agnostic human-in-the-loop agentic pipeline that combines LLMs with code generation in Lean4 to autoformalize informal scientific reasoning. Applied to physics, the authors construct the FormalPhysics dataset of 200 university-level problems (primarily quantum mechanics and electromagnetism) together with their Lean4 formalizations, claiming perfect formal validity and greater statement complexity than existing formal math benchmarks. The work evaluates open-source and proprietary models on a statement autoformalisation task using zero-shot prompting, self-refinement, and a multi-stage agentic approach, while providing the first systematic characterisation of semantic drift phenomena such as notational collapse and abstraction elevation.
Significance. If the central claims hold, the pipeline offers a practical, low-cost route for a single domain expert to produce formally verifiable scientific content, with the released codebase and interactive UI providing immediate value for reproducibility and extension to other domains. The FormalPhysics dataset and the explicit treatment of semantic drift limitations supply concrete material for studying where LLM-based formalisation succeeds or fails in notation-heavy fields.
major comments (3)
- [§4] §4 (FormalPhysics construction): semantic alignment of the 200 entries is asserted after a single domain expert iterates with the agent until Lean checks pass and the expert is satisfied; no inter-rater agreement, second-expert review, or quantified semantic-drift score is reported. This single-judge process is the sole safeguard for the headline claim of 'semantically aligned' formal proofs and therefore load-bearing for the dataset's claimed utility.
- [§5] §5 (evaluation and semantic-drift analysis): the abstract states 'perfect formal validity' across prompting strategies, yet the text provides no per-strategy success rates, error typology, or ablation that isolates the contribution of the multi-stage agentic refinement versus simpler baselines. Without these numbers the scalability and limitation-characterisation claims cannot be assessed for robustness.
- [§5.3] §5.3 (characterisation of semantic drift): the notions of notational collapse and abstraction elevation are introduced as novel, but the section supplies no concrete examples tied to specific FormalPhysics statements, no count of affected entries, and no metric showing how often full semantic preservation is unattainable. This weakens the claim that the analysis 'reveals what formal language verifies'.
minor comments (2)
- [Abstract] The abstract claims 'low economic cost' without any reported token counts, human time, or monetary figures in the main text or appendix; adding these numbers would strengthen the scalability narrative.
- [Figures] Pipeline diagrams (presumably Figure 1 or 2) should explicitly annotate the human intervention points and the exact Lean4 feedback loop to avoid ambiguity for readers unfamiliar with agentic workflows.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the strengths and limitations of our work. We address each major point below, providing clarifications where the manuscript may have been unclear and outlining specific revisions to strengthen the presentation of our claims.
read point-by-point responses
-
Referee: [§4] §4 (FormalPhysics construction): semantic alignment of the 200 entries is asserted after a single domain expert iterates with the agent until Lean checks pass and the expert is satisfied; no inter-rater agreement, second-expert review, or quantified semantic-drift score is reported. This single-judge process is the sole safeguard for the headline claim of 'semantically aligned' formal proofs and therefore load-bearing for the dataset's claimed utility.
Authors: We agree that semantic alignment rests on the judgment of a single domain expert, which is a limitation of the current dataset construction process. This design choice aligns with the human-in-the-loop philosophy of FormalScience, enabling a single expert without formal methods expertise to produce verified formalizations at low cost. Lean provides syntactic guarantees, while semantic fidelity is ensured through iterative expert review until the expert is satisfied. We acknowledge that inter-rater agreement or a quantified drift score would provide additional robustness. In the revised manuscript, we will expand §4 to explicitly discuss this limitation, include a small-scale inter-rater agreement study on a 20-problem subset, and report a basic semantic-drift score based on expert pairwise comparisons. revision: partial
-
Referee: [§5] §5 (evaluation and semantic-drift analysis): the abstract states 'perfect formal validity' across prompting strategies, yet the text provides no per-strategy success rates, error typology, or ablation that isolates the contribution of the multi-stage agentic refinement versus simpler baselines. Without these numbers the scalability and limitation-characterisation claims cannot be assessed for robustness.
Authors: We clarify that the claim of 'perfect formal validity' in the abstract and §4 refers exclusively to the constructed FormalPhysics dataset (all 200 entries pass Lean type-checking and expert semantic review), not to the autoformalisation success rates across prompting strategies. The evaluation in §5 compares zero-shot, self-refinement, and multi-stage agentic approaches on statement formalisation, but we accept that the current text lacks sufficient granularity. In revision, we will add a new table with per-strategy success rates (including exact numbers for each model and method), an error typology breakdown, and an ablation study isolating the contribution of the agentic stages versus baselines. revision: yes
-
Referee: [§5.3] §5.3 (characterisation of semantic drift): the notions of notational collapse and abstraction elevation are introduced as novel, but the section supplies no concrete examples tied to specific FormalPhysics statements, no count of affected entries, and no metric showing how often full semantic preservation is unattainable. This weakens the claim that the analysis 'reveals what formal language verifies'.
Authors: We agree that §5.3 would benefit from greater concreteness. The section currently introduces notational collapse and abstraction elevation as observed phenomena in physics formalisation, but lacks the requested specifics. In the revised version, we will add 3-4 concrete examples drawn directly from FormalPhysics entries (with Lean code and original LaTeX), report the approximate frequency of each phenomenon across the dataset, and introduce a simple expert-rated metric (0-5 scale) for semantic preservation to quantify how often full alignment is unattainable. revision: yes
Circularity Check
No circularity: engineering pipeline and dataset construction are self-contained.
full rationale
The paper describes an agentic human-in-the-loop pipeline for autoformalising scientific statements into Lean4, applies it to create the FormalPhysics dataset of 200 physics problems, and evaluates LLMs on statement autoformalisation tasks. No mathematical derivations, fitted parameters, or predictions are claimed; the central contributions are the pipeline description, dataset release, and empirical evaluation results. Semantic alignment is asserted via expert iteration until Lean checks pass, but this is an engineering process rather than a derivation that reduces to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the provided text. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
InInternational Conference on Intelligent Computer Mathematics, pages 288–
Formalizing physics: automation, presentation and foundation issues. InInternational Conference on Intelligent Computer Mathematics, pages 288–
-
[2]
Springer. Sirui Li, Wangyue Lu, Xiaorui Shi, Ke Weng, Haozhe Sun, Minghe Yu, Tiancheng Zhang, Ge Yu, Hengyu Liu, and Lun Du. 2025. Msc-180: A bench- mark for automated formal theorem proving from mathematical subject classification.arXiv preprint arXiv:2512.18256. Zenan Li, Yifan Wu, Zhaoyu Li, Xinming Wei, Xian Zhang, Fan Yang, and Xiaoxing Ma. 2024. Aut...
-
[3]
InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24858–24866
Controlling equational reasoning in large lan- guage models with prompt interventions. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24858–24866. Agnieszka Mensfelt, David Tena Cucala, Santiago Franco, Angeliki Koutsoukou-Argyraki, Vince Trenc- senyi, and Kostas Stathis. 2025. Towards a com- mon framework for autoform...
-
[4]
OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K
Mathgap: Out-of-distribution evaluation on problems with arbitrarily complex proofs.arXiv preprint arXiv:2410.13502. OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bu...
-
[5]
minif2f: a cross-system benchmark for for- mal olympiad-level mathematics. InInternational Conference on Learning Representations. A Implementation of FormalScience pipeline to construct FormalPhysics As described in Alg. 1, the FormalScience pipeline requires in-context examples of informal state- ments and proofs, and a larger collection of infor- mal p...
work page 2024
-
[8]
You should wrap the formal code in a way illustrated as the following: %%%%%%%%%% Your Formal Code %%%%%%%%%% Strictly follow the instructions that have been claimed. Natural language statement: {nl_statement} Give me the Lean4 formal code of the statement: Self- Refinement with Error Feedback You are an expert in formal language Lean4. You will be given ...
-
[9]
You should give the formal code directly without any additional comments or explanations. If the given physics statement is a theorem or lemma, omit the formal proof and use the default ’sorry’ mode in the formal code
-
[10]
In case that you need to import any necessary preambles, you should not import any fake (non-exist) preambles
-
[11]
effectively unchanged with self-refinement
You should wrap the formal code in a way illustrated as the following: %%%%%%%%%% Your Formal Code %%%%%%%%%% Strictly follow the instructions that have been claimed. Natural language statement: {nl_statement} There are some Lean4 formal codes describing the given physics statement: {formal} You should refine the formal code for your task to make it corre...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.