FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean

Andre Freitas; Jordan Meadows; Lan Zhang

arxiv: 2604.23002 · v1 · submitted 2026-04-24 · 💻 cs.AI · cs.CL

FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean

Jordan Meadows , Lan Zhang , Andre Freitas This is my paper

Pith reviewed 2026-05-08 11:52 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords autoformalisationLean4physicsagentic pipelinehuman-in-the-loopformal verificationsemantic drift

0 comments

The pith

FormalScience lets one domain expert turn informal physics into valid Lean4 proofs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FormalScience as a human-in-the-loop agentic pipeline that converts informal scientific reasoning into syntactically correct and semantically aligned Lean4 code. The authors apply it to physics to build FormalPhysics, a dataset of 200 university-level problems and solutions from quantum mechanics and electromagnetism, each paired with its formal representation. This dataset achieves perfect formal validity and shows greater statement complexity than existing formal math benchmarks. The work tests open-source and proprietary models on autoformalisation using zero-shot prompting, self-refinement, and a multi-stage agentic method, while characterizing semantic drift through ideas like notational collapse and abstraction elevation. The codebase and an interactive UI are released to support similar work in other scientific domains.

Core claim

FormalScience is a domain-agnostic human-in-the-loop agentic pipeline that enables a single domain expert without deep formal language experience to produce syntactically correct and semantically aligned formal proofs of informal physics reasoning in Lean4, yielding the FormalPhysics dataset of 200 problems that exhibits perfect formal validity and greater statement complexity than prior formal math benchmarks.

What carries the argument

The human-in-the-loop agentic pipeline with multi-stage refinement and error feedback that iteratively generates and corrects Lean4 code from LaTeX physics problems and solutions.

If this is right

LLMs can be systematically tested on physics autoformalisation tasks using zero-shot, self-refinement, and multi-stage agentic prompting.
Semantic drift in formalisation can be described in terms of notational collapse and abstraction elevation when full semantic preservation is not achieved.
The released codebase and UI extend autoformalisation and theorem proving support to scientific domains beyond physics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pipeline may reduce the cost of building large formal scientific knowledge bases that AI systems can verify.
Similar human-guided agentic methods could be tested on chemistry or biology problems to check domain-agnostic scalability.
Insights on semantic drift could guide training of future models to better handle domain-specific notations like Dirac notation.

Load-bearing premise

A single domain expert without deep formal language experience can reliably guide the agentic pipeline to produce semantically aligned formal proofs.

What would settle it

Independent physics experts review a sample of the FormalPhysics formalizations and find that many fail to preserve the original informal meaning, or the pipeline produces invalid code when applied to new problems outside the initial 200.

Figures

Figures reproduced from arXiv: 2604.23002 by Andre Freitas, Jordan Meadows, Lan Zhang.

**Figure 1.** Figure 1: An overview of the FormalScience approach. code with a 100% formal guarantee. The pipeline is domain-agnostic and scalable, enabling the generation of high-quality formal datasets in scientific domains for fine-tuning or evaluating AI systems. In this work, we evaluate FormalScience exclusively on physics. (ii) Present FormalPhysics: a corpus generated from the FormalScience pipeline comprising of 200 un… view at source ↗

**Figure 2.** Figure 2: Examples sourced from the FormalPhysics corpus generated via the FormalScience approach applied view at source ↗

**Figure 3.** Figure 3: Proportion of alignment drift categories view at source ↗

**Figure 4.** Figure 4: A (human-written) in-context example used view at source ↗

read the original abstract

Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit{e.g.} Dirac notation, vector calculus) imposes additional formalisation challenges that modern LLMs and agentic approaches have yet to tackle. To aid autoformalisation in scientific domains, we present FormalScience; a domain-agnostic human-in-the-loop agentic pipeline that enables a single domain expert (without deep formal language experience) to produce \textit{syntactically correct} and \textit{semantically aligned} formal proofs of informal reasoning for low economic cost. Applying FormalScience to physics, we construct FormalPhysics, a dataset of 200 university-level (LaTeX) physics problems and solutions (primarily quantum mechanics and electromagnetism), along with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity. We evaluate open-source models and proprietary systems on a statement autoformalisation task on our dataset via zero-shot prompting, self-refinement with error feedback, and a novel multi-stage agentic approach, and explore autoformalisation limitations in modern LLM-based approaches. We provide the first systematic characterisation of semantic drift in physics autoformalisation in terms of concepts such as notational collapse and abstraction elevation which reveals what formal language verifies when full semantic preservation is unattainable. We release the codebase together with an interactive UI-based FormalScience system which facilitates autoformalisation and theorem proving in scientific domains beyond physics.https://github.com/jmeadows17/formal-science

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical human-in-the-loop pipeline for turning physics problems into Lean code plus a new 200-example dataset, but semantic alignment rests only on one expert's judgment.

read the letter

The main takeaway is that FormalScience is a workable agentic system that lets one domain expert without deep Lean experience produce syntactically valid formalizations of university physics problems. They applied it to build FormalPhysics, a dataset of 200 LaTeX problems and solutions, mostly quantum mechanics and electromagnetism, that all pass Lean checks and show higher complexity than standard math benchmarks. They also release the full codebase and an interactive UI, which is the kind of thing that lets others actually use the work.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FormalScience, a domain-agnostic human-in-the-loop agentic pipeline that combines LLMs with code generation in Lean4 to autoformalize informal scientific reasoning. Applied to physics, the authors construct the FormalPhysics dataset of 200 university-level problems (primarily quantum mechanics and electromagnetism) together with their Lean4 formalizations, claiming perfect formal validity and greater statement complexity than existing formal math benchmarks. The work evaluates open-source and proprietary models on a statement autoformalisation task using zero-shot prompting, self-refinement, and a multi-stage agentic approach, while providing the first systematic characterisation of semantic drift phenomena such as notational collapse and abstraction elevation.

Significance. If the central claims hold, the pipeline offers a practical, low-cost route for a single domain expert to produce formally verifiable scientific content, with the released codebase and interactive UI providing immediate value for reproducibility and extension to other domains. The FormalPhysics dataset and the explicit treatment of semantic drift limitations supply concrete material for studying where LLM-based formalisation succeeds or fails in notation-heavy fields.

major comments (3)

[§4] §4 (FormalPhysics construction): semantic alignment of the 200 entries is asserted after a single domain expert iterates with the agent until Lean checks pass and the expert is satisfied; no inter-rater agreement, second-expert review, or quantified semantic-drift score is reported. This single-judge process is the sole safeguard for the headline claim of 'semantically aligned' formal proofs and therefore load-bearing for the dataset's claimed utility.
[§5] §5 (evaluation and semantic-drift analysis): the abstract states 'perfect formal validity' across prompting strategies, yet the text provides no per-strategy success rates, error typology, or ablation that isolates the contribution of the multi-stage agentic refinement versus simpler baselines. Without these numbers the scalability and limitation-characterisation claims cannot be assessed for robustness.
[§5.3] §5.3 (characterisation of semantic drift): the notions of notational collapse and abstraction elevation are introduced as novel, but the section supplies no concrete examples tied to specific FormalPhysics statements, no count of affected entries, and no metric showing how often full semantic preservation is unattainable. This weakens the claim that the analysis 'reveals what formal language verifies'.

minor comments (2)

[Abstract] The abstract claims 'low economic cost' without any reported token counts, human time, or monetary figures in the main text or appendix; adding these numbers would strengthen the scalability narrative.
[Figures] Pipeline diagrams (presumably Figure 1 or 2) should explicitly annotate the human intervention points and the exact Lean4 feedback loop to avoid ambiguity for readers unfamiliar with agentic workflows.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strengths and limitations of our work. We address each major point below, providing clarifications where the manuscript may have been unclear and outlining specific revisions to strengthen the presentation of our claims.

read point-by-point responses

Referee: [§4] §4 (FormalPhysics construction): semantic alignment of the 200 entries is asserted after a single domain expert iterates with the agent until Lean checks pass and the expert is satisfied; no inter-rater agreement, second-expert review, or quantified semantic-drift score is reported. This single-judge process is the sole safeguard for the headline claim of 'semantically aligned' formal proofs and therefore load-bearing for the dataset's claimed utility.

Authors: We agree that semantic alignment rests on the judgment of a single domain expert, which is a limitation of the current dataset construction process. This design choice aligns with the human-in-the-loop philosophy of FormalScience, enabling a single expert without formal methods expertise to produce verified formalizations at low cost. Lean provides syntactic guarantees, while semantic fidelity is ensured through iterative expert review until the expert is satisfied. We acknowledge that inter-rater agreement or a quantified drift score would provide additional robustness. In the revised manuscript, we will expand §4 to explicitly discuss this limitation, include a small-scale inter-rater agreement study on a 20-problem subset, and report a basic semantic-drift score based on expert pairwise comparisons. revision: partial
Referee: [§5] §5 (evaluation and semantic-drift analysis): the abstract states 'perfect formal validity' across prompting strategies, yet the text provides no per-strategy success rates, error typology, or ablation that isolates the contribution of the multi-stage agentic refinement versus simpler baselines. Without these numbers the scalability and limitation-characterisation claims cannot be assessed for robustness.

Authors: We clarify that the claim of 'perfect formal validity' in the abstract and §4 refers exclusively to the constructed FormalPhysics dataset (all 200 entries pass Lean type-checking and expert semantic review), not to the autoformalisation success rates across prompting strategies. The evaluation in §5 compares zero-shot, self-refinement, and multi-stage agentic approaches on statement formalisation, but we accept that the current text lacks sufficient granularity. In revision, we will add a new table with per-strategy success rates (including exact numbers for each model and method), an error typology breakdown, and an ablation study isolating the contribution of the agentic stages versus baselines. revision: yes
Referee: [§5.3] §5.3 (characterisation of semantic drift): the notions of notational collapse and abstraction elevation are introduced as novel, but the section supplies no concrete examples tied to specific FormalPhysics statements, no count of affected entries, and no metric showing how often full semantic preservation is unattainable. This weakens the claim that the analysis 'reveals what formal language verifies'.

Authors: We agree that §5.3 would benefit from greater concreteness. The section currently introduces notational collapse and abstraction elevation as observed phenomena in physics formalisation, but lacks the requested specifics. In the revised version, we will add 3-4 concrete examples drawn directly from FormalPhysics entries (with Lean code and original LaTeX), report the approximate frequency of each phenomenon across the dataset, and introduce a simple expert-rated metric (0-5 scale) for semantic preservation to quantify how often full alignment is unattainable. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering pipeline and dataset construction are self-contained.

full rationale

The paper describes an agentic human-in-the-loop pipeline for autoformalising scientific statements into Lean4, applies it to create the FormalPhysics dataset of 200 physics problems, and evaluates LLMs on statement autoformalisation tasks. No mathematical derivations, fitted parameters, or predictions are claimed; the central contributions are the pipeline description, dataset release, and empirical evaluation results. Semantic alignment is asserted via expert iteration until Lean checks pass, but this is an engineering process rather than a derivation that reduces to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the provided text. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard assumptions about LLM capabilities for code generation and human oversight for semantic alignment; no new physical axioms or invented entities are introduced.

pith-pipeline@v0.9.0 · 5592 in / 1220 out tokens · 32774 ms · 2026-05-08T11:52:36.155094+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

[1]

InInternational Conference on Intelligent Computer Mathematics, pages 288–

Formalizing physics: automation, presentation and foundation issues. InInternational Conference on Intelligent Computer Mathematics, pages 288–

work page
[2]

Sirui Li, Wangyue Lu, Xiaorui Shi, Ke Weng, Haozhe Sun, Minghe Yu, Tiancheng Zhang, Ge Yu, Hengyu Liu, and Lun Du

Springer. Sirui Li, Wangyue Lu, Xiaorui Shi, Ke Weng, Haozhe Sun, Minghe Yu, Tiancheng Zhang, Ge Yu, Hengyu Liu, and Lun Du. 2025. Msc-180: A bench- mark for automated formal theorem proving from mathematical subject classification.arXiv preprint arXiv:2512.18256. Zenan Li, Yifan Wu, Zhaoyu Li, Xinming Wei, Xian Zhang, Fan Yang, and Xiaoxing Ma. 2024. Aut...

work page arXiv 2025
[3]

InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24858–24866

Controlling equational reasoning in large lan- guage models with prompt interventions. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24858–24866. Agnieszka Mensfelt, David Tena Cucala, Santiago Franco, Angeliki Koutsoukou-Argyraki, Vince Trenc- senyi, and Kostas Stathis. 2025. Towards a com- mon framework for autoform...

work page arXiv 2025
[4]

OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K

Mathgap: Out-of-distribution evaluation on problems with arbitrarily complex proofs.arXiv preprint arXiv:2410.13502. OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bu...

work page arXiv 2025
[5]

question

minif2f: a cross-system benchmark for for- mal olympiad-level mathematics. InInternational Conference on Learning Representations. A Implementation of FormalScience pipeline to construct FormalPhysics As described in Alg. 1, the FormalScience pipeline requires in-context examples of informal state- ments and proofs, and a larger collection of infor- mal p...

work page 2024
[8]

Natural language statement: {nl_statement} Give me the Lean4 formal code of the statement: Self- Refinement with Error Feedback You are an expert in formal language Lean4

You should wrap the formal code in a way illustrated as the following: %%%%%%%%%% Your Formal Code %%%%%%%%%% Strictly follow the instructions that have been claimed. Natural language statement: {nl_statement} Give me the Lean4 formal code of the statement: Self- Refinement with Error Feedback You are an expert in formal language Lean4. You will be given ...

work page
[9]

If the given physics statement is a theorem or lemma, omit the formal proof and use the default ’sorry’ mode in the formal code

You should give the formal code directly without any additional comments or explanations. If the given physics statement is a theorem or lemma, omit the formal proof and use the default ’sorry’ mode in the formal code

work page
[10]

In case that you need to import any necessary preambles, you should not import any fake (non-exist) preambles

work page
[11]

effectively unchanged with self-refinement

You should wrap the formal code in a way illustrated as the following: %%%%%%%%%% Your Formal Code %%%%%%%%%% Strictly follow the instructions that have been claimed. Natural language statement: {nl_statement} There are some Lean4 formal codes describing the given physics statement: {formal} You should refine the formal code for your task to make it corre...

work page

[1] [1]

InInternational Conference on Intelligent Computer Mathematics, pages 288–

Formalizing physics: automation, presentation and foundation issues. InInternational Conference on Intelligent Computer Mathematics, pages 288–

work page

[2] [2]

Sirui Li, Wangyue Lu, Xiaorui Shi, Ke Weng, Haozhe Sun, Minghe Yu, Tiancheng Zhang, Ge Yu, Hengyu Liu, and Lun Du

Springer. Sirui Li, Wangyue Lu, Xiaorui Shi, Ke Weng, Haozhe Sun, Minghe Yu, Tiancheng Zhang, Ge Yu, Hengyu Liu, and Lun Du. 2025. Msc-180: A bench- mark for automated formal theorem proving from mathematical subject classification.arXiv preprint arXiv:2512.18256. Zenan Li, Yifan Wu, Zhaoyu Li, Xinming Wei, Xian Zhang, Fan Yang, and Xiaoxing Ma. 2024. Aut...

work page arXiv 2025

[3] [3]

InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24858–24866

Controlling equational reasoning in large lan- guage models with prompt interventions. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24858–24866. Agnieszka Mensfelt, David Tena Cucala, Santiago Franco, Angeliki Koutsoukou-Argyraki, Vince Trenc- senyi, and Kostas Stathis. 2025. Towards a com- mon framework for autoform...

work page arXiv 2025

[4] [4]

OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K

Mathgap: Out-of-distribution evaluation on problems with arbitrarily complex proofs.arXiv preprint arXiv:2410.13502. OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bu...

work page arXiv 2025

[5] [5]

question

minif2f: a cross-system benchmark for for- mal olympiad-level mathematics. InInternational Conference on Learning Representations. A Implementation of FormalScience pipeline to construct FormalPhysics As described in Alg. 1, the FormalScience pipeline requires in-context examples of informal state- ments and proofs, and a larger collection of infor- mal p...

work page 2024

[6] [8]

Natural language statement: {nl_statement} Give me the Lean4 formal code of the statement: Self- Refinement with Error Feedback You are an expert in formal language Lean4

You should wrap the formal code in a way illustrated as the following: %%%%%%%%%% Your Formal Code %%%%%%%%%% Strictly follow the instructions that have been claimed. Natural language statement: {nl_statement} Give me the Lean4 formal code of the statement: Self- Refinement with Error Feedback You are an expert in formal language Lean4. You will be given ...

work page

[7] [9]

If the given physics statement is a theorem or lemma, omit the formal proof and use the default ’sorry’ mode in the formal code

You should give the formal code directly without any additional comments or explanations. If the given physics statement is a theorem or lemma, omit the formal proof and use the default ’sorry’ mode in the formal code

work page

[8] [10]

In case that you need to import any necessary preambles, you should not import any fake (non-exist) preambles

work page

[9] [11]

effectively unchanged with self-refinement

You should wrap the formal code in a way illustrated as the following: %%%%%%%%%% Your Formal Code %%%%%%%%%% Strictly follow the instructions that have been claimed. Natural language statement: {nl_statement} There are some Lean4 formal codes describing the given physics statement: {formal} You should refine the formal code for your task to make it corre...

work page