arxiv: 2604.23135 · v1 · submitted 2026-04-25 · 💻 cs.LG

Recognition: unknown

Surface Sensitivity in Lean 4 Autoformalization

William Feng , Ethan Lou , Aryan Sharma

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords autoformalizationLean 4paraphrase sensitivitytheorem provingformal mathematicsmachine learningcompilation boundary

0 comments

The pith

Paraphrase sensitivity in Lean autoformalization stems from compilation-boundary failures rather than semantic divergence among successful formalizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why semantically equivalent paraphrases of theorem statements in natural language lead to divergent formal outputs when using Lean 4 autoformalizers. It applies 60 deterministic paraphrase rules to ProofNet# and miniF2F across GPT-family models and open-weight 7B autoformalizers. The central observation is that when both the baseline and perturbed formalizations compile successfully, they prove to be semantically equivalent under the BEq+ metric and structurally near-identical under the GTED metric. This indicates that the sensitivity arises mainly because paraphrases change whether an output crosses the compilation threshold, not because they produce genuinely different mathematical meanings when they succeed. The results point toward targeting training interventions at the compilation step and designing benchmarks that isolate compile-conditional equivalence from surface-level consistency.

Core claim

When both baseline and perturbed outputs compile, paired predictions are semantically equivalent under BEq+ and structurally near-identical under GTED. By contrast, paraphrasing substantially affects whether outputs compile, with failure modes varying across datasets and perturbation classes.

What carries the argument

Sixty deterministic paraphrase rules applied to ProofNet# and miniF2F, evaluated with the BEq+ semantic equivalence metric and GTED structural similarity metric on compiling formalizations.

If this is right

Training-time interventions should target the compile boundary rather than the semantic layer.
Benchmarks should separate compile-conditional equivalence from surface consistency.
Paraphrasing affects compilation success rates, with failure modes that differ by dataset and perturbation class.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compile-boundary pattern may appear in autoformalization for other proof assistants beyond Lean 4.
Increasing robustness to surface variation could raise overall success rates by improving the fraction of outputs that reach compilation.
Datasets enriched with controlled paraphrases could serve as better probes for where models lose the ability to produce valid formal statements.

Load-bearing premise

The 60 deterministic paraphrase rules preserve semantic equivalence of the original theorems, and the BEq+ and GTED metrics fully capture any semantic or structural differences in the formal outputs.

What would settle it

Finding even one pair of formalizations from semantically equivalent paraphrases that both compile successfully yet differ under the BEq+ metric would disprove the claim that successful outputs show no semantic divergence.

Figures

Figures reproduced from arXiv: 2604.23135 by Aryan Sharma, Ethan Lou, William Feng.

**Figure 1.** Figure 1: (a) Per-model surface consistency under paraphrased inputs on ProofNet# (teal) and miniF2F (orange). Pair agreement ranges from 19–56% across the GPT panel. Error bars are 95% bootstrap CIs. (b) Per-category BEq+ surface consistency, pooled across the GPT panel, with ProofNet# categories (teal) and miniF2F categories (orange) shown together. 4.2. Compiling Variance Failures are typed differently by dataset… view at source ↗

**Figure 2.** Figure 2: ProofNet# surface consistency on the 10 rules shared across both panels (95% bootstrap CIs, paired by source, 5,000 iters). 5. Discussion We study paraphrase sensitivity in Lean 4 autoformalization, and find it to primarily be a compile-boundary phenomenon, not a semantic one. Specifically, 602/602 pairs where both outputs compile are semantically equivalent under BEq+ and structurally near-identical under… view at source ↗

**Figure 3.** Figure 3: Per-(rule, model) BEq+ equivalence on ProofNet# (GPT panel). Rows sorted by panel-mean ascending. Dots denote cells with no paired predictions. A.6. Per-Rule Open-Weight Surface Consistency view at source ↗

read the original abstract

Natural-language variation poses a key challenge in Lean autoformalization: semantically equivalent paraphrases of the same theorem statements can induce divergent formal outputs, yet it remains unclear whether this variation reflects semantic disagreements or shallower failures. We investigate this question in Lean 4 using 60 deterministic paraphrase rules applied to ProofNet\# and miniF2F. Across four GPT-family models and three open-weight 7B autoformalizers, we find that the observed paraphrase sensitivity reflects compilation-boundary failures rather than semantic divergence among successful formalizations. In particular, when both baseline and perturbed outputs compile, paired predictions are semantically equivalent under BEq+ and structurally near-identical under GTED. By contrast, paraphrasing substantially affects whether outputs compile, with failure modes varying across datasets and perturbation classes. Our results suggest that future training-time interventions should target the compile boundary rather than the semantic layer, and that benchmarks should separate compile-conditional equivalence from surface consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paraphrase effects in Lean 4 autoformalization mostly hit the compilation boundary, with successful outputs staying equivalent under the metrics used.

read the letter

The main point is that this paper separates paraphrase sensitivity into two parts: most of the trouble comes from outputs failing to compile, while the ones that do compile turn out equivalent under BEq+ and nearly identical under GTED. They reach this by running 60 fixed paraphrase rules on ProofNet# and miniF2F across GPT-family models and open 7B autoformalizers, then checking the results only in the compile-conditional regime. That distinction is the useful piece, because it points training effort at surface robustness rather than deeper semantic alignment once the formalization succeeds. The datasets and model families are named clearly, and the metrics are applied consistently to quantify the split between failure modes and equivalence. This gives a concrete measurement that earlier work on autoformalization sensitivity had not pinned down for Lean 4 in this controlled way. The scoping avoids overreach by sticking to what the checks actually show. The soft spots are mostly about missing execution details. The abstract does not spell out exact sample sizes for the paired evaluations, the full prompt templates, or any statistical tests on the equivalence rates, so it is difficult to judge how stable the near-identity result is across all perturbation classes. The claim that the 60 rules preserve meaning rests on their deterministic construction, which is plausible but would benefit from a short validation step in the text. None of this breaks the central logic, but it leaves the strength of the numbers open until the full tables are seen. This work is aimed at people building or benchmarking autoformalization pipelines who need to know where input variation actually hurts. A reader focused on practical reliability or dataset design would pick up the failure-mode breakdown and the suggestion to target the compile boundary. It is worth sending for peer review because the experimental framing is sound and the scoped claim is falsifiable with the metrics provided; the gaps are fixable with added detail rather than fundamental.

Referee Report

0 major / 3 minor

Summary. The manuscript investigates paraphrase sensitivity in Lean 4 autoformalization by applying 60 deterministic, meaning-preserving paraphrase rules to theorems drawn from ProofNet# and miniF2F. Across four GPT-family models and three open-weight 7B autoformalizers, it reports that paraphrase-induced variation primarily manifests as compilation failures rather than semantic divergence: when both baseline and perturbed outputs compile successfully, the formalizations are equivalent under the BEq+ metric and structurally near-identical under GTED. The work concludes that future interventions should target the compile boundary and that benchmarks should separate compile-conditional equivalence from surface consistency.

Significance. If the central empirical distinction holds, the result usefully separates compilation-boundary effects from deeper semantic disagreements in autoformalization. The deterministic construction of the 60 paraphrase rules and the conditional analysis (equivalence only among successful compilations) provide a clean operationalization that can guide both model training and benchmark design. The explicit use of BEq+ for semantic equivalence and GTED for structural similarity adds precision to the claims.

minor comments (3)

The experimental section should specify the exact number of outputs sampled or evaluated per theorem (including any filtering for successful compilations) and report the raw counts of cases where both baseline and perturbed outputs compile, to allow readers to assess the statistical power of the BEq+ and GTED comparisons.
The prompt templates used for each model family are referenced but not reproduced; including them (or a representative example) would improve reproducibility of the reported compilation rates and failure-mode distributions.
Table or figure captions for the per-dataset and per-perturbation-class results should explicitly state the total number of theorem-paraphrase pairs evaluated so that the reported percentages of compilation success can be interpreted in absolute terms.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and for recommending minor revision. We are pleased that the central empirical finding—that paraphrase sensitivity primarily manifests at the compilation boundary rather than through semantic divergence among successful formalizations—was viewed as useful, along with the clean operationalization via deterministic rules, BEq+, and GTED.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper conducts an empirical measurement of model outputs on paraphrased theorem statements from ProofNet# and miniF2F, using deterministic meaning-preserving rules and external verification via compilation success plus BEq+ semantic equivalence and GTED structural metrics. The central claim is scoped to the compile-conditional regime and does not derive any quantity from fitted parameters, self-definitions, or load-bearing self-citations; all reported quantities are direct observations against independent checks. No step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen paraphrase rules are semantically neutral and that the two equivalence metrics are adequate proxies for meaning and structure.

axioms (1)

domain assumption The 60 deterministic paraphrase rules preserve the semantic meaning of the theorems.
Invoked when interpreting that successful formalizations are equivalent rather than divergent.

pith-pipeline@v0.9.0 · 5452 in / 1175 out tokens · 43824 ms · 2026-05-08T08:30:06.087700+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Ayers, Dragomir Radev, and Jeremy Avigad

Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., and Avigad, J. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. CoRR, abs/2302.12433,

work page arXiv
[2]

arXiv preprint arXiv:2406.19783 , year=

Chen, J., Li, Z., Hu, X., and Xia, X. NLPerturbator: Study- ing the robustness of code LLMs to natural language variations.arXiv preprint arXiv:2406.19783,

work page arXiv
[3]

Hao, Y ., Wan, X., and Zhai, C. An investigation of robust- ness of LLMs in mathematical reasoning: Benchmark- ing with mathematically-equivalent transformation of ad- vanced mathematical problems.CoRR, abs/2508.08833,

work page arXiv
[4]

Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations

Huang, K., Guo, J., Li, Z., Ji, X., Ge, J., Li, W., Guo, Y ., Cai, T., Yuan, H., Wang, R., Wu, Y ., Yin, M., Tang, S., Huang, Y ., Jin, C., Chen, X., Zhang, C., and Wang, M. MATH- Perturb: Benchmarking LLMs’ math reasoning abilities against hard perturbations.CoRR, abs/2502.06453,

work page arXiv
[5]

Rethinking and improving autoformalization: Towards a faithful met- ric and a dependency retrieval-based approach

Liu, Q., Zheng, X., Lu, X., Cao, Q., and Yan, J. Rethinking and improving autoformalization: Towards a faithful met- ric and a dependency retrieval-based approach. InThe Thirteenth International Conference on Learning Repre- sentations, 2025a. Liu, X., Bao, K., Zhang, J., Liu, Y ., Chen, Y ., Liu, Y ., Jiao, Y ., and Luo, T. ATLAS: Autoformalizing theo- r...

work page arXiv
[6]

CoRR, abs/2406.01940,

work page arXiv
[7]

and Shah, A

Moore, H. and Shah, A. Evaluating autoformalization ro- bustness via semantically similar paraphrasing.CoRR, abs/2511.12784,

work page arXiv
[8]

Re- liable evaluation and benchmarks for statement autofor- malization

Poiroux, A., Weiss, G., Kunˇcak, V ., and Bosselut, A. Re- liable evaluation and benchmarks for statement autofor- malization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17958–17980,

2025
[9]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Ren, Z. Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z. F., Gou, Z., Ma, S., Tang, H., Liu, Y ., Gao, W., Guo, D., and Ruan, C. DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.CoRR, abs/2504.21801,

work page internal anchor Pith review arXiv
[10]

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Dos San- tos, M., Sung, F., Vinyes, M., Ying, Z., Zhu, Z., Lu, J., de Saxc´e, H., et al. Kimina-Prover Preview: Towards large formal reasoning models with reinforcement learn- ing.arXiv preprint arXiv:2504.11354,

work page arXiv
[11]

PromptRobust: Towards evaluating the robustness of large language models on adversarial prompts,

Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y ., Yang, L., Ye, W., Gong, N. Z., Zhang, Y ., and Xie, X. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts.CoRR, abs/2306.04528,

work page arXiv
[12]

ifPthenQ

5 Surface Sensitivity in Lean 4 Autoformalization A. Additional Results A.1. Compile Rates Table 2 reports per-cell compile rates underlying theN=602 compile-both claim in §4.1. Compile rates range from 11.4% to 24.2% per prediction direction. The both-compile column gives the number of pairs where both baseline and perturbed predictions type-check. BEq+ ...

2048