pith. machine review for the scientific record. sign in

arxiv: 2604.23135 · v1 · submitted 2026-04-25 · 💻 cs.LG

Recognition: unknown

Surface Sensitivity in Lean 4 Autoformalization

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:30 UTC · model grok-4.3

classification 💻 cs.LG
keywords autoformalizationLean 4paraphrase sensitivitytheorem provingformal mathematicsmachine learningcompilation boundary
0
0 comments X

The pith

Paraphrase sensitivity in Lean autoformalization stems from compilation-boundary failures rather than semantic divergence among successful formalizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why semantically equivalent paraphrases of theorem statements in natural language lead to divergent formal outputs when using Lean 4 autoformalizers. It applies 60 deterministic paraphrase rules to ProofNet# and miniF2F across GPT-family models and open-weight 7B autoformalizers. The central observation is that when both the baseline and perturbed formalizations compile successfully, they prove to be semantically equivalent under the BEq+ metric and structurally near-identical under the GTED metric. This indicates that the sensitivity arises mainly because paraphrases change whether an output crosses the compilation threshold, not because they produce genuinely different mathematical meanings when they succeed. The results point toward targeting training interventions at the compilation step and designing benchmarks that isolate compile-conditional equivalence from surface-level consistency.

Core claim

When both baseline and perturbed outputs compile, paired predictions are semantically equivalent under BEq+ and structurally near-identical under GTED. By contrast, paraphrasing substantially affects whether outputs compile, with failure modes varying across datasets and perturbation classes.

What carries the argument

Sixty deterministic paraphrase rules applied to ProofNet# and miniF2F, evaluated with the BEq+ semantic equivalence metric and GTED structural similarity metric on compiling formalizations.

If this is right

  • Training-time interventions should target the compile boundary rather than the semantic layer.
  • Benchmarks should separate compile-conditional equivalence from surface consistency.
  • Paraphrasing affects compilation success rates, with failure modes that differ by dataset and perturbation class.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same compile-boundary pattern may appear in autoformalization for other proof assistants beyond Lean 4.
  • Increasing robustness to surface variation could raise overall success rates by improving the fraction of outputs that reach compilation.
  • Datasets enriched with controlled paraphrases could serve as better probes for where models lose the ability to produce valid formal statements.

Load-bearing premise

The 60 deterministic paraphrase rules preserve semantic equivalence of the original theorems, and the BEq+ and GTED metrics fully capture any semantic or structural differences in the formal outputs.

What would settle it

Finding even one pair of formalizations from semantically equivalent paraphrases that both compile successfully yet differ under the BEq+ metric would disprove the claim that successful outputs show no semantic divergence.

Figures

Figures reproduced from arXiv: 2604.23135 by Aryan Sharma, Ethan Lou, William Feng.

Figure 1
Figure 1. Figure 1: (a) Per-model surface consistency under paraphrased inputs on ProofNet# (teal) and miniF2F (orange). Pair agreement ranges from 19–56% across the GPT panel. Error bars are 95% bootstrap CIs. (b) Per-category BEq+ surface consistency, pooled across the GPT panel, with ProofNet# categories (teal) and miniF2F categories (orange) shown together. 4.2. Compiling Variance Failures are typed differently by dataset… view at source ↗
Figure 2
Figure 2. Figure 2: ProofNet# surface consistency on the 10 rules shared across both panels (95% bootstrap CIs, paired by source, 5,000 iters). 5. Discussion We study paraphrase sensitivity in Lean 4 autoformalization, and find it to primarily be a compile-boundary phenomenon, not a semantic one. Specifically, 602/602 pairs where both outputs compile are semantically equivalent under BEq+ and structurally near-identical under… view at source ↗
Figure 3
Figure 3. Figure 3: Per-(rule, model) BEq+ equivalence on ProofNet# (GPT panel). Rows sorted by panel-mean ascending. Dots denote cells with no paired predictions. A.6. Per-Rule Open-Weight Surface Consistency view at source ↗
read the original abstract

Natural-language variation poses a key challenge in Lean autoformalization: semantically equivalent paraphrases of the same theorem statements can induce divergent formal outputs, yet it remains unclear whether this variation reflects semantic disagreements or shallower failures. We investigate this question in Lean 4 using 60 deterministic paraphrase rules applied to ProofNet\# and miniF2F. Across four GPT-family models and three open-weight 7B autoformalizers, we find that the observed paraphrase sensitivity reflects compilation-boundary failures rather than semantic divergence among successful formalizations. In particular, when both baseline and perturbed outputs compile, paired predictions are semantically equivalent under BEq+ and structurally near-identical under GTED. By contrast, paraphrasing substantially affects whether outputs compile, with failure modes varying across datasets and perturbation classes. Our results suggest that future training-time interventions should target the compile boundary rather than the semantic layer, and that benchmarks should separate compile-conditional equivalence from surface consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript investigates paraphrase sensitivity in Lean 4 autoformalization by applying 60 deterministic, meaning-preserving paraphrase rules to theorems drawn from ProofNet# and miniF2F. Across four GPT-family models and three open-weight 7B autoformalizers, it reports that paraphrase-induced variation primarily manifests as compilation failures rather than semantic divergence: when both baseline and perturbed outputs compile successfully, the formalizations are equivalent under the BEq+ metric and structurally near-identical under GTED. The work concludes that future interventions should target the compile boundary and that benchmarks should separate compile-conditional equivalence from surface consistency.

Significance. If the central empirical distinction holds, the result usefully separates compilation-boundary effects from deeper semantic disagreements in autoformalization. The deterministic construction of the 60 paraphrase rules and the conditional analysis (equivalence only among successful compilations) provide a clean operationalization that can guide both model training and benchmark design. The explicit use of BEq+ for semantic equivalence and GTED for structural similarity adds precision to the claims.

minor comments (3)
  1. The experimental section should specify the exact number of outputs sampled or evaluated per theorem (including any filtering for successful compilations) and report the raw counts of cases where both baseline and perturbed outputs compile, to allow readers to assess the statistical power of the BEq+ and GTED comparisons.
  2. The prompt templates used for each model family are referenced but not reproduced; including them (or a representative example) would improve reproducibility of the reported compilation rates and failure-mode distributions.
  3. Table or figure captions for the per-dataset and per-perturbation-class results should explicitly state the total number of theorem-paraphrase pairs evaluated so that the reported percentages of compilation success can be interpreted in absolute terms.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and for recommending minor revision. We are pleased that the central empirical finding—that paraphrase sensitivity primarily manifests at the compilation boundary rather than through semantic divergence among successful formalizations—was viewed as useful, along with the clean operationalization via deterministic rules, BEq+, and GTED.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper conducts an empirical measurement of model outputs on paraphrased theorem statements from ProofNet# and miniF2F, using deterministic meaning-preserving rules and external verification via compilation success plus BEq+ semantic equivalence and GTED structural metrics. The central claim is scoped to the compile-conditional regime and does not derive any quantity from fitted parameters, self-definitions, or load-bearing self-citations; all reported quantities are direct observations against independent checks. No step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen paraphrase rules are semantically neutral and that the two equivalence metrics are adequate proxies for meaning and structure.

axioms (1)
  • domain assumption The 60 deterministic paraphrase rules preserve the semantic meaning of the theorems.
    Invoked when interpreting that successful formalizations are equivalent rather than divergent.

pith-pipeline@v0.9.0 · 5452 in / 1175 out tokens · 43824 ms · 2026-05-08T08:30:06.087700+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Ayers, Dragomir Radev, and Jeremy Avigad

    Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., and Avigad, J. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. CoRR, abs/2302.12433,

  2. [2]

    arXiv preprint arXiv:2406.19783 , year=

    Chen, J., Li, Z., Hu, X., and Xia, X. NLPerturbator: Study- ing the robustness of code LLMs to natural language variations.arXiv preprint arXiv:2406.19783,

  3. [3]

    Hao, Y ., Wan, X., and Zhai, C. An investigation of robust- ness of LLMs in mathematical reasoning: Benchmark- ing with mathematically-equivalent transformation of ad- vanced mathematical problems.CoRR, abs/2508.08833,

  4. [4]

    Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations

    Huang, K., Guo, J., Li, Z., Ji, X., Ge, J., Li, W., Guo, Y ., Cai, T., Yuan, H., Wang, R., Wu, Y ., Yin, M., Tang, S., Huang, Y ., Jin, C., Chen, X., Zhang, C., and Wang, M. MATH- Perturb: Benchmarking LLMs’ math reasoning abilities against hard perturbations.CoRR, abs/2502.06453,

  5. [5]

    Rethinking and improving autoformalization: Towards a faithful met- ric and a dependency retrieval-based approach

    Liu, Q., Zheng, X., Lu, X., Cao, Q., and Yan, J. Rethinking and improving autoformalization: Towards a faithful met- ric and a dependency retrieval-based approach. InThe Thirteenth International Conference on Learning Repre- sentations, 2025a. Liu, X., Bao, K., Zhang, J., Liu, Y ., Chen, Y ., Liu, Y ., Jiao, Y ., and Luo, T. ATLAS: Autoformalizing theo- r...

  6. [6]

    CoRR, abs/2406.01940,

  7. [7]

    and Shah, A

    Moore, H. and Shah, A. Evaluating autoformalization ro- bustness via semantically similar paraphrasing.CoRR, abs/2511.12784,

  8. [8]

    Re- liable evaluation and benchmarks for statement autofor- malization

    Poiroux, A., Weiss, G., Kunˇcak, V ., and Bosselut, A. Re- liable evaluation and benchmarks for statement autofor- malization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17958–17980,

  9. [9]

    DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

    Ren, Z. Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z. F., Gou, Z., Ma, S., Tang, H., Liu, Y ., Gao, W., Guo, D., and Ruan, C. DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.CoRR, abs/2504.21801,

  10. [10]

    Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

    Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Dos San- tos, M., Sung, F., Vinyes, M., Ying, Z., Zhu, Z., Lu, J., de Saxc´e, H., et al. Kimina-Prover Preview: Towards large formal reasoning models with reinforcement learn- ing.arXiv preprint arXiv:2504.11354,

  11. [11]

    PromptRobust: Towards evaluating the robustness of large language models on adversarial prompts,

    Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y ., Yang, L., Ye, W., Gong, N. Z., Zhang, Y ., and Xie, X. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts.CoRR, abs/2306.04528,

  12. [12]

    ifPthenQ

    5 Surface Sensitivity in Lean 4 Autoformalization A. Additional Results A.1. Compile Rates Table 2 reports per-cell compile rates underlying theN=602 compile-both claim in §4.1. Compile rates range from 11.4% to 24.2% per prediction direction. The both-compile column gives the number of pairs where both baseline and perturbed predictions type-check. BEq+ ...