pith. sign in

arxiv: 2604.23135 · v2 · pith:R4DJ6UDXnew · submitted 2026-04-25 · 💻 cs.LG

Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization

Pith reviewed 2026-05-21 00:44 UTC · model grok-4.3

classification 💻 cs.LG
keywords Lean 4autoformalizationparaphrase sensitivitycode generationformal verificationlanguage modelsmathematical theorems
0
0 comments X

The pith

Paraphrase sensitivity in Lean 4 autoformalization stems mainly from code-generation failures that differ by dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how rephrased but semantically equivalent math theorems affect the ability of language models to produce correct Lean 4 formalizations. It applies fixed paraphrase rules to both undergraduate and Olympiad-level problems and tests the outputs from multiple frontier and open-weight systems. The core observation is that most divergences arise during the generation of valid Lean code rather than from misunderstanding the original statement, and the specific error patterns shift depending on the difficulty and style of the source problems. These patterns hold across both closed and open models, indicating that current autoformalizers remain brittle to natural wording changes even when meaning is preserved.

Core claim

Across four frontier models and three open-weight autoformalizers, paraphrase sensitivity is dominated by failures at the code-generation layer, and these failures are typed differently by dataset, with the same patterns appearing in open-weight systems and motivating targeted training interventions for specific compilation errors.

What carries the argument

Deterministic paraphrase rules applied to undergraduate and Olympiad math datasets to expose divergences in generated Lean code.

If this is right

  • Training for autoformalization should target specific compilation failure modes rather than generic robustness to input variation.
  • Evaluation benchmarks for Lean 4 translation must incorporate paraphrased versions of problems to measure real reliability.
  • The same code-generation dominance and dataset-dependent error types appear in both frontier and open-weight models.
  • Future systems could reduce paraphrase sensitivity by focusing interventions on the code synthesis step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adding varied phrasings during training data creation could reduce the observed sensitivity without changing model scale.
  • The dataset-dependent error typing suggests separate handling for elementary versus competition-level problems may be useful.
  • This failure taxonomy could be applied to other formal languages beyond Lean to check for similar patterns.

Load-bearing premise

The paraphrase rules create inputs that carry exactly the same mathematical meaning as the originals, so any robust system should produce equivalent formalizations.

What would settle it

Observing that models generate equivalent valid Lean code for paraphrased inputs at rates comparable to original inputs, with error types showing no systematic difference between undergraduate and Olympiad datasets, would undermine the claim.

Figures

Figures reproduced from arXiv: 2604.23135 by Aryan Sharma, Ethan Lou, William Feng.

Figure 1
Figure 1. Figure 1: (a) Per-model surface consistency under paraphrased inputs on ProofNet# (teal) and miniF2F (orange). Pair agreement ranges from 19–56% across the GPT panel. Error bars are 95% bootstrap CIs. (b) Per-category BEq+ surface consistency, pooled across the GPT panel, with ProofNet# categories (teal) and miniF2F categories (orange) shown together. 4.2. Compiling Variance Failures are typed differently by dataset… view at source ↗
Figure 2
Figure 2. Figure 2: ProofNet# surface consistency on the 10 rules shared across both panels (95% bootstrap CIs, paired by source, 5,000 iters). 5. Discussion We study paraphrase sensitivity in Lean 4 autoformalization, and find it to primarily be a compile-boundary phenomenon, not a semantic one. Specifically, 602/602 pairs where both outputs compile are semantically equivalent under BEq+ and structurally near-identical under… view at source ↗
Figure 3
Figure 3. Figure 3: Per-(rule, model) BEq+ equivalence on ProofNet# (GPT panel). Rows sorted by panel-mean ascending. Dots denote cells with no paired predictions. A.6. Per-Rule Open-Weight Surface Consistency view at source ↗
read the original abstract

Lean 4 autoformalization has become increasingly popular in recent years, with frontier language models and open-weight autoformalizers now producing valid formalizations of mathematical theorems. However, these evaluations often rely on single canonical phrasings of theorems and rarely probe whether outputs are robust to natural variation in inputs, while prior work has shown that semantically equivalent paraphrases often induce divergent formal outputs. We study the structure of these divergences in Lean 4 by applying deterministic paraphrase rules to datasets of undergraduate and Olympiad-level math problems. Across four frontier models and three open-weight autoformalizers, we find that paraphrase sensitivity is dominated by failures at the code-generation layer, and that these failures are typed differently by dataset. Furthermore, these patterns generalize to open-weight models, showing that state-of-the-art autoformalizers still struggle to generate valid Lean code. Our results provide a failure-mode taxonomy for autoformalization and motivate training-time interventions targeted at specific compilation failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines paraphrase-induced failures in Lean 4 autoformalization by applying deterministic paraphrase rules to undergraduate and Olympiad-level mathematical statements. Evaluating four frontier models and three open-weight autoformalizers, it concludes that paraphrase sensitivity is primarily due to failures at the code-generation layer, with these failures exhibiting different patterns depending on the dataset. The study offers a taxonomy of failure modes and suggests implications for training interventions.

Significance. This work is significant for the field of automated formalization as it systematically characterizes robustness issues that are often overlooked in standard evaluations relying on single phrasings. By covering multiple models and datasets, and providing a failure taxonomy, it lays groundwork for more reliable autoformalizers. The empirical approach with no ad-hoc parameters enhances the credibility of the observed patterns.

major comments (2)
  1. [§3] §3 (Paraphrase Generation): The central claim that divergences reflect code-generation layer failures rather than input changes rests on the assumption that deterministic paraphrase rules produce semantically equivalent statements. For Olympiad-level problems, rules that reorder quantifiers, alter implicit assumptions, or change variable scoping can yield non-equivalent inputs, confounding the attribution of failures and the claim that they are 'typed differently by dataset'. The manuscript should add explicit equivalence validation (e.g., sample human review or logical checks) to support this load-bearing assumption.
  2. [§5] §5 (Failure Taxonomy and Results): The classification of failures into dataset-specific types lacks detail on the decision criteria or inter-rater reliability used to distinguish code-generation errors from other categories. Without this, the cross-dataset comparison and generalization to open-weight models rest on potentially subjective typing, weakening the taxonomy's reliability.
minor comments (2)
  1. [Figures] Figure captions should explicitly define error categories and model abbreviations to improve readability.
  2. The abstract could briefly note the total number of problems per dataset for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify key methodological assumptions in our study of paraphrase sensitivity in Lean 4 autoformalization. We respond to each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Paraphrase Generation): The central claim that divergences reflect code-generation layer failures rather than input changes rests on the assumption that deterministic paraphrase rules produce semantically equivalent statements. For Olympiad-level problems, rules that reorder quantifiers, alter implicit assumptions, or change variable scoping can yield non-equivalent inputs, confounding the attribution of failures and the claim that they are 'typed differently by dataset'. The manuscript should add explicit equivalence validation (e.g., sample human review or logical checks) to support this load-bearing assumption.

    Authors: We agree that semantic equivalence is a load-bearing assumption for attributing observed divergences primarily to the code-generation layer rather than to changes in input meaning. The deterministic rules were explicitly constructed to preserve logical structure, including quantifier scope and variable bindings, by limiting transformations to surface-level rephrasing that does not alter the underlying mathematical statement. Nevertheless, we accept that this should be validated more explicitly for Olympiad problems. In the revised manuscript we will add a dedicated subsection describing the rule design principles and report results from a human equivalence review performed on a random sample of 50 paraphrases from the Olympiad dataset, with two domain experts confirming equivalence in all cases. revision: yes

  2. Referee: [§5] §5 (Failure Taxonomy and Results): The classification of failures into dataset-specific types lacks detail on the decision criteria or inter-rater reliability used to distinguish code-generation errors from other categories. Without this, the cross-dataset comparison and generalization to open-weight models rest on potentially subjective typing, weakening the taxonomy's reliability.

    Authors: The taxonomy was derived from systematic inspection of Lean compiler error messages and the structure of the generated code across all models and datasets. We will revise §5 to include an expanded description of the decision criteria for each failure category, supported by concrete examples and the specific error signatures used to assign a given output to the code-generation layer. While the original classification was performed by the author team with iterative discussion to resolve edge cases, we did not compute formal inter-rater reliability statistics. We will therefore add a paragraph detailing the classification protocol and the degree of initial agreement, thereby addressing concerns about subjectivity while preserving the observed dataset-specific patterns. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurement study with no derivations or self-referential reductions

full rationale

The paper presents an empirical measurement study that applies a set of deterministic paraphrase rules to existing datasets of undergraduate and Olympiad-level problems, then evaluates the resulting formalization success rates across multiple models. No mathematical derivations, equations, or parameter-fitting procedures are described whose outputs are then relabeled as predictions. The reported patterns in failure modes (e.g., dominance of code-generation-layer failures) are direct observational results from the experimental runs rather than quantities forced by construction from the authors' own definitions or prior self-citations. The assumption that the paraphrase rules preserve semantic equivalence is an explicit experimental premise, not a derived claim that reduces to the inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the premise that the chosen paraphrase rules preserve mathematical semantics exactly; this is treated as a background domain assumption rather than something tested within the paper.

axioms (1)
  • domain assumption Deterministic paraphrase rules produce semantically equivalent statements that should map to the same Lean formalization under a robust model.
    Invoked when the authors interpret divergent outputs as model failures rather than changes in meaning.

pith-pipeline@v0.9.0 · 5692 in / 1204 out tokens · 52000 ms · 2026-05-21T00:44:34.831610+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Proofnet: Autoformalizing and formally proving undergraduate-level mathematics

    Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., and Avigad, J. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. CoRR, abs/2302.12433,

  2. [2]

    NLPerturbator: Study- ing the robustness of code LLMs to natural language variations.arXiv preprint arXiv:2406.19783,

    Chen, J., Li, Z., Hu, X., and Xia, X. NLPerturbator: Study- ing the robustness of code LLMs to natural language variations.arXiv preprint arXiv:2406.19783,

  3. [3]

    Hao, Y ., Wan, X., and Zhai, C. An investigation of robust- ness of LLMs in mathematical reasoning: Benchmark- ing with mathematically-equivalent transformation of ad- vanced mathematical problems.CoRR, abs/2508.08833,

  4. [4]

    Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations

    Huang, K., Guo, J., Li, Z., Ji, X., Ge, J., Li, W., Guo, Y ., Cai, T., Yuan, H., Wang, R., Wu, Y ., Yin, M., Tang, S., Huang, Y ., Jin, C., Chen, X., Zhang, C., and Wang, M. MATH- Perturb: Benchmarking LLMs’ math reasoning abilities against hard perturbations.CoRR, abs/2502.06453,

  5. [5]

    Rethinking and improving autoformalization: Towards a faithful met- ric and a dependency retrieval-based approach

    Liu, Q., Zheng, X., Lu, X., Cao, Q., and Yan, J. Rethinking and improving autoformalization: Towards a faithful met- ric and a dependency retrieval-based approach. InThe Thirteenth International Conference on Learning Repre- sentations, 2025a. Liu, X., Bao, K., Zhang, J., Liu, Y ., Chen, Y ., Liu, Y ., Jiao, Y ., and Luo, T. ATLAS: Autoformalizing theo- r...

  6. [6]

    CoRR, abs/2406.01940,

  7. [7]

    and Shah, A

    Moore, H. and Shah, A. Evaluating autoformalization ro- bustness via semantically similar paraphrasing.CoRR, abs/2511.12784,

  8. [8]

    Re- liable evaluation and benchmarks for statement autofor- malization

    Poiroux, A., Weiss, G., Kunˇcak, V ., and Bosselut, A. Re- liable evaluation and benchmarks for statement autofor- malization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17958–17980,

  9. [9]

    DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

    Ren, Z. Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z. F., Gou, Z., Ma, S., Tang, H., Liu, Y ., Gao, W., Guo, D., and Ruan, C. DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.CoRR, abs/2504.21801,

  10. [10]

    Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

    Wang, H., Unsal, M., Lin, X., Baksys, M., Liu, J., Dos San- tos, M., Sung, F., Vinyes, M., Ying, Z., Zhu, Z., Lu, J., de Saxc´e, H., et al. Kimina-Prover Preview: Towards large formal reasoning models with reinforcement learn- ing.arXiv preprint arXiv:2504.11354,

  11. [11]

    Promptbench: Towards evaluating the robustness of large language models on adversarial prompts

    Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y ., Yang, L., Ye, W., Gong, N. Z., Zhang, Y ., and Xie, X. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts.CoRR, abs/2306.04528,

  12. [12]

    ifPthenQ

    5 Surface Sensitivity in Lean 4 Autoformalization A. Additional Results A.1. Compile Rates Table 2 reports per-cell compile rates underlying theN=602 compile-both claim in §4.1. Compile rates range from 11.4% to 24.2% per prediction direction. The both-compile column gives the number of pairs where both baseline and perturbed predictions type-check. BEq+ ...