pith. machine review for the scientific record. sign in

arxiv: 2604.13515 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI· cs.LO

Recognition: unknown

SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LO
keywords autoformalizationSFTGRPOdata overlappost-trainingLean 4semantic evaluationformal mathematics
0
0 comments X

The pith

Keeping SFT and GRPO data completely disjoint improves Lean 4 autoformalization accuracy at zero extra cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the amount of shared data between supervised fine-tuning and the following Group Relative Policy Optimization stage matters for training a language model to translate math problems into Lean 4 proofs. It runs controlled experiments on Qwen3-8B across zero, thirty, and one hundred percent overlap, plus SFT-only and GRPO-only baselines, and measures both whether the output compiles and whether it matches the intended meaning. Lower overlap produces steadily higher success rates on Gaokao-Formal and PutnamBench, with zero overlap adding more than ten percentage points of semantic accuracy over SFT alone while full overlap leaves GRPO with no effect. The result identifies data overlap as a free hyperparameter that can be tuned to make the second training stage useful rather than redundant.

Core claim

The percentage of overlap between the SFT corpus and the GRPO prompts functions as a post-training hyperparameter that directly controls whether the GRPO stage improves autoformalization performance. When the stages use disjoint data, GRPO raises both compilation rates and LLM-judged semantic correctness in a monotonic fashion; when the stages share all prompts, GRPO adds nothing measurable to either metric and the second stage becomes redundant. Dual-metric evaluation also shows compile-semantic gaps above thirty points that remain hidden when only compilation is checked.

What carries the argument

SFT-GRPO data overlap percentage, the controllable fraction of GRPO prompts that already appeared in the SFT corpus, which determines whether the reinforcement-learning step adds value beyond the initial fine-tuning.

If this is right

  • Zero-percent overlap between SFT and GRPO data yields the highest compilation and semantic accuracy on both Gaokao-Formal and PutnamBench.
  • Full overlap eliminates any measurable gain from the GRPO stage, leaving the model no better than SFT alone.
  • Partial overlap produces intermediate results, confirming a monotonic relationship between overlap and final performance.
  • Compile-only metrics miss semantic gaps larger than thirty percentage points that appear once an LLM judge is added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same overlap principle could be tested in other sequential post-training pipelines that combine supervised and reinforcement stages.
  • Training data schedules might be designed from the start to enforce separation between stages rather than reusing prompts by default.
  • Semantic evaluation should be treated as a required companion metric whenever the goal is formal correctness rather than mere syntactic validity.

Load-bearing premise

That the observed differences in accuracy are caused by the overlap level itself rather than by any uncontrolled variation in training schedule, prompt selection, or the reliability of the LLM semantic judge.

What would settle it

Repeating the six training recipes while randomizing or swapping the overlap assignments and still obtaining the same strict ordering of performance by overlap percentage.

Figures

Figures reproduced from arXiv: 2604.13515 by Andy Lyu, Kasey Zhang, Xiaole Su.

Figure 1
Figure 1. Figure 1: The autoformalization task: translate a natural-language mathematical statement into [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Answer injection eliminates the need for the model to solve the underlying math [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training overview. Phase 1 produces one SFT checkpoint. Phase 2 applies GRPO to [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dual-stage reward function. Compilable outputs are scored by Gemini Flash 3 for [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Semantic judge example. The model output compiles and appears plausible, but omits [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Compile and semantic pass@k as a function of SFT-GRPO data overlap. Top: pass@1. Bottom: pass@8. Dashed lines indicate SFT baselines. 7 Discussion Reward saturation hypothesis. Our findings are consistent with a reward saturation interpretation. SFT teaches the model to compile and formalize specific problems through imitation of ground-truth outputs. When GRPO subsequently trains on the same problems, bot… view at source ↗
Figure 7
Figure 7. Figure 7: Compile pass@k as a function of the number of attempts (k) on Gaokao-Formal (left) and PutnamBench (right). Semantic pass@k values at k=1 and k=8 are reported in [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Smoothed training dynamics (EMA) across GRPO variants. Higher training reward [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Raw (unsmoothed) training dynamics across GRPO variants. Entropy, KL loss, and [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper conducts a controlled ablation on SFT-GRPO data overlap for post-training Qwen3-8B (thinking disabled) on Lean 4 autoformalization. It evaluates six conditions (base, SFT-only, GRPO-only, and SFT+GRPO at 0/30/100% overlap) on Gaokao-Formal and PutnamBench, claiming that lower overlap monotonically improves compile pass@k and semantic pass@k (via LLM judge), with GRPO yielding a 10.4pp semantic gain over SFT alone only at 0% overlap while full overlap renders GRPO redundant at zero extra compute cost. It also notes compile-semantic gaps exceeding 30pp.

Significance. If the results hold after validation, the work establishes SFT-GRPO overlap as a zero-cost post-training hyperparameter that can improve autoformalization performance. The controlled ablation design and emphasis on dual-metric evaluation (revealing limitations of compile-only benchmarks) are strengths; this could inform more efficient training recipes in LLM post-training for formal reasoning without additional compute.

major comments (3)
  1. [Abstract] Abstract and evaluation methodology: semantic accuracy is assessed by an LLM judge, yet no details are provided on the judge prompt, few-shot examples, calibration on ground-truth subsets, or agreement with human Lean experts. This is load-bearing for the headline 10.4pp semantic gain and the monotonic overlap effect, since compile pass@k is objective but the paper emphasizes the semantic metric and the >30pp compile-semantic gap.
  2. [Results] Results section: the reported performance differences and monotonic trends lack error bars, statistical significance tests, or details on the number of evaluation runs, undermining confidence that the 10.4pp gain and overlap effects are reliable rather than noise or uncontrolled variables.
  3. [Methods] Methods and experimental design: while the ablation controls for overlap percentage, the manuscript does not report checks for other potential confounders in training dynamics or evaluation that could explain the performance differences, as required to support the causal claim that overlap (rather than other factors) drives the gains.
minor comments (2)
  1. [Methods] Clarify the exact definition and construction of the 0/30/100% overlap conditions, including how prompts are sampled to ensure the ablation is truly isolated to overlap.
  2. [Abstract] The abstract mentions 'thinking disabled' for Qwen3-8B; add a brief note on whether this is a standard setting or affects reproducibility of the post-training recipe.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of evaluation rigor that we have addressed in the revision. We provide point-by-point responses below and have updated the manuscript accordingly to strengthen the presentation of our controlled ablation on SFT-GRPO data overlap.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation methodology: semantic accuracy is assessed by an LLM judge, yet no details are provided on the judge prompt, few-shot examples, calibration on ground-truth subsets, or agreement with human Lean experts. This is load-bearing for the headline 10.4pp semantic gain and the monotonic overlap effect, since compile pass@k is objective but the paper emphasizes the semantic metric and the >30pp compile-semantic gap.

    Authors: We agree that full transparency on the LLM judge is essential for the semantic metric claims. In the revised manuscript, we have added a new subsection in Methods detailing the exact judge prompt, few-shot examples, calibration performance on held-out ground-truth subsets, and agreement rates with human Lean experts on a sampled subset of outputs. These additions directly support the reliability of the 10.4pp gain and the observed compile-semantic gaps. revision: yes

  2. Referee: [Results] Results section: the reported performance differences and monotonic trends lack error bars, statistical significance tests, or details on the number of evaluation runs, undermining confidence that the 10.4pp gain and overlap effects are reliable rather than noise or uncontrolled variables.

    Authors: We acknowledge the value of statistical reporting. The original experiments used single evaluation runs per condition due to the high cost of inference on PutnamBench and Gaokao-Formal. In the revision, we have added bootstrap-derived error bars to all tables and figures, explicitly stated the number of runs, and included a note on the single-run design. While we cannot retroactively perform additional independent runs without new compute, the consistent monotonic trends across two independent benchmarks provide supporting evidence for the overlap effect. revision: partial

  3. Referee: [Methods] Methods and experimental design: while the ablation controls for overlap percentage, the manuscript does not report checks for other potential confounders in training dynamics or evaluation that could explain the performance differences, as required to support the causal claim that overlap (rather than other factors) drives the gains.

    Authors: We thank the referee for this observation. To better isolate the effect of overlap, the revised Methods section now includes explicit checks for potential confounders: we report comparable training loss curves, gradient statistics, and data quality metrics across all six conditions, as well as balanced evaluation set statistics. These controls, combined with the fixed model, hyperparameters, and data sources, support our claim that overlap percentage is the primary driver of the observed performance differences. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ablation with no derivations or fitted predictions

full rationale

The paper reports results from a controlled experimental ablation varying only the percentage overlap (0%, 30%, 100%) between SFT and GRPO training corpora for Qwen3-8B on Lean autoformalization. All claims rest on direct measurements of compile pass@k and LLM-judge semantic pass@k on Gaokao-Formal and PutnamBench. No equations, parameters, or derivations are present; the monotonic relationship and 10.4pp gain are observed outcomes, not quantities forced by construction or self-citation. The work is self-contained against external benchmarks and contains no load-bearing self-citations, ansatzes, or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical ablation study with no mathematical derivations, fitted parameters in the ledger sense, axioms, or invented entities; all claims rest on experimental measurements.

pith-pipeline@v0.9.0 · 5575 in / 1191 out tokens · 39156 ms · 2026-05-10T13:18:04.054399+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 12 canonical work pages · 5 internal anchors

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI . DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948, 2025

  2. [2]

    DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

    Huajian Xin, Daya Guo, Zhihong Shao, et al. DeepSeek-Prover-V2 : Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv:2504.21801, 2025

  3. [3]

    Jiang, Wenda Li, et al

    Yuhuai Wu, Albert Q. Jiang, Wenda Li, et al. Autoformalization with large language models. In NeurIPS, 2022

  4. [4]

    FormaRL : Enhancing autoformalization with no labeled data

    Yige Huang, Xiaohan Jin, Shengyuan Liang, Peilin Li, and Yuanzhi Liu. FormaRL : Enhancing autoformalization with no labeled data. In COLM, 2025. arXiv:2508.18914

  5. [5]

    A semantic search engine for mathlib4

    Guoxiong Gao et al. HERALD : A natural language annotated L ean 4 dataset. In ICLR, 2025. arXiv:2410.10878

  6. [6]

    Lean W orkbook: A large-scale L ean problem set formalized from natural language math problems

    Yifan He et al. Lean W orkbook: A large-scale L ean problem set formalized from natural language math problems. In NeurIPS Datasets and Benchmarks, 2024

  7. [7]

    Leanabell-Prover-Formal-Statement

    stoney0062. Leanabell-Prover-Formal-Statement. HuggingFace dataset, 2025

  8. [8]

    Qwen3 Technical Report

    Qwen Team . Qwen3 technical report. arXiv:2505.09388, 2025

  9. [9]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300, 2024

  10. [10]

    The L ean 4 theorem prover and programming language

    Leonardo de Moura, Sebastian Ullrich, et al. The L ean 4 theorem prover and programming language. In CADE, 2021

  11. [11]

    Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

    Haiming Wang, Huajian Xin, Zhengying Liu, et al. Kimina-Prover : Automated theorem proving via large language models. arXiv:2504.11354, 2025

  12. [12]

    arXiv:2305.12474 (2023).https://doi.org/10.48550/ arXiv.2305.12474

    Xiaotian Zhang, Chunyang Li, Yi Zong, et al. Evaluating the performance of large language models via GAOKAO benchmark. arXiv:2305.12474, 2024

  13. [13]

    PutnamBench : Evaluating neural theorem-provers on the P utnam mathematical competition

    George Tsoukalas, Jasper Lee, John Jennings, et al. PutnamBench : Evaluating neural theorem-provers on the P utnam mathematical competition. In NeurIPS Datasets and Benchmarks, 2024

  14. [14]

    StepFun-Formalizer : Unlocking the autoformalization potential of LLMs through knowledge-reasoning fusion

    Yutong Wu, Di Huang, Ruosi Wan, et al. StepFun-Formalizer : Unlocking the autoformalization potential of LLMs through knowledge-reasoning fusion. arXiv:2508.04440, 2025

  15. [15]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, and Sergey Levine. SFT memorizes, RL generalizes: A comparative study of foundation model post-training. arXiv:2501.17161, 2025

  16. [16]

    A data-centric perspective on RLHF for code generation

    Wangyue Lu et al. A data-centric perspective on RLHF for code generation. arXiv preprint, 2026

  17. [17]

    Mathesis : Towards formal theorem proving from natural language

    Xiaohan Zhang et al. Mathesis : Towards formal theorem proving from natural language. In ICLR, 2025

  18. [18]

    Autoformalization of mathematical statements: A comprehensive survey

    Jianqiao Weng, Yihan Geng, Jinan Xu, Jian Yang, and Yue Zhang. Autoformalization of mathematical statements: A comprehensive survey. arXiv:2505.23486, 2025

  19. [19]

    Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

    Yong Lin et al. Goedel-Prover : A frontier model for open-source automated theorem proving. arXiv:2502.07640, 2025

  20. [20]

    NuminaMath : The largest public dataset in AI4Maths with 860K pairs of competition math problems and solutions

    Jia Li, Edward Beeching, Lewis Tunstall, et al. NuminaMath : The largest public dataset in AI4Maths with 860K pairs of competition math problems and solutions. HuggingFace repository, 2024