arxiv: 2604.13515 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI· cs.LO

Recognition: unknown

SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization

Xiaole Su , Kasey Zhang , Andy Lyu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LO

keywords autoformalizationSFTGRPOdata overlappost-trainingLean 4semantic evaluationformal mathematics

0 comments

The pith

Keeping SFT and GRPO data completely disjoint improves Lean 4 autoformalization accuracy at zero extra cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the amount of shared data between supervised fine-tuning and the following Group Relative Policy Optimization stage matters for training a language model to translate math problems into Lean 4 proofs. It runs controlled experiments on Qwen3-8B across zero, thirty, and one hundred percent overlap, plus SFT-only and GRPO-only baselines, and measures both whether the output compiles and whether it matches the intended meaning. Lower overlap produces steadily higher success rates on Gaokao-Formal and PutnamBench, with zero overlap adding more than ten percentage points of semantic accuracy over SFT alone while full overlap leaves GRPO with no effect. The result identifies data overlap as a free hyperparameter that can be tuned to make the second training stage useful rather than redundant.

Core claim

The percentage of overlap between the SFT corpus and the GRPO prompts functions as a post-training hyperparameter that directly controls whether the GRPO stage improves autoformalization performance. When the stages use disjoint data, GRPO raises both compilation rates and LLM-judged semantic correctness in a monotonic fashion; when the stages share all prompts, GRPO adds nothing measurable to either metric and the second stage becomes redundant. Dual-metric evaluation also shows compile-semantic gaps above thirty points that remain hidden when only compilation is checked.

What carries the argument

SFT-GRPO data overlap percentage, the controllable fraction of GRPO prompts that already appeared in the SFT corpus, which determines whether the reinforcement-learning step adds value beyond the initial fine-tuning.

If this is right

Zero-percent overlap between SFT and GRPO data yields the highest compilation and semantic accuracy on both Gaokao-Formal and PutnamBench.
Full overlap eliminates any measurable gain from the GRPO stage, leaving the model no better than SFT alone.
Partial overlap produces intermediate results, confirming a monotonic relationship between overlap and final performance.
Compile-only metrics miss semantic gaps larger than thirty percentage points that appear once an LLM judge is added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same overlap principle could be tested in other sequential post-training pipelines that combine supervised and reinforcement stages.
Training data schedules might be designed from the start to enforce separation between stages rather than reusing prompts by default.
Semantic evaluation should be treated as a required companion metric whenever the goal is formal correctness rather than mere syntactic validity.

Load-bearing premise

That the observed differences in accuracy are caused by the overlap level itself rather than by any uncontrolled variation in training schedule, prompt selection, or the reliability of the LLM semantic judge.

What would settle it

Repeating the six training recipes while randomizing or swapping the overlap assignments and still obtaining the same strict ordering of performance by overlap percentage.

Figures

Figures reproduced from arXiv: 2604.13515 by Andy Lyu, Kasey Zhang, Xiaole Su.

**Figure 2.** Figure 2: Answer injection eliminates the need for the model to solve the underlying math [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Training overview. Phase 1 produces one SFT checkpoint. Phase 2 applies GRPO to [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Dual-stage reward function. Compilable outputs are scored by Gemini Flash 3 for [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Semantic judge example. The model output compiles and appears plausible, but omits [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Compile and semantic pass@k as a function of SFT-GRPO data overlap. Top: pass@1. Bottom: pass@8. Dashed lines indicate SFT baselines. 7 Discussion Reward saturation hypothesis. Our findings are consistent with a reward saturation interpretation. SFT teaches the model to compile and formalize specific problems through imitation of ground-truth outputs. When GRPO subsequently trains on the same problems, bot… view at source ↗

**Figure 7.** Figure 7: Compile pass@k as a function of the number of attempts (k) on Gaokao-Formal (left) and PutnamBench (right). Semantic pass@k values at k=1 and k=8 are reported in [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Smoothed training dynamics (EMA) across GRPO variants. Higher training reward [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Raw (unsmoothed) training dynamics across GRPO variants. Entropy, KL loss, and [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lower SFT-GRPO overlap gives a clean 10.4pp semantic lift on Gaokao autoformalization at zero extra cost, but the result rests on an unvalidated LLM judge.

read the letter

The main thing to know is that this paper treats the overlap between SFT and GRPO data as an explicit hyperparameter and finds that keeping them disjoint improves both compile and semantic accuracy in a monotonic way. At zero overlap, GRPO adds a 10.4 percentage point semantic gain over SFT alone on Gaokao-Formal; at full overlap the GRPO stage does nothing. They run the ablation on Qwen3-8B for Lean 4 autoformalization across six controlled conditions and evaluate on Gaokao and PutnamBench with dual metrics.

Referee Report

3 major / 2 minor

Summary. The paper conducts a controlled ablation on SFT-GRPO data overlap for post-training Qwen3-8B (thinking disabled) on Lean 4 autoformalization. It evaluates six conditions (base, SFT-only, GRPO-only, and SFT+GRPO at 0/30/100% overlap) on Gaokao-Formal and PutnamBench, claiming that lower overlap monotonically improves compile pass@k and semantic pass@k (via LLM judge), with GRPO yielding a 10.4pp semantic gain over SFT alone only at 0% overlap while full overlap renders GRPO redundant at zero extra compute cost. It also notes compile-semantic gaps exceeding 30pp.

Significance. If the results hold after validation, the work establishes SFT-GRPO overlap as a zero-cost post-training hyperparameter that can improve autoformalization performance. The controlled ablation design and emphasis on dual-metric evaluation (revealing limitations of compile-only benchmarks) are strengths; this could inform more efficient training recipes in LLM post-training for formal reasoning without additional compute.

major comments (3)

[Abstract] Abstract and evaluation methodology: semantic accuracy is assessed by an LLM judge, yet no details are provided on the judge prompt, few-shot examples, calibration on ground-truth subsets, or agreement with human Lean experts. This is load-bearing for the headline 10.4pp semantic gain and the monotonic overlap effect, since compile pass@k is objective but the paper emphasizes the semantic metric and the >30pp compile-semantic gap.
[Results] Results section: the reported performance differences and monotonic trends lack error bars, statistical significance tests, or details on the number of evaluation runs, undermining confidence that the 10.4pp gain and overlap effects are reliable rather than noise or uncontrolled variables.
[Methods] Methods and experimental design: while the ablation controls for overlap percentage, the manuscript does not report checks for other potential confounders in training dynamics or evaluation that could explain the performance differences, as required to support the causal claim that overlap (rather than other factors) drives the gains.

minor comments (2)

[Methods] Clarify the exact definition and construction of the 0/30/100% overlap conditions, including how prompts are sampled to ensure the ablation is truly isolated to overlap.
[Abstract] The abstract mentions 'thinking disabled' for Qwen3-8B; add a brief note on whether this is a standard setting or affects reproducibility of the post-training recipe.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of evaluation rigor that we have addressed in the revision. We provide point-by-point responses below and have updated the manuscript accordingly to strengthen the presentation of our controlled ablation on SFT-GRPO data overlap.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation methodology: semantic accuracy is assessed by an LLM judge, yet no details are provided on the judge prompt, few-shot examples, calibration on ground-truth subsets, or agreement with human Lean experts. This is load-bearing for the headline 10.4pp semantic gain and the monotonic overlap effect, since compile pass@k is objective but the paper emphasizes the semantic metric and the >30pp compile-semantic gap.

Authors: We agree that full transparency on the LLM judge is essential for the semantic metric claims. In the revised manuscript, we have added a new subsection in Methods detailing the exact judge prompt, few-shot examples, calibration performance on held-out ground-truth subsets, and agreement rates with human Lean experts on a sampled subset of outputs. These additions directly support the reliability of the 10.4pp gain and the observed compile-semantic gaps. revision: yes
Referee: [Results] Results section: the reported performance differences and monotonic trends lack error bars, statistical significance tests, or details on the number of evaluation runs, undermining confidence that the 10.4pp gain and overlap effects are reliable rather than noise or uncontrolled variables.

Authors: We acknowledge the value of statistical reporting. The original experiments used single evaluation runs per condition due to the high cost of inference on PutnamBench and Gaokao-Formal. In the revision, we have added bootstrap-derived error bars to all tables and figures, explicitly stated the number of runs, and included a note on the single-run design. While we cannot retroactively perform additional independent runs without new compute, the consistent monotonic trends across two independent benchmarks provide supporting evidence for the overlap effect. revision: partial
Referee: [Methods] Methods and experimental design: while the ablation controls for overlap percentage, the manuscript does not report checks for other potential confounders in training dynamics or evaluation that could explain the performance differences, as required to support the causal claim that overlap (rather than other factors) drives the gains.

Authors: We thank the referee for this observation. To better isolate the effect of overlap, the revised Methods section now includes explicit checks for potential confounders: we report comparable training loss curves, gradient statistics, and data quality metrics across all six conditions, as well as balanced evaluation set statistics. These controls, combined with the fixed model, hyperparameters, and data sources, support our claim that overlap percentage is the primary driver of the observed performance differences. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ablation with no derivations or fitted predictions

full rationale

The paper reports results from a controlled experimental ablation varying only the percentage overlap (0%, 30%, 100%) between SFT and GRPO training corpora for Qwen3-8B on Lean autoformalization. All claims rest on direct measurements of compile pass@k and LLM-judge semantic pass@k on Gaokao-Formal and PutnamBench. No equations, parameters, or derivations are present; the monotonic relationship and 10.4pp gain are observed outcomes, not quantities forced by construction or self-citation. The work is self-contained against external benchmarks and contains no load-bearing self-citations, ansatzes, or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical ablation study with no mathematical derivations, fitted parameters in the ledger sense, axioms, or invented entities; all claims rest on experimental measurements.

pith-pipeline@v0.9.0 · 5575 in / 1191 out tokens · 39156 ms · 2026-05-10T13:18:04.054399+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 12 canonical work pages · 5 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI . DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Huajian Xin, Daya Guo, Zhihong Shao, et al. DeepSeek-Prover-V2 : Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv:2504.21801, 2025

work page internal anchor Pith review arXiv 2025
[3]

Jiang, Wenda Li, et al

Yuhuai Wu, Albert Q. Jiang, Wenda Li, et al. Autoformalization with large language models. In NeurIPS, 2022

2022
[4]

FormaRL : Enhancing autoformalization with no labeled data

Yige Huang, Xiaohan Jin, Shengyuan Liang, Peilin Li, and Yuanzhi Liu. FormaRL : Enhancing autoformalization with no labeled data. In COLM, 2025. arXiv:2508.18914

work page arXiv 2025
[5]

A semantic search engine for mathlib4

Guoxiong Gao et al. HERALD : A natural language annotated L ean 4 dataset. In ICLR, 2025. arXiv:2410.10878

work page arXiv 2025
[6]

Lean W orkbook: A large-scale L ean problem set formalized from natural language math problems

Yifan He et al. Lean W orkbook: A large-scale L ean problem set formalized from natural language math problems. In NeurIPS Datasets and Benchmarks, 2024

2024
[7]

Leanabell-Prover-Formal-Statement

stoney0062. Leanabell-Prover-Formal-Statement. HuggingFace dataset, 2025

2025
[8]

Qwen3 Technical Report

Qwen Team . Qwen3 technical report. arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

The L ean 4 theorem prover and programming language

Leonardo de Moura, Sebastian Ullrich, et al. The L ean 4 theorem prover and programming language. In CADE, 2021

2021
[11]

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Haiming Wang, Huajian Xin, Zhengying Liu, et al. Kimina-Prover : Automated theorem proving via large language models. arXiv:2504.11354, 2025

work page arXiv 2025
[12]

arXiv:2305.12474 (2023).https://doi.org/10.48550/ arXiv.2305.12474

Xiaotian Zhang, Chunyang Li, Yi Zong, et al. Evaluating the performance of large language models via GAOKAO benchmark. arXiv:2305.12474, 2024

work page arXiv 2024
[13]

PutnamBench : Evaluating neural theorem-provers on the P utnam mathematical competition

George Tsoukalas, Jasper Lee, John Jennings, et al. PutnamBench : Evaluating neural theorem-provers on the P utnam mathematical competition. In NeurIPS Datasets and Benchmarks, 2024

2024
[14]

StepFun-Formalizer : Unlocking the autoformalization potential of LLMs through knowledge-reasoning fusion

Yutong Wu, Di Huang, Ruosi Wan, et al. StepFun-Formalizer : Unlocking the autoformalization potential of LLMs through knowledge-reasoning fusion. arXiv:2508.04440, 2025

work page arXiv 2025
[15]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, and Sergey Levine. SFT memorizes, RL generalizes: A comparative study of foundation model post-training. arXiv:2501.17161, 2025

work page internal anchor Pith review arXiv 2025
[16]

A data-centric perspective on RLHF for code generation

Wangyue Lu et al. A data-centric perspective on RLHF for code generation. arXiv preprint, 2026

2026
[17]

Mathesis : Towards formal theorem proving from natural language

Xiaohan Zhang et al. Mathesis : Towards formal theorem proving from natural language. In ICLR, 2025

2025
[18]

Autoformalization of mathematical statements: A comprehensive survey

Jianqiao Weng, Yihan Geng, Jinan Xu, Jian Yang, and Yue Zhang. Autoformalization of mathematical statements: A comprehensive survey. arXiv:2505.23486, 2025

work page arXiv 2025
[19]

Goedel-prover: A frontier model for open-source automated theorem proving.arXiv preprint arXiv:2502.07640, 2025

Yong Lin et al. Goedel-Prover : A frontier model for open-source automated theorem proving. arXiv:2502.07640, 2025

work page arXiv 2025
[20]

NuminaMath : The largest public dataset in AI4Maths with 860K pairs of competition math problems and solutions

Jia Li, Edward Beeching, Lewis Tunstall, et al. NuminaMath : The largest public dataset in AI4Maths with 860K pairs of competition math problems and solutions. HuggingFace repository, 2024

2024