Recognition: unknown
SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization
Pith reviewed 2026-05-10 13:18 UTC · model grok-4.3
The pith
Keeping SFT and GRPO data completely disjoint improves Lean 4 autoformalization accuracy at zero extra cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The percentage of overlap between the SFT corpus and the GRPO prompts functions as a post-training hyperparameter that directly controls whether the GRPO stage improves autoformalization performance. When the stages use disjoint data, GRPO raises both compilation rates and LLM-judged semantic correctness in a monotonic fashion; when the stages share all prompts, GRPO adds nothing measurable to either metric and the second stage becomes redundant. Dual-metric evaluation also shows compile-semantic gaps above thirty points that remain hidden when only compilation is checked.
What carries the argument
SFT-GRPO data overlap percentage, the controllable fraction of GRPO prompts that already appeared in the SFT corpus, which determines whether the reinforcement-learning step adds value beyond the initial fine-tuning.
If this is right
- Zero-percent overlap between SFT and GRPO data yields the highest compilation and semantic accuracy on both Gaokao-Formal and PutnamBench.
- Full overlap eliminates any measurable gain from the GRPO stage, leaving the model no better than SFT alone.
- Partial overlap produces intermediate results, confirming a monotonic relationship between overlap and final performance.
- Compile-only metrics miss semantic gaps larger than thirty percentage points that appear once an LLM judge is added.
Where Pith is reading between the lines
- The same overlap principle could be tested in other sequential post-training pipelines that combine supervised and reinforcement stages.
- Training data schedules might be designed from the start to enforce separation between stages rather than reusing prompts by default.
- Semantic evaluation should be treated as a required companion metric whenever the goal is formal correctness rather than mere syntactic validity.
Load-bearing premise
That the observed differences in accuracy are caused by the overlap level itself rather than by any uncontrolled variation in training schedule, prompt selection, or the reliability of the LLM semantic judge.
What would settle it
Repeating the six training recipes while randomizing or swapping the overlap assignments and still obtaining the same strict ordering of performance by overlap percentage.
Figures
read the original abstract
Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a controlled ablation on SFT-GRPO data overlap for post-training Qwen3-8B (thinking disabled) on Lean 4 autoformalization. It evaluates six conditions (base, SFT-only, GRPO-only, and SFT+GRPO at 0/30/100% overlap) on Gaokao-Formal and PutnamBench, claiming that lower overlap monotonically improves compile pass@k and semantic pass@k (via LLM judge), with GRPO yielding a 10.4pp semantic gain over SFT alone only at 0% overlap while full overlap renders GRPO redundant at zero extra compute cost. It also notes compile-semantic gaps exceeding 30pp.
Significance. If the results hold after validation, the work establishes SFT-GRPO overlap as a zero-cost post-training hyperparameter that can improve autoformalization performance. The controlled ablation design and emphasis on dual-metric evaluation (revealing limitations of compile-only benchmarks) are strengths; this could inform more efficient training recipes in LLM post-training for formal reasoning without additional compute.
major comments (3)
- [Abstract] Abstract and evaluation methodology: semantic accuracy is assessed by an LLM judge, yet no details are provided on the judge prompt, few-shot examples, calibration on ground-truth subsets, or agreement with human Lean experts. This is load-bearing for the headline 10.4pp semantic gain and the monotonic overlap effect, since compile pass@k is objective but the paper emphasizes the semantic metric and the >30pp compile-semantic gap.
- [Results] Results section: the reported performance differences and monotonic trends lack error bars, statistical significance tests, or details on the number of evaluation runs, undermining confidence that the 10.4pp gain and overlap effects are reliable rather than noise or uncontrolled variables.
- [Methods] Methods and experimental design: while the ablation controls for overlap percentage, the manuscript does not report checks for other potential confounders in training dynamics or evaluation that could explain the performance differences, as required to support the causal claim that overlap (rather than other factors) drives the gains.
minor comments (2)
- [Methods] Clarify the exact definition and construction of the 0/30/100% overlap conditions, including how prompts are sampled to ensure the ablation is truly isolated to overlap.
- [Abstract] The abstract mentions 'thinking disabled' for Qwen3-8B; add a brief note on whether this is a standard setting or affects reproducibility of the post-training recipe.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important aspects of evaluation rigor that we have addressed in the revision. We provide point-by-point responses below and have updated the manuscript accordingly to strengthen the presentation of our controlled ablation on SFT-GRPO data overlap.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation methodology: semantic accuracy is assessed by an LLM judge, yet no details are provided on the judge prompt, few-shot examples, calibration on ground-truth subsets, or agreement with human Lean experts. This is load-bearing for the headline 10.4pp semantic gain and the monotonic overlap effect, since compile pass@k is objective but the paper emphasizes the semantic metric and the >30pp compile-semantic gap.
Authors: We agree that full transparency on the LLM judge is essential for the semantic metric claims. In the revised manuscript, we have added a new subsection in Methods detailing the exact judge prompt, few-shot examples, calibration performance on held-out ground-truth subsets, and agreement rates with human Lean experts on a sampled subset of outputs. These additions directly support the reliability of the 10.4pp gain and the observed compile-semantic gaps. revision: yes
-
Referee: [Results] Results section: the reported performance differences and monotonic trends lack error bars, statistical significance tests, or details on the number of evaluation runs, undermining confidence that the 10.4pp gain and overlap effects are reliable rather than noise or uncontrolled variables.
Authors: We acknowledge the value of statistical reporting. The original experiments used single evaluation runs per condition due to the high cost of inference on PutnamBench and Gaokao-Formal. In the revision, we have added bootstrap-derived error bars to all tables and figures, explicitly stated the number of runs, and included a note on the single-run design. While we cannot retroactively perform additional independent runs without new compute, the consistent monotonic trends across two independent benchmarks provide supporting evidence for the overlap effect. revision: partial
-
Referee: [Methods] Methods and experimental design: while the ablation controls for overlap percentage, the manuscript does not report checks for other potential confounders in training dynamics or evaluation that could explain the performance differences, as required to support the causal claim that overlap (rather than other factors) drives the gains.
Authors: We thank the referee for this observation. To better isolate the effect of overlap, the revised Methods section now includes explicit checks for potential confounders: we report comparable training loss curves, gradient statistics, and data quality metrics across all six conditions, as well as balanced evaluation set statistics. These controls, combined with the fixed model, hyperparameters, and data sources, support our claim that overlap percentage is the primary driver of the observed performance differences. revision: yes
Circularity Check
No circularity: purely empirical ablation with no derivations or fitted predictions
full rationale
The paper reports results from a controlled experimental ablation varying only the percentage overlap (0%, 30%, 100%) between SFT and GRPO training corpora for Qwen3-8B on Lean autoformalization. All claims rest on direct measurements of compile pass@k and LLM-judge semantic pass@k on Gaokao-Formal and PutnamBench. No equations, parameters, or derivations are present; the monotonic relationship and 10.4pp gain are observed outcomes, not quantities forced by construction or self-citation. The work is self-contained against external benchmarks and contains no load-bearing self-citations, ansatzes, or renamings of known results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI . DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Huajian Xin, Daya Guo, Zhihong Shao, et al. DeepSeek-Prover-V2 : Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv:2504.21801, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Jiang, Wenda Li, et al
Yuhuai Wu, Albert Q. Jiang, Wenda Li, et al. Autoformalization with large language models. In NeurIPS, 2022
2022
-
[4]
FormaRL : Enhancing autoformalization with no labeled data
Yige Huang, Xiaohan Jin, Shengyuan Liang, Peilin Li, and Yuanzhi Liu. FormaRL : Enhancing autoformalization with no labeled data. In COLM, 2025. arXiv:2508.18914
-
[5]
A semantic search engine for mathlib4
Guoxiong Gao et al. HERALD : A natural language annotated L ean 4 dataset. In ICLR, 2025. arXiv:2410.10878
-
[6]
Lean W orkbook: A large-scale L ean problem set formalized from natural language math problems
Yifan He et al. Lean W orkbook: A large-scale L ean problem set formalized from natural language math problems. In NeurIPS Datasets and Benchmarks, 2024
2024
-
[7]
Leanabell-Prover-Formal-Statement
stoney0062. Leanabell-Prover-Formal-Statement. HuggingFace dataset, 2025
2025
-
[8]
Qwen Team . Qwen3 technical report. arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
The L ean 4 theorem prover and programming language
Leonardo de Moura, Sebastian Ullrich, et al. The L ean 4 theorem prover and programming language. In CADE, 2021
2021
-
[11]
Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning
Haiming Wang, Huajian Xin, Zhengying Liu, et al. Kimina-Prover : Automated theorem proving via large language models. arXiv:2504.11354, 2025
-
[12]
arXiv:2305.12474 (2023).https://doi.org/10.48550/ arXiv.2305.12474
Xiaotian Zhang, Chunyang Li, Yi Zong, et al. Evaluating the performance of large language models via GAOKAO benchmark. arXiv:2305.12474, 2024
-
[13]
PutnamBench : Evaluating neural theorem-provers on the P utnam mathematical competition
George Tsoukalas, Jasper Lee, John Jennings, et al. PutnamBench : Evaluating neural theorem-provers on the P utnam mathematical competition. In NeurIPS Datasets and Benchmarks, 2024
2024
-
[14]
Yutong Wu, Di Huang, Ruosi Wan, et al. StepFun-Formalizer : Unlocking the autoformalization potential of LLMs through knowledge-reasoning fusion. arXiv:2508.04440, 2025
-
[15]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, and Sergey Levine. SFT memorizes, RL generalizes: A comparative study of foundation model post-training. arXiv:2501.17161, 2025
work page internal anchor Pith review arXiv 2025
-
[16]
A data-centric perspective on RLHF for code generation
Wangyue Lu et al. A data-centric perspective on RLHF for code generation. arXiv preprint, 2026
2026
-
[17]
Mathesis : Towards formal theorem proving from natural language
Xiaohan Zhang et al. Mathesis : Towards formal theorem proving from natural language. In ICLR, 2025
2025
-
[18]
Autoformalization of mathematical statements: A comprehensive survey
Jianqiao Weng, Yihan Geng, Jinan Xu, Jian Yang, and Yue Zhang. Autoformalization of mathematical statements: A comprehensive survey. arXiv:2505.23486, 2025
-
[19]
Yong Lin et al. Goedel-Prover : A frontier model for open-source automated theorem proving. arXiv:2502.07640, 2025
-
[20]
NuminaMath : The largest public dataset in AI4Maths with 860K pairs of competition math problems and solutions
Jia Li, Edward Beeching, Lewis Tunstall, et al. NuminaMath : The largest public dataset in AI4Maths with 860K pairs of competition math problems and solutions. HuggingFace repository, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.