arxiv: 2605.12483 · v2 · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Yuanda Xu , Hejian Sang , Zhengze Zhou , Ran He , Zhipeng Wang , Alborz Geramifard

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords GRPOon-policy distillationsparse rewarddense supervisionlanguage model post-trainingmath reasoningknowledge transfer

0 comments

The pith

With scarce labeled data, apply sparse RL like GRPO to a strong teacher first, then distill densely to the smaller student.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that sparse sequence-level rewards work best when given to capable models that can explore and find better solutions on their own. Dense token-level supervision from an improved teacher then compresses those solutions into a weaker deployment model more efficiently. This staged allocation beats spending the same labeled data on direct sparse RL for the small student. Experiments on math tasks with Qwen3 models confirm the pattern: an RL-improved 8B teacher distilled to a 1.7B student outperforms GRPO applied straight to the 1.7B model. The transfer step itself matters, with a forward-KL warmup followed by on-policy distillation giving the strongest bridge before any later student-side RL.

Core claim

Sparse sequence-level reward is most useful for strong models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller student. This suggests a simple allocation rule: use scarce labeled data upstream to improve the strongest available teacher, then transfer the improved behavior downstream through dense supervision. GRPO-style sparse RL and OPD-style distillation are not competing methods but two reward-density regimes used at different stages.

What carries the argument

Sparse-to-dense reward allocation that assigns sequence-level RL to the teacher for discovery and token-level distillation for student compression.

If this is right

Distilling from an RL-improved 8B teacher outperforms applying GRPO directly to the 1.7B student with the same data.
Distilling from the same teacher before RL gives weaker results than after RL.
A forward-KL warmup on teacher rollouts followed by OPD on student rollouts performs best on MATH.
After the bridge, student-side GRPO raises MATH accuracy from 75.4% to 78.5% and outperforms a matched replay control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The staging could apply to any task where exploration is harder for smaller models than for larger ones.
Training pipelines might routinely keep a larger model available only for the discovery phase rather than for deployment.
The same principle may suggest testing intermediate-size models as stepping stones between teacher and student.

Load-bearing premise

A stronger teacher model can discover better behaviors through sparse RL that a weaker student cannot discover when given the same data.

What would settle it

If direct GRPO on the 1.7B student with the labeled data produces higher MATH accuracy than distillation from an RL-improved 8B teacher, the allocation rule would not hold.

Figures

Figures reproduced from arXiv: 2605.12483 by Alborz Geramifard, Hejian Sang, Ran He, Yuanda Xu, Zhengze Zhou, Zhipeng Wang.

read the original abstract

When labeled verifiable training data is scarce, each checked example should be used where it has the most value. A common approach is to train the deployment student model directly with sparse RL methods such as GRPO. We argue that this is often inefficient. Sparse sequence-level reward is most useful for strong models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller student. This suggests a simple allocation rule: use scarce labeled data upstream to improve the strongest available teacher, then transfer the improved behavior downstream through dense supervision. In this view, GRPO-style sparse RL and OPD-style distillation are not competing methods, but two reward-density regimes used at different stages. We evaluate this rule on verifiable math tasks with Qwen3 and Llama models. For a fixed Qwen3-1.7B deployment student, distilling from an RL-improved 8B teacher outperforms applying GRPO directly to the student with the same labeled data. In contrast, distilling from the same teacher before RL gives weaker results. The transfer bridge is also important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts performs best on MATH before any later student-side sparse RL, and gives the strongest pre-Stage 3 AIME results for the canonical 8B and 14B teachers. Finally, the bridge makes later student-side RL more effective. GRPO is weak when applied to a cold student, but after the bridge it raises MATH accuracy from 75.4% to 78.5%, outperforming a matched replay control by 2.8 points. Overall, the lesson is to avoid spending scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the student has been bridged.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main claim is that staging sparse GRPO on a larger teacher then distilling to the small student beats direct GRPO on the student with the same labeled data, but the verifier-call budget may not be matched.

read the letter

The central finding is that for a fixed Qwen3-1.7B student on math tasks, first running GRPO on an 8B teacher and then transferring via distillation outperforms applying GRPO directly to the 1.7B model using the same labeled examples. They also report that a forward-KL warmup on teacher rollouts followed by OPD on student rollouts gives the best bridge, and that this bridge makes a later round of student-side GRPO more effective, lifting MATH accuracy from 75.4% to 78.5%. The paper frames GRPO and distillation as two reward-density regimes applied at different stages rather than rival methods, which is a clean way to think about scarce verifiable data. The empirical comparisons across Qwen3 and Llama sizes add some breadth to the claim. The soft spot is the experimental control. GRPO estimates advantages from multiple rollouts per problem, so total verifier calls are the real scarce resource. The abstract says the comparison uses the same labeled data but does not confirm that the total number of reward evaluations was equalized between the teacher RL phase and the direct student GRPO baseline. If the teacher side consumed substantially more calls, the gap could reflect unequal data volume rather than the sparse-to-dense allocation rule. No error bars or full rollout counts are mentioned in the abstract, so the numbers need the methods section to be convincing. This is for researchers working on efficient post-training of smaller models when labeled verifiable data is limited. The idea is practical and the results are concrete enough to check, so it deserves a serious referee to verify the controls and see whether the staged approach holds under matched compute.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an empirical study on reward allocation in language model post-training for verifiable math tasks. It argues that sparse sequence-level rewards like GRPO are best used to improve a strong teacher model, while dense token-level supervision via on-policy distillation (OPD) is better for transferring to a smaller student model. The key claim is that, for a Qwen3-1.7B student, distilling from an RL-improved 8B teacher outperforms direct application of GRPO to the student using the same labeled data. The paper also introduces a 'transfer bridge' consisting of forward-KL warmup on teacher rollouts followed by OPD on student rollouts, which improves performance and makes subsequent student-side RL more effective, as evidenced by MATH accuracy increasing from 75.4% to 78.5%.

Significance. If the comparisons hold under matched verifier budgets, the work offers a practical heuristic for allocating scarce labeled data and verification resources: apply sparse RL upstream for exploration by strong teachers and dense distillation downstream for student compression. Concrete gains on MATH and AIME with Qwen3 and Llama models, plus the observation that the bridge enhances later student RL, provide actionable guidance for post-training pipelines. The staged sparse-to-dense framing distinguishes the contribution from pure GRPO or pure distillation baselines.

major comments (2)

Abstract: The central claim states that distilling from an RL-improved 8B teacher outperforms GRPO applied directly to the 1.7B student 'with the same labeled data.' GRPO relies on multiple rollouts per problem to estimate group-relative advantages, so the total number of verifier calls is not necessarily equalized between the teacher RL phase and the student GRPO baseline. Without explicit matching of rollout count or total reward evaluations, the observed gap may reflect unequal effective data volume rather than the proposed sparse-to-dense allocation rule.
Abstract: The reported MATH accuracy lift from 75.4% to 78.5% after the forward-KL + OPD bridge plus student-side GRPO (outperforming a matched replay control by 2.8 points) is presented without error bars, number of independent runs, or statistical significance tests. This makes it difficult to assess whether the gain is robust or sensitive to random seeds and hyperparameter choices.

minor comments (1)

Abstract: Acronyms GRPO and OPD are used without initial expansion or brief definition on first appearance, which may hinder readers unfamiliar with the specific variants of group-relative policy optimization and on-policy distillation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each major comment below and will make the necessary revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The central claim states that distilling from an RL-improved 8B teacher outperforms GRPO applied directly to the 1.7B student 'with the same labeled data.' GRPO relies on multiple rollouts per problem to estimate group-relative advantages, so the total number of verifier calls is not necessarily equalized between the teacher RL phase and the student GRPO baseline. Without explicit matching of rollout count or total reward evaluations, the observed gap may reflect unequal effective data volume rather than the proposed sparse-to-dense allocation rule.

Authors: We appreciate the referee pointing out the potential mismatch in total verifier calls. The phrase 'with the same labeled data' in the abstract refers to using the same set of training problems (i.e., the same number of labeled examples). However, as noted, GRPO requires multiple rollouts per problem, leading to more verifier evaluations in the direct student GRPO setting. To address this, we will revise the manuscript to explicitly state the total number of reward evaluations used in each experiment and add a discussion on how the sparse-to-dense principle holds under this allocation. If space permits, we will include a controlled comparison with matched total verifier budgets. revision: yes
Referee: Abstract: The reported MATH accuracy lift from 75.4% to 78.5% after the forward-KL + OPD bridge plus student-side GRPO (outperforming a matched replay control by 2.8 points) is presented without error bars, number of independent runs, or statistical significance tests. This makes it difficult to assess whether the gain is robust or sensitive to random seeds and hyperparameter choices.

Authors: We agree that the lack of error bars and statistical analysis limits the assessment of robustness. The reported numbers are from single runs, which is common in large-scale LM training due to high computational costs, but we acknowledge this as a limitation. In the revised manuscript, we will report results from multiple independent runs with error bars and include statistical significance tests for the key improvements, such as the 2.8-point gain over the replay control. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of training strategies

full rationale

The paper advances an empirical allocation rule for sparse vs. dense rewards based on model strength, then validates it through direct experimental comparisons (RL-improved teacher distillation vs. direct GRPO on student) on Qwen3/Llama models with MATH/AIME. No derivations, equations, fitted parameters presented as predictions, or self-citation chains exist. The central claim reduces to observed accuracy differences under stated conditions, with no self-definitional or load-bearing reductions to inputs. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and relies on standard RL assumptions for LLMs rather than introducing new free parameters, axioms, or invented entities.

axioms (1)

domain assumption Sparse sequence-level reward is most useful for strong models that can explore and discover better behavior
Stated directly in the abstract as the basis for the proposed allocation rule.

pith-pipeline@v0.9.0 · 5675 in / 1235 out tokens · 56827 ms · 2026-05-15T05:20:20.794990+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 21 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Rubric-based On-policy Distillation

Junfeng Fang, Zhepei Hong, Mao Zheng, Mingyang Song, Gengsheng Li, Houcheng Jiang, Dan Zhang, Haiyun Guo, Xiang Wang, and Tat-Seng Chua. Rubric-based on-policy distillation.arXiv preprint arXiv:2605.07396, 2026a. Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, and Feng Zhao. Flow-...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, et al. Self-distillation zero: Self-revision turns binary rewards into dense supervision.arXiv preprint arXiv:2604.12002,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Wenjin Hou, Shangpin Peng, Weinong Wang, Zheng Ruan, Yue Zhang, Zhenglin Zhou, Mingqi Gao, Yifei Chen, Kaiqi Wang, et al. Uni-OPD: Unifying on-policy distillation with a dual-perspective recipe.arXiv preprint arXiv:2605.03677,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023,

work page 2023
[8]

Beyond correctness: Learning robust reasoning via transfer.arXiv preprint arXiv:2602.08489,

Hyunseok Lee, Soheil Abbasloo, Jihoon Tack, and Jinwoo Shin. Beyond correctness: Learning robust reasoning via transfer.arXiv preprint arXiv:2602.08489,

work page arXiv
[9]

Explanations from large language models make small reasoners better.arXiv preprint arXiv:2210.06726,

Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, Wenhu Chen, and Xin Xie. Explanations from large language models make small reasoners better.arXiv preprint arXiv:2210.06726,

work page arXiv
[10]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning

Kun Liang, Clive Bai, Xin Xu, Chenming Tang, Sanwoo Lee, Weijie Liu, Saiyong Yang, and Yunfang Wu. ORBIT: On-policy exploration-exploitation for controllable multi-budget reasoning.arXiv preprint arXiv:2601.08310,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Knowledge distillation with training wheels.arXiv preprint arXiv:2502.17717,

Guanlin Liu, Anand Ramachandran, Tanmay Gangwani, Yan Fu, and Abhinav Sethy. Knowledge distillation with training wheels.arXiv preprint arXiv:2502.17717,

work page arXiv
[13]

https://thinkingmachines.ai/blog/ on-policy-distillation/

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/ on-policy-distillation/. Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. ReFT: Rea- soning with reinforced fine-tuning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,

work page doi:10.64434/tml.20251026
[14]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

11 Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. CRISP: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework.arXiv preprint arXiv:2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, and Honggang Qi. Skill-SD: Skill-conditioned self-distillation for multi-turn LLM agents.arXiv preprint arXiv:2604.10674, 2026a. Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, and James Cheng. TCOD: Exploring tem- poral curriculum in on-poli...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

MiMo-V2-Flash Technical Report

URL https://arxiv.org/ abs/2601.02780. Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. PACED: Distillation and on-policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178, 2026a. Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. TIP: Token importance in on-policy distil...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, et al. Self-distilled RLVR.arXiv preprint arXiv:2604.03128,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

work page arXiv
[25]

On-Policy Context Distillation for Language Models

12 Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Sun, Xiang Shen, Liang Gao, Ziyi Pan, et al. DAPO: An open-source llm reinforcement learning system.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong, and Chien-Sheng Wu. The illusion of certainty: Decoupling capability and calibration in on-policy distillation.arXiv preprint arXiv:2604.16830, 2026a. Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, and Stefano Soatto. Reinforcement-...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

[2026b] propose RL-aware distillation through advantage-aware selective imitation during PPO/GRPO-style updates

formulate KD as entropy-regularized value optimization with on-policy and off-policy demonstrations, while Zhang et al. [2026b] propose RL-aware distillation through advantage-aware selective imitation during PPO/GRPO-style updates. Further OPD work studies a forward-then-reverse KL schedule [Xu et al., 2026a], analyzes which student-state tokens carry th...

work page 2026