Recognition: 1 theorem link
· Lean TheoremBeyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Pith reviewed 2026-05-15 05:20 UTC · model grok-4.3
The pith
With scarce labeled data, apply sparse RL like GRPO to a strong teacher first, then distill densely to the smaller student.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sparse sequence-level reward is most useful for strong models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller student. This suggests a simple allocation rule: use scarce labeled data upstream to improve the strongest available teacher, then transfer the improved behavior downstream through dense supervision. GRPO-style sparse RL and OPD-style distillation are not competing methods but two reward-density regimes used at different stages.
What carries the argument
Sparse-to-dense reward allocation that assigns sequence-level RL to the teacher for discovery and token-level distillation for student compression.
If this is right
- Distilling from an RL-improved 8B teacher outperforms applying GRPO directly to the 1.7B student with the same data.
- Distilling from the same teacher before RL gives weaker results than after RL.
- A forward-KL warmup on teacher rollouts followed by OPD on student rollouts performs best on MATH.
- After the bridge, student-side GRPO raises MATH accuracy from 75.4% to 78.5% and outperforms a matched replay control.
Where Pith is reading between the lines
- The staging could apply to any task where exploration is harder for smaller models than for larger ones.
- Training pipelines might routinely keep a larger model available only for the discovery phase rather than for deployment.
- The same principle may suggest testing intermediate-size models as stepping stones between teacher and student.
Load-bearing premise
A stronger teacher model can discover better behaviors through sparse RL that a weaker student cannot discover when given the same data.
What would settle it
If direct GRPO on the 1.7B student with the labeled data produces higher MATH accuracy than distillation from an RL-improved 8B teacher, the allocation rule would not hold.
Figures
read the original abstract
When labeled verifiable training data is scarce, each checked example should be used where it has the most value. A common approach is to train the deployment student model directly with sparse RL methods such as GRPO. We argue that this is often inefficient. Sparse sequence-level reward is most useful for strong models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller student. This suggests a simple allocation rule: use scarce labeled data upstream to improve the strongest available teacher, then transfer the improved behavior downstream through dense supervision. In this view, GRPO-style sparse RL and OPD-style distillation are not competing methods, but two reward-density regimes used at different stages. We evaluate this rule on verifiable math tasks with Qwen3 and Llama models. For a fixed Qwen3-1.7B deployment student, distilling from an RL-improved 8B teacher outperforms applying GRPO directly to the student with the same labeled data. In contrast, distilling from the same teacher before RL gives weaker results. The transfer bridge is also important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts performs best on MATH before any later student-side sparse RL, and gives the strongest pre-Stage 3 AIME results for the canonical 8B and 14B teachers. Finally, the bridge makes later student-side RL more effective. GRPO is weak when applied to a cold student, but after the bridge it raises MATH accuracy from 75.4% to 78.5%, outperforming a matched replay control by 2.8 points. Overall, the lesson is to avoid spending scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the student has been bridged.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical study on reward allocation in language model post-training for verifiable math tasks. It argues that sparse sequence-level rewards like GRPO are best used to improve a strong teacher model, while dense token-level supervision via on-policy distillation (OPD) is better for transferring to a smaller student model. The key claim is that, for a Qwen3-1.7B student, distilling from an RL-improved 8B teacher outperforms direct application of GRPO to the student using the same labeled data. The paper also introduces a 'transfer bridge' consisting of forward-KL warmup on teacher rollouts followed by OPD on student rollouts, which improves performance and makes subsequent student-side RL more effective, as evidenced by MATH accuracy increasing from 75.4% to 78.5%.
Significance. If the comparisons hold under matched verifier budgets, the work offers a practical heuristic for allocating scarce labeled data and verification resources: apply sparse RL upstream for exploration by strong teachers and dense distillation downstream for student compression. Concrete gains on MATH and AIME with Qwen3 and Llama models, plus the observation that the bridge enhances later student RL, provide actionable guidance for post-training pipelines. The staged sparse-to-dense framing distinguishes the contribution from pure GRPO or pure distillation baselines.
major comments (2)
- Abstract: The central claim states that distilling from an RL-improved 8B teacher outperforms GRPO applied directly to the 1.7B student 'with the same labeled data.' GRPO relies on multiple rollouts per problem to estimate group-relative advantages, so the total number of verifier calls is not necessarily equalized between the teacher RL phase and the student GRPO baseline. Without explicit matching of rollout count or total reward evaluations, the observed gap may reflect unequal effective data volume rather than the proposed sparse-to-dense allocation rule.
- Abstract: The reported MATH accuracy lift from 75.4% to 78.5% after the forward-KL + OPD bridge plus student-side GRPO (outperforming a matched replay control by 2.8 points) is presented without error bars, number of independent runs, or statistical significance tests. This makes it difficult to assess whether the gain is robust or sensitive to random seeds and hyperparameter choices.
minor comments (1)
- Abstract: Acronyms GRPO and OPD are used without initial expansion or brief definition on first appearance, which may hinder readers unfamiliar with the specific variants of group-relative policy optimization and on-policy distillation.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We address each major comment below and will make the necessary revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: The central claim states that distilling from an RL-improved 8B teacher outperforms GRPO applied directly to the 1.7B student 'with the same labeled data.' GRPO relies on multiple rollouts per problem to estimate group-relative advantages, so the total number of verifier calls is not necessarily equalized between the teacher RL phase and the student GRPO baseline. Without explicit matching of rollout count or total reward evaluations, the observed gap may reflect unequal effective data volume rather than the proposed sparse-to-dense allocation rule.
Authors: We appreciate the referee pointing out the potential mismatch in total verifier calls. The phrase 'with the same labeled data' in the abstract refers to using the same set of training problems (i.e., the same number of labeled examples). However, as noted, GRPO requires multiple rollouts per problem, leading to more verifier evaluations in the direct student GRPO setting. To address this, we will revise the manuscript to explicitly state the total number of reward evaluations used in each experiment and add a discussion on how the sparse-to-dense principle holds under this allocation. If space permits, we will include a controlled comparison with matched total verifier budgets. revision: yes
-
Referee: Abstract: The reported MATH accuracy lift from 75.4% to 78.5% after the forward-KL + OPD bridge plus student-side GRPO (outperforming a matched replay control by 2.8 points) is presented without error bars, number of independent runs, or statistical significance tests. This makes it difficult to assess whether the gain is robust or sensitive to random seeds and hyperparameter choices.
Authors: We agree that the lack of error bars and statistical analysis limits the assessment of robustness. The reported numbers are from single runs, which is common in large-scale LM training due to high computational costs, but we acknowledge this as a limitation. In the revised manuscript, we will report results from multiple independent runs with error bars and include statistical significance tests for the key improvements, such as the 2.8-point gain over the replay control. revision: yes
Circularity Check
No circularity: purely empirical comparison of training strategies
full rationale
The paper advances an empirical allocation rule for sparse vs. dense rewards based on model strength, then validates it through direct experimental comparisons (RL-improved teacher distillation vs. direct GRPO on student) on Qwen3/Llama models with MATH/AIME. No derivations, equations, fitted parameters presented as predictions, or self-citation chains exist. The central claim reduces to observed accuracy differences under stated conditions, with no self-definitional or load-bearing reductions to inputs. This is a standard non-circular empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sparse sequence-level reward is most useful for strong models that can explore and discover better behavior
Reference graph
Works this paper leans on
-
[1]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Rubric-based On-policy Distillation
Junfeng Fang, Zhepei Hong, Mao Zheng, Mingyang Song, Gengsheng Li, Houcheng Jiang, Dan Zhang, Haiyun Guo, Xiang Wang, and Tat-Seng Chua. Rubric-based on-policy distillation.arXiv preprint arXiv:2605.07396, 2026a. Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, and Feng Zhao. Flow-...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision
Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, et al. Self-distillation zero: Self-revision turns binary rewards into dense supervision.arXiv preprint arXiv:2604.12002,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Wenjin Hou, Shangpin Peng, Weinong Wang, Zheng Ruan, Yue Zhang, Zhenglin Zhou, Mingqi Gao, Yifei Chen, Kaiqi Wang, et al. Uni-OPD: Unifying on-policy distillation with a dual-perspective recipe.arXiv preprint arXiv:2605.03677,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023,
work page 2023
-
[8]
Beyond correctness: Learning robust reasoning via transfer.arXiv preprint arXiv:2602.08489,
Hyunseok Lee, Soheil Abbasloo, Jihoon Tack, and Jinwoo Shin. Beyond correctness: Learning robust reasoning via transfer.arXiv preprint arXiv:2602.08489,
-
[9]
Explanations from large language models make small reasoners better.arXiv preprint arXiv:2210.06726,
Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, Wenhu Chen, and Xin Xie. Explanations from large language models make small reasoners better.arXiv preprint arXiv:2210.06726,
-
[10]
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning
Kun Liang, Clive Bai, Xin Xu, Chenming Tang, Sanwoo Lee, Weijie Liu, Saiyong Yang, and Yunfang Wu. ORBIT: On-policy exploration-exploitation for controllable multi-budget reasoning.arXiv preprint arXiv:2601.08310,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Knowledge distillation with training wheels.arXiv preprint arXiv:2502.17717,
Guanlin Liu, Anand Ramachandran, Tanmay Gangwani, Yan Fu, and Abhinav Sethy. Knowledge distillation with training wheels.arXiv preprint arXiv:2502.17717,
-
[13]
https://thinkingmachines.ai/blog/ on-policy-distillation/
doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/ on-policy-distillation/. Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. ReFT: Rea- soning with reinforced fine-tuning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,
-
[14]
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation
11 Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. CRISP: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework.arXiv preprint arXiv:2409.19256,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
A Survey of On-Policy Distillation for Large Language Models
Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, and Honggang Qi. Skill-SD: Skill-conditioned self-distillation for multi-turn LLM agents.arXiv preprint arXiv:2604.10674, 2026a. Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, and James Cheng. TCOD: Exploring tem- poral curriculum in on-poli...
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
MiMo-V2-Flash Technical Report
URL https://arxiv.org/ abs/2601.02780. Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. PACED: Distillation and on-policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178, 2026a. Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. TIP: Token importance in on-policy distil...
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, et al. Self-distilled RLVR.arXiv preprint arXiv:2604.03128,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,
Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,
-
[25]
On-Policy Context Distillation for Language Models
12 Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Sun, Xiang Shen, Liang Gao, Ziyi Pan, et al. DAPO: An open-source llm reinforcement learning system.arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation
Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong, and Chien-Sheng Wu. The illusion of certainty: Decoupling capability and calibration in on-policy distillation.arXiv preprint arXiv:2604.16830, 2026a. Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, and Stefano Soatto. Reinforcement-...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
formulate KD as entropy-regularized value optimization with on-policy and off-policy demonstrations, while Zhang et al. [2026b] propose RL-aware distillation through advantage-aware selective imitation during PPO/GRPO-style updates. Further OPD work studies a forward-then-reverse KL schedule [Xu et al., 2026a], analyzes which student-state tokens carry th...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.