pith. sign in

arxiv: 2605.12483 · v4 · pith:URSNDDDQnew · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Pith reviewed 2026-05-21 07:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reward densitysparse-to-dense transferon-policy distillationGRPOlanguage model post-trainingmathematical reasoningteacher-student allocation
0
0 comments X

The pith

Scarce labels should train strong teachers with sparse rewards then transfer via dense distillation to students.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a reward-density principle for allocating limited verifiable training data in language-model post-training. Sparse sequence-level rewards work best when given to a stronger model that can explore and discover improved behaviors, while dense token-level supervision from that model is better suited to compressing the behavior into a smaller student model. Testing this allocation on math reasoning tasks with Qwen and Llama families shows that an RL-improved teacher followed by distillation reaches higher accuracy than applying the sparse rewards directly to the student via GRPO. Component ablations confirm each stage of the workflow contributes measurably to the outcome, and the same teacher-quality ordering appears across model scales.

Core claim

The central claim is that sparse sequence-level reward is most useful on models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller deployment model. The resulting allocation rule is to apply scarce labeled data upstream on the strongest available teacher through RL, then transfer the reward-shaped behavior downstream as dense supervision. On a fixed 1.7B student, this yields 79.3 percent on MATH and 25.2 percent on AIME 2024 versus 75.9 percent and 19.8 percent from direct GRPO on the same student; the ordering raw-teacher transfer is worse than direct GRPO which is worse than RL-teacher via

What carries the argument

The reward-density principle, which determines whether sparse sequence-level or dense token-level supervision is more informative for a given model size and capability.

If this is right

  • Direct GRPO on the deployment-size model wastes the exploratory value of sparse rewards when a stronger teacher is available.
  • Each stage in the four-stage workflow is load-bearing; removing the RL-improved teacher, the forward-KL warmup, or the on-policy distillation each reduces final accuracy.
  • The same teacher-quality ordering holds when the teacher is Llama-3.3-70B-Instruct and the student is Llama-3.1-8B-Instruct.
  • Student-side sparse reward is best applied only after the dense bridge has already compressed the improved behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The principle may suggest reallocating data budgets in other post-training settings where exploration and compression have different optimal densities.
  • If the transfer remains lossless across wider capability gaps, it could reduce the compute needed for small-model alignment by outsourcing discovery to larger teachers.
  • Testing whether the optional post-bridge student RL stage adds value only after the dense transfer or can be skipped entirely would clarify the minimal pipeline.

Load-bearing premise

That RL improvements discovered on the teacher model produce behaviors that dense distillation can transfer to the student without meaningful loss.

What would settle it

If the performance ordering raw-teacher transfer less than direct GRPO less than RL-teacher transfer fails to appear when the same workflow is repeated on a new pair of models or a different verifiable task.

Figures

Figures reproduced from arXiv: 2605.12483 by Alborz Geramifard, Hejian Sang, Ran He, Yuanda Xu, Zhengze Zhou, Zhipeng Wang.

Figure 1
Figure 1. Figure 1: Where labeled training data should be allocated. The teacher-side path (Stage 1: teacher [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Headline contrast on the same Qwen3-1.7B deployment student (avg@16, %; details in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
read the original abstract

In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated to the model and reward density where it is most informative. We identify a reward-density principle that governs this allocation: sparse sequence-level reward is most useful on models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller deployment model. The principle yields a simple allocation rule: use scarce labeled data upstream on the strongest available teacher, then transfer the reward-shaped behavior downstream as dense supervision. We evaluate this rule through a four-stage workflow -- teacher RL, forward-KL warmup, on-policy distillation, optional post-bridge student RL -- on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student ($79.3\%$ vs.\ $75.9\%$ on MATH; $25.2\%$ vs.\ $19.8\%$ on AIME~2024, avg@16), while transfer from the same teacher \emph{before} RL underperforms. A component ablation confirms that each stage is load-bearing: replacing the RL-improved teacher with a raw teacher costs $7.8$ MATH points, removing the forward-KL warmup costs $1.7$, and removing on-policy distillation costs $3.3$. The teacher-quality ordering -- raw-teacher transfer $<$ direct GRPO $<$ RL-teacher transfer -- replicates on Llama-3.1-8B-Instruct with a Llama-3.3-70B-Instruct teacher. The operational lesson is to avoid spending scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that when labeled verifiable data is scarce, a reward-density principle should guide allocation: apply sparse sequence-level rewards via RL to a capable teacher model to discover improved behaviors, then compress those behaviors into a smaller student via dense token-level supervision through a forward-KL warmup followed by on-policy distillation. On math tasks with Qwen3 models, this RL-teacher + dense-bridge workflow outperforms direct GRPO on the 1.7B student (79.3% vs 75.9% MATH; 25.2% vs 19.8% AIME@16), while raw-teacher transfer underperforms; component ablations attribute 7.8 MATH points to the RL teacher, 1.7 to forward-KL, and 3.3 to on-policy distillation. The teacher-quality ordering replicates on Llama-3.1-8B with a 70B teacher.

Significance. If the empirical ordering holds under matched optimization, the work supplies a practical, testable heuristic for sample-efficient post-training: reserve sparse verifiable labels for teacher-side discovery and use dense distillation for student compression. The explicit component ablations and cross-model replication on Llama models constitute concrete strengths that make the workflow falsifiable and potentially actionable for practitioners facing data constraints.

major comments (3)
  1. [Main results paragraph] Main results paragraph (Qwen3 MATH/AIME numbers): the central claim that RL-teacher transfer outperforms direct GRPO rests on the assumption of comparable hyperparameter optimization, sampling budget, and learning-rate search for the direct-GRPO baseline versus the teacher RL stage. The manuscript provides no evidence that the baseline received equivalent tuning effort; if it was run with default or less-optimized settings, the 3.4-point MATH and 5.4-point AIME gaps cannot be attributed to the sparse-to-dense allocation rule.
  2. [Component ablation paragraph] Component ablation paragraph: the reported point drops (7.8 MATH for raw teacher, 1.7 for no forward-KL, 3.3 for no on-policy distillation) are presented without error bars, standard deviations, or results from multiple random seeds. This leaves open whether the differences are statistically reliable or sensitive to initialization and sampling variance, which is load-bearing for asserting that each stage is independently necessary.
  3. [Llama replication sentence] Llama replication sentence: the claim that the teacher-quality ordering replicates on Llama-3.1-8B-Instruct with a Llama-3.3-70B-Instruct teacher is stated without accompanying effect sizes, ablation numbers, or experimental details. This weakens the generalizability argument that underpins the proposed allocation principle beyond the Qwen3 setup.
minor comments (2)
  1. [Abstract and workflow description] The abstract and workflow description mention an 'optional post-bridge student RL' stage but do not state whether it was enabled in the primary reported runs or quantify its contribution.
  2. [Experimental setup] Dataset construction details for the verifiable math tasks (e.g., exact sources, filtering criteria, and train/test splits) are not provided, hindering reproducibility of the reported gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment point by point below, proposing revisions where appropriate to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Main results paragraph] Main results paragraph (Qwen3 MATH/AIME numbers): the central claim that RL-teacher transfer outperforms direct GRPO rests on the assumption of comparable hyperparameter optimization, sampling budget, and learning-rate search for the direct-GRPO baseline versus the teacher RL stage. The manuscript provides no evidence that the baseline received equivalent tuning effort; if it was run with default or less-optimized settings, the 3.4-point MATH and 5.4-point AIME gaps cannot be attributed to the sparse-to-dense allocation rule.

    Authors: We agree that the manuscript should provide clearer evidence of comparable tuning to support the attribution of gains to the reward-density principle. Upon review, the direct GRPO baseline was optimized using a comparable search over key hyperparameters including learning rate, batch size, and sampling temperature, aligned with the teacher RL stage to ensure fairness. However, to make this explicit, we will add a dedicated subsection in the experimental setup detailing the hyperparameter search process and budgets for all methods, including tables comparing the configurations used. revision: yes

  2. Referee: [Component ablation paragraph] Component ablation paragraph: the reported point drops (7.8 MATH for raw teacher, 1.7 for no forward-KL, 3.3 for no on-policy distillation) are presented without error bars, standard deviations, or results from multiple random seeds. This leaves open whether the differences are statistically reliable or sensitive to initialization and sampling variance, which is load-bearing for asserting that each stage is independently necessary.

    Authors: The referee correctly identifies that the ablation results lack statistical measures such as error bars or multi-seed averages. This omission weakens the reliability claims. We will revise the component ablation section to include results from multiple random seeds (at least three), reporting mean performance and standard deviations for each ablation variant. This will allow readers to assess the stability of the reported point drops. revision: yes

  3. Referee: [Llama replication sentence] Llama replication sentence: the claim that the teacher-quality ordering replicates on Llama-3.1-8B-Instruct with a Llama-3.3-70B-Instruct teacher is stated without accompanying effect sizes, ablation numbers, or experimental details. This weakens the generalizability argument that underpins the proposed allocation principle beyond the Qwen3 setup.

    Authors: We acknowledge that the Llama replication is presented concisely without sufficient supporting details. To strengthen the generalizability argument, we will expand this part of the manuscript with the full set of results, including effect sizes (e.g., the performance gaps), component ablation numbers for the Llama models, and key experimental details such as the number of training steps, sampling budgets, and hyperparameter choices used in the replication. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical workflow with independent experimental outcomes

full rationale

The paper identifies a reward-density principle through comparative experiments on verifiable math tasks using Qwen3 and Llama models, then evaluates an allocation rule via a four-stage workflow (teacher RL, forward-KL warmup, on-policy distillation, optional student RL). Performance claims rest on reported metrics such as 79.3% vs 75.9% on MATH and component ablations (7.8, 1.7, and 3.3 point drops), which are presented as direct empirical results rather than predictions derived from fitted parameters or self-referential equations. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify the central ordering; the derivation chain consists of experimental stages whose outputs are measured against external benchmarks and baselines, remaining self-contained without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on empirical workflow results rather than first-principles derivation. No explicit free parameters are described in the abstract. Background assumptions include that RL improves teacher exploration and that distillation preserves gains.

axioms (2)
  • domain assumption Reinforcement learning with sparse rewards improves teacher model behavior on verifiable math tasks in a transferable way
    Invoked by the four-stage workflow and teacher-quality ordering in the abstract.
  • domain assumption Dense on-policy distillation can compress improved teacher behavior into smaller students without substantial loss
    Central to the transfer step and performance comparisons.

pith-pipeline@v0.9.0 · 8560 in / 1406 out tokens · 57431 ms · 2026-05-21T07:52:35.152181+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We identify a reward-density principle that governs this allocation: sparse sequence-level reward is most useful on models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller deployment model.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 21 internal anchors

  1. [1]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

  2. [2]

    Rubric-based On-policy Distillation

    Junfeng Fang, Zhepei Hong, Mao Zheng, Mingyang Song, Gengsheng Li, Houcheng Jiang, Dan Zhang, Haiyun Guo, Xiang Wang, and Tat-Seng Chua. Rubric-based on-policy distillation.arXiv preprint arXiv:2605.07396, 2026a. Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, and Feng Zhao. Flow-...

  3. [3]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  4. [4]

    Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

    Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, et al. Self-distillation zero: Self-revision turns binary rewards into dense supervision.arXiv preprint arXiv:2604.12002,

  5. [5]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  6. [6]

    Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

    Wenjin Hou, Shangpin Peng, Weinong Wang, Zheng Ruan, Yue Zhang, Zhenglin Zhou, Mingqi Gao, Yifei Chen, Kaiqi Wang, et al. Uni-OPD: Unifying on-policy distillation with a dual-perspective recipe.arXiv preprint arXiv:2605.03677,

  7. [7]

    Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023,

  8. [8]

    Beyond correctness: Learning robust reasoning via transfer.arXiv preprint arXiv:2602.08489,

    Hyunseok Lee, Soheil Abbasloo, Jihoon Tack, and Jinwoo Shin. Beyond correctness: Learning robust reasoning via transfer.arXiv preprint arXiv:2602.08489,

  9. [9]

    Explanations from large language models make small reasoners better.arXiv preprint arXiv:2210.06726,

    Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, Wenhu Chen, and Xin Xie. Explanations from large language models make small reasoners better.arXiv preprint arXiv:2210.06726,

  10. [10]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016,

  11. [11]

    ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning

    Kun Liang, Clive Bai, Xin Xu, Chenming Tang, Sanwoo Lee, Weijie Liu, Saiyong Yang, and Yunfang Wu. ORBIT: On-policy exploration-exploitation for controllable multi-budget reasoning.arXiv preprint arXiv:2601.08310,

  12. [12]

    Knowledge distillation with training wheels.arXiv preprint arXiv:2502.17717,

    Guanlin Liu, Anand Ramachandran, Tanmay Gangwani, Yan Fu, and Abhinav Sethy. Knowledge distillation with training wheels.arXiv preprint arXiv:2502.17717,

  13. [13]

    On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025

    doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/ on-policy-distillation/. Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. ReFT: Rea- soning with reinforced fine-tuning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,

  14. [14]

    CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. CRISP: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433,

  15. [15]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  16. [16]

    9 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  17. [17]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897,

  18. [18]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework.arXiv preprint arXiv:2409.19256,

  19. [19]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626,

  20. [20]

    Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

    Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, and Honggang Qi. Skill-SD: Skill-conditioned self-distillation for multi-turn LLM agents.arXiv preprint arXiv:2604.10674, 2026a. Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, and James Cheng. TCOD: Exploring tem- poral curriculum in on-poli...

  21. [21]

    MiMo-V2-Flash Technical Report

    URL https://arxiv.org/ abs/2601.02780. Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. PACED: Distillation and on-policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178, 2026a. Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. TIP: Token importance in on-policy distil...

  22. [22]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  23. [23]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, et al. Self-distilled RLVR.arXiv preprint arXiv:2604.03128,

  24. [24]

    Black-box on-policy distillation of large language models.arXiv preprint, arXiv:2511.10643, 2025

    Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

  25. [25]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275,

  26. [26]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Sun, Xiang Shen, Liang Gao, Ziyi Pan, et al. DAPO: An open-source llm reinforcement learning system.arXiv preprint arXiv:2503.14476,

  27. [27]

    The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

    Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong, and Chien-Sheng Wu. The illusion of certainty: Decoupling capability and calibration in on-policy distillation.arXiv preprint arXiv:2604.16830, 2026a. Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, and Stefano Soatto. Reinforcement-...

  28. [28]

    satisfy (C2) by construction because their teacher is the same model conditioned on a task-specific demonstration; the two-stage bridge is the cross-scale construction that obtains the same trust-region property explicitly. 11 B Half-Split Experiments: SFT-Teacher and Bridge-Protocol Controls This appendix gives the SFT-teacher control referenced in Secti...

  29. [29]

    The 1.7B RL’d-teacher rows are a same-size control that isolates the dense-reward effect from teacher scale

    Raw and SFT rows test C1 by using the same transfer protocol without teacher-side sparse RL. The 1.7B RL’d-teacher rows are a same-size control that isolates the dense-reward effect from teacher scale. Teacher checkpoint Transfer protocol MATH AIME 2024 AIME 2025 — Direct GRPO (cold student)75.9±0.9 19.8±1.4 17.1±0.9 raw Qwen3-8B two-stage bridge71.5±0.9 ...

  30. [30]

    [2026b] propose RL-aware distillation through advantage-aware selective imitation during PPO/GRPO-style updates

    formulate KD as entropy-regularized value optimization with on-policy and off-policy demonstrations, while Zhang et al. [2026b] propose RL-aware distillation through advantage-aware selective imitation during PPO/GRPO-style updates. Further OPD work studies a forward-then-reverse KL schedule [Xu et al., 2026a], analyzes which student-state tokens carry th...