Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation
Pith reviewed 2026-05-19 20:45 UTC · model grok-4.3
The pith
Decoupling prefix source from token-level KL direction reveals four distinct LLM distillation objectives that unify SFT, DAgger, offline RL, and OPD.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Decomposing sequence-level KL divergence over autoregressive response distributions shows that off-policy distillation pairs teacher prefixes with token-level forward KL while on-policy distillation pairs student prefixes with token-level reverse KL. This implicit coupling is unnecessary; separating the two axes yields four valid objectives whose gradients recover SFT-style cross-entropy matching, DAgger-style on-policy correction, offline-RL-style dense reward signals, and on-policy distillation. Controlled experiments confirm that KL direction controls an accuracy-entropy tradeoff, prefix source controls a quality-compute tradeoff, and training horizon controls an accuracy-stability trade.
What carries the argument
Decomposition of sequence-level KL divergence into independent prefix-source and token-level KL-direction axes over autoregressive teacher and student distributions.
If this is right
- Forward KL with teacher prefixes recovers SFT-style cross-entropy matching to teacher soft targets.
- Reverse KL with student prefixes recovers an RL-style policy-gradient update using the teacher-student log-ratio as a dense reward.
- KL direction induces a measurable accuracy-entropy tradeoff in reasoning tasks.
- Prefix source induces a measurable quality-compute tradeoff during data generation.
- Training length induces an accuracy-stability tradeoff that can be mitigated by curriculum design.
Where Pith is reading between the lines
- The same decomposition could be applied to other sequence-level divergences to generate additional distillation variants beyond the four objectives.
- KL mixing weights could be scheduled dynamically based on current response entropy rather than fixed ratios.
- The entropy-gated curriculum may generalize to non-reasoning generation tasks where length inflation harms latency.
- Initializations from the four objectives could be tested for their effect on sample efficiency in subsequent reinforcement learning stages.
Load-bearing premise
The decomposition of sequence-level KL divergence into independent prefix-source and token-level KL-direction axes is valid and produces four distinct, usable objectives without hidden inconsistencies or additional constraints.
What would settle it
A controlled run in which one of the four decoupled objectives produces training dynamics or final performance that cannot be recovered from the claimed gradient identities for forward or reverse KL.
Figures
read the original abstract
Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient-level identities showing forward KL gives SFT-style cross-entropy matching with teacher soft targets, whereas reverse KL gives an RL-style policy-gradient objective with a dense teacher-student log-ratio reward, connecting them to off-policy SFT, DAgger-style on-policy SFT, offline-RL-style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy-entropy tradeoff, prefix source a quality-compute tradeoff, and training length an accuracy-stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy-gated length curriculum. KL mixing shows long-sequence distillation requires substantial forward-KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy-gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long-horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that sequence-level KL divergence over autoregressive distributions decomposes into orthogonal choices of prefix source (teacher vs. student) and token-level KL direction (forward vs. reverse). This decomposition reveals that standard off-policy distillation and on-policy distillation (OPD) implicitly couple these axes, while decoupling them produces four distinct objectives. Gradient-level identities are derived showing that forward KL recovers SFT-style cross-entropy with soft targets and reverse KL recovers an RL-style policy gradient with dense log-ratio rewards, thereby unifying SFT, DAgger, offline RL-style distillation, and OPD. A controlled empirical study on math reasoning evaluates the four objectives standalone and as RL initializations, identifies accuracy-entropy, quality-compute, and accuracy-stability tradeoffs, and proposes KL mixing plus an entropy-gated length curriculum that improves Avg@k and Pass@k while reducing response length.
Significance. If the gradient identities and validity of the four decoupled objectives hold without hidden inconsistencies, the work supplies a principled taxonomy that connects distillation to RL and supplies practical levers (KL mixing, entropy-gated curriculum) for balancing accuracy, diversity, and compute in reasoning models. The controlled study and proposed methods are concrete strengths that could guide future post-training design.
major comments (2)
- [§3] §3 (decomposition and gradient identities): The matched cases (teacher prefixes + forward token KL; student prefixes + reverse token KL) correctly recover sequence-level forward and reverse KL. However, the decoupled case of teacher prefixes with reverse token KL yields E[prefix ~ teacher] KL(student token || teacher token), which equals neither sequence-level KL nor its reverse. The manuscript must explicitly derive whether this objective's gradients are unbiased estimators of any target or whether importance-sampling corrections are required; without this, the claim that all four combinations are 'valid objectives' rests on an unverified assumption.
- [Experimental section] Experimental section and appendix (controlled study): The abstract and results claim three tradeoffs and improvements from KL mixing and the entropy-gated curriculum (3.6–5.8 points on Avg@k/Pass@k, ~3× length reduction). Full details on data exclusion rules, exact hyperparameter grids, and statistical significance tests are needed to verify that the observed differences are attributable to the decoupled objectives rather than confounding factors in the math-reasoning setup.
minor comments (2)
- [§3] Notation for the four objectives should be introduced with a single summary table or diagram early in §3 to make the prefix/KL-direction axes immediately legible.
- [§3] The abstract states 'we establish gradient-level identities'; the main text should include the precise gradient expressions (with expectations and baselines) rather than only high-level descriptions.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments help strengthen the theoretical clarity around the decomposition and improve the reproducibility of the experimental results. We address each major comment point by point below.
read point-by-point responses
-
Referee: [§3] §3 (decomposition and gradient identities): The matched cases (teacher prefixes + forward token KL; student prefixes + reverse token KL) correctly recover sequence-level forward and reverse KL. However, the decoupled case of teacher prefixes with reverse token KL yields E[prefix ~ teacher] KL(student token || teacher token), which equals neither sequence-level KL nor its reverse. The manuscript must explicitly derive whether this objective's gradients are unbiased estimators of any target or whether importance-sampling corrections are required; without this, the claim that all four combinations are 'valid objectives' rests on an unverified assumption.
Authors: We acknowledge that the teacher-prefix + reverse-token-KL combination does not recover the full sequence-level reverse KL. This objective is nevertheless well-defined and distinct, corresponding to an off-policy reverse-KL variant that uses teacher-generated prefixes while applying token-level reverse KL. In the revised manuscript we will add an explicit gradient derivation in §3 for all four combinations. For the teacher-prefix + reverse case the gradient reduces to an unbiased estimator of the expected dense log-ratio reward under the fixed teacher prefix distribution; no additional importance-sampling correction is required beyond the standard on-policy token sampling. This derivation confirms that each of the four objectives possesses a valid, optimizable gradient and supports their status as distinct targets. revision: yes
-
Referee: [Experimental section] Experimental section and appendix (controlled study): The abstract and results claim three tradeoffs and improvements from KL mixing and the entropy-gated curriculum (3.6–5.8 points on Avg@k/Pass@k, ~3× length reduction). Full details on data exclusion rules, exact hyperparameter grids, and statistical significance tests are needed to verify that the observed differences are attributable to the decoupled objectives rather than confounding factors in the math-reasoning setup.
Authors: We agree that expanded experimental details are required for full reproducibility and to rule out confounds. In the revised version we will augment both the main experimental section and the appendix with: (i) the precise data exclusion rules applied to the math-reasoning benchmarks, (ii) the complete hyperparameter grid (learning rates, batch sizes, KL coefficients, entropy thresholds, and curriculum schedules) together with the final selected values, and (iii) statistical significance results including standard errors over multiple random seeds and paired tests (e.g., bootstrap confidence intervals or Wilcoxon signed-rank) on the reported Avg@k and Pass@k gains. These additions will allow readers to attribute the observed accuracy, length, and stability improvements directly to the proposed objectives and methods. revision: yes
Circularity Check
KL decomposition and gradient identities follow from standard autoregressive properties
full rationale
The paper derives its four objectives by applying the chain rule to sequence-level KL divergence over autoregressive token distributions, yielding the known pairings (teacher prefixes with forward token KL; student prefixes with reverse token KL) plus the two decoupled combinations. These identities are standard consequences of the definition of KL for product distributions and do not rely on fitted parameters, self-referential definitions, or load-bearing self-citations. The connections to SFT-style cross-entropy and RL-style policy gradients are obtained by direct expansion of the resulting objectives, which remain mathematically independent of the target results. The empirical study on math reasoning evaluates the objectives as standalone methods and RL initializations without circular reduction to inputs. No step reduces by construction to its own outputs or to unverified prior work by the same authors.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sequence-level KL over autoregressive response distributions decomposes into a choice of prefix source paired with either forward or reverse token-level KL.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL... Their Cartesian product yields four token-level distillation objectives
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
KL(pT ∥ qθ)(x) = Σ Est∼dtT [KL(pT(·|st)∥qθ(·|st))]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self- generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[2]
American mathematics competitions, 2023
AMC2023. American mathematics competitions, 2023. URL https:// artofproblemsolving.com/wiki/index.php/AMC_Problems_and_Solutions
work page 2023
-
[3]
Scheduled sampling for sequence prediction with recurrent neural networks
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015
work page 2015
-
[4]
Retaining by doing: The role of on-policy data in mitigating forgetting, 2025
Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting, 2025. URL https://arxiv.org/abs/2510. 18874
work page 2025
-
[5]
Unveiling the key factors for distilling chain-of-thought reasoning
Xinghao Chen, Zhijing Sun, Guo Wenjin, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, and Xiaoyu Shen. Unveiling the key factors for distilling chain-of-thought reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics...
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Deepseek-v4 technical report, 2026
DeepSeek-AI. Deepseek-v4 technical report, 2026. URL https://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Technical report
work page 2026
-
[9]
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
-
[11]
MiniLLM: Knowledge distillation of large language models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[12]
OpenThoughts: Data Recipes for Reasoning Models
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 10
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[14]
Liger Kernel: Efficient Triton Kernels for
Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahismail, Haowen Dong, Anirudh Patel, and Bryan Roth. Liger Kernel: Efficient triton kernels for LLM training.arXiv preprint arXiv:2410.10989, 2024
-
[15]
Stable On-Policy Distillation through Adaptive Target Reformulation
Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Entropy-aware on-policy distillation of language models
Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026
work page internal anchor Pith review arXiv 2026
-
[17]
Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation, 2016. URL https://arxiv.org/abs/1606.07947
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[18]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E
Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026
-
[19]
Offline Reinforcement Learning with Implicit Q-Learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning, 2021. URLhttps://arxiv.org/abs/2110.06169
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Conservative Q-Learning for Offline Reinforcement Learning, August 2020
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning, 2020. URLhttps://arxiv.org/abs/2006.04779
-
[21]
Kat-coder-v2 technical report, 2026
KwaiKAT Team. Kat-coder-v2 technical report, 2026. URL https://arxiv.org/abs/2603. 27703
work page 2026
-
[22]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[23]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[24]
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[26]
MiMo-V2-Flash Technical Report
LLM-Core Xiaomi. MiMo-V2-Flash technical report.arXiv preprint arXiv:2601.02780, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Kevin Lu and Thinking Machines Lab. On-policy distillation. https://thinkingmachines. ai/blog/on-policy-distillation/, 2025
work page 2025
-
[28]
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying OPD: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog
work page 2025
-
[30]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011. 11
work page 2011
-
[33]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Sky-t1: Train your own o1 preview model within $450
NovaSky Team. Sky-t1: Train your own o1 preview model within $450. https://novasky- ai.github.io/posts/sky-t1, 2025. Accessed: 2025-01-09
work page 2025
-
[35]
Qwen3.5-omni technical report, 2026
Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604. 15804
work page 2026
-
[36]
Hy-mt1.5 technical report, 2025
Tencent Hunyuan Team. Hy-mt1.5 technical report, 2025. URL https://arxiv.org/abs/ 2512.24092
-
[37]
Hy-embodied-0.5: Embodied foundation models for real-world agents,
Tencent Robotics X, HY Vision Team, :, Xumin Yu, Zuyan Liu, Ziyi Wang, He Zhang, Yong- ming Rao, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, Yves Liang, Haitao Lin, Minghui Wang, Yubo Dong, Kevin Cheng, Bolin Ni, Rui Huang, Han Hu, Zhengyou Zhang, Linus, and Shunyu Yao. Hy-embodied-0.5: Embodied foundation models for real-world agents,
-
[38]
URLhttps://arxiv.org/abs/2604.07430
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexan- der Bukharin, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade 2: Post-training LLMs with cascade RL and multi-domain on-policy distillation.arXiv preprint arXiv:2603.1...
-
[40]
The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation
Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong, and Chien-Sheng Wu. The illusion of certainty: Decoupling capability and calibration in on-policy distillation.arXiv preprint arXiv:2604.16830, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
American invitational mathematics examination (aime) 2024, 2024
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024
work page 2024
-
[42]
GLM-5: from Vibe Coding to Agentic Engineering
Zhipu AI and Tsinghua University. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026. 12 Appendix A Related Work Off-policy distillation for LLMs.Off-policy distillation Knowledge distillation was originally developed for model compression [13]. In modern LLM post-training, Off-policy distillation has increasingly become ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.