pith. sign in

arxiv: 2605.16826 · v1 · pith:LNW3CB62new · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.CL

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

Pith reviewed 2026-05-19 20:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LLM distillationknowledge distillationKL divergenceon-policy distillationoff-policy distillationSFToffline RLmath reasoning
0
0 comments X

The pith

Decoupling prefix source from token-level KL direction reveals four distinct LLM distillation objectives that unify SFT, DAgger, offline RL, and OPD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard off-policy and on-policy distillation methods for large language models unintentionally tie together two separate design choices: the source of response prefixes and the direction of token-level KL divergence. Breaking down the overall sequence-level KL between teacher and student distributions demonstrates that forward KL naturally pairs with teacher prefixes while reverse KL pairs with student prefixes. Removing this coupling produces four separate objectives, each matching a known training paradigm through explicit gradient identities. Experiments on math reasoning tasks identify clear tradeoffs among these choices and motivate two practical refinements, KL mixing and an entropy-gated length curriculum, that improve accuracy and efficiency.

Core claim

Decomposing sequence-level KL divergence over autoregressive response distributions shows that off-policy distillation pairs teacher prefixes with token-level forward KL while on-policy distillation pairs student prefixes with token-level reverse KL. This implicit coupling is unnecessary; separating the two axes yields four valid objectives whose gradients recover SFT-style cross-entropy matching, DAgger-style on-policy correction, offline-RL-style dense reward signals, and on-policy distillation. Controlled experiments confirm that KL direction controls an accuracy-entropy tradeoff, prefix source controls a quality-compute tradeoff, and training horizon controls an accuracy-stability trade.

What carries the argument

Decomposition of sequence-level KL divergence into independent prefix-source and token-level KL-direction axes over autoregressive teacher and student distributions.

If this is right

  • Forward KL with teacher prefixes recovers SFT-style cross-entropy matching to teacher soft targets.
  • Reverse KL with student prefixes recovers an RL-style policy-gradient update using the teacher-student log-ratio as a dense reward.
  • KL direction induces a measurable accuracy-entropy tradeoff in reasoning tasks.
  • Prefix source induces a measurable quality-compute tradeoff during data generation.
  • Training length induces an accuracy-stability tradeoff that can be mitigated by curriculum design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition could be applied to other sequence-level divergences to generate additional distillation variants beyond the four objectives.
  • KL mixing weights could be scheduled dynamically based on current response entropy rather than fixed ratios.
  • The entropy-gated curriculum may generalize to non-reasoning generation tasks where length inflation harms latency.
  • Initializations from the four objectives could be tested for their effect on sample efficiency in subsequent reinforcement learning stages.

Load-bearing premise

The decomposition of sequence-level KL divergence into independent prefix-source and token-level KL-direction axes is valid and produces four distinct, usable objectives without hidden inconsistencies or additional constraints.

What would settle it

A controlled run in which one of the four decoupled objectives produces training dynamics or final performance that cannot be recovered from the claimed gradient identities for forward or reverse KL.

Figures

Figures reproduced from arXiv: 2605.16826 by Anhao Zhao, Haoran Xin, Junlong Tong, Wenjie Li, Xiaoyu Shen, Yingqi Fan.

Figure 1
Figure 1. Figure 1: Overview of our decoupled distillation framework. Conventional objectives couple prefix source with KL direction; decoupling these axes yields four objectives that correspond to classical training regimes. Off-policy distillation aligns the student’s token-level output distribution with the teacher’s along teacher-generated prefixes. In practice, this is often implemented as supervised fine-tuning (SFT) on… view at source ↗
Figure 2
Figure 2. Figure 2: Distillation training dynamics with Qwen3-4B as teacher and Qwen3-0.6B as student. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: RL training dynamics after Qwen3-4B-teacher distillation warmup. Top/bottom rows: [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distillation training dynamics with Qwen3-8B as teacher and Qwen3-0.6B as student. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RL dynamics after Qwen3-8B-teacher warmup. Top/bottom rows: 128/4096-token [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training dynamics of KL mixing on MATH500. Columns, left to [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient-level identities showing forward KL gives SFT-style cross-entropy matching with teacher soft targets, whereas reverse KL gives an RL-style policy-gradient objective with a dense teacher-student log-ratio reward, connecting them to off-policy SFT, DAgger-style on-policy SFT, offline-RL-style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy-entropy tradeoff, prefix source a quality-compute tradeoff, and training length an accuracy-stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy-gated length curriculum. KL mixing shows long-sequence distillation requires substantial forward-KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy-gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long-horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that sequence-level KL divergence over autoregressive distributions decomposes into orthogonal choices of prefix source (teacher vs. student) and token-level KL direction (forward vs. reverse). This decomposition reveals that standard off-policy distillation and on-policy distillation (OPD) implicitly couple these axes, while decoupling them produces four distinct objectives. Gradient-level identities are derived showing that forward KL recovers SFT-style cross-entropy with soft targets and reverse KL recovers an RL-style policy gradient with dense log-ratio rewards, thereby unifying SFT, DAgger, offline RL-style distillation, and OPD. A controlled empirical study on math reasoning evaluates the four objectives standalone and as RL initializations, identifies accuracy-entropy, quality-compute, and accuracy-stability tradeoffs, and proposes KL mixing plus an entropy-gated length curriculum that improves Avg@k and Pass@k while reducing response length.

Significance. If the gradient identities and validity of the four decoupled objectives hold without hidden inconsistencies, the work supplies a principled taxonomy that connects distillation to RL and supplies practical levers (KL mixing, entropy-gated curriculum) for balancing accuracy, diversity, and compute in reasoning models. The controlled study and proposed methods are concrete strengths that could guide future post-training design.

major comments (2)
  1. [§3] §3 (decomposition and gradient identities): The matched cases (teacher prefixes + forward token KL; student prefixes + reverse token KL) correctly recover sequence-level forward and reverse KL. However, the decoupled case of teacher prefixes with reverse token KL yields E[prefix ~ teacher] KL(student token || teacher token), which equals neither sequence-level KL nor its reverse. The manuscript must explicitly derive whether this objective's gradients are unbiased estimators of any target or whether importance-sampling corrections are required; without this, the claim that all four combinations are 'valid objectives' rests on an unverified assumption.
  2. [Experimental section] Experimental section and appendix (controlled study): The abstract and results claim three tradeoffs and improvements from KL mixing and the entropy-gated curriculum (3.6–5.8 points on Avg@k/Pass@k, ~3× length reduction). Full details on data exclusion rules, exact hyperparameter grids, and statistical significance tests are needed to verify that the observed differences are attributable to the decoupled objectives rather than confounding factors in the math-reasoning setup.
minor comments (2)
  1. [§3] Notation for the four objectives should be introduced with a single summary table or diagram early in §3 to make the prefix/KL-direction axes immediately legible.
  2. [§3] The abstract states 'we establish gradient-level identities'; the main text should include the precise gradient expressions (with expectations and baselines) rather than only high-level descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments help strengthen the theoretical clarity around the decomposition and improve the reproducibility of the experimental results. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [§3] §3 (decomposition and gradient identities): The matched cases (teacher prefixes + forward token KL; student prefixes + reverse token KL) correctly recover sequence-level forward and reverse KL. However, the decoupled case of teacher prefixes with reverse token KL yields E[prefix ~ teacher] KL(student token || teacher token), which equals neither sequence-level KL nor its reverse. The manuscript must explicitly derive whether this objective's gradients are unbiased estimators of any target or whether importance-sampling corrections are required; without this, the claim that all four combinations are 'valid objectives' rests on an unverified assumption.

    Authors: We acknowledge that the teacher-prefix + reverse-token-KL combination does not recover the full sequence-level reverse KL. This objective is nevertheless well-defined and distinct, corresponding to an off-policy reverse-KL variant that uses teacher-generated prefixes while applying token-level reverse KL. In the revised manuscript we will add an explicit gradient derivation in §3 for all four combinations. For the teacher-prefix + reverse case the gradient reduces to an unbiased estimator of the expected dense log-ratio reward under the fixed teacher prefix distribution; no additional importance-sampling correction is required beyond the standard on-policy token sampling. This derivation confirms that each of the four objectives possesses a valid, optimizable gradient and supports their status as distinct targets. revision: yes

  2. Referee: [Experimental section] Experimental section and appendix (controlled study): The abstract and results claim three tradeoffs and improvements from KL mixing and the entropy-gated curriculum (3.6–5.8 points on Avg@k/Pass@k, ~3× length reduction). Full details on data exclusion rules, exact hyperparameter grids, and statistical significance tests are needed to verify that the observed differences are attributable to the decoupled objectives rather than confounding factors in the math-reasoning setup.

    Authors: We agree that expanded experimental details are required for full reproducibility and to rule out confounds. In the revised version we will augment both the main experimental section and the appendix with: (i) the precise data exclusion rules applied to the math-reasoning benchmarks, (ii) the complete hyperparameter grid (learning rates, batch sizes, KL coefficients, entropy thresholds, and curriculum schedules) together with the final selected values, and (iii) statistical significance results including standard errors over multiple random seeds and paired tests (e.g., bootstrap confidence intervals or Wilcoxon signed-rank) on the reported Avg@k and Pass@k gains. These additions will allow readers to attribute the observed accuracy, length, and stability improvements directly to the proposed objectives and methods. revision: yes

Circularity Check

0 steps flagged

KL decomposition and gradient identities follow from standard autoregressive properties

full rationale

The paper derives its four objectives by applying the chain rule to sequence-level KL divergence over autoregressive token distributions, yielding the known pairings (teacher prefixes with forward token KL; student prefixes with reverse token KL) plus the two decoupled combinations. These identities are standard consequences of the definition of KL for product distributions and do not rely on fitted parameters, self-referential definitions, or load-bearing self-citations. The connections to SFT-style cross-entropy and RL-style policy gradients are obtained by direct expansion of the resulting objectives, which remain mathematically independent of the target results. The empirical study on math reasoning evaluates the objectives as standalone methods and RL initializations without circular reduction to inputs. No step reduces by construction to its own outputs or to unverified prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of decomposing sequence-level KL over autoregressive distributions and on the resulting gradient identities; these are treated as standard domain assumptions rather than new postulates.

axioms (1)
  • domain assumption Sequence-level KL over autoregressive response distributions decomposes into a choice of prefix source paired with either forward or reverse token-level KL.
    This decomposition is invoked to identify the implicit coupling in prevailing methods and to derive the four objectives.

pith-pipeline@v0.9.0 · 5876 in / 1180 out tokens · 54698 ms · 2026-05-19T20:45:16.306817+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 19 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self- generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. InInternational Conference on Learning Representations (ICLR), 2024

  2. [2]

    American mathematics competitions, 2023

    AMC2023. American mathematics competitions, 2023. URL https:// artofproblemsolving.com/wiki/index.php/AMC_Problems_and_Solutions

  3. [3]

    Scheduled sampling for sequence prediction with recurrent neural networks

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015

  4. [4]

    Retaining by doing: The role of on-policy data in mitigating forgetting, 2025

    Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting, 2025. URL https://arxiv.org/abs/2510. 18874

  5. [5]

    Unveiling the key factors for distilling chain-of-thought reasoning

    Xinghao Chen, Zhijing Sun, Guo Wenjin, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, and Xiaoyu Shen. Unveiling the key factors for distilling chain-of-thought reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics...

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  8. [8]

    Deepseek-v4 technical report, 2026

    DeepSeek-AI. Deepseek-v4 technical report, 2026. URL https://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Technical report

  9. [9]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

  10. [10]

    He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  11. [11]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations (ICLR), 2024

  12. [12]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

  13. [13]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 10

  14. [14]

    Liger Kernel: Efficient Triton Kernels for

    Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahismail, Haowen Dong, Anirudh Patel, and Bryan Roth. Liger Kernel: Efficient triton kernels for LLM training.arXiv preprint arXiv:2410.10989, 2024

  15. [15]

    Stable On-Policy Distillation through Adaptive Target Reformulation

    Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026

  16. [16]

    Entropy-aware on-policy distillation of language models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026

  17. [17]

    Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation, 2016. URL https://arxiv.org/abs/1606.07947

  18. [18]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

  19. [19]

    Offline Reinforcement Learning with Implicit Q-Learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning, 2021. URLhttps://arxiv.org/abs/2110.06169

  20. [20]

    Conservative Q-Learning for Offline Reinforcement Learning, August 2020

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning, 2020. URLhttps://arxiv.org/abs/2006.04779

  21. [21]

    Kat-coder-v2 technical report, 2026

    KwaiKAT Team. Kat-coder-v2 technical report, 2026. URL https://arxiv.org/abs/2603. 27703

  22. [22]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  23. [23]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

  24. [24]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  25. [25]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

  26. [26]

    MiMo-V2-Flash Technical Report

    LLM-Core Xiaomi. MiMo-V2-Flash technical report.arXiv preprint arXiv:2601.02780, 2026

  27. [27]

    On-policy distillation

    Kevin Lu and Thinking Machines Lab. On-policy distillation. https://thinkingmachines. ai/blog/on-policy-distillation/, 2025

  28. [28]

    Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying OPD: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527, 2026

  29. [29]

    Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog

  30. [30]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

  31. [31]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  32. [32]

    Gordon, and Drew Bagnell

    Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011. 11

  33. [33]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  34. [34]

    Sky-t1: Train your own o1 preview model within $450

    NovaSky Team. Sky-t1: Train your own o1 preview model within $450. https://novasky- ai.github.io/posts/sky-t1, 2025. Accessed: 2025-01-09

  35. [35]

    Qwen3.5-omni technical report, 2026

    Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604. 15804

  36. [36]

    Hy-mt1.5 technical report, 2025

    Tencent Hunyuan Team. Hy-mt1.5 technical report, 2025. URL https://arxiv.org/abs/ 2512.24092

  37. [37]

    Hy-embodied-0.5: Embodied foundation models for real-world agents,

    Tencent Robotics X, HY Vision Team, :, Xumin Yu, Zuyan Liu, Ziyi Wang, He Zhang, Yong- ming Rao, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, Yves Liang, Haitao Lin, Minghui Wang, Yubo Dong, Kevin Cheng, Bolin Ni, Rui Huang, Han Hu, Zhengyou Zhang, Linus, and Shunyu Yao. Hy-embodied-0.5: Embodied foundation models for real-world agents,

  38. [38]

    URLhttps://arxiv.org/abs/2604.07430

  39. [39]

    Nemotron-cascade 2: Post-training LLMs with cascade RL and multi-domain on-policy distillation.arXiv preprint arXiv:2603.19220, 2026

    Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexan- der Bukharin, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade 2: Post-training LLMs with cascade RL and multi-domain on-policy distillation.arXiv preprint arXiv:2603.1...

  40. [40]

    The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

    Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong, and Chien-Sheng Wu. The illusion of certainty: Decoupling capability and calibration in on-policy distillation.arXiv preprint arXiv:2604.16830, 2026

  41. [41]

    American invitational mathematics examination (aime) 2024, 2024

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

  42. [42]

    GLM-5: from Vibe Coding to Agentic Engineering

    Zhipu AI and Tsinghua University. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026. 12 Appendix A Related Work Off-policy distillation for LLMs.Off-policy distillation Knowledge distillation was originally developed for model compression [13]. In modern LLM post-training, Off-policy distillation has increasingly become ...