pith. sign in

arxiv: 2606.23104 · v1 · pith:WWBSSZ5Vnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

Pith reviewed 2026-06-26 09:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords on-policy distillationLLM reasoningnegative trajectoriesprobability ratioreweightingmathematical reasoningcode generationself-distillation
0
0 comments X

The pith

ReNIO reweights negative student trajectories using student-to-teacher probability ratios to improve on-policy LLM distillation without final-answer labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that on-policy distillation benefits more from incorrect student-generated outputs than correct ones, because the former better preserve exploratory reasoning near the model's capability boundary. It introduces ReNIO to capture this signal by computing normalized sample weights from probability ratios at pivotal tokens, without ever checking whether the final answer is right or wrong. This preserves the prefix-training advantage of on-policy methods over full-rollout reinforcement learning while delivering measurable gains on mathematical reasoning and code generation benchmarks. A sympathetic reader would care because the method turns an observed asymmetry in training signals into a practical weighting rule that focuses learning on the most informative mistakes.

Core claim

ReNIO identifies pivotal tokens that steer reasoning traces toward errors by comparing student and teacher token probabilities, then aggregates those ratios into a normalized per-sample weight that automatically assigns higher importance to likely negative trajectories; the resulting weighted on-policy distillation improves both standard OPD and on-policy self-distillation across mathematical reasoning and code tasks.

What carries the argument

The student-to-teacher probability ratio at selected tokens, aggregated into a normalized sample weight that upweights negative trajectories.

If this is right

  • Incorrect-only training yields longer traces and stronger reflection than correct-only training.
  • The weighting works for both ordinary on-policy distillation and on-policy self-distillation.
  • Relative gains reach 8.90 percent on Qwen3-1.7B and 10.00 percent on R1-Distill-Qwen-7B for mathematical reasoning.
  • Prefix-conditioned probability ratios suffice, so the method keeps the training efficiency of on-policy distillation over full-rollout methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ratio signal might surface useful negative examples in non-reasoning generation tasks where final-answer verification is expensive.
  • If the ratio correlates with specific error categories, it could be used to diagnose recurring failure modes without manual inspection.
  • Extending the weighting to multi-turn dialogues would test whether negative-trajectory emphasis remains helpful when context grows longer.

Load-bearing premise

The student-to-teacher probability ratio at early tokens can reliably flag trajectories that will end in error without ever observing the final answer.

What would settle it

Run the same on-policy training loop with and without the ReNIO weights on the same set of student rollouts and measure whether accuracy on held-out math benchmarks stays flat or drops.

Figures

Figures reproduced from arXiv: 2606.23104 by Chen Lin, Kedi Chen, Wei Zhang.

Figure 1
Figure 1. Figure 1: Motivation analysis for correct-only versus incorrect-only on-policy distillation. Left two panels: Avg@12 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed ReNIO.Phase I computes the student–teacher log ratio along the SGO; Phase II selects pivotal tokens with a fixed threshold; Phase III aggregates selected log ratios into a normalized sample-level weight. and useful role in OPD, but directly weighting tra￾jectories by correctness would require complete answer-containing rollouts. ReNIO therefore looks for a prefix-computable signal … view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of token-level log ratios ℓt on SGOs in the Qwen3 mathematical reasoning setting. Both OPD and OPSD exhibit long-tailed distributions: most tokens have small log ratios, while a small number of tokens exhibit large student–teacher disagreement. The log ratio also has an optimization inter￾pretation. Under a prefix-level reverse-KL ob￾jective KL(p t S ∥p t T ), the effective gradient weight of … view at source ↗
Figure 4
Figure 4. Figure 4: Training-time comparison between GRPO and prefix-based on-policy distillation on Qwen3-1.7B mathematical reasoning. 4.2 Main Results Tables 1 and 2 report the main results on mathemat￾ical reasoning and code generation. ReNIO consis￾tently improves the corresponding OPD or OPSD baseline across all model blocks in terms of average performance: for OPSD, DS-R1-Qwen-7B obtains a 10.00% relative improvement on… view at source ↗
Figure 5
Figure 5. Figure 5: Relationship between ReNIO sample weight [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-benchmark hyperparameter ablation results for ReNIO under OPD on Qwen3-1.7B mathematical [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average response length of correct versus incorrect on-policy trajectories during training on AIME24 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average count of Epistemic Markers of correct versus incorrect on-policy trajectories during training on [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison on one mathematical counting problem. GRPO finds a valid family but misses [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

On-policy distillation (OPD) improves LLM reasoning by training a student model on its own generated outputs, but standard OPD treats all student-generated outputs (SGOs) equally regardless of their informativeness. We observe a consistent asymmetry in controlled filtering experiments: in both OPD and on-policy self distillation (OPSD), training only on incorrect SGOs outperforms training only on correct ones. Our further analysis suggests that models trained on correct-only SGOs tend to generate shorter reasoning traces and show weaker reflection behavior, while incorrect SGOs better preserve exploratory reasoning near the model's capability boundary. To exploit this signal without requiring full answer-containing rollouts, we introduce ReNIO, which Reweights Negative trajectory Importance for LLM On-policy distillation. By using the student-to-teacher probability ratio, ReNIO identifies pivotal tokens leading to wrong reasoning traces and aggregates their information into a normalized sample weight, inherently assigning larger weights to likely negative trajectories without observing the correctness of final-answer. Since Re-NIO only uses prefix-conditioned token probabilities, it preserves OPD's prefix training advantage over full-rollout reinforcement learning. Across both mathematical reasoning and code generation tasks, ReNIO improves both OPD and OPSD, with representative relative gains of up to 8.90% for Qwen3-1.7B and 10.00% for R1-Distill-Qwen-7B on mathematical reasoning benchmarks. Code repo: https://github.com/BDML-lab/ReNIO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ReNIO, a reweighting scheme for on-policy distillation (OPD) and on-policy self-distillation (OPSD) of LLMs. It first reports an asymmetry in filtering experiments where training exclusively on incorrect student-generated outputs (SGOs) outperforms training on correct SGOs, attributing this to better preservation of exploratory reasoning. ReNIO then uses aggregated student-to-teacher token probability ratios over prefixes to assign higher normalized weights to likely negative trajectories without observing final-answer correctness, claiming this preserves OPD's prefix-training advantage. Experiments on mathematical reasoning and code generation report relative gains of up to 8.90% (Qwen3-1.7B) and 10.00% (R1-Distill-Qwen-7B) on math benchmarks.

Significance. If the central claims hold after validation, ReNIO would provide a label-free mechanism to emphasize informative negative trajectories in on-policy settings, potentially improving distillation efficiency while retaining the computational advantages of prefix-conditioned training over full-rollout RL. The public code repository is a clear strength for reproducibility.

major comments (3)
  1. [Method (ReNIO definition)] The core weighting construction (student-to-teacher log-probability ratio aggregated over the prefix and normalized into a sample weight) is presented as identifying 'pivotal tokens leading to wrong reasoning traces,' yet no correlation study, ablation against ground-truth outcome labels, or control for confounders (length bias, token rarity, teacher calibration) is reported; this assumption is load-bearing for the claim that ReNIO assigns larger weights to negative trajectories without final-answer observation.
  2. [Experiments] The reported relative gains (8.90% for Qwen3-1.7B, 10.00% for R1-Distill-Qwen-7B) are stated without dataset sizes, number of evaluation seeds, error bars, or statistical significance tests, preventing assessment of whether the improvements are reliable or could be explained by variance.
  3. [Analysis and Experiments] The asymmetry observation (incorrect SGOs outperform correct SGOs) is used to motivate the method, but the paper does not show that the probability-ratio weights recover this asymmetry or outperform a simple length- or entropy-based baseline that would also favor longer exploratory traces.
minor comments (2)
  1. [Method] Notation for the normalized sample weight (e.g., how the aggregation and normalization are exactly defined) should be presented with an explicit equation rather than prose description.
  2. [Experiments] The abstract and results section should clarify whether the same hyper-parameters and training budgets were used for all compared methods (standard OPD, OPSD, ReNIO variants).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below. Where the concerns identify gaps in validation or reporting, we will revise the manuscript accordingly to strengthen the presentation of ReNIO.

read point-by-point responses
  1. Referee: [Method (ReNIO definition)] The core weighting construction (student-to-teacher log-probability ratio aggregated over the prefix and normalized into a sample weight) is presented as identifying 'pivotal tokens leading to wrong reasoning traces,' yet no correlation study, ablation against ground-truth outcome labels, or control for confounders (length bias, token rarity, teacher calibration) is reported; this assumption is load-bearing for the claim that ReNIO assigns larger weights to negative trajectories without final-answer observation.

    Authors: The probability-ratio construction is motivated by the fact that divergence from the teacher occurs precisely where the student begins to depart from higher-quality reasoning; because the teacher is the stronger model, such divergence is expected to mark the onset of error. We did not include post-hoc correlation or confounder controls in the original submission. We will add a dedicated analysis subsection that (i) reports Spearman correlation between ReNIO weights and ground-truth outcome labels on held-out trajectories, (ii) ablates length and entropy as alternative weightings, and (iii) checks sensitivity to teacher calibration. These additions will directly test the load-bearing assumption. revision: yes

  2. Referee: [Experiments] The reported relative gains (8.90% for Qwen3-1.7B, 10.00% for R1-Distill-Qwen-7B) are stated without dataset sizes, number of evaluation seeds, error bars, or statistical significance tests, preventing assessment of whether the improvements are reliable or could be explained by variance.

    Authors: We agree that these statistical details are necessary. All experiments used three independent random seeds; the underlying test sets are the standard GSM8K (1,319 examples) and MATH (5,000 examples) splits. We will revise the experimental section to report mean and standard deviation across seeds, include error bars in all tables and figures, and add paired t-test p-values for the reported relative gains. revision: yes

  3. Referee: [Analysis and Experiments] The asymmetry observation (incorrect SGOs outperform correct SGOs) is used to motivate the method, but the paper does not show that the probability-ratio weights recover this asymmetry or outperform a simple length- or entropy-based baseline that would also favor longer exploratory traces.

    Authors: The asymmetry is presented as empirical motivation rather than a direct empirical claim about ReNIO weights. To close this gap we will add two new figures: one showing the mean ReNIO weight for correct versus incorrect SGOs (thereby recovering the asymmetry), and a second comparing ReNIO against length-based and entropy-based reweighting baselines on the same OPD/OPSD setups. These comparisons will demonstrate that the probability-ratio weighting yields larger gains than the simpler alternatives. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a heuristic weighting defined from ratios with empirical gains

full rationale

The paper defines ReNIO's sample weights directly via aggregation of student-to-teacher token probability ratios over prefixes, then reports empirical gains on benchmarks. This construction does not reduce any claimed prediction or result to its inputs by definition, nor does it invoke self-citations, uniqueness theorems, or fitted parameters renamed as predictions. The observation that incorrect SGOs outperform correct ones is presented as motivation from filtering experiments, and the proxy is offered as an operational choice without requiring final-answer labels. No load-bearing step equates the output performance to the input definition. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described beyond the standard use of probability ratios in distillation.

pith-pipeline@v0.9.1-grok · 5805 in / 1025 out tokens · 41211 ms · 2026-06-26T09:03:21.738739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 2 canonical work pages

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    ArXiv , year=

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. ArXiv , year=

  9. [9]

    ArXiv , year=

    OpenThoughts: Data Recipes for Reasoning Models , author=. ArXiv , year=

  10. [10]

    2025 , url=

    Qwen3 Technical Report , author=. 2025 , url=

  11. [11]

    ArXiv , year=

    The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning , author=. ArXiv , year=

  12. [12]

    2026 , url=

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? , author=. 2026 , url=

  13. [13]

    2026 , url=

    Scaling Reasoning Efficiently via Relaxed On-Policy Distillation , author=. 2026 , url=

  14. [14]

    NIPS Deep Learning and Representation Learning Workshop , year=

    Distilling the Knowledge in a Neural Network , author=. NIPS Deep Learning and Representation Learning Workshop , year=

  15. [15]

    Proceedings of the 12th International Conference on Learning Representations , year=

    On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. Proceedings of the 12th International Conference on Learning Representations , year=

  16. [16]

    Manuscript , year=

    Fast and Effective On-Policy Distillation from Reasoning Prefixes , author=. Manuscript , year=

  17. [17]

    arXiv preprint arXiv:2603.07079 , year=

    Entropy-Aware On-Policy Distillation of Language Models , author=. arXiv preprint arXiv:2603.07079 , year=

  18. [19]

    arXiv preprint arXiv:2604.14084 , year=

    TIP: Token Importance in On-Policy Distillation , author=. arXiv preprint arXiv:2604.14084 , year=

  19. [20]

    2026 , url=

    The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes , author=. 2026 , url=

  20. [21]

    arXiv preprint arXiv:2605.17497 , year=

    Self-Supervised On-Policy Distillation for Reasoning Language Models , author=. arXiv preprint arXiv:2605.17497 , year=

  21. [22]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  22. [23]

    Thinking Machines Lab: Connectionism , year =

    Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =

  23. [24]

    2026 , url=

    Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty , author=. 2026 , url=

  24. [25]

    2025 , eprint=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

  25. [26]

    American Invitational Mathematics Examination (AIME) 2024 , author=

  26. [27]

    American Invitational Mathematics Examination (AIME) 2025 , author=

  27. [28]

    2026 , eprint=

    Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs , author=. 2026 , eprint=

  28. [29]

    Is Your Code Generated by Chat

    Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , booktitle =. Is Your Code Generated by Chat. 2023 , url =

  29. [30]

    International Conference on Learning Representations , year=

    On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. International Conference on Learning Representations , year=

  30. [31]

    ArXiv , year=

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs , author=. ArXiv , year=

  31. [32]

    Nature , year=

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , year=

  32. [33]

    Li and Y

    Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y.K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =

  33. [34]

    2025 , url=

    Kimi K2: Open Agentic Intelligence , author=. 2025 , url=

  34. [35]

    ArXiv , year=

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. ArXiv , year=

  35. [36]

    2026 , eprint=

    Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe , author=. 2026 , eprint=

  36. [37]

    2026 , eprint=

    SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting , author=. 2026 , eprint=

  37. [38]

    2026 , eprint=

    Self-Distilled RLVR , author=. 2026 , eprint=

  38. [39]

    2023 , url=

    MiniLLM: On-Policy Distillation of Large Language Models , author=. 2023 , url=

  39. [40]

    ArXiv , year=

    Self-Distillation Enables Continual Learning , author=. ArXiv , year=

  40. [41]

    ArXiv , year=

    Reinforcement Learning via Self-Distillation , author=. ArXiv , year=

  41. [42]

    Efficient Knowledge Injection in LLMs via Self-Distillation , author=. Trans. Mach. Learn. Res. , year=

  42. [43]

    ArXiv , year=

    On-Policy Context Distillation for Language Models , author=. ArXiv , year=

  43. [44]

    ArXiv , year=

    GATES: Self-Distillation under Privileged Context with Consensus Gating , author=. ArXiv , year=

  44. [45]

    ArXiv , year=

    Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation , author=. ArXiv , year=

  45. [46]

    ArXiv , year=

    Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling , author=. ArXiv , year=

  46. [47]

    DistiLLM: Towards Streamlined Distillation for Large Language Models , author=

  47. [48]

    2025 , journal=

    DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs , author=. 2025 , journal=

  48. [49]

    ArXiv , year=

    From Correction to Mastery: Reinforced Distillation of Large Language Model Agents , author=. ArXiv , year=

  49. [50]

    2025 , url=

    AdaSwitch: Balancing Exploration and Guidance in Knowledge Distillation via Adaptive Switching , author=. 2025 , url=

  50. [51]

    2026 , eprint=

    A Survey of Inductive Reasoning for Large Language Models , author=. 2026 , eprint=

  51. [52]

    2026 , eprint=

    A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems , author=. 2026 , eprint=

  52. [53]

    arXiv preprint arXiv:2110.14168 , year=

    Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

  53. [54]

    Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. 2024. https://openreview.net/forum?id=3zKtaqxLhW On-policy distillation of language models: Learning from self-generated mistakes . In Proceedings of the 12th International Conference on Learning Representations

  54. [55]

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stańczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. 2023. https://api.semanticscholar.org/CorpusID:263610088 On-policy distillation of language models: Learning from self-generated mistakes . In International Conference on Learning Representations

  55. [56]

    Kimi Team Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Haochen Ding, Meng xiao Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, and 149 others. 2025. https://api.semanticscholar.org/CorpusID:280323540 Kimi k2: Open agentic intelligence

  56. [57]

    Kedi Chen, Dezhao Ruan, Yuhao Dan, Yaoting Wang, Siyu Yan, Xuecheng Wu, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Biqing Qi, Linyang Li, Qipeng Guo, Xiaoming Shi, and Wei Zhang. 2026. https://arxiv.org/abs/2510.10182 A survey of inductive reasoning for large language models . Preprint, arXiv:2510.10182

  57. [58]

    DeepSeek-AI. 2025. https://arxiv.org/abs/2501.12948 Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning . Preprint, arXiv:2501.12948

  58. [59]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 179 others. 2025. https://api.semanticscholar.org/CorpusID:275789950 Deepseek-r1 incentivizes reasoning in llms through rein...

  59. [60]

    Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvaldsson, Ivo Petrov, Chenhao Sun, and Martin Vechev. 2026. https://arxiv.org/abs/2605.00674 Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms

  60. [61]

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2023. https://api.semanticscholar.org/CorpusID:259164722 Minillm: On-policy distillation of large language models

  61. [62]

    Etash Kumar Guha, Ryan Marten, Sedrick Scott Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean-Pierre Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Ben Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, and 31 others. 2025. https://api.semanticscholar.org/CorpusID:27915447...

  62. [63]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. https://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . In NIPS Deep Learning and Representation Learning Workshop

  63. [64]

    Wenjin Hou, Shangpin Peng, Weinong Wang, Zheng Ruan, Yue Zhang, Zhenglin Zhou, Mingqi Gao, Yifei Chen, Kaiqi Wang, Hongming Yang, Chengquan Zhang, Zhuotao Tian, Han Hu, Yi Yang, Fei Wu, and Hehe Fan. 2026. https://arxiv.org/abs/2605.03677 Uni-opd: Unifying on-policy distillation with a dual-perspective recipe . Preprint, arXiv:2605.03677

  64. [65]

    Jonas Hubotter, Frederike Lubeck, Lejs Deen Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. 2026. https://api.semanticscholar.org/CorpusID:285102353 Reinforcement learning via self-distillation . ArXiv, abs/2601.20802

  65. [66]

    Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, Caiming Xiong, and Shafiq Joty. 2026. https://arxiv.org/abs/2504.09037 A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems . Preprint, arXiv:2504.09037

  66. [67]

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. 2026 a . https://api.semanticscholar.org/CorpusID:286776340 Why does self-distillation (sometimes) degrade the reasoning capability of llms?

  67. [68]

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, and Yuqing Yang. 2026 b . https://api.semanticscholar.org/CorpusID:286572576 Understanding reasoning in llms through strategic information allocation under uncertainty

  68. [69]

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. 2026. https://api.semanticscholar.org/CorpusID:286489350 Scaling reasoning efficiently via relaxed on-policy distillation

  69. [70]

    Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. 2025. Distillm-2: A contrastive approach boosts the distillation of llms. arXiv preprint arXiv:2503.07067

  70. [71]

    Distillm: Towards streamlined distillation for large language models

    Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models. In Forty-first International Conference on Machine Learning

  71. [72]

    Kalle Kujanp \"a \"a , Pekka Marttinen, Harri Valpola, and Alexander Ilin. 2024. https://api.semanticscholar.org/CorpusID:274859956 Efficient knowledge injection in llms via self-distillation . Trans. Mach. Learn. Res., 2025

  72. [73]

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. 2026. https://arxiv.org/abs/2604.13016 Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe . arXiv preprint arXiv:2604.13016

  73. [74]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. https://openreview.net/forum?id=1qvx610Cu7 Is your code generated by chat GPT really correct? rigorous evaluation of large language models for code generation . In Thirty-seventh Conference on Neural Information Processing Systems

  74. [75]

    Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In International Conference on Learning Representations

  75. [76]

    Kevin Lu and Thinking Machines Lab. 2025. https://doi.org/10.64434/tml.20251026 On-policy distillation . Thinking Machines Lab: Connectionism. Https://thinkingmachines.ai/blog/on-policy-distillation

  76. [77]

    Yuanjie Lyu, Chengyu Wang, Jun Huang, and Tong Xu. 2025. https://api.semanticscholar.org/CorpusID:281393943 From correction to mastery: Reinforced distillation of large language model agents . ArXiv, abs/2509.14257

  77. [78]

    Jingyu Peng, Maolin Wang, Hengyi Cai, Yuchen Li, Kai Zhang, Shuaiqiang Wang, Dawei Yin, and Xiangyu Zhao. 2025. https://api.semanticscholar.org/CorpusID:286579337 Adaswitch: Balancing exploration and guidance in knowledge distillation via adaptive switching

  78. [79]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models

  79. [80]

    Idan Shenfeld, Mehul Damani, Jonas H \"u botter, and Pulkit Agrawal. 2026. https://api.semanticscholar.org/CorpusID:285071839 Self-distillation enables continual learning . ArXiv, abs/2601.19897

  80. [81]

    Alex Stein, Furong Huang, and Tom Goldstein. 2026. https://api.semanticscholar.org/CorpusID:286001285 Gates: Self-distillation under privileged context with consensus gating . ArXiv, abs/2602.20574

Showing first 80 references.