pith. sign in

arxiv: 2606.22600 · v2 · pith:B32IYG73new · submitted 2026-06-21 · 💻 cs.LG · cs.AI

On the Position Bias of On-Policy Distillation

Pith reviewed 2026-06-26 10:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords on-policy distillationposition biasimportance weightingreinforcement learningtoken-level supervisionKL divergencesequence generation
0
0 comments X

The pith

Weighting tokens by accumulated distribution discrepancy improves on-policy distillation convergence and performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard on-policy distillation uniformly averages token-level KL losses, but later tokens in longer student rollouts deviate farther from the teacher and provide degraded supervision. This position bias explains why the first 30 percent of tokens often suffice while the last 30 percent yield almost no learning. From a constrained-optimization view, the authors derive importance weights for each token that grow with the running discrepancy between student and teacher distributions, naturally favoring earlier, more reliable tokens. The resulting IW-OPD method shows faster convergence and higher final scores than uniform OPD, including gains of up to 6.9 points on AIME-2025 in both same-size and cross-scale settings.

Core claim

In the standard KL objective of OPD, token-level losses are uniformly averaged, implying equal weights for all tokens. As student rollouts grow longer they deviate further from the teacher's distribution, leading to degraded supervision quality at later positions. IW-OPD assigns each token a weight based on the accumulated discrepancy between the student's and teacher's distributions, upweighting earlier tokens and downweighting later ones, which yields faster convergence and better final performance than standard OPD.

What carries the argument

Importance-Weighted On-Policy Distillation (IW-OPD), which sets token weights according to the accumulated discrepancy between student and teacher output distributions.

If this is right

  • IW-OPD converges significantly faster than standard OPD with better learning efficiency.
  • IW-OPD achieves higher final performance than standard OPD in both same-size and cross-scale teacher-student settings.
  • Standard OPD using only the first 30 percent of tokens performs comparably to using all tokens, while using only the last 30 percent barely learns.
  • The importance weighting improves performance by up to 6.9 points on AIME-2025.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same discrepancy-based weighting could be tested in other autoregressive generation settings where supervision quality is known to degrade with sequence length.
  • IW-OPD might allow shorter effective rollouts during distillation without loss of signal, reducing compute per update.
  • Similar accumulated-discrepancy reweighting could be applied inside other token-level RL objectives that currently assume uniform token importance.

Load-bearing premise

The assumption that weighting each token by its accumulated discrepancy with the teacher distribution correctly identifies and mitigates degraded supervision quality without introducing new optimization biases.

What would settle it

A controlled experiment on a short-horizon task where all tokens remain close to the teacher distribution throughout the rollout, in which IW-OPD shows no convergence or performance advantage over uniform OPD.

Figures

Figures reproduced from arXiv: 2606.22600 by Bo Chen, Sijie Zhu, Tiansheng Wen, Yan Xie, Yifei Wang.

Figure 1
Figure 1. Figure 1: Position Bias in OPD training. (a) With the same 30% token budget, training on the prefix part of each response matches or exceeds full token Standard OPD, whereas training on the suffix part fails to learn effectively. Student: Qwen3-0.6B, Teacher: Qwen3-4B-Instruct-2507. (b) Teacher and student accuracy are measured by the probability of reaching a correct answer from a given student-generated prefix. St… view at source ↗
Figure 2
Figure 2. Figure 2: IW-OPD improves both sample efficiency and final performance. (a) AIME 2025 accuracy during training: IW-OPD converges faster and achieves better final performance than Standard OPD. (b) Final accuracy across student scales distilled from the same teacher; the IW-OPD advantage grows from +4.0% at 1.0× compression to +14.9% at 6.7×. 1 Introduction On-Policy Distillation (OPD) trains a student on its own rol… view at source ↗
Figure 3
Figure 3. Figure 3: Position Bias phenomena in OPD. (a) The mean token-level KL decreases during OPD training but plateaus at a non-zero residual. (b) Token-level reverse KL before and after OPD training. (c) Sequence-level log-probabilities of student-sampled prefixes under the student and teacher. Student: Qwen3-0.6B; Teacher: Qwen3-4B-Instruct. divergence between these two only decreases by 20% even if training converges a… view at source ↗
Figure 4
Figure 4. Figure 4: From signed prefix ratio to unsigned prefix discrepancy. (a) Directly using the ideal prefix ratio is sensitive to α. (b) Token-wise visualization shows the desired overall downward trend, but also local rebounds caused by signed cancellation. (c) Replacing signed accumulation with the unsigned weight gives a more stable weighting signal. 4.2 Stable Token-level Importance Weight Estimate Eq. (12) provides … view at source ↗
read the original abstract

On-Policy Distillation (OPD) improves the learning efficiency of standard reinforcement learning through dense, token-level supervision from teachers. In the standard KL objective of OPD, token-level losses are uniformly averaged, implying equal weights for all tokens. However, we discover that not all tokens are created equal: as student rollouts grow longer, they deviate further from the teacher's distribution, leading to degraded supervision quality at later positions. As a result, OPD using only the first 30% of tokens can perform comparably to using all tokens, whereas OPD using only the last 30% of tokens barely learns anything. In this work, we provide a principled understanding of this issue through the lens of constrained optimization. Based on these insights, we derive Importance-Weighted On-Policy Distillation (IW-OPD), in which the weight assigned to each token depends on the accumulated discrepancy between the student's and teacher's distributions, naturally upweighting earlier tokens and downweighting later ones with larger deviations. We show that IW-OPD converges significantly faster than OPD, with better learning efficiency, and achieves better final performance than standard OPD in both same-size and cross-scale settings, improving performance up to 6.9 points on AIME-2025.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard On-Policy Distillation (OPD) suffers from position bias: as student rollouts lengthen, tokens deviate further from the teacher distribution, degrading supervision quality at later positions. This is supported by an ablation showing that OPD on only the first 30% of tokens performs comparably to using all tokens, while the last 30% barely learns. From a constrained-optimization perspective, the authors derive Importance-Weighted OPD (IW-OPD), which assigns each token a weight based on its accumulated student-teacher discrepancy (naturally upweighting early tokens). They report that IW-OPD converges faster with better learning efficiency than OPD and achieves superior final performance in both same-size and cross-scale settings, with gains up to 6.9 points on AIME-2025.

Significance. If the weighting mechanism isolates the claimed position bias (rather than serving as a proxy for token difficulty or rarity), the result would strengthen the case for constrained-optimization-derived corrections in on-policy distillation and could improve sample efficiency in RL fine-tuning of language models. The work supplies a parameter-free derivation and reports concrete, falsifiable performance deltas on a challenging benchmark.

major comments (2)
  1. [Abstract (first-30% vs last-30% token ablation)] Abstract (first-30% vs last-30% token ablation): the ablation demonstrates position-dependent degradation but does not test whether accumulated discrepancy correlates with intrinsic token properties (low teacher probability, complexity, or rarity). If so, IW-OPD gains could arise from implicit curriculum or regularization effects rather than the intended correction for rollout-length bias, weakening the causal claim for the 6.9-point AIME-2025 improvement.
  2. [Empirical results (same-size and cross-scale settings)] Empirical results (same-size and cross-scale settings): the abstract reports performance gains without specifying experimental controls, statistical significance tests, exact baseline implementations, or confirmation that the weighting rule was derived without post-hoc hyperparameter fitting. These details are load-bearing for verifying that the reported efficiency and final-performance advantages are reproducible and attributable to the proposed mechanism.
minor comments (1)
  1. Clarify in the methods whether the accumulated-discrepancy weighting introduces any tunable scaling factors or is strictly determined by the constrained-optimization derivation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address the two major comments point by point below, clarifying the intended scope of the ablation and committing to expanded experimental details in revision.

read point-by-point responses
  1. Referee: [Abstract (first-30% vs last-30% token ablation)] Abstract (first-30% vs last-30% token ablation): the ablation demonstrates position-dependent degradation but does not test whether accumulated discrepancy correlates with intrinsic token properties (low teacher probability, complexity, or rarity). If so, IW-OPD gains could arise from implicit curriculum or regularization effects rather than the intended correction for rollout-length bias, weakening the causal claim for the 6.9-point AIME-2025 improvement.

    Authors: The ablation isolates position in the rollout as the variable: the first 30% of tokens are generated while the student remains close to the teacher, while the last 30% occur after substantial deviation has accumulated. This directly illustrates the length-dependent degradation described in the constrained-optimization derivation. The importance weights are computed from the running discrepancy itself, not from any token-level difficulty or rarity proxy. We agree that an explicit correlation analysis between discrepancy and intrinsic token properties would further rule out alternative explanations and will add this discussion plus supporting plots in the revision. revision: partial

  2. Referee: [Empirical results (same-size and cross-scale settings)] Empirical results (same-size and cross-scale settings): the abstract reports performance gains without specifying experimental controls, statistical significance tests, exact baseline implementations, or confirmation that the weighting rule was derived without post-hoc hyperparameter fitting. These details are load-bearing for verifying that the reported efficiency and final-performance advantages are reproducible and attributable to the proposed mechanism.

    Authors: We agree these details are necessary for reproducibility. The weighting rule follows directly from the constrained-optimization derivation with no additional hyperparameters. In the revised manuscript we will expand the experimental section to include: (i) precise baseline implementations and hyperparameter matching, (ii) statistical significance tests across runs, (iii) full description of rollout and evaluation protocols, and (iv) explicit confirmation that the importance-weight formula contains no post-hoc tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a constrained-optimization analysis of position bias in OPD, then derives the IW-OPD weighting rule directly from that analysis (accumulated discrepancy as the natural weight). No equations reduce a claimed prediction or result to a fitted parameter or self-referential definition by construction. Performance improvements are reported as empirical outcomes on benchmarks, not as mathematical identities. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the abstract or described derivation. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters, axioms, or invented entities; the weighting rule itself is treated as a domain assumption whose justification is not visible.

axioms (1)
  • domain assumption Token weights can be assigned proportionally to accumulated student-teacher distribution discrepancy without introducing new biases into the constrained optimization.
    Core modeling choice underlying IW-OPD; location implied in the derivation paragraph of the abstract.

pith-pipeline@v0.9.1-grok · 5764 in / 1237 out tokens · 27546 ms · 2026-06-26T10:42:33.804356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations,

  2. [2]

    URLhttps://openreview.net/forum?id=3zKtaqxLhW

  3. [3]

    Why exposure bias matters: An imitation learning perspective of error accumulation in language generation

    Kushal Arora, Layla El Asri, Hareesh Bahuleyan, and Jackie Chi Kit Cheung. Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. InFindings of the Association for Computational Linguistics: ACL 2022, pages 700–710, 2022

  4. [4]

    Scheduled sampling for sequence prediction with recurrent neural networks

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems, 2015

  5. [5]

    Bigelow, Ari Holtzman, Hidenori Tanaka, and Tomer Ullman

    Eric J. Bigelow, Ari Holtzman, Hidenori Tanaka, and Tomer Ullman. Forking paths in neural text generation. InInternational Conference on Learning Representations, 2025

  6. [6]

    Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024

  7. [7]

    Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wenjie Peng, Jianhao Chen, Ning Chen, Zhiyuan Liu, and Maosong Sun. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

  8. [8]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  9. [9]

    Rlhf workflow: From reward modeling to online rlhf

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024

  10. [10]

    In: EMNLP (2020)

    Shuhao Gu, Jinchao Zhang, Fandong Meng, Yang Feng, Wanying Xie, Jie Zhou, and Dong Yu. Token-level adaptive training for neural machine translation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1035–1046, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main

  11. [11]

    URLhttps://aclanthology.org/2020.emnlp-main.76/

  12. [12]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=5h0qf7IBZZ

  13. [13]

    Rethinking token-level credit assignment in RLVR: A polarity-entropy analysis.arXiv preprint arXiv:2604.11056, 2026

    Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, and Yongqi Zhang. Rethinking token-level credit assignment in RLVR: A polarity-entropy analysis.arXiv preprint arXiv:2604.11056, 2026

  14. [14]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015. URL https://arxiv. org/abs/1503.02531

  15. [15]

    Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

  16. [16]

    SelecTKD: Selective token-weighted knowledge distillation for LLMs, 2025

    Haiduo Huang, Jiangcheng Song, Yadong Zhang, and Pengju Ren. SelecTKD: Selective token-weighted knowledge distillation for LLMs, 2025. URL https://arxiv.org/abs/ 2510.24021

  17. [17]

    Sham M. Kakade. A natural policy gradient. InAdvances in Neural Information Processing Systems, volume 14, pages 1531–1538, 2001

  18. [18]

    Explain in your own words: Improving reasoning via token-selective dual knowledge distillation

    Minsang Kim and Seung Jun Baek. Explain in your own words: Improving reasoning via token-selective dual knowledge distillation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=zph7e5JaXc. 11

  19. [19]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026. doi: 10.48550/arXiv.2604.13016. URL https://arxiv.org/abs/ 2604.13016

  20. [20]

    Bowman and George Dahl

    Chen Liang, Haoming Jiang, Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao, and Tuo Zhao. Token-wise curriculum learning for neural machine translation. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 3658–3670, Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. find...

  21. [21]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations, 2024

  22. [22]

    Critical tokens matter: Token-level contrastive estimation enhances LLM’s reasoning capability

    Zicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. Critical tokens matter: Token-level contrastive estimation enhances LLM’s reasoning capability. InInternational Conference on Machine Learning, 2025

  23. [23]

    Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

    Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, et al. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

  24. [24]

    Code-r1: Reproducing r1 for code with reliable rewards

    Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards. 2025

  25. [25]

    Being strong progressively! enhancing knowledge distillation of large language models through a curriculum learning framework.arXiv preprint arXiv:2506.05695, 2025

    Lingyuan Liu and Mengxiang Zhang. Being strong progressively! enhancing knowledge distillation of large language models through a curriculum learning framework.arXiv preprint arXiv:2506.05695, 2025

  26. [26]

    MiMo-V2-Flash technical report.arXiv preprint arXiv:2601.02780, 2026

    LLM-Core, Xiaomi. MiMo-V2-Flash technical report.arXiv preprint arXiv:2601.02780, 2026

  27. [27]

    On-policy distillation

    Kevin Lu. On-policy distillation. Thinking Machines Lab Blog, 2025. URL https:// thinkingmachines.ai/blog/on-policy-distillation/

  28. [28]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023

  29. [29]

    Gordon, and J

    Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011

  30. [30]

    Jordan, and Pieter Abbeel

    John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. InInternational Conference on Machine Learning, pages 1889– 1897, 2015

  31. [31]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  32. [32]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  33. [33]

    A survey of on-policy distillation for large language models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

  34. [34]

    Rényi divergence and kullback-leibler divergence.IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

    Tim van Erven and Peter Harremoës. Rényi divergence and kullback-leibler divergence.IEEE Transactions on Information Theory, 60(7):3797–3820, 2014

  35. [35]

    Ignore the KL penalty! boosting exploration on critical tokens to enhance RL fine-tuning

    Jean Vassoyan, Nathanaël Beau, and Roman Plaud. Ignore the KL penalty! boosting exploration on critical tokens to enhance RL fine-tuning. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6123–6133, 2025. 12

  36. [36]

    Xu, Damai Dai, Yifei Li, Deli Chen, Y

    Peiyi Wang, Lei Li, Zhihong Shao, R.X. Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  37. [37]

    Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. InAdvances in Neural Information...

  38. [38]

    f-divergence minimization for sequence-level knowledge distillation

    Yuqiao Wen, Zichao Li, Wenyu Du, and Lili Mou. f-divergence minimization for sequence-level knowledge distillation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10817–10834, Toronto, Canada,

  39. [39]

    doi: 10.18653/v1/2023.acl-long.605

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.605. URL https://aclanthology.org/2023.acl-long.605/

  40. [40]

    Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin

    Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit Q*-approximation for sample-efficient RLHF.arXiv preprint arXiv:2405.21046, 2024

  41. [41]

    LLM-oriented token-adaptive knowledge distillation, 2025

    Xurong Xie, Zhucun Xue, Jiafu Wu, Jian Li, Yabiao Wang, Xiaobin Hu, Yong Liu, and Jiangning Zhang. LLM-oriented token-adaptive knowledge distillation, 2025. URL https: //arxiv.org/abs/2510.11615

  42. [42]

    PACED: Distilla- tion and on-policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178, 2026

    Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. PACED: Distilla- tion and on-policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178, 2026

  43. [43]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  44. [44]

    Deepcritic: Deliberate critique with large language models.arXiv preprint arXiv:2505.00662, 2025

    Wenkai Yang, Jingwen Chen, Yankai Lin, and Ji-Rong Wen. Deepcritic: Deliberate critique with large language models.arXiv preprint arXiv:2505.00662, 2025

  45. [45]

    Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  46. [46]

    Disentangling reasoning tokens and boilerplate tokens for language model fine-tuning

    Ziang Ye, Zhenru Zhang, Yang Zhang, Jianxin Ma, Junyang Lin, and Fuli Feng. Disentangling reasoning tokens and boilerplate tokens for language model fine-tuning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20939–20957. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-acl.1078

  47. [47]

    Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  48. [48]

    Geometric-mean policy optimization

    Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, and Furu Wei. Geometric-mean policy optimization. InInternational Conference on Learning Representations, 2026

  49. [49]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. doi: 10.48550/arXiv.2507.18071. URL https://arxiv.org/abs/2507.18071. 13 A Proofs and Derivations The derivations below fix a prompt x u...