pith. machine review for the scientific record. sign in

arxiv: 2605.11182 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:13 UTC · model grok-4.3

classification 💻 cs.AI
keywords on-policy distillationon-policy self-distillationlarge language modelsdistribution mismatchprivileged informationreverse KLmathematical reasoningmodel alignment
0
0 comments X

The pith

On-policy distillation fails in LLMs due to distribution mismatch, biased gradients, and privileged information aggregation but targeted fixes restore effectiveness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies on-policy distillation and self-distillation, methods that supervise large language models using trajectories sampled from the model itself. It shows these approaches produce mixed results because of three concrete failure mechanisms rather than working reliably across tasks. The mechanisms are a mismatch when the student conditions on its own prefixes, unstable optimization from certain gradient estimates, and the student's inability to retain instance-specific information during self-distillation. The authors demonstrate that simple changes to the loss, teacher adaptation, and student initialization address these problems in their tested settings. Readers care because the findings give practical rules for deciding when and how to apply distillation without external data.

Core claim

On-policy distillation on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas on-policy self-distillation fails due to the test-time absence of instance-specific privileged information. The three failure mechanisms are distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, optimization instability from biased TopK reverse-KL gradients, and an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers. In contrast, OPSD succeeds when PI represents a shared latent rule such as a system prompt. Stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabl

What carries the argument

The three failure mechanisms in on-policy distillation—distribution mismatch from student-generated prefixes, biased TopK reverse-KL gradients, and PI-free policy aggregation in OPSD—together with the mitigations of stop-gradient TopK, RLVR teachers, and SFT stabilization.

If this is right

  • OPD performance varies sharply with the choice of teacher and the exact loss formulation in reasoning tasks.
  • OPSD succeeds for shared latent rules like system prompts or alignment preferences but cannot capture instance-specific PI.
  • Stop-gradient applied to TopK objectives removes the source of optimization instability.
  • RLVR-adapted teachers and SFT-stabilized students prevent the identified failure modes from appearing.
  • The methods internalize shared information reliably but require additional handling when PI varies per instance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mismatch and gradient issues may appear in other on-policy training loops that mix teacher and student outputs.
  • Combining the fixes with existing post-training pipelines could reduce reliance on large supervised datasets for model improvement.
  • Repeating the experiments at larger model scales would test whether the three mechanisms remain dominant or new interactions emerge.
  • Training pipelines could adopt SFT stabilization as a default first step before attempting on-policy distillation steps.

Load-bearing premise

The tested settings of mathematical reasoning trajectories and system-prompt or alignment privileged information are representative enough that the three failure mechanisms and fixes will apply to other LLM tasks, model scales, and data distributions.

What would settle it

Apply the proposed fixes to a new task requiring instance-specific privileged information, such as personalized multi-turn dialogue, and measure whether performance still degrades relative to a teacher baseline or improves as predicted.

Figures

Figures reproduced from arXiv: 2605.11182 by Ge Liu, Hongyu Lu, Siqi Zhu, Weiye Shi, Xuyan Ye.

Figure 1
Figure 1. Figure 1: Overview. We map the OP(S)D design space (left, top) and its task-dependent success/fail￾ure behavior (left, bottom), identify three failure mechanisms—prefix-distorted teacher state, biased Top-K reverse-KL, and PI-marginalized OPSD policy (middle), and propose practical fixes: stable Top-K losses, SFT stabilization, and RLVR-adapted teachers (right). In this paper, we present a comprehensive empirical st… view at source ↗
Figure 2
Figure 2. Figure 2: (Left) On-Policy (Self-)Distillation. In OPSD, the teacher is constructed from the student itself and privileged information (PI) is necessary. In OPD, the teacher is a stronger model and PI is optional. (Right) p: teacher distribution, q: student distribution. Reverse KL is mode-seeking, whereas forward KL is mode-covering. Reinforcement Learning from Textual Feedback. Another related direction augments r… view at source ↗
Figure 3
Figure 3. Figure 3: Qwen3-1.7B, trained on OpenThoughts. OPSD fails to improve student. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Collapse under unnormalized Top-20 reverse KL. The model first becomes verbose, then degenerates into repetitive “maybe” outputs as response length reaches the limit and evaluation accuracy drops. Token statistics show that repetitive tokens dominate as the repeat ratio approaches one. 4 Experiments We evaluate OPD and OPSD on reasoning, system-prompt internalization, and alignment, covering both failure a… view at source ↗
Figure 5
Figure 5. Figure 5: Training reward (left) and evaluation score (right) curves for OPSD, GRPO, and PPO on [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of GRPO and OPSD on Qwen3-8B (thinking mode) trained with DAPO-Math [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Train and evaluate Qwen3-1.7B (nothink) on Wildguardmix using their original train and [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effectiveness of OPSD depends on the structure of privileged information I. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: PI does not improve OPD on math reasoning with a stronger teacher. Using a Qwen3- 8B teacher and a Qwen3-1.7B student on OpenThoughts, both final-answer PI and full-response PI underperform vanilla OPD. PI-conditioned OPD leads to higher KL loss. This form indicates that OPSD can distill behavior that is consistently supported under different PI. Outputs that receive high probability under some PI but low… view at source ↗
Figure 11
Figure 11. Figure 11: Teacher: Qwen3-1.7B-GRPO (nothink), Student: Qwen3-1.7B (nothink), DAPO, TopK=5. [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Whether to put the distillation loss in policy gradient? sampled token KL in policy gradient [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: dataset: OpenThoughts. Left: Qwen3-8B and Qwen3-1.7B-GRPO have similar math reasoning performance. Middle: In OPD, Qwen3-1.7B-GRPO is a more effective teacher. Right: Qwen3-1.7B-GRPO’s Top20 vocabulary distribution is more aligned with the Qwen3-1.7B student. 0 50000 100000 150000 200000 250000 Number of Training Samples 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Task Reward (Pass@1) MATH-500 Task Reward Direct OPD … view at source ↗
Figure 14
Figure 14. Figure 14: Qwen3-4B teacher, Qwen3-1.7B-Base student, OpenThoughts. [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Teacher: Qwen3-1.7B-GRPO (nothink), Student: Qwen3-1.7B (nothink), training data: [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of teacher signal on responses generated by different student models. [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of token-level KL supervision distributions for correct and incorrect student [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: We show token-level heatmap of ∆logprob on last 128 tokens. The experiment is based on openthoughts [22], we show an example question. PI strengthens supervision for the same teacher, yet the sampled-token supervision distribution is based more on teacher capability (as shown in the figure, 3 experiments using Qwen3-8B teacher show similar distribution, while 2 experiments using Qwen3-1.7B teacher show an… view at source ↗
Figure 19
Figure 19. Figure 19: General reasoning results of OPD training. The experiment uses the Science subset of [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Comparison of teacher signals on general reasoning trajectories. [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Next-token log probs (left), truncated ratio (middle) and evaluation results (right) curves [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: An example of thinking mode hacking during OPSD. The student is trained with thinking mode disabled, while the teacher is queried with reasoning enabled. During training, the student gradually learns to emit explicit thinking-mode control tokens in its response, even though such tokens are not intended to appear at inference time. 1.7b and is trained on dapo [20]. We observe a failure mode that we term th… view at source ↗
Figure 23
Figure 23. Figure 23: ∆logprob - token entropy. Teacher: Qwen3-8B w/ PI. and imagery to tell true human stories. I hold fast to poetic meter, seek no popular applause, and ask only that everyone who hears me feels understood. When asked why I do not try new forms, I say: true innovation is not breaking tradition, but letting tradition be reborn in new breath. Each of my recitations guards and transmits ancient wisdom -- not to… view at source ↗
read the original abstract

On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a comprehensive empirical study of on-policy distillation (OPD) and on-policy self-distillation (OPSD) for LLMs. It identifies three failure mechanisms—distribution mismatch from student-generated prefixes, optimization instability from biased TopK reverse-KL gradients, and OPSD-specific aggregation of PI-conditioned teachers into a PI-free policy when PI is instance-specific—and shows that these explain mixed prior results. The work focuses on mathematical reasoning trajectories and shared-latent PI (e.g., system prompts or alignment preferences), proposing and validating fixes via stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students, with ablations on teacher choice, loss formulation, and PI type.

Significance. If the mechanisms and fixes hold, this provides mechanistic insight into why OPD/OPSD results have been inconsistent, offering practical guidance for LLM post-training. The structured ablations and identification of specific pitfalls represent a useful contribution to understanding dense token-level supervision on self-generated trajectories. However, the restriction to math reasoning and shared PI settings means the work's broader impact depends on whether these failure modes generalize.

major comments (2)
  1. [Abstract and experimental results] Abstract and experimental results: The central claim that the three identified failure mechanisms explain mixed prior results on OPD/OPSD rests on the tested regimes (mathematical reasoning trajectories and system-prompt/alignment PI) being representative. No experiments are reported on other domains (e.g., general language modeling, code generation, or larger-scale models), leaving open the possibility that different token distributions or optimization landscapes produce distinct dominant failure modes.
  2. [Abstract] Abstract: The assertion that OPSD fails due to learning a PI-free policy that aggregates PI-conditioned teachers is load-bearing for the OPSD-specific limitation. However, the paper provides no quantitative measure (e.g., policy divergence or per-instance performance breakdown) of this aggregation effect, making it difficult to confirm that this is the primary cause rather than a symptom of other factors like data scale or conditioning.
minor comments (2)
  1. [Abstract] The abstract introduces OPD, OPSD, and PI without initial expansions or a brief definition, which reduces accessibility for readers outside the immediate subfield.
  2. [Abstract] The description of the fixes (stop-gradient TopK, RLVR teachers, SFT stabilization) would benefit from a short summary table comparing their effects across the ablations to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. We address each major comment point by point below, indicating the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and experimental results: The central claim that the three identified failure mechanisms explain mixed prior results on OPD/OPSD rests on the tested regimes (mathematical reasoning trajectories and system-prompt/alignment PI) being representative. No experiments are reported on other domains (e.g., general language modeling, code generation, or larger-scale models), leaving open the possibility that different token distributions or optimization landscapes produce distinct dominant failure modes.

    Authors: We agree that the representativeness of our tested regimes is central to the broader claims. Mathematical reasoning was selected as the primary domain because it permits clean isolation of instance-specific versus shared privileged information, enabling precise diagnosis of the three failure mechanisms. We acknowledge that the absence of experiments on domains such as code generation or general language modeling leaves open the possibility of different dominant failure modes. In the revision we will expand the Limitations and Future Work section to explicitly discuss this scope limitation, qualify the central claim accordingly, and outline why the identified mechanisms (prefix mismatch, biased TopK gradients, and PI aggregation) are expected to be relevant beyond math while calling for targeted follow-up studies. revision: partial

  2. Referee: [Abstract] Abstract: The assertion that OPSD fails due to learning a PI-free policy that aggregates PI-conditioned teachers is load-bearing for the OPSD-specific limitation. However, the paper provides no quantitative measure (e.g., policy divergence or per-instance performance breakdown) of this aggregation effect, making it difficult to confirm that this is the primary cause rather than a symptom of other factors like data scale or conditioning.

    Authors: We thank the referee for this observation. The current manuscript supports the aggregation claim through comparative performance results and qualitative policy analysis in Section 4.3, but we agree that direct quantitative evidence would strengthen the argument. In the revised version we will add explicit metrics, including estimates of policy divergence (e.g., token-level KL between the student policy and each PI-conditioned teacher) and per-instance performance breakdowns that contrast shared-PI versus instance-specific-PI settings. These additions will help isolate the aggregation effect from confounding factors such as data scale. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical identification of failure modes

full rationale

The paper presents a comprehensive empirical study of on-policy distillation and self-distillation, identifying three failure mechanisms and mitigation strategies through direct experiments on mathematical reasoning trajectories and system-prompt/alignment settings. No derivation chain, first-principles prediction, or mathematical reduction is claimed; all central claims rest on observed experimental comparisons (e.g., sensitivity to teacher choice, loss formulation, and presence/absence of instance-specific PI). No self-citations, fitted parameters renamed as predictions, or ansatzes are load-bearing. The analysis is self-contained against the reported benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a purely empirical study; no new mathematical objects, fitted constants, or unverified theoretical entities are introduced. All claims rest on experimental observations under standard LLM training assumptions.

axioms (1)
  • domain assumption Standard assumptions in supervised fine-tuning, reinforcement learning with verifiable rewards, and KL-regularized distillation hold for the loss formulations and sampling procedures used.
    The study applies common loss functions and sampling without deriving or validating them from first principles.

pith-pipeline@v0.9.0 · 5542 in / 1400 out tokens · 50894 ms · 2026-05-13T02:13:55.930380+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    On-policy distillation of language models: Learning from self- generated mistakes, 2024

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes, 2024

  2. [2]

    Minillm: On-policy distillation of large language models, 2026

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: On-policy distillation of large language models, 2026

  3. [3]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  4. [4]

    Mimo-v2-flash technical report, 2026

    Core Team, Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, Gang Xie, Hailin Zhang, Hanglong Lv, Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Lei Li, Liang Zhao, Linghao Zhang,...

  5. [5]

    On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

  6. [6]

    Self-distillation enables continual learning, 2026

    Idan Shenfeld, Mehul Damani, Jonas H¨ubotter, and Pulkit Agrawal. Self-distillation enables continual learning, 2026

  7. [7]

    Self-distilled reasoner: On-policy self-distillation for large language models, 2026

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models, 2026

  8. [8]

    On-policy self-distillation for reasoning compression, 2026

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression, 2026

  9. [9]

    On-policy context distillation for language models, 2026

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models, 2026

  10. [10]

    Revisiting on-policy distillation: Empirical failure modes and simple fixes, 2026

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes, 2026

  11. [11]

    Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026

  12. [12]

    Learning by distilling context, 2022

    Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context, 2022

  13. [13]

    Andrew Bagnell, Aarti Singh, and Andrea Zanette

    Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback, 2026

  14. [14]

    Pope: Learning to reason on hard problems via privileged on-policy exploration, 2026

    Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. Pope: Learning to reason on hard problems via privileged on-policy exploration, 2026. 11

  15. [15]

    Rupam Mahmood, and Martha White

    Alan Chan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, A. Rupam Mahmood, and Martha White. Greedification operators for policy optimization: Investigating forward and reverse kl divergences, 2022

  16. [16]

    Characterbench: Benchmarking character customization of large language models, 2024

    Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, Rongsheng Zhang, Le Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, and Minlie Huang. Characterbench: Benchmarking character customization of large language models, 2024

  17. [17]

    Jen tse Huang, Man Ho Lam, Eric John Li, Shujie Ren, Wenxuan Wang, Wenxiang Jiao, Zhaopeng Tu, and Michael R. Lyu. Emotionally numb or empathetic? evaluating how llms feel using emotionbench, 2024

  18. [18]

    Crisp: Compressed reasoning via iterative self-policy distillation, 2026

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation, 2026

  19. [19]

    Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms, 2024

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms, 2024

  20. [20]

    Dapo: An open-source llm reinforcement learning system at scale, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  21. [21]

    Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe, 2026

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe, 2026

  22. [22]

    Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

  23. [23]

    Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying opd: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527, 2026

  24. [24]

    Reinforcement learning via self-distillation, 2026

    Jonas H ¨ubotter, Frederike L ¨ubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation, 2026

  25. [25]

    Openclaw-rl: Train any agent simply by talking, 2026

    Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking, 2026

  26. [26]

    Proximal policy optimization algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

  27. [27]

    Persuasion for good: Towards a personalized persuasive dialogue system for social good

    Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh, Sijia Yang, Jingwen Zhang, and Zhou Yu. Persuasion for good: Towards a personalized persuasive dialogue system for social good. arXiv preprint arXiv:1906.06725, 2019

  28. [28]

    L(θ, y)∇θ logπ θ(y|x) + 1 T TX t=1 ∇θℓt(θ, y<t) # . 17 Usinglogπ θ(y|x) = PT t=1 logπ θ(yt |x, y <t), the rollout term becomes Ex,y

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models, 2026. 12 A Appendix A.1 Experiment Setup We use a maximum response length of16384, temperature1.0, and top-p0.95for evaluation. Table 1 summarizes the main hyperparameters for OPD,...