arxiv: 2604.00626 · v2 · submitted 2026-04-01 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song , Mao Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:59 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords on-policy distillationlarge language modelsexposure biasknowledge distillationf-divergenceRLHFimitation learning

0 comments

The pith

On-policy distillation reframes LLM knowledge transfer as correction on student-generated sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models suffer from exposure bias when distilled using static teacher text, as the student must generate its own prefixes at inference, leading to compounding errors that scale with sequence length squared. This survey unifies the growing literature on on-policy distillation by treating it as the minimization of an f-divergence between the teacher and the distribution induced by the student's own trajectories. It organizes existing work along three axes: the choice of objective, the origin of the feedback signal, and practical stabilization strategies. The result links these methods to KL-regularized reinforcement learning and highlights conditions for success versus common failure modes.

Core claim

On-policy distillation is formalized as f-divergence minimization over trajectories sampled from the student model, where the teacher supplies feedback only on what the student actually produces rather than on oracle sequences. This shifts the training distribution closer to the inference distribution and reduces the exposure bias term from quadratic to linear in sequence length.

What carries the argument

f-divergence minimization over student-sampled trajectories, with design axes for optimization target, signal source, and stabilization.

If this is right

Training loops that sample from the student policy before applying teacher corrections will show improved robustness on long reasoning tasks.
Many techniques from RLHF can be reinterpreted as on-policy distillation when viewed through the f-divergence lens.
Stabilization methods become essential to prevent divergence in the iterative student-teacher loop.
Distillation performance will depend on how closely the chosen divergence matches the desired correction behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Distillation scaling laws may emerge once methods are standardized under this framework, similar to pretraining laws.
Uncertainty estimation could be integrated to provide more targeted teacher feedback on uncertain student outputs.
Agentic setups with multi-turn interactions stand to gain the most from on-policy approaches due to higher compounding risk.

Load-bearing premise

The scattered methods across distillation, RLHF, and imitation learning can be comprehensively organized using the three proposed design axes.

What would settle it

Discovery of a high-performing on-policy distillation variant whose key design choices fall outside the categories of optimization objective, feedback source, and stabilization technique.

Figures

Figures reproduced from arXiv: 2604.00626 by Mao Zheng, Mingyang Song.

**Figure 2.** Figure 2: Taxonomy of on-policy distillation for large language models. The methodology [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) continue to grow in both capability and cost, transferring frontier capabilities into smaller, deployable students has become a central engineering problem, and knowledge distillation remains the dominant technique for this transfer. The prevailing recipe in industrial pipelines, static imitation of teacher-generated text, carries a structural weakness that grows more severe as tasks become longer and more reasoning-intensive. Because the student is trained on flawless teacher prefixes but must generate its own at inference, small errors tend to accumulate into trajectories it has rarely been trained to recover from, and the resulting exposure bias has been shown to scale roughly with the square of sequence length. On-Policy Distillation (OPD) reorganizes the training loop around this observation by having the teacher provide feedback on what the student actually produces, with the goal of reducing the compounding term toward linear and reframing distillation as an iterative correction process rather than single-pass imitation. The resulting literature has expanded along divergence design, reward-guided optimization, and self-play, yet contributions remain scattered across the knowledge distillation, RLHF, and imitation learning communities without a unified treatment. This survey provides such a treatment. We formalize OPD as $f$-divergence minimization over student-sampled trajectories, organize the field along three design axes (what to optimize, where the signal comes from, and how to stabilize training in practice), and consolidate success conditions, recurring failure modes, and the connection between OPD and KL-constrained RL. We close with open problems that emerge from this synthesis, including distillation scaling laws, uncertainty-aware feedback, agentic distillation, and the growing overlap between knowledge distillation and RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes on-policy distillation work into a usable map but adds no new methods or results beyond the framing.

read the letter

This paper pulls together on-policy distillation techniques for LLMs that have been scattered across knowledge distillation, RLHF, and imitation learning. The main contribution is the framing: it treats OPD as f-divergence minimization over trajectories the student actually samples, rather than static teacher imitation. That directly addresses exposure bias, which the abstract notes grows with sequence length. The three design axes—what to optimize, where the signal comes from, and stabilization methods—give a practical way to compare approaches, and the paper collects common success conditions and failure modes while linking back to KL-constrained RL. The open problems section on scaling laws, uncertainty-aware feedback, and agentic distillation is a reasonable list of next steps. The synthesis looks careful enough to save someone time when they need an overview of the landscape. The soft spots are the standard ones for a survey. There are no new derivations, experiments, or code, so the value rests entirely on whether the axes and formalization cover the literature without major omissions. If recent multi-turn or agentic papers fall outside the three axes, the unification claim weakens. The abstract gives no sign of internal contradictions or sloppy citations, but full coverage can only be checked against the reference list. This is the sort of paper a practitioner or new researcher in LLM efficiency would find useful as a starting reference. It does not move the frontier but it organizes existing pieces coherently. I would send it to peer review; the structure is clear and the open problems are grounded enough to justify referee time.

Referee Report

2 major / 2 minor

Summary. The paper surveys on-policy distillation (OPD) for large language models. It formalizes OPD as f-divergence minimization over student-sampled trajectories, organizes the literature along three design axes (what to optimize, where the signal comes from, and how to stabilize training), consolidates success conditions and failure modes, links OPD to KL-constrained RL, and identifies open problems including distillation scaling laws, uncertainty-aware feedback, and agentic distillation.

Significance. If the synthesis is comprehensive, the survey would supply a useful unifying lens for a rapidly growing area that spans knowledge distillation, RLHF, and imitation learning. The explicit design axes and formalization could help practitioners select methods and researchers identify gaps, particularly as exposure bias becomes more acute for long-horizon reasoning tasks.

major comments (2)

[Formalization] Formalization section: the claim that OPD reframes distillation as iterative correction and reduces compounding error from quadratic to linear scaling is load-bearing for the motivation; the manuscript should cite the specific empirical studies that quantify this scaling and state the precise conditions under which the reduction holds.
[Design axes] Design axes section: the three axes (what to optimize, signal source, stabilization) must demonstrably partition the cited literature without significant omissions; the manuscript should include an explicit mapping table or checklist showing how representative works from each community fall into the taxonomy.

minor comments (2)

[Introduction] The abstract states that contributions remain scattered across communities; the introduction should quantify this scattering (e.g., number of papers per community) to justify the need for unification.
[Open problems] Open problems section: the discussion of distillation scaling laws would benefit from a short paragraph contrasting them with existing scaling laws for pre-training and RLHF.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. The comments help strengthen the formalization and taxonomy. We address each point below and will incorporate the suggested changes.

read point-by-point responses

Referee: [Formalization] Formalization section: the claim that OPD reframes distillation as iterative correction and reduces compounding error from quadratic to linear scaling is load-bearing for the motivation; the manuscript should cite the specific empirical studies that quantify this scaling and state the precise conditions under which the reduction holds.

Authors: We agree the scaling claim is central and requires explicit grounding. In revision we will expand the formalization section to cite the specific empirical studies that quantify quadratic exposure-bias growth (e.g., the seq2seq and LLM analyses referenced in the current text) and will state the precise conditions under which on-policy feedback reduces the compounding term to linear scaling: namely, when corrective signals are supplied on student-sampled trajectories at each decoding step rather than on teacher prefixes alone. revision: yes
Referee: [Design axes] Design axes section: the three axes (what to optimize, signal source, stabilization) must demonstrably partition the cited literature without significant omissions; the manuscript should include an explicit mapping table or checklist showing how representative works from each community fall into the taxonomy.

Authors: We agree that an explicit mapping table will make the taxonomy more verifiable. We will add a table (or checklist) in the design-axes section that classifies representative works from the knowledge-distillation, RLHF, and imitation-learning communities according to the three axes, confirming that the partition covers the cited literature without major omissions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in survey formalization

full rationale

This manuscript is a literature survey that synthesizes prior work on on-policy distillation without presenting original derivations, fitted parameters, or first-principles predictions. The central formalization of OPD as f-divergence minimization over student-sampled trajectories is explicitly described as a unifying lens drawn from existing contributions across knowledge distillation, RLHF, and imitation learning; it does not reduce to a self-referential definition or a fitted input renamed as a prediction. The three design axes are presented as an organizational framework rather than a uniqueness theorem or ansatz smuggled via self-citation. No load-bearing step relies on self-citation chains or renames known empirical patterns as novel results. The work is therefore self-contained against external benchmarks and receives the default non-circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no original mathematical models, free parameters, axioms, or invented entities; all technical content is drawn from the referenced prior work.

pith-pipeline@v0.9.0 · 5593 in / 1049 out tokens · 45707 ms · 2026-05-13T22:59:28.745867+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize OPD as f-divergence minimization over student-sampled trajectories, organize the field along three design axes (what to optimize, where the signal comes from, and how to stabilize training)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
cs.LG 2026-05 unverdicted novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
Rubric-based On-policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
cs.LG 2026-05 unverdicted novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
cs.CL 2026-05 unverdicted novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 accept novelty 7.0

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 unverdicted novelty 7.0

GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
GRAFT: Graph-Tokenized LLMs for Tool Planning
cs.LG 2026-05 unverdicted novelty 6.0

GRAFT internalizes tool dependency graphs via dedicated special tokens in LLMs and applies on-policy context distillation to achieve higher exact sequence matching and dependency legality than prior external-graph methods.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
cs.LG 2026-05 unverdicted novelty 5.0

Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair
cs.AI 2026-05 unverdicted novelty 5.0

Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 4.0

Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
Knowledge Distillation Must Account for What It Loses
cs.LG 2026-04 unverdicted novelty 4.0

Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.
Knowledge Distillation Must Account for What It Loses
cs.LG 2026-04 unverdicted novelty 4.0

Knowledge distillation should be reframed as a lossy projection and evaluated with a taxonomy of off-metric losses plus a Distillation Loss Statement reporting preserved and lost capabilities.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 3.0

Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · cited by 19 Pith papers · 2 internal anchors

[1]

X-opd: Cross-modal on-policy distillation for capability alignment in speech llms.arXiv preprint arXiv:2603.24596, 2026

URLhttps://arxiv.org/abs/2603.24596. Yihan Cao, Yanbin Kang, Zhengming Xing, and Ruijie Jiang. Delta Knowledge Distillation for Large Language Models.arXiv preprint arXiv:2509.14526, 2025. URL https://arxiv. org/abs/2509.14526. Sungmin Cha and Kyunghyun Cho. Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation.Advances in N...

work page arXiv 2025
[2]

Reinforcement Learning via Self-Distillation

URLhttps://arxiv.org/abs/2601.20802. Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable On-Policy Distillation through Adaptive Target Reformulation.arXiv preprint arXiv:2601.07155, 2026. URLhttps://arxiv.org/abs/2601.07155. Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang. Lion: Adversarial Distillation of Proprietary Large Langua...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang

URLhttps://arxiv.org/abs/2402.12842. Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?arXiv preprint arXiv:2603.24472, 2026a. URL https: //arxiv.org/abs/2603.24472. Minsang Kim and Seung Jun Baek. Explain in Your Own ...

work page arXiv 2026
[4]

arXiv preprint arXiv:2603.11137 , year =

URLhttps://arxiv.org/abs/2603.11137. Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, et al. Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation.arXiv preprint arXiv:2602.02994, 2026. URL https://arxiv.org/ abs/2602.02994. Yingru Li, Ziniu Li, and Jiacai Liu. A Note o...

work page arXiv 2026
[5]

arXiv preprint arXiv:2509.14257 , year =

URLhttps://arxiv.org/abs/2509.14257. Ethan Mendes, Jungsoo Park, and Alan Ritter. Didactic to Constructive: Turning Expert Solutions into Learnable Reasoning.arXiv preprint arXiv:2602.02405, 2026. URL https: //arxiv.org/abs/2602.02405. Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, B...

work page arXiv 2026
[6]

MiMo-V2-Flash Technical Report

URLhttps://arxiv.org/abs/2601.02780. Xurong Xie, Zhucun Xue, Jiafu Wu, Jian Li, Yabiao Wang, Xiaobin Hu, et al. LLM-Oriented Token-Adaptive Knowledge Distillation.arXiv preprint arXiv:2510.11615, 2025. URL https://arxiv.org/abs/2510.11615. Jing Xiong, Hui Shen, Shansan Gong, Yuxin Cheng, Jianghan Shen, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, an...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

arXiv preprint arXiv:2504.14945 , year =

URLhttps://arxiv.org/abs/2504.14945. Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, and Jieping Ye. Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning.arXiv preprint arXiv:2601.09088, 2026. URL https://arxiv.org/abs/2601. 09088. 37 Preprint. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang...

work page arXiv 2026
[8]

Distillspec: Improving speculative decoding via knowledge distillation.arXiv preprint arXiv:2310.08461,

URLhttps://arxiv.org/abs/2310.08461. Matthieu Zimmer, Xiaotong Ji, Tu Nguyen, and Haitham Bou Ammar. Rethinking Large Language Model Distillation: A Constrained Markov Decision Process Perspective.arXiv preprint arXiv:2509.22921, 2025. URLhttps://arxiv.org/abs/2509.22921. 39

work page arXiv 2025