On-Policy Context Distillation for Language Models
Pith reviewed 2026-05-13 20:44 UTC · model grok-4.3
The pith
On-policy context distillation lets language models internalize experiential knowledge from their own outputs more effectively than prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On-Policy Context Distillation trains the student on sequences it produces itself while aligning its token-level distributions to those of a context-conditioned teacher via reverse KL minimization. The resulting student internalizes the knowledge that was previously only available through in-context examples, yielding measurable gains in task accuracy and retention of out-of-distribution performance.
What carries the argument
On-Policy Context Distillation (OPCD), the procedure of sampling trajectories from the current student and minimizing reverse KL divergence to a context-conditioned teacher's distributions.
If this is right
- Task accuracy rises on mathematical reasoning, text-based games, and domain-specific problems relative to standard distillation.
- Out-of-distribution performance degrades less than with conventional context or on-policy baselines.
- Smaller student models can successfully absorb experiential knowledge distilled from larger teachers.
- Models can consolidate knowledge from their own historical solution traces without external supervision.
- Beneficial behaviors encoded in optimized system prompts become internalized parameters rather than repeated context.
Where Pith is reading between the lines
- Deployed models could rely on shorter contexts if key prompt knowledge is first internalized via OPCD.
- The method may extend naturally to multi-turn agent settings where experience accumulates across interactions.
- Self-generated trajectories appear to supply a more stable training signal than fixed teacher demonstrations for knowledge transfer.
- Cross-size results suggest OPCD could serve as a practical route for compressing large-model capabilities into smaller ones.
Load-bearing premise
Training on the student's own generated trajectories while matching a context-conditioned teacher will internalize transferable knowledge without causing output collapse or training instability.
What would settle it
A controlled run in which, after OPCD training, the student model's accuracy on the target tasks falls below the no-distillation baseline or its output entropy collapses to a narrow range of repetitive responses.
read the original abstract
Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes On-Policy Context Distillation (OPCD), which trains a student language model on trajectories sampled from its own policy while minimizing reverse KL divergence to a context-conditioned teacher. The method is applied to experiential knowledge distillation from historical solution traces and to system-prompt distillation. The central claims are that OPCD yields higher task accuracy than baselines on mathematical reasoning, text-based games, and domain-specific tasks, while better preserving out-of-distribution capabilities and enabling effective cross-size distillation.
Significance. If the empirical results and OOD claims hold after addressing potential mode-collapse concerns, OPCD would represent a useful advance in parameter-efficient internalization of in-context knowledge. The on-policy reverse-KL formulation directly targets a known limitation of standard context distillation and could improve generalization retention, which is a recurring practical bottleneck in LLM distillation pipelines.
major comments (3)
- [§3.2] §3.2 (Training Objective): The reverse-KL objective applied to on-policy samples is mode-seeking by construction. The manuscript provides no entropy monitoring, mode-coverage statistics, or forward-KL ablation to demonstrate that the claimed OOD preservation is not an artifact of the student concentrating on high-probability modes present in its own rollouts. This directly bears on the central claim that OPCD “better preserv[es] out-of-distribution capabilities.”
- [§5] §5 (Experimental Results): The abstract asserts “consistent outperformance” and “higher task accuracy,” yet the provided text supplies no numerical values, baseline specifications, number of runs, or statistical tests. Without these, the quantitative support for the superiority claim cannot be evaluated.
- [§4.3] §4.3 (Cross-Size Distillation): The claim that smaller students successfully internalize knowledge from larger teachers rests on the same reverse-KL on-policy setup. An explicit check that the student does not simply overfit to the teacher’s high-reward modes (e.g., via held-out OOD accuracy curves or diversity metrics) is required to substantiate the cross-size result.
minor comments (2)
- [§3.1] Notation for the reverse-KL term is introduced without an explicit equation number; adding a numbered display equation would improve traceability.
- [Abstract] The abstract states results across “mathematical reasoning, text-based games, and domain-specific tasks” but does not list the concrete benchmarks or datasets; a short table in the abstract or introduction would aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below. Revisions have been made to strengthen the empirical support and analyses as requested.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Training Objective): The reverse-KL objective applied to on-policy samples is mode-seeking by construction. The manuscript provides no entropy monitoring, mode-coverage statistics, or forward-KL ablation to demonstrate that the claimed OOD preservation is not an artifact of the student concentrating on high-probability modes present in its own rollouts. This directly bears on the central claim that OPCD “better preserv[es] out-of-distribution capabilities.”
Authors: We acknowledge that reverse KL is mode-seeking by design. However, the strictly on-policy nature of OPCD means the student is trained exclusively on trajectories from its own evolving policy, which limits collapse to external high-probability modes. To directly substantiate the OOD preservation claim, we have added entropy monitoring during training, mode-coverage statistics on held-out OOD tasks, and a forward-KL ablation in the revised §3.2 and appendix. These additions show that OPCD retains higher policy entropy and superior OOD accuracy relative to off-policy baselines. revision: yes
-
Referee: [§5] §5 (Experimental Results): The abstract asserts “consistent outperformance” and “higher task accuracy,” yet the provided text supplies no numerical values, baseline specifications, number of runs, or statistical tests. Without these, the quantitative support for the superiority claim cannot be evaluated.
Authors: We apologize for the insufficient quantitative detail in the submitted version. The revised §5 now reports exact task accuracies (e.g., 78.4% vs. 74.1% on math reasoning), full baseline specifications (standard context distillation, SFT, and imitation learning), results averaged over 5 random seeds, and statistical significance via paired t-tests with p-values. Key numerical highlights have also been incorporated into the abstract. revision: yes
-
Referee: [§4.3] §4.3 (Cross-Size Distillation): The claim that smaller students successfully internalize knowledge from larger teachers rests on the same reverse-KL on-policy setup. An explicit check that the student does not simply overfit to the teacher’s high-reward modes (e.g., via held-out OOD accuracy curves or diversity metrics) is required to substantiate the cross-size result.
Authors: We agree that explicit checks against overfitting to high-reward modes are necessary to support the cross-size distillation results. The revised §4.3 now includes held-out OOD accuracy curves for smaller students across task distributions and diversity metrics (token entropy and unique n-gram coverage). These demonstrate that the students generalize beyond the teacher’s high-probability outputs rather than overfitting. revision: yes
Circularity Check
No circularity in OPCD derivation chain
full rationale
The paper defines On-Policy Context Distillation directly as training the student on its own generated trajectories while minimizing reverse KL to a context-conditioned teacher. This objective is stated as an independent proposal without any fitted constants, self-referential equations, or reductions to prior results by construction. Performance claims rest on empirical evaluations across tasks rather than derived predictions that collapse back to inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from the authors' prior work appear in the provided text. The framework is self-contained as a stated combination of on-policy sampling and reverse KL, with no steps that equate outputs to inputs by definition.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 49 Pith papers
-
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
-
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
-
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
GenEvolve proposes a self-evolving agent framework for open-ended image generation that uses tool-orchestrated trajectories and visual experience distillation from best-worst differences to achieve reported state-of-t...
-
Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction
Next-acceleration-scale autoregressive prediction in discrete latent space with on-policy privileged information distillation yields improved MRI reconstructions from sparse measurements on the fastMRI benchmark.
-
Learning from Language Feedback via Variational Policy Distillation
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming f...
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
-
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...
-
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
-
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.
-
Self-Distilled RLVR
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
-
PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
-
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
-
On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation
On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
-
Reinforcing Human Behavior Simulation via Verbal Feedback
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
-
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and b...
-
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
MixSD achieves superior memorization-retention trade-off in knowledge injection by using mixed self-generated supervision from the base model's conditionals, retaining up to 100% held-out capability versus 1% for stan...
-
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.
-
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.
-
GRAFT: Graph-Tokenized LLMs for Tool Planning
GRAFT internalizes tool dependency graphs via dedicated special tokens in LLMs and applies on-policy context distillation to achieve higher exact sequence matching and dependency legality than prior external-graph methods.
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
ORACLE: Anticipating Scams from Partial Trajectories in Streaming App Usage
ORACLE is a new agentic framework using adaptive context consolidation and teacher-student distillation to detect emerging scam patterns from incomplete, long-horizon app usage streams across 12 scam types.
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
-
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
OPSDL: On-Policy Self-Distillation for Long-Context Language Models
OPSDL improves long-context LLM performance by having the model self-distill from its short-context capability using point-wise reverse KL divergence on generated tokens, outperforming SFT and DPO on benchmarks withou...
-
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
-
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.
-
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation
CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.
-
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.
-
Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction
Autoregressive prediction over discrete codebook tokens at successive acceleration scales, supervised via on-policy privileged-information distillation from fully sampled data, yields sharper MRI reconstructions under...
-
$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control
f-OPD decomposes on-policy distillation drift into rollout and supervision components, then applies a sample-level freshness score to adaptively limit stale data influence and stabilize long-horizon agent training.
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
A four-stage sparse-to-dense reward workflow for LLM post-training reaches 79.3% on MATH and 25.2% on AIME 2024 with a 1.7B student, outperforming direct GRPO by enforcing dense implicit rewards from a shaped teacher.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
-
Reasoning Compression with Mixed-Policy Distillation
Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
-
Multilingual Safety Alignment via Self-Distillation
MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
-
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
-
A Brief Overview: On-Policy Self-Distillation In Large Language Models
This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.
-
A Brief Overview: On-Policy Self-Distillation In Large Language Models
OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.