Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.
super hub Mixed citations
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Mixed citation behavior. Most common role is background (58%).
abstract
Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self, we introduce On-Policy Self-Distillation (OPSD), a learning algorithm where a single LLM acts as both teacher and student with different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving superior token efficiency compared to reinforcement learning methods and better performance over off-policy distillation methods. Code repo: https://github.com/siyan-zhao/OPSD.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuitio
authors
co-cited works
years
2026 145representative citing papers
L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
Purified OPSD subtracts a reference-only teacher's signal from standard OPSD supervision and applies PMI to create a cleaner distillation target, yielding gains on long-CoT models while preserving epistemic behavior.
Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.
SAGE-OPD improves multi-turn OPD via turn-level selective intervention, teacher-confidence weighting, and loss normalization, reporting up to 13.3% relative gain in ALFWorld unseen success rate over standard OPD.
TAPO constructs learnable micro-reflective trajectories from contrastive model rollouts during RL training to provide explicit error diagnoses and corrections, reporting consistent gains over GRPO on AIME and HMMT math benchmarks.
SC-GRPO improves RL with verifiable rewards by multiplying GRPO gradients with self-induced per-token KL divergence, outperforming GRPO by 8.1% and DAPO by 5.9% on math, code, and agent benchmarks.
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
d-OPSD reframes on-policy self-distillation for dLLMs via suffix conditioning from self-generated answers and step-level supervision, outperforming RLVR and SFT on reasoning benchmarks with ~10% of the optimization steps.
RGSD distills rubric-conditioned teacher distributions into base policies token-by-token, matching GRPO rubric satisfaction on Qwen models with one rollout and zero verifier calls.
A stop-gradient consistency regularizer mitigates context-induced degradation in on-policy distillation, improving robustness across 12 configurations.
AR-OPD disentangles privileged supervision via anchored residual guidance to reduce hindsight leakage in on-policy distillation, reporting gains of 2.3 points over full privileged OPD and 7.9 over SFT on reasoning tasks.
OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.
DistIL applies distributional DAgger with forward cross-entropy to achieve monotonic policy improvement and better Pass@N from rich feedback in RL for reasoning tasks.
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
OPD+ removes the bias from stop-gradient in on-policy distillation by deriving correct gradients for f-divergences, outperforming standard KL-based methods on math reasoning and tool-use tasks.
Token teachability, based on local compatibility of teacher and student distributions, predicts on-policy distillation gains better than raw KL disagreement and enables TA-OPD to match or exceed full-token performance with 5% tokens across Qwen models.
Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.
EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
Introduces a Cognitive User Simulator modeling stratified personas with hidden concerns and Simulator-Induced Asymmetric-View Policy Optimization to unlock proactive behavior in task-oriented dialogue agents.
Life-Harness evolves reusable interventions from training trajectories to enhance frozen LLM agents on unseen tasks across seven deterministic environments, yielding 88.5% average relative improvement in 116 of 126 model-environment settings.
CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.
Next-acceleration-scale autoregressive prediction in discrete latent space with on-policy privileged information distillation yields improved MRI reconstructions from sparse measurements on the fastMRI benchmark.
citing papers explorer
-
Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback
Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.
-
Transformers Provably Learn to Internalize Chain-of-Thought
L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.
-
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
-
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Purified OPSD subtracts a reference-only teacher's signal from standard OPSD supervision and applies PMI to create a cleaner distillation target, yielding gains on long-CoT models while preserving epistemic behavior.
-
When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon
Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.
-
SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation
SAGE-OPD improves multi-turn OPD via turn-level selective intervention, teacher-confidence weighting, and loss normalization, reporting up to 13.3% relative gain in ALFWorld unseen success rate over standard OPD.
-
Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation
TAPO constructs learnable micro-reflective trajectories from contrastive model rollouts during RL training to provide explicit error diagnoses and corrections, reporting consistent gains over GRPO on AIME and HMMT math benchmarks.
-
Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards
SC-GRPO improves RL with verifiable rewards by multiplying GRPO gradients with self-induced per-token KL divergence, outperforming GRPO by 8.1% and DAPO by 5.9% on math, code, and agent benchmarks.
-
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
-
Learning from the Self-future: On-policy Self-distillation for dLLMs
d-OPSD reframes on-policy self-distillation for dLLMs via suffix conditioning from self-generated answers and step-level supervision, outperforming RLVR and SFT on reasoning benchmarks with ~10% of the optimization steps.
-
Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers
RGSD distills rubric-conditioned teacher distributions into base policies token-by-token, matching GRPO rubric satisfaction on Qwen models with one rollout and zero verifier calls.
-
When Context Returns: Toward Robust Internalization in On-Policy Distillation
A stop-gradient consistency regularizer mitigates context-induced degradation in on-policy distillation, improving robustness across 12 configurations.
-
Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation
AR-OPD disentangles privileged supervision via anchored residual guidance to reduce hindsight leakage in on-policy distillation, reporting gains of 2.3 points over full privileged OPD and 7.9 over SFT on reasoning tasks.
-
OPRD: On-Policy Representation Distillation
OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.
-
Reinforcement Learning from Rich Feedback with Distributional DAgger
DistIL applies distributional DAgger with forward cross-entropy to achieve monotonic policy improvement and better Pass@N from rich feedback in RL for reasoning tasks.
-
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
-
OPD+: Rethinking the Advantage Design for On-Policy Distillation
OPD+ removes the bias from stop-gradient in on-policy distillation by deriving correct gradients for f-divergences, outperforming standard KL-based methods on math reasoning and tool-use tasks.
-
Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation
Token teachability, based on local compatibility of teacher and student distributions, predicts on-policy distillation gains better than raw KL disagreement and enables TA-OPD to match or exceed full-token performance with 5% tokens across Qwen models.
-
Not only where, But when: Temporal Scheduling for RLVR
Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.
-
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
-
Unlocking Proactivity in Task-Oriented Dialogue
Introduces a Cognitive User Simulator modeling stratified personas with hidden concerns and Simulator-Induced Asymmetric-View Policy Optimization to unlock proactive behavior in task-oriented dialogue agents.
-
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
Life-Harness evolves reusable interventions from training trajectories to enhance frozen LLM agents on unseen tasks across seven deterministic environments, yielding 88.5% average relative improvement in 116 of 126 model-environment settings.
-
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.
-
Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction
Next-acceleration-scale autoregressive prediction in discrete latent space with on-policy privileged information distillation yields improved MRI reconstructions from sparse measurements on the fastMRI benchmark.
-
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
-
Learning from Language Feedback via Variational Policy Distillation
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
-
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
-
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
-
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.
-
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
-
LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification
LaTER reduces LLM token usage 16-33% on reasoning benchmarks by exploring in latent space then switching to explicit CoT verification, with gains like 70% to 73.3% on AIME 2025 in the training-free version.
-
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
-
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
-
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.
-
Near-Future Policy Optimization
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
Self-Distilled RLVR
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
-
PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
-
SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision
SEAD applies entropy-guided token selection, KL annealing, and easy-to-hard curriculum to on-policy distillation and reports +4.8 average accuracy gain over vanilla OPD on six math benchmarks with OLMo-3 models.
-
ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents
ATOD anneals from on-policy distillation to RL with turn-level reweighting to improve multi-turn agent success rates on ALFWorld, WebShop, and Search-QA.
-
Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs
ViPSy constructs policy-aligned and visually grounded preference pairs for VLMs via visual cues from image variants, yielding SOTA hallucination reductions of 35.7% on AMBER and 24.5% on Object HalBench.
-
On the Position Bias of On-Policy Distillation
Position bias in on-policy distillation degrades later-token supervision; IW-OPD weights tokens by accumulated discrepancy, yielding faster convergence and up to 6.9 point gains on AIME-2025.
-
Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
ViGOS decouples perception from reasoning in on-policy self-distillation for MLLMs by supervising visual descriptions with an image-only teacher and reasoning with a privileged teacher.
-
Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning
Visual-OPSD distills reasoning from a privileged visual-thought teacher to a text-only student using on-policy JSD, delivering +3.40pp accuracy gain and 14.3x speedup over the generative teacher on nine benchmarks.
-
Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation
On-policy distillation produces coordinate-sparse, FFN-heavy updates that are full-rank but spectrally concentrated away from principal singular subspaces and near-zero source weights.
-
RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation
RLCSD contrasts teacher-student distributional gaps under correct versus wrong hints to suppress privilege-induced style drift and concentrate supervision on task tokens, outperforming GRPO and prior OPSD on Qwen3 and Olmo models.
-
ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models
ParaBridge applies on-policy self-distillation with a scaffold as privileged view to convert brittle inference-time paralinguistic guidance into stable model behavior, raising VoxSafeBench SAR from 14.6% to 40.3% on Qwen3-Omni-thinking while preserving general capabilities.