Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.
hub Canonical reference
On-policy distillation of language models: Learning from self-generated mistakes
Canonical reference. 71% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
years
2026 17representative citing papers
VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
MA-BC partitions divergent expert data and pools non-conflicting pairs to achieve faster convergence to Pareto-optimal policies in MOMDPs, with a matching minimax lower bound.
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
ProteinOPD uses token-level on-policy distillation from multiple preference-specific teacher models into a shared student to balance competing objectives in protein design, delivering gains on targets without losing designability and an 8x speedup over RL baselines.
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.
VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.
citing papers explorer
-
Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation
MA-BC partitions divergent expert data and pools non-conflicting pairs to achieve faster convergence to Pareto-optimal policies in MOMDPs, with a matching minimax lower bound.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design
ProteinOPD uses token-level on-policy distillation from multiple preference-specific teacher models into a shared student to balance competing objectives in protein design, delivering gains on targets without losing designability and an 8x speedup over RL baselines.
- Flow-OPD: On-Policy Distillation for Flow Matching Models