hub Canonical reference

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

· 2026 · cs.LG · arXiv 2603.25562

Canonical reference. 89% of citing Pith papers cite this work as background.

18 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 18 citing papers arXiv PDF

abstract

On-policy distillation (OPD) is increasingly used in LLM post-training because it can leverage a teacher model to provide dense supervision on student rollouts. The standard implementation, however, usually reduces distribution matching to a sampled-token log-ratio, which can make the learning signal fragile on long rollouts whose prefixes drift away from the teacher's typical support. We revisit this formulation from both theoretical and implementation perspectives. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL minimization, but admits a substantially tighter worst-case variance bound; a controlled synthetic study further shows that stronger future-reward coupling increases gradient variance and destabilizes training. Empirically, we identify three failure modes of sampled-token OPD: imbalanced token-level supervision, unreliable teacher guidance on student-generated prefixes, and tokenizer or special-token mismatch. These findings motivate teacher top-K local support matching, a truncated reverse-KL objective that compares teacher and student distributions over a teacher-supported token set at each prefix, together with top-p rollout sampling and special-token masking. Across single-task reasoning and multi-task benchmarks spanning agentic and reasoning settings, this objective improves optimization stability and yields a +19.8% performance gain over standard sampled-token OPD baselines, providing a practical recipe for more stable on-policy distillation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 1 method 1

citation-polarity summary

background 8 baseline 1

representative citing papers

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.

Self-Distilled RLVR

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

cs.LG · 2026-05-20 · conditional · novelty 6.0

On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

cs.CL · 2026-05-19 · unverdicted · novelty 5.0

MOTAB is a new distillation pipeline that monitors on-policy student trajectories and backtracks with teacher intervention to mitigate dual exposure biases, improving reasoning performance by about 3%.

$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

f-OPD decomposes on-policy distillation drift into rollout and supervision components, then applies a sample-level freshness score to adaptively limit stale data influence and stabilize long-horizon agent training.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 5.0 · 3 refs

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

On-Policy Distillation with Best-of-N Teacher Rollout Selection

cs.CV · 2026-05-10 · unverdicted · novelty 5.0 · 2 refs

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

cs.AI · 2026-05-08 · unverdicted · novelty 5.0

Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

cs.CL · 2026-05-13

citing papers explorer

Showing 18 of 18 citing papers.

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation cs.LG · 2026-05-16 · unverdicted · none · ref 9 · internal anchor
Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning cs.CL · 2026-05-09 · unverdicted · none · ref 10 · internal anchor
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs cs.LG · 2026-05-09 · unverdicted · none · ref 12 · internal anchor
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
Rubric-based On-policy Distillation cs.LG · 2026-05-08 · unverdicted · none · ref 31 · internal anchor
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization cs.LG · 2026-05-06 · unverdicted · none · ref 5 · internal anchor
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate cs.CL · 2026-05-02 · unverdicted · none · ref 12 · internal anchor
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
Self-Distilled RLVR cs.LG · 2026-04-03 · unverdicted · none · ref 38 · internal anchor
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning cs.LG · 2026-05-21 · unverdicted · none · ref 39 · internal anchor
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation cs.LG · 2026-05-20 · conditional · none · ref 8 · internal anchor
On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CL · 2026-05-08 · unverdicted · none · ref 38 · internal anchor
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe cs.LG · 2026-04-14 · unverdicted · none · ref 4 · internal anchor
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation cs.CL · 2026-05-19 · unverdicted · none · ref 8 · internal anchor
MOTAB is a new distillation pipeline that monitors on-policy student trajectories and backtracks with teacher intervention to mitigate dual exposure biases, improving reasoning performance by about 3%.
$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control cs.LG · 2026-05-18 · unverdicted · none · ref 26 · internal anchor
f-OPD decomposes on-policy distillation drift into rollout and supervision components, then applies a sample-level freshness score to adaptively limit stale data influence and stabilize long-horizon agent training.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 7 · 3 links · internal anchor
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.
On-Policy Distillation with Best-of-N Teacher Rollout Selection cs.CV · 2026-05-10 · unverdicted · none · ref 10 · 2 links · internal anchor
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair cs.AI · 2026-05-08 · unverdicted · none · ref 8 · internal anchor
Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation cs.CL · 2026-05-13 · unreviewed · ref 12 · internal anchor

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer