hub Canonical reference

TIP: Token Importance in On-Policy Distillation

· 2026 · cs.LG · arXiv 2604.14084

Canonical reference. 80% of citing Pith papers cite this work as background.

19 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 19 citing papers arXiv PDF

abstract

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 use method 1

representative citing papers

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

TRIAGE augments GRPO with role-typed segment rewards derived from a judge that detects regression and exploration, yielding higher success rates and fewer turns on ALFWorld, Search-QA, and WebShop.

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Token teachability, based on local compatibility of teacher and student distributions, predicts on-policy distillation gains better than raw KL disagreement and enables TA-OPD to match or exceed full-token performance with 5% tokens across Qwen models.

Not only where, But when: Temporal Scheduling for RLVR

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.

TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

Regime-Aware Peer Specialization for Robust RAG under Heterogeneous Knowledge Conflicts

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

RAPS-DA improves RAG robustness to heterogeneous knowledge conflicts by training regime-specific peer specialists with hard routing and a dual-layer token selector for focused supervision.

ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents

cs.AI · 2026-06-26 · unverdicted · novelty 6.0

ATOD anneals from on-policy distillation to RL with turn-level reweighting to improve multi-turn agent success rates on ALFWorld, WebShop, and Search-QA.

AsyncOPD: How Stale Can On-Policy Distillation Be?

cs.LG · 2026-06-23 · conditional · novelty 6.0

AsyncOPD shows asynchronous OPD training reaches 1.6-3.8x higher throughput than synchronous baselines with comparable accuracy by using forward-KL estimators and multi-sample Monte Carlo correction for finite teacher caches.

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

ReNIO reweights negative student-generated trajectories in LLM on-policy distillation using probability ratios, reporting relative gains up to 10% on reasoning benchmarks.

Finding the Evidence: Discovering Decision-Supporting Tokens for On-Policy Reasoning Distillation

cs.AI · 2026-06-22 · unverdicted · novelty 6.0

DEAR identifies decision tokens via entropy and evidence tokens via cosine similarity plus divergence to improve on-policy reasoning distillation over standard methods.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.

DOPD: Dual On-policy Distillation

cs.AI · 2026-06-29 · unverdicted · novelty 5.0

DOPD is an advantage-aware dual distillation method that dynamically assigns token supervision from either privileged teacher or student to transfer capability while mitigating non-replicable information asymmetry in on-policy distillation.

DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation

cs.CL · 2026-06-27 · unverdicted · novelty 5.0

DriftGuard introduces multi-monitor safety-aware drift detection paired with hard-mix selective adaptation, reporting toxic recall gains to 0.8777 on Civil Comments and 0.8523 on DynaHate under temporal and cross-dataset shifts.

Blockwise Policy-Drift Gating for On-Policy Distillation

cs.LG · 2026-06-23 · unverdicted · novelty 5.0

Blockwise policy-drift gating raises mean pass@8 from 0.4978 to 0.5160 on four math benchmarks by reweighting OPD losses with detached mean-normalized gates from student policy drift over 64-token blocks.

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

cs.LG · 2026-06-01 · unverdicted · novelty 5.0

FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 5.0 · 3 refs

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

A Formula-Driven Survey and Research Agenda for On-Policy Distillation

cs.AI · 2026-06-22 · unverdicted · novelty 4.0

A survey creates a taxonomy for on-policy distillation in LLMs that separates temporal credit assignment from vocabulary-level probability routing.

citing papers explorer

Showing 18 of 18 citing papers after filters.

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning cs.LG · 2026-06-30 · unverdicted · none · ref 6 · internal anchor
TRIAGE augments GRPO with role-typed segment rewards derived from a judge that detects regression and exploration, yielding higher success rates and fewer turns on ALFWorld, Search-QA, and WebShop.
Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation cs.LG · 2026-05-26 · unverdicted · none · ref 1 · internal anchor
Token teachability, based on local compatibility of teacher and student distributions, predicts on-policy distillation gains better than raw KL disagreement and enables TA-OPD to match or exceed full-token performance with 5% tokens across Qwen models.
Not only where, But when: Temporal Scheduling for RLVR cs.LG · 2026-05-25 · unverdicted · none · ref 27 · internal anchor
Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment cs.AI · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs cs.LG · 2026-05-09 · unverdicted · none · ref 42 · internal anchor
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
Rubric-based On-policy Distillation cs.LG · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
Regime-Aware Peer Specialization for Robust RAG under Heterogeneous Knowledge Conflicts cs.CL · 2026-06-29 · unverdicted · none · ref 4 · internal anchor
RAPS-DA improves RAG robustness to heterogeneous knowledge conflicts by training regime-specific peer specialists with hard routing and a dual-layer token selector for focused supervision.
ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents cs.AI · 2026-06-26 · unverdicted · none · ref 22 · internal anchor
ATOD anneals from on-policy distillation to RL with turn-level reweighting to improve multi-turn agent success rates on ALFWorld, WebShop, and Search-QA.
ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation cs.LG · 2026-06-22 · unverdicted · none · ref 19 · internal anchor
ReNIO reweights negative student-generated trajectories in LLM on-policy distillation using probability ratios, reporting relative gains up to 10% on reasoning benchmarks.
Finding the Evidence: Discovering Decision-Supporting Tokens for On-Policy Reasoning Distillation cs.AI · 2026-06-22 · unverdicted · none · ref 20 · internal anchor
DEAR identifies decision tokens via entropy and evidence tokens via cosine similarity plus divergence to improve on-policy reasoning distillation over standard methods.
SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 41 · internal anchor
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CL · 2026-05-08 · unverdicted · none · ref 39 · 2 links · internal anchor
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
DOPD: Dual On-policy Distillation cs.AI · 2026-06-29 · unverdicted · none · ref 46 · internal anchor
DOPD is an advantage-aware dual distillation method that dynamically assigns token supervision from either privileged teacher or student to transfer capability while mitigating non-replicable information asymmetry in on-policy distillation.
DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation cs.CL · 2026-06-27 · unverdicted · none · ref 34 · internal anchor
DriftGuard introduces multi-monitor safety-aware drift detection paired with hard-mix selective adaptation, reporting toxic recall gains to 0.8777 on Civil Comments and 0.8523 on DynaHate under temporal and cross-dataset shifts.
Blockwise Policy-Drift Gating for On-Policy Distillation cs.LG · 2026-06-23 · unverdicted · none · ref 17 · internal anchor
Blockwise policy-drift gating raises mean pass@8 from 0.4978 to 0.5160 on four math benchmarks by reweighting OPD losses with detached mean-normalized gates from student policy drift over 64-token blocks.
Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation cs.LG · 2026-06-01 · unverdicted · none · ref 16 · internal anchor
FiRe-OPD introduces a two-stage filter-then-soft-reweight procedure for trajectory- and token-level supervision in on-policy distillation, claiming gains over prior token-level methods.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 23 · 3 links · internal anchor
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.
A Formula-Driven Survey and Research Agenda for On-Policy Distillation cs.AI · 2026-06-22 · unverdicted · none · ref 37 · internal anchor
A survey creates a taxonomy for on-policy distillation in LLMs that separates temporal credit assignment from vocabulary-level probability routing.

TIP: Token Importance in On-Policy Distillation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer