hub Canonical reference

Entropy-aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee · 2026 · cs.LG · arXiv 2603.07079

Canonical reference. 100% of citing Pith papers cite this work as background.

29 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 29 citing papers arXiv PDF

abstract

On-policy distillation is a promising approach for transferring knowledge between language models, where a student learns from dense token-level signals along its own trajectories. This framework typically uses reverse KL divergence, encouraging the student to match the teacher's high-confidence predictions. However, we show that the mode-seeking property of reverse KL reduces generation diversity and yields unstable learning signals when the teacher distribution has high entropy. To address this, we introduce Entropy-Aware On-Policy Distillation. Our key idea is augmenting the standard reverse KL objective with forward KL when teacher entropy is high, capturing the full range of plausible outputs while retaining precise imitation elsewhere. It balances mode-seeking precision with mode-covering robustness without sacrificing on-policy training efficiency. Experiments show that our method maintains generation diversity (sustained token-level entropy) and improves student-teacher alignment (lower forward KL on high-entropy tokens). Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods. These results demonstrate that accounting for teacher uncertainty is essential for maintaining diversity and achieving effective knowledge transfer.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 1

citation-polarity summary

background 7

representative citing papers

Visual-Advantage On-Policy Distillation for Vision-Language Models

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

KL for a KL: On-Policy Distillation with Control Variate Baseline

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.

ARKD: Adaptive Reinforcement Learning-Guided Bidirectional KL Divergence Distillation for Text Generation

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

ARKD uses an RL policy network to adaptively balance FKL and RKL in LLM distillation, claiming gains of 0.4-0.6 points on Rouge-L and BertScore over baselines.

SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

SEAD applies entropy-guided token selection, KL annealing, and easy-to-hard curriculum to on-policy distillation and reports +4.8 average accuracy gain over vanilla OPD on six math benchmarks with OLMo-3 models.

ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents

cs.AI · 2026-06-26 · unverdicted · novelty 6.0

ATOD anneals from on-policy distillation to RL with turn-level reweighting to improve multi-turn agent success rates on ALFWorld, WebShop, and Search-QA.

Rethinking the Role of Temperature in Large Language Model Distillation

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Including temperature scaling makes forward KL divergence outperform reverse KL in LLM distillation on instruction benchmarks, overturning the τ=1 preference for reverse KL.

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

cs.LG · 2026-05-20 · conditional · novelty 6.0

On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

cs.CL · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

Local teachability collapse occurs in later trajectory segments during strong-to-weak OPD; a margin-based release rule using top-K teacher advantage and BIC change-point detection on sentence segments outperforms full-trajectory supervision on five in-domain benchmarks and preserves out-of-domain pe

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

MOPD improves on-policy distillation by using peer successes and failures from multiple rollouts to construct more informative teacher signals, yielding consistent gains over baselines on reasoning benchmarks.

Flow-OPD: On-Policy Distillation for Flow Matching Models

cs.CV · 2026-05-08 · conditional · novelty 6.0 · 5 refs

Flow-OPD is a two-stage on-policy distillation method for flow matching models that lifts GenEval from 63 to 92 and OCR from 59 to 94 on SD 3.5 Medium while preserving fidelity.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

cs.CL · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.

DOPD: Dual On-policy Distillation

cs.AI · 2026-06-29 · unverdicted · novelty 5.0

DOPD is an advantage-aware dual distillation method that dynamically assigns token supervision from either privileged teacher or student to transfer capability while mitigating non-replicable information asymmetry in on-policy distillation.

citing papers explorer

Showing 1 of 1 citing paper after filters.

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs cs.LG · 2026-05-09 · unverdicted · none · ref 19 · internal anchor
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

Entropy-aware on-policy distillation of language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer