Stable On-Policy Distillation through Adaptive Target Reformulation

Hyunggu Lim; Ijun Jang; Jewon Yeom; Juan Yeo; Taesup Kim

arxiv: 2601.07155 · v2 · submitted 2026-01-12 · 💻 cs.LG · cs.AI

Stable On-Policy Distillation through Adaptive Target Reformulation

Ijun Jang , Jewon Yeom , Juan Yeo , Hyunggu Lim , Taesup Kim This is my paper

Pith reviewed 2026-05-16 15:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords knowledge distillationon-policy distillationlarge language modelslogit spaceVeto objectivegradient stabilizationdistributional alignmentadaptive target

0 comments

The pith

Veto builds a geometric bridge in logit space to stabilize on-policy distillation from teacher to student models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve training instabilities that arise when distilling knowledge from large language models directly onto student-generated outputs. Standard on-policy approaches either produce pathological gradients under forward KL or suffer diversity collapse under reverse KL because the gap between novice and expert is too wide. Veto instead constructs an intermediate target distribution geometrically inside the logit space, using a single tunable beta to create a controlled bridge. This reformulation suppresses harmful gradients on low-confidence tokens while letting the student align without collapsing variety. Experiments on reasoning and generation tasks show consistent gains over supervised fine-tuning and prior on-policy methods.

Core claim

Veto is an objective-level reformulation that constructs a geometric bridge in the logit space. By introducing a tunable parameter beta, Veto serves as an Adaptive Gradient Veto that stabilizes optimization by suppressing harmful gradients on low-confidence tokens, while simultaneously acting as a Decisiveness Knob to balance reward-driven performance with output diversity.

What carries the argument

The Veto objective, which creates an intermediate target distribution geometrically in logit space to bridge the distributional gap between student and teacher without mixing data samples.

If this is right

Pathological gradients from direct forward KL are suppressed by the veto on low-confidence tokens.
Diversity collapse typical of reverse KL regimes is avoided through the decisiveness control.
The same beta parameter simultaneously improves both reward alignment and output variety.
No sample mixing is required; the bridge operates entirely inside the logit space.
Outperformance holds across multiple reasoning and generation benchmarks without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The geometric logit-space construction could transfer to other teacher-student alignment settings where direct matching fails.
A single beta knob may reduce the need for multi-stage curricula in model compression pipelines.
If the bridge scales to very large gaps, it could shorten the number of distillation rounds needed for tiny models.
Similar geometric intermediates might stabilize other on-policy objectives outside pure distillation.

Load-bearing premise

An intermediate target built geometrically in logit space will close the gap between student and teacher reliably without introducing new optimization pathologies or forcing extensive per-task beta retuning.

What would settle it

Training runs on a very small student distilled from a much larger teacher on hard reasoning tasks where the initial output distributions diverge sharply, checking whether gradients remain stable and performance still exceeds baselines.

read the original abstract

Knowledge distillation (KD) is a widely adopted technique for transferring knowledge from large language models to smaller student models; however, conventional supervised KD often suffers from a distribution mismatch between training and inference. While on-policy KD approaches attempt to mitigate this issue by learning directly from student-generated outputs, they frequently encounter training instabilities because the distributional gap between the novice student and the expert teacher is often too wide to bridge directly. These challenges manifest as pathological gradients in forward KL objectives or diversity collapse in reverse KL regimes. To address these limitations, we propose Veto, an objective-level reformulation that constructs a geometric bridge in the logit space. Unlike prior methods that mix data samples, Veto creates an intermediate target distribution that promotes alignment between the teacher and the student. By introducing a tunable parameter beta, Veto serves as an Adaptive Gradient Veto that stabilizes optimization by suppressing harmful gradients on low-confidence tokens, while simultaneously acting as a Decisiveness Knob to balance reward-driven performance with output diversity. Extensive experiments across various reasoning and generation tasks demonstrate that Veto consistently outperforms supervised fine-tuning and existing on-policy baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Veto's logit-space geometric bridge with beta as both gradient veto and diversity knob is a plausible stabilization tweak for on-policy KD, but the abstract's lack of equations and tuning details leaves the gains looking potentially fragile to hyperparameter choice.

read the letter

The core move is constructing an intermediate target distribution by geometric interpolation in logit space between the student and teacher, then using a single scalar beta to suppress gradients on low-confidence tokens while also controlling how decisive the student becomes. This is framed as fixing the instability that comes from jumping straight to the teacher's distribution in on-policy settings. The paper does a clean job naming the two classic failure modes—pathological gradients in forward KL and diversity collapse in reverse KL—and positions the geometric bridge as a way to create a softer target without mixing raw samples. That framing is useful and the dual role for beta is a neat way to think about the stability-diversity trade-off in one knob. Credit for trying to keep the method inside the on-policy regime rather than falling back to supervised fine-tuning. The soft spots are exactly where the stress-test note points: beta is explicitly tunable, yet the abstract gives no equations for the objective, no ablation on its sensitivity, no error bars, and no indication whether a single value or narrow schedule works across tasks. If optimal beta shifts noticeably between reasoning and generation benchmarks, the reported wins could trace to extra search effort rather than the reformulation itself. The central claim that this bridge reliably closes the gap without introducing new optimization issues therefore rests on experiments we cannot inspect here. This is for people already working on LLM distillation and compression who need practical on-policy tricks. A reader who wants to try the method would get value from the full paper if it includes the loss definition, beta selection protocol, and per-task ablations. I would send it to peer review so referees can check whether the gains survive when beta is held fixed or when the search budget is matched to the baselines.

Referee Report

2 major / 2 minor

Summary. The paper proposes Veto, an on-policy knowledge distillation objective for LLMs that constructs an intermediate target distribution via geometric interpolation in logit space controlled by a tunable scalar beta. This is positioned as an 'Adaptive Gradient Veto' that suppresses gradients on low-confidence tokens while acting as a 'Decisiveness Knob' to trade off reward-driven performance against output diversity. The central claim is that Veto mitigates the instabilities of direct forward/reverse KL on-policy KD and the train-inference mismatch of supervised KD, with extensive experiments across reasoning and generation tasks showing consistent outperformance over SFT and prior on-policy baselines.

Significance. If the experimental superiority holds under fixed beta schedules and without per-task retuning, the method would offer a practical stabilization technique for on-policy distillation that avoids both gradient pathologies and diversity collapse. The logit-space geometric bridge is a clean reformulation that could generalize beyond the reported tasks. However, the absence of any parameter-free derivation or external benchmark for beta, combined with the explicit framing of beta as a tunable knob, means the significance is conditional on demonstrating that the performance gains are not artifacts of additional hyperparameter search.

major comments (2)

[§3 and §4] §3 (Method) and §4 (Experiments): The claim that Veto 'consistently outperforms' baselines rests on the choice of beta, yet the manuscript supplies no ablation on beta sensitivity, no cross-task variation statistics, and no protocol for selecting beta (e.g., grid search budget or fixed schedule). If optimal beta differs materially across tasks, the reported gains may reflect extra tuning rather than the geometric reformulation itself.
[§4] §4 (Experiments): No error bars, confidence intervals, or statistical significance tests are reported for the 'consistent outperformance' claim. Without these, it is impossible to determine whether the gains over on-policy baselines exceed run-to-run variance, especially given the known sensitivity of on-policy objectives to sampling and optimization details.

minor comments (2)

[§3] Notation for the geometric interpolation (logit-space mixing) should be introduced with an explicit equation early in §3 rather than described only in prose.
[Abstract and §1] The abstract and introduction use the terms 'Adaptive Gradient Veto' and 'Decisiveness Knob' without first defining them formally; a short definitional sentence would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional analyses and statistical reporting as outlined.

read point-by-point responses

Referee: [§3 and §4] §3 (Method) and §4 (Experiments): The claim that Veto 'consistently outperforms' baselines rests on the choice of beta, yet the manuscript supplies no ablation on beta sensitivity, no cross-task variation statistics, and no protocol for selecting beta (e.g., grid search budget or fixed schedule). If optimal beta differs materially across tasks, the reported gains may reflect extra tuning rather than the geometric reformulation itself.

Authors: We agree that the current manuscript would benefit from more explicit analysis of beta. While the method is designed with beta as an interpretable knob for the diversity-performance trade-off, we will add a dedicated ablation subsection in the revision. This will include performance curves for beta in [0.1, 0.9] across all reported tasks, summary statistics on cross-task variation (mean and std of optimal beta), and a recommended selection protocol (default beta=0.5 with optional small grid search on a held-out validation set). These additions will demonstrate that the core gains arise from the geometric interpolation rather than task-specific tuning. revision: yes
Referee: [§4] §4 (Experiments): No error bars, confidence intervals, or statistical significance tests are reported for the 'consistent outperformance' claim. Without these, it is impossible to determine whether the gains over on-policy baselines exceed run-to-run variance, especially given the known sensitivity of on-policy objectives to sampling and optimization details.

Authors: This is a fair criticism of the current presentation. In the revised manuscript we will rerun the main experiments with 5 independent random seeds, report mean performance with standard deviation error bars, include 95% confidence intervals, and add paired statistical significance tests (Wilcoxon signed-rank) against each baseline. These results will be presented in updated tables and figures to substantiate the outperformance claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Veto formulation or claims

full rationale

The paper proposes Veto as a new objective reformulation that inserts a tunable beta-controlled geometric interpolation between student and teacher logits. This construction is presented explicitly as a design choice (not derived from a parameter-free first-principles argument), and the claimed stability and performance gains are supported by empirical experiments rather than by showing that any reported metric equals the input data or a fitted hyperparameter by algebraic identity. No self-citation chain, uniqueness theorem, or ansatz is invoked to force the result; beta is openly labeled tunable and its balancing role is part of the method definition. The derivation therefore remains self-contained against external benchmarks and does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim depends on one free parameter beta whose value controls both gradient suppression and output diversity, plus the domain assumption that logit-space interpolation forms a stable bridge between mismatched distributions.

free parameters (1)

beta
Tunable scalar that simultaneously acts as adaptive gradient veto and decisiveness knob; its specific value determines the claimed stability and performance gains.

axioms (1)

domain assumption A geometrically constructed intermediate target in logit space meaningfully reduces the distributional gap between student and teacher without introducing new instabilities.
Invoked when the abstract states that Veto creates an intermediate target distribution that promotes alignment.

invented entities (1)

Veto objective no independent evidence
purpose: Reformulated distillation loss that constructs the geometric bridge and supplies the adaptive veto mechanism.
Newly defined training objective whose properties are asserted to solve the stated instabilities.

pith-pipeline@v0.9.0 · 5500 in / 1399 out tokens · 45105 ms · 2026-05-16T15:04:06.255007+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Q(y|x) ∝ exp(z_T + β · z_S) = PT(y|x) · PS(y|x)^β ... Adaptive Gradient Veto that suppresses harmful gradients on low-confidence tokens
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Veto serves as a Decisiveness Knob to balance reward-driven performance with output diversity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasonin...
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
cs.LG 2026-05 unverdicted novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
cs.CL 2026-05 unverdicted novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
cs.LG 2026-04 unverdicted novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
cs.LG 2026-04 unverdicted novelty 6.0

On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.