Draft-OPD: On-Policy Distillation for Speculative Draft Models

· 2026 · cs.CL · arXiv 2605.29343

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over $5\times$ lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.

representative citing papers

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

cs.LG · 2026-06-10 · unverdicted · novelty 5.0

Bebop counters entropy-driven drops in MTP acceptance during RL by switching to probabilistic rejection sampling and training with an end-to-end TV loss, delivering up to 1.8x end-to-end acceleration on Qwen models without online MTP retraining.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling cs.LG · 2026-06-10 · unverdicted · none · ref 1 · internal anchor
Bebop counters entropy-driven drops in MTP acceptance during RL by switching to probabilistic rejection sampling and training with an end-to-end TV loss, delivering up to 1.8x end-to-end acceleration on Qwen models without online MTP retraining.

Draft-OPD: On-Policy Distillation for Speculative Draft Models

fields

years

verdicts

representative citing papers

citing papers explorer