Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.
hub
Sequence-level knowledge distillation
13 Pith papers cite this work. Polarity classification is still indexing.
abstract
Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model that has 13 times fewer parameters than the original teacher model, with a decrease of 0.4 BLEU.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.
A two-axis taxonomy of student entropy and teacher-student divergence identifies informative tokens in on-policy distillation, allowing near-full performance with 10-50% of tokens.
A modified divergence decouples top-K teacher probabilities from the distribution tail during distillation, yielding competitive performance on decoder models with standard compute.
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
NoNN partitions a teacher model into disjoint compressed students via network science for distributed IoT inference, matching teacher accuracy with far lower per-device memory and communication.
The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.
NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Knowledge distillation from an external RNN language model to a seq2seq ASR model yields 9.3% CER on Chinese datasets, an 18.42% relative improvement over the baseline without test-time fusion components.
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
citing papers explorer
-
Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation
Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.
-
PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
-
Accelerating Large Language Model Decoding with Speculative Sampling
Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
-
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning
Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.
-
TIP: Token Importance in On-Policy Distillation
A two-axis taxonomy of student entropy and teacher-student divergence identifies informative tokens in on-policy distillation, allowing near-full performance with 10-50% of tokens.
-
Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation
A modified divergence decouples top-K teacher probabilities from the distribution tail during distillation, yielding competitive performance on decoder models with standard compute.
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
Memory- and Communication-Aware Model Compression for Distributed Deep Learning Inference on IoT
NoNN partitions a teacher model into disjoint compressed students via network science for distributed IoT inference, matching teacher accuracy with far lower per-device memory and communication.
-
Continuous diffusion for categorical data
The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.
-
Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing
NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
-
Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition
Knowledge distillation from an external RNN language model to a seq2seq ASR model yields 9.3% CER on Chinese datasets, an 18.42% relative improvement over the baseline without test-time fusion components.
-
A Survey on Knowledge Distillation of Large Language Models
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.