pith. sign in

hub

Sequence-level knowledge distillation

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it
abstract

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model that has 13 times fewer parameters than the original teacher model, with a decrease of 0.4 BLEU.

hub tools

citation-role summary

background 1

citation-polarity summary

roles

background 1

polarities

background 1

representative citing papers

TIP: Token Importance in On-Policy Distillation

cs.LG · 2026-04-15 · unverdicted · novelty 6.0 · 3 refs

A two-axis taxonomy of student entropy and teacher-student divergence identifies informative tokens in on-policy distillation, allowing near-full performance with 10-50% of tokens.

Continuous diffusion for categorical data

cs.CL · 2022-11-28 · unverdicted · novelty 5.0

The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.

citing papers explorer

Showing 13 of 13 citing papers.