Well-read students learn better: The impact of student initialization on knowledge distillation

Turc, I · 1908 · arXiv 1908.08962

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

LLM-ODE: Data-driven Discovery of Dynamical Systems with Large Language Models

cs.LG · 2026-03-21 · unverdicted · novelty 7.0

LLM-ODE integrates large language models into genetic programming to guide symbolic search for governing equations of dynamical systems, outperforming classical GP on 91 test cases in efficiency and solution quality.

DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

cs.CL · 2022-10-17 · conditional · novelty 7.0

DiffuSeq adapts diffusion models to conditional sequence-to-sequence text generation and reports performance matching or exceeding strong baselines including pretrained language model systems while generating more diverse outputs.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

cs.CL · 2019-09-26 · accept · novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.

Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Heterogeneous visual agents form shared symbols via decentralized Metropolis-Hastings captioning, where encoder similarity shapes the content and symmetry of the resulting language.

Spectrum-Adaptive Generalization Bounds for Trained Deep Transformers

stat.ML · 2026-05-08 · unverdicted · novelty 6.0

Spectrum-adaptive post-hoc generalization bounds for multi-layer Transformers are derived using layerwise Schatten quantities whose indices are chosen after training based on singular-value profiles.

Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension

cs.LG · 2025-02-07 · unverdicted · novelty 6.0

In ridgeless regression with low intrinsic dimension, discrepancy between weak and strong models reduces W2S generalization variance by dim(V_s)/N in the discrepant subspace while inheriting it in the overlap.

Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

cs.LG · 2021-01-07 · unverdicted · novelty 6.0

Denoising Student distills the multi-step denoising process of score-based and diffusion models into a single forward pass, matching GAN sampling speed while producing comparable sample quality on CIFAR-10, CelebA, and 256x256 LSUN.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

cs.CL · 2019-10-02 · unverdicted · novelty 6.0

DistilBERT compresses BERT by 40% via pre-training distillation with a triple loss, retaining 97% performance and running 60% faster.

citing papers explorer

Showing 8 of 8 citing papers.

LLM-ODE: Data-driven Discovery of Dynamical Systems with Large Language Models cs.LG · 2026-03-21 · unverdicted · none · ref 35
LLM-ODE integrates large language models into genetic programming to guide symbolic search for governing equations of dynamical systems, outperforming classical GP on 91 test cases in efficiency and solution quality.
DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models cs.CL · 2022-10-17 · conditional · none · ref 9
DiffuSeq adapts diffusion models to conditional sequence-to-sequence text generation and reports performance matching or exceeding strong baselines including pretrained language model systems while generating more diverse outputs.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations cs.CL · 2019-09-26 · accept · none · ref 34
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning cs.CV · 2026-05-12 · unverdicted · none · ref 37
Heterogeneous visual agents form shared symbols via decentralized Metropolis-Hastings captioning, where encoder similarity shapes the content and symmetry of the resulting language.
Spectrum-Adaptive Generalization Bounds for Trained Deep Transformers stat.ML · 2026-05-08 · unverdicted · none · ref 27
Spectrum-adaptive post-hoc generalization bounds for multi-layer Transformers are derived using layerwise Schatten quantities whose indices are chosen after training based on singular-value profiles.
Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension cs.LG · 2025-02-07 · unverdicted · none · ref 22
In ridgeless regression with low intrinsic dimension, discrepancy between weak and strong models reduces W2S generalization variance by dim(V_s)/N in the discrepant subspace while inheriting it in the overlap.
Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed cs.LG · 2021-01-07 · unverdicted · none · ref 41
Denoising Student distills the multi-step denoising process of score-based and diffusion models into a single forward pass, matching GAN sampling speed while producing comparable sample quality on CIFAR-10, CelebA, and 256x256 LSUN.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter cs.CL · 2019-10-02 · unverdicted · none · ref 45
DistilBERT compresses BERT by 40% via pre-training distillation with a triple loss, retaining 97% performance and running 60% faster.

Well-read students learn better: The impact of student initialization on knowledge distillation

fields

years

verdicts

representative citing papers

citing papers explorer