Distilling the Knowledge in a Neural Network
read the original abstract
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
This paper has not been read by Pith yet.
Forward citations
Cited by 60 Pith papers
-
PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs
PrimeKG-CL supplies the first continual graph learning benchmark using authentic temporal snapshots from nine biomedical databases, showing strong interactions between embedding decoders and learning strategies plus l...
-
Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion
Inference-time refinement of pre-trained tabular diffusion models via Bidirectional Chamfer Refinement achieves median 8.6% better downstream performance than real data across 15 benchmarks while preserving fidelity a...
-
Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters
Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
-
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
-
Emerging Properties in Self-Supervised Vision Transformers
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Learning Through Noise: Why Subliminal Learning Works and When It Fails
Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.
-
Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment
Slimmable ConvNeXt adapts ConvNeXt for width-adaptive inference using LayerNorm and inverted bottlenecks, reaching 80.8% top-1 at 4.5 GMACs and outperforming HydraViT, MatFormer, and SortedNet on ImageNet-1k.
-
Dual-Integrated Low-Latency Single-Lens Infrared Computational Imaging for Object Detection
PDI-Net integrates physics-aware priors into a dual network that shares semi-reconstruction features with a YOLO detector, cutting inference time 84% while raising mAP 5% on low-SNR M3FD data.
-
Visual-Advantage On-Policy Distillation for Vision-Language Models
VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.
-
X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation
X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.
-
Layer-wise Token Compression for Efficient Document Reranking
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, ...
-
Layer-wise Token Compression for Efficient Document Reranking
Layer-wise Token Compression applies adaptive pooling at middle transformer layers to increase QPS by up to 116% on document ranking with little or no loss in quality.
-
Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning
Proposes weighted aggregation of clusters and self-distillation-driven token pruning to improve both accuracy and efficiency in ViT-based visual place recognition.
-
Code Generation by Differential Test Time Scaling
DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.
-
When Does Model Collapse Occur in Structured Interactive Learning?
Model collapse occurs in structured interactive learning if and only if the directed interaction graph satisfies a specific topological condition, with finite-sample guarantees for linear regression and asymptotic res...
-
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
-
Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models
s-step self-distillation is optimal among spectral shrinkage estimators for s-spiked covariance matrices and necessary for optimality.
-
Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space
In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight...
-
When Bits Break Recourse: Counterfactual-Faithful Quantization
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
-
Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation
Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasonin...
-
Continual Learning of Domain-Invariant Representations
Introduces replay-based continual learning with sequential invariance alignment to learn domain-invariant representations, outperforming baselines on generalization to unseen domains across six datasets in vision, med...
-
DIPA: Distilled Preconditioned Algorithms for Solving Imaging Inverse Problems
DIPA learns preconditioning operators via distillation from a teacher with a better sensing matrix to improve reconstruction quality for the student's physically constrained matrix in imaging inverse problems.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
-
TILT: Target-induced loss tilting under covariate shift
TILT adds a target-data penalty on an auxiliary predictor component to induce effective importance weighting for unsupervised domain adaptation under covariate shift.
-
Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation
Genetic programming evolves heterogeneous layer-specific scalar functions to approximate layer normalization in pre-trained ViTs, capturing 91.6% variance versus 70.2% for uniform baselines and recovering 84.25% Image...
-
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
-
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
-
On the Generalization of Knowledge Distillation: An Information-Theoretic View
Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.
-
On the Generalization of Knowledge Distillation: An Information-Theoretic View
Knowledge distillation generalization bounds are derived via a new distillation divergence measuring teacher-student kernel difference, with tighter bounds from teacher loss flatness.
-
SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-P...
-
When to Trust Confidence Thresholding: Calibration Diagnostics for Pseudo-Labelled Regression
Attenuation bias from confidence thresholding in pseudo-labelled regression equals a closed-form function of residual score variance V* after partialling out controls X, yielding a (V*, κ) safety rule computable befor...
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
Minimax Rates and Spectral Distillation for Tree Ensembles
Spectral analysis of tree ensembles produces minimax rates for random forests governed by kernel eigenvalue decay and enables distillation of RFs and GBMs into compact models via leading eigenfunctions and singular vectors.
-
DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers
DORA uses an online RL agent to adaptively merge tokens in Vision Transformers, reporting better accuracy-efficiency trade-offs than static baselines on ImageNet and OOD sets.
-
Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery
SkyPart uses learnable prototypes for patch grouping, altitude modulation only in training, graph-attention readout, and Kendall-weighted loss to set new state-of-the-art single-pass performance on SUES-200, Universit...
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer
GDPD treats partial student features as degraded observations and uses a learned diffusion prior over teacher features to sample restorative long-context targets for improved partial time-series classification.
-
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
-
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via mo...
-
Identified-Set Geometry of Distributional Model Extraction under Top-$K$ Censored API Access
Top-K logit censoring bounds the total-variation diameter of compatible teacher distributions by U_K but permits substantial capability transfer via distillation even when KL divergence is near zero.
-
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
-
Simpson's Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics
Aggregation distorts parametric behavioral curve peaks by factors of 3-5x via Simpson's paradox and survival bias, shown by individual vs. aggregate comparisons on Goodreads and Amazon datasets with a negative control.
-
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
-
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
-
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
-
Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
Energy-navigated trajectory shaping during training produces 8-step discrete flow matching students that achieve 32% lower perplexity than 1024-step teachers on 170M language models with unchanged inference cost.
-
Characterizing and Correcting Effective Target Shift in Online Learning
Online kernel regression equals offline regression with shifted targets; correcting the targets lets online learning match offline performance and outperform true targets in continual image classification.
-
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
-
Stochastic Transition-Map Distillation for Fast Probabilistic Inference
STMD distills the full transition map of diffusion sampling SDEs into a conditional Mean Flow model to enable fast one- or few-step stochastic sampling without teacher models or bi-level optimization.
-
Rubric-based On-policy Distillation
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
-
Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns
SWAP-Score evaluates neural networks without training by quantifying sample-wise activation patterns, achieving high correlation with true performance on CIFAR-10 for CNNs and GLUE for Transformers while enabling fast NAS.
-
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
-
LoopQ: Quantization for Recursive Transformers
LoopQ provides a loop-aware PTQ framework for recursive Transformers that mitigates distribution shift, state reuse, and recursive error accumulation, yielding 68.8% higher average accuracy and 87.7% lower perplexity ...
-
Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions
ABGD parametrizes piecewise linear functions as difference of max-affine functions and converges linearly to an epsilon-accurate solution with O(d max(sigma/epsilon,1)^2) samples under sub-Gaussian noise, which is min...
-
SMolLM: Small Language Models Learn Small Molecular Grammar
A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
-
A Testable Certificate for Constant Collapse in Teacher-Guided VAEs
For any fixed nonconstant teacher T, the best constant student has alignment cost exactly equal to the teacher mutual information I_T(X;T); a latent-only witness below this threshold with margin cannot be constant.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.