Unifying distillation and privileged information

Bernhard Sch\"olkopf; David Lopez-Paz; L\'eon Bottou; Vladimir Vapnik

arxiv: 1511.03643 · v3 · pith:N65CK4FSnew · submitted 2015-11-11 · 📊 stat.ML · cs.LG

Unifying distillation and privileged information

David Lopez-Paz , L\'eon Bottou , Bernhard Sch\"olkopf , Vladimir Vapnik This is my paper

classification 📊 stat.ML cs.LG

keywords distillationmachinesdatageneralizedinformationlearnprivilegedtechniques

0 comments

read the original abstract

Distillation (Hinton et al., 2015) and privileged information (Vapnik & Izmailov, 2015) are two techniques that enable machines to learn from other machines. This paper unifies these two techniques into generalized distillation, a framework to learn from multiple machines and data representations. We provide theoretical and causal insight about the inner workings of generalized distillation, extend it to unsupervised, semisupervised and multitask learning scenarios, and illustrate its efficacy on a variety of numerical simulations on both synthetic and real-world data.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Generalization of Knowledge Distillation: An Information-Theoretic View
cs.IT 2026-05 unverdicted novelty 7.0

Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.
On the Generalization of Knowledge Distillation: An Information-Theoretic View
cs.IT 2026-05 unverdicted novelty 7.0

Knowledge distillation generalization bounds are derived via a new distillation divergence measuring teacher-student kernel difference, with tighter bounds from teacher loss flatness.
Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective
cs.LG 2026-04 unverdicted novelty 7.0

CoT distillation frequently degrades student performance versus pre-distillation baselines, and capacity gap effects do not consistently dominate under a realistic protocol that includes original baselines.
SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SD-Search derives step-level supervision for search queries in reasoning agents via on-policy hindsight self-distillation using the policy as both student and teacher.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
PEPR: Privileged Event-based Predictive Regularization for Domain Generalization
cs.CV 2026-02 unverdicted novelty 6.0

PEPR reframes learning with privileged event data as predicting latent event features from RGB to improve domain generalization in object detection and segmentation without direct cross-modal alignment.
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
cs.AI 2026-05 unverdicted novelty 5.0

Search-E1 interleaves vanilla GRPO with offline self-distillation via token-level forward KL alignment to privileged sibling trajectories, reaching 0.440 average EM on seven QA benchmarks with Qwen2.5-3B and beating o...
Neurosymbolic Imitation Learning with Human Guidance: A Privileged Information Approach
cs.LG 2026-05 unverdicted novelty 5.0

A neurosymbolic imitation learning approach uses privileged gaze data during training to handle high-dimensional inputs while achieving better generalization than pure neural or symbolic methods.