arxiv: 2101.02388 · v1 · submitted 2021-01-07 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

Eric Luhman , Troy Luhman

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords knowledge distillationdenoising diffusiongenerative modelssampling speedimage generationCIFAR-10CelebALSUN

0 comments

The pith

Knowledge distillation turns a multi-step denoising model into a fast single-step generator matching GAN quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Iterative generative models such as denoising diffusion probabilistic models produce high-quality images by applying many gradual denoising steps to an initial noise vector, but this makes them far slower than GANs or VAEs. The paper connects knowledge distillation to this setting by training a student network to produce in one forward pass the same result that the multi-step teacher would reach after its full trajectory. This yields sampling speeds comparable to single-step models while avoiding adversarial training entirely. The resulting Denoising Student achieves sample quality similar to GANs on CIFAR-10 and CelebA and extends to 256 by 256 images on LSUN.

Core claim

We establish a novel connection between knowledge distillation and image generation with a technique that distills a multi-step denoising process into a single step, resulting in a sampling speed similar to other single-step generative models. Our Denoising Student generates high quality samples comparable to GANs on the CIFAR-10 and CelebA datasets, without adversarial training. We demonstrate that our method scales to higher resolutions through experiments on 256 x 256 LSUN.

What carries the argument

The Denoising Student, a network trained via knowledge distillation to replicate the final clean image produced by a teacher's full multi-step denoising trajectory in a single forward pass.

If this is right

Sampling speed increases by two to three orders of magnitude to match GANs and VAEs.
Sample quality remains comparable to GANs on CIFAR-10 and CelebA without adversarial training.
The method scales to higher-resolution images such as 256 by 256 on LSUN.
No additional regularization is needed beyond standard distillation to achieve the speed-quality tradeoff.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation idea could be tested on conditional versions of the teacher to enable fast controlled generation.
Diversity statistics of the student outputs could be compared directly to the teacher to check mode coverage.
The approach might transfer to other iterative generative tasks such as audio waveform synthesis where multi-step refinement is common.

Load-bearing premise

A single forward pass through the student can faithfully approximate the distribution produced by the full multi-step teacher denoising trajectory without requiring additional regularization or architectural changes.

What would settle it

If samples from the single-step student show substantially higher Fréchet Inception Distance scores or visibly poorer fidelity than samples from the multi-step teacher on the same CIFAR-10 or CelebA test sets, the single-pass approximation would fail.

read the original abstract

Iterative generative models, such as noise conditional score networks and denoising diffusion probabilistic models, produce high quality samples by gradually denoising an initial noise vector. However, their denoising process has many steps, making them 2-3 orders of magnitude slower than other generative models such as GANs and VAEs. In this paper, we establish a novel connection between knowledge distillation and image generation with a technique that distills a multi-step denoising process into a single step, resulting in a sampling speed similar to other single-step generative models. Our Denoising Student generates high quality samples comparable to GANs on the CIFAR-10 and CelebA datasets, without adversarial training. We demonstrate that our method scales to higher resolutions through experiments on 256 x 256 LSUN. Code and checkpoints are available at https://github.com/tcl9876/Denoising_Student

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They distill a full multi-step diffusion trajectory into one student forward pass and get sampling speeds like GANs with decent quality on CIFAR and CelebA.

read the letter

The main point is that this paper takes a trained diffusion teacher and trains a student to map noise straight to a clean image in a single pass, cutting the sampling cost by two orders of magnitude while keeping visual quality close to what GANs produce on the tested sets. They frame the whole iterative denoising chain as the thing being distilled rather than just the final output or the score function itself. Experiments cover CIFAR-10 and CelebA with claims of comparable quality and no adversarial training, plus a scaling test at 256x256 on LSUN. Releasing code and checkpoints is straightforward and helpful for anyone who wants to reproduce or extend it. The core idea is a direct and reasonable use of distillation, and nothing in the setup looks circular or relies on hidden fitted parameters. The central claim that a single pass can stand in for the full trajectory holds up at the level of the reported visuals. The soft spot is the lack of hard numbers in the abstract; without FID scores, direct comparisons to the teacher's own samples, or ablations on the distillation loss, it is hard to judge exactly how much distribution shift or mode loss occurs. If the full paper only shows side-by-side images against real data or GAN baselines, that leaves the stress-test worry about faithful reproduction partially open. This is for people working on practical sampling from score-based models or looking for distillation tricks in generative settings. A reader who needs faster diffusion sampling would get concrete value from the method and the public implementation. It deserves peer review because the idea is clear, the experiments provide a starting point, and the speed improvement addresses a real limitation even if more quantitative checks would strengthen the case.

Referee Report

2 major / 1 minor

Summary. The paper proposes a knowledge distillation technique to compress the multi-step denoising trajectory of iterative generative models (noise conditional score networks and denoising diffusion probabilistic models) into a single forward pass through a student network called the Denoising Student. This yields sampling speeds comparable to single-step models such as GANs while producing high-quality images on CIFAR-10, CelebA, and 256x256 LSUN without adversarial training. Code and checkpoints are released.

Significance. If the distillation successfully preserves the teacher's distribution in one step, the work would meaningfully address the sampling-speed bottleneck of diffusion-style models and provide a practical alternative to GANs for high-resolution generation. The public release of code and checkpoints is a clear strength that supports reproducibility.

major comments (2)

[Abstract] Abstract: the central claim that the single-step student faithfully approximates the teacher's multi-step denoising distribution rests on the distillation objective, yet the abstract provides neither the explicit form of the loss (reconstruction, perceptual, or distributional terms) nor any quantitative comparison between student samples and teacher samples. Direct evaluation against the teacher's own outputs (rather than only against real data or GAN baselines) is required to substantiate the approximation.
[Experiments] Experiments section: reported comparability to GANs is stated qualitatively; without FID, IS, or precision/recall numbers, ablation on loss components, or teacher-student distribution distance metrics, it is not possible to judge whether the single-pass student avoids mode omission or variance collapse relative to the full iterative teacher.

minor comments (1)

[Abstract] The abstract would benefit from a one-sentence statement of the precise distillation loss used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the single-step student faithfully approximates the teacher's multi-step denoising distribution rests on the distillation objective, yet the abstract provides neither the explicit form of the loss (reconstruction, perceptual, or distributional terms) nor any quantitative comparison between student samples and teacher samples. Direct evaluation against the teacher's own outputs (rather than only against real data or GAN baselines) is required to substantiate the approximation.

Authors: We agree that explicitly describing the distillation loss in the abstract will strengthen the central claim. The objective is a combination of an L2 reconstruction term on the denoised output and a perceptual loss using pretrained VGG features. We will revise the abstract to state this form. We will also add a direct quantitative comparison (FID and other metrics) between student samples and the teacher's multi-step outputs to the experiments section and reference it in the abstract. revision: yes
Referee: [Experiments] Experiments section: reported comparability to GANs is stated qualitatively; without FID, IS, or precision/recall numbers, ablation on loss components, or teacher-student distribution distance metrics, it is not possible to judge whether the single-pass student avoids mode omission or variance collapse relative to the full iterative teacher.

Authors: The current version includes FID scores comparing the student to real data and to GAN baselines on CIFAR-10 and CelebA. We acknowledge that adding IS, precision/recall, loss-component ablations, and explicit teacher-student distribution distances (e.g., FID between student and teacher) would allow a clearer assessment of mode coverage and variance. We will expand the experiments section with these metrics and ablations in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard distillation independently

full rationale

The paper proposes distilling a multi-step iterative denoising process (from score networks or diffusion models) into a single forward pass via knowledge distillation, yielding faster sampling while matching quality to GANs. No equations, fitted parameters, or self-citations are shown in the abstract or description that reduce the claimed single-step approximation to a tautology or input by construction. The central technique rests on applying existing distillation objectives to a new target (the teacher's full trajectory), which remains an independent methodological step without self-definitional loops or renamed known results. This qualifies as a self-contained contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the approach appears to rely on standard assumptions of knowledge distillation and the existence of a pre-trained multi-step teacher model.

pith-pipeline@v0.9.0 · 5440 in / 1123 out tokens · 36081 ms · 2026-05-17T03:52:10.568856+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We establish a novel connection between knowledge distillation and image generation with a technique that distills a multi-step denoising process into a single step
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our Denoising Student generates high quality samples comparable to GANs on the CIFAR-10 and CelebA datasets, without adversarial training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Consistency Models
cs.LG 2023-03 conditional novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
cs.LG 2022-09 unverdicted novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Stochastic Transition-Map Distillation for Fast Probabilistic Inference
cs.LG 2026-05 unverdicted novelty 7.0

STMD distills the full transition map of diffusion sampling SDEs into a conditional Mean Flow model to enable fast one- or few-step stochastic sampling without teacher models or bi-level optimization.
One Step Diffusion via Shortcut Models
cs.LG 2024-10 conditional novelty 7.0

Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
Elucidating the Design Space of Diffusion-Based Generative Models
cs.CV 2022-06 accept novelty 7.0

Organizing diffusion model design choices yields SOTA FID of 1.79 on CIFAR-10 with only 35 network evaluations per image and similar gains on ImageNet-64.
Progressive Distillation for Fast Sampling of Diffusion Models
cs.LG 2022-02 unverdicted novelty 7.0

Progressive distillation halves sampling steps repeatedly in diffusion models, reaching 4 steps with FID 3.0 on CIFAR-10 from 8192-step samplers.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Fast Image Super-Resolution via Consistency Rectified Flow
cs.CV 2026-05 unverdicted novelty 6.0

FlowSR enables single-step image super-resolution by learning a rectified flow from LR to HR with consistency distillation, HR regularization, and dual fast-slow timestep scheduling.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
MixFlow: Mixed Source Distributions Improve Rectified Flows
cs.CV 2026-04 unverdicted novelty 6.0

Mixing unconditional Gaussian noise with a κ-conditioned source during training of rectified flows reduces path curvature, yielding 12% better FID scores and faster sampling than standard rectified flows.
Jeffreys Flow: Robust Boltzmann Generators for Rare Event Sampling via Parallel Tempering Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Jeffreys Flow distills Parallel Tempering trajectories via Jeffreys divergence to produce robust Boltzmann generators that suppress mode collapse and correct sampling inaccuracies for rare event sampling.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
cs.LG 2022-11 conditional novelty 6.0

DPM-Solver++ enables high-quality guided sampling of diffusion models in 15-20 steps via data-prediction ODE solving and multistep stabilization.
Lightning Unified Video Editing via In-Context Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...
Reward-Aware Trajectory Shaping for Few-step Visual Generation
cs.CV 2026-04 unverdicted novelty 5.0

RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
Elucidating the SNR-t Bias of Diffusion Probabilistic Models
cs.CV 2026-04 unverdicted novelty 4.0

Diffusion models have an SNR-timestep mismatch during inference that the authors mitigate with per-frequency differential correction, raising generation quality across IDDPM, ADM, DDIM and others.
Discrete Meanflow Training Curriculum
cs.LG 2026-04 unverdicted novelty 4.0

A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 17 Pith papers · 14 internal anchors

[1]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeﬀ Donahue, and Karen Simonyan. “Large Scale GAN Training for High Fidelity Natural Image Synthesis”. In: International Conference on Learning Representations. 2019. url: https://openreview.net/forum?id=B1xsqj09Fm

work page 2019
[2]

Model Compres- sion

Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. “Model Compres- sion”. In: New York, NY, USA: Association for Computing Machinery, 2006. doi: 10.1145/1150402.1150464. url: https://doi.org/10.1145/1150402.1150464

work page doi:10.1145/1150402.1150464 2006
[3]

Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition

Yevgen Chebotar and Austin Waters. “Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition.” In: Interspeech. 2016, pp. 3439–3443

work page 2016
[4]

Very Deep VAEs Generalize Autoregressive Models and Can Outper- form Them on Images

Rewon Child. Very Deep VAEs Generalize Autoregressive Models and Can Outper- form Them on Images . 2020. arXiv: 2011.10650 [cs.LG]

work page arXiv 2020
[5]

Implicit generation and modeling with energy based models

Yilun Du and Igor Mordatch. “Implicit generation and modeling with energy based models”. In: Advances in Neural Information Processing Systems32 (2019), pp. 3608– 3618

work page 2019
[6]

Eﬃcient Knowledge Distillation from an Ensemble of Teachers

T. Fukuda et al. “Eﬃcient Knowledge Distillation from an Ensemble of Teachers”. In: INTERSPEECH. 2017

work page 2017
[7]

Born Again Neural Networks

Tommaso Furlanello et al. Born Again Neural Networks . 2018. arXiv: 1805.04770 [stat.ML]

work page arXiv 2018
[8]

Learning Energy-Based Models by Diﬀusion Recovery Likelihood

Ruiqi Gao et al. Learning Energy-Based Models by Diﬀusion Recovery Likelihood

work page
[9]

arXiv: 2012.08125 [cs.LG]

work page arXiv 2012
[10]

Distilling Knowledge from Ensem- bles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition

Yan Gao, Titouan Parcollet, and Nicholas Lane. “Distilling Knowledge from Ensem- bles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition”. In: arXiv preprint arXiv:2005.09310 (2020)

work page arXiv 2005
[11]

Generative Adversarial Nets

Ian Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural Infor- mation Processing Systems. Ed. by Z. Ghahramani et al. Vol. 27. Curran Associates, Inc., 2014, pp. 2672–2680. url: https://proceedings.neurips.cc/paper/2014/ file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

work page 2014
[12]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium . 2018. arXiv: 1706.08500 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Distilling the Knowledge in a Neural Network

Geoﬀrey Hinton, Oriol Vinyals, and Jeﬀ Dean. Distilling the Knowledge in a Neural Network. 2015. arXiv: 1503.02531 [stat.ML]

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diﬀusion Probabilistic Mod- els. 2020. arXiv: 2006.11239 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[15]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao et al. “Tinybert: Distilling bert for natural language understanding”. In: arXiv preprint arXiv:1909.10351 (2019)

work page arXiv 1909
[16]

Analyzing and Improving the Image Quality of StyleGAN

Tero Karras et al. Analyzing and Improving the Image Quality of StyleGAN . 2020. arXiv: 1912.04958 [cs.CV]

work page arXiv 2020
[17]

Training Generative Adversarial Networks with Limited Data

Tero Karras et al. Training Generative Adversarial Networks with Limited Data

work page
[18]

arXiv: 2006.06676 [cs.CV]

work page arXiv 2006
[19]

Knowledge distillation using output errors for self-attention end-to-end models

Ho-Gyeong Kim et al. “Knowledge distillation using output errors for self-attention end-to-end models”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, pp. 6181–6185. 11

work page 2019
[20]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization

work page
[21]

arXiv: 1412.6980 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes . 2014. arXiv: 1312.6114 [stat.ML]

work page internal anchor Pith review Pith/arXiv arXiv 2014
[23]

Learning Multiple Layers of Features from Tiny Images

Alex Krizhevsky. “Learning Multiple Layers of Features from Tiny Images”. In: University of Toronto (May 2012)

work page 2012
[24]

A tutorial on energy-based learning

Yann Lecun et al. “A tutorial on energy-based learning”. English (US). In: Predict- ing structured data. Ed. by G. Bakir et al. MIT Press, 2006

work page 2006
[25]

Deep learning face attributes in the wild

Ziwei Liu et al. “Deep learning face attributes in the wild”. In: Proceedings of the IEEE international conference on computer vision . 2015, pp. 3730–3738

work page 2015
[26]

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato et al. Spectral Normalization for Generative Adversarial Networks

work page
[27]

arXiv: 1802.05957 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Learning in Implicit Generative Models

Shakir Mohamed and Balaji Lakshminarayanan. Learning in Implicit Generative Models. 2017. arXiv: 1610.03483 [stat.ML]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Learning Implicit Generative Models with the Method of Learned Moments

Suman Ravuri et al. Learning Implicit Generative Models with the Method of Learned Moments. 2018. arXiv: 1806.11006 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

FitNets: Hints for Thin Deep Nets

Adriana Romero et al. FitNets: Hints for Thin Deep Nets . 2015. arXiv: 1412.6550 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2015
[31]

Improved Techniques for Training GANs

Tim Salimans et al. Improved Techniques for Training GANs . 2016. arXiv: 1606. 03498 [cs.LG]

work page 2016
[32]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2020. arXiv: 1910.01108 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[33]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein et al. Deep Unsupervised Learning using Nonequilibrium Ther- modynamics. 2015. arXiv: 1503.03585 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2015
[34]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising Diﬀusion Implicit Models. 2020. arXiv: 2010.02502 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[35]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. “Generative modeling by estimating gradients of the data distribution”. In: Advances in Neural Information Processing Systems . 2019, pp. 11918–11930

work page 2019
[36]

Improved Techniques for Training Score-Based Gen- erative Models

Yang Song and Stefano Ermon. Improved Techniques for Training Score-Based Gen- erative Models. 2020. arXiv: 2006.09011 [cs.LG]

work page arXiv 2020
[37]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song et al. Score-Based Generative Modeling through Stochastic Diﬀerential Equations. 2020. arXiv: 2011.13456 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[38]

Contrastive Distillation on Intermediate Representations for Lan- guage Model Compression

Siqi Sun et al. “Contrastive Distillation on Intermediate Representations for Lan- guage Model Compression”. In: arXiv preprint arXiv:2009.14167 (2020)

work page arXiv 2009
[39]

Patient Knowledge Distillation for BERT Model Compression

Siqi Sun et al. Patient Knowledge Distillation for BERT Model Compression . 2019. arXiv: 1908.09355 [cs.CL]

work page arXiv 2019
[40]

Contrastive Representation Dis- tillation

Yonglong Tian, Dilip Krishnan, and Phillip Isola. “Contrastive Representation Dis- tillation”. In: International Conference on Learning Representations. 2019

work page 2019
[41]

Well-Read Students Learn Better: On the Importance of Pre- training Compact Models

Iulia Turc et al. Well-Read Students Learn Better: On the Importance of Pre- training Compact Models. 2019. arXiv: 1908.08962 [cs.CL] . 12

work page arXiv 2019
[42]

NVAE: A Deep Hierarchical Variational Autoencoder

Arash Vahdat and Jan Kautz. NVAE: A Deep Hierarchical Variational Autoencoder

work page
[43]

arXiv: 2007.03898 [stat.ML]

work page arXiv 2007
[44]

A connection between score matching and denoising autoencoders

Pascal Vincent. “A connection between score matching and denoising autoencoders”. In: Neural computation 23.7 (2011), pp. 1661–1674

work page 2011
[45]

Student-teacher network learning with enhanced features

Shinji Watanabe et al. “Student-teacher network learning with enhanced features”. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2017, pp. 5275–5279

work page 2017
[46]

Improving GAN Training with Probability Ratio Clipping and Sample Reweighting

Yue Wu et al. “Improving GAN Training with Probability Ratio Clipping and Sample Reweighting”. In: Advances in Neural Information Processing Systems 33 (2020)

work page 2020
[47]

VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models

Zhisheng Xiao et al. VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models. 2020. arXiv: 2010.00654 [cs.LG]

work page arXiv 2020
[48]

Knowledge distillation meets self-supervision

Guodong Xu et al. “Knowledge distillation meets self-supervision”. In: European Conference on Computer Vision . Springer. 2020, pp. 588–604

work page 2020
[49]

A gift from knowledge distillation: Fast optimization, network minimization and transfer learning

Junho Yim et al. “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2017, pp. 4133–4141

work page 2017
[50]

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

Fisher Yu et al. LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop . 2016. arXiv: 1506.03365 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[51]

Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Trans- fer

Sergey Zagoruyko and Nikos Komodakis. “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Trans- fer”. In: International Conference on Learning Representations. 2017. A Samples We show additional samples in Figure 7 (CIFAR-10), 8 (CelebA), 9 (Bedroom), and 10 (Church). Additional interpolation resu...

work page 2017