pith. machine review for the scientific record. sign in

arxiv: 2101.02388 · v1 · submitted 2021-01-07 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords knowledge distillationdenoising diffusiongenerative modelssampling speedimage generationCIFAR-10CelebALSUN
0
0 comments X

The pith

Knowledge distillation turns a multi-step denoising model into a fast single-step generator matching GAN quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Iterative generative models such as denoising diffusion probabilistic models produce high-quality images by applying many gradual denoising steps to an initial noise vector, but this makes them far slower than GANs or VAEs. The paper connects knowledge distillation to this setting by training a student network to produce in one forward pass the same result that the multi-step teacher would reach after its full trajectory. This yields sampling speeds comparable to single-step models while avoiding adversarial training entirely. The resulting Denoising Student achieves sample quality similar to GANs on CIFAR-10 and CelebA and extends to 256 by 256 images on LSUN.

Core claim

We establish a novel connection between knowledge distillation and image generation with a technique that distills a multi-step denoising process into a single step, resulting in a sampling speed similar to other single-step generative models. Our Denoising Student generates high quality samples comparable to GANs on the CIFAR-10 and CelebA datasets, without adversarial training. We demonstrate that our method scales to higher resolutions through experiments on 256 x 256 LSUN.

What carries the argument

The Denoising Student, a network trained via knowledge distillation to replicate the final clean image produced by a teacher's full multi-step denoising trajectory in a single forward pass.

If this is right

  • Sampling speed increases by two to three orders of magnitude to match GANs and VAEs.
  • Sample quality remains comparable to GANs on CIFAR-10 and CelebA without adversarial training.
  • The method scales to higher-resolution images such as 256 by 256 on LSUN.
  • No additional regularization is needed beyond standard distillation to achieve the speed-quality tradeoff.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation idea could be tested on conditional versions of the teacher to enable fast controlled generation.
  • Diversity statistics of the student outputs could be compared directly to the teacher to check mode coverage.
  • The approach might transfer to other iterative generative tasks such as audio waveform synthesis where multi-step refinement is common.

Load-bearing premise

A single forward pass through the student can faithfully approximate the distribution produced by the full multi-step teacher denoising trajectory without requiring additional regularization or architectural changes.

What would settle it

If samples from the single-step student show substantially higher Fréchet Inception Distance scores or visibly poorer fidelity than samples from the multi-step teacher on the same CIFAR-10 or CelebA test sets, the single-pass approximation would fail.

read the original abstract

Iterative generative models, such as noise conditional score networks and denoising diffusion probabilistic models, produce high quality samples by gradually denoising an initial noise vector. However, their denoising process has many steps, making them 2-3 orders of magnitude slower than other generative models such as GANs and VAEs. In this paper, we establish a novel connection between knowledge distillation and image generation with a technique that distills a multi-step denoising process into a single step, resulting in a sampling speed similar to other single-step generative models. Our Denoising Student generates high quality samples comparable to GANs on the CIFAR-10 and CelebA datasets, without adversarial training. We demonstrate that our method scales to higher resolutions through experiments on 256 x 256 LSUN. Code and checkpoints are available at https://github.com/tcl9876/Denoising_Student

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a knowledge distillation technique to compress the multi-step denoising trajectory of iterative generative models (noise conditional score networks and denoising diffusion probabilistic models) into a single forward pass through a student network called the Denoising Student. This yields sampling speeds comparable to single-step models such as GANs while producing high-quality images on CIFAR-10, CelebA, and 256x256 LSUN without adversarial training. Code and checkpoints are released.

Significance. If the distillation successfully preserves the teacher's distribution in one step, the work would meaningfully address the sampling-speed bottleneck of diffusion-style models and provide a practical alternative to GANs for high-resolution generation. The public release of code and checkpoints is a clear strength that supports reproducibility.

major comments (2)
  1. [Abstract] Abstract: the central claim that the single-step student faithfully approximates the teacher's multi-step denoising distribution rests on the distillation objective, yet the abstract provides neither the explicit form of the loss (reconstruction, perceptual, or distributional terms) nor any quantitative comparison between student samples and teacher samples. Direct evaluation against the teacher's own outputs (rather than only against real data or GAN baselines) is required to substantiate the approximation.
  2. [Experiments] Experiments section: reported comparability to GANs is stated qualitatively; without FID, IS, or precision/recall numbers, ablation on loss components, or teacher-student distribution distance metrics, it is not possible to judge whether the single-pass student avoids mode omission or variance collapse relative to the full iterative teacher.
minor comments (1)
  1. [Abstract] The abstract would benefit from a one-sentence statement of the precise distillation loss used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the single-step student faithfully approximates the teacher's multi-step denoising distribution rests on the distillation objective, yet the abstract provides neither the explicit form of the loss (reconstruction, perceptual, or distributional terms) nor any quantitative comparison between student samples and teacher samples. Direct evaluation against the teacher's own outputs (rather than only against real data or GAN baselines) is required to substantiate the approximation.

    Authors: We agree that explicitly describing the distillation loss in the abstract will strengthen the central claim. The objective is a combination of an L2 reconstruction term on the denoised output and a perceptual loss using pretrained VGG features. We will revise the abstract to state this form. We will also add a direct quantitative comparison (FID and other metrics) between student samples and the teacher's multi-step outputs to the experiments section and reference it in the abstract. revision: yes

  2. Referee: [Experiments] Experiments section: reported comparability to GANs is stated qualitatively; without FID, IS, or precision/recall numbers, ablation on loss components, or teacher-student distribution distance metrics, it is not possible to judge whether the single-pass student avoids mode omission or variance collapse relative to the full iterative teacher.

    Authors: The current version includes FID scores comparing the student to real data and to GAN baselines on CIFAR-10 and CelebA. We acknowledge that adding IS, precision/recall, loss-component ablations, and explicit teacher-student distribution distances (e.g., FID between student and teacher) would allow a clearer assessment of mode coverage and variance. We will expand the experiments section with these metrics and ablations in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard distillation independently

full rationale

The paper proposes distilling a multi-step iterative denoising process (from score networks or diffusion models) into a single forward pass via knowledge distillation, yielding faster sampling while matching quality to GANs. No equations, fitted parameters, or self-citations are shown in the abstract or description that reduce the claimed single-step approximation to a tautology or input by construction. The central technique rests on applying existing distillation objectives to a new target (the teacher's full trajectory), which remains an independent methodological step without self-definitional loops or renamed known results. This qualifies as a self-contained contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the approach appears to rely on standard assumptions of knowledge distillation and the existence of a pre-trained multi-step teacher model.

pith-pipeline@v0.9.0 · 5440 in / 1123 out tokens · 36081 ms · 2026-05-17T03:52:10.568856+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Consistency Models

    cs.LG 2023-03 conditional novelty 8.0

    Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

  2. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    cs.LG 2022-09 unverdicted novelty 8.0

    Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

  3. Stochastic Transition-Map Distillation for Fast Probabilistic Inference

    cs.LG 2026-05 unverdicted novelty 7.0

    STMD distills the full transition map of diffusion sampling SDEs into a conditional Mean Flow model to enable fast one- or few-step stochastic sampling without teacher models or bi-level optimization.

  4. One Step Diffusion via Shortcut Models

    cs.LG 2024-10 conditional novelty 7.0

    Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.

  5. Elucidating the Design Space of Diffusion-Based Generative Models

    cs.CV 2022-06 accept novelty 7.0

    Organizing diffusion model design choices yields SOTA FID of 1.79 on CIFAR-10 with only 35 network evaluations per image and similar gains on ImageNet-64.

  6. Progressive Distillation for Fast Sampling of Diffusion Models

    cs.LG 2022-02 unverdicted novelty 7.0

    Progressive distillation halves sampling steps repeatedly in diffusion models, reaching 4 steps with FID 3.0 on CIFAR-10 from 8192-step samplers.

  7. Diffusion Models Beat GANs on Image Synthesis

    cs.LG 2021-05 accept novelty 7.0

    Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.

  8. Fast Image Super-Resolution via Consistency Rectified Flow

    cs.CV 2026-05 unverdicted novelty 6.0

    FlowSR enables single-step image super-resolution by learning a rectified flow from LR to HR with consistency distillation, HR regularization, and dual fast-slow timestep scheduling.

  9. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  10. MixFlow: Mixed Source Distributions Improve Rectified Flows

    cs.CV 2026-04 unverdicted novelty 6.0

    Mixing unconditional Gaussian noise with a κ-conditioned source during training of rectified flows reduces path curvature, yielding 12% better FID scores and faster sampling than standard rectified flows.

  11. Jeffreys Flow: Robust Boltzmann Generators for Rare Event Sampling via Parallel Tempering Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Jeffreys Flow distills Parallel Tempering trajectories via Jeffreys divergence to produce robust Boltzmann generators that suppress mode collapse and correct sampling inaccuracies for rare event sampling.

  12. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  13. DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

    cs.LG 2022-11 conditional novelty 6.0

    DPM-Solver++ enables high-quality guided sampling of diffusion models in 15-20 steps via data-prediction ODE solving and multistep stabilization.

  14. Lightning Unified Video Editing via In-Context Sparse Attention

    cs.CV 2026-05 unverdicted novelty 5.0

    ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...

  15. Reward-Aware Trajectory Shaping for Few-step Visual Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.

  16. Elucidating the SNR-t Bias of Diffusion Probabilistic Models

    cs.CV 2026-04 unverdicted novelty 4.0

    Diffusion models have an SNR-timestep mismatch during inference that the authors mitigate with per-frequency differential correction, raising generation quality across IDDPM, ADM, DDIM and others.

  17. Discrete Meanflow Training Curriculum

    cs.LG 2026-04 unverdicted novelty 4.0

    A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 17 Pith papers · 14 internal anchors

  1. [1]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. “Large Scale GAN Training for High Fidelity Natural Image Synthesis”. In: International Conference on Learning Representations. 2019. url: https://openreview.net/forum?id=B1xsqj09Fm

  2. [2]

    Model Compres- sion

    Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. “Model Compres- sion”. In: New York, NY, USA: Association for Computing Machinery, 2006. doi: 10.1145/1150402.1150464. url: https://doi.org/10.1145/1150402.1150464

  3. [3]

    Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition

    Yevgen Chebotar and Austin Waters. “Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition.” In: Interspeech. 2016, pp. 3439–3443

  4. [4]

    Very Deep VAEs Generalize Autoregressive Models and Can Outper- form Them on Images

    Rewon Child. Very Deep VAEs Generalize Autoregressive Models and Can Outper- form Them on Images . 2020. arXiv: 2011.10650 [cs.LG]

  5. [5]

    Implicit generation and modeling with energy based models

    Yilun Du and Igor Mordatch. “Implicit generation and modeling with energy based models”. In: Advances in Neural Information Processing Systems32 (2019), pp. 3608– 3618

  6. [6]

    Efficient Knowledge Distillation from an Ensemble of Teachers

    T. Fukuda et al. “Efficient Knowledge Distillation from an Ensemble of Teachers”. In: INTERSPEECH. 2017

  7. [7]

    Born Again Neural Networks

    Tommaso Furlanello et al. Born Again Neural Networks . 2018. arXiv: 1805.04770 [stat.ML]

  8. [8]

    Learning Energy-Based Models by Diffusion Recovery Likelihood

    Ruiqi Gao et al. Learning Energy-Based Models by Diffusion Recovery Likelihood

  9. [9]

    arXiv: 2012.08125 [cs.LG]

  10. [10]

    Distilling Knowledge from Ensem- bles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition

    Yan Gao, Titouan Parcollet, and Nicholas Lane. “Distilling Knowledge from Ensem- bles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition”. In: arXiv preprint arXiv:2005.09310 (2020)

  11. [11]

    Generative Adversarial Nets

    Ian Goodfellow et al. “Generative Adversarial Nets”. In: Advances in Neural Infor- mation Processing Systems. Ed. by Z. Ghahramani et al. Vol. 27. Curran Associates, Inc., 2014, pp. 2672–2680. url: https://proceedings.neurips.cc/paper/2014/ file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

  12. [12]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    Martin Heusel et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium . 2018. arXiv: 1706.08500 [cs.LG]

  13. [13]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network. 2015. arXiv: 1503.02531 [stat.ML]

  14. [14]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Mod- els. 2020. arXiv: 2006.11239 [cs.LG]

  15. [15]

    Tinybert: Distilling bert for natural language understanding

    Xiaoqi Jiao et al. “Tinybert: Distilling bert for natural language understanding”. In: arXiv preprint arXiv:1909.10351 (2019)

  16. [16]

    Analyzing and Improving the Image Quality of StyleGAN

    Tero Karras et al. Analyzing and Improving the Image Quality of StyleGAN . 2020. arXiv: 1912.04958 [cs.CV]

  17. [17]

    Training Generative Adversarial Networks with Limited Data

    Tero Karras et al. Training Generative Adversarial Networks with Limited Data

  18. [18]

    arXiv: 2006.06676 [cs.CV]

  19. [19]

    Knowledge distillation using output errors for self-attention end-to-end models

    Ho-Gyeong Kim et al. “Knowledge distillation using output errors for self-attention end-to-end models”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, pp. 6181–6185. 11

  20. [20]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization

  21. [21]

    arXiv: 1412.6980 [cs.LG]

  22. [22]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes . 2014. arXiv: 1312.6114 [stat.ML]

  23. [23]

    Learning Multiple Layers of Features from Tiny Images

    Alex Krizhevsky. “Learning Multiple Layers of Features from Tiny Images”. In: University of Toronto (May 2012)

  24. [24]

    A tutorial on energy-based learning

    Yann Lecun et al. “A tutorial on energy-based learning”. English (US). In: Predict- ing structured data. Ed. by G. Bakir et al. MIT Press, 2006

  25. [25]

    Deep learning face attributes in the wild

    Ziwei Liu et al. “Deep learning face attributes in the wild”. In: Proceedings of the IEEE international conference on computer vision . 2015, pp. 3730–3738

  26. [26]

    Spectral Normalization for Generative Adversarial Networks

    Takeru Miyato et al. Spectral Normalization for Generative Adversarial Networks

  27. [27]

    arXiv: 1802.05957 [cs.LG]

  28. [28]

    Learning in Implicit Generative Models

    Shakir Mohamed and Balaji Lakshminarayanan. Learning in Implicit Generative Models. 2017. arXiv: 1610.03483 [stat.ML]

  29. [29]

    Learning Implicit Generative Models with the Method of Learned Moments

    Suman Ravuri et al. Learning Implicit Generative Models with the Method of Learned Moments. 2018. arXiv: 1806.11006 [cs.LG]

  30. [30]

    FitNets: Hints for Thin Deep Nets

    Adriana Romero et al. FitNets: Hints for Thin Deep Nets . 2015. arXiv: 1412.6550 [cs.LG]

  31. [31]

    Improved Techniques for Training GANs

    Tim Salimans et al. Improved Techniques for Training GANs . 2016. arXiv: 1606. 03498 [cs.LG]

  32. [32]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2020. arXiv: 1910.01108 [cs.CL]

  33. [33]

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics

    Jascha Sohl-Dickstein et al. Deep Unsupervised Learning using Nonequilibrium Ther- modynamics. 2015. arXiv: 1503.03585 [cs.LG]

  34. [34]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising Diffusion Implicit Models. 2020. arXiv: 2010.02502 [cs.LG]

  35. [35]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. “Generative modeling by estimating gradients of the data distribution”. In: Advances in Neural Information Processing Systems . 2019, pp. 11918–11930

  36. [36]

    Improved Techniques for Training Score-Based Gen- erative Models

    Yang Song and Stefano Ermon. Improved Techniques for Training Score-Based Gen- erative Models. 2020. arXiv: 2006.09011 [cs.LG]

  37. [37]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song et al. Score-Based Generative Modeling through Stochastic Differential Equations. 2020. arXiv: 2011.13456 [cs.LG]

  38. [38]

    Contrastive Distillation on Intermediate Representations for Lan- guage Model Compression

    Siqi Sun et al. “Contrastive Distillation on Intermediate Representations for Lan- guage Model Compression”. In: arXiv preprint arXiv:2009.14167 (2020)

  39. [39]

    Patient Knowledge Distillation for BERT Model Compression

    Siqi Sun et al. Patient Knowledge Distillation for BERT Model Compression . 2019. arXiv: 1908.09355 [cs.CL]

  40. [40]

    Contrastive Representation Dis- tillation

    Yonglong Tian, Dilip Krishnan, and Phillip Isola. “Contrastive Representation Dis- tillation”. In: International Conference on Learning Representations. 2019

  41. [41]

    Well-Read Students Learn Better: On the Importance of Pre- training Compact Models

    Iulia Turc et al. Well-Read Students Learn Better: On the Importance of Pre- training Compact Models. 2019. arXiv: 1908.08962 [cs.CL] . 12

  42. [42]

    NVAE: A Deep Hierarchical Variational Autoencoder

    Arash Vahdat and Jan Kautz. NVAE: A Deep Hierarchical Variational Autoencoder

  43. [43]

    arXiv: 2007.03898 [stat.ML]

  44. [44]

    A connection between score matching and denoising autoencoders

    Pascal Vincent. “A connection between score matching and denoising autoencoders”. In: Neural computation 23.7 (2011), pp. 1661–1674

  45. [45]

    Student-teacher network learning with enhanced features

    Shinji Watanabe et al. “Student-teacher network learning with enhanced features”. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2017, pp. 5275–5279

  46. [46]

    Improving GAN Training with Probability Ratio Clipping and Sample Reweighting

    Yue Wu et al. “Improving GAN Training with Probability Ratio Clipping and Sample Reweighting”. In: Advances in Neural Information Processing Systems 33 (2020)

  47. [47]

    VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models

    Zhisheng Xiao et al. VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models. 2020. arXiv: 2010.00654 [cs.LG]

  48. [48]

    Knowledge distillation meets self-supervision

    Guodong Xu et al. “Knowledge distillation meets self-supervision”. In: European Conference on Computer Vision . Springer. 2020, pp. 588–604

  49. [49]

    A gift from knowledge distillation: Fast optimization, network minimization and transfer learning

    Junho Yim et al. “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2017, pp. 4133–4141

  50. [50]

    LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

    Fisher Yu et al. LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop . 2016. arXiv: 1506.03365 [cs.CV]

  51. [51]

    Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Trans- fer

    Sergey Zagoruyko and Nikos Komodakis. “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Trans- fer”. In: International Conference on Learning Representations. 2017. A Samples We show additional samples in Figure 7 (CIFAR-10), 8 (CelebA), 9 (Bedroom), and 10 (Church). Additional interpolation resu...