pith. sign in

arxiv: 2605.19729 · v2 · pith:2BNY7G4Unew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

Pith reviewed 2026-05-21 07:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords knowledge distillationdiffusion modelsmodel compressionlightweight networksdenoising processcoarse-to-fine trainingadaptive loss weighting
0
0 comments X

The pith

Breaking the teacher's denoising into coarse linear alignment then locally adaptive fine refinement lets tiny students train stably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard knowledge distillation collapses when the student diffusion model is reduced to roughly 1.6 percent of the teacher's size because the full complex denoising trajectory is too hard to copy directly. The proposed method first trains the student on a simplified coarse objective obtained by linear fitting of the teacher's outputs, then switches to a refinement stage that applies piecewise local scaling factors to the loss according to per-region error levels. This staged, adaptive guidance produces stable convergence and an FID of 15.73 on a 1.3-million-parameter student where conventional distillation yields FID scores of 50–200 or worse. The same procedure works for both pixel-space and latent-space diffusion, U-Net and DiT backbones, unconditional and conditional tasks, and even extends to flow-matching models such as MMDiT.

Core claim

The teacher's complex denoising process can be decomposed into an initial coarse-alignment stage learned via linear fitting of outputs and a subsequent fine-refinement stage whose loss is locally re-weighted by error-based partitioning; training the student sequentially on these two stages yields stable optimization and high-quality generation even when the student capacity is reduced by more than 98 percent.

What carries the argument

LIFT performs linear-fitting-based distillation to separate coarse alignment from fine refinement; PLACE then partitions the output space by local error magnitude to compute spatially adaptive loss coefficients.

If this is right

  • Stable training remains possible even when the student is only 1.6 percent of teacher size.
  • The same procedure transfers across pixel versus latent diffusion spaces and across U-Net versus DiT architectures.
  • The framework also improves distillation for flow-based generative models such as MMDiT.
  • Performance holds for both unconditional and class-conditional generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Edge devices could run high-quality diffusion sampling with far smaller memory footprints if the coarse-to-fine schedule is adopted.
  • The error-partitioning idea may transfer to other teacher–student gaps in generative modeling beyond diffusion.
  • A natural next test is whether the same staged guidance improves distillation for video or 3-D diffusion models.

Load-bearing premise

The teacher's denoising trajectory contains separable coarse and fine components that error-based local re-weighting can usefully expose to a much smaller student.

What would settle it

Train a 1.3 M-parameter student with the full LIFT-plus-PLACE pipeline on a standard benchmark; if the resulting FID exceeds 50 or training diverges, the claim that the decomposition supplies stable guidance is false.

Figures

Figures reproduced from arXiv: 2605.19729 by Hyunsoo Han, Jaejun Yoo, Sangyeop Yeo.

Figure 1
Figure 1. Figure 1: Impact of teacher network scale on distilling diffusion [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of (a) input image, latent error map [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Regression-based correction analysis. At each time step [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of LIFT and PLACE. LIFT parameterizes KD via linear regression, regularizing (β0 → 0, β1 → 1) to align low￾order moments “Coarse–Easy” and using the residual to learn “Fine–Hard” with an adaptive weight w. PLACE ranks error magnitudes E, partitions outputs into equal-sized groups, estimates (β0,i, β1,i) and applies LIFT in each group for difficulty adaptive estimation. 4. Method We present a Coars… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of pruned SD 2.1. Our method achieves improved semantic adherence to red-highlighted background cues. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Numerical labels indicate FID at each itera [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effects of group-size K. Students are 90% pruned on CelebA and distilled from the 78.7M-parameter teacher. the best overall performance. This confirms that the largest teacher still provides highly meaningful signals, and the baseline degradation is better understood as a consequence of the large capacity gap. Is there any training or inference overhead? Our frame￾work simply reformulates the KD objective,… view at source ↗
Figure 8
Figure 8. Figure 8: Error map of lightweight student models after being [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of error map of TinyFusion (i.e., DiT-D7): [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of pixel space diffusion models with LSUN Bedroom. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of pruned Stable Diffusion 2.1. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization results for DiT-D14 and DiT-D7. The top row compares DiT-D14, and the bottom row compares DiT-D7. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
read the original abstract

We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process - stemming from its substantially larger capacity - poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTtingbased distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a "coarse" alignment and a "fine" refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow-based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M-parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50-200+, but our method remains stably convergent and achieves an FID of 15.73.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LIFT (Linear Fitting-based distillation) and PLACE (Piecewise Local Adaptive Coefficient Estimation) as a coarse-to-fine knowledge distillation framework for lightweight diffusion models. LIFT decomposes the teacher's complex denoising into an initial coarse alignment stage followed by fine refinement, while PLACE partitions outputs into error-based groups to supply locally adaptive guidance. Experiments claim stable convergence and strong FID (15.73) for a 1.3M-parameter student (1.6% of teacher size) where conventional KD degrades to FID 50-200+, with demonstrations across image/latent spaces, U-Net/DiT backbones, unconditional/conditional tasks, and extension to flow-based models like MMDiT.

Significance. If the central stability claims hold under rigorous controls, the framework could meaningfully advance practical deployment of diffusion models on edge devices by enabling reliable extreme compression without training collapse. The cross-backbone and cross-task generality, plus the parameter-free flavor of the linear-fitting core, would be notable strengths.

major comments (2)
  1. [Experiments] Experiments section: the headline comparison treats direct denoising-output imitation as the sole conventional KD baseline. Stronger standard techniques (intermediate feature matching, attention transfer, or multi-scale losses) that are common in the diffusion KD literature are not shown to fail under the same 1.6% compression regime; without these controls the necessity of the LIFT/PLACE error-based decomposition for stability is not established.
  2. [§3.2] §3.2 (PLACE description): the partitioning into error-based groups relies on ad-hoc thresholds whose selection is not ablated or justified; it is unclear whether the reported stability is robust to reasonable variations in these thresholds or whether they must be tuned per dataset/backbone.
minor comments (2)
  1. [Abstract] The abstract states results 'across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets', yet the main text would benefit from a consolidated table summarizing FID/PSNR across all these axes rather than scattered figures.
  2. [Method] Notation for the linear-fitting coefficients in LIFT and the local adaptive coefficients in PLACE could be unified or given a single table of definitions to reduce reader effort when tracing the coarse-to-fine schedule.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address the major concerns point by point below and will make the necessary revisions to improve the paper.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline comparison treats direct denoising-output imitation as the sole conventional KD baseline. Stronger standard techniques (intermediate feature matching, attention transfer, or multi-scale losses) that are common in the diffusion KD literature are not shown to fail under the same 1.6% compression regime; without these controls the necessity of the LIFT/PLACE error-based decomposition for stability is not established.

    Authors: We agree that a more comprehensive set of baselines would strengthen the claims. While direct output imitation is a standard and direct approach for distilling diffusion models, we acknowledge that methods like feature matching and attention transfer are used in the broader KD literature. To establish the necessity of our LIFT/PLACE framework under extreme compression, we will include additional experiments comparing against these stronger baselines in the revised version. This will better demonstrate where conventional techniques fail and why our coarse-to-fine decomposition is beneficial. revision: yes

  2. Referee: [§3.2] §3.2 (PLACE description): the partitioning into error-based groups relies on ad-hoc thresholds whose selection is not ablated or justified; it is unclear whether the reported stability is robust to reasonable variations in these thresholds or whether they must be tuned per dataset/backbone.

    Authors: The thresholds in PLACE were selected empirically to partition the error distribution into groups of roughly equal size, ensuring that the local adaptive coefficients are meaningful. We appreciate the concern regarding robustness. In the revision, we will add an ablation study varying the thresholds and report performance across different datasets and backbones to show that the stability is not overly sensitive to exact threshold choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity: LIFT/PLACE are defined algorithmic steps independent of target outputs or self-referential fits.

full rationale

The paper introduces LIFT as an explicit two-stage decomposition (coarse alignment then fine refinement) and PLACE as error-based partitioning for local coefficients. These are presented as new procedural choices rather than quantities fitted from the student-teacher outputs or derived via self-citation chains. The central claim (stable convergence at 1.6% compression) rests on empirical comparison to a conventional KD baseline, not on any equation that reduces to its own inputs by construction. No load-bearing step matches the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on domain assumptions about decomposability of the denoising objective and the utility of error-based partitioning; no new physical entities are introduced.

free parameters (1)
  • error group thresholds in PLACE
    Likely tuned to define partitions for local adaptation, though exact values not specified in abstract.
axioms (1)
  • domain assumption The denoising process admits a useful coarse-to-fine decomposition for student learning
    Invoked to justify the LIFT stage separation.

pith-pipeline@v0.9.0 · 5764 in / 1195 out tokens · 43811 ms · 2026-05-21T07:27:26.211096+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 4 internal anchors

  1. [1]

    On the efficacy of knowledge distillation

    Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vision, pages 4794– 4802, 2019. 1, 3

  2. [2]

    Diversity-aware channel pruning for stylegan compres- sion

    Jiwoo Chung, Sangeek Hyun, Sang-Heon Shim, and Jae-Pil Heo. Diversity-aware channel pruning for stylegan compres- sion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 7902–7911,

  3. [3]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

  4. [4]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  5. [5]

    Structural pruning for diffusion models

    Gongfan Fang, Xinyin Ma, and Xinchao Wang. Structural pruning for diffusion models. InAdvances in Neural Infor- mation Processing Systems, pages 16716–16728. Curran As- sociates, Inc., 2023. 5, 3

  6. [6]

    Tinyfusion: Diffusion transformers learned shallow

    Gongfan Fang, Kunjun Li, Xinyin Ma, and Xinchao Wang. Tinyfusion: Diffusion transformers learned shallow. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 18144–18154, 2025. 1, 2, 5, 7, 3

  7. [7]

    Boot: Data-free distillation of denoising diffusion models with bootstrapping

    Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Lingjie Liu, and Joshua M Susskind. Boot: Data-free distillation of denoising diffusion models with bootstrapping. InICML 2023 Work- shop on Structured Probabilistic Inference{\&}Generative Modeling, 2023. 3

  8. [8]

    Progressive knowledge dis- tillation of stable diffusion xl using layer level loss.arXiv preprint arXiv:2401.02677, 2024

    Yatharth Gupta, Vishnu V Jaddipal, Harish Prabhala, Sayak Paul, and Patrick V on Platen. Progressive knowledge dis- tillation of stable diffusion xl using layer level loss.arXiv preprint arXiv:2401.02677, 2024. 2

  9. [9]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning.arXiv preprint arXiv:2104.08718,

  10. [10]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

  11. [11]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

  12. [12]

    Knowledge distillation from a stronger teacher.Advances in Neural Information Processing Systems, 35:33716–33727,

    Tao Huang, Shan You, Fei Wang, Chen Qian, and Chang Xu. Knowledge distillation from a stronger teacher.Advances in Neural Information Processing Systems, 35:33716–33727,

  13. [13]

    Bk-sdm: A lightweight, fast, and cheap ver- sion of stable diffusion

    Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. Bk-sdm: A lightweight, fast, and cheap ver- sion of stable diffusion. InEuropean Conference on Com- puter Vision, pages 381–399. Springer, 2024. 1, 2, 5, 6, 3

  14. [14]

    Consistency traject ory models: Learning probability flow ode trajectory of diffusion

    Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory mod- els: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023. 3

  15. [15]

    Random conditioning with distillation for data-efficient diffusion model compression

    Dohyun Kim, Sehwan Park, Geonhee Han, Seung Wook Kim, and Paul Hongsuck Seo. Random conditioning with distillation for data-efficient diffusion model compression. arXiv preprint arXiv:2504.02011, 2025. 1, 3

  16. [16]

    Singular value scaling: Ef- ficient generative model compression via pruned weights re- finement

    Hyeonjin Kim and Jaejun Yoo. Singular value scaling: Ef- ficient generative model compression via pruned weights re- finement. InProceedings of the AAAI Conference on Artifi- cial Intelligence, pages 17859–17867, 2025. 3

  17. [17]

    Improved precision and recall met- ric for assessing generative models.Advances in neural in- formation processing systems, 32, 2019

    Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.Advances in neural in- formation processing systems, 32, 2019. 6

  18. [18]

    Koala: Empirical lessons toward memory-efficient and fast diffusion models for text-to-image synthesis.Advances in Neural Information Processing Sys- tems, 37:51597–51633, 2024

    Youngwan Lee, Kwanyong Park, Yoorhim Cho, Yong-Ju Lee, and Sung Ju Hwang. Koala: Empirical lessons toward memory-efficient and fast diffusion models for text-to-image synthesis.Advances in Neural Information Processing Sys- tems, 37:51597–51633, 2024. 1, 2

  19. [19]

    Dual-forward path teacher knowledge distillation: Bridging the capacity gap between teacher and student.arXiv preprint arXiv:2506.18244, 2025

    Tong Li, Long Liu, Yihang Hu, Hu Chen, and Shifeng Chen. Dual-forward path teacher knowledge distillation: Bridging the capacity gap between teacher and student.arXiv preprint arXiv:2506.18244, 2025. 1

  20. [20]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 5

  21. [21]

    Content-aware gan compression

    Yuchen Liu, Zhixin Shu, Yijun Li, Zhe Lin, Federico Per- azzi, and Sun-Yuan Kung. Content-aware gan compression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12156–12166, 2021. 3

  22. [22]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of International Conference on Computer Vision (ICCV), 2015. 5

  23. [23]

    Shortgpt: Layers in large language models are more redundant than you expect

    Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingn- ing Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and 9 Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the As- sociation for Computational Linguistics: ACL 2025, pages 20192–20204, 2025. 5, 6

  24. [24]

    Im- proved knowledge distillation via teacher assistant

    Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Im- proved knowledge distillation via teacher assistant. InPro- ceedings of the AAAI conference on artificial intelligence, pages 5191–5198, 2020. 1, 3

  25. [25]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  26. [26]

    A good teacher adapts their knowledge for distillation

    Chengyao Qian, Trung Le, and Mehrtash Harandi. A good teacher adapts their knowledge for distillation. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 1239–1248, 2025. 1

  27. [27]

    High-resolution image syn- thesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2021. 3

  28. [28]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 2

  29. [29]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 3

  30. [30]

    Laion- aesthetics.https : / / laion

    Christoph Schuhmann and Romain Beaumont. Laion- aesthetics.https : / / laion . ai / blog / laion - aesthetics/, 2022. Accessed: 2025-08-12. 5

  31. [31]

    Densely guided knowledge distillation using multi- ple teacher assistants

    Wonchul Son, Jaemin Na, Junyong Choi, and Wonjun Hwang. Densely guided knowledge distillation using multi- ple teacher assistants. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9395–9404,

  32. [32]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 3

  33. [33]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023. 3

  34. [34]

    Sdxs: Real- time one-step latent diffusion models with image conditions

    Yuda Song, Zehao Sun, and Xuanwu Yin. Sdxs: Real- time one-step latent diffusion models with image conditions. arXiv preprint arXiv:2403.16627, 2024. 3

  35. [35]

    Efficient knowledge distillation from model check- points.Advances in Neural Information Processing Systems, 35:607–619, 2022

    Chaofei Wang, Qisen Yang, Rui Huang, Shiji Song, and Gao Huang. Efficient knowledge distillation from model check- points.Advances in Neural Information Processing Systems, 35:607–619, 2022. 1, 3

  36. [36]

    Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture

    Qianlong Xiang, Miao Zhang, Yuzhang Shang, Jianlong Wu, Yan Yan, and Liqiang Nie. Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 2955–2965, 2025. 1, 3

  37. [37]

    Mind the gap in distilling stylegans

    Guodong Xu, Yuenan Hou, Ziwei Liu, and Chen Change Loy. Mind the gap in distilling stylegans. InEuropean Con- ference on Computer Vision, pages 423–439. Springer, 2022. 3

  38. [38]

    Nickel and dim- ing your gan: A dual-method approach to enhancing gan ef- ficiency via knowledge distillation

    Sangyeop Yeo, Yoojin Jang, and Jaejun Yoo. Nickel and dim- ing your gan: A dual-method approach to enhancing gan ef- ficiency via knowledge distillation. InEuropean Conference on Computer Vision, pages 104–121. Springer, 2024. 3

  39. [39]

    Im- proved distribution matching distillation for fast image syn- thesis.Advances in neural information processing systems, 37:47455–47487, 2024

    Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Im- proved distribution matching distillation for fast image syn- thesis.Advances in neural information processing systems, 37:47455–47487, 2024. 3

  40. [40]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6613–6623, 2024. 3

  41. [41]

    LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

    Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop.arXiv preprint arXiv:1506.03365, 2015. 5

  42. [42]

    Laptop-diff: Layer pruning and normalized dis- tillation for compressing diffusion models.arXiv preprint arXiv:2404.11098, 2024

    Dingkun Zhang, Sijia Li, Chen Chen, Qingsong Xie, and Haonan Lu. Laptop-diff: Layer pruning and normalized dis- tillation for compressing diffusion models.arXiv preprint arXiv:2404.11098, 2024. 1, 2

  43. [43]

    Penalizing gra- dient norm for efficiently improving generalization in deep learning

    Yang Zhao, Hao Zhang, and Xiuyuan Hu. Penalizing gra- dient norm for efficiently improving generalization in deep learning. InInternational conference on machine learning, pages 26982–26992. PMLR, 2022. 8 10 LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models Supplementary Material Table 6. All...

  44. [44]

    1 illustrates the challenge of the teacher–student capac- ity gap in KD for lightweight diffusion models

    Experimental Results of Figure 1 Fig. 1 illustrates the challenge of the teacher–student capac- ity gap in KD for lightweight diffusion models. We fix a 90%-pruned 1.3M-student and distill it from four teachers of varying capacities (78.7M, 19.7M, 16.6M, and 9.2M), evaluating two KD objectives under fixed hyperparame- ters. Each setting is run five times,...

  45. [45]

    Experiments Details Across all experiments, we fix PLACE’s group size to K=16, as determined by our ablation study (see Fig. 7). For image space diffusion models, we use the Diff-Pruning base pruned model with varying pruning ratios, where the pruning ratio denotes the fraction of teacher channels re- moved. We setλ diff=1andλ FeatKD=1e−6for all such ex- ...

  46. [46]

    Non-uniform Error Across Architectures In Sec. 3.2 of the main paper, we showed that the distil- lation error between teacher and student is spatially non- uniform and exhibits a highly structured pattern that cor- relates with semantic content. Here, we examine whether this phenomenon persists across diffusion models with sub- stantially different archit...

  47. [47]

    All student models in these experiments are distilled from the strongest 78.7M-teacher model

    Additional Ablation Studies We provide detailed ablation studies to validate our LIFT and PLACE. All student models in these experiments are distilled from the strongest 78.7M-teacher model. While the CelebA experiments in Tab. 1 (i.e., 19.7M- and 1.3M- student models) reveal that OutKD+FeatKD occasionally leads to performance degradation compared to Feat...

  48. [48]

    Related Works 12.1. Efficient Diffusion Model Although diffusion models [11, 25, 27, 32] demonstrate outstanding performance, their inherent iterative denois- ing process not only demands substantial computational resources but also makes it challenging to apply exist- ing compression methods designed for feed-forward net- works [2, 16, 21, 37, 38]. To ad...

  49. [49]

    The following figures present representative samples for im- age and latent space diffusion models

    Visualization Results We provide visualization results of our experiments. The following figures present representative samples for im- age and latent space diffusion models. Each Figs. 10 to 12 correspond to Tabs. 1 to 3, respectively. The results show that across all model sizes (see Fig. 10), our method pro- duces noticeably more stable and realistic s...