pith. machine review for the scientific record. sign in

arxiv: 2402.13929 · v3 · submitted 2024-02-21 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords diffusion distillationadversarial distillationprogressive trainingtext-to-imageone-step generationfew-step generationSDXL
0
0 comments X

The pith

A distillation method combines progressive and adversarial training to enable one-step high-quality 1024-pixel image generation from SDXL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a diffusion distillation approach that delivers state-of-the-art performance for generating 1024-pixel text-to-image outputs in just one or a few steps. It does so by merging progressive distillation, which reduces steps gradually, with adversarial training that uses a discriminator to guide the model toward realistic and diverse results. A sympathetic reader would care because current high-quality diffusion models like SDXL are computationally expensive due to their multi-step nature, limiting their use in real-time or resource-constrained settings. The work includes analysis of the method, design of the discriminator, formulation of the model, and training strategies, along with open-sourced implementations.

Core claim

The authors claim that through progressive adversarial diffusion distillation on the SDXL model, they achieve new state-of-the-art results in one-step and few-step 1024px text-to-image generation by balancing perceptual quality with mode coverage, supported by theoretical analysis and specific training techniques.

What carries the argument

The progressive adversarial distillation process, which integrates staged reduction of diffusion steps with adversarial losses from a discriminator to preserve both fidelity and variety in the generated images.

If this is right

  • The resulting SDXL-Lightning models generate images in one or few steps instead of many.
  • They maintain better mode coverage than previous distillation methods.
  • Both LoRA adapters and full model weights are made available for users.
  • The method scales to 1024px resolution without major quality degradation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying similar distillation to other diffusion-based models could accelerate generation in related domains like video or 3D synthesis.
  • The open-sourced weights may enable community experiments on further optimization or fine-tuning for specific tasks.
  • Testing the approach on even larger models might reveal if the balance between quality and coverage holds at greater scales.

Load-bearing premise

The progressive adversarial training maintains both high perceptual quality and broad mode coverage without causing artifacts or mode collapse at the scale of the SDXL model.

What would settle it

Running the model on a diverse set of prompts and measuring diversity metrics or human evaluations showing significant drop in variety or introduction of artifacts compared to full SDXL would indicate the claim does not hold.

read the original abstract

We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to-image generation based on SDXL. Our method combines progressive and adversarial distillation to achieve a balance between quality and mode coverage. In this paper, we discuss the theoretical analysis, discriminator design, model formulation, and training techniques. We open-source our distilled SDXL-Lightning models both as LoRA and full UNet weights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SDXL-Lightning, a diffusion distillation method that combines progressive and adversarial distillation applied to the SDXL model. It claims new state-of-the-art results for one-step and few-step 1024px text-to-image generation by balancing perceptual quality and mode coverage. The manuscript covers theoretical analysis, discriminator design, model formulation, training techniques, and experimental validation, with open-sourced LoRA and full UNet weights.

Significance. If the central claims hold, the work would advance efficient inference for high-capacity diffusion models at high resolution. The progressive adversarial combination and explicit discriminator scaling to SDXL represent a practical engineering contribution. Open-sourcing both LoRA and full weights is a clear strength that aids reproducibility and downstream use.

major comments (2)
  1. [§5.2] §5.2 (Quantitative Results): The reported metrics focus on FID, CLIP score, and qualitative examples, but no recall, precision-recall curves, or intra-class diversity statistics are provided to verify mode coverage. This is load-bearing for the abstract claim that the method achieves a balance between quality and mode coverage without collapse when the discriminator is scaled to the full SDXL UNet.
  2. [§4.1] §4.1 (Discriminator Design): The architecture description scales the discriminator to SDXL but does not include an explicit diversity regularization term or ablation on its effect. Without this, the assumption that progressive adversarial training avoids the common artifact and collapse regimes at 1024 px remains untested in the reported experiments.
minor comments (2)
  1. [Figure 4] Figure 4 caption: Add the exact prompt templates and random seeds used for the qualitative comparisons to improve reproducibility.
  2. [§3.3] §3.3: The weighting schedule between progressive and adversarial losses is described qualitatively; a precise equation or pseudocode for the schedule would clarify implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. The comments highlight important aspects of our quantitative evaluation and discriminator design that we address below. We have prepared revisions to incorporate additional analysis where feasible.

read point-by-point responses
  1. Referee: [§5.2] §5.2 (Quantitative Results): The reported metrics focus on FID, CLIP score, and qualitative examples, but no recall, precision-recall curves, or intra-class diversity statistics are provided to verify mode coverage. This is load-bearing for the abstract claim that the method achieves a balance between quality and mode coverage without collapse when the discriminator is scaled to the full SDXL UNet.

    Authors: We agree that recall and precision-recall analysis would provide more direct evidence for mode coverage claims. Standard metrics like FID and CLIP score are used in the field, and our qualitative examples at 1024px demonstrate diversity without visible collapse. To strengthen the manuscript, we will add precision-recall curves and recall statistics computed on a held-out set to Section 5.2 in the revision. revision: yes

  2. Referee: [§4.1] §4.1 (Discriminator Design): The architecture description scales the discriminator to SDXL but does not include an explicit diversity regularization term or ablation on its effect. Without this, the assumption that progressive adversarial training avoids the common artifact and collapse regimes at 1024 px remains untested in the reported experiments.

    Authors: Our theoretical analysis in the paper argues that the progressive schedule combined with adversarial distillation promotes coverage by gradually increasing the discriminator's capacity, which helps avoid collapse even at full SDXL scale. We did not add an explicit diversity regularization term to keep the objective focused. We acknowledge that an ablation isolating the progressive component's role in preventing artifacts would be valuable, and we will include such an ablation study in the revised Section 4.1. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes a combination of progressive and adversarial distillation applied to the SDXL model for one-step and few-step text-to-image generation. No equations, model formulations, or training procedures are presented in the provided abstract or context that reduce any claimed prediction or result to a fitted parameter defined by the target metric itself, nor do they rely on self-citations or imported uniqueness theorems in a load-bearing manner. The central claims rest on empirical application of existing distillation ideas to a new architecture and scale, with the derivation remaining self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard diffusion model assumptions plus new training choices whose details are not visible in the abstract; no explicit free parameters or invented entities are named.

pith-pipeline@v0.9.0 · 5364 in / 1075 out tokens · 21396 ms · 2026-05-17T05:04:24.605423+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

    cs.CV 2026-05 unverdicted novelty 8.0

    CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.

  2. Asymmetric Flow Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...

  3. Inverse Design for Conditional Distribution Matching

    cs.LG 2026-05 unverdicted novelty 7.0

    Defines Conditional Distribution Matching (CDM) as finding inputs whose induced conditional distributions match a target distribution and proposes the MLGD-F inference-time algorithm using pretrained diffusion models ...

  4. GeoEdit: Local Frames for Fast, Training-Free On-Manifold Editing in Diffusion Models

    cs.LG 2026-04 unverdicted novelty 7.0

    GeoEdit constructs local tangent frames from small perturbations to initial noise, enabling Jacobian-free on-manifold edits in diffusion models via alternating tangent steps and diffusion projections.

  5. Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

  6. 1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation

    cs.CV 2026-04 conditional novelty 7.0

    1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.

  7. Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting

    cs.CV 2026-03 unverdicted novelty 7.0

    Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.

  8. FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashClear delivers up to 122x faster object removal than prior diffusion models via adversarial step distillation and asymmetric attention caching while preserving visual quality.

  9. FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashClear achieves up to 8.26x speedup over its base diffusion model and 122x over OmniPaint for image object removal via region-aware adversarial distillation and foreground-prioritized caching while claiming to mai...

  10. Efficient Diffusion Distillation via Embedding Loss

    cs.CV 2026-04 unverdicted novelty 6.0

    Embedding Loss aligns feature distributions via MMD in random network embeddings to boost one-step diffusion distillation, reaching SOTA FID of 1.475 on CIFAR-10 unconditional generation.

  11. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

    cs.CV 2026-04 unverdicted novelty 6.0

    By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.

  12. BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models

    cs.CY 2026-04 conditional novelty 6.0

    BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.

  13. Continuous Adversarial Flow Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...

  14. ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop

    cs.CV 2026-04 unverdicted novelty 6.0

    ExpressEdit delivers fast, artifact-free stylized facial expression editing inside Photoshop via a diffusion model plugin and an accompanying expression database.

  15. WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    cs.CV 2025-12 unverdicted novelty 6.0

    WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.

  16. Teacher-Feature Drifting: One-Step Diffusion Distillation with Pretrained Diffusion Representations

    cs.CV 2026-05 unverdicted novelty 5.0

    A simplified one-step diffusion distillation uses pretrained teacher features directly for drifting loss plus a mode coverage term, achieving FID 1.58 on ImageNet-64 and 18.4 on SDXL.

  17. Reward-Aware Trajectory Shaping for Few-step Visual Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.

  18. TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 17 Pith papers · 1 internal anchor

  1. [1]

    https : / / civitai

    AAM-XL Anime Mix. https : / / civitai . com / models/269232. 9

  2. [2]

    Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. 1

  3. [3]

    Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis

    A. Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent dif- fusion models. 2023 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) , pages 22563–22575,

  4. [4]

    Coyo-700m: Image-text pair dataset

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https : / / github . com / kakaobrain/coyo-dataset, 2022. 6

  5. [5]

    Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis

    Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis. In The Twelfth International Conference on Learning Representations, 2024. 1

  6. [6]

    Flashattention-2: Faster attention with better par- allelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better par- allelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024. 6

  7. [7]

    Flashattention: Fast and memory-efficient exact attention with IO-awareness

    Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Re. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems , 2022. 6

  8. [8]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. Com- munications of the ACM, 63:139 – 144, 2014. 3, 4

  9. [9]

    Smooth diffusion: Crafting smooth latent spaces in dif- fusion models, 2023

    Jiayi Guo, Xingqian Xu, Yifan Pu, Zanlin Ni, Chaofei Wang, Manushree Vasu, Shiji Song, Gao Huang, and Humphrey Shi. Smooth diffusion: Crafting smooth latent spaces in dif- fusion models, 2023. 4

  10. [10]

    Animatediff: Animate your personalized text- to-image diffusion models without specific tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning. In The Twelfth International Conference on Learning Representa- tions, 2024. 1, 2, 3

  11. [11]

    Gaussian error linear units (gelus), 2023

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus), 2023. 5

  12. [12]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. In Isabelle Guyon, Ulrike von Luxburg, Samy Ben- gio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Informa- tion Processing S...

  13. [13]

    Kingma, Ben Poole, Mohammad Norouzi, David J

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Sali- mans. Imagen video: High definition video generation with diffusion models, 2022. 1

  14. [14]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan- Tien Lin, editors, Advances in Neural Information Process- ing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtua...

  15. [15]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 5

  16. [16]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations , 2022. 2, 3, 4

  17. [17]

    Scaling up gans for text-to-image synthesis

    Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. 2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 10124–10134, 2023. 5, 6

  18. [18]

    MSG-GAN: multi- scale gradients for generative adversarial networks

    Animesh Karnewar and Oliver Wang. MSG-GAN: multi- scale gradients for generative adversarial networks. In 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2020, Seattle, WA, USA, June 13- 19, 2020, pages 7796–7805. IEEE, 2020. 6

  19. [19]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Informa- tion Processing Systems, 2022. 2

  20. [20]

    Training generative ad- versarial networks with limited data

    Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative ad- versarial networks with limited data. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Informa- tion Processing Systems 33: Annual Conference on Neural Information Pr...

  21. [21]

    Analyzing and improving the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In 2020 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020 , pages 8107–

  22. [22]

    Consistency trajectory mod- els: Learning probability flow ODE trajectory of diffusion

    Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory mod- els: Learning probability flow ODE trajectory of diffusion. In The Twelfth International Conference on Learning Repre- sentations, 2024. 2, 3 10

  23. [23]

    The lipschitz constant of self-attention

    Hyunjik Kim, George Papamakarios, and Andriy Mnih. The lipschitz constant of self-attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5562–5571. PMLR, 2021. 5

  24. [24]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Represen- tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. 6

  25. [25]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Confer- ence Track Proceedings, 2014. 2

  26. [26]

    Common diffusion noise schedules and sample steps are flawed

    Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages 5404– 5411, January 2024. 5

  27. [27]

    Diffusion model with per- ceptual loss, 2024

    Shanchuan Lin and Xiao Yang. Diffusion model with per- ceptual loss, 2024. 4

  28. [28]

    Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C

    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014. 8

  29. [29]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. In The Eleventh International Conference on Learning Representations, 2023. 2, 4

  30. [30]

    Pseudo numerical methods for diffusion models on manifolds

    Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. InIn- ternational Conference on Learning Representations , 2022. 2

  31. [31]

    Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. 2, 3, 4

  32. [32]

    Instaflow: One step is enough for high-quality diffusion-based text-to-image generation

    Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and qiang liu. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth In- ternational Conference on Learning Representations , 2024. 2, 3

  33. [33]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learning Representations, 2019. 6

  34. [34]

    DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. 2

  35. [35]

    Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models, 2023

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models, 2023. 2

  36. [36]

    Latent consistency models: Synthesizing high- resolution images with few-step inference, 2023

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference, 2023. 2, 3, 7, 8, 9

  37. [37]

    Lcm-lora: A universal stable-diffusion acceleration module, 2023

    Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolin´ario Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module, 2023. 2, 3, 4, 6, 8, 9

  38. [38]

    SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. In International Conference on Learning Representa- tions, 2022. 3, 6

  39. [39]

    Mescheder, Andreas Geiger, and Sebastian Nowozin

    Lars M. Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually con- verge? In Jennifer G. Dy and Andreas Krause, editors, Pro- ceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm ¨assan, Stockholm, Swe- den, July 10-15, 2018, volume 80 ofProceedings of Machine Learning Research, ...

  40. [40]

    Mixed precision training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In International Conference on Learning Representations, 2018. 6

  41. [41]

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

  42. [42]

    On aliased resizing and surprising subtleties in gan evaluation

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 11400–11410, 2022. 8

  43. [43]

    W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models

    Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models. In The Twelfth International Conference on Learn- ing Representations, 2024. 1

  44. [44]

    SDXL: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth Interna- tional Conference on Learning Representations , 2024. 1, 2, 7, 8

  45. [45]

    Barron, and Ben Milden- hall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representa- tions, 2023. 3

  46. [46]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen 11 Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th Interna- tional Conference on Ma...

  47. [47]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2019. 6

  48. [48]

    Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Search- ing for activation functions, 2017. 5

  49. [49]

    Hierarchical text-conditional image gener- ation with clip latents, 2022

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents, 2022. 1

  50. [50]

    https://civitai.com/models/ 139562

    RealVisXL V4.0. https://civitai.com/models/ 139562. 9

  51. [51]

    Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer

    Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021. 1, 2

  52. [52]

    U-Net: Convolutional Networks for Biomedical Image Segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015. 4

  53. [53]

    Fleet, and Mohammad Norouzi

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agar- wal, Danielle Belgrave, and ...

  54. [54]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Confer- ence on Learning Representations, 2022. 2, 3

  55. [55]

    https://civitai.com/ models/81270

    Samaritan 3D Cartoon V4. https://civitai.com/ models/81270. 9

  56. [56]

    Projected gans converge faster

    Axel Sauer, Kashyap Chitta, Jens M ¨uller, and Andreas Geiger. Projected gans converge faster. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neu- ral Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14,...

  57. [57]

    Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis

    Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. In International Confer- ence on Machine Learning , 2023. 3

  58. [58]

    Adversarial diffusion distillation, 2023

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation, 2023. 2, 3, 5, 7, 8, 9

  59. [59]

    LAION-5b: An open large-scale dataset for train- ing next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for train- ing next generation image-text...

  60. [60]

    https://huggingface

    SDXL-ControlNet Canny. https://huggingface. co/diffusers/controlnet-canny-sdxl-1.0 . 9

  61. [61]

    https://huggingface

    SDXL-ControlNet Depth. https://huggingface. co/diffusers/controlnet-depth-sdxl-1.0 . 9

  62. [62]

    Make-a-video: Text-to-video generation without text-video data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations, 2023. 1

  63. [63]

    Weiss, Niru Mah- eswaranathan, and Surya Ganguli

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Mah- eswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Fran- cis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 , volume 37 of JMLR Workshop and Conference Proceedi...

  64. [64]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. In International Conference on Learning Representations, 2021. 2

  65. [65]

    Improved techniques for training consistency models

    Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. In The Twelfth International Conference on Learning Representations, 2024. 2, 3

  66. [66]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, 2023. 2, 3

  67. [67]

    Score-based generative modeling through stochastic differential equa- tions

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. In International Conference on Learning Represen- tations, 2021. 1, 2

  68. [68]

    Rethinking the in- ception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the in- ception architecture for computer vision. In 2016 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2016, Las V egas, NV , USA, June 27-30, 2016 , pages 2818–

  69. [69]

    IEEE Computer Society, 2016. 8

  70. [70]

    Diffusers: State-of-the-art diffusion models

    Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https : / / github . com / huggingface / diffusers, 2022. 5

  71. [71]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 3

  72. [72]

    Group normalization

    Yuxin Wu and Kaiming He. Group normalization. Inter- national Journal of Computer Vision , 128:742 – 755, 2018. 5 12

  73. [73]

    Tackling the generative learning trilemma with denoising diffusion GANs

    Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Represen- tations, 2022. 3, 4

  74. [74]

    Ufogen: You forward once large scale text-to-image genera- tion via diffusion gans, 2023

    Yanwu Xu, Yang Zhao, Zhisheng Xiao, and Tingbo Hou. Ufogen: You forward once large scale text-to-image genera- tion via diffusion gans, 2023. 2, 3

  75. [75]

    Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023. 2, 3

  76. [76]

    Freeman, and Taesung Park

    Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation,

  77. [77]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3813–3824, 2023. 2, 3, 9

  78. [78]

    Efros, Eli Shecht- man, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 586–

  79. [79]

    IEEE Computer Society, 2018. 4

  80. [80]

    Unipc: A unified predictor-corrector framework for fast sampling of diffusion models

    Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. NeurIPS, 2023. 2

Showing first 80 references.