arxiv: 2402.13929 · v3 · submitted 2024-02-21 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

Shanchuan Lin , Anran Wang , Xiao Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords diffusion distillationadversarial distillationprogressive trainingtext-to-imageone-step generationfew-step generationSDXL

0 comments

The pith

A distillation method combines progressive and adversarial training to enable one-step high-quality 1024-pixel image generation from SDXL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a diffusion distillation approach that delivers state-of-the-art performance for generating 1024-pixel text-to-image outputs in just one or a few steps. It does so by merging progressive distillation, which reduces steps gradually, with adversarial training that uses a discriminator to guide the model toward realistic and diverse results. A sympathetic reader would care because current high-quality diffusion models like SDXL are computationally expensive due to their multi-step nature, limiting their use in real-time or resource-constrained settings. The work includes analysis of the method, design of the discriminator, formulation of the model, and training strategies, along with open-sourced implementations.

Core claim

The authors claim that through progressive adversarial diffusion distillation on the SDXL model, they achieve new state-of-the-art results in one-step and few-step 1024px text-to-image generation by balancing perceptual quality with mode coverage, supported by theoretical analysis and specific training techniques.

What carries the argument

The progressive adversarial distillation process, which integrates staged reduction of diffusion steps with adversarial losses from a discriminator to preserve both fidelity and variety in the generated images.

If this is right

The resulting SDXL-Lightning models generate images in one or few steps instead of many.
They maintain better mode coverage than previous distillation methods.
Both LoRA adapters and full model weights are made available for users.
The method scales to 1024px resolution without major quality degradation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar distillation to other diffusion-based models could accelerate generation in related domains like video or 3D synthesis.
The open-sourced weights may enable community experiments on further optimization or fine-tuning for specific tasks.
Testing the approach on even larger models might reveal if the balance between quality and coverage holds at greater scales.

Load-bearing premise

The progressive adversarial training maintains both high perceptual quality and broad mode coverage without causing artifacts or mode collapse at the scale of the SDXL model.

What would settle it

Running the model on a diverse set of prompts and measuring diversity metrics or human evaluations showing significant drop in variety or introduction of artifacts compared to full SDXL would indicate the claim does not hold.

read the original abstract

We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to-image generation based on SDXL. Our method combines progressive and adversarial distillation to achieve a balance between quality and mode coverage. In this paper, we discuss the theoretical analysis, discriminator design, model formulation, and training techniques. We open-source our distilled SDXL-Lightning models both as LoRA and full UNet weights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SDXL-Lightning, a diffusion distillation method that combines progressive and adversarial distillation applied to the SDXL model. It claims new state-of-the-art results for one-step and few-step 1024px text-to-image generation by balancing perceptual quality and mode coverage. The manuscript covers theoretical analysis, discriminator design, model formulation, training techniques, and experimental validation, with open-sourced LoRA and full UNet weights.

Significance. If the central claims hold, the work would advance efficient inference for high-capacity diffusion models at high resolution. The progressive adversarial combination and explicit discriminator scaling to SDXL represent a practical engineering contribution. Open-sourcing both LoRA and full weights is a clear strength that aids reproducibility and downstream use.

major comments (2)

[§5.2] §5.2 (Quantitative Results): The reported metrics focus on FID, CLIP score, and qualitative examples, but no recall, precision-recall curves, or intra-class diversity statistics are provided to verify mode coverage. This is load-bearing for the abstract claim that the method achieves a balance between quality and mode coverage without collapse when the discriminator is scaled to the full SDXL UNet.
[§4.1] §4.1 (Discriminator Design): The architecture description scales the discriminator to SDXL but does not include an explicit diversity regularization term or ablation on its effect. Without this, the assumption that progressive adversarial training avoids the common artifact and collapse regimes at 1024 px remains untested in the reported experiments.

minor comments (2)

[Figure 4] Figure 4 caption: Add the exact prompt templates and random seeds used for the qualitative comparisons to improve reproducibility.
[§3.3] §3.3: The weighting schedule between progressive and adversarial losses is described qualitatively; a precise equation or pseudocode for the schedule would clarify implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. The comments highlight important aspects of our quantitative evaluation and discriminator design that we address below. We have prepared revisions to incorporate additional analysis where feasible.

read point-by-point responses

Referee: [§5.2] §5.2 (Quantitative Results): The reported metrics focus on FID, CLIP score, and qualitative examples, but no recall, precision-recall curves, or intra-class diversity statistics are provided to verify mode coverage. This is load-bearing for the abstract claim that the method achieves a balance between quality and mode coverage without collapse when the discriminator is scaled to the full SDXL UNet.

Authors: We agree that recall and precision-recall analysis would provide more direct evidence for mode coverage claims. Standard metrics like FID and CLIP score are used in the field, and our qualitative examples at 1024px demonstrate diversity without visible collapse. To strengthen the manuscript, we will add precision-recall curves and recall statistics computed on a held-out set to Section 5.2 in the revision. revision: yes
Referee: [§4.1] §4.1 (Discriminator Design): The architecture description scales the discriminator to SDXL but does not include an explicit diversity regularization term or ablation on its effect. Without this, the assumption that progressive adversarial training avoids the common artifact and collapse regimes at 1024 px remains untested in the reported experiments.

Authors: Our theoretical analysis in the paper argues that the progressive schedule combined with adversarial distillation promotes coverage by gradually increasing the discriminator's capacity, which helps avoid collapse even at full SDXL scale. We did not add an explicit diversity regularization term to keep the objective focused. We acknowledge that an ablation isolating the progressive component's role in preventing artifacts would be valuable, and we will include such an ablation study in the revised Section 4.1. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes a combination of progressive and adversarial distillation applied to the SDXL model for one-step and few-step text-to-image generation. No equations, model formulations, or training procedures are presented in the provided abstract or context that reduce any claimed prediction or result to a fitted parameter defined by the target metric itself, nor do they rely on self-citations or imported uniqueness theorems in a load-bearing manner. The central claims rest on empirical application of existing distillation ideas to a new architecture and scale, with the derivation remaining self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard diffusion model assumptions plus new training choices whose details are not visible in the abstract; no explicit free parameters or invented entities are named.

pith-pipeline@v0.9.0 · 5364 in / 1075 out tokens · 21396 ms · 2026-05-17T05:04:24.605423+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
cs.CV 2026-05 unverdicted novelty 8.0

CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
Asymmetric Flow Models
cs.CV 2026-05 unverdicted novelty 7.0

Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
Inverse Design for Conditional Distribution Matching
cs.LG 2026-05 unverdicted novelty 7.0

Defines Conditional Distribution Matching (CDM) as finding inputs whose induced conditional distributions match a target distribution and proposes the MLGD-F inference-time algorithm using pretrained diffusion models ...
GeoEdit: Local Frames for Fast, Training-Free On-Manifold Editing in Diffusion Models
cs.LG 2026-04 unverdicted novelty 7.0

GeoEdit constructs local tangent frames from small perturbations to initial noise, enabling Jacobian-free on-manifold edits in diffusion models via alternating tangent steps and diffusion projections.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
cs.CV 2026-04 conditional novelty 7.0

1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
cs.CV 2026-03 unverdicted novelty 7.0

Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
cs.CV 2026-05 unverdicted novelty 6.0

FlashClear delivers up to 122x faster object removal than prior diffusion models via adversarial step distillation and asymmetric attention caching while preserving visual quality.
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
cs.CV 2026-05 unverdicted novelty 6.0

FlashClear achieves up to 8.26x speedup over its base diffusion model and 122x over OmniPaint for image object removal via region-aware adversarial distillation and foreground-prioritized caching while claiming to mai...
Efficient Diffusion Distillation via Embedding Loss
cs.CV 2026-04 unverdicted novelty 6.0

Embedding Loss aligns feature distributions via MMD in random network embeddings to boost one-step diffusion distillation, reaching SOTA FID of 1.475 on CIFAR-10 unconditional generation.
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
cs.CV 2026-04 unverdicted novelty 6.0

By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models
cs.CY 2026-04 conditional novelty 6.0

BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.
Continuous Adversarial Flow Models
cs.LG 2026-04 unverdicted novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop
cs.CV 2026-04 unverdicted novelty 6.0

ExpressEdit delivers fast, artifact-free stylized facial expression editing inside Photoshop via a diffusion model plugin and an accompanying expression database.
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
cs.CV 2025-12 unverdicted novelty 6.0

WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.
Teacher-Feature Drifting: One-Step Diffusion Distillation with Pretrained Diffusion Representations
cs.CV 2026-05 unverdicted novelty 5.0

A simplified one-step diffusion distillation uses pretrained teacher features directly for drifting loss plus a mode coverage term, achieving FID 1.58 on ImageNet-64 and 18.4 on SDXL.
Reward-Aware Trajectory Shaping for Few-step Visual Generation
cs.CV 2026-04 unverdicted novelty 5.0

RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation
cs.CV 2026-04 unverdicted novelty 5.0

TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 17 Pith papers · 1 internal anchor

[1]

https : / / civitai

AAM-XL Anime Mix. https : / / civitai . com / models/269232. 9

work page
[2]

Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. 1

work page 2023
[3]

Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis

A. Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent dif- fusion models. 2023 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) , pages 22563–22575,

work page 2023
[4]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https : / / github . com / kakaobrain/coyo-dataset, 2022. 6

work page 2022
[5]

Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis

Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis. In The Twelfth International Conference on Learning Representations, 2024. 1

work page 2024
[6]

Flashattention-2: Faster attention with better par- allelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better par- allelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024. 6

work page 2024
[7]

Flashattention: Fast and memory-efficient exact attention with IO-awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Re. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems , 2022. 6

work page 2022
[8]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. Com- munications of the ACM, 63:139 – 144, 2014. 3, 4

work page 2014
[9]

Smooth diffusion: Crafting smooth latent spaces in dif- fusion models, 2023

Jiayi Guo, Xingqian Xu, Yifan Pu, Zanlin Ni, Chaofei Wang, Manushree Vasu, Shiji Song, Gao Huang, and Humphrey Shi. Smooth diffusion: Crafting smooth latent spaces in dif- fusion models, 2023. 4

work page 2023
[10]

Animatediff: Animate your personalized text- to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning. In The Twelfth International Conference on Learning Representa- tions, 2024. 1, 2, 3

work page 2024
[11]

Gaussian error linear units (gelus), 2023

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus), 2023. 5

work page 2023
[12]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. In Isabelle Guyon, Ulrike von Luxburg, Samy Ben- gio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Informa- tion Processing S...

work page 2017
[13]

Kingma, Ben Poole, Mohammad Norouzi, David J

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Sali- mans. Imagen video: High definition video generation with diffusion models, 2022. 1

work page 2022
[14]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan- Tien Lin, editors, Advances in Neural Information Process- ing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtua...

work page 2020
[15]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 5

work page 2021
[16]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations , 2022. 2, 3, 4

work page 2022
[17]

Scaling up gans for text-to-image synthesis

Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. 2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 10124–10134, 2023. 5, 6

work page 2023
[18]

MSG-GAN: multi- scale gradients for generative adversarial networks

Animesh Karnewar and Oliver Wang. MSG-GAN: multi- scale gradients for generative adversarial networks. In 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2020, Seattle, WA, USA, June 13- 19, 2020, pages 7796–7805. IEEE, 2020. 6

work page 2020
[19]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Informa- tion Processing Systems, 2022. 2

work page 2022
[20]

Training generative ad- versarial networks with limited data

Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative ad- versarial networks with limited data. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Informa- tion Processing Systems 33: Annual Conference on Neural Information Pr...

work page 2020
[21]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In 2020 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020 , pages 8107–

work page 2020
[22]

Consistency trajectory mod- els: Learning probability flow ODE trajectory of diffusion

Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory mod- els: Learning probability flow ODE trajectory of diffusion. In The Twelfth International Conference on Learning Repre- sentations, 2024. 2, 3 10

work page 2024
[23]

The lipschitz constant of self-attention

Hyunjik Kim, George Papamakarios, and Andriy Mnih. The lipschitz constant of self-attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5562–5571. PMLR, 2021. 5

work page 2021
[24]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Represen- tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. 6

work page 2015
[25]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Confer- ence Track Proceedings, 2014. 2

work page 2014
[26]

Common diffusion noise schedules and sample steps are flawed

Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages 5404– 5411, January 2024. 5

work page 2024
[27]

Diffusion model with per- ceptual loss, 2024

Shanchuan Lin and Xiao Yang. Diffusion model with per- ceptual loss, 2024. 4

work page 2024
[28]

Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014. 8

work page 2014
[29]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. In The Eleventh International Conference on Learning Representations, 2023. 2, 4

work page 2023
[30]

Pseudo numerical methods for diffusion models on manifolds

Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. InIn- ternational Conference on Learning Representations , 2022. 2

work page 2022
[31]

Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. 2, 3, 4

work page 2022
[32]

Instaflow: One step is enough for high-quality diffusion-based text-to-image generation

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and qiang liu. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth In- ternational Conference on Learning Representations , 2024. 2, 3

work page 2024
[33]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learning Representations, 2019. 6

work page 2019
[34]

DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. 2

work page 2022
[35]

Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models, 2023

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models, 2023. 2

work page 2023
[36]

Latent consistency models: Synthesizing high- resolution images with few-step inference, 2023

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference, 2023. 2, 3, 7, 8, 9

work page 2023
[37]

Lcm-lora: A universal stable-diffusion acceleration module, 2023

Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolin´ario Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module, 2023. 2, 3, 4, 6, 8, 9

work page 2023
[38]

SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. In International Conference on Learning Representa- tions, 2022. 3, 6

work page 2022
[39]

Mescheder, Andreas Geiger, and Sebastian Nowozin

Lars M. Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually con- verge? In Jennifer G. Dy and Andreas Krause, editors, Pro- ceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm ¨assan, Stockholm, Swe- den, July 10-15, 2018, volume 80 ofProceedings of Machine Learning Research, ...

work page 2018
[40]

Mixed precision training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In International Conference on Learning Representations, 2018. 6

work page 2018
[41]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

work page 2024
[42]

On aliased resizing and surprising subtleties in gan evaluation

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 11400–11410, 2022. 8

work page 2022
[43]

W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models

Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. W ¨urstchen: An ef- ficient architecture for large-scale text-to-image diffusion models. In The Twelfth International Conference on Learn- ing Representations, 2024. 1

work page 2024
[44]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth Interna- tional Conference on Learning Representations , 2024. 1, 2, 7, 8

work page 2024
[45]

Barron, and Ben Milden- hall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representa- tions, 2023. 3

work page 2023
[46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen 11 Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th Interna- tional Conference on Ma...

work page 2021
[47]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2019. 6

work page 2019
[48]

Prajit Ramachandran, Barret Zoph, and Quoc V . Le. Search- ing for activation functions, 2017. 5

work page 2017
[49]

Hierarchical text-conditional image gener- ation with clip latents, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents, 2022. 1

work page 2022
[50]

https://civitai.com/models/ 139562

RealVisXL V4.0. https://civitai.com/models/ 139562. 9

work page
[51]

Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer

Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021. 1, 2

work page 2022
[52]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015. 4

work page internal anchor Pith review Pith/arXiv arXiv 2015
[53]

Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agar- wal, Danielle Belgrave, and ...

work page 2022
[54]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Confer- ence on Learning Representations, 2022. 2, 3

work page 2022
[55]

https://civitai.com/ models/81270

Samaritan 3D Cartoon V4. https://civitai.com/ models/81270. 9

work page
[56]

Projected gans converge faster

Axel Sauer, Kashyap Chitta, Jens M ¨uller, and Andreas Geiger. Projected gans converge faster. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neu- ral Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14,...

work page 2021
[57]

Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis

Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. In International Confer- ence on Machine Learning , 2023. 3

work page 2023
[58]

Adversarial diffusion distillation, 2023

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation, 2023. 2, 3, 5, 7, 8, 9

work page 2023
[59]

LAION-5b: An open large-scale dataset for train- ing next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for train- ing next generation image-text...

work page 2022
[60]

https://huggingface

SDXL-ControlNet Canny. https://huggingface. co/diffusers/controlnet-canny-sdxl-1.0 . 9

work page
[61]

https://huggingface

SDXL-ControlNet Depth. https://huggingface. co/diffusers/controlnet-depth-sdxl-1.0 . 9

work page
[62]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations, 2023. 1

work page 2023
[63]

Weiss, Niru Mah- eswaranathan, and Surya Ganguli

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Mah- eswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Fran- cis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 , volume 37 of JMLR Workshop and Conference Proceedi...

work page 2015
[64]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. In International Conference on Learning Representations, 2021. 2

work page 2021
[65]

Improved techniques for training consistency models

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. In The Twelfth International Conference on Learning Representations, 2024. 2, 3

work page 2024
[66]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, 2023. 2, 3

work page 2023
[67]

Score-based generative modeling through stochastic differential equa- tions

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. In International Conference on Learning Represen- tations, 2021. 1, 2

work page 2021
[68]

Rethinking the in- ception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the in- ception architecture for computer vision. In 2016 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2016, Las V egas, NV , USA, June 27-30, 2016 , pages 2818–

work page 2016
[69]

IEEE Computer Society, 2016. 8

work page 2016
[70]

Diffusers: State-of-the-art diffusion models

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https : / / github . com / huggingface / diffusers, 2022. 5

work page 2022
[71]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 3

work page 2023
[72]

Group normalization

Yuxin Wu and Kaiming He. Group normalization. Inter- national Journal of Computer Vision , 128:742 – 755, 2018. 5 12

work page 2018
[73]

Tackling the generative learning trilemma with denoising diffusion GANs

Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Represen- tations, 2022. 3, 4

work page 2022
[74]

Ufogen: You forward once large scale text-to-image genera- tion via diffusion gans, 2023

Yanwu Xu, Yang Zhao, Zhisheng Xiao, and Tingbo Hou. Ufogen: You forward once large scale text-to-image genera- tion via diffusion gans, 2023. 2, 3

work page 2023
[75]

Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models, 2023. 2, 3

work page 2023
[76]

Freeman, and Taesung Park

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation,

work page
[77]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3813–3824, 2023. 2, 3, 9

work page 2023
[78]

Efros, Eli Shecht- man, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 586–

work page 2018
[79]

IEEE Computer Society, 2018. 4

work page 2018
[80]

Unipc: A unified predictor-corrector framework for fast sampling of diffusion models

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. NeurIPS, 2023. 2

work page 2023

Showing first 80 references.