arxiv: 2211.01324 · v5 · submitted 2022-11-02 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji , Seungjun Nah , Xun Huang , Arash Vahdat , Jiaming Song , Qinsheng Zhang , Karsten Kreis , Miika Aittala

show 5 more authors

Timo Aila Samuli Laine Bryan Catanzaro Tero Karras Ming-Yu Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:40 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords text-to-image generationdiffusion modelsensemble of expertsdenoisersprompt alignmentimage synthesisconditional generation

0 comments

The pith

An ensemble of stage-specialized diffusion models improves text alignment in image synthesis at the same inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that text-to-image diffusion models change their reliance on conditioning during sampling: early iterations depend strongly on the text prompt to build aligned content, while later iterations largely disregard it. A single model with shared parameters across all steps is therefore suboptimal. The authors address this by first training one base model and then splitting it into an ensemble of expert denoisers, each fine-tuned for a narrow range of sampling stages. The resulting system, eDiff-I, delivers stronger prompt adherence, keeps visual quality high, and runs at the original computational budget. Additional conditioning options, including CLIP image embeddings for style transfer and a paint-with-words interface, are shown to work naturally within the same framework.

Core claim

We propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark.

What carries the argument

An ensemble of expert denoisers, each fine-tuned on a narrow window of the iterative sampling trajectory after an initial shared pre-training stage.

If this is right

Better text-to-image alignment on standard benchmarks than prior large-scale diffusion models.
No increase in inference compute or sampling steps relative to a single model.
Retention of high visual fidelity while adding controllable behaviors via multiple conditioning embeddings.
Support for intuitive style transfer from reference images using CLIP image embeddings.
User-level control via a paint-with-words mechanism that lets selected prompt words directly influence output regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged-specialization pattern could be tested on other iterative generative tasks such as video or 3D synthesis where conditioning importance may also vary across steps.
Focusing capacity on the phases where conditioning actually matters might reduce the parameter count needed for high performance compared with scaling a monolithic model.
The paint-with-words interface suggests a path toward more interactive, region-specific editing tools that operate inside the diffusion loop rather than post hoc.
Because the ensemble is created by splitting a shared base, the method may offer a practical route for adapting large pre-trained diffusion models to new domains without retraining from scratch.

Load-bearing premise

The synthesis process changes qualitatively so that text conditioning drives early steps but is largely ignored later, rendering a single shared-parameter model suboptimal.

What would settle it

If a single diffusion model trained with the same total compute budget produces equal or higher text-alignment scores on the standard benchmark, the premise that stage specialization is required would be refuted.

read the original abstract

Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

eDiff-I splits diffusion into stage-specific experts after base training and adds multi-embedding conditioning, but gains may trace to extra fine-tuning steps rather than the ensemble alone.

read the letter

The main takeaway is that this paper trains one base text-to-image diffusion model, then splits its parameters into specialists for early and late sampling stages and continues training each on its assigned timestep range. They report better text alignment at unchanged inference cost, plus they combine T5 text, CLIP text, and CLIP image embeddings and introduce a paint-with-words control where users paint selected words on a canvas. The qualitative observation that text conditioning matters early but fades later is useful and matches what many practitioners see in sampling trajectories. The multi-embedding experiments are straightforward and show distinct behaviors, with the CLIP image path enabling simple style transfer from a reference. The paint-with-words method is a practical user-facing addition that does not require new architecture. The soft spot is the comparison to baselines. The specialists receive continued gradient steps on restricted intervals, so they accumulate more total optimization than the single shared-parameter models they are measured against. Without an ablation that holds total training compute or steps fixed, it is difficult to isolate the benefit of removing parameter sharing from the benefit of extra targeted training. The abstract claims outperformance on a standard benchmark but supplies no numbers or error bars, which leaves the magnitude of the improvement unclear. This paper is for researchers working on large-scale diffusion models who are already thinking about adaptive or modular denoisers. A reader focused on conditioning choices or controllable generation would pick up concrete ideas from the embedding and paint-with-words sections. It deserves a serious referee because the core procedure is simple to reproduce and the qualitative findings are grounded, even if the quantitative claims need tighter controls on training budget. I would send it to review with a request for compute-matched baselines and full tables.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes eDiff-I, an ensemble of expert denoisers for text-to-image diffusion models. A single base model is first trained across all timesteps and then split into stage-specific specialists that receive continued training on restricted timestep intervals. The central claim is that this yields improved text alignment at unchanged inference cost and visual quality, outperforming prior large-scale models on a standard benchmark; additional results cover conditioning with T5/CLIP embeddings and a paint-with-words interface.

Significance. If the performance gains are shown to arise from the ensemble structure rather than extra optimization steps, the work would demonstrate that parameter sharing across the full diffusion trajectory is suboptimal and that stage-specialized experts can improve conditioning adherence without raising inference cost. This would be a useful empirical finding for diffusion-based generative modeling.

major comments (1)

[Training procedure] Training procedure (described in abstract and §3): each specialist receives additional gradient steps on its assigned timestep interval after the base model is split. The manuscript compares against single-model baselines that appear to have received fewer total optimization steps. This leaves open the possibility that measured gains in text alignment are driven by extra training compute rather than removal of parameter sharing, directly undermining the claim that a shared-parameter model is suboptimal.

minor comments (1)

[Abstract] Abstract: the claim of outperformance on 'the standard benchmark' supplies no quantitative metrics, error bars, ablation tables, or benchmark name, preventing verification of the result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying the key question of whether gains arise from specialization or extra optimization steps. We address this concern directly below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Training procedure (described in abstract and §3): each specialist receives additional gradient steps on its assigned timestep interval after the base model is split. The manuscript compares against single-model baselines that appear to have received fewer total optimization steps. This leaves open the possibility that measured gains in text alignment are driven by extra training compute rather than removal of parameter sharing, directly undermining the claim that a shared-parameter model is suboptimal.

Authors: We agree that the current experimental setup does not fully isolate the effect of parameter sharing from total training compute. After the initial base model is trained, each specialist receives continued gradient steps on its restricted timestep interval, resulting in higher aggregate optimization steps for the ensemble than for the reported single-model baselines. To address this, we will add a controlled ablation in the revised manuscript: a single shared-parameter model trained for a total number of gradient steps matching the sum of steps used across all specialists. We will report text-alignment metrics (e.g., CLIP score) and visual quality for this equal-compute baseline alongside eDiff-I. If the ensemble still outperforms, this will strengthen the claim that stage-specific specialization is beneficial beyond extra training. We will also explicitly document the step counts for the base model and each specialist in §3 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training procedure with no derivations

full rationale

The paper describes an empirical procedure: train a base diffusion model, split parameters into stage-specific experts, and continue training each on its timestep interval. No equations, predictions, or first-principles derivations are presented that could reduce to fitted inputs by construction. The central claim (ensemble improves text alignment) rests on benchmark comparisons rather than any self-definitional or self-citation load-bearing step. External benchmarks and qualitative observations are independent of the training split itself, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the empirical observation of stage-dependent text reliance.

pith-pipeline@v0.9.0 · 5666 in / 1019 out tokens · 44630 ms · 2026-05-15T01:40:27.202143+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.EightTick eight_tick_period echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Consistency Models
cs.LG 2023-03 conditional novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions
cs.LG 2026-05 unverdicted novelty 7.0

FP-FM adapts flow matching models to unseen distributions via least-squares projection onto basis functions spanning training velocity fields, yielding improved precision and recall without inference-time training.
Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings
cs.CV 2026-04 conditional novelty 7.0

Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
cs.CV 2024-03 unverdicted novelty 7.0

ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
cs.CV 2023-07 unverdicted novelty 7.0

A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
Leveraging Verifier-Based Reinforcement Learning in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
Temporally Extended Mixture-of-Experts Models
cs.LG 2026-04 unverdicted novelty 6.0

Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
OFA-Diffusion Compression: Compressing Diffusion Model in One-Shot Manner
cs.CV 2026-04 conditional novelty 6.0

OFA-Diffusion Compression trains diffusion models once to yield multiple size-specific compressed subnetworks via restricted candidate spaces, importance-based channel allocation, and reweighting.
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
cs.CV 2026-04 unverdicted novelty 6.0

PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
cs.CV 2024-10 unverdicted novelty 6.0

Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
cs.CV 2023-10 unverdicted novelty 6.0

Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
MVDream: Multi-view Diffusion for 3D Generation
cs.CV 2023-08 conditional novelty 6.0

MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
cs.CV 2023-08 unverdicted novelty 6.0

IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
Embody4D: A Generalist 4D World Model for Embodied AI
cs.CV 2026-05 unverdicted novelty 5.0

Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.
DiffMagicFace: Identity Consistent Facial Editing of Real Videos
cs.CV 2026-04 unverdicted novelty 5.0

DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression
cs.CV 2026-04 unverdicted novelty 5.0

ADP-DiT is a text-conditioned diffusion transformer for synthesizing longitudinal Alzheimer's MRI scans, reporting SSIM 0.8739 and PSNR 29.32 dB with improvements over a DiT baseline.
3D Smoke Scene Reconstruction Guided by Vision Priors from Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 5.0

A framework that combines MLLM-based image enhancement with a medium-aware 3D Gaussian Splatting model to reconstruct and render smoke scenes.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · cited by 21 Pith papers · 11 internal anchors

[1]

Efﬁcient large scale language modeling with mixtures of experts

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mi- haylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, et al. Efﬁcient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021. 5

work page arXiv 2021
[2]

Blended latent diffusion

Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. arXiv preprint arXiv:2206.02779, 2022. 4

work page arXiv 2022
[3]

Blended diffusion for text-driven editing of natural images

Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proc. CVPR, 2022. 4, 14

work page 2022
[4]

Estimating the optimal covariance with imperfect mean in diffusion probabilistic models

Fan Bao, Chongxuan Li, Jiacheng Sun, Jun Zhu, and Bo Zhang. Estimating the optimal covariance with imperfect mean in diffusion probabilistic models. In Proc. ICML, 2022. 4

work page 2022
[5]

Analytic- DPM: An analytic estimate of the optimal reverse variance in diffusion probabilistic models

Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic- DPM: An analytic estimate of the optimal reverse variance in diffusion probabilistic models. In Proc. ICLR, 2022. 4

work page 2022
[6]

Paint by word

David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by word. arXiv preprint arXiv:2103.10951, 2021. 14

work page arXiv 2021
[7]

Semi-Parametric Neural Image Synthesis

Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas M¨uller, and Bj¨orn Ommer. Semi-Parametric Neural Image Synthesis. In Proc. NeurIPS, 2022. 4

work page 2022
[8]

Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Proc. NeurIPS, 2020. 5

work page 2020
[9]

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Cohen. Re-Imagen: Retrieval-Augmented Text-to-Image Generator. arXiv preprint arXiv:2209.14491, 2022. 4

work page arXiv 2022
[10]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Improving diffusion models for inverse problems using manifold constraints

Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. In Proc. NeurIPS, 2022. 4

work page 2022
[12]

DiffEdit: Diffusion-based seman- tic image editing with mask guidance

Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. DiffEdit: Diffusion-based seman- tic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022. 4

work page arXiv 2022
[13]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In Proc. NeurIPS,

work page
[14]

Differentially private diffusion models

Tim Dockhorn, Tianshi Cao, Arash Vahdat, and Karsten Kreis. Differentially private diffusion models. arXiv:2210.09929,

work page arXiv
[15]

GENIE: Higher-order denoising diffusion solvers

Tim Dockhorn, Arash Vahdat, and Karsten Kreis. GENIE: Higher-order denoising diffusion solvers. In Proc. NeurIPS,

work page
[16]

Score- based generative modeling with critically-damped Langevin diffusion

Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score- based generative modeling with critically-damped Langevin diffusion. In Proc. ICLR, 2022. 4

work page 2022
[17]

Make-a-scene: Scene-based text-to-image generation with human priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022. 9, 10

work page arXiv 2022
[18]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image genera- tion using textual inversion.arXiv preprint arXiv:2208.01618,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Vector quan- tized diffusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quan- tized diffusion model for text-to-image synthesis. In Proc. CVPR, 2022. 4

work page 2022
[20]

Flexible diffusion modeling of long videos

William Harvey, Saeid Naderiparizi, Vaden Masrani, Chris- tian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022. 4

work page arXiv 2022
[21]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christo- pher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Pra- fulla Dhariwal, Scott Gray, et al. Scaling laws for autoregres- sive generative modeling. arXiv preprint arXiv:2010.14701,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[22]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kﬁr Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 9

work page 2017
[25]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Ima- gen Video: High deﬁnition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In Proc. NeurIPS, 2020. 2, 4

work page 2020
[27]

Fleet, Mohammad Norouzi, and Tim Salimans

Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high ﬁdelity image generation. JMLR, 23(47):1– 33, 2022. 4, 7

work page 2022
[28]

Classiﬁer-free diffusion guidance

Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 4, 7

work page 2021
[29]

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022. 4

work page 2022
[30]

Multimodal conditional image synthesis with product- of-experts GANs

Xun Huang, Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Multimodal conditional image synthesis with product- of-experts GANs. In Proc. ECCV, 2022. 14

work page 2022
[31]

Estimation of non-normalized statistical models by score matching

Aapo Hyv¨arinen. Estimation of non-normalized statistical models by score matching. JMLR, 6(24):695–709, 2005. 4, 5 20

work page 2005
[32]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial net- works. In Proc. CVPR, 2017. 14

work page 2017
[33]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[34]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022. 4, 5

work page 2022
[35]

Denoising diffusion restoration models

Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Proc. NeurIPS, 2022. 4

work page 2022
[36]

JPEG artifact correction using denoising diffusion restoration models

Bahjat Kawar, Jiaming Song, Stefano Ermon, and Michael Elad. JPEG artifact correction using denoising diffusion restoration models. In NeurIPS 2022 Workshop on Score- Based Methods, 2022. 4

work page 2022
[37]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022. 4

work page arXiv 2022
[38]

Scaling laws for deep learning based image reconstruction

Tobit Klug and Reinhard Heckel. Scaling laws for deep learning based image reconstruction. arXiv preprint arXiv:2209.13435, 2022. 2

work page arXiv 2022
[39]

DiffWave: A versatile diffusion model for audio synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A versatile diffusion model for audio synthesis. In Proc. ICLR, 2021. 4

work page 2021
[40]

Shamma, Michael Bernstein, and Li Fei-Fei

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- ditis, Li-Jia Li, David A. Shamma, Michael Bernstein, and Li Fei-Fei. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32–73, 2017. 9

work page 2017
[41]

Hashimoto

Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto. Diffusion-LM improves control- lable text generation. arXiv preprint arXiv:2205.14217, 2022. 4

work page arXiv 2022
[42]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In Proc. ECCV, 2014. 9

work page 2014
[43]

Pseudo numerical methods for diffusion models on manifolds

Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In Proc. ICLR, 2022. 4

work page 2022
[44]

Imaginaire

Ming-Yu Liu, Ting-Chun Wang, Xun Huang, and Arun Mallya. Imaginaire. https://github.com/NVlabs/ imaginaire, 2020. 8

work page 2020
[45]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In Proc. ICLR, 2019. 8

work page 2019
[46]

DPM-Solver: A fast ODE solver for diffu- sion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffu- sion probabilistic model sampling in around 10 steps. In Proc. NeurIPS, 2022. 4

work page 2022
[47]

RePaint: Inpainting us- ing denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting us- ing denoising diffusion probabilistic models. In Proc. CVPR,

work page
[48]

Improving diffusion model efﬁciency through patching.arXiv preprint arXiv:2207.04316,

Troy Luhman and Eric Luhman. Improving diffusion model efﬁciency through patching.arXiv preprint arXiv:2207.04316,

work page arXiv
[49]

Diffusion probabilistic models for 3D point cloud generation

Shitong Luo and Wei Hu. Diffusion probabilistic models for 3D point cloud generation. In Proc. CVPR, 2021. 4

work page 2021
[50]

SDEdit: Guided image synthesis and editing with stochastic differential equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In Proc. ICLR, 2022. 4

work page 2022
[51]

GLIDE: Towards photorealistic image genera- tion and editing with text-guided diffusion models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image genera- tion and editing with text-guided diffusion models. In Proc. ICML, 2022. 3, 4, 9, 10, 14

work page 2022
[52]

Diffusion models for adver- sarial puriﬁcation

Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. Diffusion models for adver- sarial puriﬁcation. In Proc. ICML, 2022. 4

work page 2022
[53]

Semantic image synthesis with spatially-adaptive nor- malization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proc. CVPR, 2019. 14

work page 2019
[54]

PyTorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. In Proc. NeurIPS, 2019. 8

work page 2019
[55]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proc. ICML,

work page
[56]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer.JMLR, 21(140):1–67, 2020. 3, 5, 7

work page 2020
[57]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with CLIP latents. arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

2, 3, 4, 5, 9, 10, 23

work page
[59]

Scaling vision with sparse mixture of experts

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. In Proc. NeurIPS, 2021. 5

work page 2021
[60]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 2022. 2, 3, 4, 7, 9, 10

work page 2022
[61]

Stable diffusion v1-4

Robin Rombach and Patrick Esser. Stable diffusion v1-4. https : / / huggingface . co / CompVis / stable - diffusion-v1-4, July 2022. 3, 4

work page 2022
[62]

DreamBooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kﬁr Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration. arXiv preprint arXiv:2208.12242, 2022. 4

work page arXiv 2022
[63]

Lee, Jonathan Ho, Tim Salimans, David J

Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mohammad 21 Norouzi. Palette: Image-to-image diffusion models. In Proc. SIGGRAPH, 2022. 4

work page 2022
[64]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 2,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[65]

Fleet, and Mohammad Norouzi

Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali- mans, David J. Fleet, and Mohammad Norouzi. Image super- resolution via iterative reﬁnement. IEEE Trans. Pattern Anal- ysis and Machine Intelligence, 2022. 4

work page 2022
[66]

Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer. In Proc. ICLR, 2017. 5

work page 2017
[67]

KNN- Diffusion: Image Generation via Large-Scale Retrieval

Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. KNN- Diffusion: Image Generation via Large-Scale Retrieval. arXiv preprint arXiv:2204.02849, 2022. 4

work page arXiv 2022
[68]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-A-Video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[69]

D2C: Diffusion-decoding models for few-shot condi- tional generation

Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2C: Diffusion-decoding models for few-shot condi- tional generation. In Proc. NeurIPS, 2021. 4

work page 2021
[70]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. ICML, 2015. 2, 4

work page 2015
[71]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In Proc. ICLR, 2021. 4, 5

work page 2021
[72]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InProc. NeurIPS,

work page
[73]

Improved techniques for training score-based generative models

Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Proc. NeurIPS,

work page
[74]

Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In Proc. ICLR, 2021. 2, 4, 5

work page 2021
[75]

Dropout: a simple way to prevent neural networks from overﬁtting

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1):1929–1958, 2014. 7

work page 1929
[76]

The bitter lesson

Rich Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/ BitterLesson.html, March 2019. 2

work page 2019
[77]

CSDI: Conditional score-based diffusion models for probabilistic time series imputation

Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Er- mon. CSDI: Conditional score-based diffusion models for probabilistic time series imputation. In Proc. NeurIPS, 2021. 4

work page 2021
[78]

Score-based generative modeling in latent space

Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In Proc. NeurIPS, 2021. 4

work page 2021
[79]

UniTune: Text-driven image editing by ﬁne tuning an image generation model on a single image

Dani Valevski, Matan Kalman, Yossi Matias, and Yaniv Leviathan. UniTune: Text-driven image editing by ﬁne tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022. 4

work page arXiv 2022
[80]

A connection between score matching and denoising autoencoders

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661– 1674, 2011. 4, 5

work page 2011

Showing first 80 references.