Recognition: 2 theorem links
· Lean TheoremeDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Pith reviewed 2026-05-15 01:40 UTC · model grok-4.3
The pith
An ensemble of stage-specialized diffusion models improves text alignment in image synthesis at the same inference cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark.
What carries the argument
An ensemble of expert denoisers, each fine-tuned on a narrow window of the iterative sampling trajectory after an initial shared pre-training stage.
If this is right
- Better text-to-image alignment on standard benchmarks than prior large-scale diffusion models.
- No increase in inference compute or sampling steps relative to a single model.
- Retention of high visual fidelity while adding controllable behaviors via multiple conditioning embeddings.
- Support for intuitive style transfer from reference images using CLIP image embeddings.
- User-level control via a paint-with-words mechanism that lets selected prompt words directly influence output regions.
Where Pith is reading between the lines
- The same staged-specialization pattern could be tested on other iterative generative tasks such as video or 3D synthesis where conditioning importance may also vary across steps.
- Focusing capacity on the phases where conditioning actually matters might reduce the parameter count needed for high performance compared with scaling a monolithic model.
- The paint-with-words interface suggests a path toward more interactive, region-specific editing tools that operate inside the diffusion loop rather than post hoc.
- Because the ensemble is created by splitting a shared base, the method may offer a practical route for adapting large pre-trained diffusion models to new domains without retraining from scratch.
Load-bearing premise
The synthesis process changes qualitatively so that text conditioning drives early steps but is largely ignored later, rendering a single shared-parameter model suboptimal.
What would settle it
If a single diffusion model trained with the same total compute budget produces equal or higher text-alignment scores on the standard benchmark, the premise that stage specialization is required would be refuted.
read the original abstract
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes eDiff-I, an ensemble of expert denoisers for text-to-image diffusion models. A single base model is first trained across all timesteps and then split into stage-specific specialists that receive continued training on restricted timestep intervals. The central claim is that this yields improved text alignment at unchanged inference cost and visual quality, outperforming prior large-scale models on a standard benchmark; additional results cover conditioning with T5/CLIP embeddings and a paint-with-words interface.
Significance. If the performance gains are shown to arise from the ensemble structure rather than extra optimization steps, the work would demonstrate that parameter sharing across the full diffusion trajectory is suboptimal and that stage-specialized experts can improve conditioning adherence without raising inference cost. This would be a useful empirical finding for diffusion-based generative modeling.
major comments (1)
- [Training procedure] Training procedure (described in abstract and §3): each specialist receives additional gradient steps on its assigned timestep interval after the base model is split. The manuscript compares against single-model baselines that appear to have received fewer total optimization steps. This leaves open the possibility that measured gains in text alignment are driven by extra training compute rather than removal of parameter sharing, directly undermining the claim that a shared-parameter model is suboptimal.
minor comments (1)
- [Abstract] Abstract: the claim of outperformance on 'the standard benchmark' supplies no quantitative metrics, error bars, ablation tables, or benchmark name, preventing verification of the result.
Simulated Author's Rebuttal
We thank the referee for the detailed review and for identifying the key question of whether gains arise from specialization or extra optimization steps. We address this concern directly below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Training procedure (described in abstract and §3): each specialist receives additional gradient steps on its assigned timestep interval after the base model is split. The manuscript compares against single-model baselines that appear to have received fewer total optimization steps. This leaves open the possibility that measured gains in text alignment are driven by extra training compute rather than removal of parameter sharing, directly undermining the claim that a shared-parameter model is suboptimal.
Authors: We agree that the current experimental setup does not fully isolate the effect of parameter sharing from total training compute. After the initial base model is trained, each specialist receives continued gradient steps on its restricted timestep interval, resulting in higher aggregate optimization steps for the ensemble than for the reported single-model baselines. To address this, we will add a controlled ablation in the revised manuscript: a single shared-parameter model trained for a total number of gradient steps matching the sum of steps used across all specialists. We will report text-alignment metrics (e.g., CLIP score) and visual quality for this equal-compute baseline alongside eDiff-I. If the ensemble still outperforms, this will strengthen the claim that stage-specific specialization is beneficial beyond extra training. We will also explicitly document the step counts for the base model and each specialist in §3 and the appendix. revision: yes
Circularity Check
No circularity: purely empirical training procedure with no derivations
full rationale
The paper describes an empirical procedure: train a base diffusion model, split parameters into stage-specific experts, and continue training each on its timestep interval. No equations, predictions, or first-principles derivations are presented that could reduce to fitted inputs by construction. The central claim (ensemble improves text alignment) rests on benchmark comparisons rather than any self-definitional or self-citation load-bearing step. External benchmarks and qualitative observations are independent of the training split itself, satisfying the criteria for a self-contained empirical result.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Foundation.EightTickeight_tick_period echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
Consistency Models
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
-
A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions
FP-FM adapts flow matching models to unseen distributions via least-squares projection onto basis functions spanning training velocity fields, yielding improved precision and recall without inference-time training.
-
Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings
Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.
-
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
-
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
-
Leveraging Verifier-Based Reinforcement Learning in Image Editing
Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
-
Temporally Extended Mixture-of-Experts Models
Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
-
OFA-Diffusion Compression: Compressing Diffusion Model in One-Shot Manner
OFA-Diffusion Compression trains diffusion models once to yield multiple size-specific compressed subnetworks via restricted candidate spaces, importance-based channel allocation, and reweighting.
-
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
-
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
-
MVDream: Multi-view Diffusion for 3D Generation
MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.
-
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
-
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
-
Embody4D: A Generalist 4D World Model for Embodied AI
Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.
-
DiffMagicFace: Identity Consistent Facial Editing of Real Videos
DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
-
ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression
ADP-DiT is a text-conditioned diffusion transformer for synthesizing longitudinal Alzheimer's MRI scans, reporting SSIM 0.8739 and PSNR 29.32 dB with improvements over a DiT baseline.
-
3D Smoke Scene Reconstruction Guided by Vision Priors from Multimodal Large Language Models
A framework that combines MLLM-based image enhancement with a medium-aware 3D Gaussian Splatting model to reconstruct and render smoke scenes.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[1]
Efficient large scale language modeling with mixtures of experts
Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mi- haylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, et al. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021. 5
-
[2]
Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. arXiv preprint arXiv:2206.02779, 2022. 4
-
[3]
Blended diffusion for text-driven editing of natural images
Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proc. CVPR, 2022. 4, 14
work page 2022
-
[4]
Estimating the optimal covariance with imperfect mean in diffusion probabilistic models
Fan Bao, Chongxuan Li, Jiacheng Sun, Jun Zhu, and Bo Zhang. Estimating the optimal covariance with imperfect mean in diffusion probabilistic models. In Proc. ICML, 2022. 4
work page 2022
-
[5]
Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic- DPM: An analytic estimate of the optimal reverse variance in diffusion probabilistic models. In Proc. ICLR, 2022. 4
work page 2022
-
[6]
David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by word. arXiv preprint arXiv:2103.10951, 2021. 14
-
[7]
Semi-Parametric Neural Image Synthesis
Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas M¨uller, and Bj¨orn Ommer. Semi-Parametric Neural Image Synthesis. In Proc. NeurIPS, 2022. 4
work page 2022
-
[8]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Proc. NeurIPS, 2020. 5
work page 2020
- [9]
-
[10]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022. 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Improving diffusion models for inverse problems using manifold constraints
Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. In Proc. NeurIPS, 2022. 4
work page 2022
-
[12]
DiffEdit: Diffusion-based seman- tic image editing with mask guidance
Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. DiffEdit: Diffusion-based seman- tic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022. 4
-
[13]
Diffusion models beat GANs on image synthesis
Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In Proc. NeurIPS,
-
[14]
Differentially private diffusion models
Tim Dockhorn, Tianshi Cao, Arash Vahdat, and Karsten Kreis. Differentially private diffusion models. arXiv:2210.09929,
-
[15]
GENIE: Higher-order denoising diffusion solvers
Tim Dockhorn, Arash Vahdat, and Karsten Kreis. GENIE: Higher-order denoising diffusion solvers. In Proc. NeurIPS,
-
[16]
Score- based generative modeling with critically-damped Langevin diffusion
Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score- based generative modeling with critically-damped Langevin diffusion. In Proc. ICLR, 2022. 4
work page 2022
-
[17]
Make-a-scene: Scene-based text-to-image generation with human priors
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022. 9, 10
-
[18]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image genera- tion using textual inversion.arXiv preprint arXiv:2208.01618,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Vector quan- tized diffusion model for text-to-image synthesis
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quan- tized diffusion model for text-to-image synthesis. In Proc. CVPR, 2022. 4
work page 2022
-
[20]
Flexible diffusion modeling of long videos
William Harvey, Saeid Naderiparizi, Vaden Masrani, Chris- tian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022. 4
-
[21]
Scaling Laws for Autoregressive Generative Modeling
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christo- pher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Pra- fulla Dhariwal, Scott Gray, et al. Scaling laws for autoregres- sive generative modeling. arXiv preprint arXiv:2010.14701,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[22]
Prompt-to-Prompt Image Editing with Cross Attention Control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Deep Learning Scaling is Predictable, Empirically
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 9
work page 2017
-
[25]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Ima- gen Video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022. 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In Proc. NeurIPS, 2020. 2, 4
work page 2020
-
[27]
Fleet, Mohammad Norouzi, and Tim Salimans
Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. JMLR, 23(47):1– 33, 2022. 4, 7
work page 2022
-
[28]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 4, 7
work page 2021
-
[29]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022. 4
work page 2022
-
[30]
Multimodal conditional image synthesis with product- of-experts GANs
Xun Huang, Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Multimodal conditional image synthesis with product- of-experts GANs. In Proc. ECCV, 2022. 14
work page 2022
-
[31]
Estimation of non-normalized statistical models by score matching
Aapo Hyv¨arinen. Estimation of non-normalized statistical models by score matching. JMLR, 6(24):695–709, 2005. 4, 5 20
work page 2005
-
[32]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial net- works. In Proc. CVPR, 2017. 14
work page 2017
-
[33]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[34]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022. 4, 5
work page 2022
-
[35]
Denoising diffusion restoration models
Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Proc. NeurIPS, 2022. 4
work page 2022
-
[36]
JPEG artifact correction using denoising diffusion restoration models
Bahjat Kawar, Jiaming Song, Stefano Ermon, and Michael Elad. JPEG artifact correction using denoising diffusion restoration models. In NeurIPS 2022 Workshop on Score- Based Methods, 2022. 4
work page 2022
-
[37]
Imagic: Text-based real image editing with diffusion models
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022. 4
-
[38]
Scaling laws for deep learning based image reconstruction
Tobit Klug and Reinhard Heckel. Scaling laws for deep learning based image reconstruction. arXiv preprint arXiv:2209.13435, 2022. 2
-
[39]
DiffWave: A versatile diffusion model for audio synthesis
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A versatile diffusion model for audio synthesis. In Proc. ICLR, 2021. 4
work page 2021
-
[40]
Shamma, Michael Bernstein, and Li Fei-Fei
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- ditis, Li-Jia Li, David A. Shamma, Michael Bernstein, and Li Fei-Fei. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32–73, 2017. 9
work page 2017
- [41]
-
[42]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In Proc. ECCV, 2014. 9
work page 2014
-
[43]
Pseudo numerical methods for diffusion models on manifolds
Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In Proc. ICLR, 2022. 4
work page 2022
-
[44]
Ming-Yu Liu, Ting-Chun Wang, Xun Huang, and Arun Mallya. Imaginaire. https://github.com/NVlabs/ imaginaire, 2020. 8
work page 2020
-
[45]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In Proc. ICLR, 2019. 8
work page 2019
-
[46]
DPM-Solver: A fast ODE solver for diffu- sion probabilistic model sampling in around 10 steps
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffu- sion probabilistic model sampling in around 10 steps. In Proc. NeurIPS, 2022. 4
work page 2022
-
[47]
RePaint: Inpainting us- ing denoising diffusion probabilistic models
Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting us- ing denoising diffusion probabilistic models. In Proc. CVPR,
-
[48]
Improving diffusion model efficiency through patching.arXiv preprint arXiv:2207.04316,
Troy Luhman and Eric Luhman. Improving diffusion model efficiency through patching.arXiv preprint arXiv:2207.04316,
-
[49]
Diffusion probabilistic models for 3D point cloud generation
Shitong Luo and Wei Hu. Diffusion probabilistic models for 3D point cloud generation. In Proc. CVPR, 2021. 4
work page 2021
-
[50]
SDEdit: Guided image synthesis and editing with stochastic differential equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In Proc. ICLR, 2022. 4
work page 2022
-
[51]
GLIDE: Towards photorealistic image genera- tion and editing with text-guided diffusion models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image genera- tion and editing with text-guided diffusion models. In Proc. ICML, 2022. 3, 4, 9, 10, 14
work page 2022
-
[52]
Diffusion models for adver- sarial purification
Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. Diffusion models for adver- sarial purification. In Proc. ICML, 2022. 4
work page 2022
-
[53]
Semantic image synthesis with spatially-adaptive nor- malization
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proc. CVPR, 2019. 14
work page 2019
-
[54]
PyTorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. In Proc. NeurIPS, 2019. 8
work page 2019
-
[55]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proc. ICML,
-
[56]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR, 21(140):1–67, 2020. 3, 5, 7
work page 2020
-
[57]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with CLIP latents. arXiv preprint arXiv:2204.06125,
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
2, 3, 4, 5, 9, 10, 23
-
[59]
Scaling vision with sparse mixture of experts
Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. In Proc. NeurIPS, 2021. 5
work page 2021
-
[60]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 2022. 2, 3, 4, 7, 9, 10
work page 2022
-
[61]
Robin Rombach and Patrick Esser. Stable diffusion v1-4. https : / / huggingface . co / CompVis / stable - diffusion-v1-4, July 2022. 3, 4
work page 2022
-
[62]
DreamBooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration. arXiv preprint arXiv:2208.12242, 2022. 4
-
[63]
Lee, Jonathan Ho, Tim Salimans, David J
Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mohammad 21 Norouzi. Palette: Image-to-image diffusion models. In Proc. SIGGRAPH, 2022. 4
work page 2022
-
[64]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 2,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[65]
Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali- mans, David J. Fleet, and Mohammad Norouzi. Image super- resolution via iterative refinement. IEEE Trans. Pattern Anal- ysis and Machine Intelligence, 2022. 4
work page 2022
-
[66]
Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer. In Proc. ICLR, 2017. 5
work page 2017
-
[67]
KNN- Diffusion: Image Generation via Large-Scale Retrieval
Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. KNN- Diffusion: Image Generation via Large-Scale Retrieval. arXiv preprint arXiv:2204.02849, 2022. 4
-
[68]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-A-Video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[69]
D2C: Diffusion-decoding models for few-shot condi- tional generation
Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2C: Diffusion-decoding models for few-shot condi- tional generation. In Proc. NeurIPS, 2021. 4
work page 2021
-
[70]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. ICML, 2015. 2, 4
work page 2015
-
[71]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In Proc. ICLR, 2021. 4, 5
work page 2021
-
[72]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InProc. NeurIPS,
-
[73]
Improved techniques for training score-based generative models
Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Proc. NeurIPS,
-
[74]
Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In Proc. ICLR, 2021. 2, 4, 5
work page 2021
-
[75]
Dropout: a simple way to prevent neural networks from overfitting
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014. 7
work page 1929
-
[76]
Rich Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/ BitterLesson.html, March 2019. 2
work page 2019
-
[77]
CSDI: Conditional score-based diffusion models for probabilistic time series imputation
Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Er- mon. CSDI: Conditional score-based diffusion models for probabilistic time series imputation. In Proc. NeurIPS, 2021. 4
work page 2021
-
[78]
Score-based generative modeling in latent space
Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In Proc. NeurIPS, 2021. 4
work page 2021
-
[79]
UniTune: Text-driven image editing by fine tuning an image generation model on a single image
Dani Valevski, Matan Kalman, Yossi Matias, and Yaniv Leviathan. UniTune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022. 4
-
[80]
A connection between score matching and denoising autoencoders
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661– 1674, 2011. 4, 5
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.