Diffusion Domain Expansion: Learning to Coordinate Pre-trained Diffusion Models

Egor Lifar; Semyon Savkin; Shangyuan Tong; Timur Garipov; Tommi Jaakkola

arxiv: 2605.23275 · v1 · pith:MCCFVMZAnew · submitted 2026-05-22 · 💻 cs.LG

Diffusion Domain Expansion: Learning to Coordinate Pre-trained Diffusion Models

Egor Lifar , Semyon Savkin , Timur Garipov , Shangyuan Tong , Tommi Jaakkola This is my paper

Pith reviewed 2026-05-25 04:40 UTC · model grok-4.3

classification 💻 cs.LG

keywords diffusion modelsdomain expansioncoordinated generationimage generationaudio generationpre-trained modelsconditional generationmodel coordination

0 comments

The pith

A compact coordinator network lets pre-trained diffusion models generate outputs in domains larger than their original training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Diffusion Domain Expansion as a way to extend pre-trained diffusion models to larger objects and more complex conditioning without full retraining. It introduces a small trainable network whose role is to coordinate the denoised outputs produced by those models. The central demonstration is that this coordinator remains simple in design yet can generalize to domain sizes it never saw during its own training. Experiments apply the approach to long audio generation and conditional image synthesis, where it is reported to outperform alternative coordination techniques on both qualitative and quantitative measures. A sympathetic reader would see this as a route to reuse existing powerful models on bigger tasks with only modest additional training.

Core claim

Diffusion Domain Expansion trains a compact network to coordinate the denoised outputs of multiple pre-trained diffusion models, thereby enabling coherent generation of larger objects and more intricate conditioning signals than the individual models were originally designed to produce, with the coordinator itself generalizing beyond the domain sizes encountered in its training.

What carries the argument

The compact trainable coordinator network that merges denoised outputs from separate pre-trained diffusion models.

If this is right

Pre-trained diffusion models can produce long audio tracks through coordination rather than retraining.
Conditional image generation can scale to larger resolutions or more complex prompts using the same base models.
The coordinator generalizes to domain sizes larger than those supplied during its training.
Quantitative and qualitative results exceed those of prior methods for coordinating multiple diffusion models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may lower the cost of scaling diffusion-based generation to higher resolutions or longer sequences by reusing existing models.
Coordination of this form could be tested on other generative families such as autoregressive transformers.
Minimal additional parameters might suffice for domain expansion across additional modalities like video or 3D.

Load-bearing premise

A compact trainable network can coordinate denoised outputs from pre-trained diffusion models so that the combination generalizes to unseen larger domains without needing domain-specific architecture or large amounts of new data.

What would settle it

Train the coordinator only on short audio clips or small image patches, then measure whether generation quality and coherence remain high when the same models are applied to audio tracks or images several times longer or larger than any example used in coordinator training.

Figures

Figures reproduced from arXiv: 2605.23275 by Egor Lifar, Semyon Savkin, Shangyuan Tong, Timur Garipov, Tommi Jaakkola.

**Figure 1.** Figure 1: Overview of the architecture. A large image is decomposed into a set of overlapping patches; each patch is [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Generated sample from the RRR model (left) and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Satellite image generation example: DDE sample [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

In this paper, we propose Diffusion Domain Expansion (DDE), a method that efficiently extends pre-trained diffusion models to generate larger objects and handle more complex conditioning beyond their original capabilities. Our method employs a compact trainable network designed to coordinate the denoised outputs of pre-trained diffusion models. We demonstrate that the coordinator can be universally simple while being capable of generalizing to domains larger than those observed during its training time. We evaluate DDE on long audio track generation and conditional image generation, demonstrating its applicability across domains. DDE outperforms other approaches to coordinated generation with diffusion models in qualitative and quantitative evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DDE adds a compact coordinator to stitch pre-trained diffusion models for larger scales, with empirical wins on audio and images but thin support for the generalization claim.

read the letter

The core contribution is Diffusion Domain Expansion, which trains a small network to coordinate the outputs of multiple pre-trained diffusion models so they can produce larger objects or handle richer conditioning than any single model was built for. The paper shows this on long audio tracks and conditional image generation, where the coordinator stays simple and the results beat other coordination baselines in both qualitative samples and quantitative metrics. That reuse of existing checkpoints without full retraining is the practical angle that stands out.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes Diffusion Domain Expansion (DDE), a method employing a compact trainable coordinator network to combine denoised outputs from multiple pre-trained diffusion models. This enables generation of larger objects and more complex conditioning than the base models support. The coordinator is claimed to be universally simple yet capable of generalizing to domains strictly larger than those seen in training. Evaluations are presented on long audio track generation and conditional image generation, where DDE is reported to outperform other coordinated diffusion approaches in both qualitative and quantitative metrics.

Significance. If the generalization claim holds with supporting analysis, the result would be significant for enabling efficient scaling of pre-trained diffusion models to new resolutions or lengths without retraining large base models. This could reduce computational costs in audio and image domains. The approach of learning a lightweight coordinator rather than domain-specific redesigns is conceptually appealing, though the current evidence for cross-scale compatibility is insufficient to establish the result.

major comments (1)

[Abstract] The central generalization claim—that a fixed compact coordinator trained on smaller domains can produce coherent outputs on strictly larger domains by combining independent pre-trained denoisers—is load-bearing but unsupported. No derivation, bound, or analysis is given on the compatibility of marginal distributions or noise schedules when outputs are spatially or temporally extended. The evaluations on long audio and conditional images report success but include no ablations isolating the scale gap or testing collapse when the extension factor exceeds the training regime.

minor comments (1)

The abstract provides no equations, model architecture details, loss functions, or quantitative metrics (e.g., specific scores or error bars), which hinders assessment of the technical implementation.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the detailed and constructive report. The primary concern centers on the empirical versus theoretical support for the coordinator's generalization to strictly larger domains. We address this below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] The central generalization claim—that a fixed compact coordinator trained on smaller domains can produce coherent outputs on strictly larger domains by combining independent pre-trained denoisers—is load-bearing but unsupported. No derivation, bound, or analysis is given on the compatibility of marginal distributions or noise schedules when outputs are spatially or temporally extended. The evaluations on long audio and conditional images report success but include no ablations isolating the scale gap or testing collapse when the extension factor exceeds the training regime.

Authors: We agree that the manuscript provides no formal derivation or bound on marginal distribution compatibility or noise schedule alignment under domain extension; the generalization result is presented as an empirical observation. The current experiments demonstrate successful application to longer audio tracks and higher-complexity image conditioning, but they do not isolate the precise scale gap or systematically test failure modes beyond the training regime. We will revise the manuscript to include (i) a dedicated discussion of the implicit assumptions on noise schedules and marginals, (ii) new ablations that vary the extension factor while holding the coordinator fixed, and (iii) explicit tests for output collapse when the test-time domain size substantially exceeds the training distribution. These additions will be placed in the experimental section and a new limitations paragraph. revision: yes

standing simulated objections not resolved

A formal theoretical bound or derivation establishing compatibility of marginal distributions and noise schedules for arbitrary domain extensions.

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent training and empirical generalization tests

full rationale

The paper introduces a compact trainable coordinator network whose parameters are learned from data on smaller domains, then evaluated for generalization on strictly larger domains. No equations or claims reduce the coordinator's behavior to a fitted input by construction, nor do any load-bearing steps depend on self-citations whose content is itself unverified. The central claim is supported by qualitative and quantitative evaluations on audio and image tasks rather than by re-labeling of training objectives. This is the most common honest finding for a method paper whose core contribution is an empirical architecture and training procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no information on free parameters, axioms, or invented entities; ledger is empty by necessity.

pith-pipeline@v0.9.0 · 5636 in / 921 out tokens · 19623 ms · 2026-05-25T04:40:46.387611+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

D[L](X[L](t), Y[L], t) = C[L]([D(xi(t), yi, t)]L i=1, [yi]L i=1, t) ... trained by minimizing the denoising error (1) ... generalize to domains larger than those observed during its training time
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ViT-based coordinator ... overlap averaging ... MultiDiffusion-like updates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 1 internal anchor

[1]

2019 , eprint=

Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity , author=. 2019 , eprint=

work page 2019
[2]

International Conference on Learning Representations , year=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. International Conference on Learning Representations , year=

work page
[3]

NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , year=

Classifier-Free Diffusion Guidance , author=. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , year=

work page 2021
[4]

Advances in Neural Information Processing Systems , volume=

Diffusion Models Beat GANs on Image Synthesis , author=. Advances in Neural Information Processing Systems , volume=

work page
[5]

Advances in Neural Information Processing Systems , volume=

Implicit Generation and Modeling with Energy-Based Models , author=. Advances in Neural Information Processing Systems , volume=

work page
[6]

Advances in Neural Information Processing Systems , volume=

Compositional Visual Generation with Energy Based Models , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Advances in Neural Information Processing Systems , volume=

Unsupervised Learning of Compositional Energy Concepts , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

Advances in Neural Information Processing Systems , volume=

Learning to Compose Visual Relations , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XVII , pages=

Compositional Visual Generation with Composable Diffusion Models , author=. Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XVII , pages=. 2022 , organization=

work page 2022
[10]

International Conference on Machine Learning , pages=

Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[11]

International Conference on Machine Learning , pages=

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. International Conference on Machine Learning , pages=. 2015 , organization=

work page 2015
[12]

Advances in Neural Information Processing Systems , volume=

Denoising Diffusion Probabilistic Models , author=. Advances in Neural Information Processing Systems , volume=

work page
[13]

Advances in neural information processing systems , volume=

Generative Modeling by Estimating Gradients of the Data Distribution , author=. Advances in neural information processing systems , volume=

work page
[14]

Advances in neural information processing systems , volume=

Improved Techniques for Training Score-Based Generative Models , author=. Advances in neural information processing systems , volume=

work page
[15]

Stochastic Processes and their Applications , volume=

Reverse-time diffusion equation models , author=. Stochastic Processes and their Applications , volume=. 1982 , publisher=

work page 1982
[16]

Journal of Machine Learning Research , volume=

Estimation of Non-Normalized Statistical Models by Score Matching , author=. Journal of Machine Learning Research , volume=

work page
[17]

Uncertainty in Artificial Intelligence , pages=

Sliced Score Matching: A Scalable Approach to Density and Score Estimation , author=. Uncertainty in Artificial Intelligence , pages=. 2020 , organization=

work page 2020
[18]

Advances in Neural Information Processing Systems , volume=

Elucidating the design space of diffusion-based generative models , author=. Advances in Neural Information Processing Systems , volume=

work page
[19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Blended diffusion for text-driven editing of natural images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[20]

International Conference on Machine Learning , pages=

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[21]

Advances in Neural Information Processing Systems , volume=

Syncdiffusion: Coherent montage via synchronized joint diffusions , author=. Advances in Neural Information Processing Systems , volume=

work page
[22]

and Jiang, Yuming and Liu, Ziwei , title =

Huang, Ziqi and Chan, Kelvin C.K. and Jiang, Yuming and Liu, Ziwei , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

work page 2023
[23]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Diffcollage: Parallel generation of large content with diffusion models , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2023 , organization=

work page 2023
[24]

The Twelfth International Conference on Learning Representations , year=

Compositional Generative Inverse Design , author=. The Twelfth International Conference on Learning Representations , year=

work page
[25]

IEEE International Conference on Computer Vision (ICCV) , year=

Adding Conditional Control to Text-to-Image Diffusion Models , author=. IEEE International Conference on Computer Vision (ICCV) , year=

work page
[26]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Scaling rectified flow transformers for high-resolution image synthesis , author=. arXiv preprint arXiv:2403.03206 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[28]

International Conference on Machine Learning , pages=

Zero-Shot Text-to-Image Generation , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[29]

Nature , pages=

Accurate structure prediction of biomolecular interactions with AlphaFold 3 , author=. Nature , pages=. 2024 , publisher=

work page 2024
[30]

Advances in Neural Information Processing Systems , volume=

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , author=. Advances in Neural Information Processing Systems , volume=

work page
[31]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Compositional Sculpting of Iterative Generative Processes , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[32]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[33]

Diverse Sampling with Diffusion Models , author=

Particle Guidance: non-I.I.D. Diverse Sampling with Diffusion Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[34]

The Eleventh International Conference on Learning Representations , year=

Diffusion Probabilistic Modeling of Protein Backbones in 3D for the motif-scaffolding problem , author=. The Eleventh International Conference on Learning Representations , year=

work page
[35]

Proceedings of Robotics: Science and Systems (RSS) , year=

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , author=. Proceedings of Robotics: Science and Systems (RSS) , year=

work page
[36]

2023 , url=

Improving image generation with better captions , author=. 2023 , url=

work page 2023
[37]

arXiv preprint arXiv:2402.04825 , year=

Fast Timing-Conditioned Latent Audio Diffusion , author=. arXiv preprint arXiv:2402.04825 , year=

work page arXiv
[38]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Choi, Jooyoung and Kim, Sungwon and Jeong, Yonghyun and Gwon, Youngjune and Yoon, Sungroh , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2021 , pages =

work page 2021
[39]

Advances in Neural Information Processing Systems , volume=

Denoising diffusion restoration models , author=. Advances in Neural Information Processing Systems , volume=

work page
[40]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Lugmayr, Andreas and Danelljan, Martin and Romero, Andres and Yu, Fisher and Timofte, Radu and Van Gool, Luc , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

work page 2022
[41]

Advances in Neural Information Processing Systems , volume=

Improving diffusion models for inverse problems using manifold constraints , author=. Advances in Neural Information Processing Systems , volume=

work page
[42]

Advances in Neural Information Processing Systems , volume=

Video diffusion models , author=. Advances in Neural Information Processing Systems , volume=

work page
[43]

ACM SIGGRAPH 2022 conference proceedings , pages=

Palette: Image-to-image diffusion models , author=. ACM SIGGRAPH 2022 conference proceedings , pages=

work page 2022
[44]

The Twelfth International Conference on Learning Representations , year=

Multi-Source Diffusion Models for Simultaneous Music Generation and Separation , author=. The Twelfth International Conference on Learning Representations , year=

work page
[45]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[46]

The Eleventh International Conference on Learning Representations , year=

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , author=. The Eleventh International Conference on Learning Representations , year=

work page
[47]

The Eleventh International Conference on Learning Representations , year=

Prompt-to-Prompt Image Editing with Cross-Attention Control , author=. The Eleventh International Conference on Learning Representations , year=

work page
[48]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Imagic: Text-based real image editing with diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[49]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Collage diffusion , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page
[50]

2022 , url=

Chenlin Meng and Yutong He and Yang Song and Jiaming Song and Jiajun Wu and Jun-Yan Zhu and Stefano Ermon , booktitle=. 2022 , url=

work page 2022
[51]

The Eleventh International Conference on Learning Representations , year=

DiffEdit: Diffusion-based semantic image editing with mask guidance , author=. The Eleventh International Conference on Learning Representations , year=

work page
[52]

The Twelfth International Conference on Learning Representations , year=

Training Diffusion Models with Reinforcement Learning , author=. The Twelfth International Conference on Learning Representations , year=

work page
[53]

The Twelfth International Conference on Learning Representations , year=

Directly Fine-Tuning Diffusion Models on Differentiable Rewards , author=. The Twelfth International Conference on Learning Representations , year=

work page
[54]

Diffusion model alignment using direct preference optimization.arXiv preprint arXiv:2311.12908,

Diffusion model alignment using direct preference optimization , author=. arXiv preprint arXiv:2311.12908 , year=

work page arXiv
[55]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Ren, Mengwei and Delbracio, Mauricio and Talebi, Hossein and Gerig, Guido and Milanfar, Peyman , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

work page 2023
[56]

and Milanfar, Peyman , title =

Whang, Jay and Delbracio, Mauricio and Talebi, Hossein and Saharia, Chitwan and Dimakis, Alexandros G. and Milanfar, Peyman , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

work page 2022
[57]

arXiv preprint arXiv:2402.14017 , year=

D-Flow: Differentiating through Flows for Controlled Generation , author=. arXiv preprint arXiv:2402.14017 , year=

work page arXiv
[58]

Advances in Neural Information Processing Systems , volume=

Flow network based generative models for non-iterative diverse candidate generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[59]

2023 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

work page 2023
[60]

2016 , eprint=

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , author=. 2016 , eprint=

work page 2016
[61]

2023 , eprint=

Scalable Diffusion Models with Transformers , author=. 2023 , eprint=

work page 2023
[62]

2019 , eprint=

Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms , author=. 2019 , eprint=

work page 2019
[63]

2019 , eprint=

Averaging Weights Leads to Wider Optima and Better Generalization , author=. 2019 , eprint=

work page 2019
[64]

CVPR , year=

On Aliased Resizing and Surprising Subtleties in GAN Evaluation , author=. CVPR , year=

work page

[1] [1]

2019 , eprint=

Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity , author=. 2019 , eprint=

work page 2019

[2] [2]

International Conference on Learning Representations , year=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. International Conference on Learning Representations , year=

work page

[3] [3]

NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , year=

Classifier-Free Diffusion Guidance , author=. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , year=

work page 2021

[4] [4]

Advances in Neural Information Processing Systems , volume=

Diffusion Models Beat GANs on Image Synthesis , author=. Advances in Neural Information Processing Systems , volume=

work page

[5] [5]

Advances in Neural Information Processing Systems , volume=

Implicit Generation and Modeling with Energy-Based Models , author=. Advances in Neural Information Processing Systems , volume=

work page

[6] [6]

Advances in Neural Information Processing Systems , volume=

Compositional Visual Generation with Energy Based Models , author=. Advances in Neural Information Processing Systems , volume=

work page

[7] [7]

Advances in Neural Information Processing Systems , volume=

Unsupervised Learning of Compositional Energy Concepts , author=. Advances in Neural Information Processing Systems , volume=

work page

[8] [8]

Advances in Neural Information Processing Systems , volume=

Learning to Compose Visual Relations , author=. Advances in Neural Information Processing Systems , volume=

work page

[9] [9]

Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XVII , pages=

Compositional Visual Generation with Composable Diffusion Models , author=. Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XVII , pages=. 2022 , organization=

work page 2022

[10] [10]

International Conference on Machine Learning , pages=

Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[11] [11]

International Conference on Machine Learning , pages=

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. International Conference on Machine Learning , pages=. 2015 , organization=

work page 2015

[12] [12]

Advances in Neural Information Processing Systems , volume=

Denoising Diffusion Probabilistic Models , author=. Advances in Neural Information Processing Systems , volume=

work page

[13] [13]

Advances in neural information processing systems , volume=

Generative Modeling by Estimating Gradients of the Data Distribution , author=. Advances in neural information processing systems , volume=

work page

[14] [14]

Advances in neural information processing systems , volume=

Improved Techniques for Training Score-Based Generative Models , author=. Advances in neural information processing systems , volume=

work page

[15] [15]

Stochastic Processes and their Applications , volume=

Reverse-time diffusion equation models , author=. Stochastic Processes and their Applications , volume=. 1982 , publisher=

work page 1982

[16] [16]

Journal of Machine Learning Research , volume=

Estimation of Non-Normalized Statistical Models by Score Matching , author=. Journal of Machine Learning Research , volume=

work page

[17] [17]

Uncertainty in Artificial Intelligence , pages=

Sliced Score Matching: A Scalable Approach to Density and Score Estimation , author=. Uncertainty in Artificial Intelligence , pages=. 2020 , organization=

work page 2020

[18] [18]

Advances in Neural Information Processing Systems , volume=

Elucidating the design space of diffusion-based generative models , author=. Advances in Neural Information Processing Systems , volume=

work page

[19] [19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Blended diffusion for text-driven editing of natural images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[20] [20]

International Conference on Machine Learning , pages=

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[21] [21]

Advances in Neural Information Processing Systems , volume=

Syncdiffusion: Coherent montage via synchronized joint diffusions , author=. Advances in Neural Information Processing Systems , volume=

work page

[22] [22]

and Jiang, Yuming and Liu, Ziwei , title =

Huang, Ziqi and Chan, Kelvin C.K. and Jiang, Yuming and Liu, Ziwei , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

work page 2023

[23] [23]

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Diffcollage: Parallel generation of large content with diffusion models , author=. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2023 , organization=

work page 2023

[24] [24]

The Twelfth International Conference on Learning Representations , year=

Compositional Generative Inverse Design , author=. The Twelfth International Conference on Learning Representations , year=

work page

[25] [25]

IEEE International Conference on Computer Vision (ICCV) , year=

Adding Conditional Control to Text-to-Image Diffusion Models , author=. IEEE International Conference on Computer Vision (ICCV) , year=

work page

[26] [26]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Scaling rectified flow transformers for high-resolution image synthesis , author=. arXiv preprint arXiv:2403.03206 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[28] [28]

International Conference on Machine Learning , pages=

Zero-Shot Text-to-Image Generation , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[29] [29]

Nature , pages=

Accurate structure prediction of biomolecular interactions with AlphaFold 3 , author=. Nature , pages=. 2024 , publisher=

work page 2024

[30] [30]

Advances in Neural Information Processing Systems , volume=

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , author=. Advances in Neural Information Processing Systems , volume=

work page

[31] [31]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Compositional Sculpting of Iterative Generative Processes , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[32] [32]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[33] [33]

Diverse Sampling with Diffusion Models , author=

Particle Guidance: non-I.I.D. Diverse Sampling with Diffusion Models , author=. The Twelfth International Conference on Learning Representations , year=

work page

[34] [34]

The Eleventh International Conference on Learning Representations , year=

Diffusion Probabilistic Modeling of Protein Backbones in 3D for the motif-scaffolding problem , author=. The Eleventh International Conference on Learning Representations , year=

work page

[35] [35]

Proceedings of Robotics: Science and Systems (RSS) , year=

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , author=. Proceedings of Robotics: Science and Systems (RSS) , year=

work page

[36] [36]

2023 , url=

Improving image generation with better captions , author=. 2023 , url=

work page 2023

[37] [37]

arXiv preprint arXiv:2402.04825 , year=

Fast Timing-Conditioned Latent Audio Diffusion , author=. arXiv preprint arXiv:2402.04825 , year=

work page arXiv

[38] [38]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Choi, Jooyoung and Kim, Sungwon and Jeong, Yonghyun and Gwon, Youngjune and Yoon, Sungroh , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2021 , pages =

work page 2021

[39] [39]

Advances in Neural Information Processing Systems , volume=

Denoising diffusion restoration models , author=. Advances in Neural Information Processing Systems , volume=

work page

[40] [40]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Lugmayr, Andreas and Danelljan, Martin and Romero, Andres and Yu, Fisher and Timofte, Radu and Van Gool, Luc , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

work page 2022

[41] [41]

Advances in Neural Information Processing Systems , volume=

Improving diffusion models for inverse problems using manifold constraints , author=. Advances in Neural Information Processing Systems , volume=

work page

[42] [42]

Advances in Neural Information Processing Systems , volume=

Video diffusion models , author=. Advances in Neural Information Processing Systems , volume=

work page

[43] [43]

ACM SIGGRAPH 2022 conference proceedings , pages=

Palette: Image-to-image diffusion models , author=. ACM SIGGRAPH 2022 conference proceedings , pages=

work page 2022

[44] [44]

The Twelfth International Conference on Learning Representations , year=

Multi-Source Diffusion Models for Simultaneous Music Generation and Separation , author=. The Twelfth International Conference on Learning Representations , year=

work page

[45] [45]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[46] [46]

The Eleventh International Conference on Learning Representations , year=

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , author=. The Eleventh International Conference on Learning Representations , year=

work page

[47] [47]

The Eleventh International Conference on Learning Representations , year=

Prompt-to-Prompt Image Editing with Cross-Attention Control , author=. The Eleventh International Conference on Learning Representations , year=

work page

[48] [48]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Imagic: Text-based real image editing with diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[49] [49]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Collage diffusion , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

work page

[50] [50]

2022 , url=

Chenlin Meng and Yutong He and Yang Song and Jiaming Song and Jiajun Wu and Jun-Yan Zhu and Stefano Ermon , booktitle=. 2022 , url=

work page 2022

[51] [51]

The Eleventh International Conference on Learning Representations , year=

DiffEdit: Diffusion-based semantic image editing with mask guidance , author=. The Eleventh International Conference on Learning Representations , year=

work page

[52] [52]

The Twelfth International Conference on Learning Representations , year=

Training Diffusion Models with Reinforcement Learning , author=. The Twelfth International Conference on Learning Representations , year=

work page

[53] [53]

The Twelfth International Conference on Learning Representations , year=

Directly Fine-Tuning Diffusion Models on Differentiable Rewards , author=. The Twelfth International Conference on Learning Representations , year=

work page

[54] [54]

Diffusion model alignment using direct preference optimization.arXiv preprint arXiv:2311.12908,

Diffusion model alignment using direct preference optimization , author=. arXiv preprint arXiv:2311.12908 , year=

work page arXiv

[55] [55]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Ren, Mengwei and Delbracio, Mauricio and Talebi, Hossein and Gerig, Guido and Milanfar, Peyman , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

work page 2023

[56] [56]

and Milanfar, Peyman , title =

Whang, Jay and Delbracio, Mauricio and Talebi, Hossein and Saharia, Chitwan and Dimakis, Alexandros G. and Milanfar, Peyman , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

work page 2022

[57] [57]

arXiv preprint arXiv:2402.14017 , year=

D-Flow: Differentiating through Flows for Controlled Generation , author=. arXiv preprint arXiv:2402.14017 , year=

work page arXiv

[58] [58]

Advances in Neural Information Processing Systems , volume=

Flow network based generative models for non-iterative diverse candidate generation , author=. Advances in Neural Information Processing Systems , volume=

work page

[59] [59]

2023 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

work page 2023

[60] [60]

2016 , eprint=

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , author=. 2016 , eprint=

work page 2016

[61] [61]

2023 , eprint=

Scalable Diffusion Models with Transformers , author=. 2023 , eprint=

work page 2023

[62] [62]

2019 , eprint=

Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms , author=. 2019 , eprint=

work page 2019

[63] [63]

2019 , eprint=

Averaging Weights Leads to Wider Optima and Better Generalization , author=. 2019 , eprint=

work page 2019

[64] [64]

CVPR , year=

On Aliased Resizing and Surprising Subtleties in GAN Evaluation , author=. CVPR , year=

work page