AsyncPatch Diffusion: spatially-flexible image generation

Daniel S\'ykora; Guillaume Couairon; Klaus Greff; Romuald Elie; Samuele Papa; Valentin De Bortoli

arxiv: 2606.07079 · v1 · pith:7KOXAEU5new · submitted 2026-06-05 · 💻 cs.CV

AsyncPatch Diffusion: spatially-flexible image generation

Samuele Papa , Valentin De Bortoli , Guillaume Couairon , Daniel S\'ykora , Romuald Elie , Klaus Greff This is my paper

Pith reviewed 2026-06-27 22:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsimage generationinpaintingspatially adaptive generationELBOnoise schedulingjoint diffusion

0 comments

The pith

AsyncPatch Diffusion assigns independent noise levels to different image regions in a single joint-diffusion model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard diffusion models apply one shared noise level across an entire image, so every region follows the same denoising path. AsyncPatch Diffusion instead lets each pixel or token receive its own noise level, creating a valid generative process that still admits an evidence lower bound. The method proves this asynchronous corruption works and introduces a controlled sampler that keeps training from over-focusing on extreme heterogeneity. One resulting model can then generate images with spatially varying denoising rates, perform inpainting, and apply input guidance without any task-specific retraining. Experiments on ImageNet 256 and LSUN show quality comparable to conventional diffusion while enabling new adaptive strategies such as uncertainty-guided acceleration.

Core claim

AsyncPatch Diffusion is a joint-diffusion framework that assigns distinct noise levels to separate input dimensions such as pixels or latent tokens. This asynchronous corruption defines a valid generative process and supports a richer family of spatially heterogeneous denoising trajectories; the paper proves the first valid ELBO for the process. A controlled noise-level sampler regulates both average corruption and spatial variability so that homogeneous configurations remain well represented during training.

What carries the argument

The asynchronous corruption mechanism that assigns independent noise levels to different spatial dimensions while preserving a valid joint generative process.

If this is right

A single pretrained model can perform inpainting by denoising unknown regions while holding known regions at low or zero noise.
Input guidance from clean or partially corrupted regions improves texture matching and local consistency in generated areas.
Uncertainty-guided acceleration and autoregressive sampling become native capabilities of the same model.
Spatially adaptive generation works without any task-specific fine-tuning on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework may reduce the number of specialized models needed for conditional or region-specific image tasks.
Extending the per-dimension noise idea to video or 3D data could allow independent temporal or depth denoising schedules.
If the sampler generalizes, similar asynchronous corruption might apply to other generative processes such as score-based or flow models.

Load-bearing premise

The controlled noise-level sampler can balance homogeneous and heterogeneous configurations during training without degrading performance on standard uniform-noise trajectories.

What would settle it

Train the AsyncPatch model on ImageNet 256 with the controlled sampler and check whether FID scores remain within a few points of a matched standard diffusion baseline under identical architecture and compute.

Figures

Figures reproduced from arXiv: 2606.07079 by Daniel S\'ykora, Guillaume Couairon, Klaus Greff, Romuald Elie, Samuele Papa, Valentin De Bortoli.

**Figure 2.** Figure 2: Perlin sampling produces an inpainting-like clean/noisy partition, patchwise sampling draws [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Sampled training timesteps and distribution of the mean timestep per image. From left to [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative performance of the models on ImageNet 256 and LSUN bedroom. Shown are [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of input guidance (0.0 for top vs 2.0 for bottom) in the inpainting of the right part of [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: On the left: autoregressive sampling from full noise to sample. In the middle and right: [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Additional details on timestep sampling used during training. [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Effect of input guidance and classifier-free guidance (CFG) on both LPIPS and FID [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of timestep sampling methods on four ImageNet-64 classes. The same seed [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

**Figure 11.** Figure 11: Generated ImageNet 256 samples using AsyncPatch latent diffusion. [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗

**Figure 12.** Figure 12: Generated ImageNet 256 samples using AsyncPatch latent diffusion. [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗

**Figure 13.** Figure 13: Generated ImageNet 256 samples using AsyncPatch latent diffusion. [PITH_FULL_IMAGE:figures/full_fig_p036_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison of texture synthesis. Images are original and un-altered, zoom-in [PITH_FULL_IMAGE:figures/full_fig_p037_14.png] view at source ↗

read the original abstract

Standard diffusion models corrupt an entire sample with a single shared noise level, forcing all spatial regions to follow the same denoising trajectory. We introduce AsyncPatch Diffusion, a joint-diffusion framework that assigns distinct noise levels to different input dimensions, such as image pixels, or latent tokens. We show how this asynchronous corruption defines a valid generative process while supporting a richer family of spatially heterogeneous denoising trajectories, and prove the first valid ELBO for this process. We show that a single pretrained model can perform spatially adaptive generation, where different regions are denoised on different schedules. A key challenge is training: naive independent noise-level sampling overemphasizes highly heterogeneous configurations and underrepresents homogeneous noise levels, that are crucial during sampling. We address this with a controlled noise-level sampler that regulates both the average corruption level and its spatial variability. AsyncPatch achieves generation quality comparable to conventional diffusion on ImageNet 256 and LSUN, while being natively suited for inpainting without task-specific fine-tuning. We further introduce input guidance, which uses clean or partially corrupted regions to guide the generation of unknown regions, improving local consistency and texture matching. Finally, we demonstrate adaptive generation strategies including uncertainty-guided acceleration and autoregressive sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AsyncPatch lets diffusion use independent noise per patch with a claimed ELBO and controlled sampler, plus native inpainting, but the math details and sampler balance are thin.

read the letter

AsyncPatch Diffusion assigns separate noise levels to different image regions or tokens instead of one shared schedule across the whole sample. They claim this defines a valid process, derive what they call the first ELBO for it, and introduce a controlled sampler to keep training from skewing too far toward mixed-noise cases.

The practical results are the strongest part. A single model handles spatially adaptive generation, does inpainting without fine-tuning, and uses input guidance from clean or partly corrupted areas to improve consistency in the unknown parts. They also show uncertainty-guided acceleration and autoregressive sampling. On ImageNet 256 and LSUN the quality matches standard diffusion, which is a useful check that the changes do not break uniform-trajectory performance.

The soft spots sit in the foundations. The abstract states a valid ELBO but supplies no derivation or proof outline, so it is impossible to judge whether the asynchronous corruption really yields a proper generative model. The controlled sampler is presented as fixing the bias from naive independent sampling, yet there is no analysis or ablation showing it still supplies enough homogeneous batches for standard inference. If the sampler under-represents uniform noise levels, the score function could be biased and the comparable benchmark numbers would rest on shaky ground. That assumption is load-bearing and currently unverified.

The work is aimed at people building diffusion models for image synthesis and editing who want built-in spatial flexibility. A reader focused on new mechanisms for heterogeneous denoising or task-free inpainting would get concrete ideas from it. It deserves peer review because the core mechanism is distinct from shared-noise diffusion and the empirical demonstrations are clear enough to test, even though the technical claims need the full derivations and controls to stand up.

Referee Report

2 major / 0 minor

Summary. The paper introduces AsyncPatch Diffusion, a joint-diffusion framework assigning distinct noise levels to different input dimensions (e.g., image pixels or latent tokens). It claims this asynchronous corruption defines a valid generative process supporting spatially heterogeneous denoising trajectories, provides the first valid ELBO for the process, introduces a controlled noise-level sampler to balance training configurations, and demonstrates a single pretrained model performing spatially adaptive generation, native inpainting, input guidance, uncertainty-guided acceleration, and autoregressive sampling, with generation quality comparable to standard diffusion on ImageNet 256 and LSUN.

Significance. If the ELBO is valid and the controlled sampler preserves coverage of homogeneous trajectories without bias, the framework would enable richer spatially adaptive generation and task-agnostic inpainting in a single model, representing a meaningful extension of diffusion models for heterogeneous denoising schedules.

major comments (2)

[Abstract] Abstract: the controlled noise-level sampler is introduced to regulate average corruption level and spatial variability, addressing the issue that naive independent sampling overemphasizes heterogeneous configurations. However, no derivation, density analysis, or ablation demonstrates that the induced training distribution maintains sufficient measure on homogeneous noise levels (uniform-t trajectories) used at inference. This is load-bearing for the claim of comparable ImageNet/LSUN quality under conventional uniform sampling.
[Abstract] Abstract: the paper asserts a valid ELBO for the asynchronous corruption process and a new sampler, but provides no derivation details, experimental controls, or error analysis. The soundness of the central generative-process claim cannot be assessed from the given information.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below, agreeing that additional details will strengthen the presentation, and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the controlled noise-level sampler is introduced to regulate average corruption level and spatial variability, addressing the issue that naive independent sampling overemphasizes heterogeneous configurations. However, no derivation, density analysis, or ablation demonstrates that the induced training distribution maintains sufficient measure on homogeneous noise levels (uniform-t trajectories) used at inference. This is load-bearing for the claim of comparable ImageNet/LSUN quality under conventional uniform sampling.

Authors: We agree that explicit verification of coverage on homogeneous trajectories is important for supporting the quality claims. The controlled sampler is intended to balance average corruption level and spatial variability to avoid overemphasizing heterogeneous cases. In the revised manuscript we will add a formal derivation of the induced training distribution, density analysis quantifying measure on uniform-t trajectories, and ablations showing that generation quality remains comparable when the sampler is used versus naive sampling. This will directly address the load-bearing concern. revision: yes
Referee: [Abstract] Abstract: the paper asserts a valid ELBO for the asynchronous corruption process and a new sampler, but provides no derivation details, experimental controls, or error analysis. The soundness of the central generative-process claim cannot be assessed from the given information.

Authors: The manuscript states a proof of ELBO validity for the asynchronous process. We acknowledge that the current presentation may not provide sufficient accessible details for full assessment. In revision we will expand the derivation with additional step-by-step explanations, include experimental controls that validate the ELBO under asynchronous schedules, and add error analysis to quantify approximation quality. These additions will make the soundness of the generative-process claim easier to evaluate. revision: yes

Circularity Check

0 steps flagged

No significant circularity; ELBO derivation and sampler are independent of fitted inputs

full rationale

The paper presents an explicit derivation of a valid ELBO for asynchronous per-dimension noise corruption and introduces a controlled noise-level sampler to address training distribution issues. No equations or claims reduce a prediction to a fitted parameter by construction, no self-citation chains justify core premises, and no ansatz is smuggled via prior work. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no concrete information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5754 in / 1021 out tokens · 27872 ms · 2026-06-27T22:31:31.707144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 2 canonical work pages

[1]

One transformer fits all distributions in multi-modal diffusion at scale

Fan Bao, Shen Nie, Kai Xue, et al. One transformer fits all distributions in multi-modal diffusion at scale. InICML, 2023

2023
[2]

PatchMatch: A randomized correspondence algorithm for structural image editing.ACM Transactions on Graphics, 28(3):24, 2009

Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. PatchMatch: A randomized correspondence algorithm for structural image editing.ACM Transactions on Graphics, 28(3):24, 2009

2009
[3]

John Wiley & Sons, 2013

Patrick Billingsley.Convergence of probability measures. John Wiley & Sons, 2013

2013
[4]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

2023
[5]

Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design

Andrew Campbell, Jason Yim, Regina Barzilay, et al. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. InICML, 2024

2024
[6]

Self-supervised flow matching for scalable multi-modal synthesis

Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, and Robin Rombach. Self-supervised flow matching for scalable multi-modal synthesis. arXiv preprint arXiv:2603.06507, 2026. doi: 10.48550/arXiv.2603.06507

work page doi:10.48550/arxiv.2603.06507 2026
[7]

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion, December

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion, December
[8]

arXiv:2407.01392 [cs]

URLhttp://arxiv.org/abs/2407.01392. arXiv:2407.01392 [cs]

arXiv
[9]

DiffEdit: Diffusion- based semantic image editing with mask guidance, October 2022

Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. DiffEdit: Diffusion- based semantic image editing with mask guidance, October 2022. URL http://arxiv.org/ abs/2210.11427. arXiv:2210.11427 [cs]

arXiv 2022
[10]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InComputer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009

2009
[11]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, 2021

2021
[12]

Efros and William T

Alexei A. Efros and William T. Freeman. Image Quilting for Texture Synthesis and Transfer. InSIGGRAPH Conference Proceedings, pages 341–346, 2001

2001
[13]

Efros and Thomas K

Alexei A. Efros and Thomas K. Leung. Texture Synthesis by Non-Parametric Sampling. In Proceedings of IEEE International Conference on Computer Vision, pages 1033–1038, 1999

1999
[14]

StyLit: Illumination-Guided Example-Based Stylization of 3D Renderings

Jakub Fišer, Ondˇrej Jamriška, Michal Lukáˇc, Eli Shechtman, Paul Asente, Jingwan Lu, and Daniel Sýkora. StyLit: Illumination-Guided Example-Based Stylization of 3D Renderings. ACM Transactions on Graphics, 35(4):92, 2016

2016
[15]

Mathis Gerdes, Max Welling, and Miranda C. N. Cheng. GUD: Generation with Unified Diffusion, October 2024. URL http://arxiv.org/abs/2410.02667. arXiv:2410.02667 [cs]

arXiv 2024
[16]

Gradpaint: Gradient-guided inpainting with diffusion models.Computer Vision and Image Understanding, 244:103928, 2025

Asya Grechka, Guillaume Couairon, and Matthieu Cord. Gradpaint: Gradient-guided inpainting with diffusion models.Computer Vision and Image Understanding, 244:103928, 2025

2025
[17]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in neural information processing systems, volume 33, pages 6840–6851, 2020

2020
[18]

Video diffusion models.Advances in Neural Information Processing Systems, 35:5733–5747, 2022

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, Tim Salimans, and others. Video diffusion models.Advances in Neural Information Processing Systems, 35:5733–5747, 2022

2022
[19]

Peter Holderrieth, Marton Havasi, Jason Yim, Neta Shaul, Itai Gat, Tommi Jaakkola, Brian Karrer, Ricky T. Q. Chen, and Yaron Lipman. Generator matching: Generative modeling with arbitrary markov processes. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. 10

2025
[20]

A variational perspective on diffusion- based generative models and score matching.Advances in Neural Information Processing Systems, 34:22863–22876, 2021

Chin-Wei Huang, Jae Hyun Lim, and Aaron C Courville. A variational perspective on diffusion- based generative models and score matching.Advances in Neural Information Processing Systems, 34:22863–22876, 2021

2021
[21]

Self Tuning Texture Optimization.Computer Graphics Forum, 34(2):349–360, 2015

Alexandre Kaspar, Boris Neubert, Dani Lischinski, Mark Pauly, and Johannes Kopf. Self Tuning Texture Optimization.Computer Graphics Forum, 34(2):349–360, 2015

2015
[22]

A versatile diffusion transformer with mixture of noise levels for audiovisual generation

Gunwoo Kim, Alejandro Martinez, Yu-Chuan Su, et al. A versatile diffusion transformer with mixture of noise levels for audiovisual generation. InNeurIPS, 2024

2024
[23]

RAD: Region-Aware Diffusion Models for Image Inpainting, December 2024

Sora Kim, Sungho Suh, and Minsik Lee. RAD: Region-Aware Diffusion Models for Image Inpainting, December 2024. URL http://arxiv.org/abs/2412.09191. arXiv:2412.09191 [cs]

arXiv 2024
[24]

Don’t Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation, October 2025

Woojin Kim and Jaeyoung Do. Don’t Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation, October 2025. URL http://arxiv.org/abs/2510. 26200. arXiv:2510.26200 [cs]

arXiv 2025
[25]

DiffWave: a versatile diffusion model for audio synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: a versatile diffusion model for audio synthesis. InInternational conference on learning representations, 2021

2021
[26]

Essa, Aaron F

Vivek Kwatra, Irfan A. Essa, Aaron F. Bobick, and Nipun Kwatra. Texture optimization for example-based synthesis.ACM Transactions on Graphics, 24(3):795–802, 2005

2005
[27]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space, June 2025

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 Kontext: Flow Matching for In-Context Image ...

Pith/arXiv arXiv 2025
[28]

Omniflow: Any-to-any generation with multi-modal rectified flows

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Omniflow: Any-to-any generation with multi-modal rectified flows. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[29]

RePaint: Inpainting using Denoising Diffusion Probabilistic Models, August 2022

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting using Denoising Diffusion Probabilistic Models, August 2022. URL http://arxiv.org/abs/2201.09865. arXiv:2201.09865 [cs]

arXiv 2022
[30]

Hd-painter: high-resolution and prompt-faithful text-guided image in- painting with diffusion models

Hayk Manukyan, Andranik Sargsyan, Barsegh Atanyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Hd-painter: high-resolution and prompt-faithful text-guided image in- painting with diffusion models. InThe Thirteenth International Conference on Learning Representations, 2023

2023
[31]

Efficient zero-shot inpainting with decoupled diffusion guidance.arXiv preprint arXiv:2512.18365, 2025

Badr Moufad, Navid Bagheri Shouraki, Alain Oliviero Durmus, Thomas Hirtz, Eric Moulines, Jimmy Olsson, and Yazid Janati. Efficient zero-shot inpainting with decoupled diffusion guidance.arXiv preprint arXiv:2512.18365, 2025

Pith/arXiv arXiv 2025
[32]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, March 2022

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, March 2022. URL http://arxiv.org/abs/2112. 10741. arXiv:2112.10741 [cs]

Pith/arXiv arXiv 2022
[33]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[34]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InAAAI Conference on Artificial Intelligence, 2018

2018
[35]

DreamFusion: Text-to-3D using 2D diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. DreamFusion: Text-to-3D using 2D diffusion. InInternational conference on learning representations, 2023. 11

2023
[36]

Ye, and Molei Tao

Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X.-F. Ye, and Molei Tao. Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces, June 2025. URL http://arxiv. org/abs/2506.07903. arXiv:2506.07903 [cs]

arXiv 2025
[37]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[38]

Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation

Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. InCVPR, 2023

2023
[39]

Rolling Diffusion Models, September 2024

David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling Diffusion Models, September 2024. URLhttp://arxiv.org/abs/2402.09470. arXiv:2402.09470 [cs]

arXiv 2024
[40]

Palette: Image-to-image diffusion models

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. InACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022

2022
[41]

Denoising, fast and slow: Difficulty-aware adaptive sampling for image generation

Johannes Schusterbauer, Ming Gui, Yusong Li, Pingchuan Ma, Felix Krause, and Björn Om- mer. Denoising, fast and slow: Difficulty-aware adaptive sampling for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

2026
[42]

Large-scale text-to-image model with inpainting is a zero-shot subject-driven image generator

Chaehun Shin, Jooyoung Choi, Heeseung Kim, and Sungroh Yoon. Large-scale text-to-image model with inpainting is a zero-shot subject-driven image generator. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7986–7996, 2025

2025
[43]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265, 2015

2015
[44]

History-Guided Video Diffusion, July 2025

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-Guided Video Diffusion, July 2025. URL http://arxiv.org/abs/2502.06764. arXiv:2502.06764 [cs]

Pith/arXiv arXiv 2025
[45]

Resolution-robust Large Mask Inpainting with Fourier Convolutions.arXiv preprint arXiv:2109.07161, 2021

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lem- pitsky. Resolution-robust Large Mask Inpainting with Fourier Convolutions.arXiv preprint arXiv:2109.07161, 2021

arXiv 2021
[46]

Unified multimodal discrete diffusion.arXiv preprint arXiv:2503.20853, 2025

Alexander Swerdlow, Mihir Prabhudesai, Siddharth Gandhi, Deepak Pathak, and Katerina Fragkiadaki. Unified multimodal discrete diffusion.arXiv preprint arXiv:2503.20853, 2025. doi: 10.48550/arXiv.2503.20853

work page doi:10.48550/arxiv.2503.20853 2025
[47]

De novo design of protein structure and function with RFdiffusion.Nature, 620(7976):1089–1100, 2023

Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, and others. De novo design of protein structure and function with RFdiffusion.Nature, 620(7976):1089–1100, 2023

2023
[48]

Spatial reasoning with denoising models

Christopher Wewer, Bart Pogodzinski, Bernt Schiele, and Jan Eric Lenssen. Spatial reasoning with denoising models. InInternational Conference on Machine Learning, 2025. doi: 10. 48550/arXiv.2502.21075

arXiv 2025
[49]

AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation, December 2023

Tong Wu, Zhihao Fan, Xiao Liu, Yeyun Gong, Yelong Shen, Jian Jiao, Hai-Tao Zheng, Juntao Li, Zhongyu Wei, Jian Guo, Nan Duan, and Weizhu Chen. AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation, December 2023. URL http://arxiv.org/abs/2305. 09515. arXiv:2305.09515 [cs]

arXiv 2023
[50]

Turbofill: adapting few-step text- to-image model for fast image inpainting

Liangbin Xie, Daniil Pakhomov, Zhonghao Wang, Zongze Wu, Ziyan Chen, Yuqian Zhou, Haitian Zheng, Zhifei Zhang, Zhe Lin, Jiantao Zhou, et al. Turbofill: adapting few-step text- to-image model for fast image inpainting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7613–7622, 2025. 12

2025
[51]

Energy-Based Diffusion Language Models for Text Generation, March 2025

Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, and Arash Vahdat. Energy-Based Diffusion Language Models for Text Generation, March 2025. URLhttp://arxiv.org/abs/2410.21357. arXiv:2410.21357 [cs]

arXiv 2025
[52]

remaining capacity

Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: construction of a large-scale image dataset using deep learning with humans in the loop.CoRR, abs/1506.03365, 2015. 13 A Proofs of the Lemmas A.1 Proof Lemma 1 Before we begin the proof, we must setup the basic definition and a preliminary Lemma. Definition: Generalized Denoising Sco...

Pith/arXiv arXiv 2015

[1] [1]

One transformer fits all distributions in multi-modal diffusion at scale

Fan Bao, Shen Nie, Kai Xue, et al. One transformer fits all distributions in multi-modal diffusion at scale. InICML, 2023

2023

[2] [2]

PatchMatch: A randomized correspondence algorithm for structural image editing.ACM Transactions on Graphics, 28(3):24, 2009

Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. PatchMatch: A randomized correspondence algorithm for structural image editing.ACM Transactions on Graphics, 28(3):24, 2009

2009

[3] [3]

John Wiley & Sons, 2013

Patrick Billingsley.Convergence of probability measures. John Wiley & Sons, 2013

2013

[4] [4]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

2023

[5] [5]

Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design

Andrew Campbell, Jason Yim, Regina Barzilay, et al. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. InICML, 2024

2024

[6] [6]

Self-supervised flow matching for scalable multi-modal synthesis

Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, and Robin Rombach. Self-supervised flow matching for scalable multi-modal synthesis. arXiv preprint arXiv:2603.06507, 2026. doi: 10.48550/arXiv.2603.06507

work page doi:10.48550/arxiv.2603.06507 2026

[7] [7]

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion, December

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion, December

[8] [8]

arXiv:2407.01392 [cs]

URLhttp://arxiv.org/abs/2407.01392. arXiv:2407.01392 [cs]

arXiv

[9] [9]

DiffEdit: Diffusion- based semantic image editing with mask guidance, October 2022

Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. DiffEdit: Diffusion- based semantic image editing with mask guidance, October 2022. URL http://arxiv.org/ abs/2210.11427. arXiv:2210.11427 [cs]

arXiv 2022

[10] [10]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InComputer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009

2009

[11] [11]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, 2021

2021

[12] [12]

Efros and William T

Alexei A. Efros and William T. Freeman. Image Quilting for Texture Synthesis and Transfer. InSIGGRAPH Conference Proceedings, pages 341–346, 2001

2001

[13] [13]

Efros and Thomas K

Alexei A. Efros and Thomas K. Leung. Texture Synthesis by Non-Parametric Sampling. In Proceedings of IEEE International Conference on Computer Vision, pages 1033–1038, 1999

1999

[14] [14]

StyLit: Illumination-Guided Example-Based Stylization of 3D Renderings

Jakub Fišer, Ondˇrej Jamriška, Michal Lukáˇc, Eli Shechtman, Paul Asente, Jingwan Lu, and Daniel Sýkora. StyLit: Illumination-Guided Example-Based Stylization of 3D Renderings. ACM Transactions on Graphics, 35(4):92, 2016

2016

[15] [15]

Mathis Gerdes, Max Welling, and Miranda C. N. Cheng. GUD: Generation with Unified Diffusion, October 2024. URL http://arxiv.org/abs/2410.02667. arXiv:2410.02667 [cs]

arXiv 2024

[16] [16]

Gradpaint: Gradient-guided inpainting with diffusion models.Computer Vision and Image Understanding, 244:103928, 2025

Asya Grechka, Guillaume Couairon, and Matthieu Cord. Gradpaint: Gradient-guided inpainting with diffusion models.Computer Vision and Image Understanding, 244:103928, 2025

2025

[17] [17]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in neural information processing systems, volume 33, pages 6840–6851, 2020

2020

[18] [18]

Video diffusion models.Advances in Neural Information Processing Systems, 35:5733–5747, 2022

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, Tim Salimans, and others. Video diffusion models.Advances in Neural Information Processing Systems, 35:5733–5747, 2022

2022

[19] [19]

Peter Holderrieth, Marton Havasi, Jason Yim, Neta Shaul, Itai Gat, Tommi Jaakkola, Brian Karrer, Ricky T. Q. Chen, and Yaron Lipman. Generator matching: Generative modeling with arbitrary markov processes. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. 10

2025

[20] [20]

A variational perspective on diffusion- based generative models and score matching.Advances in Neural Information Processing Systems, 34:22863–22876, 2021

Chin-Wei Huang, Jae Hyun Lim, and Aaron C Courville. A variational perspective on diffusion- based generative models and score matching.Advances in Neural Information Processing Systems, 34:22863–22876, 2021

2021

[21] [21]

Self Tuning Texture Optimization.Computer Graphics Forum, 34(2):349–360, 2015

Alexandre Kaspar, Boris Neubert, Dani Lischinski, Mark Pauly, and Johannes Kopf. Self Tuning Texture Optimization.Computer Graphics Forum, 34(2):349–360, 2015

2015

[22] [22]

A versatile diffusion transformer with mixture of noise levels for audiovisual generation

Gunwoo Kim, Alejandro Martinez, Yu-Chuan Su, et al. A versatile diffusion transformer with mixture of noise levels for audiovisual generation. InNeurIPS, 2024

2024

[23] [23]

RAD: Region-Aware Diffusion Models for Image Inpainting, December 2024

Sora Kim, Sungho Suh, and Minsik Lee. RAD: Region-Aware Diffusion Models for Image Inpainting, December 2024. URL http://arxiv.org/abs/2412.09191. arXiv:2412.09191 [cs]

arXiv 2024

[24] [24]

Don’t Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation, October 2025

Woojin Kim and Jaeyoung Do. Don’t Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation, October 2025. URL http://arxiv.org/abs/2510. 26200. arXiv:2510.26200 [cs]

arXiv 2025

[25] [25]

DiffWave: a versatile diffusion model for audio synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: a versatile diffusion model for audio synthesis. InInternational conference on learning representations, 2021

2021

[26] [26]

Essa, Aaron F

Vivek Kwatra, Irfan A. Essa, Aaron F. Bobick, and Nipun Kwatra. Texture optimization for example-based synthesis.ACM Transactions on Graphics, 24(3):795–802, 2005

2005

[27] [27]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space, June 2025

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 Kontext: Flow Matching for In-Context Image ...

Pith/arXiv arXiv 2025

[28] [28]

Omniflow: Any-to-any generation with multi-modal rectified flows

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Omniflow: Any-to-any generation with multi-modal rectified flows. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[29] [29]

RePaint: Inpainting using Denoising Diffusion Probabilistic Models, August 2022

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting using Denoising Diffusion Probabilistic Models, August 2022. URL http://arxiv.org/abs/2201.09865. arXiv:2201.09865 [cs]

arXiv 2022

[30] [30]

Hd-painter: high-resolution and prompt-faithful text-guided image in- painting with diffusion models

Hayk Manukyan, Andranik Sargsyan, Barsegh Atanyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Hd-painter: high-resolution and prompt-faithful text-guided image in- painting with diffusion models. InThe Thirteenth International Conference on Learning Representations, 2023

2023

[31] [31]

Efficient zero-shot inpainting with decoupled diffusion guidance.arXiv preprint arXiv:2512.18365, 2025

Badr Moufad, Navid Bagheri Shouraki, Alain Oliviero Durmus, Thomas Hirtz, Eric Moulines, Jimmy Olsson, and Yazid Janati. Efficient zero-shot inpainting with decoupled diffusion guidance.arXiv preprint arXiv:2512.18365, 2025

Pith/arXiv arXiv 2025

[32] [32]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, March 2022

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, March 2022. URL http://arxiv.org/abs/2112. 10741. arXiv:2112.10741 [cs]

Pith/arXiv arXiv 2022

[33] [33]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[34] [34]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InAAAI Conference on Artificial Intelligence, 2018

2018

[35] [35]

DreamFusion: Text-to-3D using 2D diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. DreamFusion: Text-to-3D using 2D diffusion. InInternational conference on learning representations, 2023. 11

2023

[36] [36]

Ye, and Molei Tao

Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X.-F. Ye, and Molei Tao. Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces, June 2025. URL http://arxiv. org/abs/2506.07903. arXiv:2506.07903 [cs]

arXiv 2025

[37] [37]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[38] [38]

Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation

Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. InCVPR, 2023

2023

[39] [39]

Rolling Diffusion Models, September 2024

David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling Diffusion Models, September 2024. URLhttp://arxiv.org/abs/2402.09470. arXiv:2402.09470 [cs]

arXiv 2024

[40] [40]

Palette: Image-to-image diffusion models

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. InACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022

2022

[41] [41]

Denoising, fast and slow: Difficulty-aware adaptive sampling for image generation

Johannes Schusterbauer, Ming Gui, Yusong Li, Pingchuan Ma, Felix Krause, and Björn Om- mer. Denoising, fast and slow: Difficulty-aware adaptive sampling for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

2026

[42] [42]

Large-scale text-to-image model with inpainting is a zero-shot subject-driven image generator

Chaehun Shin, Jooyoung Choi, Heeseung Kim, and Sungroh Yoon. Large-scale text-to-image model with inpainting is a zero-shot subject-driven image generator. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7986–7996, 2025

2025

[43] [43]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265, 2015

2015

[44] [44]

History-Guided Video Diffusion, July 2025

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-Guided Video Diffusion, July 2025. URL http://arxiv.org/abs/2502.06764. arXiv:2502.06764 [cs]

Pith/arXiv arXiv 2025

[45] [45]

Resolution-robust Large Mask Inpainting with Fourier Convolutions.arXiv preprint arXiv:2109.07161, 2021

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lem- pitsky. Resolution-robust Large Mask Inpainting with Fourier Convolutions.arXiv preprint arXiv:2109.07161, 2021

arXiv 2021

[46] [46]

Unified multimodal discrete diffusion.arXiv preprint arXiv:2503.20853, 2025

Alexander Swerdlow, Mihir Prabhudesai, Siddharth Gandhi, Deepak Pathak, and Katerina Fragkiadaki. Unified multimodal discrete diffusion.arXiv preprint arXiv:2503.20853, 2025. doi: 10.48550/arXiv.2503.20853

work page doi:10.48550/arxiv.2503.20853 2025

[47] [47]

De novo design of protein structure and function with RFdiffusion.Nature, 620(7976):1089–1100, 2023

Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, and others. De novo design of protein structure and function with RFdiffusion.Nature, 620(7976):1089–1100, 2023

2023

[48] [48]

Spatial reasoning with denoising models

Christopher Wewer, Bart Pogodzinski, Bernt Schiele, and Jan Eric Lenssen. Spatial reasoning with denoising models. InInternational Conference on Machine Learning, 2025. doi: 10. 48550/arXiv.2502.21075

arXiv 2025

[49] [49]

AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation, December 2023

Tong Wu, Zhihao Fan, Xiao Liu, Yeyun Gong, Yelong Shen, Jian Jiao, Hai-Tao Zheng, Juntao Li, Zhongyu Wei, Jian Guo, Nan Duan, and Weizhu Chen. AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation, December 2023. URL http://arxiv.org/abs/2305. 09515. arXiv:2305.09515 [cs]

arXiv 2023

[50] [50]

Turbofill: adapting few-step text- to-image model for fast image inpainting

Liangbin Xie, Daniil Pakhomov, Zhonghao Wang, Zongze Wu, Ziyan Chen, Yuqian Zhou, Haitian Zheng, Zhifei Zhang, Zhe Lin, Jiantao Zhou, et al. Turbofill: adapting few-step text- to-image model for fast image inpainting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7613–7622, 2025. 12

2025

[51] [51]

Energy-Based Diffusion Language Models for Text Generation, March 2025

Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, and Arash Vahdat. Energy-Based Diffusion Language Models for Text Generation, March 2025. URLhttp://arxiv.org/abs/2410.21357. arXiv:2410.21357 [cs]

arXiv 2025

[52] [52]

remaining capacity

Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: construction of a large-scale image dataset using deep learning with humans in the loop.CoRR, abs/1506.03365, 2015. 13 A Proofs of the Lemmas A.1 Proof Lemma 1 Before we begin the proof, we must setup the basic definition and a preliminary Lemma. Definition: Generalized Denoising Sco...

Pith/arXiv arXiv 2015