pith. machine review for the scientific record. sign in

arxiv: 2604.19141 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsimage generationadaptive samplingpatch-level denoisingdifficulty predictionImageNettext-to-image synthesis
0
0 comments X

The pith

Patch-level timesteps and difficulty prediction let diffusion models advance easy image regions first to inform harder ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion models for images normally apply the same denoising timestep and steps to every patch, even though natural images contain regions of very different difficulty. The paper shows that simply assigning different timesteps per patch during training creates states the model never encounters at inference, so they introduce a sampler that caps the maximum information per patch. Adding a small per-patch difficulty head then allows the sampler to move easier patches forward in time, letting them supply context before more compute is spent on difficult patches. This combination, called Patch Forcing, produces better class-conditional ImageNet samples and extends to text-to-image generation while remaining compatible with guidance and alignment techniques.

Core claim

Moving from global to patch-level timesteps, controlled by a sampler that limits maximum patch information, already improves generation. Augmenting the model with a lightweight per-patch difficulty head enables adaptive allocation of denoising steps. Combined with noise levels that vary over both space and diffusion time, this yields Patch Forcing, which advances easier regions earlier so they can provide context for harder ones and achieves superior results on class-conditional ImageNet while scaling to text-to-image synthesis.

What carries the argument

Patch Forcing (PF), the framework that pairs spatially varying timesteps with an adaptive sampler driven by a per-patch difficulty head to prioritize context from easy patches before refining hard ones.

If this is right

  • Superior sample quality on class-conditional ImageNet generation compared with uniform-timestep baselines.
  • Compatibility with existing representation-alignment and classifier-free guidance methods.
  • Successful scaling from class-conditional to text-to-image synthesis without architectural overhaul.
  • Patch-level denoising schedules form a foundation for further adaptive image generation techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-patch difficulty signal could be reused at inference to decide where to spend extra function evaluations, potentially cutting total compute.
  • The approach may transfer to video or 3D diffusion where spatial and temporal heterogeneity is even stronger.
  • Combining Patch Forcing with model distillation could compound efficiency gains by reducing steps while preserving the adaptive schedule.

Load-bearing premise

The assumption that a specially designed timestep sampler can prevent the model from seeing patch-wise noise combinations during training that never appear at inference time.

What would settle it

Training the same architecture with random per-patch timesteps but without the proposed sampler and comparing FID or perceptual quality on ImageNet against both uniform-timestep baselines and the full Patch Forcing method would directly test whether the sampler is necessary.

Figures

Figures reproduced from arXiv: 2604.19141 by Bj\"orn Ommer, Felix Krause, Johannes Schusterbauer, Ming Gui, Pingchuan Ma, Yusong Li.

Figure 1
Figure 1. Figure 1: Patch Forcing overview. We show that heterogeneous patch timesteps during training can improve image generation, even with basic Euler sampling, but only when paired with the right timestep sampling strategy. This further enables adaptive in￾ference strategies that allocate more compute to difficult patches, yielding further gains under the same sampling budget. assumes that all regions of an image are equ… view at source ↗
Figure 2
Figure 2. Figure 2: Patch Forcing inference. Given the input, our model predicts a patch difficulty map over the velocity field, separating confident (easy) from uncertain (hard) denoising regions. In our difficulty-aware samplers, confident regions move faster along the denoising trajectory and provide intermediate, cleaner context for uncertain regions, improving generation performance. to produce contextual guidance intern… view at source ↗
Figure 3
Figure 3. Figure 3: Timestep sampling during training. The per-sample tmax distribution under SRM’s [56] uniform t¯schedule places very high mass near t = 1, meaning that for almost every training sample, there is at least one patch that is (near-)fully denoised. This introduces context leakage and a train-test mismatch. Our Logit-Normal Truncated Gaussian (LTG) sampler reduces this effect by controlling tmax while spreading … view at source ↗
Figure 4
Figure 4. Figure 4: Illustation of adaptive samplers. Left: 3 × 3 patchification of an image. Middle: For 9 patches, we visualize different sampling strategies. In Dual-Loop and Context Look-ahead, confident patches move faster than uncertain patches. Right: Generation performance across scheduling strategies. While Patch Forcing already outperforms the SiT baseline, the ordering choice is important: structured approaches lik… view at source ↗
Figure 5
Figure 5. Figure 5: Our Logit-Normal Truncate Gaussian (LTG) sam￾pler. (a) tmax determines where the truncated Gaussian is located, while std controls its spread. Setting std = 0 collapses the distri￾bution to a Dirac delta distribution, reducing our method to stan￾dard Flow Matching. (b) Our sampler enables fast sampling via parallelization instead of SRM’s recursive sampling. Adaptive LayerNorm [63] across spatial locations… view at source ↗
Figure 7
Figure 7. Figure 7: Validation loss on uncertain regions with and with￾out additional context from advanced confident patches. Providing ”future” context consistently reduces the loss in uncertain areas. 2 0 Model Uncertainty 10 2 10 1 10 0 10 1 Validation Loss R = 0.07 t = 0.2 2 0 Model Uncertainty R = 0.43 t = 0.4 2 0 Model Uncertainty R = 0.53 t = 0.6 Validation Loss vs Model Uncertainty Across Timesteps [PITH_FULL_IMAGE:… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of one-step predictions at varying timesteps and their corresponding uncertainty maps. These re￾sults show that the model develops a strong intuition about which regions are easy or difficult to generate from very early t. this simple baseline already improves generation quality over SRM’s uniform-t¯ schedule ( [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of context on predicted uncertainty. When confident regions are advanced and provide context, the predicted uncertainty in the remaining high-uncertainty regions decreases, evaluated at the same timestep t. 3.3. Adaptive Sampling Combining uncertainty prediction with varying noise scales across patches offers our model particularly flexible sam￾pling strategies. Beyond the standard parallel sampling… view at source ↗
Figure 10
Figure 10. Figure 10: Patch difficulty prediction aligns with the model’s x1 prediction variance. We then examine in [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Scaling sampling steps. Across increasing numbers of function evaluations (NFE), our PFT-B/2 model consistently out￾performs the SiT-B/2 ODE and SDE baselines. Our uncertainty￾aware samplers further improve over parallel PFT sampling, with dual-loop and look-ahead achieving the best FID across NFEs. 2 1 0 0 1 2 3 Predicted Uncertainty t=0.1 2 1 0 t=0.2 2 1 0 t=0.3 Normal sampling More context [PITH_FULL_… view at source ↗
Figure 12
Figure 12. Figure 12: More context reduces uncertainty. Predicted uncer￾tainty aligns with our qualitative findings in [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative text-to-image results at 512 px resolution. Baseline Patch Forcing Graffiti on a brick wall spelling “CVPR 2026” in colorful font. A neon sign over a bar that reads ”Patch-Forcing 24/7” with blue font. A birthday cake with icing that spells ”Paulina” in pink font [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Text rendering comparison. Our PFT shows supe￾rior text rendering compared to an equivalent model trained with vanilla Flow Matching, under identical training and inference set￾tings (plain Euler sampler, fixed NFE, same seed). Additional un￾curated samples in Figure S6. 5. Conclusion We introduce Patch Forcing (PF), a simple and flexible framework for spatially adaptive image synthesis based on per-patch… view at source ↗
read the original abstract

Diffusion- and flow-based models usually allocate compute uniformly across space, updating all patches with the same timestep and number of function evaluations. While convenient, this ignores the heterogeneity of natural images: some regions are easy to denoise, whereas others benefit from more refinement or additional context. Motivated by this, we explore patch-level noise scales for image synthesis. We find that naively varying timesteps across image tokens performs poorly, as it exposes the model to overly informative training states that do not occur at inference. We therefore introduce a timestep sampler that explicitly controls the maximum patch-level information available during training, and show that moving from global to patch-level timesteps already improves image generation over standard baselines. By further augmenting the model with a lightweight per-patch difficulty head, we enable adaptive samplers that allocate compute dynamically where it is most needed. Combined with noise levels varying over both space and diffusion time, this yields Patch Forcing (PF), a framework that advances easier regions earlier so they can provide context for harder ones. PF achieves superior results on class-conditional ImageNet, remains orthogonal to representation alignment and guidance methods, and scales to text-to-image synthesis. Our results suggest that patch-level denoising schedules provide a promising foundation for adaptive image generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Patch Forcing (PF), a framework for adaptive image generation in diffusion models. It identifies that uniform timestep allocation across patches ignores spatial heterogeneity in denoising difficulty. The authors introduce a timestep sampler that controls the maximum patch-level information during training to avoid overly informative states absent at inference, augment the model with a lightweight per-patch difficulty head, and enable adaptive samplers that advance easier regions first to provide context for harder ones. Combined with spatially and temporally varying noise, PF is claimed to yield superior results on class-conditional ImageNet, remain orthogonal to representation alignment and guidance methods, and scale to text-to-image synthesis.

Significance. If the central claims are substantiated with rigorous experiments, the work could meaningfully advance efficient generative modeling by exploiting per-patch difficulty heterogeneity rather than uniform compute allocation. The orthogonality to existing techniques would make it a useful complement, and the emphasis on aligning training and inference distributions addresses a common pitfall in adaptive sampling methods.

major comments (1)
  1. [Method description of the timestep sampler and adaptive inference procedure] The skeptic concern is load-bearing: the central claim that gains arise because easier patches provide context for harder ones requires that the timestep sampler produce a joint distribution over patch timesteps (including spatial correlations and conditional dependencies) that is statistically close to the distribution encountered under difficulty-driven adaptive inference. The manuscript does not appear to verify this alignment beyond bounding per-patch maxima; without such verification (e.g., via distribution distance metrics or ablation on higher-order statistics), observed improvements could stem from the added difficulty head or increased conditioning capacity rather than the claimed mechanism.
minor comments (2)
  1. The abstract asserts performance gains and orthogonality but supplies no quantitative metrics, baselines, or ablation details; the full manuscript should include these in the results section to allow assessment of effect sizes.
  2. Notation for patch-level timesteps and the difficulty head should be introduced with explicit equations early in the method section to improve clarity for readers unfamiliar with the framework.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for acknowledging the potential of Patch Forcing to advance adaptive sampling in diffusion models by exploiting per-patch difficulty heterogeneity. We address the major comment below and outline revisions that will strengthen the empirical support for the claimed mechanism.

read point-by-point responses
  1. Referee: [Method description of the timestep sampler and adaptive inference procedure] The skeptic concern is load-bearing: the central claim that gains arise because easier patches provide context for harder ones requires that the timestep sampler produce a joint distribution over patch timesteps (including spatial correlations and conditional dependencies) that is statistically close to the distribution encountered under difficulty-driven adaptive inference. The manuscript does not appear to verify this alignment beyond bounding per-patch maxima; without such verification (e.g., via distribution distance metrics or ablation on higher-order statistics), observed improvements could stem from the added difficulty head or increased conditioning capacity rather than the claimed mechanism.

    Authors: We agree that explicit verification of the joint distribution alignment is necessary to isolate the contribution of the context-providing mechanism. The timestep sampler is constructed to enforce a per-patch upper bound on information content (equivalently, a lower bound on noise level) that mirrors the states reachable under difficulty-driven adaptive inference, where easier patches are denoised first. This bound, combined with the spatially varying noise schedule, is intended to preclude training states in which all patches are simultaneously at low noise while others remain noisy. Nevertheless, we acknowledge that bounding per-patch maxima alone does not automatically guarantee matching higher-order statistics such as spatial correlations or conditional dependencies. In the revised manuscript we will therefore add: (i) an ablation that trains with the difficulty head but disables the adaptive sampler at inference (reverting to uniform timesteps), and (ii) quantitative distribution-alignment diagnostics, including marginal histograms of per-patch timesteps and a simple measure of pairwise spatial correlation between patch timesteps sampled from the training procedure versus trajectories simulated from the adaptive inference policy. These additions will directly test whether the observed gains are attributable to the intended mechanism rather than auxiliary model capacity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with explicit controls and experimental validation

full rationale

The paper introduces a timestep sampler and per-patch difficulty head as explicit training mechanisms to address observed mismatches between uniform and patch-level denoising. These are not derived from fitted parameters or self-referential equations but are motivated by empirical findings and validated through ablation and benchmark results on ImageNet and text-to-image tasks. No load-bearing step reduces to a self-citation chain, ansatz smuggling, or renaming of known results; the central claims rest on the introduced controls and their measured performance rather than tautological definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the per-patch difficulty head is a learned module whose training details and assumptions remain unspecified.

pith-pipeline@v0.9.0 · 5536 in / 1208 out tokens · 57414 ms · 2026-05-10T03:02:37.463525+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Inline Critic Steers Image Editing

    cs.CV 2026-05 conditional novelty 7.0

    Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

Reference graph

Works this paper leans on

63 extracted references · 19 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Self-rectifying diffu- sion sampling with perturbed-attention guidance

    Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Ky- ong Hwan Jin, and Seungryong Kim. Self-rectifying diffu- sion sampling with perturbed-attention guidance. InEuro- pean Conference on Computer Vision, pages 1–17. Springer,

  2. [2]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797,

  3. [3]

    Patchmatch: A randomized correspon- dence algorithm for structural image editing.ACM Trans

    Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspon- dence algorithm for structural image editing.ACM Trans. Graph., 28(3):24, 2009. 1

  4. [4]

    Image inpainting

    Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. InProceedings of the 27th annual conference on Computer graphics and interac- tive techniques, pages 417–424, 2000. 1

  5. [5]

    Improving image generation with better captions.Computer Science

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 8

  6. [6]

    Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 1

  7. [7]

    Coyo-700m: Image-text pair dataset.https://github.com/ kakaobrain/coyo-dataset, 2022

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https://github.com/ kakaobrain/coyo-dataset, 2022. 8, 5

  8. [8]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11315–11325, 2022. 1, 3

  9. [9]

    Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023. 1, 8

  10. [10]

    Self-supervised flow matching for scalable multi-modal synthesis, 2026

    Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, and Robin Rom- bach. Self-supervised flow matching for scalable multi- modal synthesis.arXiv preprint arXiv:2603.06507, 2026. 3

  11. [11]

    Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffu- sion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 2, 3, 4

  12. [12]

    Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

  13. [13]

    Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 3, 6, 5

  14. [14]

    Image quilting for texture synthesis and transfer

    Alexei A Efros and William T Freeman. Image quilting for texture synthesis and transfer. InSeminal graphics papers: pushing the boundaries, volume 2, pages 571–576. 2023. 1

  15. [15]

    Structure and content-guided video synthesis with diffusion models

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 7346–7356, 2023. 1

  16. [16]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 5, 8, 1, 2, 6

  17. [17]

    Training-free structured diffusion guidance for compositional text-to-image synthesis, 2023

    Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis, 2023. 8

  18. [18]

    Geneval: An object-focused framework for evaluating text- to-image alignment, 2023

    Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment, 2023. 8

  19. [19]

    Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

    Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, and Bj ¨orn Ommer. Adapting self-supervised representations as a latent space for efficient generation.arXiv preprint arXiv:2510.14630, 2025. 1

  20. [20]

    Masked autoencoders are scalable 314 vision learners

    K He, X Chen, S Xie, Y Li, P Doll ´ar, and R Girshick. Masked autoencoders are scalable 314 vision learners. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern, pages 16000–16009, 2021. 1

  21. [21]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 3, 6, 2, 5

  22. [22]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

  23. [23]

    Improving sample quality of diffusion models us- ing self-attention guidance

    Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungry- ong Kim. Improving sample quality of diffusion models us- ing self-attention guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7462– 7471, 2023. 3

  24. [24]

    T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image genera- tion.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image genera- tion.Advances in Neural Information Processing Systems, 36:78723–78747, 2023. 8, 3

  25. [25]

    Diffusion model-based image editing: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, Jiaxi Lv, Jianzhuang Liu, Wei Xiong, He Zhang, Liangliang Cao, and Shifeng Chen. Diffusion model-based image editing: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1

  26. [26]

    Entropy rectifying guidance for diffusion and flow models.arXiv preprint arXiv:2504.13987, 2025

    Tariq Berrada Ifriqi, Adriana Romero-Soriano, Michal Drozdzal, Jakob Verbeek, and Karteek Alahari. Entropy rectifying guidance for diffusion and flow models.arXiv preprint arXiv:2504.13987, 2025. 3

  27. [27]

    A comprehensive review of past and present image inpainting methods.Computer vision and image understanding, 203: 103147, 2021

    Jireh Jam, Connah Kendrick, Kevin Walker, Vincent Drouard, Jison Gee-Sern Hsu, and Moi Hoon Yap. A comprehensive review of past and present image inpainting methods.Computer vision and image understanding, 203: 103147, 2021. 1

  28. [28]

    48550/arXiv.2411.099983

    Myunsoo Kim, Donghyeon Ki, Seong-Woong Shim, and Byung-Jun Lee. Adaptive non-uniform timestep sam- pling for diffusion model training.arXiv preprint arXiv:2411.09998, 2024. 3

  29. [29]

    Flowedit: Inversion- free text-based editing using pre-trained flow models

    Vladimir Kulikov, Matan Kleiner, Inbar Huberman- Spiegelglas, and Tomer Michaeli. Flowedit: Inversion- free text-based editing using pre-trained flow models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19721–19730, 2025. 1

  30. [30]

    FLUX.2: Frontier Visual Intelligence

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025. 5

  31. [31]

    Improved masked image generation with token-critic

    Jos ´e Lezama, Huiwen Chang, Lu Jiang, and Irfan Essa. Improved masked image generation with token-critic. In European Conference on Computer Vision, pages 70–86. Springer, 2022. 3

  32. [32]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Chen Keqin, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3- vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026. 8, 5

  33. [33]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 3

  34. [34]

    Tenenbaum

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models, 2023. 8

  35. [35]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 2, 3

  36. [36]

    Patchscaler: An efficient patch-independent diffusion model for image super- resolution

    Yong Liu, Hang Dong, Jinshan Pan, Qingji Dong, Kai Chen, Rongxiang Zhang, Lean Fu, and Fei Wang. Patchscaler: An efficient patch-independent diffusion model for image super- resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11283–11293, 2025. 3

  37. [37]

    Region- adaptive sampling for diffusion transformers.arXiv preprint arXiv:2502.10389, 2025

    Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, and Yuqing Yang. Region- adaptive sampling for diffusion transformers.arXiv preprint arXiv:2502.10389, 2025. 3, 2

  38. [38]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

  39. [39]

    Repaint: Inpainting using denoising diffusion probabilistic models

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11461–11471, 2022. 1

  40. [40]

    Exploring the role of large language models in prompt encoding for diffusion models.arXiv preprint arXiv:2406.11831, 2024

    Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, and Yu Liu. Exploring the role of large language models in prompt encoding for diffusion models.arXiv preprint arXiv:2406.11831, 2024. 5

  41. [41]

    48550/arXiv.2401.08740,https://arxiv.org/abs/2401.08740

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers.arXiv preprint arXiv:2401.08740,

  42. [42]

    Sdedit: Guided image synthesis and editing with stochastic differential equa- tions

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions. InICLR, 2022. 1

  43. [43]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR,

  44. [44]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

  45. [45]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth Inter- national Conference on Learning Representations, 2024. 8, 5

  46. [46]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents.arXiv preprint arXiv:2204.06125,

  47. [47]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685. IEEE, 2022. 1, 3, 8

  48. [48]

    Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015. 2, 6

  49. [49]

    On the pitfalls of heteroscedastic uncertainty estimation with probabilistic neural networks

    Maximilian Seitzer, Arash Tavakoli, Dimitrije Antic, and Georg Martius. On the pitfalls of heteroscedastic uncertainty estimation with probabilistic neural networks. InInterna- tional Conference on Learning Representations. 2, 3, 5

  50. [50]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations, 2021. 2

  51. [51]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 2, 3

  52. [52]

    Roformer: Enhanced transformer with rotary position embedding, 2021

    Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021. 5

  53. [53]

    Journeydb: A benchmark for generative im- age understanding.Advances in neural information process- ing systems, 36:49659–49678, 2023

    Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative im- age understanding.Advances in neural information process- ing systems, 36:49659–49678, 2023. 5

  54. [54]

    Autoregressive model beats diffusion: Llama for scalable image generation, 2024

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation, 2024. 8

  55. [55]

    Emu3: Next-token prediction is all you need, 2024

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need, 2024. 8

  56. [56]

    Spatial reasoning with denoising models

    Christopher Wewer, Bart Pogodzinski, Bernt Schiele, and Jan Eric Lenssen. Spatial reasoning with denoising models. arXiv preprint arXiv:2502.21075, 2025. 2, 3, 4, 5, 7

  57. [57]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InThe Thirteenth In- ternational Conference on Learning Representations. 1, 6, 5

  58. [58]

    Adadiff: Adaptive step selection for fast diffusion models

    Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, and Yu-Gang Jiang. Adadiff: Adaptive step selection for fast diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9914–9922, 2025. 3

  59. [59]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 1

  60. [60]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders.arXiv preprint arXiv:2510.11690, 2025. 1

  61. [61]

    arXiv:2306.09305 , year=

    Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.arXiv preprint arXiv:2306.09305, 2023. 6

  62. [62]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 8, 5

  63. [63]

    confident

    Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, and Chang Wen Chen. Sd-dit: Unleash- ing the power of self-supervised discrimination in diffusion transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8435– 8445, 2024. 4, 6 Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for I...