pith. sign in

arxiv: 2510.02307 · v2 · pith:GDW3MDSRnew · submitted 2025-10-02 · 💻 cs.CV · cs.AI

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Pith reviewed 2026-05-21 21:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords diffusion modelslow-resolution generationnoise recalibrationtext-to-image synthesistrain-test mismatchStable Diffusionimage quality
0
0 comments X

The pith

Re-indexing noise conditioning with a learned resolution mapping restores consistency and improves low-resolution diffusion generation quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-image diffusion models lose quality when generating at resolutions below their training range because the scheduled noise levels no longer match the actual corruption levels perceived at smaller scales. This mismatch mis-calibrates the denoiser timestep and noise embedding, leading to degraded outputs. NoiseShift addresses the issue by keeping the original noise schedule fixed and instead learning a resolution-specific mapping that re-indexes the conditioning noise fed to the denoiser. The mapping is obtained through lightweight calibration on a small set of image-text pairs and requires no changes at inference time. Experiments on Stable Diffusion 3, 3.5, and Flux-Dev show consistent quality gains at low resolutions such as 128x128 and 64x64.

Core claim

The paper claims that low-resolution degradation arises from a train-test mismatch in noise conditioning where the same scheduled noise corresponds to different perceptual corruption at reduced resolutions. NoiseShift corrects this by learning a resolution-specific mapping from scheduler noise to conditioning noise through coarse-to-fine calibration on a small image-text set, thereby restoring local forward-reverse consistency without changing the noise sampling schedule or incurring inference overhead. This produces measurable improvements, including FID reductions from 203 to 171 for SD3 and from 310 to 277 for SD3.5 at 128x128 on LAION-COCO, and a smaller gain for Flux-Dev at 64x64.

What carries the argument

NoiseShift, a training-free recalibration method that learns a resolution-specific mapping to re-index the denoiser's noise conditioning and restore forward-reverse consistency at lower resolutions.

If this is right

  • SD3 generation at 128x128 improves FID from 203 to 171 on LAION-COCO.
  • SD3.5 achieves FID reduction from 310 to 277 at the same 128x128 resolution.
  • Flux-Dev receives a modest FID improvement from 120 to 113 at 64x64.
  • The recalibration adds no inference overhead and requires only minimal code changes.
  • The approach works across multiple pretrained models without any retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conditioning mismatch may appear in other diffusion tasks such as image editing or super-resolution when resolution changes.
  • If the mapping generalizes across prompts, it could enable efficient pipelines that switch resolutions on the fly during sampling.
  • Similar recalibration might be tested on video or 3D diffusion models to check whether per-task recalibration is always required.

Load-bearing premise

A lightweight coarse-to-fine calibration on a small set of image-text pairs produces a general resolution-specific mapping that transfers reliably to arbitrary new prompts and images without overfitting to the calibration set.

What would settle it

Applying the learned mapping to a fresh collection of prompts and images at the target low resolution and finding no improvement or a worsening in FID scores or visual quality compared to the unadjusted baseline would indicate the mapping fails to reduce the train-test mismatch.

Figures

Figures reproduced from arXiv: 2510.02307 by Moayed Haji-Ali, Ruozhen He, Vicente Ordonez, Ziyan Yang.

Figure 1
Figure 1. Figure 1: Resolution-dependent perceptual effect of noise. At the same sampling noise level σt, lower-resolution images experience more severe visual and structural corruption than high-resolution counterparts. resolution images lose semantic details more rapidly due to pixel aggregation, while high-resolution images retain details due to spatial redundancy [16] (see [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training-testing misalignment in diffusion sampling. The forward (noise addition) and reverse (denoising) processes are theoretically symmetric but diverge during test-time sampling. (a) illustrates the conceptual discrepancy. (b) plots the mean squared error between the predicted and actual noisy image across sampling steps. Resolution-Dependent Misalignment. While minor for￾ward–reverse discrepancies are… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of Flux-Dev. Generated image examples before and after applying NoiseShift are on CelebA (left) and LAION-COCO (right). 0 10 20 30 40 50 Step 0.0 0.2 0.4 0.6 0.8 1.0 Sigma SD3: Default & Calibrated Sigma vs. Step across Resolutions Default 64×64 128×128 256×256 512×512 1024×1024 0 5 10 15 20 25 30 35 40 Step 0.0 0.2 0.4 0.6 0.8 1.0 Sigma SD3.5: Default & Calibrated Sigma vs. Step acr… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation studies. Ablation studies on the number of samples used during calibration and the new sigmas obtained at 128×128 and 256×256. expectations without modifying its architecture or training. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of SD3.5. Generated image examples before and after applying NoiseShift are on CelebA (top) and LAION-COCO (bottom). remains unchanged or slightly reduced, likely because it is the resolution of the final stage training, reducing the impact of calibration. These results suggest that NoiseShift comple￾ments, but does not replace, the resolution-aware scheduling baked into the model it… view at source ↗
read the original abstract

Text-to-image diffusion models often degrade when sampled at resolutions outside the final training resolution set. Prior work has largely emphasized higher resolution generation, enabling pretrained diffusion models to extrapolate beyond the resolutions seen during training. In this work, we instead target lower-resolution generation, performing inference at reduced resolution to significantly cut computational cost. We show that network conditioning of the noise level induces a train-test mismatch that directly degrades low-resolution generation: the same scheduled noise level can correspond to a different perceptual corruption level at lower resolutions, mis-calibrating the denoiser timestep and noise embedding. To this end, we propose NoiseShift, a training-free recalibration method that keeps the original noise sampling schedule unchanged and instead re-indexes the noise conditioning of the denoiser to restore local forward-reverse consistency. Using a lightweight coarse-to-fine calibration on a small set of image-text pairs, NoiseShift learns a resolution-specific mapping from scheduler noise to conditioning noise, reducing train-test mismatch and improving lower-resolution generation quality. When NoiseShift is applied to Stable Diffusion 3 (SD3), Stable Diffusion 3.5 (SD3.5), and Flux-Dev, generation quality at low resolutions improves consistently. Particularly, SD3 generation at 128x128 resolution gets an improved FID score from 203 to 171, and SD3.5 gets an improved FID score from 310 to 277 on LAION-COCO. Even Flux-Dev which already implements a complementary time-shifting strategy gets a modest boost from NoiseShift with an improved FID score from 120 to 113 at 64x64 resolution. More importantly, NoiseShift achieves such improvements with minimal implementation changes and no additional inference overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that pretrained text-to-image diffusion models suffer from a train-test mismatch in noise conditioning when sampled at resolutions below their training resolution, causing degraded generation quality. NoiseShift addresses this by learning a resolution-specific mapping from scheduler noise levels to denoiser conditioning noise via a lightweight coarse-to-fine calibration on a small set of image-text pairs. The mapping is applied at inference to re-index the noise conditioning while keeping the original noise schedule unchanged, restoring local forward-reverse consistency with no added overhead. Experiments report FID improvements on SD3 (203 to 171 at 128x128), SD3.5 (310 to 277), and Flux-Dev (120 to 113 at 64x64) on LAION-COCO.

Significance. If the learned mapping generalizes reliably, NoiseShift would offer a practical, training-free technique for efficient low-resolution sampling from high-resolution pretrained models, reducing compute costs while improving quality over naive downsampling. Its complementarity to existing strategies like time-shifting in Flux and minimal implementation changes could make it broadly useful for resource-constrained deployment of diffusion models.

major comments (2)
  1. [Method] The central claim that the resolution-specific mapping captures general forward-process effects and transfers to arbitrary new prompts rests on the calibration procedure. However, the manuscript provides no information on the size or prompt diversity of the image-text calibration set, nor any held-out validation that the mapping was tested on data disjoint from the LAION-COCO evaluation set (see Calibration subsection of the Method). Without this, the reported FID gains could reflect overfitting rather than resolution-aware recalibration.
  2. [Experiments] Table reporting FID scores (e.g., SD3 at 128x128: 203→171) presents point estimates without error bars, standard deviations across multiple runs, or statistical tests. This undermines confidence in the robustness of the improvements, especially given the empirical nature of the central claim (see Experiments section).
minor comments (2)
  1. [Abstract] The abstract refers to 'local forward-reverse consistency' without a brief inline definition or pointer to the relevant equations; adding this would improve immediate clarity for readers.
  2. [Introduction] Notation for the learned mapping (scheduler noise vs. conditioning noise) could be introduced more explicitly in the introduction with a simple equation to distinguish the re-indexing operation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our calibration procedure and the robustness of our empirical results. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method] The central claim that the resolution-specific mapping captures general forward-process effects and transfers to arbitrary new prompts rests on the calibration procedure. However, the manuscript provides no information on the size or prompt diversity of the image-text calibration set, nor any held-out validation that the mapping was tested on data disjoint from the LAION-COCO evaluation set (see Calibration subsection of the Method). Without this, the reported FID gains could reflect overfitting rather than resolution-aware recalibration.

    Authors: We agree that additional details on the calibration set would strengthen the manuscript and address potential concerns about generalization. The calibration uses a small, fixed set of image-text pairs drawn from public sources to learn a resolution-dependent mapping that corrects for the mismatch between scheduled noise and perceptual corruption at lower resolutions; this mapping is fundamentally resolution-driven rather than prompt-specific. In the revised manuscript, we will expand the Calibration subsection to report the exact number of pairs, describe their prompt diversity (covering a range of semantic categories), and explicitly state that the LAION-COCO evaluation set is held-out and disjoint from the calibration data. These additions will confirm that the observed FID improvements arise from restoring forward-reverse consistency rather than overfitting to the calibration examples. revision: yes

  2. Referee: [Experiments] Table reporting FID scores (e.g., SD3 at 128x128: 203→171) presents point estimates without error bars, standard deviations across multiple runs, or statistical tests. This undermines confidence in the robustness of the improvements, especially given the empirical nature of the central claim (see Experiments section).

    Authors: We acknowledge that reporting variability across runs would increase confidence in the results. The FID values are obtained via the standard protocol on LAION-COCO, but stochastic sampling in the diffusion process can introduce run-to-run variation. In the revised Experiments section, we will augment the FID table with standard deviations computed over multiple independent generations (e.g., five runs per setting) and include a brief discussion of the magnitude of the observed improvements relative to this variability. This will provide a clearer picture of robustness without altering the core experimental design. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical recalibration validated by external FID measurements

full rationale

The paper presents NoiseShift as an empirical, training-free method that performs lightweight coarse-to-fine calibration on a small set of image-text pairs to learn a resolution-specific mapping from scheduler noise to conditioning noise, then re-indexes the denoiser input at inference time. The claimed restoration of forward-reverse consistency and quality gains are not derived mathematically or by construction from the mapping itself; instead, they are supported by measured FID improvements on the LAION-COCO evaluation set (e.g., 203→171 for SD3 at 128²). No equations reduce the output metric to the calibration fit, no self-citation chain justifies a uniqueness claim, and the method does not rename a known result or smuggle an ansatz. The derivation chain is self-contained as a practical recalibration technique whose effectiveness is assessed through separate experimental benchmarks rather than tautological redefinition of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the learned mapping and the existence of a resolvable train-test mismatch; no new physical entities or unstated mathematical axioms are introduced beyond standard diffusion assumptions.

free parameters (1)
  • resolution-specific noise mapping
    Learned via coarse-to-fine calibration on small set of image-text pairs; directly determines the re-indexing applied at inference.

pith-pipeline@v0.9.0 · 5855 in / 1090 out tokens · 31205 ms · 2026-05-21T21:34:48.274165+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 5 internal anchors

  1. [1]

    Stable diffusion 3

    Stability AI. Stable diffusion 3. https://stability. ai/news/stable- diffusion- 3- announcement,

  2. [2]

    Multidiffusion: Fusing diffusion paths for controlled image generation

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. InICML, 2023. 1, 2

  3. [3]

    On the importance of noise scheduling for diffu- sion models.arXiv, 2023

    Ting Chen. On the importance of noise scheduling for diffu- sion models.arXiv, 2023. 3

  4. [4]

    On the importance of noise scheduling for diffu- sion models

    Ting Chen. On the importance of noise scheduling for diffu- sion models.arXiv preprint arXiv:2301.10972, 2023. 2

  5. [5]

    Re- sadapter: Domain consistent resolution adapter for diffusion models.ArXiv, abs/2403.02084, 2024

    Jiaxiang Cheng, Pan Xie, Xin Xia, Jiashi Li, Jie Wu, Yuxi Ren, Huixia Li, Xuefeng Xiao, Min Zheng, and Lean Fu. Re- sadapter: Domain consistent resolution adapter for diffusion models.ArXiv, abs/2403.02084, 2024. 1, 2

  6. [6]

    Flux: A modern diffusion transformer

    Cody Crockett, Tushar Patil, Laura Weidinger, et al. Flux: A modern diffusion transformer. https://github.com/ fluxml/flux-diffusion, 2024. 1, 2, 3, 6

  7. [7]

    Demofusion: Democratising high- resolution image generation with no $$$

    Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high- resolution image generation with no $$$. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6159–6168, 2024. 1, 2

  8. [8]

    Patrick Esser, Sumith Kulal, A. Blattmann, Rahim Entezari, Jonas Muller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high- resolution image synthesis.ArXiv, abs/2403.03206, 2024. 6

  9. [9]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 1, 2, 8

  10. [10]

    Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation

    Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xin- tao Wang, Qifeng Chen, et al. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. InEuropean Conference on Computer Vision, pages 39–55. Springer, 2024. 2

  11. [11]

    Rethinking the noise schedule of diffusion-based generative models

    Qiushan Guo, Sifei Liu, Yizhou Yu, and Ping Luo. Rethinking the noise schedule of diffusion-based generative models. 2023. 3

  12. [12]

    Elasticdiffusion: Training-free arbitrary size image generation through global-local content separation, 2024

    Moayed Haji-Ali, Guha Balakrishnan, and Vicente Ordonez. Elasticdiffusion: Training-free arbitrary size image generation through global-local content separation, 2024. 1, 2, 3

  13. [13]

    Scalecrafter: Tuning-free higher- resolution visual generation with diffusion models

    Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher- resolution visual generation with diffusion models. InThe Twelfth International Conference on Learning Representa- tions, 2023. 2, 3

  14. [14]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.ArXiv, abs/2104.08718, 2021. 6

  15. [15]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeural Information Processing Systems, 2017. 6

  16. [16]

    Sim- ple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Sim- ple diffusion: End-to-end diffusion for high resolution images. InProceedings of the 40th International Conference on Ma- chine Learning (ICML), 2023. 1, 2

  17. [17]

    Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis

    Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis. InEuropean Conference on Computer Vision, pages 196–212. Springer, 2024. 1

  18. [18]

    Resolu- tion chromatography of diffusion models.arXiv preprint arXiv:2401.10247, 2023

    Juno Hwang, Yong-Hyun Park, and Junghyo Jo. Resolu- tion chromatography of diffusion models.arXiv preprint arXiv:2401.10247, 2023. 1

  19. [19]

    Training- free diffusion model adaptation for variable-sized text-to- image synthesis.Advances in Neural Information Processing Systems, 36:70847–70860, 2023

    Zhiyu Jin, Xuli Shen, Bin Li, and Xiangyang Xue. Training- free diffusion model adaptation for variable-sized text-to- image synthesis.Advances in Neural Information Processing Systems, 36:70847–70860, 2023. 3

  20. [20]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022. 5

  21. [21]

    Alleviating exposure bias in diffusion mod- els through sampling with shifted time steps.arXiv preprint arXiv:2305.15583, 2023

    Mingxiao Li, Tingyu Qu, Ruicong Yao, Wei Sun, and Marie- Francine Moens. Alleviating exposure bias in diffusion mod- els through sampling with shifted time steps.arXiv preprint arXiv:2305.15583, 2023. 1, 3

  22. [22]

    Flow matching for generative modeling.arXiv preprint arXiv:2305.08891, 2023

    Yotam Lipman, Emiel Hoogeboom, Ajay Jain, Jacob Menick, Arash Vahdat, Tim Salimans, David J Fleet, and Jonathan Heek. Flow matching for generative modeling.arXiv preprint arXiv:2305.08891, 2023. 3

  23. [23]

    Flow matching models for learning reliable dynamics.arXiv preprint arXiv:2305.19591,

    Hanyu Liu, Zhen Xu, Wei Shi, Yuntao Bai, Hongyuan Zhao, Stefano Ermon, and Xiao Wang. Flow matching models for learning reliable dynamics.arXiv preprint arXiv:2305.19591,

  24. [24]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of International Conference on Computer Vision (ICCV), 2015. 5

  25. [25]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022. 3

  26. [26]

    DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022. 3

  27. [27]

    Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned pho- tographs. InNeural Information Processing Systems, 2011. 5

  28. [28]

    Scalable Diffusion Mod- els with Transformers

    William Peebles and Saining Xie. Scalable Diffusion Mod- els with Transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 1, 5

  29. [29]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023. 1

  30. [30]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1

  31. [31]

    Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion.arXiv preprint arXiv:2412.09626, 2024

    Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, and Ziwei Liu. Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion.arXiv preprint arXiv:2412.09626, 2024. 1, 2

  32. [32]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1

  33. [33]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical image computing and computer-assisted interven- tion, 2015. 1

  34. [34]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Lud- wig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion- 5b: An open large-scale dataset for training next generation image-text mode...

  35. [35]

    K¨ opf, Theo Coombes Richard Vencu, and Ross Beaumont

    Christoph Schuhmann, Andreas A. K¨ opf, Theo Coombes Richard Vencu, and Ross Beaumont. Laioncoco: 600m syn- thetic captions from laion2b-en, 2023. 5

  36. [36]

    Diffclip: Leveraging stable diffusion for language grounded 3d classification

    Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, and Xinx- iao Wu. Diffclip: Leveraging stable diffusion for language grounded 3d classification. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3596–3605, 2024. 1

  37. [37]

    OmniGen: Unified Image Generation

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340, 2024

  38. [38]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 1

  39. [39]

    Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.arXiv preprint arXiv:2407.01494, 2024

    Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.arXiv preprint arXiv:2407.01494, 2024. 1, 2

  40. [40]

    Any-size-diffusion: To- ward efficient text-driven synthesis for any-size hd images

    Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, and Hang Xu. Any-size-diffusion: To- ward efficient text-driven synthesis for any-size hd images. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7571–7578, 2024. 1, 2

  41. [41]

    Exposure bias reduction for enhancing diffusion transformer feature caching

    Zhen Zou, Hu Yu, Jie Xiao, and Feng Zhao. Exposure bias reduction for enhancing diffusion transformer feature caching. arXiv preprint arXiv:2503.07120, 2025. 1