pith. sign in

arxiv: 2511.07756 · v6 · submitted 2025-11-11 · 💻 cs.CV

Determinism of Randomness: Prompt-Residual Seed Shaping for Diffusion Generation

Pith reviewed 2026-05-18 00:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelslatent space geometryseed sensitivityprompt conditioningtraining-free generationsemantic anisotropy
0
0 comments X

The pith

A prompt-residual proxy for semantic-sensitive directions in initial noise improves diffusion generation quality and alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion models generate images from random Gaussian noise, but different seeds produce wildly varying results in how well they match the text prompt. The paper shows that this happens because semantic meaning is mapped through a many-to-one projection, concentrating sensitive variations in a small part of the noise space. To address this, it proposes a simple training-free method that uses one high-noise prompt residual to shape the seed by adding only its relevant tangential component while keeping it on the Gaussian shell. This results in better performance on alignment and quality metrics across benchmarks. The approach provides both a practical improvement and a geometric explanation for the seed lottery phenomenon.

Core claim

The paper establishes that the semantic map from initial noise to generated meaning creates a degenerate pullback semi-metric on the latent space, with most directions being nearly invariant and sensitive variation confined to a smaller horizontal subspace. Motivated by this, it introduces a prompt-residual seed-shaping procedure that employs a single high-noise cold-start prompt residual as a model-coupled proxy for this subspace, injecting only its tangential component and retracting the seed to the original radius to maintain prior compatibility, thereby enhancing generation without additional training.

What carries the argument

The prompt-residual seed-shaping procedure that uses a high-noise cold-start prompt residual as proxy, injects its tangential component into the seed, and retracts to the Gaussian shell.

If this is right

  • Generation quality and prompt alignment improve over standard sampling on multiple benchmarks.
  • The method requires only one additional conditional/unconditional probe before standard sampling.
  • The approach remains compatible with the Gaussian prior of the diffusion model.
  • Semantic anisotropy in the latent space is demonstrated as explanatory for seed sensitivity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that similar proxy methods could be applied to other stochastic generative processes to reduce variance.
  • Further research might explore recovering more of the horizontal subspace using multiple residuals for even better control.
  • The geometric view could inform the design of better initialization strategies in related models.

Load-bearing premise

A single high-noise cold-start prompt residual provides an adequate model-coupled proxy for the semantic-sensitive horizontal subspace.

What would settle it

Running the seed-shaping procedure on standard generation benchmarks and observing no improvement or a decrease in alignment and quality metrics would falsify the effectiveness of the proxy.

Figures

Figures reproduced from arXiv: 2511.07756 by Chenfeng Wang, GuanYe Xiong, Jian Yang, Min Li, Song Yan, Tao Zhang, Wei Zhai, Xinliang Bi, Yancheng Cai, Yunwei Lan, Yusen Zhang, Zheng-Jun Zha.

Figure 1
Figure 1. Figure 1: By deeply investigating the patterns within the seemingly random noise of diffusion models, we design [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Influence of Noise Semantics on Model Generation. To investigate the impact of semantics in noise on model generation, we design a minimalist diffusion model that exclusively transforms a Gaussian distribution into three elementary distributions. Diverging from conventional training paradigms, we deliberately overfit certain random seeds to distinct functional distributions during training, thereby simulat… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of noise semantics on generation across dif￾ferent models. We use the semantic content of images generated from specific noise with an empty prompt to represent the inherent semantics in the noise. Then, we generate outputs with two types of prompts: one consistent and one inconsistent with the noise. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Alignment Comparisons of SDXL, FLUX, WAN and TRELLIS. More qualitative results can be found in Appendix O. T2I. For the T2I task, as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of Sample Size in Noise Semantic Erasure and Injection on Model. In theory, increasing the number of noise dur￾ing erasure should con￾tinuously improve gen￾eration quality by aver￾aging semantic informa￾tion. However, as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of Semantic Injection on Model Generation. We evaluate three configurations: the standard pipeline, noise-only semantic erasure, and joint semantic erasure with injection. The results show that semantic injection significantly enhances seman￾tic alignment and overcomes the performance ceiling of erasure. As the number of noise erasure samples increases, the improvement in generation quality reaches … view at source ↗
Figure 7
Figure 7. Figure 7: The impact of semantic injection at different stages and the value of [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Semantic Coupling in Diffusion Model the noise semantic and the target semantic leads to a de￾crease in the consistency of the generation. Our approach can effectively adjust the semantic of the noise to be closer to the target semantic, thereby significantly enhancing the consistency of the model’s generation. From the above, it is not difficult to discern that what appears to be an isotropic Gaussian dis… view at source ↗
Figure 8
Figure 8. Figure 8: The semantic distribution of 20,000 random noise in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization Method in Fig [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Semantic Injection and Its Impact on Complex Scene Generation and Alignment [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Schematic Diagram of Questionnaire Design [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The impact of the value of δ(t) on model generation (FLUX) [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: More Results of SDXL [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: More Results of FLUX [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: More Results of WAN 2.1 1.3B [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: More Results of WAN 2.1 1.3B [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: More Results of TRELLIS Text Xlarge [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: More Results of TRELLIS Text Xlarge [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗
read the original abstract

Diffusion models start generation from an isotropic Gaussian latent, yet changing only the random seed can lead to large differences in prompt faithfulness, composition, and visual quality. We study this seed sensitivity through the semantic map from initial noise to generated meaning. Although the sampling flow is locally invertible, the subsequent semantic projection is many-to-one, inducing a degenerate pullback semi-metric on the latent space: most local directions are nearly semantic-invariant, while semantic-sensitive variation is concentrated in a much smaller horizontal subspace. This provides an explanatory geometric view of the seed lottery. Motivated by this view, we introduce a training-free prompt-residual seed-shaping procedure. Rather than claiming to recover the exact horizontal space, the method uses a single high-noise cold-start prompt residual as a model-coupled proxy, injects only its tangential component, and retracts the seed to the original Gaussian radius shell. This keeps the initialization prior-compatible while adding only one conditional/unconditional probe before standard sampling. Across multiple generation benchmarks, the method improves alignment and quality metrics over standard sampling, supporting both the practical value of the proxy and the explanatory relevance of semantic anisotropy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that seed sensitivity in diffusion models stems from a degenerate pullback semi-metric on the latent space induced by the many-to-one semantic projection, concentrating variation in a low-dimensional horizontal subspace. It introduces a training-free prompt-residual seed-shaping procedure that approximates this subspace via the tangential component of a single high-noise cold-start prompt residual, injects it into the initial Gaussian seed, and retracts to the original radius. The method is reported to improve alignment and quality metrics over standard sampling on multiple generation benchmarks, supporting both practical utility and the explanatory role of semantic anisotropy.

Significance. If the empirical gains prove robust and attributable to the proposed geometric proxy rather than generic effects, the work would provide a practical training-free enhancement to prompt faithfulness in diffusion generation alongside a geometric lens on latent-space sensitivity. The training-free nature and use of an external proxy are strengths that could influence follow-up work on understanding and controlling randomness in generative models.

major comments (2)
  1. [Abstract / Experimental evaluation] Abstract and experimental evaluation: The abstract asserts metric improvements in alignment and quality but supplies no quantitative values, error bars, statistical tests, or ablation results on proxy choice (e.g., noise level or number of residuals). This absence makes it difficult to assess whether the gains support the explanatory claim of semantic anisotropy or could arise from incidental effects of the added probe and retraction.
  2. [Method] Method description: The procedure relies on the tangential component of one high-noise prompt residual serving as an adequate model-coupled proxy for the semantic-sensitive horizontal subspace. No derivation is provided showing why this single cold-start residual preferentially captures the relevant directions rather than generic perturbations, which is central to linking the practical method to the geometric interpretation of the degenerate semi-metric.
minor comments (2)
  1. [Introduction / Geometric view] The introduction of the 'horizontal subspace' would benefit from an accompanying equation or illustrative diagram to clarify its relation to the pullback semi-metric.
  2. [Method] Notation for the prompt residual and its tangential projection could be made more explicit to facilitate reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below, clarifying the geometric motivation and empirical support while indicating where we will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Experimental evaluation] Abstract and experimental evaluation: The abstract asserts metric improvements in alignment and quality but supplies no quantitative values, error bars, statistical tests, or ablation results on proxy choice (e.g., noise level or number of residuals). This absence makes it difficult to assess whether the gains support the explanatory claim of semantic anisotropy or could arise from incidental effects of the added probe and retraction.

    Authors: We agree that the abstract would be strengthened by explicit quantitative values. In the revised manuscript we will update the abstract to report concrete improvements (e.g., average gains in CLIP-based alignment and perceptual quality metrics across benchmarks) together with references to the experimental sections that contain error bars, statistical significance tests, and ablations on noise level and number of residuals. These additions will make clearer that the observed gains are tied to the proposed proxy rather than generic probe effects. revision: yes

  2. Referee: [Method] Method description: The procedure relies on the tangential component of one high-noise prompt residual serving as an adequate model-coupled proxy for the semantic-sensitive horizontal subspace. No derivation is provided showing why this single cold-start residual preferentially captures the relevant directions rather than generic perturbations, which is central to linking the practical method to the geometric interpretation of the degenerate semi-metric.

    Authors: The method is explicitly framed as an approximation that uses a single model-coupled proxy rather than claiming exact recovery of the horizontal subspace. The high-noise cold-start choice follows from the observation that semantic degeneracy is strongest at large noise scales, so the residual’s tangential component preferentially aligns with the low-dimensional sensitive directions induced by the many-to-one projection. We will expand the method section with additional geometric intuition and a short sketch relating the tangential projection to the degenerate pullback semi-metric. We will also add a brief comparison showing that the chosen proxy outperforms isotropic or low-noise perturbations, thereby tightening the link between the practical procedure and the explanatory geometric view. revision: partial

Circularity Check

0 steps flagged

Derivation self-contained with independent empirical support

full rationale

The paper reasons from the many-to-one character of semantic projection to a degenerate pullback semi-metric whose semantic variation lies in a low-dimensional horizontal subspace. It then proposes a training-free proxy that injects only the tangential component of a single high-noise prompt residual and retracts to the Gaussian shell. Reported metric gains on standard generation benchmarks constitute external evidence that does not reduce to a fitted parameter or to a self-referential definition. No equations or claims in the provided text equate the proxy construction to the target explanatory claim by construction, and no load-bearing self-citations are invoked. The interpretive framework therefore supplies motivation rather than a definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that the semantic projection is many-to-one and that the prompt residual at high noise approximates the tangential semantic directions; no free parameters are explicitly fitted, but the choice of high-noise level and the definition of tangential component are model-dependent.

axioms (2)
  • domain assumption The sampling flow is locally invertible
    Invoked to justify the existence of a pullback semi-metric on the latent space.
  • domain assumption Semantic projection from latent to meaning is many-to-one
    Core premise inducing the degenerate semi-metric and concentration of semantic variation in a smaller subspace.
invented entities (1)
  • horizontal subspace no independent evidence
    purpose: To represent the low-dimensional directions in latent space that carry semantic-sensitive variation
    Postulated as the complement to the nearly semantic-invariant directions; no independent falsifiable handle provided beyond the proxy method itself.

pith-pipeline@v0.9.0 · 5531 in / 1505 out tokens · 32421 ms · 2026-05-18T00:09:17.111134+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 5 internal anchors

  1. [1]

    Aishwarya Agarwal, Srikrishna Karanam, K. J. Joseph, Apoorv Saxena, Koustava Goswami, and Balaji Vasan Srini- vasan. A-STAR: test-time attention segregation and reten- tion for text-to-image synthesis. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 2283–2293. IEEE, 2023. 3

  2. [2]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...

  3. [3]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. ediff-i: Text-to-image diffusion models with an ensem- ble of expert denoisers.CoRR, abs/2211.01324, 2022. 2

  4. [4]

    Meta 3d gen.CoRR, abs/2407.02599, 2024

    Raphael Bensadoun, Tom Monnier, Yanir Kleiman, Filippos Kokkinos, Yawar Siddiqui, Mahendra Kariya, Omri Harosh, Roman Shapovalov, Benjamin Graham, Emilien Garreau, Animesh Karnewar, Ang Cao, Idan Azuri, Iurii Makarov, Eric-Tuan Le, Antoine Toisoul, David Novotn´y, Oran Gafni, Natalia Neverova, and Andrea Vedaldi. Meta 3d gen.CoRR, abs/2407.02599, 2024. 2

  5. [5]

    Im- proving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, † TimBrooks, Jian- feng Wang, Linjie Li, † LongOuyang, † JuntangZhuang, † JoyceLee, † YufeiGuo, † WesamManassra, † PrafullaDhari- wal, † CaseyChu, † YunxinJiao, and Aditya Ramesh. Im- proving image generation with better captions. 2

  6. [6]

    Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Trans

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Trans. Graph., 42(4):148:1–148:10, 2023. 3

  7. [7]

    Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models, 2023

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models, 2023. 2

  8. [8]

    Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

  9. [9]

    Chen, Yaron Vaxman, Elad Ben Baruch, David Asulin, Aviad Moreshet, Kuo-Chin Lien, Misha Sra, and Pradeep Sen

    Sherry X. Chen, Yaron Vaxman, Elad Ben Baruch, David Asulin, Aviad Moreshet, Kuo-Chin Lien, Misha Sra, and Pradeep Sen. Tino-edit: Timestep and noise optimization for robust diffusion-based image editing, 2024. 2

  10. [10]

    Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, and Jitendra Malik

    Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F. Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, and Jitendra Malik. ABO: dataset and benchmarks for real-world 3d object understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, L...

  11. [11]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 13142–13153. IEEE, 2023. 3

  12. [12]

    Scaling rec- tified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, ICML 2024,...

  13. [13]

    Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang

    Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Ar- jun R. Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured dif- fusion guidance for compositional text-to-image synthesis. InThe Eleventh International Conference on Learning Rep- resentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 3

  14. [14]

    Initno: Boosting text-to-image dif- fusion models via initial noise optimization

    Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, and Di Huang. Initno: Boosting text-to-image dif- fusion models via initial noise optimization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 9380–9389. IEEE, 2024. 3, 4

  15. [15]

    Initno: Boosting text-to-image diffu- sion models via initial noise optimization, 2024

    Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, and Di Huang. Initno: Boosting text-to-image diffu- sion models via initial noise optimization, 2024. 2

  16. [16]

    Clipscore: A reference-free evaluation met- ric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Process- ing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7514–7528. Associa- tion for Com...

  17. [17]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6626–6637, 2017. 3, 5

  18. [18]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Represen- tations, ICLR 2022, Virtual Event, April 25-29, 2022. Open- Review.net, 2022. 3

  19. [19]

    One more step: A versatile plug-and-play module for rectifying diffusion schedule flaws and enhancing low-frequency controls

    Minghui Hu, Jianbin Zheng, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, and Tat-Jen Cham. One more step: A versatile plug-and-play module for rectifying diffusion schedule flaws and enhancing low-frequency controls. 2023. 3, 4

  20. [20]

    Predicting scores of various aesthetic attribute sets by learning from overall score labels

    Heng Huang, Xin Jin, Yaqi Liu, Hao Lou, Chaoen Xiao, Shuai Cui, Xining Li, and Dongqing Zou. Predicting scores of various aesthetic attribute sets by learning from overall score labels. InProceedings of the 2nd International Work- shop on Multimedia Content Generation and Evaluation: New Methods and Practice, McGE 2024, Melbourne, VIC, Australia, 28 Octob...

  21. [21]

    VBench: Com- prehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Reco...

  22. [22]

    Open- clip, 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. If you use this software, please cite it as below. 3

  23. [23]

    Chang, and Manolis Savva

    Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, and Manolis Savva. Habitat synthetic scenes dataset (HSSD-200): an analysis of 3d scene scale and realism tradeoffs for objectgoal nav- igation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2...

  24. [24]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. 7

  25. [25]

    Pick-a-pic: an open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: an open dataset of user preferences for text-to-image generation. InPro- ceedings of the 37th International Conference on Neural In- formation Processing Systems, Red Hook, NY , USA, 2023. Curran Associates Inc. 5, 3, 4

  26. [26]

    Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.arXiv preprint, 2024

    Kolors. Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.arXiv preprint, 2024. 2

  27. [27]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 4, 3

  28. [28]

    Divide & bind your attention for improved generative seman- tic nursing

    Yumeng Li, Margret Keuper, Dan Zhang, and Anna Khoreva. Divide & bind your attention for improved generative seman- tic nursing. In34th British Machine Vision Conference 2023, BMVC 2023, Aberdeen, UK, November 20-24, 2023, page

  29. [29]

    BMV A Press, 2023. 3

  30. [30]

    Evaluating text-to-visual generation with image-to-text models.preprint arXiv:2404.01291, 2024

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration.arXiv preprint arXiv:2404.01291, 2024. 5, 4

  31. [31]

    Alignment of diffusion models: Fundamentals, challenges, and future, 2024

    Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, and Zeke Xie. Alignment of diffusion models: Fundamentals, challenges, and future, 2024. 2

  32. [32]

    Tenenbaum

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, Octo- ber 23-27, 2022, Proceedings, Part XVII, pages 423–439. Springer, 2022. 3

  33. [33]

    Repaint: Inpainting using denoising diffusion probabilistic models, 2022

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models, 2022. 2

  34. [34]

    3d- rpe: Enhancing long-context modeling through 3d rotary po- sition encoding

    Xindian Ma, Wenyuan Liu, Peng Zhang, and Nan Xu. 3d- rpe: Enhancing long-context modeling through 3d rotary po- sition encoding. InAAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 24804–24811. AAAI Press, 2025. 3

  35. [35]

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rab- bat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...

  36. [36]

    Scalable diffusion models with transformers, 2023

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. 2

  37. [37]

    Courville

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron C. Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial In- telligence (IAAI-18), and the 8th AAAI Symposium on Edu- cational Advan...

  38. [38]

    Richter, Christo- pher J

    Pablo Pernias, Dominic Rampas, Mats L. Richter, Christo- pher J. Pal, and Marc Aubreville. Wuerstchen: An efficient architecture for large-scale text-to-image diffusion models,

  39. [39]

    Richter, Christo- pher Pal, and Marc Aubreville

    Pablo Pernias, Dominic Rampas, Mats L. Richter, Christo- pher Pal, and Marc Aubreville. W ¨urstchen: An efficient architecture for large-scale text-to-image diffusion models. InThe Twelfth International Conference on Learning Rep- resentations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 3

  40. [40]

    A Gentle Introduction to the Kernel Distance

    Jeff M. Phillips and Suresh Venkatasubramanian. A gentle introduction to the kernel distance.CoRR, abs/1103.1625,

  41. [41]

    SDXL: improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion models for high-resolution image synthesis. InThe Twelfth Interna- tional Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 4, 3

  42. [42]

    Not all noises are created equally:diffusion noise selection and optimization, 2024

    Zipeng Qi, Lichen Bai, Haoyi Xiong, and Zeke Xie. Not all noises are created equally:diffusion noise selection and optimization, 2024. 2

  43. [43]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21: 140:1–140:67, 2020. 2, 3

  44. [44]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with CLIP latents.CoRR, abs/2204.06125, 2022. 2

  45. [45]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674– 10685. IEEE, 2022. 3

  46. [46]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InComputer Vision and Pattern Recognition, pages 10684–10695. IEEE, 2022. 2

  47. [47]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention - MICCAI 2015 - 18th International Conference Mu- nich, Germany, October 5 - 9, 2015, Proceedings, Part III, pages 234–241. Springer, 2015. 3

  48. [48]

    Photorealistic text-to-image diffusion models with deep lan- guage understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Sali- mans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep lan- guage understanding. InAdvances in Neural Information Processing Systems, pages 3647...

  49. [49]

    Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Moham- mad Norouzi. Photorealistic text-to-image diffusion mod- els with deep language understanding. InAdvances in Neu- ral Information Processing Sys...

  50. [50]

    arXiv:2310.16656

    Eyal Segalis, Dani Valevski, Danny Lumen, Yossi Matias, and Yaniv Leviathan. A picture is worth a thousand words: Principled recaptioning improves image generation.CoRR, abs/2310.16656, 2023. 2

  51. [51]

    Stefan Stojanov, Anh Thai, and James M. Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias

  52. [52]

    Rethinking the in- ception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the in- ception architecture for computer vision. In2016 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 2818–

  53. [53]

    IEEE Computer Society, 2016. 5

  54. [54]

    Diffusion lens: Interpreting text encoders in text-to-image pipelines, 2024

    Michael Toker, Hadas Orgad, Mor Ventura, Dana Arad, and Yonatan Belinkov. Diffusion lens: Interpreting text encoders in text-to-image pipelines, 2024. 2

  55. [55]

    Random fourier signature features.SIAM J

    Csaba T ´oth, Harald Oberhauser, and Zolt´an Szab´o. Random fourier signature features.SIAM J. Math. Data Sci., 7(1): 329–354, 2025. 7

  56. [56]

    Viualizing data using t-sne.Journal of Machine Learning Research, 9:2579–2605, 2008

    Laurens van der Maaten, Geoffrey Hinton, and Yoesoep Rachmad. Viualizing data using t-sne.Journal of Machine Learning Research, 9:2579–2605, 2008. 8

  57. [57]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Xiaofeng Meng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wa...

  58. [58]

    Uncovering the disentanglement capability in text- to-image diffusion models, 2022

    Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. Uncovering the disentanglement capability in text- to-image diffusion models, 2022. 2

  59. [59]

    Density-aware chamfer distance as a comprehensive metric for point cloud completion.CoRR, abs/2111.12702, 2021

    Tong Wu, Liang Pan, Junzhe Zhang, Tai Wang, Ziwei Liu, and Dahua Lin. Density-aware chamfer distance as a comprehensive metric for point cloud completion.CoRR, abs/2111.12702, 2021. 8

  60. [60]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

  61. [61]

    Structured 3d latents for scalable and versatile 3d gen- eration

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 21469–21480. Computer Vision Founda- tion / IEEE, 2025. 5, 3

  62. [62]

    Imagere- ward: Learning and evaluating human preferences for text- to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Infor- mation Processing Systems 2023, NeurIPS 2023, New Or- leans, LA, USA, December 10 ...

  63. [63]

    RAPHAEL: text-to-image generation via large mixture of diffusion paths

    Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuo- fan Zong, Yu Liu, and Ping Luo. RAPHAEL: text-to-image generation via large mixture of diffusion paths. InAdvances in Neural Information Processing Systems 36: Annual Con- ference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 2

  64. [64]

    Uncovering the text embedding in text-to-image diffusion models, 2024

    Hu Yu, Hao Luo, Fan Wang, and Feng Zhao. Uncovering the text embedding in text-to-image diffusion models, 2024. 2

  65. [65]

    Efros, Eli Shecht- man, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In2018 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 586–

  66. [66]

    Computer Vision Foundation / IEEE Computer Society,

  67. [67]

    Golden noise for dif- fusion models: A learning framework

    Zikai Zhou, Shitong Shao, Lichen Bai, Shufei Zhang, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for dif- fusion models: A learning framework. InInternational Con- ference on Computer Vision, 2025. 3, 4

  68. [68]

    Sparse3d: Distill- ing multiview-consistent diffusion for object reconstruction from sparse views

    Zixin Zou, Weihao Cheng, Yan-Pei Cao, Shi-Sheng Huang, Ying Shan, and Song-Hai Zhang. Sparse3d: Distill- ing multiview-consistent diffusion for object reconstruction from sparse views. InThirty-Eighth AAAI Conference on Ar- tificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteent...

  69. [69]

    Content Text Alignment in Diffusion Models

    Related Works 2 2.1. Content Text Alignment in Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2. Initial Noise Optimization for Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

  70. [70]

    Semantic Information in Random Noise

    Preliminary 3 3.1. Semantic Information in Random Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2. Denoising and Semantic Injection Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3. Denoising Phase Priorities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

  71. [71]

    Semantic Erasure via Noise Normalization

    Method 4 4.1. Semantic Erasure via Noise Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.2. Semantic Injection via Temporal Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.3. Equivalence in Conditional Flow Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

  72. [72]

    Settings

    Experiment and Analysis 5 5.1. Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.2. Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.3. Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  73. [73]

    Implementation Details 3 A.1

    Conclusion 9 A . Implementation Details 3 A.1 . Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 A.2 . Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 A.2.1. T2I Benchmarks . . . . . . . . . . . . . . . . . . . . . . ...

  74. [74]

    video generation quality

    ViT-G/14 and CLIP ViT-L/14, giving dual-text awareness. Training uses a multi-aspect-ratio mixture (64 %1 : 1, 20 % 2 : 3, 16 %3 : 2) with resolution-aware noise scheduling. After 1.3 M GPU-hours on 13 M high-resolution image–text pairs, the base model yields64×64→96×96latents. An optional 2.3 B-parameter refiner UNet, trained on the same data but with hi...

  75. [75]

    Already encapsulate semantic injection capabilities

  76. [76]

    Permit direct utilization as noise semantic injectors

  77. [77]

    velocity

    Enable weighted aggregation across flow time-steps: vagg = X k wkvtk(x|y)(54) with weightsw k ∝t γ k controlling precision/fidelity tradeoffs I. Equivalence of Sec. 3.3 in Conditional Flow Matching Models Flow models map simple distributions (e.g., Gaussian noise) to complex data distributions throughreversible transforma- tions. The generation process is...