pith. sign in

arxiv: 2411.19182 · v2 · submitted 2024-11-28 · 💻 cs.CV · cs.AI

SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

Pith reviewed 2026-05-23 08:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords diffusion modelsimage generationmultimodal large language modelscontextual coherenceone-way diffusiontext-to-image synthesiscondition fidelity
0
0 comments X

The pith

Selective One-Way Diffusion uses MLLMs to dynamically control diffusion for coherent and faithful image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion models suffer from chaotic information spread that disrupts image coherence. The paper introduces Cyclic One-Way Diffusion to enable precise unidirectional transfer. Building on this, Selective One-Way Diffusion employs MLLMs to identify semantic and spatial relationships, then uses attention to adjust diffusion direction and intensity accordingly. This results in images that match input conditions at the pixel level while keeping visual and semantic consistency across the whole image, all without additional training.

Core claim

By reframing diffusion as a controlled process guided by MLLM-derived contextual relationships, SOW achieves pixel-level condition fidelity and maintains visual and semantic coherence in text-vision-to-image generation tasks.

What carries the argument

Selective One-Way Diffusion (SOW), which integrates MLLM clarification of relationships with attention mechanisms to regulate the direction and intensity of diffusion.

If this is right

  • Precise information transfer with minimized interference between image regions.
  • Learning-free improvement to existing diffusion models for better detail preservation.
  • Adaptable generation that responds to contextual relationships dynamically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could be tested on more complex multi-object scenes to check if MLLM relationship detection scales reliably.
  • Similar control strategies might apply to other stochastic processes in generative modeling beyond static images.
  • Integration with different backbone diffusion models could reveal how general the COW and SOW frameworks are across architectures.

Load-bearing premise

Multimodal large language models can accurately clarify semantic and spatial relationships in the image to set the correct diffusion parameters without adding new errors.

What would settle it

Generating an image where the MLLM misidentifies object positions or relationships, leading to either mismatched details from the input or visible inconsistencies like distorted objects or illogical layouts.

Figures

Figures reproduced from arXiv: 2411.19182 by Olga Russakovsky, Ruoyu Wang, Ye Zhu, Yongqi Yang, Yuhan Pei, Yu Wu.

Figure 1
Figure 1. Figure 1: Comparisons with existing methods [9], [10], [12], [13] for maintaining the fidelity of text and visual conditions in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A cognitive-inspired approach for image generation. Starting with a partial visual input (left), we leverage a multimodal [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline of our proposed SOW method. Initially, given the visual condition and text condition, we employ a MLLM [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of “diffusion in diffusion”. In experiment (a), We invert the pictures of pure gray and white to xt, merge them together, and then regenerate them to x0 via deterministic denoising. In experiment (b), we enhance the attention scores of the upper right quartile to the lower left quartile, while in experiment (c) we suppress attention scores from the upper right quartile towards other areas. The… view at source ↗
Figure 5
Figure 5. Figure 5: Schematic of attentional modulation. We modulate [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on Seed Initialization. Given the left [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: More analysis of the cycling process that diffuses “visual [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: For further analysis of generalized applications, we [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The adaptation of the visual condition to align with the text condition while maintaining the semantic and pixel-level [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 1
Figure 1. Figure 1: Text-sensitivity during denoising process. Each line [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Different sizes come with different semantic formation [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the semantic formation process and [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of our SOW-generated images with TV2I baselines. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Cyclic One-Way Diffusion (COW) as a unidirectional framework for precise information transfer in diffusion models and Selective One-Way Diffusion (SOW) that leverages MLLMs to extract semantic and spatial relationships, then uses attention mechanisms to dynamically control diffusion direction and intensity. The central claim is that this yields pixel-level condition fidelity and visual/semantic coherence for text-vision-to-image tasks in a learning-free manner, with extensive experiments demonstrating the approach.

Significance. If validated, the work would demonstrate a training-free method for controlled diffusion that exploits MLLM-derived context to reduce interference, offering a practical route to more coherent generative outputs without retraining.

major comments (2)
  1. [Abstract] Abstract: The claim of pixel-level fidelity and coherence rests on MLLM outputs correctly setting per-region diffusion parameters via attention, yet the description provides no mechanism to detect or mitigate MLLM extraction errors (e.g., incorrect containment or spatial relations); this is load-bearing because noisy inputs would directly propagate into the regulated diffusion trajectory.
  2. [SOW description] SOW construction: The paper states that attention 'dynamically regulate[s] the direction and intensity of diffusion according to contextual relationships,' but without reported ablation on MLLM accuracy or error-correction steps, it is unclear whether the coherence gains follow from the COW/SOW framework or require external verification of the MLLM inputs.
minor comments (1)
  1. [Abstract] The acronym 'TV2I' is used without prior definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The two major comments both concern the reliance on MLLM outputs without explicit error handling or ablations; we address each point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of pixel-level fidelity and coherence rests on MLLM outputs correctly setting per-region diffusion parameters via attention, yet the description provides no mechanism to detect or mitigate MLLM extraction errors (e.g., incorrect containment or spatial relations); this is load-bearing because noisy inputs would directly propagate into the regulated diffusion trajectory.

    Authors: We agree that the manuscript does not describe explicit mechanisms for detecting or correcting MLLM extraction errors. The SOW design treats MLLM outputs as reliable contextual guidance (consistent with other MLLM-augmented generation works) and relies on the attention-based regulation to translate those outputs into diffusion control. Because this assumption is load-bearing, we will add a dedicated paragraph in the revised manuscript discussing potential MLLM inaccuracies, their possible propagation, and why the overall experimental results still support the claimed fidelity and coherence. revision: yes

  2. Referee: [SOW description] SOW construction: The paper states that attention 'dynamically regulate[s] the direction and intensity of diffusion according to contextual relationships,' but without reported ablation on MLLM accuracy or error-correction steps, it is unclear whether the coherence gains follow from the COW/SOW framework or require external verification of the MLLM inputs.

    Authors: The manuscript demonstrates coherence improvements via direct comparisons against baselines that lack the selective one-way control; these gains are therefore attributable to the COW/SOW mechanisms rather than external MLLM verification. Nevertheless, the referee is correct that no dedicated ablation isolating MLLM accuracy is reported. We will incorporate additional analysis (including qualitative examples of MLLM outputs and their effect on final images) in the revision to make this distinction clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new procedural method with no derivation reducing to inputs

full rationale

The paper introduces COW and SOW as novel diffusion frameworks that use MLLMs for relationship clarification and attention for regulating diffusion direction/intensity. No equations, fitted parameters, self-citations, or uniqueness theorems are described that would cause any claimed result to reduce by construction to its own inputs. The contribution is a learning-free procedural proposal rather than a mathematical derivation chain, making it self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard diffusion model assumptions plus the new procedural inventions of COW and SOW; no explicit free parameters, additional axioms, or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption Diffusion generative models simulate a random walk in the data space along the denoising trajectory allowing information to diffuse across regions.
    Stated in the opening of the abstract as the physical origin of the models being modified.
invented entities (2)
  • Cyclic One-Way Diffusion (COW) no independent evidence
    purpose: Provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference.
    New named framework introduced to address chaotic diffusion.
  • Selective One-Way Diffusion (SOW) no independent evidence
    purpose: Utilizes MLLMs to clarify semantic and spatial relationships and combines attention to regulate diffusion direction and intensity.
    Core new method proposed for contextual coherence.

pith-pipeline@v0.9.0 · 5760 in / 1292 out tokens · 28124 ms · 2026-05-23T08:34:39.041997+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 14 internal anchors

  1. [1]

    One and a half century of diffusion: Fick, einstein before and beyond,

    J. Philibert, “One and a half century of diffusion: Fick, einstein before and beyond,” 2006

  2. [2]

    Deep unsupervised learning using nonequilibrium thermodynamics,

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning . PMLR, 2015, pp. 2256– 2265

  3. [3]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456 , 2020

  4. [4]

    Maximum likelihood training of score-based diffusion models,

    Y . Song, C. Durkan, I. Murray, and S. Ermon, “Maximum likelihood training of score-based diffusion models,” in Advances in Neural Information Processing Systems , M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 1415–1428. [Online]. Available: https://proceedings.neurips.cc/paper_file...

  5. [5]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,

    C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” Advances in Neural Information Processing Systems , vol. 35, pp. 5775–5787, 2022

  6. [6]

    Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,

    H. Chefer, Y . Alaluf, Y . Vinker, L. Wolf, and D. Cohen-Or, “Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,” ACM Transactions on Graphics (TOG) , vol. 42, no. 4, pp. 1–10, 2023

  7. [7]

    Divide & bind your attention for improved generative semantic nursing,

    Y . Li, M. Keuper, D. Zhang, and A. Khoreva, “Divide & bind your attention for improved generative semantic nursing,” in 34th British Machine Vision Conference 2023, BMVC 2023 , 2023

  8. [8]

    T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,

    K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu, “T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,” Advances in Neural Information Processing Systems , vol. 36, pp. 78 723–78 747, 2023

  9. [9]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personal- izing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022

  10. [10]

    Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

    N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” arXiv preprint arXiv:2208.12242 , 2022

  11. [11]

    Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning,

    Z. Dong, P. Wei, and L. Lin, “Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning,” arXiv preprint arXiv:2211.11337, 2022

  12. [12]

    Adding Conditional Control to Text-to-Image Diffusion Models

    L. Zhang and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” arXiv preprint arXiv:2302.05543 , 2023

  13. [13]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in CVPR, 2022, pp. 10 684–10 695

  14. [14]

    Diffusion in diffusion: Cyclic one-way diffusion for text-vision-conditioned generation,

    R. Wang, Y . Yang, Z. Qian, Y . Zhu, and Y . Wu, “Diffusion in diffusion: Cyclic one-way diffusion for text-vision-conditioned generation,” arXiv preprint arXiv:2306.08247, 2023

  15. [15]

    Denoising Diffusion Implicit Models

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502 , 2020

  16. [16]

    Diffusion models beat gans on image synthesis,

    P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” NeurIPS, vol. 34, pp. 8780–8794, 2021

  17. [17]

    Improved denoising diffusion probabilistic models,

    A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in ICML, 2021, pp. 8162–8171

  18. [18]

    Autoregressive diffusion models,

    E. Hoogeboom, A. A. Gritsenko, J. Bastings, B. Poole, R. v. d. Berg, and T. Salimans, “Autoregressive diffusion models,” arXiv preprint arXiv:2110.02037, 2021

  19. [19]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,

    J. Z. Wu, Y . Ge, X. Wang, W. Lei, Y . Gu, W. Hsu, Y . Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” arXiv preprint arXiv:2212.11565 , 2022

  20. [20]

    Discrete contrastive diffusion for cross-modal music and image generation,

    Y . Zhu, Y . Wu, K. Olszewski, J. Ren, S. Tulyakov, and Y . Yan, “Discrete contrastive diffusion for cross-modal music and image generation,” in ICLR, 2023

  21. [21]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020

  22. [22]

    Spontaneous symmetry breaking in generative diffusion models,

    G. Raya and L. Ambrogioni, “Spontaneous symmetry breaking in generative diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024

  23. [23]

    Dynamical regimes of diffusion models,

    G. Biroli, T. Bonnaire, V . De Bortoli, and M. Mézard, “Dynamical regimes of diffusion models,” arXiv preprint arXiv:2402.18491 , 2024

  24. [24]

    Generative adversarial text to image synthesis,

    S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in ICML, 2016, pp. 1060–1069

  25. [25]

    StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,

    H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in ICCV, 2017, pp. 5907–5915

  26. [26]

    Semantic image synthesis via adversarial learning,

    H. Dong, S. Yu, C. Wu, and Y . Guo, “Semantic image synthesis via adversarial learning,” in ICCV, 2017, pp. 5706–5714

  27. [27]

    StackGAN++: Realistic image synthesis with stacked gen- erative adversarial networks,

    H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN++: Realistic image synthesis with stacked gen- erative adversarial networks,” TPAMI, vol. 41, no. 8, pp. 1947–1962, 2018

  28. [28]

    AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,

    T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,” in CVPR, 2018, pp. 1316–1324

  29. [29]

    ST- GAN: Spatial transformer generative adversarial networks for image compositing,

    C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey, “ST- GAN: Spatial transformer generative adversarial networks for image compositing,” in CVPR, 2018, pp. 9455–9464

  30. [30]

    GP-GAN: Towards realistic high-resolution image blending,

    H. Wu, S. Zheng, J. Zhang, and K. Huang, “GP-GAN: Towards realistic high-resolution image blending,” in ACM MM, 2019, pp. 2487–2495

  31. [31]

    A style-based generator architecture for generative adversarial networks,

    T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in CVPR, 2019, pp. 4401–4410

  32. [32]

    Object- driven text-to-image synthesis via adversarial training,

    W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao, “Object- driven text-to-image synthesis via adversarial training,” in CVPR, 2019, pp. 12 174–12 182

  33. [33]

    Analyzing and improving the image quality of stylegan,

    T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in CVPR, 2020, pp. 8110–8119

  34. [34]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021

  35. [35]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022

  36. [36]

    The creativity of text-to-image generation,

    J. Oppenlaender, “The creativity of text-to-image generation,” in Pro- ceedings of the 25th International Academic Mindtrek Conference , 2022, pp. 192–202

  37. [37]

    Zero-shot text-to-image generation,

    A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in ICML, 2021, pp. 8821–8831

  38. [38]

    Classifier-Free Diffusion Guidance

    J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

  39. [39]

    Photorealistic text-to-image diffusion models with deep language understanding,

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al. , “Photorealistic text-to-image diffusion models with deep language understanding,” NeurIPS, pp. 36 479–36 494, 2022

  40. [40]

    Instructpix2pix: Learning to follow image editing instructions,

    T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 18 392–18 402

  41. [41]

    Switchable novel object captioner,

    Y . Wu, L. Jiang, and Y . Yang, “Switchable novel object captioner,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , vol. 45, no. 1, pp. 1162–1173, 2023

  42. [42]

    Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,

    Y . Wei, Y . Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” arXiv preprint arXiv:2302.13848 , 2023

  43. [43]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Y . Guo, C. Yang, A. Rao, Y . Wang, Y . Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725 , 2023

  44. [44]

    Ilvr: Conditioning method for denoising diffusion probabilistic models,

    J. Choi, S. Kim, Y . Jeong, Y . Gwon, and S. Yoon, “Ilvr: Conditioning method for denoising diffusion probabilistic models,” arXiv preprint arXiv:2108.02938, 2021

  45. [45]

    Multi-concept customization of text-to-image diffusion,

    N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi-concept customization of text-to-image diffusion,” arXiv preprint arXiv:2212.04488, 2022

  46. [46]

    Hyperdreambooth: Hypernet- works for fast personalization of text-to-image models,

    N. Ruiz, Y . Li, V . Jampani, W. Wei, T. Hou, Y . Pritch, N. Wad- hwa, M. Rubinstein, and K. Aberman, “Hyperdreambooth: Hypernet- works for fast personalization of text-to-image models,” arXiv preprint arXiv:2307.06949, 2023

  47. [47]

    Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation,

    H. Chen, Y . Zhang, X. Wang, X. Duan, Y . Zhou, and W. Zhu, “Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation,” arXiv preprint arXiv:2305.03374 , 2023

  48. [48]

    Instantbooth: Personalized text-to-image generation without test-time finetuning,

    J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” arXiv preprint arXiv:2304.03411, 2023

  49. [49]

    Generate anything anywhere in any scene,

    Y . Li, H. Liu, Y . Wen, and Y . J. Lee, “Generate anything anywhere in any scene,” arXiv preprint arXiv:2306.17154 , 2023

  50. [50]

    Dense text-to-image generation with attention modulation,

    Y . Kim, J. Lee, J.-H. Kim, J.-W. Ha, and J.-Y . Zhu, “Dense text-to-image generation with attention modulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 7701–7711

  51. [51]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

  52. [52]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971 , 2023

  53. [53]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

  54. [54]

    LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models

    L. Lian, B. Li, A. Yala, and T. Darrell, “Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models,” arXiv preprint arXiv:2305.13655 , 2023

  55. [55]

    Self-correcting llm-controlled diffusion models,

    T.-H. Wu, L. Lian, J. E. Gonzalez, B. Li, and T. Darrell, “Self-correcting llm-controlled diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 6327–6336. 14

  56. [56]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al. , “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023

  57. [57]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al. , “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378 , 2023

  58. [58]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023

  59. [59]

    Boundary guided mixing trajectory for semantic control with diffusion models,

    Y . Zhu, Y . Wu, Z. Deng, O. Russakovsky, and Y . Yan, “Boundary guided mixing trajectory for semantic control with diffusion models,” NeurIPS, 2023

  60. [60]

    Null- text inversion for editing real images using guided diffusion models,

    R. Mokady, A. Hertz, K. Aberman, Y . Pritch, and D. Cohen-Or, “Null- text inversion for editing real images using guided diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6038–6047

  61. [61]

    Head rotation in denoising diffusion models,

    A. Asperti, G. Colasuonno, and A. Guerra, “Head rotation in denoising diffusion models,” arXiv preprint arXiv:2308.06057 , 2023

  62. [62]

    Maskgan: Towards diverse and interactive facial image manipulation,

    C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “Maskgan: Towards diverse and interactive facial image manipulation,” in CVPR, 2020, pp. 5549–5558

  63. [63]

    Joint face detection and alignment using multitask cascaded convolutional networks,

    K. Zhang, Z. Zhang, Z. Li, and Y . Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016

  64. [64]

    Facenet: A unified embedding for face recognition and clustering,

    F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in CVPR, 2015, pp. 815–823. Yuhan Pei received a bachelor’s degree from the School of Computer Science and Technology, Xidian University (Xi’an, China) in 2024. She is currently pursuing a master’s degree in computer science at the School of Cy...

  65. [65]

    15 In the supplementary material, Sec

    He served as the Area Chair for CVPR, ICCV , ECCV , and NeurIPS, and also served as the Workshop Chair of CVPR 2023. 15 In the supplementary material, Sec. A showcases an array of TV2I generation results along with comprehensive analyses. Furthermore, we conduct extensive qualitative and quantitative image comparisons with baseline methods, and detailed r...