SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

Olga Russakovsky; Ruoyu Wang; Ye Zhu; Yongqi Yang; Yuhan Pei; Yu Wu

arxiv: 2411.19182 · v2 · submitted 2024-11-28 · 💻 cs.CV · cs.AI

SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

Yuhan Pei , Ruoyu Wang , Yongqi Yang , Ye Zhu , Olga Russakovsky , Yu Wu This is my paper

Pith reviewed 2026-05-23 08:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords diffusion modelsimage generationmultimodal large language modelscontextual coherenceone-way diffusiontext-to-image synthesiscondition fidelity

0 comments

The pith

Selective One-Way Diffusion uses MLLMs to dynamically control diffusion for coherent and faithful image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion models suffer from chaotic information spread that disrupts image coherence. The paper introduces Cyclic One-Way Diffusion to enable precise unidirectional transfer. Building on this, Selective One-Way Diffusion employs MLLMs to identify semantic and spatial relationships, then uses attention to adjust diffusion direction and intensity accordingly. This results in images that match input conditions at the pixel level while keeping visual and semantic consistency across the whole image, all without additional training.

Core claim

By reframing diffusion as a controlled process guided by MLLM-derived contextual relationships, SOW achieves pixel-level condition fidelity and maintains visual and semantic coherence in text-vision-to-image generation tasks.

What carries the argument

Selective One-Way Diffusion (SOW), which integrates MLLM clarification of relationships with attention mechanisms to regulate the direction and intensity of diffusion.

If this is right

Precise information transfer with minimized interference between image regions.
Learning-free improvement to existing diffusion models for better detail preservation.
Adaptable generation that responds to contextual relationships dynamically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be tested on more complex multi-object scenes to check if MLLM relationship detection scales reliably.
Similar control strategies might apply to other stochastic processes in generative modeling beyond static images.
Integration with different backbone diffusion models could reveal how general the COW and SOW frameworks are across architectures.

Load-bearing premise

Multimodal large language models can accurately clarify semantic and spatial relationships in the image to set the correct diffusion parameters without adding new errors.

What would settle it

Generating an image where the MLLM misidentifies object positions or relationships, leading to either mismatched details from the input or visible inconsistencies like distorted objects or illogical layouts.

Figures

Figures reproduced from arXiv: 2411.19182 by Olga Russakovsky, Ruoyu Wang, Ye Zhu, Yongqi Yang, Yuhan Pei, Yu Wu.

**Figure 1.** Figure 1: Comparisons with existing methods [9], [10], [12], [13] for maintaining the fidelity of text and visual conditions in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: A cognitive-inspired approach for image generation. Starting with a partial visual input (left), we leverage a multimodal [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The pipeline of our proposed SOW method. Initially, given the visual condition and text condition, we employ a MLLM [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of “diffusion in diffusion”. In experiment (a), We invert the pictures of pure gray and white to xt, merge them together, and then regenerate them to x0 via deterministic denoising. In experiment (b), we enhance the attention scores of the upper right quartile to the lower left quartile, while in experiment (c) we suppress attention scores from the upper right quartile towards other areas. The… view at source ↗

**Figure 5.** Figure 5: Schematic of attentional modulation. We modulate [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study on Seed Initialization. Given the left [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: More analysis of the cycling process that diffuses “visual [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: For further analysis of generalized applications, we [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: The adaptation of the visual condition to align with the text condition while maintaining the semantic and pixel-level [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 1.** Figure 1: Text-sensitivity during denoising process. Each line [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗

**Figure 2.** Figure 2: Different sizes come with different semantic formation [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the semantic formation process and [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of our SOW-generated images with TV2I baselines. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes COW and SOW to steer diffusion with MLLM-derived attention for coherence, but the gains depend on unverified MLLM accuracy with no error correction shown.

read the letter

The paper's main move is to treat diffusion as something that can be made unidirectional and then selectively guided by MLLM outputs on semantic and spatial relations, with attention adjusting direction and intensity per region. They introduce COW as the base unidirectional setup and SOW as the version that adds the MLLM step, all in a learning-free way for text-to-image tasks. This framing of the interference problem in standard diffusion is clear and the unidirectional idea is a straightforward way to reduce cross-region messiness. The learning-free claim is also useful if it holds up, since most control methods add training overhead. The abstract does a reasonable job connecting the physics intuition to the practical goal of pixel-level fidelity plus overall coherence. The soft spot is exactly the one in the stress-test note. SOW assumes the MLLM can reliably clarify relationships to set the attention parameters, yet the description gives no mechanism to catch or fix cases where the MLLM hallucinates positions or containment. If those inputs are noisy the claimed regulation cannot deliver the fidelity. The abstract mentions extensive experiments but supplies none of the baselines, metrics, or failure cases, so it is impossible to tell whether the gains come from the new mechanism or from other factors. This work is aimed at people already working on controllable diffusion models who might want another knob to turn. It shows clear engagement with the diffusion literature and the MLLM integration question, so the thinking is serious even if the evidence is thin so far. I would bring it to a reading group to see the actual implementation and results. I would not cite it yet. It is worth sending to peer review so referees can check whether the MLLM step is robust in practice.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Cyclic One-Way Diffusion (COW) as a unidirectional framework for precise information transfer in diffusion models and Selective One-Way Diffusion (SOW) that leverages MLLMs to extract semantic and spatial relationships, then uses attention mechanisms to dynamically control diffusion direction and intensity. The central claim is that this yields pixel-level condition fidelity and visual/semantic coherence for text-vision-to-image tasks in a learning-free manner, with extensive experiments demonstrating the approach.

Significance. If validated, the work would demonstrate a training-free method for controlled diffusion that exploits MLLM-derived context to reduce interference, offering a practical route to more coherent generative outputs without retraining.

major comments (2)

[Abstract] Abstract: The claim of pixel-level fidelity and coherence rests on MLLM outputs correctly setting per-region diffusion parameters via attention, yet the description provides no mechanism to detect or mitigate MLLM extraction errors (e.g., incorrect containment or spatial relations); this is load-bearing because noisy inputs would directly propagate into the regulated diffusion trajectory.
[SOW description] SOW construction: The paper states that attention 'dynamically regulate[s] the direction and intensity of diffusion according to contextual relationships,' but without reported ablation on MLLM accuracy or error-correction steps, it is unclear whether the coherence gains follow from the COW/SOW framework or require external verification of the MLLM inputs.

minor comments (1)

[Abstract] The acronym 'TV2I' is used without prior definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The two major comments both concern the reliance on MLLM outputs without explicit error handling or ablations; we address each point below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of pixel-level fidelity and coherence rests on MLLM outputs correctly setting per-region diffusion parameters via attention, yet the description provides no mechanism to detect or mitigate MLLM extraction errors (e.g., incorrect containment or spatial relations); this is load-bearing because noisy inputs would directly propagate into the regulated diffusion trajectory.

Authors: We agree that the manuscript does not describe explicit mechanisms for detecting or correcting MLLM extraction errors. The SOW design treats MLLM outputs as reliable contextual guidance (consistent with other MLLM-augmented generation works) and relies on the attention-based regulation to translate those outputs into diffusion control. Because this assumption is load-bearing, we will add a dedicated paragraph in the revised manuscript discussing potential MLLM inaccuracies, their possible propagation, and why the overall experimental results still support the claimed fidelity and coherence. revision: yes
Referee: [SOW description] SOW construction: The paper states that attention 'dynamically regulate[s] the direction and intensity of diffusion according to contextual relationships,' but without reported ablation on MLLM accuracy or error-correction steps, it is unclear whether the coherence gains follow from the COW/SOW framework or require external verification of the MLLM inputs.

Authors: The manuscript demonstrates coherence improvements via direct comparisons against baselines that lack the selective one-way control; these gains are therefore attributable to the COW/SOW mechanisms rather than external MLLM verification. Nevertheless, the referee is correct that no dedicated ablation isolating MLLM accuracy is reported. We will incorporate additional analysis (including qualitative examples of MLLM outputs and their effect on final images) in the revision to make this distinction clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new procedural method with no derivation reducing to inputs

full rationale

The paper introduces COW and SOW as novel diffusion frameworks that use MLLMs for relationship clarification and attention for regulating diffusion direction/intensity. No equations, fitted parameters, self-citations, or uniqueness theorems are described that would cause any claimed result to reduce by construction to its own inputs. The contribution is a learning-free procedural proposal rather than a mathematical derivation chain, making it self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard diffusion model assumptions plus the new procedural inventions of COW and SOW; no explicit free parameters, additional axioms, or invented physical entities are stated in the abstract.

axioms (1)

domain assumption Diffusion generative models simulate a random walk in the data space along the denoising trajectory allowing information to diffuse across regions.
Stated in the opening of the abstract as the physical origin of the models being modified.

invented entities (2)

Cyclic One-Way Diffusion (COW) no independent evidence
purpose: Provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference.
New named framework introduced to address chaotic diffusion.
Selective One-Way Diffusion (SOW) no independent evidence
purpose: Utilizes MLLMs to clarify semantic and spatial relationships and combines attention to regulate diffusion direction and intensity.
Core new method proposed for contextual coherence.

pith-pipeline@v0.9.0 · 5760 in / 1292 out tokens · 28124 ms · 2026-05-23T08:34:39.041997+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships... utilizes MLLMs to clarify the semantic and spatial relationships within the image
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Cyclic One-Way Diffusion (COW)... Selective One-Way Diffusion (SOW)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 14 internal anchors

[1]

One and a half century of diffusion: Fick, einstein before and beyond,

J. Philibert, “One and a half century of diffusion: Fick, einstein before and beyond,” 2006

work page 2006
[2]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning . PMLR, 2015, pp. 2256– 2265

work page 2015
[3]

Score-Based Generative Modeling through Stochastic Differential Equations

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[4]

Maximum likelihood training of score-based diffusion models,

Y . Song, C. Durkan, I. Murray, and S. Ermon, “Maximum likelihood training of score-based diffusion models,” in Advances in Neural Information Processing Systems , M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 1415–1428. [Online]. Available: https://proceedings.neurips.cc/paper_file...

work page 2021
[5]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” Advances in Neural Information Processing Systems , vol. 35, pp. 5775–5787, 2022

work page 2022
[6]

Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,

H. Chefer, Y . Alaluf, Y . Vinker, L. Wolf, and D. Cohen-Or, “Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,” ACM Transactions on Graphics (TOG) , vol. 42, no. 4, pp. 1–10, 2023

work page 2023
[7]

Divide & bind your attention for improved generative semantic nursing,

Y . Li, M. Keuper, D. Zhang, and A. Khoreva, “Divide & bind your attention for improved generative semantic nursing,” in 34th British Machine Vision Conference 2023, BMVC 2023 , 2023

work page 2023
[8]

T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,

K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu, “T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,” Advances in Neural Information Processing Systems , vol. 36, pp. 78 723–78 747, 2023

work page 2023
[9]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personal- izing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” arXiv preprint arXiv:2208.12242 , 2022

work page arXiv 2022
[11]

Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning,

Z. Dong, P. Wei, and L. Lin, “Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning,” arXiv preprint arXiv:2211.11337, 2022

work page arXiv 2022
[12]

Adding Conditional Control to Text-to-Image Diffusion Models

L. Zhang and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” arXiv preprint arXiv:2302.05543 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in CVPR, 2022, pp. 10 684–10 695

work page 2022
[14]

Diffusion in diffusion: Cyclic one-way diffusion for text-vision-conditioned generation,

R. Wang, Y . Yang, Z. Qian, Y . Zhu, and Y . Wu, “Diffusion in diffusion: Cyclic one-way diffusion for text-vision-conditioned generation,” arXiv preprint arXiv:2306.08247, 2023

work page arXiv 2023
[15]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[16]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” NeurIPS, vol. 34, pp. 8780–8794, 2021

work page 2021
[17]

Improved denoising diffusion probabilistic models,

A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in ICML, 2021, pp. 8162–8171

work page 2021
[18]

Autoregressive diffusion models,

E. Hoogeboom, A. A. Gritsenko, J. Bastings, B. Poole, R. v. d. Berg, and T. Salimans, “Autoregressive diffusion models,” arXiv preprint arXiv:2110.02037, 2021

work page arXiv 2021
[19]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,

J. Z. Wu, Y . Ge, X. Wang, W. Lei, Y . Gu, W. Hsu, Y . Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” arXiv preprint arXiv:2212.11565 , 2022

work page arXiv 2022
[20]

Discrete contrastive diffusion for cross-modal music and image generation,

Y . Zhu, Y . Wu, K. Olszewski, J. Ren, S. Tulyakov, and Y . Yan, “Discrete contrastive diffusion for cross-modal music and image generation,” in ICLR, 2023

work page 2023
[21]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020

work page 2020
[22]

Spontaneous symmetry breaking in generative diffusion models,

G. Raya and L. Ambrogioni, “Spontaneous symmetry breaking in generative diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[23]

Dynamical regimes of diffusion models,

G. Biroli, T. Bonnaire, V . De Bortoli, and M. Mézard, “Dynamical regimes of diffusion models,” arXiv preprint arXiv:2402.18491 , 2024

work page arXiv 2024
[24]

Generative adversarial text to image synthesis,

S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in ICML, 2016, pp. 1060–1069

work page 2016
[25]

StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,

H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in ICCV, 2017, pp. 5907–5915

work page 2017
[26]

Semantic image synthesis via adversarial learning,

H. Dong, S. Yu, C. Wu, and Y . Guo, “Semantic image synthesis via adversarial learning,” in ICCV, 2017, pp. 5706–5714

work page 2017
[27]

StackGAN++: Realistic image synthesis with stacked gen- erative adversarial networks,

H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN++: Realistic image synthesis with stacked gen- erative adversarial networks,” TPAMI, vol. 41, no. 8, pp. 1947–1962, 2018

work page 1947
[28]

AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,

T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,” in CVPR, 2018, pp. 1316–1324

work page 2018
[29]

ST- GAN: Spatial transformer generative adversarial networks for image compositing,

C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey, “ST- GAN: Spatial transformer generative adversarial networks for image compositing,” in CVPR, 2018, pp. 9455–9464

work page 2018
[30]

GP-GAN: Towards realistic high-resolution image blending,

H. Wu, S. Zheng, J. Zhang, and K. Huang, “GP-GAN: Towards realistic high-resolution image blending,” in ACM MM, 2019, pp. 2487–2495

work page 2019
[31]

A style-based generator architecture for generative adversarial networks,

T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in CVPR, 2019, pp. 4401–4410

work page 2019
[32]

Object- driven text-to-image synthesis via adversarial training,

W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao, “Object- driven text-to-image synthesis via adversarial training,” in CVPR, 2019, pp. 12 174–12 182

work page 2019
[33]

Analyzing and improving the image quality of stylegan,

T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in CVPR, 2020, pp. 8110–8119

work page 2020
[34]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

The creativity of text-to-image generation,

J. Oppenlaender, “The creativity of text-to-image generation,” in Pro- ceedings of the 25th International Academic Mindtrek Conference , 2022, pp. 192–202

work page 2022
[37]

Zero-shot text-to-image generation,

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in ICML, 2021, pp. 8821–8831

work page 2021
[38]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al. , “Photorealistic text-to-image diffusion models with deep language understanding,” NeurIPS, pp. 36 479–36 494, 2022

work page 2022
[40]

Instructpix2pix: Learning to follow image editing instructions,

T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 18 392–18 402

work page 2023
[41]

Switchable novel object captioner,

Y . Wu, L. Jiang, and Y . Yang, “Switchable novel object captioner,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , vol. 45, no. 1, pp. 1162–1173, 2023

work page 2023
[42]

Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,

Y . Wei, Y . Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” arXiv preprint arXiv:2302.13848 , 2023

work page arXiv 2023
[43]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Y . Guo, C. Yang, A. Rao, Y . Wang, Y . Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Ilvr: Conditioning method for denoising diffusion probabilistic models,

J. Choi, S. Kim, Y . Jeong, Y . Gwon, and S. Yoon, “Ilvr: Conditioning method for denoising diffusion probabilistic models,” arXiv preprint arXiv:2108.02938, 2021

work page arXiv 2021
[45]

Multi-concept customization of text-to-image diffusion,

N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi-concept customization of text-to-image diffusion,” arXiv preprint arXiv:2212.04488, 2022

work page arXiv 2022
[46]

Hyperdreambooth: Hypernet- works for fast personalization of text-to-image models,

N. Ruiz, Y . Li, V . Jampani, W. Wei, T. Hou, Y . Pritch, N. Wad- hwa, M. Rubinstein, and K. Aberman, “Hyperdreambooth: Hypernet- works for fast personalization of text-to-image models,” arXiv preprint arXiv:2307.06949, 2023

work page arXiv 2023
[47]

Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation,

H. Chen, Y . Zhang, X. Wang, X. Duan, Y . Zhou, and W. Zhu, “Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation,” arXiv preprint arXiv:2305.03374 , 2023

work page arXiv 2023
[48]

Instantbooth: Personalized text-to-image generation without test-time finetuning,

J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” arXiv preprint arXiv:2304.03411, 2023

work page arXiv 2023
[49]

Generate anything anywhere in any scene,

Y . Li, H. Liu, Y . Wen, and Y . J. Lee, “Generate anything anywhere in any scene,” arXiv preprint arXiv:2306.17154 , 2023

work page arXiv 2023
[50]

Dense text-to-image generation with attention modulation,

Y . Kim, J. Lee, J.-H. Kim, J.-W. Ha, and J.-Y . Zhu, “Dense text-to-image generation with attention modulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 7701–7711

work page 2023
[51]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models

L. Lian, B. Li, A. Yala, and T. Darrell, “Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models,” arXiv preprint arXiv:2305.13655 , 2023

work page arXiv 2023
[55]

Self-correcting llm-controlled diffusion models,

T.-H. Wu, L. Lian, J. E. Gonzalez, B. Li, and T. Darrell, “Self-correcting llm-controlled diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 6327–6336. 14

work page 2024
[56]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al. , “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al. , “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Boundary guided mixing trajectory for semantic control with diffusion models,

Y . Zhu, Y . Wu, Z. Deng, O. Russakovsky, and Y . Yan, “Boundary guided mixing trajectory for semantic control with diffusion models,” NeurIPS, 2023

work page 2023
[60]

Null- text inversion for editing real images using guided diffusion models,

R. Mokady, A. Hertz, K. Aberman, Y . Pritch, and D. Cohen-Or, “Null- text inversion for editing real images using guided diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6038–6047

work page 2023
[61]

Head rotation in denoising diffusion models,

A. Asperti, G. Colasuonno, and A. Guerra, “Head rotation in denoising diffusion models,” arXiv preprint arXiv:2308.06057 , 2023

work page arXiv 2023
[62]

Maskgan: Towards diverse and interactive facial image manipulation,

C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “Maskgan: Towards diverse and interactive facial image manipulation,” in CVPR, 2020, pp. 5549–5558

work page 2020
[63]

Joint face detection and alignment using multitask cascaded convolutional networks,

K. Zhang, Z. Zhang, Z. Li, and Y . Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016

work page 2016
[64]

Facenet: A unified embedding for face recognition and clustering,

F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in CVPR, 2015, pp. 815–823. Yuhan Pei received a bachelor’s degree from the School of Computer Science and Technology, Xidian University (Xi’an, China) in 2024. She is currently pursuing a master’s degree in computer science at the School of Cy...

work page 2015
[65]

15 In the supplementary material, Sec

He served as the Area Chair for CVPR, ICCV , ECCV , and NeurIPS, and also served as the Workshop Chair of CVPR 2023. 15 In the supplementary material, Sec. A showcases an array of TV2I generation results along with comprehensive analyses. Furthermore, we conduct extensive qualitative and quantitative image comparisons with baseline methods, and detailed r...

work page 2023

[1] [1]

One and a half century of diffusion: Fick, einstein before and beyond,

J. Philibert, “One and a half century of diffusion: Fick, einstein before and beyond,” 2006

work page 2006

[2] [2]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning . PMLR, 2015, pp. 2256– 2265

work page 2015

[3] [3]

Score-Based Generative Modeling through Stochastic Differential Equations

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011

[4] [4]

Maximum likelihood training of score-based diffusion models,

Y . Song, C. Durkan, I. Murray, and S. Ermon, “Maximum likelihood training of score-based diffusion models,” in Advances in Neural Information Processing Systems , M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 1415–1428. [Online]. Available: https://proceedings.neurips.cc/paper_file...

work page 2021

[5] [5]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” Advances in Neural Information Processing Systems , vol. 35, pp. 5775–5787, 2022

work page 2022

[6] [6]

Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,

H. Chefer, Y . Alaluf, Y . Vinker, L. Wolf, and D. Cohen-Or, “Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,” ACM Transactions on Graphics (TOG) , vol. 42, no. 4, pp. 1–10, 2023

work page 2023

[7] [7]

Divide & bind your attention for improved generative semantic nursing,

Y . Li, M. Keuper, D. Zhang, and A. Khoreva, “Divide & bind your attention for improved generative semantic nursing,” in 34th British Machine Vision Conference 2023, BMVC 2023 , 2023

work page 2023

[8] [8]

T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,

K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu, “T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,” Advances in Neural Information Processing Systems , vol. 36, pp. 78 723–78 747, 2023

work page 2023

[9] [9]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personal- izing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” arXiv preprint arXiv:2208.12242 , 2022

work page arXiv 2022

[11] [11]

Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning,

Z. Dong, P. Wei, and L. Lin, “Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning,” arXiv preprint arXiv:2211.11337, 2022

work page arXiv 2022

[12] [12]

Adding Conditional Control to Text-to-Image Diffusion Models

L. Zhang and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” arXiv preprint arXiv:2302.05543 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in CVPR, 2022, pp. 10 684–10 695

work page 2022

[14] [14]

Diffusion in diffusion: Cyclic one-way diffusion for text-vision-conditioned generation,

R. Wang, Y . Yang, Z. Qian, Y . Zhu, and Y . Wu, “Diffusion in diffusion: Cyclic one-way diffusion for text-vision-conditioned generation,” arXiv preprint arXiv:2306.08247, 2023

work page arXiv 2023

[15] [15]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[16] [16]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” NeurIPS, vol. 34, pp. 8780–8794, 2021

work page 2021

[17] [17]

Improved denoising diffusion probabilistic models,

A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in ICML, 2021, pp. 8162–8171

work page 2021

[18] [18]

Autoregressive diffusion models,

E. Hoogeboom, A. A. Gritsenko, J. Bastings, B. Poole, R. v. d. Berg, and T. Salimans, “Autoregressive diffusion models,” arXiv preprint arXiv:2110.02037, 2021

work page arXiv 2021

[19] [19]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,

J. Z. Wu, Y . Ge, X. Wang, W. Lei, Y . Gu, W. Hsu, Y . Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” arXiv preprint arXiv:2212.11565 , 2022

work page arXiv 2022

[20] [20]

Discrete contrastive diffusion for cross-modal music and image generation,

Y . Zhu, Y . Wu, K. Olszewski, J. Ren, S. Tulyakov, and Y . Yan, “Discrete contrastive diffusion for cross-modal music and image generation,” in ICLR, 2023

work page 2023

[21] [21]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020

work page 2020

[22] [22]

Spontaneous symmetry breaking in generative diffusion models,

G. Raya and L. Ambrogioni, “Spontaneous symmetry breaking in generative diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[23] [23]

Dynamical regimes of diffusion models,

G. Biroli, T. Bonnaire, V . De Bortoli, and M. Mézard, “Dynamical regimes of diffusion models,” arXiv preprint arXiv:2402.18491 , 2024

work page arXiv 2024

[24] [24]

Generative adversarial text to image synthesis,

S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in ICML, 2016, pp. 1060–1069

work page 2016

[25] [25]

StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,

H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in ICCV, 2017, pp. 5907–5915

work page 2017

[26] [26]

Semantic image synthesis via adversarial learning,

H. Dong, S. Yu, C. Wu, and Y . Guo, “Semantic image synthesis via adversarial learning,” in ICCV, 2017, pp. 5706–5714

work page 2017

[27] [27]

StackGAN++: Realistic image synthesis with stacked gen- erative adversarial networks,

H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN++: Realistic image synthesis with stacked gen- erative adversarial networks,” TPAMI, vol. 41, no. 8, pp. 1947–1962, 2018

work page 1947

[28] [28]

AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,

T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,” in CVPR, 2018, pp. 1316–1324

work page 2018

[29] [29]

ST- GAN: Spatial transformer generative adversarial networks for image compositing,

C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey, “ST- GAN: Spatial transformer generative adversarial networks for image compositing,” in CVPR, 2018, pp. 9455–9464

work page 2018

[30] [30]

GP-GAN: Towards realistic high-resolution image blending,

H. Wu, S. Zheng, J. Zhang, and K. Huang, “GP-GAN: Towards realistic high-resolution image blending,” in ACM MM, 2019, pp. 2487–2495

work page 2019

[31] [31]

A style-based generator architecture for generative adversarial networks,

T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in CVPR, 2019, pp. 4401–4410

work page 2019

[32] [32]

Object- driven text-to-image synthesis via adversarial training,

W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao, “Object- driven text-to-image synthesis via adversarial training,” in CVPR, 2019, pp. 12 174–12 182

work page 2019

[33] [33]

Analyzing and improving the image quality of stylegan,

T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in CVPR, 2020, pp. 8110–8119

work page 2020

[34] [34]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[35] [35]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

The creativity of text-to-image generation,

J. Oppenlaender, “The creativity of text-to-image generation,” in Pro- ceedings of the 25th International Academic Mindtrek Conference , 2022, pp. 192–202

work page 2022

[37] [37]

Zero-shot text-to-image generation,

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in ICML, 2021, pp. 8821–8831

work page 2021

[38] [38]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al. , “Photorealistic text-to-image diffusion models with deep language understanding,” NeurIPS, pp. 36 479–36 494, 2022

work page 2022

[40] [40]

Instructpix2pix: Learning to follow image editing instructions,

T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 18 392–18 402

work page 2023

[41] [41]

Switchable novel object captioner,

Y . Wu, L. Jiang, and Y . Yang, “Switchable novel object captioner,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , vol. 45, no. 1, pp. 1162–1173, 2023

work page 2023

[42] [42]

Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,

Y . Wei, Y . Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” arXiv preprint arXiv:2302.13848 , 2023

work page arXiv 2023

[43] [43]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Y . Guo, C. Yang, A. Rao, Y . Wang, Y . Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Ilvr: Conditioning method for denoising diffusion probabilistic models,

J. Choi, S. Kim, Y . Jeong, Y . Gwon, and S. Yoon, “Ilvr: Conditioning method for denoising diffusion probabilistic models,” arXiv preprint arXiv:2108.02938, 2021

work page arXiv 2021

[45] [45]

Multi-concept customization of text-to-image diffusion,

N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi-concept customization of text-to-image diffusion,” arXiv preprint arXiv:2212.04488, 2022

work page arXiv 2022

[46] [46]

Hyperdreambooth: Hypernet- works for fast personalization of text-to-image models,

N. Ruiz, Y . Li, V . Jampani, W. Wei, T. Hou, Y . Pritch, N. Wad- hwa, M. Rubinstein, and K. Aberman, “Hyperdreambooth: Hypernet- works for fast personalization of text-to-image models,” arXiv preprint arXiv:2307.06949, 2023

work page arXiv 2023

[47] [47]

Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation,

H. Chen, Y . Zhang, X. Wang, X. Duan, Y . Zhou, and W. Zhu, “Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation,” arXiv preprint arXiv:2305.03374 , 2023

work page arXiv 2023

[48] [48]

Instantbooth: Personalized text-to-image generation without test-time finetuning,

J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” arXiv preprint arXiv:2304.03411, 2023

work page arXiv 2023

[49] [49]

Generate anything anywhere in any scene,

Y . Li, H. Liu, Y . Wen, and Y . J. Lee, “Generate anything anywhere in any scene,” arXiv preprint arXiv:2306.17154 , 2023

work page arXiv 2023

[50] [50]

Dense text-to-image generation with attention modulation,

Y . Kim, J. Lee, J.-H. Kim, J.-W. Ha, and J.-Y . Zhu, “Dense text-to-image generation with attention modulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 7701–7711

work page 2023

[51] [51]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models

L. Lian, B. Li, A. Yala, and T. Darrell, “Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models,” arXiv preprint arXiv:2305.13655 , 2023

work page arXiv 2023

[55] [55]

Self-correcting llm-controlled diffusion models,

T.-H. Wu, L. Lian, J. E. Gonzalez, B. Li, and T. Darrell, “Self-correcting llm-controlled diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 6327–6336. 14

work page 2024

[56] [56]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al. , “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al. , “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

Boundary guided mixing trajectory for semantic control with diffusion models,

Y . Zhu, Y . Wu, Z. Deng, O. Russakovsky, and Y . Yan, “Boundary guided mixing trajectory for semantic control with diffusion models,” NeurIPS, 2023

work page 2023

[60] [60]

Null- text inversion for editing real images using guided diffusion models,

R. Mokady, A. Hertz, K. Aberman, Y . Pritch, and D. Cohen-Or, “Null- text inversion for editing real images using guided diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6038–6047

work page 2023

[61] [61]

Head rotation in denoising diffusion models,

A. Asperti, G. Colasuonno, and A. Guerra, “Head rotation in denoising diffusion models,” arXiv preprint arXiv:2308.06057 , 2023

work page arXiv 2023

[62] [62]

Maskgan: Towards diverse and interactive facial image manipulation,

C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “Maskgan: Towards diverse and interactive facial image manipulation,” in CVPR, 2020, pp. 5549–5558

work page 2020

[63] [63]

Joint face detection and alignment using multitask cascaded convolutional networks,

K. Zhang, Z. Zhang, Z. Li, and Y . Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016

work page 2016

[64] [64]

Facenet: A unified embedding for face recognition and clustering,

F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in CVPR, 2015, pp. 815–823. Yuhan Pei received a bachelor’s degree from the School of Computer Science and Technology, Xidian University (Xi’an, China) in 2024. She is currently pursuing a master’s degree in computer science at the School of Cy...

work page 2015

[65] [65]

15 In the supplementary material, Sec

He served as the Area Chair for CVPR, ICCV , ECCV , and NeurIPS, and also served as the Workshop Chair of CVPR 2023. 15 In the supplementary material, Sec. A showcases an array of TV2I generation results along with comprehensive analyses. Furthermore, we conduct extensive qualitative and quantitative image comparisons with baseline methods, and detailed r...

work page 2023