SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation
Pith reviewed 2026-05-23 08:34 UTC · model grok-4.3
The pith
Selective One-Way Diffusion uses MLLMs to dynamically control diffusion for coherent and faithful image generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By reframing diffusion as a controlled process guided by MLLM-derived contextual relationships, SOW achieves pixel-level condition fidelity and maintains visual and semantic coherence in text-vision-to-image generation tasks.
What carries the argument
Selective One-Way Diffusion (SOW), which integrates MLLM clarification of relationships with attention mechanisms to regulate the direction and intensity of diffusion.
If this is right
- Precise information transfer with minimized interference between image regions.
- Learning-free improvement to existing diffusion models for better detail preservation.
- Adaptable generation that responds to contextual relationships dynamically.
Where Pith is reading between the lines
- This method could be tested on more complex multi-object scenes to check if MLLM relationship detection scales reliably.
- Similar control strategies might apply to other stochastic processes in generative modeling beyond static images.
- Integration with different backbone diffusion models could reveal how general the COW and SOW frameworks are across architectures.
Load-bearing premise
Multimodal large language models can accurately clarify semantic and spatial relationships in the image to set the correct diffusion parameters without adding new errors.
What would settle it
Generating an image where the MLLM misidentifies object positions or relationships, leading to either mismatched details from the input or visible inconsistencies like distorted objects or illogical layouts.
Figures
read the original abstract
Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Cyclic One-Way Diffusion (COW) as a unidirectional framework for precise information transfer in diffusion models and Selective One-Way Diffusion (SOW) that leverages MLLMs to extract semantic and spatial relationships, then uses attention mechanisms to dynamically control diffusion direction and intensity. The central claim is that this yields pixel-level condition fidelity and visual/semantic coherence for text-vision-to-image tasks in a learning-free manner, with extensive experiments demonstrating the approach.
Significance. If validated, the work would demonstrate a training-free method for controlled diffusion that exploits MLLM-derived context to reduce interference, offering a practical route to more coherent generative outputs without retraining.
major comments (2)
- [Abstract] Abstract: The claim of pixel-level fidelity and coherence rests on MLLM outputs correctly setting per-region diffusion parameters via attention, yet the description provides no mechanism to detect or mitigate MLLM extraction errors (e.g., incorrect containment or spatial relations); this is load-bearing because noisy inputs would directly propagate into the regulated diffusion trajectory.
- [SOW description] SOW construction: The paper states that attention 'dynamically regulate[s] the direction and intensity of diffusion according to contextual relationships,' but without reported ablation on MLLM accuracy or error-correction steps, it is unclear whether the coherence gains follow from the COW/SOW framework or require external verification of the MLLM inputs.
minor comments (1)
- [Abstract] The acronym 'TV2I' is used without prior definition.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. The two major comments both concern the reliance on MLLM outputs without explicit error handling or ablations; we address each point below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim of pixel-level fidelity and coherence rests on MLLM outputs correctly setting per-region diffusion parameters via attention, yet the description provides no mechanism to detect or mitigate MLLM extraction errors (e.g., incorrect containment or spatial relations); this is load-bearing because noisy inputs would directly propagate into the regulated diffusion trajectory.
Authors: We agree that the manuscript does not describe explicit mechanisms for detecting or correcting MLLM extraction errors. The SOW design treats MLLM outputs as reliable contextual guidance (consistent with other MLLM-augmented generation works) and relies on the attention-based regulation to translate those outputs into diffusion control. Because this assumption is load-bearing, we will add a dedicated paragraph in the revised manuscript discussing potential MLLM inaccuracies, their possible propagation, and why the overall experimental results still support the claimed fidelity and coherence. revision: yes
-
Referee: [SOW description] SOW construction: The paper states that attention 'dynamically regulate[s] the direction and intensity of diffusion according to contextual relationships,' but without reported ablation on MLLM accuracy or error-correction steps, it is unclear whether the coherence gains follow from the COW/SOW framework or require external verification of the MLLM inputs.
Authors: The manuscript demonstrates coherence improvements via direct comparisons against baselines that lack the selective one-way control; these gains are therefore attributable to the COW/SOW mechanisms rather than external MLLM verification. Nevertheless, the referee is correct that no dedicated ablation isolating MLLM accuracy is reported. We will incorporate additional analysis (including qualitative examples of MLLM outputs and their effect on final images) in the revision to make this distinction clearer. revision: yes
Circularity Check
No significant circularity; new procedural method with no derivation reducing to inputs
full rationale
The paper introduces COW and SOW as novel diffusion frameworks that use MLLMs for relationship clarification and attention for regulating diffusion direction/intensity. No equations, fitted parameters, self-citations, or uniqueness theorems are described that would cause any claimed result to reduce by construction to its own inputs. The contribution is a learning-free procedural proposal rather than a mathematical derivation chain, making it self-contained against external benchmarks with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diffusion generative models simulate a random walk in the data space along the denoising trajectory allowing information to diffuse across regions.
invented entities (2)
-
Cyclic One-Way Diffusion (COW)
no independent evidence
-
Selective One-Way Diffusion (SOW)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships... utilizes MLLMs to clarify the semantic and spatial relationships within the image
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Cyclic One-Way Diffusion (COW)... Selective One-Way Diffusion (SOW)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
One and a half century of diffusion: Fick, einstein before and beyond,
J. Philibert, “One and a half century of diffusion: Fick, einstein before and beyond,” 2006
work page 2006
-
[2]
Deep unsupervised learning using nonequilibrium thermodynamics,
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning . PMLR, 2015, pp. 2256– 2265
work page 2015
-
[3]
Score-Based Generative Modeling through Stochastic Differential Equations
Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[4]
Maximum likelihood training of score-based diffusion models,
Y . Song, C. Durkan, I. Murray, and S. Ermon, “Maximum likelihood training of score-based diffusion models,” in Advances in Neural Information Processing Systems , M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 1415–1428. [Online]. Available: https://proceedings.neurips.cc/paper_file...
work page 2021
-
[5]
Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,
C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” Advances in Neural Information Processing Systems , vol. 35, pp. 5775–5787, 2022
work page 2022
-
[6]
Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,
H. Chefer, Y . Alaluf, Y . Vinker, L. Wolf, and D. Cohen-Or, “Attend-and- excite: Attention-based semantic guidance for text-to-image diffusion models,” ACM Transactions on Graphics (TOG) , vol. 42, no. 4, pp. 1–10, 2023
work page 2023
-
[7]
Divide & bind your attention for improved generative semantic nursing,
Y . Li, M. Keuper, D. Zhang, and A. Khoreva, “Divide & bind your attention for improved generative semantic nursing,” in 34th British Machine Vision Conference 2023, BMVC 2023 , 2023
work page 2023
-
[8]
T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,
K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu, “T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,” Advances in Neural Information Processing Systems , vol. 36, pp. 78 723–78 747, 2023
work page 2023
-
[9]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personal- izing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,
N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” arXiv preprint arXiv:2208.12242 , 2022
-
[11]
Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning,
Z. Dong, P. Wei, and L. Lin, “Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning,” arXiv preprint arXiv:2211.11337, 2022
-
[12]
Adding Conditional Control to Text-to-Image Diffusion Models
L. Zhang and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” arXiv preprint arXiv:2302.05543 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in CVPR, 2022, pp. 10 684–10 695
work page 2022
-
[14]
Diffusion in diffusion: Cyclic one-way diffusion for text-vision-conditioned generation,
R. Wang, Y . Yang, Z. Qian, Y . Zhu, and Y . Wu, “Diffusion in diffusion: Cyclic one-way diffusion for text-vision-conditioned generation,” arXiv preprint arXiv:2306.08247, 2023
-
[15]
Denoising Diffusion Implicit Models
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[16]
Diffusion models beat gans on image synthesis,
P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” NeurIPS, vol. 34, pp. 8780–8794, 2021
work page 2021
-
[17]
Improved denoising diffusion probabilistic models,
A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in ICML, 2021, pp. 8162–8171
work page 2021
-
[18]
Autoregressive diffusion models,
E. Hoogeboom, A. A. Gritsenko, J. Bastings, B. Poole, R. v. d. Berg, and T. Salimans, “Autoregressive diffusion models,” arXiv preprint arXiv:2110.02037, 2021
-
[19]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,
J. Z. Wu, Y . Ge, X. Wang, W. Lei, Y . Gu, W. Hsu, Y . Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” arXiv preprint arXiv:2212.11565 , 2022
-
[20]
Discrete contrastive diffusion for cross-modal music and image generation,
Y . Zhu, Y . Wu, K. Olszewski, J. Ren, S. Tulyakov, and Y . Yan, “Discrete contrastive diffusion for cross-modal music and image generation,” in ICLR, 2023
work page 2023
-
[21]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020
work page 2020
-
[22]
Spontaneous symmetry breaking in generative diffusion models,
G. Raya and L. Ambrogioni, “Spontaneous symmetry breaking in generative diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[23]
Dynamical regimes of diffusion models,
G. Biroli, T. Bonnaire, V . De Bortoli, and M. Mézard, “Dynamical regimes of diffusion models,” arXiv preprint arXiv:2402.18491 , 2024
-
[24]
Generative adversarial text to image synthesis,
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in ICML, 2016, pp. 1060–1069
work page 2016
-
[25]
StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,
H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in ICCV, 2017, pp. 5907–5915
work page 2017
-
[26]
Semantic image synthesis via adversarial learning,
H. Dong, S. Yu, C. Wu, and Y . Guo, “Semantic image synthesis via adversarial learning,” in ICCV, 2017, pp. 5706–5714
work page 2017
-
[27]
StackGAN++: Realistic image synthesis with stacked gen- erative adversarial networks,
H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN++: Realistic image synthesis with stacked gen- erative adversarial networks,” TPAMI, vol. 41, no. 8, pp. 1947–1962, 2018
work page 1947
-
[28]
AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,
T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,” in CVPR, 2018, pp. 1316–1324
work page 2018
-
[29]
ST- GAN: Spatial transformer generative adversarial networks for image compositing,
C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey, “ST- GAN: Spatial transformer generative adversarial networks for image compositing,” in CVPR, 2018, pp. 9455–9464
work page 2018
-
[30]
GP-GAN: Towards realistic high-resolution image blending,
H. Wu, S. Zheng, J. Zhang, and K. Huang, “GP-GAN: Towards realistic high-resolution image blending,” in ACM MM, 2019, pp. 2487–2495
work page 2019
-
[31]
A style-based generator architecture for generative adversarial networks,
T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in CVPR, 2019, pp. 4401–4410
work page 2019
-
[32]
Object- driven text-to-image synthesis via adversarial training,
W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao, “Object- driven text-to-image synthesis via adversarial training,” in CVPR, 2019, pp. 12 174–12 182
work page 2019
-
[33]
Analyzing and improving the image quality of stylegan,
T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in CVPR, 2020, pp. 8110–8119
work page 2020
-
[34]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[35]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
The creativity of text-to-image generation,
J. Oppenlaender, “The creativity of text-to-image generation,” in Pro- ceedings of the 25th International Academic Mindtrek Conference , 2022, pp. 192–202
work page 2022
-
[37]
Zero-shot text-to-image generation,
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in ICML, 2021, pp. 8821–8831
work page 2021
-
[38]
Classifier-Free Diffusion Guidance
J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Photorealistic text-to-image diffusion models with deep language understanding,
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al. , “Photorealistic text-to-image diffusion models with deep language understanding,” NeurIPS, pp. 36 479–36 494, 2022
work page 2022
-
[40]
Instructpix2pix: Learning to follow image editing instructions,
T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 18 392–18 402
work page 2023
-
[41]
Switchable novel object captioner,
Y . Wu, L. Jiang, and Y . Yang, “Switchable novel object captioner,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , vol. 45, no. 1, pp. 1162–1173, 2023
work page 2023
-
[42]
Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,
Y . Wei, Y . Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” arXiv preprint arXiv:2302.13848 , 2023
-
[43]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Y . Guo, C. Yang, A. Rao, Y . Wang, Y . Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Ilvr: Conditioning method for denoising diffusion probabilistic models,
J. Choi, S. Kim, Y . Jeong, Y . Gwon, and S. Yoon, “Ilvr: Conditioning method for denoising diffusion probabilistic models,” arXiv preprint arXiv:2108.02938, 2021
-
[45]
Multi-concept customization of text-to-image diffusion,
N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi-concept customization of text-to-image diffusion,” arXiv preprint arXiv:2212.04488, 2022
-
[46]
Hyperdreambooth: Hypernet- works for fast personalization of text-to-image models,
N. Ruiz, Y . Li, V . Jampani, W. Wei, T. Hou, Y . Pritch, N. Wad- hwa, M. Rubinstein, and K. Aberman, “Hyperdreambooth: Hypernet- works for fast personalization of text-to-image models,” arXiv preprint arXiv:2307.06949, 2023
-
[47]
Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation,
H. Chen, Y . Zhang, X. Wang, X. Duan, Y . Zhou, and W. Zhu, “Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation,” arXiv preprint arXiv:2305.03374 , 2023
-
[48]
Instantbooth: Personalized text-to-image generation without test-time finetuning,
J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” arXiv preprint arXiv:2304.03411, 2023
-
[49]
Generate anything anywhere in any scene,
Y . Li, H. Liu, Y . Wen, and Y . J. Lee, “Generate anything anywhere in any scene,” arXiv preprint arXiv:2306.17154 , 2023
-
[50]
Dense text-to-image generation with attention modulation,
Y . Kim, J. Lee, J.-H. Kim, J.-W. Ha, and J.-Y . Zhu, “Dense text-to-image generation with attention modulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 7701–7711
work page 2023
-
[51]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
L. Lian, B. Li, A. Yala, and T. Darrell, “Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models,” arXiv preprint arXiv:2305.13655 , 2023
-
[55]
Self-correcting llm-controlled diffusion models,
T.-H. Wu, L. Lian, J. E. Gonzalez, B. Li, and T. Darrell, “Self-correcting llm-controlled diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 6327–6336. 14
work page 2024
-
[56]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al. , “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
PaLM-E: An Embodied Multimodal Language Model
D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al. , “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Boundary guided mixing trajectory for semantic control with diffusion models,
Y . Zhu, Y . Wu, Z. Deng, O. Russakovsky, and Y . Yan, “Boundary guided mixing trajectory for semantic control with diffusion models,” NeurIPS, 2023
work page 2023
-
[60]
Null- text inversion for editing real images using guided diffusion models,
R. Mokady, A. Hertz, K. Aberman, Y . Pritch, and D. Cohen-Or, “Null- text inversion for editing real images using guided diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6038–6047
work page 2023
-
[61]
Head rotation in denoising diffusion models,
A. Asperti, G. Colasuonno, and A. Guerra, “Head rotation in denoising diffusion models,” arXiv preprint arXiv:2308.06057 , 2023
-
[62]
Maskgan: Towards diverse and interactive facial image manipulation,
C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “Maskgan: Towards diverse and interactive facial image manipulation,” in CVPR, 2020, pp. 5549–5558
work page 2020
-
[63]
Joint face detection and alignment using multitask cascaded convolutional networks,
K. Zhang, Z. Zhang, Z. Li, and Y . Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016
work page 2016
-
[64]
Facenet: A unified embedding for face recognition and clustering,
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in CVPR, 2015, pp. 815–823. Yuhan Pei received a bachelor’s degree from the School of Computer Science and Technology, Xidian University (Xi’an, China) in 2024. She is currently pursuing a master’s degree in computer science at the School of Cy...
work page 2015
-
[65]
15 In the supplementary material, Sec
He served as the Area Chair for CVPR, ICCV , ECCV , and NeurIPS, and also served as the Workshop Chair of CVPR 2023. 15 In the supplementary material, Sec. A showcases an array of TV2I generation results along with comprehensive analyses. Furthermore, we conduct extensive qualitative and quantitative image comparisons with baseline methods, and detailed r...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.