pith. machine review for the scientific record. sign in

arxiv: 2605.04609 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: 2 theorem links

Advancing Aesthetic Image Generation via Composition Transfer

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords composition transferaesthetic image generationtext-to-image synthesisdiffusion modelssemantic-agnostic controlconditional guidancelarge vision-language modelsimage composition
0
0 comments X

The pith

Composer extracts composition from reference images and applies it separately from content to guide diffusion models toward higher aesthetic quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Composer as a way to treat composition as an independent aesthetic principle rather than something entangled with specific objects or scenes. It extracts composition-aware features from a reference image and feeds them through a dedicated guidance module into an existing diffusion model, so the new output keeps the layout and balance while changing the semantic content. When no reference is supplied, the system uses large vision-language models to propose a composition from a text theme or fine-tunes the guidance module to plan composition implicitly. The authors support the approach with a new dataset of two million image-text pairs. If successful, the result is text-to-image outputs that users can steer more precisely toward pleasing visual structure.

Core claim

Composer models composition in a semantic-agnostic manner by first extracting key composition-aware representations from a reference image and then applying them via a tailored conditional guidance module on top of pre-trained diffusion models. It further supports theme-driven composition retrieval through in-context learning in large vision-language models and performs text-to-composition fine-tuning on the control module to enable implicit planning when no reference is given.

What carries the argument

The Composer framework, which extracts composition-aware representations from a reference image and routes them through a conditional guidance module to steer pre-trained diffusion models without altering semantic content.

If this is right

  • Users gain explicit control to transfer the layout and balance of one image onto generations with entirely different subjects.
  • In reference-free settings the system can still produce planned compositions either by querying large vision-language models or through the fine-tuned implicit module.
  • Aesthetic quality improves in standard text-to-image workflows because the model receives direct guidance on structural principles rather than learning them only implicitly.
  • The approach supports personalized creative workflows by letting users specify composition via example or theme without rewriting the prompt.
  • The curated two-million-pair dataset provides training data that keeps composition signals decoupled from content labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation could be tested in video or 3D generation where consistent framing across frames matters.
  • Interface designers could expose the guidance module as a simple drag-and-drop composition template for non-expert users.
  • Combining the module with other conditioning signals such as depth or pose maps would allow layered control without mutual interference.

Load-bearing premise

That composition can be reliably extracted and applied in a semantic-agnostic manner using a tailored conditional guidance module on top of pre-trained diffusion models, without the separation breaking down in practice.

What would settle it

A controlled comparison in which images generated with Composer receive no higher aesthetic ratings or composition-consistency scores than images from the same base diffusion model without the guidance module.

read the original abstract

Composition is a cornerstone of visual aesthetics, influencing the appeal of an image. While its principles operate independently of specific content, in practice, composition is often coupled with semantics. As a result, existing methods often enhance composition either through implicit learning or by semantics-based layout control, rather than explicitly modeling composition itself. To address this gap, we introduce Composer, a framework rooted in aesthetic theory, designed to model composition in a semantic-agnostic manner. First, it supports composition transfer by extracting key composition-aware representations from a reference image and leveraging a tailored conditional guidance module to control composition based on pre-trained diffusion models. Second, when users specify only text themes without a composition reference, Composer supports theme-driven composition retrieval by leveraging the in-context learning capabilities of Large Vision-Language Models (LVLMs), achieving explicit composition planning. To enhance composition in a reference-free mode, we conduct text-to-composition fine-tuning on the trained control module to enable implicit composition planning. Furthermore, we curated a high-quality dataset comprising 2 million image-text pairs using state-of-the-art generative models to support model training. Experimental results demonstrate that Composer significantly enhances aesthetic quality in text-to-image tasks and facilitates personalized composition control and transfer, offering users precision and flexibility in the creative process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Composer, a framework for aesthetic image generation that explicitly models composition in a semantic-agnostic manner. It extracts composition-aware representations from reference images and applies them via a tailored conditional guidance module on pre-trained diffusion models for composition transfer. It further supports theme-driven composition retrieval using LVLMs when no reference is provided, performs text-to-composition fine-tuning for reference-free operation, and introduces a curated 2-million image-text pair dataset. Experimental results are claimed to show significant gains in aesthetic quality and personalized composition control for text-to-image tasks.

Significance. If the claims hold and the semantic decoupling is validated, the work could advance controllable generation by providing an explicit, theory-rooted alternative to implicit or semantics-entangled composition control, potentially improving precision and user flexibility in diffusion-based image synthesis.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts that 'experimental results demonstrate that Composer significantly enhances aesthetic quality' and introduces a 'high-quality dataset comprising 2 million image-text pairs,' yet supplies no quantitative metrics, baselines, ablation studies, or error analysis. This omission prevents assessment of whether reported gains are robust or influenced by dataset curation choices.
  2. [Abstract] Abstract (method overview): The central claim rests on extracting and applying 'key composition-aware representations' in a 'semantic-agnostic manner' via the conditional guidance module. No details are given on the extraction process or how it avoids semantic leakage from pre-trained features; if entanglement occurs, the method reduces to semantics-based control rather than pure composition transfer, undermining the aesthetic improvement and personalization claims.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'leveraging the in-context learning capabilities of Large Vision-Language Models (LVLMs)' for theme-driven retrieval would benefit from a brief example or reference to the specific LVLM used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough review and valuable feedback on our manuscript. We have addressed the major comments by revising the abstract to provide more context on the experimental results and methodological details. Our responses to each comment are detailed below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts that 'experimental results demonstrate that Composer significantly enhances aesthetic quality' and introduces a 'high-quality dataset comprising 2 million image-text pairs,' yet supplies no quantitative metrics, baselines, ablation studies, or error analysis. This omission prevents assessment of whether reported gains are robust or influenced by dataset curation choices.

    Authors: We agree that the abstract, constrained by length, does not include specific quantitative metrics, baselines, or ablation details. These are comprehensively covered in the Experiments section of the full manuscript, including quantitative evaluations using aesthetic scoring models, comparisons to existing methods, and ablations on the conditional guidance module. The dataset curation process is explained in the paper to ensure reproducibility and address potential concerns. To improve the abstract's informativeness, we have revised it to briefly mention the experimental validation and key findings without adding unsubstantiated numbers. revision: yes

  2. Referee: [Abstract] Abstract (method overview): The central claim rests on extracting and applying 'key composition-aware representations' in a 'semantic-agnostic manner' via the conditional guidance module. No details are given on the extraction process or how it avoids semantic leakage from pre-trained features; if entanglement occurs, the method reduces to semantics-based control rather than pure composition transfer, undermining the aesthetic improvement and personalization claims.

    Authors: The abstract serves as a high-level summary of the framework. Detailed explanations of the composition-aware representation extraction and the conditional guidance module are provided in Sections 3.1 and 3.2 of the manuscript. Our approach is designed to be semantic-agnostic by focusing on structural and aesthetic composition elements (e.g., spatial arrangement, balance) extracted via a specialized module that is trained separately from semantic content. We validate the lack of semantic leakage through experiments demonstrating composition transfer across different themes. To address the concern, we have added a clarifying clause in the revised abstract regarding the semantic-agnostic nature of the representations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the method or claims

full rationale

The paper presents Composer as an empirical framework that extracts composition-aware representations from references and applies them via a new conditional guidance module on top of external pre-trained diffusion models, plus LVLM-based retrieval and fine-tuning. No equations, derivations, or first-principles results are described that reduce to fitted parameters, self-definitions, or self-citation chains. The approach builds on independent external components and a separately curated dataset; experimental claims rest on those additions rather than tautological renaming or input-equivalent predictions. This is the normal non-circular case for a methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that composition principles can be isolated from semantics and on the capabilities of existing pre-trained diffusion models and LVLMs; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Composition principles operate independently of specific content semantics
    Stated directly in the abstract as the foundational premise for the semantic-agnostic modeling goal.

pith-pipeline@v0.9.0 · 5521 in / 1274 out tokens · 53525 ms · 2026-05-08T18:17:41.841031+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 16 canonical work pages · 4 internal anchors

  1. [1]

    JSTOR (1983)

    Martin, F.D.: The Power of the Center: A Study of Composition in the Visual Arts. JSTOR (1983)

  2. [2]

    In: Computer Graphics Forum, vol

    Liu, L., Chen, R., Wolf, L., Cohen-Or, D.: Optimizing photo composition. In: Computer Graphics Forum, vol. 29, pp. 469–478 (2010). Wiley Online Library

  3. [3]

    Emu: Enhanc- ing image generation models using photogenic needles in a haystack

    Dai, X., Hou, J., Ma, C.-Y., Tsai, S., Wang, J., Wang, R., Zhang, P., Vandenhende, S., Wang, X., Dubey, A., et al.: Emu: Enhanc- ing image generation models using photo- genic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023)

  4. [4]

    Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. arXiv preprint arXiv:2403.04692 (2024)

  5. [5]

    Computer Sci- ence

    Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y.,et al.: Improving image gen- eration with better captions. Computer Sci- ence. https://cdn. openai. com/papers/dall- e-3. pdf2(3), 8 (2023)

  6. [6]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y.J.: GLIGEN: open-set grounded text-to-image generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pp. 22511–22521 (2023). https: //doi.org/10.1109/CVPR52729.2023.02156 . https://doi.org/10.1109/CVPR52729.2023.02156

  7. [7]

    ArXiv preprintabs/2306.05427(2023)

    Phung, Q., Ge, S., Huang, J.: Grounded text- to-image synthesis with attention refocusing. ArXiv preprintabs/2306.05427(2023)

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Zhou, D., Li, Y., Ma, F., Zhang, X., Yang, Y.: Migc: Multi-instance generation controller for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  9. [9]

    In: ACM SIGGRAPH 2006 Papers, pp

    Cohen-Or, D., Sorkine, O., Gal, R., Ley- vand, T., Xu, Y.-Q.: Color harmonization. In: ACM SIGGRAPH 2006 Papers, pp. 624–630 (2006)

  10. [10]

    In: 2010 IEEE International Con- ference on Image Processing, pp

    Obrador, P., Schmidt-Hackenberg, L., Oliver, N.: The role of image composition in image aesthetics. In: 2010 IEEE International Con- ference on Image Processing, pp. 3185–3188 (2010). IEEE

  11. [11]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolu- tion. arXiv preprint arXiv:2409.12191 (2024)

  12. [12]

    Advances in neural infor- mation processing systems36(2024)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural infor- mation processing systems36(2024)

  13. [13]

    https://blackforestlabs

    Black Forest Labs: Black Forest Labs; Fron- tier AI Lab (2024). https://blackforestlabs. ai/

  14. [14]

    In: Com- puter Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Aus- tria, May 7-13, 2006, Proceedings, Part III 9, pp

    Datta, R., Joshi, D., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a computational approach. In: Com- puter Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Aus- tria, May 7-13, 2006, Proceedings, Part III 9, pp. 288–301 (2006). Springer

  15. [15]

    IEEE Transactions on Image Processing (2018)

    Liu, Z., Wang, Z., Yao, Y., Zhang, L., Shao, L.: Deep active learning with contaminated tags for image aesthetics assessment. IEEE Transactions on Image Processing (2018)

  16. [16]

    In: Computer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Nether- lands, October 11–14, 2016, Proceedings, Part I 14, pp

    Kong, S., Shen, X., Lin, Z., Mech, R., Fowlkes, C.: Photo aesthetics ranking net- work with attributes and content adaptation. In: Computer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Nether- lands, October 11–14, 2016, Proceedings, Part I 14, pp. 662–679 (2016). Springer

  17. [17]

    IEEE transactions on image processing9(4), 636–650 (2000) 15

    Damera-Venkata, N., Kite, T.D., Geisler, W.S., Evans, B.L., Bovik, A.C.: Image qual- ity assessment based on a degradation model. IEEE transactions on image processing9(4), 636–650 (2000) 15

  18. [18]

    IEEE Transactions on Image Processing29, 1548–1561 (2019)

    Zeng, H., Cao, Z., Zhang, L., Bovik, A.C.: A unified probabilistic formulation of image aesthetic assessment. IEEE Transactions on Image Processing29, 1548–1561 (2019)

  19. [19]

    IEEE transactions on image pro- cessing27(8), 3998–4011 (2018)

    Talebi, H., Milanfar, P.: Nima: Neural image assessment. IEEE transactions on image pro- cessing27(8), 3998–4011 (2018)

  20. [20]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Yang, S., Wu, T., Shi, S., Lao, S., Gong, Y., Cao, M., Wang, J., Yang, Y.: Maniqa: Multi-dimension attention network for no- reference image quality assessment. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1191–1200 (2022)

  21. [21]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    He, S., Ming, A., Li, Y., Sun, J., Zheng, S., Ma, H.: Thinking image color aesthetics assessment: Models, datasets and bench- marks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21838–21847 (2023)

  22. [22]

    Advances in Neural Information Processing Systems35, 25278–25294 (2022)

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Worts- man, M.,et al.: Laion-5b: An open large-scale dataset for training next generation image- text models. Advances in Neural Information Processing Systems35, 25278–25294 (2022)

  23. [23]

    arXiv preprint arXiv:2501.01097 (2025)

    Zhang, H., Duan, Z., Wang, X., Chen, Y., Zhang, Y.: Eligen: Entity-level controlled image generation with regional attention. arXiv preprint arXiv:2501.01097 (2025)

  24. [24]

    arXiv preprint arXiv:2402.12908 (2024)

    Zhang, X., Yang, L., Cai, Y., Yu, Z., Xie, J., Tian, Y., Xu, M., Tang, Y., Yang, Y., Cui, B.: Realcompo: Dynamic equilibrium between realism and compositionality improves text- to-image diffusion models. arXiv preprint arXiv:2402.12908 (2024)

  25. [25]

    In: Forty-first International Conference on Machine Learning (2024)

    Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., Bin, C.: Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. In: Forty-first International Conference on Machine Learning (2024)

  26. [26]

    arXiv preprint arXiv:2408.13858 (2024)

    Liu, M., Zhang, L., Tian, Y., Qu, X., Liu, L., Liu, T.: Draw like an artist: Complex scene generation with diffusion model via composi- tion, painting, and retouching. arXiv preprint arXiv:2408.13858 (2024)

  27. [27]

    arXiv preprint arXiv:2507.04451 (2025)

    Liu, Z., Ning, M., Zhang, Q., Yang, S., Wang, Z., Yang, Y., Xu, X., Song, Y., Chen, W., Wang, F., et al.: Cot-lized diffusion: Let’s reinforce t2i generation step-by-step. arXiv preprint arXiv:2507.04451 (2025)

  28. [28]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Chen, H., Xu, X., Li, W., Ren, J., Ye, T., Liu, S., Chen, Y.-C., Zhu, L., Wang, X.: Posta: A go-to framework for customized artistic poster generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 28694–28704 (2025)

  29. [29]

    Postercraft: Rethinking high-quality aesthetic poster generation in a unified framework.arXiv preprint arXiv:2506.10741, 2025

    Chen, S., Lai, J., Gao, J., Ye, T., Chen, H., Shi, H., Shao, S., Lin, Y., Fei, S., Xing, Z., et al.: Postercraft: Rethinking high-quality aesthetic poster generation in a unified framework. arXiv preprint arXiv:2506.10741 (2025)

  30. [30]

    In: 2009 IEEE Conference on Com- puter Vision and Pattern Recognition, pp

    Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: 2009 IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 1597–1604 (2009). IEEE

  31. [31]

    International Standard, 2019–06 (2007)

    Standard, C., et al.: Colorimetry-part 4: Cie 1976 l* a* b* colour space. International Standard, 2019–06 (2007)

  32. [32]

    IEEE transactions on pattern analysis and machine intelligence34(11), 2274–2282 (2012)

    Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., S¨ usstrunk, S.: Slic superpixels com- pared to state-of-the-art superpixel meth- ods. IEEE transactions on pattern analysis and machine intelligence34(11), 2274–2282 (2012)

  33. [33]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: IEEE/CVF Interna- tional Conference on Computer Vision, ICCV 2023, Paris, France, October 1- 6, 2023, pp. 3813–3824 (2023). https: //doi.org/10.1109/ICCV51070.2023.00355 . https://doi.org/10.1109/ICCV51070.2023.00355

  34. [34]

    In: International Con- ference on Machine Learning, pp

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, 16 A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learn- ing transferable visual models from natural language supervision. In: International Con- ference on Machine Learning, pp. 8748–8763 (2021). PMLR

  35. [35]

    Advances in neural information processing systems35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D.,et al.: Chain-of-thought prompting elicits rea- soning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

  36. [36]

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (2024)

  37. [37]

    https: //huggingface.co/datasets/ProGamerGov/ synthetic-dataset-1m-dalle3-high-quality-captions

    ProGamerGov: Synthetic Dataset 1M DALLE3 High Quality Captions. https: //huggingface.co/datasets/ProGamerGov/ synthetic-dataset-1m-dalle3-high-quality-captions. Accessed: 2024-10-01 (2024)

  38. [38]

    Parmar, R

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18- 24, 2022, pp. 10674–10685 (2022). https: //doi.org/10.1109/CVPR52688.2022.01042 . https://doi.org/10.1109/CVPR52688.2022.01042

  39. [39]

    Advances in neural information processing systems30(2017)

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  40. [40]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)

  41. [41]

    Advances in Neural Information Processing Systems36 (2024)

    Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems36 (2024)

  42. [42]

    In: European Conference on Computer Vision, pp

    Li, M., Yang, T., Kuang, H., Wu, J., Wang, Z., Xiao, X., Chen, C.: Control- net++: Improving conditional controls with efficient consistency feedback. In: European Conference on Computer Vision, pp. 129–147 (2025). Springer

  43. [43]

    In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp

    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft coco: Common objects in con- text. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755 (2014). Springer

  44. [44]

    Transactions on Machine Learning Research

    Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B.K., et al.: Scaling autoregressive models for content-rich text-to-image gen- eration. Transactions on Machine Learning Research

  45. [45]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language mod- els. arXiv preprint arXiv:2106.09685 (2021)

  46. [46]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

  47. [47]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp

    Feng, Y., Gong, B., Chen, D., Shen, Y., Liu, Y., Zhou, J.: Ranni: Taming text-to-image diffusion for accurate instruction following. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 4744–4753 (2024)

  48. [48]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M¨ uller, J., Penna, J., Rom- bach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

  49. [49]

    Advances in neural information processing systems36, 17 49659–49678 (2023) 18

    Sun, K., Pan, J., Ge, Y., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y.,et al.: Journeydb: A benchmark for gen- erative image understanding. Advances in neural information processing systems36, 17 49659–49678 (2023) 18