pith. sign in

arxiv: 2412.04300 · v3 · submitted 2024-12-05 · 💻 cs.CV · cs.AI

T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts

Pith reviewed 2026-05-23 07:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords text-to-image generationfactuality evaluationbenchmarkknowledge-intensive conceptsvisual question answeringimage synthesis evaluationconcept composition
0
0 comments X

The pith

A new benchmark shows state-of-the-art text-to-image models still fall short on factual rendering of knowledge-intensive concepts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents T2I-FactualBench as the largest dataset of its kind for testing whether text-to-image models can produce accurate images when prompts involve specialized knowledge. It organizes evaluation into three tiers that progress from recalling single facts to combining multiple facts in one image. A multi-round visual question answering protocol then checks whether the generated images contain the required factual details. Experiments using this setup find that leading models leave substantial room for improvement on these tasks.

Core claim

T2I-FactualBench supplies the first large-scale, three-tiered benchmark specifically for factuality in knowledge-intensive text-to-image generation, paired with a multi-round VQA evaluation framework; experiments with current models confirm they remain limited in producing factually correct images across memorization, single-concept, and multi-concept composition tasks.

What carries the argument

T2I-FactualBench, a three-tiered benchmark (basic memorization to multi-concept composition) paired with multi-round VQA evaluation to measure factual accuracy in generated images.

Load-bearing premise

The multi-round VQA evaluation framework accurately measures the factuality of generated images for the selected knowledge-intensive concepts without introducing its own biases or errors in question design or answer interpretation.

What would settle it

A controlled study that shows the VQA questions systematically misclassify correct images as incorrect, or a new model that scores near-perfect factuality on the full three-tier benchmark while matching prior models on other metrics.

Figures

Figures reproduced from arXiv: 2412.04300 by Fangxun Shu, Fei Wu, Hao Jiang, Haoyuan Li, Leilei Gan, Long Chan, Quanyu Long, Wanggui He, Yandi Wang, Zhelun Yu, Ziwei Huang.

Figure 1
Figure 1. Figure 1: General Concepts vs. Knowledge Concepts. We use the SOTA T2I model Stable Diffusion 3.5(SD 3.5; (Esser et al., 2024)) as an example to illustrate the challenges posed by knowledge-intensive concepts versus general concepts. When given prompts with general concepts (indicated in green), SD 3.5 effectively generates images (left in Generation) that fulfill the instructions. However, when presented with speci… view at source ↗
Figure 2
Figure 2. Figure 2: Multi-Round VQA based Factuality Evalua [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Concept Factuality Scores across 8 domains [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Concept Factuality Scores across 8 domains in the SKCM level for 11 Models. [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Feature details for eight knowledge concept categories. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The prompt we used for Concept Factuality Evaluation with GPT-4o. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The prompt we used for Instantiation Completeness Evaluation with GPT-4o. A case of Size variation. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The prompt we used for Composition Factuality Evaluation of [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The prompt we used for Composition Factuality Evaluation of [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: iTAG Interface for Concept Factuality Evaluation. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative results. Error cases of diversity models in T2I-FactualBench [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
read the original abstract

Evaluating the quality of synthesized images remains a significant challenge in the development of text-to-image (T2I) generation. Most existing studies in this area primarily focus on evaluating text-image alignment, image quality, and object composition capabilities, with comparatively fewer studies addressing the evaluation of the factuality of T2I models, particularly when the concepts involved are knowledge-intensive. To mitigate this gap, we present T2I-FactualBench in this work - the largest benchmark to date in terms of the number of concepts and prompts specifically designed to evaluate the factuality of knowledge-intensive concept generation. T2I-FactualBench consists of a three-tiered knowledge-intensive text-to-image generation framework, ranging from the basic memorization of individual knowledge concepts to the more complex composition of multiple knowledge concepts. We further introduce a multi-round visual question answering (VQA) based evaluation framework to assess the factuality of three-tiered knowledge-intensive text-to-image generation tasks. Experiments on T2I-FactualBench indicate that current state-of-the-art (SOTA) T2I models still leave significant room for improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces T2I-FactualBench, described as the largest benchmark for evaluating the factuality of text-to-image (T2I) models on knowledge-intensive concepts. It proposes a three-tiered framework ranging from basic memorization of individual concepts to complex multi-concept composition. A multi-round visual question answering (VQA) evaluation framework is used to assess factuality, with experiments indicating that current SOTA T2I models have significant room for improvement.

Significance. This benchmark addresses a gap in evaluating factual accuracy in T2I generation beyond text-image alignment and quality. If the evaluation framework is reliable, it could provide a valuable tool for measuring and improving model performance on knowledge-intensive tasks, which has implications for applications requiring accurate visual representations of real-world knowledge.

major comments (2)
  1. [§3 (Evaluation Framework)] The multi-round VQA pipeline is central to the claims, yet the manuscript does not report quantitative validation metrics such as human agreement rates with the VQA answers or error analysis for question design. Without this, it is unclear whether the reported performance gaps reflect true model shortcomings or artifacts from question phrasing, visual grounding, or VQA model errors, as highlighted in the stress-test concern.
  2. [§2 (Benchmark Construction)] The criteria for selecting the knowledge-intensive concepts and ensuring they are indeed knowledge-intensive are not detailed with sufficient rigor or examples, which is load-bearing for the benchmark's validity and the generalizability of the findings on SOTA models.
minor comments (2)
  1. [Abstract] The abstract claims it is 'the largest benchmark to date in terms of the number of concepts and prompts' but does not provide the actual numbers for comparison with prior work.
  2. [Throughout] Some figures or tables showing example prompts and generated images would benefit from clearer captions explaining the factual errors identified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of the evaluation framework and benchmark construction. We address each major comment below and will revise the manuscript to strengthen these sections.

read point-by-point responses
  1. Referee: [§3 (Evaluation Framework)] The multi-round VQA pipeline is central to the claims, yet the manuscript does not report quantitative validation metrics such as human agreement rates with the VQA answers or error analysis for question design. Without this, it is unclear whether the reported performance gaps reflect true model shortcomings or artifacts from question phrasing, visual grounding, or VQA model errors, as highlighted in the stress-test concern.

    Authors: We agree that explicit quantitative validation metrics would further support the reliability of the multi-round VQA pipeline. The manuscript includes a stress-test to probe for artifacts, but does not report human agreement rates or a dedicated error analysis. In the revised version, we will add these: human agreement rates from annotator studies on VQA outputs and an expanded discussion of question design considerations and how the multi-round process reduces VQA model errors. revision: yes

  2. Referee: [§2 (Benchmark Construction)] The criteria for selecting the knowledge-intensive concepts and ensuring they are indeed knowledge-intensive are not detailed with sufficient rigor or examples, which is load-bearing for the benchmark's validity and the generalizability of the findings on SOTA models.

    Authors: We acknowledge that the selection criteria for knowledge-intensive concepts require more explicit detail and examples to fully establish the benchmark's rigor. The manuscript describes the three-tier framework and sources concepts from established knowledge resources, but does not provide a dedicated subsection on selection metrics. In the revision, we will expand §2 with precise criteria (e.g., measures of concept rarity and specificity) and concrete examples across tiers. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and VQA framework are independent measurement tools

full rationale

The paper introduces T2I-FactualBench as a new dataset and multi-round VQA evaluation protocol for assessing factuality in knowledge-intensive T2I generation. No equations, fitted parameters, or derivations are present that reduce to self-defined terms or self-citations. The central claim (SOTA models show room for improvement) rests on empirical measurements against an externally defined benchmark rather than any construction that forces the outcome by definition. The evaluation framework is presented as a measurement instrument, not a result derived from the data it evaluates.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions in computer vision evaluation (VQA can proxy image factuality) and domain assumptions about what counts as knowledge-intensive; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Multi-round VQA responses provide a reliable proxy for factual correctness in generated images.
    Invoked in the description of the evaluation framework in the abstract.

pith-pipeline@v0.9.0 · 5767 in / 1136 out tokens · 22568 ms · 2026-05-23T07:57:44.945214+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models

    cs.CV 2026-05 unverdicted novelty 5.0

    FAGER is a new agentic framework that creates structured factual rubrics to evaluate and refine text-to-image outputs for implicit factual correctness across science, history, products, and culture.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    Dominik Lorenz Andreas Blattmann, Axel Sauer. 2024. https://blackforestlabs.ai/#get-flux Flux.1

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923

  3. [3]

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. 2023. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8

  4. [4]

    Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, Volker Tresp, and Yunpu Ma. 2025. Prism: Self-pruning intrinsic selection method for training-free multimodal data selection. arXiv preprint arXiv:2502.12119

  5. [5]

    Jinhe Bi, Yujun Wang, Haokun Chen, Xun Xiao, Artur Hecker, Volker Tresp, and Yunpu Ma. 2024. Visual instruction tuning with 500x fewer parameters through modality linear representation-steering. arXiv preprint arXiv:2412.12359

  6. [6]

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \'e J \'e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650--9660

  7. [7]

    Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. 2024 a . https://openreview.net/forum?id=eAKmQPe3m1 Pixart- \( \) : Fast training of diffusion transformer for photorealistic text-to-image synthesis . In The Twelfth International Conference on Learning Representations, ICLR 2...

  8. [8]

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024 b . Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185--24198

  9. [9]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M \" u ller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. 2024. https://doi.org/10.48550/ARXIV.2403.03206 Scaling rectified flow transformers for high-resolution ...

  10. [10]

    Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, and Dan Roth. 2024. Commonsense-t2i challenge: Can text-to-image generation models understand commonsense? arXiv preprint arXiv:2406.07546

  11. [11]

    Sebastian Hartwig, Dominik Engel, Leon Sick, Hannah Kniesel, Tristan Payer, Timo Ropinski, et al. 2024. Evaluating text to image synthesis: Survey and taxonomy of image quality metrics. arXiv preprint arXiv:2403.11821

  12. [12]

    Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, et al. 2024. Mars: Mixture of auto-regressive models for fine-grained text-to-image synthesis. arXiv preprint arXiv:2407.07614

  13. [13]

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718

  14. [14]

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30

  15. [15]

    Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. 2023. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406--20417

  16. [16]

    Hsin-Ping Huang, Xinyi Wang, Yonatan Bitton, Hagai Taitelbaum, Gaurav Singh Tomar, Ming-Wei Chang, Xuhui Jia, Kelvin CK Chan, Hexiang Hu, Yu-Chuan Su, et al. 2024. Kitten: A knowledge-intensive evaluation of image generation on visual entities. arXiv preprint arXiv:2410.11824

  17. [17]

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. 2023. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36:78723--78747

  18. [18]

    Nasiba Komil Qizi Jumaeva. 2024. Using the hyponyms for improving the beginners’vocabulary range. Academic research in educational sciences, 5(CSPU Conference 1):669--673

  19. [19]

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. 2023. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:36652--36663

  20. [20]

    Klaus Krippendorff. 2018. Content analysis: An introduction to its methodology. Sage publications

  21. [21]

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. 2023 a . Viescore: Towards explainable metrics for conditional image synthesis evaluation. arXiv preprint arXiv:2312.14867

  22. [22]

    Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, and Wenhu Chen. 2023 b . Imagenhub: Standardizing the evaluation of conditional image generation models. arXiv preprint arXiv:2310.01596

  23. [23]

    Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Teufel, Marco Bellagente, et al. 2024. Holistic evaluation of text-to-image models. Advances in Neural Information Processing Systems, 36

  24. [24]

    Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. 2024 a . Genai-bench: Evaluating and improving compositional text-to-visual generation. arXiv preprint arXiv:2406.13743

  25. [25]

    Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Xide Xia, Pengchuan Zhang, Graham Neubig, and Deva Ramanan. 2024 b . Evaluating and improving compositional text-to-visual generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5290--5301

  26. [26]

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. 2024 c . https://doi.org/10.48550/ARXIV.2402.17245 Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation . CoRR, abs/2402.17245

  27. [27]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888--12900. PMLR

  28. [28]

    Tianwei Lin, Jiang Liu, Wenqiao Zhang, Zhaocheng Li, Yang Dai, Haoyuan Li, Zhelun Yu, Wanggui He, Juncheng Li, Hao Jiang, et al. 2024. Teamlora: Boosting low-rank adaptation with expert collaboration and competition. arXiv preprint arXiv:2408.09856

  29. [29]

    Jiang Liu, Bolin Li, Haoyuan Li, Tianwei Lin, Wenqiao Zhang, Tao Zhong, Zhelun Yu, Jinghao Wei, Hao Cheng, Wanggui He, et al. 2024. Boosting private domain understanding of efficient mllms: A tuning-free, adaptive, universal prompt optimization framework. arXiv preprint arXiv:2412.19684

  30. [30]

    Mushui Liu, Yuhang Ma, Zhen Yang, Jun Dan, Yunlong Yu, Zeng Zhao, Zhipeng Hu, Bai Liu, and Changjie Fan. 2025. Llm4gen: Leveraging semantic representation of llms for text-to-image generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5523--5531

  31. [31]

    Yujie Lu, Dongfu Jiang, Wenhu Chen, William Wang, Yejin Choi, and Bill Yuchen Lin. 2024 a . Wildvision arena: Benchmarking multimodal llms in the wild (february 2024). URL https://huggingface. co/spaces/WildVision/vision-arena

  32. [32]

    Yujie Lu, Xianjun Yang, Xiujun Li, Xin Eric Wang, and William Yang Wang. 2024 b . Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. Advances in Neural Information Processing Systems, 36

  33. [33]

    Giuliano Martinelli, Francesco Molfese, Simone Tedeschi, Alberte Fern \'a ndez-Castro, and Roberto Navigli. 2024. Cner: Concept and named entity recognition. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8329--8344

  34. [34]

    Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, et al. 2024. Phybench: A physical commonsense benchmark for evaluating text-to-image models. arXiv preprint arXiv:2406.11802

  35. [35]

    Roberto Navigli and Simone Paolo Ponzetto. 2012. Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial intelligence, 193:217--250

  36. [36]

    OpenAI. 2024. Chatgpt. https://openai.com/index/gpt-4o-system-card/

  37. [37]

    Jonas Oppenlaender. 2022. The creativity of text-to-image generation. In Proceedings of the 25th international academic mindtrek conference, pages 192--202

  38. [38]

    Dong Huk Park, Samaneh Azadi, Xihui Liu, Trevor Darrell, and Anna Rohrbach. 2021. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)

  39. [39]

    Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. 2024. Dreambench++: A human-aligned benchmark for personalized image generation. arXiv preprint arXiv:2406.16855

  40. [40]

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M \" u ller, Joe Penna, and Robin Rombach. 2024. https://openreview.net/forum?id=di52zR8xgf SDXL: improving latent diffusion models for high-resolution image synthesis . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, ...

  41. [41]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748--8763. PMLR

  42. [42]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \" o rn Ommer. 2022. https://doi.org/10.1109/CVPR52688.2022.01042 High-resolution image synthesis with latent diffusion models . In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pages 10674--10685. IEEE

  43. [43]

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479--36494

  44. [44]

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. Advances in neural information processing systems, 29

  45. [45]

    D She, Mushui Liu, Jingxuan Pang, Jin Wang, Zhen Yang, Wanggui He, Guanghao Zhang, Yi Wang, Qihan Huang, Haobin Tang, et al. 2025. Customvideox: 3d reference attention driven dynamic adaptation for zero-shot customized video diffusion transformers. arXiv preprint arXiv:2502.06527

  46. [46]

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. 2024 a . Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525

  47. [47]

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. 2024 b . Generative multimodal models are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398--14409

  48. [48]

    Shanu Vashishtha, Abhinav Prakash, Lalitesh Morishetti, Kaushiki Nag, Yokila Arora, Sushant Kumar, and Kannan Achan. 2024. Chaining text-to-image and large language model: A novel approach for generating personalized e-commerce banners. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5825--5835

  49. [49]

    a m \"a l \

    Veera Vimpari, Annakaisa Kultima, Perttu H \"a m \"a l \"a inen, and Christian Guckelsberger. 2023. “an adapt-or-die type of situation”: Perception, adoption, and use of text-to-image-generation ai by game industry professionals. Proceedings of the ACM on Human-Computer Interaction, 7(CHI PLAY):131--164

  50. [50]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024 a . Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191

  51. [51]

    Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, et al. 2025 a . Visualprm: An effective process reward model for multimodal reasoning. arXiv preprint arXiv:2503.10291

  52. [52]

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al

    X. Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. 2024 b . https://doi.org/10.48550/ARXIV.2406.07209 Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance . CoRR, abs/2406.07209

  53. [53]

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. 2024 c . Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869

  54. [54]

    Yi Wang, Mushui Liu, Wanggui He, Longxiang Zhang, Ziwei Huang, Guanghao Zhang, Fangxun Shu, Zhong Tao, Dong She, Zhelun Yu, et al. 2025 b . Mint: Multi-modal chain of thought in unified generative models for enhanced image generation. arXiv preprint arXiv:2503.01298

  55. [55]

    Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, and Gordon Wetzstein. 2024 a . Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22227--22238

  56. [56]

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. 2023. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341

  57. [57]

    Xindi Wu, Dingli Yu, Yangsibo Huang, Olga Russakovsky, and Sanjeev Arora. 2024 b . Conceptmix: A compositional image generation benchmark with controllable difficulty. arXiv preprint arXiv:2408.14339

  58. [58]

    Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Hao Jiang, Fei Wu, and Linchao Zhu. 2024 a . Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. arXiv preprint arXiv:2404.14233

  59. [59]

    Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, and Fei Wu. 2024 b . A comprehensive survey of datasets, theories, variants, and applications in direct preference optimization. arXiv preprint arXiv:2410.15595

  60. [60]

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. 2024. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36

  61. [61]

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. 2022. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5

  62. [62]

    Xinlu Zhang, Yujie Lu, Weizhi Wang, An Yan, Jun Yan, Lianke Qin, Heng Wang, Xifeng Yan, William Yang Wang, and Linda Ruth Petzold. 2023 a . Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361

  63. [63]

    Yuxuan Zhang, Jiaming Liu, Yiren Song, Rui Wang, Hao Tang, Jinpeng Yu, Huaxia Li, Xu Tang, Yao Hu, Han Pan, and Zhongliang Jing. 2023 b . https://doi.org/10.48550/ARXIV.2312.16272 Ssr-encoder: Encoding selective subject representation for subject-driven generation . CoRR, abs/2312.16272

  64. [64]

    Ziyu Zhao, Leilei Gan, Guoyin Wang, Wangchunshu Zhou, Hongxia Yang, Kun Kuang, and Fei Wu. 2024. Loraretriever: Input-aware lora retrieval and composition for mixed tasks in the wild. arXiv preprint arXiv:2402.09997

  65. [65]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  66. [66]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...