Recognition: unknown
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models
Pith reviewed 2026-05-08 13:49 UTC · model grok-4.3
The pith
DynT2I-Eval generates continuously refreshed prompts by decomposing descriptions into controllable dimensions to evaluate text-to-image models without fixed-set overfitting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DynT2I-Eval constructs a structured visual semantic space from long-form descriptions, decomposing prompts into controllable dimensions such as subject, logical constraint, environment, and composition. This enables the continuous generation of fresh prompts through task-specific spaces and difficulty-aware sampling. Model performance is assessed across text alignment, perceptual quality, and aesthetics by unifying outputs into prompt-conditioned pairwise comparisons. A dynamic scheduler with micro-batch aggregation and weighted Bayesian updates maintains a stable online leaderboard despite evolving prompt distributions and model additions. Experiments with independently sampled streams show
What carries the argument
The structured visual semantic space decomposed into controllable dimensions like subject, logical constraint, environment, and composition, which supports task-specific spaces and difficulty-aware sampling for ongoing prompt generation and dynamic ranking maintenance.
Load-bearing premise
The decomposition into controllable dimensions and difficulty-aware sampling produce prompts that remain representative of real usage and do not introduce new unmeasured biases in the evaluation.
What would settle it
If models tuned specifically to the generated prompt streams still show inflated performance compared to independently sampled streams, or if rankings become inconsistent across multiple independent prompt streams from the framework.
Figures
read the original abstract
Existing text-to-image (T2I) benchmarks largely rely on fixed prompt sets, leaving them vulnerable to overfitting and benchmark contamination once publicly released and repeatedly reused. In this work, we propose DynT2I-Eval, a fully automated dynamic evaluation framework for T2I models. It constructs a structured visual semantic space from long-form descriptions, decomposing prompts into controllable dimensions (e.g., subject, logical constraint, environment, and composition). This enables the continuous generation of fresh prompts via task-specific spaces and difficulty-aware sampling. DynT2I-Eval evaluates model performance across text alignment, perceptual quality, and aesthetics. Heterogeneous outputs are unified into prompt-conditioned pairwise comparisons, allowing a dynamic scheduler, micro-batch aggregation, and weighted Bayesian updates to maintain a stable online leaderboard despite changing prompt distributions and model injection. Experiments with independently sampled prompt streams demonstrate that continually refreshed prompts provide a robust evaluation protocol, reducing the impact of prompt-set-specific tuning. Simulations and ablations further confirm that the proposed ranking framework achieves a strong balance among cold-start convergence, late-entry discovery, and long-run ranking fidelity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DynT2I-Eval, a dynamic evaluation framework for text-to-image models. It constructs a structured visual semantic space from long-form descriptions by decomposing prompts into controllable dimensions (subject, logical constraint, environment, composition), enabling continuous generation of fresh prompts via difficulty-aware sampling. Model outputs are unified into prompt-conditioned pairwise comparisons, with a dynamic scheduler and weighted Bayesian updates maintaining an online leaderboard. The central claim is that experiments with independently sampled prompt streams show continually refreshed prompts reduce prompt-set-specific tuning, with simulations confirming balance among cold-start convergence, late-entry discovery, and ranking fidelity.
Significance. If the central claim holds with proper validation, the framework could address a persistent problem in T2I benchmarking by reducing overfitting and contamination from fixed public prompt sets, offering a more reliable protocol for ongoing model evaluation. The combination of dimension-decomposed prompt generation and Bayesian aggregation is a concrete technical contribution that could generalize to other generative model evaluation settings.
major comments (2)
- [Abstract] Abstract: The claim that 'Experiments with independently sampled prompt streams demonstrate that continually refreshed prompts provide a robust evaluation protocol' lacks any quantitative results, tables, error bars, or statistical tests in the manuscript. This is load-bearing because the robustness to prompt-set-specific tuning is the primary empirical contribution.
- [Framework description] Framework description (decomposition and sampling): No validation is provided that the decomposition into subject/logical-constraint/environment/composition dimensions, or the difficulty-aware sampling from long-form descriptions, produces prompts whose distribution matches real user T2I usage. Without distributional divergence metrics or human preference alignment against an external corpus, the stability observed in independent streams could simply reflect shared bias in the generated subspace rather than true robustness.
minor comments (2)
- [Abstract] The abstract refers to 'task-specific spaces' without defining how they differ from the main structured visual semantic space or how they are instantiated.
- No mention of how heterogeneous outputs (text alignment, perceptual quality, aesthetics) are normalized before pairwise comparison, which could affect the Bayesian update stability.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and indicate the planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'Experiments with independently sampled prompt streams demonstrate that continually refreshed prompts provide a robust evaluation protocol' lacks any quantitative results, tables, error bars, or statistical tests in the manuscript. This is load-bearing because the robustness to prompt-set-specific tuning is the primary empirical contribution.
Authors: We thank the referee for this important feedback. To address the lack of quantitative support in the abstract for the central claim, we will revise the abstract to explicitly include key quantitative results from our experiments with independently sampled prompt streams, such as measures of ranking stability and robustness to prompt variation, along with references to tables, error bars, and any statistical tests in the manuscript. revision: yes
-
Referee: [Framework description] Framework description (decomposition and sampling): No validation is provided that the decomposition into subject/logical-constraint/environment/composition dimensions, or the difficulty-aware sampling from long-form descriptions, produces prompts whose distribution matches real user T2I usage. Without distributional divergence metrics or human preference alignment against an external corpus, the stability observed in independent streams could simply reflect shared bias in the generated subspace rather than true robustness.
Authors: We agree with the referee that additional validation would be beneficial. The dimension decomposition is derived from an analysis of visual semantics in long-form descriptions, but we did not include explicit distributional metrics or human alignment studies. We will revise the framework description section to provide more detailed motivation for the chosen dimensions, supported by references to related work on T2I prompt structures. We will also add a limitations paragraph discussing the potential for subspace bias and outlining plans for future validation against user corpora. revision: partial
Circularity Check
No circularity in derivation chain
full rationale
The paper proposes DynT2I-Eval as a methodological framework for dynamic T2I evaluation via prompt decomposition into dimensions like subject and composition, followed by difficulty-aware sampling and Bayesian leaderboard updates. No mathematical derivations, equations, or first-principles results are presented that reduce any claim (such as robustness from refreshed prompts) to its own inputs by construction. The key demonstration relies on external experiments with independently sampled prompt streams rather than self-referential definitions or fitted parameters renamed as predictions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text, leaving the evaluation protocol self-contained against external model outputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Elo-rating as a tool in the sequential estimation of dominance strengths.Animal behaviour, pages 489–495, 2001
Paul CH Albers and Han de Vries. Elo-rating as a tool in the sequential estimation of dominance strengths.Animal behaviour, pages 489–495, 2001
2001
-
[2]
Shuo Cao, Jiayang Li, Xiaohui Li, Yuandong Pu, Kaiwen Zhu, Yuanting Gao, Siqi Luo, Yi Xin, Qi Qin, Yu Zhou, Xiangyu Chen, Wenlong Zhang, Bin Fu, Yu Qiao, and Yihao Liu. Unipercept: Towards unified perceptual-level image understanding across aesthetics, quality, structure, and texture.arXiv preprint arXiv:2512.21675, 2025
-
[3]
Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, Bo Qu, Wenhai Wang, Yu Qiao, Dajuin Yao, and Yihao Liu. Artimuse: Fine-grained image aesthetics assessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025
-
[4]
Zijian Chen, Yuze Sun, Yuan Tian, Wenjun Zhang, and Guangtao Zhai. Maceval: A multi-agent continual evaluation network for large models.arXiv preprint arXiv:2511.09139, 2025
-
[5]
Zijian Chen, Wenjun Zhang, and Guangtao Zhai. Evaluating from benign to dynamic adversarial: A squid game for large language models.arXiv preprint arXiv:2511.10691, 2026
-
[6]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024
work page internal anchor Pith review arXiv 2024
-
[7]
Example of the glicko-2 system.Boston University, 28:2012, 2012
Mark E Glickman. Example of the glicko-2 system.Boston University, 28:2012, 2012
2012
-
[8]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A Survey on LLM-as-a-Judge.arXiv preprint arXiv:2411.15594, 2025
work page internal anchor Pith review arXiv 2025
-
[9]
Active Ranking from Pairwise Comparisons and when Parametric Assumptions Don't Help
Reinhard Heckel, Nihar B. Shah, Kannan Ramchandran, and Martin J. Wainwright. Active ranking from pairwise comparisons and when parametric assumptions don’t help.arXiv preprint arXiv:1606.08842, 2016
work page Pith review arXiv 2016
-
[10]
Trueskill™: a bayesian skill rating system
Ralf Herbrich, Tom Minka, and Thore Graepel. Trueskill™: a bayesian skill rating system. Proceedings of the Advances in neural information processing systems (NeurIPS), 19, 2006
2006
-
[11]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024
work page internal anchor Pith review arXiv 2024
-
[12]
Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 47(5):3563–3579, 2025
2025
-
[13]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21807–21818, 2024
2024
-
[14]
GenEval 2: Addressing benchmark drift in text-to-image evaluation.arXiv preprint arXiv:2512.16853,
Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. Geneval 2: Addressing benchmark drift in text-to-image evaluation.arXiv preprint arXiv:2512.16853, 2025
-
[15]
Musiq: Multi-scale image quality transformer
Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 5148–5157, 2021
2021
-
[16]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
2024
-
[17]
FLUX.2: Frontier Visual Intelligence
Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025. 10
2025
-
[18]
Flux.1 krea [dev]
Sangwu Lee, Titus Ebbecke, Erwann Millon, Will Beddow, Le Zhuo, Iker García-Ferrero, Liam Esparraguera, Mihai Petrescu, Gian Saß, Gabriel Menezes, and Victor Perez. Flux.1 krea [dev]. https://github.com/krea-ai/flux-krea, 2025
2025
-
[19]
Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024
-
[20]
Q-insight: Understanding image quality via visual reinforcement learning.Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2025
Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin Zhang, Junlin Li, Li Zhang, and Jian Zhang. Q-insight: Understanding image quality via visual reinforcement learning.Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[21]
K-sort arena: Efficient and reliable benchmarking for generative models via k-wise human preferences
Zhikai Li, Xuewen Liu, Dongrong Joe Fu, Jianquan Li, Qingyi Gu, Kurt Keutzer, and Zhen Dong. K-sort arena: Efficient and reliable benchmarking for generative models via k-wise human preferences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9131–9141, June 2025
2025
-
[22]
Evaluating text-to-visual generation with image-to-text generation
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 366–384, 2024
2024
-
[23]
Interactive visual assessment for text-to-image generation models.IEEE Transactions on Visualization and Computer Graphics (TVCG), 2026
Xiaoyue Mi, Fan Tang, Juan Cao, Qiang Sheng, Ziyao Huang, Peng Li, Yang Liu, and Tong-Yee Lee. Interactive visual assessment for text-to-image generation models.IEEE Transactions on Visualization and Computer Graphics (TVCG), 2026
2026
-
[24]
Trueskill 2: An improved bayesian skill rating system.Technical Report, 2018
Tom Minka, Ryan Cleven, and Yordan Zaykov. Trueskill 2: An improved bayesian skill rating system.Technical Report, 2018
2018
-
[25]
No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing (TIP), 21(12):4695– 4708, 2012
Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial domain.IEEE Transactions on image processing (TIP), 21(12):4695– 4708, 2012
2012
-
[26]
completely blind
Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer.IEEE Signal processing letters, 20(3):209–212, 2012
2012
-
[27]
Rank centrality: Ranking from pairwise comparisons.Operations Research, 2017
Sewoong Oh. Rank centrality: Ranking from pairwise comparisons.Operations Research, 2017
2017
-
[28]
DOCCI: Descriptions of Connected and Contrasting Images
Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, and Jason Baldridge. DOCCI: Descriptions of Connected and Contrasting Images. InProceedings of the European Conference on Computer Vision (ECCV), 2024
2024
-
[29]
Dreambench++: A human-aligned benchmark for personalized image generation
Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation. InProceedings of the International Conference on Learning Representations (ICLR)
-
[30]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review arXiv 2023
-
[31]
Rishav Pramanik, Ian E Nielsen, Jeff Smith, Saurav Pandit, Ravi P Ramachandran, and Zhaozheng Yin. Saneval: Open-vocabulary compositional benchmarks with failure-mode diagnosis.arXiv preprint arXiv:2602.00249, 2026
-
[32]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026
2026
-
[33]
Nima: Neural image assessment.IEEE transactions on image processing (TIP), 27(8):3998–4011, 2018
Hossein Talebi and Peyman Milanfar. Nima: Neural image assessment.IEEE transactions on image processing (TIP), 27(8):3998–4011, 2018
2018
-
[34]
Longcat-image technical report
Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, et al. Longcat- image technical report.arXiv preprint arXiv:2512.07584, 2025. 11
-
[35]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image Team. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025
work page internal anchor Pith review arXiv 2025
-
[36]
Yixin Wan and Kai-Wei Chang. Compalign: Improving compositional text-to-image generation with a complex benchmark and fine-grained feedback.arXiv preprint arXiv:2505.11178, 2025
-
[37]
Exploring clip for assessing the look and feel of images
Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence (AAAI), volume 37, pages 2555–2563, 2023
2023
-
[38]
Aigciqa2023: A large-scale image quality assessment database for ai generated images: From the perspectives of quality, authenticity and correspondence
Jiarui Wang, Huiyu Duan, Jing Liu, Shi Chen, Xiongkuo Min, and Guangtao Zhai. Aigciqa2023: A large-scale image quality assessment database for ai generated images: From the perspectives of quality, authenticity and correspondence. InArtificial Intelligence: Third CAAI International Conference (CICAI), page 46–57, Berlin, Heidelberg, 2023. Springer-Verlag
2023
-
[39]
Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms
Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, and Xiongkuo Min. Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17312–17323, 2025
2025
-
[40]
Juntong Wang, Huiyu Duan, Jiarui Wang, Ziheng Jia, Guangtao Zhai, and Xiongkuo Min. Tit-score: Evaluating long-prompt based text-to-image alignment via text-to-image-to-text consistency.arXiv preprint arXiv:2510.02987, 2025
-
[41]
Juntong Wang, Jiarui Wang, Huiyu Duan, Jiaxiang Kang, Guangtao Zhai, and Xiongkuo Min. I2i-bench: A comprehensive benchmark suite for image-to-image editing models.arXiv preprint arXiv:2512.04660, 2025
-
[42]
Zengbin Wang, Xuecai Hu, Yong Wang, Feng Xiong, Man Zhang, and Xiangxiang Chu. Everything in its place: Benchmarking spatial intelligence of text-to-image models.arXiv preprint arXiv:2601.20354, 2026
-
[43]
Tiif-bench: How does your t2i model follow your instructions? arXiv preprint arXiv:2506.02161, 2025
Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025
-
[44]
Q-align: Teaching lmms for visual scoring via discrete text-defined levels
Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. InProceedings of the International Conference on Machine Learning (ICML), pages 54015–54029, 2024
2024
-
[45]
Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank
Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank. InProceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[46]
Maniqa: Multi-dimension attention network for no-reference image quality assessment
Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 1191–1200, 2022
2022
-
[47]
Teaching large language models to regress accurate image quality scores using score distribution
Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 14483–14494, 2025
2025
-
[48]
Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Huayu Sha, Kexin Tan, Qiyuan Peng, Yue Zhang, Junzhe Wang, Shichun Liu, et al. Llmeval-fair: A large-scale longitudinal study on robust and fair evaluation of large language models.arXiv preprint arXiv:2508.05452, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion.arXiv preprint arXiv:2403.05121, 2024
-
[50]
Adaptive image quality assessment via teaching large multimodal model to compare
Hanwei Zhu, Haoning Wu, Yixuan Li, Zicheng Zhang, Baoliang Chen, Lingyu Zhu, Yuming Fang, Guangtao Zhai, Weisi Lin, and Shiqi Wang. Adaptive image quality assessment via teaching large multimodal model to compare. InProceedings of the International Conference on Neural Information Processing Systems (NeurIPS), pages 32611–32629, 2024. 12
2024
-
[51]
Image A”, “Image B
Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten Rijke. Relative upper confidence bound for the k-armed dueling bandit problem. InProceedings of the International conference on machine learning (ICML), pages 10–18, 2014. 13 A Broader Impacts Our proposed dynamic evaluation framework offers a positive societal impact by mitigating bench- mark contam...
2014
-
[52]
Use the Text Prompt as the primary task definition
-
[53]
Use the Checklist as grounding anchors for the key visible constraints
-
[54]
Do not reward general beauty, mood, style, sharpness, or realism unless they directly affect whether the prompt constraints are visibly satisfied
Judge only text-image alignment. Do not reward general beauty, mood, style, sharpness, or realism unless they directly affect whether the prompt constraints are visibly satisfied
-
[55]
Prefer the image that better satisfies the prompt’s core visible constraints, even if the other image looks more attractive overall
-
[56]
If both images satisfy the prompt to a similar degree, or the evidence is insufficient, return Tie. [Text Prompt] {prompt_text} [Checklist] {checklist_str} 18 Return STRICT JSON only: { "analysis_A": "one short sentence about Image A’s alignment", "analysis_B": "one short sentence about Image B’s alignment", "winner": "A" or "B" or "Tie" } E.2 Perceptual ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.