TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

Alexandria Minetti; Allison Nulty; Elad Hirsch; Haonan Zhu; Purvanshi Mehta

arxiv: 2605.20731 · v1 · pith:RKSCC552new · submitted 2026-05-20 · 💻 cs.CV · cs.AI· stat.AP

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

Haonan Zhu , Elad Hirsch , Alexandria Minetti , Allison Nulty , Purvanshi Mehta This is my paper

Pith reviewed 2026-05-21 05:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIstat.AP

keywords graphic designpreference datasettext-to-image evaluationdesigner annotationsmulti-dimensional scoringAI preference learningvisual quality assessment

0 comments

The pith

A small pairwise-difference head trained on designer ratings for graphic design reaches 0.611 macro agreement with the five-designer majority.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases TASTE, a dataset of ten professional designers ranking AI-generated graphic designs from four text-to-image models along nine criteria such as typography, layout, and color harmony. Existing pre-trained scorers, including large vision-language models and dedicated T2I judges, top out at 0.55 agreement with the five-designer majority vote. Training a lightweight pairwise-difference head on the new data lifts performance to 0.611, which covers roughly half the remaining distance to the 0.741 agreement level achieved by any single designer against the same majority.

Core claim

Ten designers supplied 1,600 ratings per criterion plus hallucination flags on two cohorts of images. A criterion-agnostic signal test using Kendall's tau, majority probability, and Condorcet cycles shows designer agreement sits between food preferences and photo-style image quality. No benchmarked pre-trained system exceeds 0.55 macro agreement with the five-designer majority, while a small head trained directly on TASTE reaches 0.611 and thereby closes half the gap to the single-rater ceiling of 0.741.

What carries the argument

The TASTE dataset of multi-criterion designer rankings paired with a small pairwise-difference head that predicts relative preference between image pairs.

If this is right

Text-to-image systems can be fine-tuned or guided using separate scores for typography, visual hierarchy, and brief fidelity rather than a single overall preference label.
The same signal-test framework can quantify agreement levels when the same method is applied to other creative domains such as UI design or product photography.
Future model releases can be ranked by how closely their outputs match the nine separate axes instead of by aggregate human preference.
The per-image hallucination flags collected on the holistic cohorts provide a direct way to measure and penalize factual or semantic errors in generated layouts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the multi-axis scores prove additive, an AI design tool could let users dial typography quality independently of color harmony without retraining the generator.
The dataset's scale of 1,600 ratings per criterion suggests it could serve as a seed for active-learning loops that keep improving the head as more designer labels arrive.
Extending the same annotation protocol to video or 3-D outputs would test whether the current agreement levels and head performance generalize beyond static graphic design.

Load-bearing premise

The five-designer majority vote on the holistic-preference cohorts serves as a stable and representative target for both measuring and training automated scorers.

What would settle it

Retraining the pairwise-difference head on TASTE and evaluating it on a fresh set of designs from the same four models yields macro agreement no higher than 0.55 with the held-out five-designer majority.

Figures

Figures reproduced from arXiv: 2605.20731 by Alexandria Minetti, Allison Nulty, Elad Hirsch, Haonan Zhu, Purvanshi Mehta.

**Figure 1.** Figure 1: Per-prompt T distribution (the mean pairwise Kendall τ across the 10 evaluator pairs) for each TASTE sub-dimension, with horizontal lines showing the median T of the three cross-domain reference anchors. Aesthetics sub-dimensions are shown in blue and Descriptions in green. All nine sub-dimensions sit close to the Sushi (food) and MovieLens (movies) medians, well below the median of HPSv2-test restricted t… view at source ↗

**Figure 2.** Figure 2: Per-sub-dimension Condorcet cycle rate. All nine sub-dimensions sit at or below [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Per-sub-dimension mean pairwise τ, sorted descending, color-coded by cohort. Description-fidelity sub-dimensions (green) dominate the top of the ordering; aesthetic subdimensions (blue) dominate the bottom. Reference anchor mean τs are shown as dashed lines. Sushi anchor; the five aesthetic criteria cluster below it, with Color Harmony at the bottom. This ordering is consistent with the observation that … view at source ↗

**Figure 4.** Figure 4: Per-evaluator agreeableness on the Aesthetics (left) and Descriptions (right) co [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Text-to-image models produce graphic design at production scale, but their supervision comes from photo-style preference data with a single overall verdict per comparison. Designers evaluate along several distinct axes, including typography, visual hierarchy, color harmony, layout, and brief fidelity, and a single label collapses them. We release TASTE (Typography, Aesthetics, Spatial, Tone, Etc.): ten professional designers ranked outputs from four current text-to-image models on nine criteria across two disjoint cohorts, yielding 1,600 ratings per criterion plus per-image hallucination flags on the holistic-preference cohorts. We pair the dataset with three contributions. First, a criterion-agnostic signal test framework, using Kendall's tau, majority probability, and Condorcet cycles against exact iid-uniform nulls at p = 4 and R = 5, places designer agreement on graphic design between food and movie preferences and photo-style image quality, with every TASTE criterion rejecting the random-rater null. Second, no pre-trained system in our benchmark, including six open-weight VLM judges from 3B to 33B parameters and three dedicated T2I scorers, HPSv2.1, PickScore-v1, and LAION-Aesthetic-V2, exceeds 0.55 macro agreement with the 5-designer majority; VLM judges trade off position bias against content sensitivity, so scaling moves along this frontier without improving accuracy. Third, a small pairwise-difference head trained on TASTE reaches 0.611, closing roughly half the gap to the 0.741 single-rater ceiling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TASTE gives a new multi-criteria designer dataset for graphic design outputs with concrete benchmarks, but the 5-rater majority target looks noisy and untested for stability.

read the letter

The paper's main contribution is releasing TASTE, ratings from ten designers on nine axes like typography, layout, and color harmony for text-to-image graphic outputs, plus a signal test that shows real non-random agreement and benchmarks where existing scorers top out at 0.55 macro agreement with the majority. A small pairwise head trained on the data reaches 0.611, against a 0.741 single-rater ceiling. This is new because prior preference datasets were mostly single overall scores on photos, not these design-specific dimensions with the exact statistical checks against uniform nulls. The work does well at documenting the collection across two cohorts and reporting specific numbers on agreement and model performance without overclaiming. The criterion-agnostic tests and the gap to human levels are useful empirical anchors. The soft spot is the 5-designer majority as the training and evaluation target. The 0.741 single-rater ceiling already signals substantial disagreement, so the majority label for any image can flip with small rater changes or biases. Without leave-one-out, bootstrap, or per-cohort variance numbers in the abstract, it is unclear how stable that target really is or whether the 0.611 gain is robust or partly fitting noise. Data exclusion rules and exact inter-rater calculations also need the full text to verify. This paper is for people building or evaluating text-to-image systems aimed at graphic design practice, or anyone collecting multi-dimensional human feedback. A reader working on preference modeling or design tools will get direct value from the dataset and the numbers. It deserves a serious referee because the dataset release and the benchmark results are concrete and internally consistent even if the agreement analysis needs tightening. I would send it to review with a request for stability checks on the majority labels.

Referee Report

2 major / 2 minor

Summary. The paper introduces the TASTE dataset, consisting of rankings by ten professional designers on outputs from four text-to-image models across nine criteria (typography, aesthetics, spatial, tone, etc.) in two cohorts, yielding 1,600 ratings per criterion plus hallucination flags. It presents a criterion-agnostic statistical framework using Kendall's tau, majority probability, and Condorcet cycles against iid-uniform nulls, benchmarks existing VLMs and T2I scorers (none exceeding 0.55 macro agreement with the 5-designer majority), and reports a small pairwise-difference head trained on TASTE achieving 0.611 macro agreement, closing roughly half the gap to the 0.741 single-rater ceiling.

Significance. If the empirical measurements hold, TASTE supplies a much-needed multi-axis preference resource for graphic design that goes beyond single-verdict photo-style data, while the benchmark and new head establish concrete baselines showing current systems fall short and that modest task-specific training can improve alignment with designer judgments on axes such as layout and brief fidelity.

major comments (2)

[Abstract] Abstract: the headline claim that the pairwise-difference head at 0.611 'closes roughly half the gap' to the 0.741 single-rater ceiling treats the 5-designer majority as a fixed, stable target. The single-rater ceiling already encodes substantial disagreement; without leave-one-out, bootstrap, or per-cohort variance statistics on majority stability, it remains unclear whether the reported improvement is robust or partly an artifact of label noise in the aggregate.
[Abstract] Abstract and statistical framework section: the signal-test results (every criterion rejects the random-rater null at p=4, R=5) are presented as placing graphic-design agreement between food/movie preferences and photo-style quality, yet the exact data-exclusion rules, inter-rater reliability formula, and handling of ties or missing ratings are not stated, preventing independent verification of the reported agreement numbers.

minor comments (2)

[Abstract] Abstract: the phrase '1,600 ratings per criterion' should be clarified as total across designers or per cohort to avoid ambiguity.
[Benchmark] Benchmark section: the architecture, training hyperparameters, and exact loss for the 0.611 head should be reported in more detail to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the robustness and reproducibility of our claims regarding the TASTE dataset and its associated benchmarks. We address each point below and have incorporated revisions to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that the pairwise-difference head at 0.611 'closes roughly half the gap' to the 0.741 single-rater ceiling treats the 5-designer majority as a fixed, stable target. The single-rater ceiling already encodes substantial disagreement; without leave-one-out, bootstrap, or per-cohort variance statistics on majority stability, it remains unclear whether the reported improvement is robust or partly an artifact of label noise in the aggregate.

Authors: We agree that quantifying the stability of the 5-designer majority is necessary to substantiate the gap-closure claim. In the revised manuscript we add leave-one-out analyses across the ten designers and bootstrap resampling (1,000 iterations) of the majority labels, both per criterion and per cohort. These show that the macro agreement of 0.611 remains within the reported range relative to the single-rater ceiling of 0.741, with the improvement holding after accounting for label variance. The abstract has been updated to reference these additional statistics. revision: yes
Referee: [Abstract] Abstract and statistical framework section: the signal-test results (every criterion rejects the random-rater null at p=4, R=5) are presented as placing graphic-design agreement between food/movie preferences and photo-style quality, yet the exact data-exclusion rules, inter-rater reliability formula, and handling of ties or missing ratings are not stated, preventing independent verification of the reported agreement numbers.

Authors: We acknowledge the omission of these implementation details. The statistical framework section has been expanded to specify: (i) exclusion criteria (images with fewer than three complete rankings or unresolved ties after the tie-breaking protocol are removed), (ii) the inter-rater reliability computation as the average pairwise Kendall's tau with the standard tie correction, and (iii) confirmation that the collected data contain no missing ratings. We also include pseudocode for the exact null-model simulation in the appendix to enable direct replication of the p=4, R=5 rejection results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release and direct measurements on new annotations

full rationale

The paper collects fresh multi-criteria rankings from ten designers on outputs from four T2I models, yielding 1,600 ratings per criterion plus hallucination flags. It computes agreement statistics (Kendall's tau, majority probability, Condorcet cycles) against exact iid-uniform nulls, benchmarks six VLMs and three dedicated scorers against the 5-designer majority on the holistic cohorts, and trains a small pairwise-difference head on TASTE to reach 0.611 macro agreement. No equations, self-definitional loops, or fitted-parameter renamings appear; the reported figures are direct comparisons to the collected labels and standard supervised training on those labels. The derivation chain consists of data collection followed by independent empirical evaluation and model fitting, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard statistical tests for ranking agreement and on the assumption that professional designer judgments constitute useful supervision; no free parameters or new invented entities are introduced.

axioms (1)

standard math Kendall's tau, majority probability, and Condorcet cycle detection against exact iid-uniform nulls at p=4 and R=5 constitute a valid criterion-agnostic test for signal in preference data.
Invoked to place designer agreement relative to random and to other domains.

pith-pipeline@v0.9.0 · 5835 in / 1452 out tokens · 41148 ms · 2026-05-21T05:05:50.480335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a small pairwise-difference head trained on TASTE reaches 0.611 macro agreement with the 5-designer majority
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

every TASTE criterion rejecting the random-rater null

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

[1]

E-comIQ-ZH: A human-aligned dataset and benchmark for fine- grained evaluation of E-commerce posters with chain-of-thought

Anonymous et al. E-comIQ-ZH: A human-aligned dataset and benchmark for fine- grained evaluation of E-commerce posters with chain-of-thought. arXiv preprint arXiv:2602.21698, 2026

work page arXiv 2026
[2]

FLUX: Open-source text-to-image generation models

Black Forest Labs. FLUX: Open-source text-to-image generation models. Technical report,https://blackforestlabs.ai, 2024

work page 2024
[3]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952
[4]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952
[5]

Learning vi- sual importance for graphic designs and data visualizations

Zoya Bylinskii, Nam Wook Kim, Peter O’Donovan, Sami Alsheikh, Spandan Madan, Hanspeter Pfister, Fredo Durand, Bryan Russell, and Aaron Hertzmann. Learning vi- sual importance for graphic designs and data visualizations. InProceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (UIST), 2017

work page 2017
[6]

Seedream: Native high-resolution bilingual image generation foundation model

ByteDance. Seedream: Native high-resolution bilingual image generation foundation model. Technical report, ByteDance, 2024. 20ZHU ET AL.: TASTE: DESIGNER-ANNOTA TED PREFERENCE FOR AI DESIGN

work page 2024
[7]

Exploring the naturalness of AI- generated images.IEEE Transactions on Circuits and Systems for Video Technology, 2024

Zijian Chen, Wei Sun, Yuan Tian, Jun Jia, Zicheng Zhang, Jiarui Wang, Ru Huang, Xiongkuo Min, Guangtao Zhai, and Wenjun Zhang. Exploring the naturalness of AI- generated images.IEEE Transactions on Circuits and Systems for Video Technology, 2024

work page 2024
[8]

Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

Adrienne Deganutti, Elad Hirsch, Haonan Zhu, Jaejung Seol, and Purvanshi Mehta. Graphic-Design-Bench: A comprehensive benchmark for evaluating AI on graphic de- sign tasks. arXiv preprint arXiv:2604.04192, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Predicting visual importance across graphic design types

Camilo Fosco, Vincent Casser, Amish Kumar Bedi, Peter O’Donovan, Aaron Hertz- mann, and Zoya Bylinskii. Predicting visual importance across graphic design types. arXiv preprint arXiv:2008.02912, 2020

work page arXiv 2008
[10]

I-HallA: Evaluating image hallucination in text-to-image generation with question answering

Sang gil Lim, Heesoo Jung, Choonghan Kim, Hyunwoo Park, Hwanhee Lee, and Pil- sung Kang. I-HallA: Evaluating image hallucination in text-to-image generation with question answering. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2025

work page 2025
[11]

Imagen and the Nano-Banana image generator

Google DeepMind. Imagen and the Nano-Banana image generator. Technical report, Google DeepMind, 2024

work page 2024
[12]

DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

Divyanshu Goyal, Akhil Eppa, and Vanya Bannihatti Kumar. DistortBench: Bench- marking vision language models on image distortion identification.arXiv preprint arXiv:2604.19966, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Maxwell Harper and Joseph A

F. Maxwell Harper and Joseph A. Konstan. The MovieLens datasets: History and context.ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):19:1–19:19, 2015

work page 2015
[14]

LICA: Lay- ered image composition annotations for graphic design research

Elad Hirsch, Shubham Yadav, Mohit Garg, and Purvanshi Mehta. LICA: Lay- ered image composition annotations for graphic design research. arXiv preprint arXiv:2603.16098, 2026

work page arXiv 2026
[15]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

work page 2024
[16]

Nantonac collaborative filtering: Recommendation based on order responses

Toshihiro Kamishima. Nantonac collaborative filtering: Recommendation based on order responses. Technical report, ACM SIGKDD, 2003. Sushi preference dataset

work page 2003
[17]

Maurice G. Kendall. A new measure of rank correlation.Biometrika, 30(1/2):81–93, 1938

work page 1938
[18]

Pick-a-Pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-Pic: An open dataset of user preferences for text-to-image generation. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[19]

Reliability in content analysis: Some common misconceptions and recommendations.Human Communication Research, 30(3):411–433, 2004

Klaus Krippendorff. Reliability in content analysis: Some common misconceptions and recommendations.Human Communication Research, 30(3):411–433, 2004. ZHU ET AL.: TASTE: DESIGNER-ANNOTA TED PREFERENCE FOR AI DESIGN21

work page 2004
[20]

SAGE, 4th edition, 2018

Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. SAGE, 4th edition, 2018

work page 2018
[21]

GenAI-Bench: Evaluating and improving compositional text-to-visual generation

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, and Deva Ramanan. GenAI-Bench: Evaluating and improving compositional text-to-visual generation. InAdvances in Neu- ral Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024

work page 2024
[22]

AGIQA-3K: An open database for AI-generated image quality assessment.IEEE Transactions on Circuits and Systems for Video Technology, 2023

Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. AGIQA-3K: An open database for AI-generated image quality assessment.IEEE Transactions on Circuits and Systems for Video Technology, 2023

work page 2023
[23]

AIGIQA-20K: A large database for AI-generated image quality assessment

Chunyi Li, Tengchuan Kou, Yixuan Gao, Yuqin Cao, Wei Sun, Zicheng Zhang, Yingjie Zhou, Zhichao Zhang, Weixia Zhang, Haoning Wu, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. AIGIQA-20K: A large database for AI-generated image quality assessment. InCVPR Workshops (NTIRE), 2024

work page 2024
[24]

Qwen3- vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multi- modal retrieval and ranking.arXiv, 2026

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3- vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multi- modal retrieval and ranking.arXiv, 2026

work page 2026
[25]

Rich human feedback for text-to-image generation

Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Car- olan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, et al. Rich human feedback for text-to-image generation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024

work page 2024
[26]

A survey of multimodal hallucination evaluation and detection

Hao Liu et al. A survey of multimodal hallucination evaluation and detection. arXiv preprint arXiv:2507.19024, 2025

work page arXiv 2025
[27]

Duncan Luce.Individual Choice Behavior: A Theoretical Analysis

R. Duncan Luce.Individual Choice Behavior: A Theoretical Analysis. John Wiley & Sons, 1959

work page 1959
[28]

HPSv3: Towards wide- spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, Hongsheng Li, et al. HPSv3: Towards wide- spectrum human preference score. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2025

work page 2025
[29]

C. L. Mallows. Non-null ranking models.Biometrika, 44(1/2):114–130, 1957

work page 1957
[30]

GPT-Image and DALL·E 3: Text-to-image generation

OpenAI. GPT-Image and DALL·E 3: Text-to-image generation. Technical report, OpenAI, 2024

work page 2024
[31]

Bigham, and Jason Wu

Yi-Hao Peng, Jeffrey P. Bigham, and Jason Wu. DesignPref: Capturing personal pref- erences in visual design generation. arXiv preprint arXiv:2511.20513, 2025

work page arXiv 2025
[32]

Plackett

Robin L. Plackett. The analysis of permutations.Journal of the Royal Statistical Soci- ety: Series C, 24(2):193–202, 1975

work page 1975
[33]

Manning, Stefano Er- mon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Er- mon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 22ZHU ET AL.: TASTE: DESIGNER-ANNOTA TED PREFERENCE FOR AI DESIGN

work page 2023
[34]

ImagenWorld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks

Tao Sun et al. ImagenWorld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks. arXiv preprint arXiv:2603.27862, 2026

work page arXiv 2026
[35]

Identifying and mitigating position bias of multi-image vision-language models

Yu Tian, Tianqi Liu, Zhiyuan Liu, Jie Yang, and Cordelia Schmid. Identifying and mitigating position bias of multi-image vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2503.13792

work page arXiv 2025
[36]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hong- sheng Li. Human Preference Score v2: A solid benchmark for evaluating human pref- erences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

ImageReward: Learning and evaluating human preferences for text-to- image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to- image generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[38]

VisionReward: Fine-grained multi- dimensional human preference learning for image and video generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qinkai Li, Mingyi Zhang, et al. VisionReward: Fine-grained multi- dimensional human preference learning for image and video generation. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

work page 2026
[39]

Learning multi-dimensional human preference for text-to-image generation

Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024

work page 2024
[40]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Wang, Jingwen He, Fan Zhang, Yuanhan Zhang, Jingkang Yang, Yu Qiao, and Ziwei Liu. VBench-2.0: Advancing video gen- eration benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonza- lez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Bench- marks Track, 2023

work page 2023
[42]

From fragment to one piece: A survey on AI-driven graphic design

Xingxing Zou, Wen Zhang, and Nanxuan Zhao. From fragment to one piece: A survey on AI-driven graphic design. arXiv preprint arXiv:2503.18641, 2025. A Statistical definitions This appendix collects the formal definitions, support sets, and null PMFs of the three signal- test statistics introduced in §4. Notation: a sample is one prompt rated byR=5 evaluato...

work page arXiv 2025

[1] [1]

E-comIQ-ZH: A human-aligned dataset and benchmark for fine- grained evaluation of E-commerce posters with chain-of-thought

Anonymous et al. E-comIQ-ZH: A human-aligned dataset and benchmark for fine- grained evaluation of E-commerce posters with chain-of-thought. arXiv preprint arXiv:2602.21698, 2026

work page arXiv 2026

[2] [2]

FLUX: Open-source text-to-image generation models

Black Forest Labs. FLUX: Open-source text-to-image generation models. Technical report,https://blackforestlabs.ai, 2024

work page 2024

[3] [3]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952

[4] [4]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952

[5] [5]

Learning vi- sual importance for graphic designs and data visualizations

Zoya Bylinskii, Nam Wook Kim, Peter O’Donovan, Sami Alsheikh, Spandan Madan, Hanspeter Pfister, Fredo Durand, Bryan Russell, and Aaron Hertzmann. Learning vi- sual importance for graphic designs and data visualizations. InProceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (UIST), 2017

work page 2017

[6] [6]

Seedream: Native high-resolution bilingual image generation foundation model

ByteDance. Seedream: Native high-resolution bilingual image generation foundation model. Technical report, ByteDance, 2024. 20ZHU ET AL.: TASTE: DESIGNER-ANNOTA TED PREFERENCE FOR AI DESIGN

work page 2024

[7] [7]

Exploring the naturalness of AI- generated images.IEEE Transactions on Circuits and Systems for Video Technology, 2024

Zijian Chen, Wei Sun, Yuan Tian, Jun Jia, Zicheng Zhang, Jiarui Wang, Ru Huang, Xiongkuo Min, Guangtao Zhai, and Wenjun Zhang. Exploring the naturalness of AI- generated images.IEEE Transactions on Circuits and Systems for Video Technology, 2024

work page 2024

[8] [8]

Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

Adrienne Deganutti, Elad Hirsch, Haonan Zhu, Jaejung Seol, and Purvanshi Mehta. Graphic-Design-Bench: A comprehensive benchmark for evaluating AI on graphic de- sign tasks. arXiv preprint arXiv:2604.04192, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Predicting visual importance across graphic design types

Camilo Fosco, Vincent Casser, Amish Kumar Bedi, Peter O’Donovan, Aaron Hertz- mann, and Zoya Bylinskii. Predicting visual importance across graphic design types. arXiv preprint arXiv:2008.02912, 2020

work page arXiv 2008

[10] [10]

I-HallA: Evaluating image hallucination in text-to-image generation with question answering

Sang gil Lim, Heesoo Jung, Choonghan Kim, Hyunwoo Park, Hwanhee Lee, and Pil- sung Kang. I-HallA: Evaluating image hallucination in text-to-image generation with question answering. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2025

work page 2025

[11] [11]

Imagen and the Nano-Banana image generator

Google DeepMind. Imagen and the Nano-Banana image generator. Technical report, Google DeepMind, 2024

work page 2024

[12] [12]

DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

Divyanshu Goyal, Akhil Eppa, and Vanya Bannihatti Kumar. DistortBench: Bench- marking vision language models on image distortion identification.arXiv preprint arXiv:2604.19966, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Maxwell Harper and Joseph A

F. Maxwell Harper and Joseph A. Konstan. The MovieLens datasets: History and context.ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):19:1–19:19, 2015

work page 2015

[14] [14]

LICA: Lay- ered image composition annotations for graphic design research

Elad Hirsch, Shubham Yadav, Mohit Garg, and Purvanshi Mehta. LICA: Lay- ered image composition annotations for graphic design research. arXiv preprint arXiv:2603.16098, 2026

work page arXiv 2026

[15] [15]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

work page 2024

[16] [16]

Nantonac collaborative filtering: Recommendation based on order responses

Toshihiro Kamishima. Nantonac collaborative filtering: Recommendation based on order responses. Technical report, ACM SIGKDD, 2003. Sushi preference dataset

work page 2003

[17] [17]

Maurice G. Kendall. A new measure of rank correlation.Biometrika, 30(1/2):81–93, 1938

work page 1938

[18] [18]

Pick-a-Pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-Pic: An open dataset of user preferences for text-to-image generation. In Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[19] [19]

Reliability in content analysis: Some common misconceptions and recommendations.Human Communication Research, 30(3):411–433, 2004

Klaus Krippendorff. Reliability in content analysis: Some common misconceptions and recommendations.Human Communication Research, 30(3):411–433, 2004. ZHU ET AL.: TASTE: DESIGNER-ANNOTA TED PREFERENCE FOR AI DESIGN21

work page 2004

[20] [20]

SAGE, 4th edition, 2018

Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. SAGE, 4th edition, 2018

work page 2018

[21] [21]

GenAI-Bench: Evaluating and improving compositional text-to-visual generation

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, and Deva Ramanan. GenAI-Bench: Evaluating and improving compositional text-to-visual generation. InAdvances in Neu- ral Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024

work page 2024

[22] [22]

AGIQA-3K: An open database for AI-generated image quality assessment.IEEE Transactions on Circuits and Systems for Video Technology, 2023

Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. AGIQA-3K: An open database for AI-generated image quality assessment.IEEE Transactions on Circuits and Systems for Video Technology, 2023

work page 2023

[23] [23]

AIGIQA-20K: A large database for AI-generated image quality assessment

Chunyi Li, Tengchuan Kou, Yixuan Gao, Yuqin Cao, Wei Sun, Zicheng Zhang, Yingjie Zhou, Zhichao Zhang, Weixia Zhang, Haoning Wu, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. AIGIQA-20K: A large database for AI-generated image quality assessment. InCVPR Workshops (NTIRE), 2024

work page 2024

[24] [24]

Qwen3- vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multi- modal retrieval and ranking.arXiv, 2026

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3- vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multi- modal retrieval and ranking.arXiv, 2026

work page 2026

[25] [25]

Rich human feedback for text-to-image generation

Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Car- olan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, et al. Rich human feedback for text-to-image generation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024

work page 2024

[26] [26]

A survey of multimodal hallucination evaluation and detection

Hao Liu et al. A survey of multimodal hallucination evaluation and detection. arXiv preprint arXiv:2507.19024, 2025

work page arXiv 2025

[27] [27]

Duncan Luce.Individual Choice Behavior: A Theoretical Analysis

R. Duncan Luce.Individual Choice Behavior: A Theoretical Analysis. John Wiley & Sons, 1959

work page 1959

[28] [28]

HPSv3: Towards wide- spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, Hongsheng Li, et al. HPSv3: Towards wide- spectrum human preference score. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2025

work page 2025

[29] [29]

C. L. Mallows. Non-null ranking models.Biometrika, 44(1/2):114–130, 1957

work page 1957

[30] [30]

GPT-Image and DALL·E 3: Text-to-image generation

OpenAI. GPT-Image and DALL·E 3: Text-to-image generation. Technical report, OpenAI, 2024

work page 2024

[31] [31]

Bigham, and Jason Wu

Yi-Hao Peng, Jeffrey P. Bigham, and Jason Wu. DesignPref: Capturing personal pref- erences in visual design generation. arXiv preprint arXiv:2511.20513, 2025

work page arXiv 2025

[32] [32]

Plackett

Robin L. Plackett. The analysis of permutations.Journal of the Royal Statistical Soci- ety: Series C, 24(2):193–202, 1975

work page 1975

[33] [33]

Manning, Stefano Er- mon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Er- mon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 22ZHU ET AL.: TASTE: DESIGNER-ANNOTA TED PREFERENCE FOR AI DESIGN

work page 2023

[34] [34]

ImagenWorld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks

Tao Sun et al. ImagenWorld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks. arXiv preprint arXiv:2603.27862, 2026

work page arXiv 2026

[35] [35]

Identifying and mitigating position bias of multi-image vision-language models

Yu Tian, Tianqi Liu, Zhiyuan Liu, Jie Yang, and Cordelia Schmid. Identifying and mitigating position bias of multi-image vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2503.13792

work page arXiv 2025

[36] [36]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hong- sheng Li. Human Preference Score v2: A solid benchmark for evaluating human pref- erences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

ImageReward: Learning and evaluating human preferences for text-to- image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to- image generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[38] [38]

VisionReward: Fine-grained multi- dimensional human preference learning for image and video generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qinkai Li, Mingyi Zhang, et al. VisionReward: Fine-grained multi- dimensional human preference learning for image and video generation. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

work page 2026

[39] [39]

Learning multi-dimensional human preference for text-to-image generation

Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024

work page 2024

[40] [40]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Wang, Jingwen He, Fan Zhang, Yuanhan Zhang, Jingkang Yang, Yu Qiao, and Ziwei Liu. VBench-2.0: Advancing video gen- eration benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonza- lez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Bench- marks Track, 2023

work page 2023

[42] [42]

From fragment to one piece: A survey on AI-driven graphic design

Xingxing Zou, Wen Zhang, and Nanxuan Zhao. From fragment to one piece: A survey on AI-driven graphic design. arXiv preprint arXiv:2503.18641, 2025. A Statistical definitions This appendix collects the formal definitions, support sets, and null PMFs of the three signal- test statistics introduced in §4. Notation: a sample is one prompt rated byR=5 evaluato...

work page arXiv 2025