arxiv: 2601.16836 · v3 · submitted 2026-01-23 · 💻 cs.CV · cs.CL

ColorConceptBench: A Benchmark for Probabilistic Color-Concept Understanding in Text-to-Image Models

Chenxi Ruan , Yihan Hou , Yu Xiao , Guosheng Hu , Wei Zeng This is my paper

Pith reviewed 2026-05-16 11:57 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords text-to-image modelscolor-concept associationsbenchmarkimplicit conceptsprobabilistic distributionsmodel evaluationabstract semantics

0 comments

The pith

Text-to-image models vary widely in color associations for implicit concepts and show low sensitivity to abstract semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces ColorConceptBench, an expert-annotated benchmark that supplies probabilistic color distributions for 1,281 implicit concepts drawn from human judgments. It evaluates nine leading text-to-image models on these distributions and finds that accuracy differs markedly by semantic category, with particular weakness on abstract ideas such as emotions and visual states. The gaps remain even after applying classifier-free guidance scaling during generation. A reader would care because many everyday text prompts imply colors through concepts rather than naming them outright, so closing this gap would make generated images more faithful to nuanced descriptions.

Core claim

ColorConceptBench supplies human-derived probabilistic color distributions for 1,281 implicit concepts and uses them to show that nine current text-to-image models produce color associations that vary substantially across semantic categories and display marked insensitivity to abstract semantics; these shortcomings persist under classifier-free guidance scaling and indicate that models require changes in how they acquire and represent implicit meaning.

What carries the argument

ColorConceptBench, a dataset of 6,584 expert annotations that define probabilistic color distributions as ground truth for implicit concepts.

If this is right

Model performance varies substantially across semantic categories.
Models exhibit a significant lack of sensitivity to abstract semantics.
These limitations persist even when applying classifier-free guidance scaling at inference time.
Human-like color understanding requires a shift in how models learn and represent implicit semantic meaning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training data that explicitly pairs implicit concepts with color statistics could reduce the observed gaps.
Creative tools that generate images from emotional or state-based prompts would gain reliability if models adopted the benchmark's distributions.
The same evaluation approach could be extended to other implicit attributes such as texture or lighting.
Architectural differences among the nine models likely influence how well implicit color associations are encoded.

Load-bearing premise

The 6,584 expert annotations accurately capture human probabilistic color-concept associations for the 1,281 implicit concepts and serve as reliable ground truth.

What would settle it

Re-annotating the same 1,281 concepts with a new, larger panel of humans and obtaining substantially different color probability distributions would undermine the benchmark's ground truth.

Figures

Figures reproduced from arXiv: 2601.16836 by Chenxi Ruan, Guosheng Hu, Wei Zeng, Yihan Hou, Yu Xiao.

**Figure 1.** Figure 1: Unlike explicit color matching (top), ColorConceptBench evaluates implicit semantic alignment using probabilistic color distributions (bottom). 2025) rely on explicit color specifications, such as color names (e.g., ‘green’) or color codes (e.g., ‘#00FF00’) within the text prompt (e.g., ‘A clipart of {color} forest’) during the generation phase. The subsequent evaluation relies on deterministic verificat… view at source ↗

**Figure 2.** Figure 2: Dataset Statistics and Construction Pipeline. An overview of the hierarchical concept distribution and our [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of color-concept association across different text-to-image models. Colors shift [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Models consistently exhibit lower color shift [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of Guidance Scale. Increasing the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: A gallery of our human annotated dataset. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Misalignment color-concept association with [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Annotation system interface select one sketch that best satisfies predefined criteria. Images with artifacts, messy backgrounds, or ambiguous structures are discarded (see [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Our sketch conducted to document the participant’s reasoning and intended interpretation. If the two rounds showed noticeable discrepancies, we further investigated potential causes such as misunderstanding of the target concept or misinterpretation of the provided prompt. The participant was then asked to repeat the annotation process until a stable and self-consistent result was obtained. This iterati… view at source ↗

**Figure 10.** Figure 10: Quality control system interface as either Consistent or Inconsistent with the semantic meaning of the concept. • Decision Rule: Concepts that fail to secure a majority vote are removed from the final dataset. Finally, we perform quality control on 562 colored images and discard 36 inconsistent samples. This hybrid approach ensures that our dataset retains rich, diverse color associations for abstract c… view at source ↗

**Figure 11.** Figure 11: Examples of quality control for humangrounded color annotations. For concepts with high inter-annotator variance, we conducted a blind expert verification Y/N task. Samples marked in red (left column) were identified as semantically inconsistent outliers (e.g., a "forbidding castle" colored in bright pastels) and excluded. The retained instances (right columns) preserve the diverse but valid color dist… view at source ↗

**Figure 12.** Figure 12: Color grounding using SAM. Method Style Clipart Natural Flux 0.665 0.683 OmniGen2 0.904 0.734 Qwen-Image 0.868 1.029 Sana 0.722 0.627 SD 3 0.787 0.771 SD 3.5 0.864 0.767 SD XL 0.752 0.579 [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Human judgment system interface. Method Modifier Visual State Emotional Flux 0.676 0.658 OmniGen2 0.830 0.739 Qwen-Image 0.979 0.739 Sana 0.679 0.644 SD 3 0.786 0.734 SD 3.5 0.826 0.739 SD XL 0.676 0.596 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative results. Compared to the Human [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

read the original abstract

Text-to-image (T2I) models have advanced considerably in generating high-quality images from textual descriptions. However, their ability to associate colors with concepts remains largely constrained to explicit color names or codes, while their capacity to handle \emph{implicit concepts}, such as emotions and visual states, remains underexplored. To address this gap, we introduce ColorConceptBench, an expert-annotated benchmark that systematically evaluates color-concept associations through probabilistic color distributions. ColorConceptBench moves beyond explicit color specifications by examining how models interpret 1,281 implicit color concepts, grounded in 6,584 human annotations. Our evaluation of nine leading T2I models reveals that performance varies substantially across semantic categories, and models exhibit a significant lack of sensitivity to abstract semantics. These limitations persist even when applying classifier-free guidance scaling at inference time, suggesting that achieving human-like color understanding demands a shift in how models learn and represent implicit semantic meaning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ColorConceptBench introduces a new probabilistic benchmark for implicit color concepts in T2I models, but the abstract leaves annotation reliability and statistical details too thin to judge the strength of the main claims.

read the letter

The one thing to know is that this paper brings a new benchmark called ColorConceptBench with 1,281 implicit concepts and 6,584 expert annotations that treat color associations as probability distributions rather than single labels. It then tests nine current T2I models and reports that they vary a lot by semantic category and show little sensitivity to abstract concepts even when classifier-free guidance is scaled up. That pattern is the core empirical result. What is actually new is the shift to probabilistic, implicit color semantics at this scale; prior work mostly stuck to explicit color names or codes, so the dataset and the cross-model comparison fill a clear gap. The paper does a clean job stating the motivation and framing why abstract semantics matter for realistic image generation. The evaluation setup looks straightforward on the surface and the findings are presented without overclaiming. The soft spots sit in the missing details. The abstract gives no information on how experts were recruited, what exact prompts or instructions they received, how the probabilistic judgments were aggregated, or what inter-annotator agreement looked like. Without those pieces, the headline claim that models lack sensitivity rests on treating the 6,584 annotations as faithful ground truth, and any systematic bias or low agreement there would weaken the reported gaps. No error bars, data splits, or statistical tests are mentioned either, which makes it hard to gauge how robust the category-wise differences really are. This is the kind of work that belongs in a reading group for people who build or benchmark text-to-image systems. Anyone working on multimodal semantics or evaluation datasets will find the resource and the reported trends useful even before the methods are fully vetted. I would send it to peer review rather than desk-reject; the benchmark idea is worth referee time, but the authors need to supply the annotation protocol and basic stats before the conclusions can be taken as solid.

Referee Report

2 major / 2 minor

Summary. The paper introduces ColorConceptBench, an expert-annotated benchmark consisting of 6,584 human annotations over 1,281 implicit color concepts (e.g., emotions and visual states). It evaluates nine leading text-to-image models on their ability to produce images whose color distributions match the human-annotated probabilistic ground truth, reporting substantial performance variation across semantic categories and a lack of sensitivity to abstract semantics that persists under classifier-free guidance scaling.

Significance. If the evaluation protocol and ground-truth annotations prove reliable, the benchmark would provide a useful diagnostic for a previously underexplored limitation in T2I models: their inability to capture probabilistic color associations with implicit rather than explicit concepts. The persistence of the deficit under guidance scaling would strengthen the case that current training regimes do not adequately encode abstract semantic color knowledge.

major comments (2)

[Benchmark Construction] Benchmark Construction / Annotation Protocol: The claim that the 6,584 expert annotations constitute reliable probabilistic ground truth for 1,281 implicit concepts is load-bearing for all reported performance gaps and the insensitivity conclusion. No inter-annotator agreement statistics, validation against larger crowdsourced studies, or analysis of prompt-phrasing effects are provided, leaving open the possibility that systematic biases in expert selection or aggregation produce the observed model deficiencies.
[Evaluation] Evaluation Methodology: The central finding of 'significant lack of sensitivity to abstract semantics' is stated without error bars, statistical significance tests, data-split details, or confidence intervals on the per-category scores. Without these, it is impossible to determine whether the reported variation across semantic categories and the null effect of guidance scaling exceed what would be expected from annotation noise alone.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a concise table summarizing the nine evaluated models, their training data scale, and the exact metric used to compare generated color distributions against the human annotations.
[Methods] Notation for the probabilistic color distributions (e.g., how the 6,584 annotations are aggregated into per-concept histograms) should be defined explicitly with an equation in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below and will revise the manuscript to strengthen the presentation of annotation reliability and statistical analysis.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction / Annotation Protocol: The claim that the 6,584 expert annotations constitute reliable probabilistic ground truth for 1,281 implicit concepts is load-bearing for all reported performance gaps and the insensitivity conclusion. No inter-annotator agreement statistics, validation against larger crowdsourced studies, or analysis of prompt-phrasing effects are provided, leaving open the possibility that systematic biases in expert selection or aggregation produce the observed model deficiencies.

Authors: We agree that demonstrating annotation reliability is essential. In the revised manuscript we will add inter-annotator agreement statistics (Fleiss' kappa and Krippendorff's alpha) computed on the subset of concepts annotated by multiple experts. We will also report an analysis of prompt-phrasing sensitivity by re-evaluating a random sample of concepts with paraphrased prompts. A full-scale crowdsourced validation study lies outside the scope of the current revision, but we will explicitly discuss the choice of expert annotators, potential selection biases, and this as a limitation of the benchmark. revision: partial
Referee: [Evaluation] Evaluation Methodology: The central finding of 'significant lack of sensitivity to abstract semantics' is stated without error bars, statistical significance tests, data-split details, or confidence intervals on the per-category scores. Without these, it is impossible to determine whether the reported variation across semantic categories and the null effect of guidance scaling exceed what would be expected from annotation noise alone.

Authors: We accept this criticism and will substantially expand the evaluation section. The revision will include (i) error bars (standard deviation across concepts and bootstrap confidence intervals) on all per-category and guidance-scaling plots, (ii) statistical significance tests (ANOVA for category differences and paired t-tests for guidance scaling), (iii) explicit description of the evaluation splits and aggregation procedure, and (iv) a short analysis comparing observed differences against simulated annotation noise. These additions will allow readers to assess whether the reported insensitivity exceeds what annotation variability alone would produce. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark uses external human annotations as independent ground truth

full rationale

The paper's derivation chain consists of collecting 6,584 expert annotations to define probabilistic color distributions for 1,281 implicit concepts, then directly comparing T2I model outputs against these fixed external distributions. No parameters are fitted to model predictions, no self-citations bear load on the central claims, and no ansatz or uniqueness result is smuggled in. The reported performance gaps and insensitivity findings are empirical comparisons to an independently sourced ground truth rather than reductions by construction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of human annotations as ground truth and the assumption that the selected implicit concepts are representative.

axioms (1)

domain assumption Expert human annotations provide reliable probabilistic ground truth for color-concept associations.
The benchmark and all model comparisons depend on the 6,584 annotations being accurate representations of human judgment.

pith-pipeline@v0.9.0 · 5471 in / 1174 out tokens · 50057 ms · 2026-05-16T11:57:18.142093+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

Ahnaf Mozib Samin, M

The earth mover’s distance as a metric for image retrieval.International Journal of Computer Vision, 40(2):99–121. Ahnaf Mozib Samin, M. Firoz Ahmed, and Md Mush- taq Shahriyar Rafee. 2025. ColorFoil: Investigating color blindness in large vision and language models. InProceedings of the North American Chapter of the Association for Computational Linguist...

work page arXiv 2025
[2]

Qwen-Image Technical Report

Clex: a lexicon for exploring color, concept and emotion associations in language. InProceed- ings of the Conference of the European Chapter of the Association for Computational Linguistics, pages 306–314. Anna Wierzbicka. 1990. The meaning of color terms: semantics, culture, and cognition.Cognitive Linguis- tics, pages 99–150. Chenfei Wu, Jiahao Li, Jing...

work page internal anchor Pith review Pith/arXiv arXiv 1990
[3]

The colors should reflect realis- tic associations rather than cartoonish or childlike styles

Color Selection Guidelines.Annotators were asked to use the most common colors associated with each concept according to their own memory and understanding. The colors should reflect realis- tic associations rather than cartoonish or childlike styles

work page
[4]

grassland

Task-specific color association.Colors must be applied strictly according to the given concept. Annotators should focus on the concept-relevant parts of the image, leaving unrelated areas uncol- ored. For example,for the concept “grassland”, only the grass should be colored; mountains, sky, and other elements should remain blank. For the concept “coffee”,...

work page
[5]

Reference examples.Example images were provided to help annotators understand the target coloring goals

work page
[6]

Consent Form.All annotators were required to read and sign an informed consent form before beginning the experiment, ensuring that participa- tion was voluntary and ethically compliant

work page
[7]

These ini- tial annotations were manually reviewed to ensure compliance with the instructions before proceeding to the formal experiment

Pre-experiment practice.Before beginning the main annotation task, each annotator completed a small pre-experiment of three images. These ini- tial annotations were manually reviewed to ensure compliance with the instructions before proceeding to the formal experiment. Payment.We pay each participant $15 for their time and effort. B.4 Quality Control. To ...

work page
[8]

We first conducted a manual review process to iden- tify annotations that may deviate from common- sense or domain-consistent interpretations of the corresponding concepts

Qualitative Review and Iterative Verification. We first conducted a manual review process to iden- tify annotations that may deviate from common- sense or domain-consistent interpretations of the corresponding concepts. All colorized results were examined by domain experts with experience in visual semantics and design. For annotations considered potentia...

work page
[9]

Specifically, for each concept c, we extracted the color distributions in the UW71 space from the five annotated images, denoted as {p1,

Quantitative Consistency Check.To guaran- tee high-quality ground truth, each concept was in- dependently annotated by five professional design- ers who underwent specific training for this task. Specifically, for each concept c, we extracted the color distributions in the UW71 space from the five annotated images, denoted as {p1, . . . , p5}. We then com...

work page
[10]

long-tail

Expert Verification for High-Variance Con- cepts.Recognizing that abstract or complex con- cepts may naturally exhibit higher variance (e.g., ‘rotten apple’and‘lonely cabin’), we perform a targeted review on the “long-tail” data. We iden- tify the top 10% of concepts with the highest aver- age EMD scores, indicating the lowest agreement. To distinguish be...

work page
[11]

forbidding castle

Inter-Expert Agreement Analysis.To quan- tify inter-expert agreement during the binary valida- tion process, we further analyze the distribution of expert votes across all reviewed images. Each im- age is independently evaluated by three experts and labeled as either Consistent or Inconsistent with respect to the semantic meaning of the concept. As shown ...

work page
[12]

This choice enables us to evaluate if the model captures the colors of concepts universally, or if its performance is biased towards a specific visual style

Visual Style.We explorenatural photoandcli- part cartoonas two common domains. This choice enables us to evaluate if the model captures the colors of concepts universally, or if its performance is biased towards a specific visual style

work page
[13]

Classifier-Free Guidance (CFG).We focus on the CFG scale, a hyperparameter that con- trols the trade-off between alignment to the text prompt and image diversity. To investigate whether the guidance scale influences the model’s color- concept association or the diversity of color selec- tion, we evaluate the model across 7 distinct guid- ance scales. This...

work page
[14]

A [Style] of a [Adjective] [Object], centered composition

Sampling Strategy.For each unique combi- nation of concept, style, and guidance scale, we generate 5 independent samples at a resolution of 1024×1024 with 50 inference steps, using distinct random seeds. C.2 Prompt Templates We utilize standardized prompt templates to trigger concept generation across different styles: • Implicit Association (Ours):“A [St...

work page 2024
[15]

We calculate the CIEDE2000 distance be- tween the dominant colors of different images, clustering images into the same visual group if their dominant colors are perceptually similar (∆E00 ≤12)

work page
[16]

Within each group, we aggregate pixels and quantize them using8×8×8RGB bins

work page
[17]

selecting the model whose color distribution best matches the human ground-truth

The top 20 colors undergo a final merging process where color centers indistinguishable Original OriginalSegmentation Segmentation Clipart Natural Image dreamy castle brooding grassland butterfly fresh banana frozen waterfall ancient cabin elephant blackberry Figure 12: Color grounding using SAM. Method Style Clipart Natural Flux 0.665 0.683 OmniGen2 0.90...

work page 2017