arxiv: 2604.06422 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.AI· cs.CV

Recognition: no theorem link

When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't

Jonathan Nemitz , Carsten Eickhoff , Junyi Jessy Li , Kyle Mahowald , Michal Golovanevsky , William Rudman

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords vision-language modelsintrospective reasoningfaithfulnesscolor attributionGraded Color Attribution datasettrustworthy AIself-knowledge

0 comments

The pith

Vision-language models systematically violate the color-labeling rules they state for themselves, while humans follow their own rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Graded Color Attribution dataset of line drawings with controlled pixel color coverage to elicit thresholds for when an object counts as a certain color and to test whether participants stick to those thresholds in later judgments. It shows that models contradict their own stated thresholds in nearly 60 percent of cases for objects with strong color associations, whereas humans stay consistent and their few departures match known tendencies to overestimate color presence. Models prove accurate at judging actual pixel coverage yet still override their own reasoning in final answers, and the presence of world knowledge worsens this inconsistency in models but not in people. The work therefore questions whether VLM failures stem mainly from task difficulty and points to miscalibrated self-knowledge as a separate problem for reliable use.

Core claim

Using line drawings that vary color coverage across world-knowledge recolorings, counterfactual recolorings, and objects without color priors, the study elicits minimum pixel thresholds for color labels from both VLMs and humans, then measures adherence to those thresholds. VLMs violate their stated rules in nearly 60 percent of cases on objects with strong color priors, while humans remain faithful with deviations explained by overestimation of color coverage. VLMs estimate color coverage accurately yet contradict their own reasoning in final responses, and world-knowledge priors degrade faithfulness for models in ways that do not occur for humans.

What carries the argument

The Graded Color Attribution (GCA) dataset of line drawings that systematically vary pixel-level color coverage to elicit and test adherence to color-label thresholds across different prior conditions.

If this is right

VLMs' introspective self-knowledge is miscalibrated.
Reasoning failures in VLMs are not primarily difficulty-driven.
World-knowledge priors degrade faithfulness in models differently than in humans.
The mismatch carries direct implications for high-stakes deployment where models must predict or explain their own behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar rule-elicitation tests could be applied to other domains such as causal or ethical judgments to check for parallel inconsistencies.
Training or prompting techniques that enforce explicit consistency between stated rules and outputs might reduce the observed violations.
Evaluation benchmarks for VLMs should include direct checks of rule adherence in addition to accuracy alone.

Load-bearing premise

That the rules participants state when prompted truly reflect the processes they use to make color judgments, rather than being shaped by how the questions are worded or how color is measured.

What would settle it

An experiment that forces models to output only decisions consistent with their previously stated thresholds and measures whether overall accuracy on the color task then falls below human levels or matches the rate of stated violations.

Figures

Figures reproduced from arXiv: 2604.06422 by Carsten Eickhoff, Jonathan Nemitz, Junyi Jessy Li, Kyle Mahowald, Michal Golovanevsky, William Rudman.

**Figure 2.** Figure 2: Examples from GCA showing prior-aligned (red ant), counterfactual (blue straw [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of different CoT setups to elicit introspective rules in VLMs. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Proportion of “color” responses as a function of color threshold [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Value of VLM stated thresholds averaged across different CoT variants. Analysis [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Human faithfulness vs. percent recolored. Top: introspection first; bottom: last. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Average human confidence for introspection-first (light blue) and introspection-last (dark blue) groups. Humans are faithful, but with miscalculation errors [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: VLM faithfulness to introspective rules averaged over all Chain-of-Thought [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: VLM responses to “What percent of pixels are [COLOR]?” VLMs faithfulness depends on both model capacity and visual input [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Left: Distribution of stimulus types per survey. Right: Example trial from the web-based human experiment. Participants viewed a colored outline image and selected the perceived object color from three options. After making a color decision, participants reported their certainty on a 10-point scale ranging from “very uncertain” to “very certain.” [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Overview of the dataset construction and filtering pipeline. Starting from 493 [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Example results of the automatic mask generation procedure. For each object, [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Proportion of “color” responses as a function of color threshold in GCA. Error bars [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Value of VLM stated thresholds for all models, stimulus types and Chain-of [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: VLM faithfulness to introspective rules for all models, stimulus types and over [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

read the original abstract

Understanding when Vision-Language Models (VLMs) will behave unexpectedly, whether models can reliably predict their own behavior, and if models adhere to their introspective reasoning are central challenges for trustworthy deployment. To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules. GCA consists of line drawings that vary pixel-level color coverage across three conditions: world-knowledge recolorings, counterfactual recolorings, and shapes with no color priors. Using GCA, both VLMs and human participants establish a threshold: the minimum percentage of pixels of a given color an object must have to receive that color label. We then compare these rules with their subsequent color attribution decisions. Our findings reveal that models systematically violate their own introspective rules. For example, GPT-5-mini violates its stated introspection rules in nearly 60\% of cases on objects with strong color priors. Human participants remain faithful to their stated rules, with any apparent violations being explained by a well-documented tendency to overestimate color coverage. In contrast, we find that VLMs are excellent estimators of color coverage, yet blatantly contradict their own reasoning in their final responses. Across all models and strategies for eliciting introspective rules, world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition. Our findings challenge the view that VLM reasoning failures are difficulty-driven and suggest that VLM introspective self-knowledge is miscalibrated, with direct implications for high-stakes deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLMs break their own elicited color rules far more than humans on familiar objects, but the gap may trace to how the rules are prompted rather than a true introspection failure.

read the letter

The paper's core contribution is the GCA dataset of line drawings that vary pixel color coverage while holding shape and priors constant. It elicits a threshold rule from both models and people, then checks whether later color-labeling decisions respect that threshold. The result is that models like GPT-5-mini contradict their stated rule in roughly 60% of strong-prior cases, while humans mostly stay consistent once you account for their known overestimation of coverage. Models also turn out to be quite accurate at reporting pixel percentages, so the mismatch is not simple estimation error. That contrast is new and cleanly measured; prior VLM introspection work has not used this kind of graded, pixel-controlled setup with an explicit human baseline.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Graded Color Attribution (GCA) dataset of line drawings varying in pixel-level color coverage across world-knowledge recolorings, counterfactual recolorings, and no-prior shapes. It elicits color-labeling thresholds (minimum pixel percentage required for a label) from VLMs and humans, then compares these to subsequent color-attribution decisions on the same images. The central claim is that VLMs systematically violate their own elicited thresholds (e.g., GPT-5-mini in ~60% of strong-prior cases), while humans remain largely faithful (with apparent violations attributable to overestimation of coverage); VLMs are accurate at pixel-coverage estimation yet still contradict their rules, with world-knowledge priors degrading faithfulness in a manner unlike human cognition. This is taken to indicate miscalibrated introspective self-knowledge in VLMs.

Significance. If the core comparison holds after addressing elicitation validity, the work supplies a controlled quantitative benchmark distinguishing rule-following from prior-driven behavior in VLMs versus humans, with direct relevance to trustworthy deployment and self-knowledge evaluation. The GCA dataset and the contrast between accurate coverage estimation and rule violation are concrete strengths that could support falsifiable follow-up tests.

major comments (3)

[Methods (rule elicitation and decision phases); Results (violation rates)] The central claim that VLMs 'violate their own introspective rules' requires that the threshold elicited in the first phase is the operative decision criterion in the second phase. No ablation is described that forces the model to condition its color label on the previously stated threshold (e.g., by including the threshold in the decision prompt), nor is stability of thresholds across re-promptings of the same image set reported. Without these, the ~60% violation rate on strong-prior objects could be an artifact of separate prompting steps rather than evidence of miscalibrated self-knowledge.
[§4 (GCA dataset construction and measurement); Results (coverage estimation vs. attribution)] The paper reports that VLMs remain accurate at pixel-coverage estimation yet contradict their thresholds. However, it is unclear how 'violation' is operationalized when the model's internal representation of coverage may differ from the GCA pixel measurement; if the model uses a different (unelicited) coverage estimate in its decision, the faithfulness metric does not isolate introspective failure from perceptual mismatch.
[Results (human vs. VLM faithfulness; prior conditions)] The claim that 'world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition' rests on the human-model comparison. The human explanation invokes a documented overestimation bias, but no parallel analysis checks whether model violations are similarly explained by systematic over- or under-estimation of coverage on the same images, or whether the prior effect is driven by the specific recoloring conditions in GCA.

minor comments (2)

[Methods] Clarify the exact prompting templates used for threshold elicitation versus decision queries, including any differences in phrasing that might affect consistency.
[Results] The abstract states 'across all models and strategies,' but the main text should tabulate violation rates broken down by elicitation strategy and model to support that generalization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional controls and analyses can strengthen the evidence that VLMs exhibit miscalibrated introspective self-knowledge. We address each point below and outline planned revisions.

read point-by-point responses

Referee: The central claim that VLMs 'violate their own introspective rules' requires that the threshold elicited in the first phase is the operative decision criterion in the second phase. No ablation is described that forces the model to condition its color label on the previously stated threshold (e.g., by including the threshold in the decision prompt), nor is stability of thresholds across re-promptings of the same image set reported. Without these, the ~60% violation rate on strong-prior objects could be an artifact of separate prompting steps rather than evidence of miscalibrated self-knowledge.

Authors: We agree that an explicit conditioning ablation would provide stronger causal evidence. In the revision we will add a condition in which the decision prompt includes the model's own previously elicited threshold and report the resulting change in violation rates. We will also re-elicit thresholds on a held-out subset of images to quantify stability (e.g., via intra-class correlation or percentage agreement) and include these metrics. These additions directly test whether the observed violations persist when the rule is made salient in the same prompt context. revision: yes
Referee: The paper reports that VLMs remain accurate at pixel-coverage estimation yet contradict their thresholds. However, it is unclear how 'violation' is operationalized when the model's internal representation of coverage may differ from the GCA pixel measurement; if the model uses a different (unelicited) coverage estimate in its decision, the faithfulness metric does not isolate introspective failure from perceptual mismatch.

Authors: Violation is defined strictly as a mismatch between the final color label and the label that would be predicted by applying the elicited threshold to the ground-truth GCA pixel coverage. Because the paper already demonstrates that VLMs produce accurate coverage estimates when queried directly, perceptual mismatch is unlikely to explain the violations. In the revision we will add an explicit analysis that substitutes each model's own coverage estimate (obtained in a separate query) into the threshold rule and recomputes violation rates; this will isolate whether any residual violations remain after accounting for the model's internal coverage representation. revision: partial
Referee: The claim that 'world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition' rests on the human-model comparison. The human explanation invokes a documented overestimation bias, but no parallel analysis checks whether model violations are similarly explained by systematic over- or under-estimation of coverage on the same images, or whether the prior effect is driven by the specific recoloring conditions in GCA.

Authors: The manuscript already reports that VLMs are highly accurate at coverage estimation, which rules out systematic estimation bias as the driver of their violations (in contrast to humans). To further isolate the role of world-knowledge priors, the revision will include (i) a breakdown of violation rates by the three GCA conditions and (ii) a correlation analysis between violation magnitude and the strength of the object's color prior. These analyses will demonstrate that the degradation is tied to prior interference rather than the particular recoloring manipulations. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with self-contained derivation

full rationale

The paper introduces the GCA dataset and performs a direct empirical comparison: thresholds are elicited separately via prompting from VLMs and humans, then color attribution decisions are measured on the same images and checked for adherence. No parameters are fitted to the target violation rates, no thresholds are defined in terms of the decisions themselves, and no load-bearing premises rely on self-citations or imported uniqueness theorems. The central claim (models violate elicited rules while humans do not) is a measurement against external participant responses and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that the GCA benchmark validly isolates introspective faithfulness and that human overestimation of color coverage is the primary explanation for apparent violations; no free parameters are introduced and no new physical entities are postulated.

axioms (1)

domain assumption Controlled variations in pixel color coverage can isolate the effect of world-knowledge priors on decision rules.
Invoked in the dataset construction and condition design described in the abstract.

invented entities (1)

Graded Color Attribution (GCA) dataset no independent evidence
purpose: To elicit and compare decision rules with subsequent color attribution decisions under controlled conditions.
New benchmark created specifically for this study; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5601 in / 1444 out tokens · 35825 ms · 2026-05-10T19:20:16.092857+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 16 canonical work pages · 1 internal anchor

[1]

plausibility: On the (un) reliability of explanations from large language models , author=

Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. Faithfulness vs. plausi- bility: On the (un) reliability of explanations from large language models.arXiv preprint arXiv:2402.04614,

work page arXiv
[2]

Fazl Barez, Tung-Yu Wu, Iv´an Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, et al

URL https:// arxiv.org/abs/2505.23945. Fazl Barez, Tung-Yu Wu, Iv´an Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, et al. Chain-of-thought is not explainability.Preprint, alphaXiv, pp. v1,

work page arXiv
[3]

doi: 10.1016/j.jvcir.2003.09.001

ISSN 1047-3203. doi: 10.1016/j.jvcir.2003.09.001. Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410,

work page doi:10.1016/j.jvcir.2003.09.001 2003
[4]

Sihao Ding, Santosh Vasa, and Aditi Ramadwar

URL https: //arxiv.org/abs/2309.04461. Sihao Ding, Santosh Vasa, and Aditi Ramadwar. Explanation-driven counterfactual testing for faithfulness in vision-language model explanations.arXiv preprint arXiv:2510.00047,

work page arXiv
[5]

URLhttps://arxiv.org/abs/2505.17127. Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, S¨oren Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignmen...

work page arXiv
[6]

Alignment faking in large language models

URLhttps://arxiv.org/abs/2412.14093. Peter Hase and Christopher Potts. Counterfactual simulation training for chain-of-thought faithfulness,

work page internal anchor Pith review arXiv
[7]

Counterfactual simulation training for chain-of-thought faithfulness

URLhttps://arxiv.org/abs/2602.20710. Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702,

work page arXiv
[8]

More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025a

Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, and Sheng Liu. More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models.arXiv preprint arXiv:2505.21523, 2025a. Fenglin Liu, Hongjian Zhou, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu, Yiru Li, Sam S. Chen, Yining Hua, Peilin Zho...

work page doi:10.1038/s44222-025-00279-5
[9]

Andreas Madsen, Sarath Chandar, and Siva Reddy

URLhttps://arxiv.org/abs/2602.07833. Andreas Madsen, Sarath Chandar, and Siva Reddy. Are self-explanations from large language models faithful? InFindings of the Association for Computational Linguistics: ACL 2024, pp. 295–337,

work page arXiv 2024
[10]

arXiv preprint arXiv:2404.18624 , year=

URL https://arxiv.org/abs/ 2404.18624. Xu Shen, Song Wang, Zhen Tan, Laura Yao, Xinyu Zhao, Kaidi Xu, Xin Wang, and Tianlong Chen. Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning,

work page arXiv
[11]

Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning

URLhttps://arxiv.org/abs/2510.04040. Nithin Sivakumaran, Shoubin Yu, Hyunji Lee, Yue Zhang, Ali Payani, Mohit Bansal, and Elias Stengel-Eskin. Balancing faithfulness and performance in reasoning via multi- listener soft execution,

work page arXiv
[12]

Siyuan Song, Harvey Lederman, Jennifer Hu, and Kyle Mahowald

URLhttps://arxiv.org/abs/2602.16154. Siyuan Song, Harvey Lederman, Jennifer Hu, and Kyle Mahowald. Privileged self-access matters for introspection in ai,

work page arXiv
[13]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman

URLhttps://arxiv.org/abs/2508.14802. Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36:74952–74965,

work page arXiv
[14]

doi: 10.3758/s13414-025-03098-3

ISSN 1943-393X. doi: 10.3758/s13414-025-03098-3. URL https: //doi.org/10.3758/s13414-025-03098-3. Shengbin Yue, Ting Huang, Zheng Jia, Siyuan Wang, Shujun Liu, Yun Song, Xuanjing Huang, and Zhongyu Wei. Multi-agent simulator drives language models for legal intensive interaction,

work page doi:10.3758/s13414-025-03098-3 1943
[15]

Rosie Zhao, Anshul Shah, Xiaoyu Zhu, Xinke Deng, Zhongyu Jiang, Yang Yang, Joerg Liebelt, and Arnab Mondal

URLhttps://arxiv.org/abs/2502.06882. Rosie Zhao, Anshul Shah, Xiaoyu Zhu, Xinke Deng, Zhongyu Jiang, Yang Yang, Joerg Liebelt, and Arnab Mondal. On robustness and chain-of-thought consistency of rl-finetuned vlms,

work page arXiv
[16]

A Creating GCA The GCA dataset consists of two types of stimuli: object images and geometric shapes

URLhttps://arxiv.org/abs/2602.12506. A Creating GCA The GCA dataset consists of two types of stimuli: object images and geometric shapes. Object stimuli were derived from the Visual CounterFact (VCF) dataset Golovanevsky et al. (2025) and were used to create two color conditions: a a canonical color prior condition, where objects were colored with their t...

work page arXiv 2025
[17]

silver” to “grey

In this example, the enclosed white area inside the forklift outline is included in the foreground mask even though it does not belong to the object itself. Because these regions occupy only a small fraction of the total mask area, they were retained to preserve consistency across thresholds and simplify the masking procedure. Since coloring percentages w...

2025