pith. machine review for the scientific record. sign in

arxiv: 2605.13047 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsscene perceptioncounterfactual ablationsemantic saliencyhuman-AI alignmentsize biascenter biasobject importance
0
0 comments X

The pith

Vision-language models over-rely on large central objects and under-rely on people compared to humans when describing scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Counterfactual Semantic Saliency, a black-box method that removes individual objects from natural scenes and measures how much the model's description changes. Applied to prominent VLMs and compared against 16,000+ human responses across 307 scenes, the analysis finds models depend more than humans on large objects, centrally placed objects, and high-saliency objects. Models rely less on people than human observers do. The strength of a model's size bias largely accounts for how far its descriptions diverge from human ones.

Core claim

By ablating objects from complex scenes and quantifying the resulting semantic shift in model outputs, the work shows that VLMs systematically overweight large, central, and high-saliency objects relative to humans while under-weighting people; a model's size bias is the main predictor of its semantic divergence from human scene descriptions.

What carries the argument

Counterfactual Semantic Saliency (CSS), which scores an object's importance by the magnitude of semantic change in a model's description after the object is removed from the image.

If this is right

  • Removing large objects alters model descriptions more than human descriptions.
  • Central objects exert stronger causal influence on model outputs than on human ones.
  • People receive lower causal weight in model scene descriptions than in human ones.
  • A model's measured size bias directly predicts the size of its semantic divergence from humans.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be extended to test whether the same biases appear in other vision tasks such as visual question answering.
  • Reducing size bias during training might narrow the overall human-model gap more effectively than targeting other features.
  • The framework supplies a practical way to rank objects by causal importance for any black-box model without internal access.
  • Future benchmarks could incorporate CSS scores as a standard alignment metric rather than relying solely on passive similarity measures.

Load-bearing premise

Ablating objects from images produces clean causal changes in semantic meaning without artifacts from the removal process or from the chosen similarity metric between descriptions.

What would settle it

Re-running the identical ablation and similarity measurement on the human description data and finding the same size and center biases as in the models would falsify the claimed human-model gap.

Figures

Figures reproduced from arXiv: 2605.13047 by Miguel P. Eckstein, Parsa Madinei, Ziqi Wen.

Figure 1
Figure 1. Figure 1: Counterfactual Semantic Saliency (CSS) reveals a pervasive perceptual gap between humans and vision-language models (VLMs). CSS quantifies the importance of an object by ablating it from a factual scene and measuring the semantic shift in the resulting descriptions. The object causing the maximal semantic shift is considered the most critical to scene perception. The example illustrates a mismatch between … view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of Counterfactual Semantic Saliency. (1) Counterfactual Stimuli Generation Stage: A large vision-language model generates a list of object names in the scene. For each identified object, SAM3 generates a corresponding segmentation. Following mask preprocessing, an image inpainting model (Nano Banana 2) removes the target object. In the illustration, the TV, along with the pink LED light are remove… view at source ↗
Figure 3
Figure 3. Figure 3: Between-subjects experimental design for human psychophysics. A scene containing N target objects (1 factual image I and N counterfactual variants Ido(oi=∅) ) is distributed across N + 1 mutually exclusive stimulus sets. Each set is evaluated by an independent group of 10 participants, ensuring no individual is exposed to multiple variants of the same scene. 227 Subjects are recruited to finish a total of … view at source ↗
Figure 4
Figure 4. Figure 4: Model-human alignment on scene perception. (a) Top-1 Accuracy representing the proportion of scenes where the evaluated model successfully identified the most semantically critical object, as determined by human consensus. (b) Mean Kendall’s rank correlation coefficient (τ ) across all scenes, capturing the overall ordinal alignment of object importance hierarchies between models and humans. Full distribut… view at source ↗
Figure 5
Figure 5. Figure 5: Divergence of Perceptual Biases: We computed the correlation between specific object attributes and the resulting CSS score upon ablation. Significance tests indicate the statistical deviation of each model’s correlation coefficient from the human baseline (∗ = p < .05; ∗∗ = p < .01; ∗ ∗ ∗ = p < .001; ∗ ∗ ∗∗ = p < .0001; derived from a one-tailed bootstrap test). To evaluate the four hypothesized drivers o… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between CSS Maps and attention-based white-box maps [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples provided for annotators in data quality validation process. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of Online Psychophysics: Left: One of the examples provided to subjects as calibrations. Right: One of the trails within the experiment. D Ablations: Deterministic vs. Stochastic Decoding As detailed in the main text, our primary evaluation employs stochastic sampling during VLM generation to capture the variance and distribution inherent in human scene perception. In this section, 17 [PITH_FULL_… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison between deterministic output and stochastic sampling. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of Kendall’s τ for each model comparing against human, and human-huamn consistency 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparisons between CSS Maps and attention-based white-box maps [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual-Semantic-Saliency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Counterfactual Semantic Saliency (CSS), a black-box framework that quantifies object importance in natural scenes by measuring semantic shifts in VLM and human descriptions after high-fidelity causal ablation of objects. It evaluates prominent VLMs against a human psychophysics baseline of 16,289 valid responses across 307 complex scenes and 1,306 counterfactual variants, revealing that models over-rely (relative to humans) on large objects, central objects, and high-saliency objects while under-relying on people; size bias is identified as a primary driver of model-human semantic divergence.

Significance. If the central claims hold after addressing measurement concerns, the work offers a practical, model-agnostic tool for probing high-level scene comprehension in closed VLMs and supplies a large-scale human anchor that could guide bias mitigation. The empirical focus on observable differences rather than fitted parameters, combined with promised code and data release, supports reproducibility and follow-on studies.

major comments (3)
  1. [Methods (Counterfactual Generation)] Methods, counterfactual generation subsection: The claim that ablation produces clean causal semantic shifts (central to all bias measurements) requires explicit validation against artifacts such as lighting inconsistencies, texture seams, or context violations. Without human naturalness ratings or comparison to alternative removal methods, the reported size/center/saliency biases and their correlation with divergence risk being confounded by the generation process itself.
  2. [Results (Bias Analysis and Divergence Correlation)] Results, similarity metric and statistical controls: The paper must specify the exact similarity metric (e.g., embedding cosine vs. description overlap) used to quantify semantic shifts and demonstrate that it isolates semantic rather than low-level visual changes. Additionally, the size-bias correlation analysis needs controls for scene complexity, object category, and multiple comparisons to support the claim that size bias is the primary driver.
  3. [Human Psychophysics Baseline] Human baseline section: While the 16,289 responses provide a credible anchor, the manuscript should report inter-rater reliability, exclusion criteria details, and any scene-complexity balancing to ensure the model-human gaps are not artifacts of response variability or stimulus selection.
minor comments (2)
  1. [Figures] Figure 1 and 2 captions should explicitly state the number of scenes, models, and response counts per panel for immediate readability.
  2. [Methods] Notation for CSS score (if formalized) should be introduced with a clear equation early in the Methods to avoid ambiguity in later comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to incorporate additional validations and clarifications.

read point-by-point responses
  1. Referee: Methods, counterfactual generation subsection: The claim that ablation produces clean causal semantic shifts (central to all bias measurements) requires explicit validation against artifacts such as lighting inconsistencies, texture seams, or context violations. Without human naturalness ratings or comparison to alternative removal methods, the reported size/center/saliency biases and their correlation with divergence risk being confounded by the generation process itself.

    Authors: We agree that validating the quality of the counterfactual generations is crucial to ensure the measured semantic shifts are not confounded by artifacts. In the revised manuscript, we will add a section reporting human naturalness ratings collected from 50 participants on a subset of 100 counterfactual images, showing high naturalness scores (mean 4.2/5). Additionally, we will compare our ablation method (which uses advanced inpainting) to simple object masking and report that the semantic shift patterns remain consistent, supporting that the biases are not due to generation artifacts. revision: yes

  2. Referee: Results, similarity metric and statistical controls: The paper must specify the exact similarity metric (e.g., embedding cosine vs. description overlap) used to quantify semantic shifts and demonstrate that it isolates semantic rather than low-level visual changes. Additionally, the size-bias correlation analysis needs controls for scene complexity, object category, and multiple comparisons to support the claim that size bias is the primary driver.

    Authors: We have now explicitly stated in the Methods section that the similarity metric is the cosine similarity of embeddings from the all-MiniLM-L6-v2 Sentence Transformer model, which focuses on semantic content. To show it isolates semantic changes, we added an analysis correlating the metric with low-level features (e.g., SSIM, color histograms) and found negligible correlations (r < 0.1). For the size-bias analysis, we included linear regression controls for scene complexity (object count), dominant object category, and applied FDR correction for multiple comparisons across bias types. These controls confirm size bias as the strongest predictor of divergence (beta = 0.45, p < 0.001). revision: yes

  3. Referee: Human baseline section: While the 16,289 responses provide a credible anchor, the manuscript should report inter-rater reliability, exclusion criteria details, and any scene-complexity balancing to ensure the model-human gaps are not artifacts of response variability or stimulus selection.

    Authors: We have expanded the Human Psychophysics Baseline section to include these details. Inter-rater reliability was assessed using Fleiss' kappa across all scenes, yielding kappa = 0.68, indicating substantial agreement. Exclusion criteria involved removing responses completed in under 3 seconds or failing attention checks (e.g., inconsistent color naming), excluding approximately 12% of initial responses. Scene selection was balanced for complexity by ensuring a uniform distribution of object counts (ranging 5-20) and semantic categories across the 307 scenes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurement framework

full rationale

The paper introduces Counterfactual Semantic Saliency (CSS) as a black-box method that quantifies object importance via measured semantic shifts after ablation, then compares VLM outputs to a human psychophysics baseline of 16,289 responses. No equations, fitted parameters, or derivations are presented that reduce the reported size/center/saliency biases or divergence metrics to quantities defined by the paper's own inputs. Claims rest on direct empirical comparisons using external human data and model responses rather than self-definitional loops, self-citation load-bearing premises, or renamed known results. The framework is self-contained against the provided benchmarks with no load-bearing internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the framework uses standard semantic similarity measures and ablation as a causal probe.

pith-pipeline@v0.9.0 · 5511 in / 1045 out tokens · 38944 ms · 2026-05-14T20:32:01.485147+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Anthropic

    A. Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

  3. [3]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretability of deep visual representations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017

  5. [5]

    A. F. Biten, L. Gómez, and D. Karatzas. Let there be a clock on the beach: Reducing object hallucination in image captioning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1381–1390, 2022

  6. [6]

    SAM 3: Segment Anything with Concepts

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  7. [7]

    Castiello and C

    U. Castiello and C. Umiltà. Size of the attentional focus and efficiency of processing.Acta psychologica, 73(3):195–209, 1990

  8. [8]

    M. Cerf, E. P. Frady, and C. Koch. Faces and text attract gaze independent of the task: Experimental data and computer model.Journal of vision, 9(12):10–10, 2009

  9. [9]

    Chefer, S

    H. Chefer, S. Gur, and L. Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 397–406, 2021

  10. [10]

    Chefer, S

    H. Chefer, S. Gur, and L. Wolf. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 782–791, 2021

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  12. [12]

    De Haas, A

    B. De Haas, A. L. Iakovidis, D. S. Schwarzkopf, and K. R. Gegenfurtner. Individual differences in visual salience vary along semantic dimensions.Proceedings of the National Academy of Sciences, 116(24):11687–11692, 2019. 10

  13. [13]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  14. [14]

    S. Ding, S. Vasa, and A. Ramadwar. Explanation-driven counterfactual testing for faithfulness in vision-language model explanations.arXiv preprint arXiv:2510.00047, 2025

  15. [15]

    M. P. Eckstein, K. Koehler, L. E. Welbourne, and E. Akbas. Humans, but not deep neural networks, often miss giant targets in scenes.Current Biology, 27(18):2827–2832, 2017

  16. [16]

    T. Fel, I. F. Rodriguez Rodriguez, D. Linsley, and T. Serre. Harmonizing the object recognition strategies of deep neural networks with humans.Advances in neural information processing systems, 35:9432–9446, 2022

  17. [17]

    P. M. Fitts. The information capacity of the human motor system in controlling the amplitude of movement.Journal of experimental psychology, 47(6):381, 1954

  18. [18]

    T.-J. Fu, W. Hu, X. Du, W. Y . Wang, Y . Yang, and Z. Gan. Guiding instruction-based image editing via multimodal large language models.arXiv preprint arXiv:2309.17102, 2023

  19. [19]

    Geirhos, J.-H

    R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

  20. [20]

    Geirhos, K

    R. Geirhos, K. Narayanappa, B. Mitzkus, T. Thieringer, M. Bethge, F. A. Wichmann, and W. Brendel. Partial success in closing the gap between human and machine vision.Advances in Neural Information Processing Systems, 34:23885–23899, 2021

  21. [21]

    Geirhos, P

    R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenet- trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational conference on learning representations, 2018

  22. [22]

    Gokce and M

    A. Gokce and M. Schrimpf. Scaling laws for task-optimized models of the primate visual ventral stream.arXiv preprint arXiv:2411.05712, 2024

  23. [23]

    Gordon and B

    J. Gordon and B. Van Durme. Reporting bias and knowledge acquisition. InProceedings of the 2013 workshop on Automated knowledge base construction, pages 25–30, 2013

  24. [24]

    Goyal, Z

    Y . Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, and S. Lee. Counterfactual visual explanations. InInternational Conference on Machine Learning, pages 2376–2384. PMLR, 2019

  25. [25]

    Harel, C

    J. Harel, C. Koch, and P. Perona. Graph-based visual saliency.Advances in neural information processing systems, 19, 2006

  26. [26]

    T. R. Hayes and J. M. Henderson. Deep saliency models learn low-, mid-, and high-level features to predict scene attention.Scientific reports, 11(1):18434, 2021

  27. [27]

    J. M. Henderson and T. R. Hayes. Meaning-based guidance of attention in scenes as revealed by meaning maps.Nature human behaviour, 1(10):743–747, 2017

  28. [28]

    Ilyas, S

    A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples are not bugs, they are features.Advances in neural information processing systems, 32, 2019

  29. [29]

    L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis.IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259, 1998

  30. [30]

    Jain and B

    S. Jain and B. C. Wallace. Attention is not explanation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, 2019

  31. [31]

    S. Jo, G. Jang, and H. Park. Gmar: gradient-driven multi-head attention rollout for vision transformer interpretability. In2025 IEEE International Conference on Image Processing (ICIP), pages 582–587. IEEE, 2025. 11

  32. [32]

    X. Ju, X. Liu, X. Wang, Y . Bian, Y . Shan, and Q. Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. InEuropean Conference on Computer Vision, pages 150–168. Springer, 2024

  33. [33]

    T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In 2009 IEEE 12th international conference on computer vision, pages 2106–2113. IEEE, 2009

  34. [34]

    G. T. A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram’e, M. Rivière, L. Rouillard, T. Mesnard, G. Cideron, J.-B. Grill, S. Ramos, E. Yvinec, M. Casbon, E. Pot, I. Penchev, G. Liu, F. Visin, K. Kenealy, L. Beyer, X. Zhai, A. Tsit- sulin, R. I. Busa-Fekete, A. Feng, N. Sachdeva, B. Coleman, Y . Gao, B. Mustafa, I...

  35. [35]

    Koehler, F

    K. Koehler, F. Guo, S. Zhang, and M. P. Eckstein. What do saliency models predict?Journal of vision, 14(3):14–14, 2014

  36. [36]

    Leem and H

    S. Leem and H. Seo. Attention guided cam: visual explanations of vision transformer guided by self-attention. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 2956–2964, 2024

  37. [37]

    T. Li, Z. Wen, L. Song, J. Liu, Z. Jing, and T. S. Lee. From local cues to global percepts: Emer- gent gestalt organization in self-supervised vision models.arXiv preprint arXiv:2506.00718, 2025

  38. [38]

    W. Li, Z. Lin, K. Zhou, L. Qi, Y . Wang, and J. Jia. Mat: Mask-aware transformer for large hole image inpainting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10758–10768, 2022

  39. [39]

    Y . Li, Y . Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

  40. [40]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  41. [41]

    S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions.Advances in neural information processing systems, 30, 2017. 12

  42. [42]

    Misra, C

    I. Misra, C. Lawrence Zitnick, M. Mitchell, and R. Girshick. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2930–2939, 2016

  43. [43]

    S. E. Palmer.Vision science: Photons to phenomenology. MIT press, 1999

  44. [44]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  45. [45]

    why should i trust you?

    M. T. Ribeiro, S. Singh, and C. Guestrin. " why should i trust you?" explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016

  46. [46]

    Rohrbach, L

    A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, 2018

  47. [47]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  48. [48]

    L. Rout, Y . Chen, N. Ruiz, C. Caramanis, S. Shakkottai, and W.-S. Chu. Semantic im- age inversion and editing using rectified stochastic differential equations.arXiv preprint arXiv:2410.10792, 2024

  49. [49]

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017

  50. [50]

    OpenAI GPT-5 System Card

    A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  51. [51]

    Spanos, M

    N. Spanos, M. Lymperaiou, G. Filandrianos, K. Thomas, A. V oulodimos, and G. Stamou. V- cece: Visual counterfactual explanations via conceptual edits.arXiv preprint arXiv:2509.16567, 2025

  52. [52]

    Suvorov, E

    R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V . Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022

  53. [53]

    M. R. Taesiri, G. Nguyen, S. Habchi, C.-P. Bezemer, and A. Nguyen. Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification.Advances in Neural Information Processing Systems, 36:35878–35953, 2023

  54. [54]

    B. W. Tatler. The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions.Journal of vision, 7(14):4–4, 2007

  55. [55]

    B. W. Tatler, M. M. Hayhoe, M. F. Land, and D. H. Ballard. Eye guidance in natural vision: Reinterpreting salience.Journal of vision, 11(5):5–5, 2011

  56. [56]

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  57. [57]

    S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

  58. [58]

    Ullman, L

    S. Ullman, L. Assif, E. Fetaya, and D. Harari. Atoms of recognition in human and computer vision.Proceedings of the National Academy of Sciences, 113(10):2744–2749, 2016. 13

  59. [59]

    Wallace, M

    B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  60. [60]

    A. N. Wang, C. Hoang, Y . Xiong, Y . LeCun, and M. Ren. Poodle: Pooled and dense self- supervised learning from naturalistic videos.arXiv preprint arXiv:2408.11208, 2024

  61. [61]

    Z. Wen, T. Li, Z. Jing, and T. S. Lee. Does resistance to style-transfer equal global shape bias? measuring network sensitivity to global shape configuration.arXiv preprint arXiv:2310.07555, 2023

  62. [62]

    E. Xie, J. Chen, J. Chen, H. Cai, H. Tang, Y . Lin, Z. Zhang, M. Li, L. Zhu, Y . Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

  63. [63]

    J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, and Q. Zhao. Predicting human gaze beyond pixels.Journal of vision, 14(1):28–28, 2014

  64. [64]

    No Entry

    M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, and J. Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

  65. [65]

    Zhang, J

    D. Zhang, J. Li, Z. Zeng, and F. Wang. Jasper and stella: distillation of sota embedding models. arXiv preprint arXiv:2412.19048, 2024

  66. [66]

    J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  67. [67]

    man in black

    J. Zhuang, Y . Zeng, W. Liu, C. Yuan, and K. Chen. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. InEuropean Conference on Computer Vision, pages 195–211. Springer, 2024. 14 A Technical Details of Counterfactual Semantic Saliency This section shows the prompts and hyperparameters employed in the Counterfa...

  68. [68]

    The description should explain what is happening in the scene

  69. [69]

    The first three images have been provided with descriptions as an example

    The descriptions need to be concise. The first three images have been provided with descriptions as an example. Please carefully review the examples, as they will give you an idea of the kind of images you will see in the survey and the kind of descriptions we expect. You need to satisfy these requirements to participate:

  70. [70]

    This means that you were raised speaking English

    You MUST be a Native English speaker. This means that you were raised speaking English

  71. [71]

    You MUST carefully look at the example shown and provide descriptions as suggested

  72. [72]

    You MUST thoroughly review each image and provide a meaningful and grammatically correct description

  73. [73]

    To standardize the verbosity of human responses, participants were required to review examples of acceptable textual descriptions before beginning the main experimental trials (Fig

    Please ensure to open this link on a laptop or Desktop. To standardize the verbosity of human responses, participants were required to review examples of acceptable textual descriptions before beginning the main experimental trials (Fig. 8). Importantly, the visual stimuli utilized for these calibration examples were strictly disjoint from the main datase...