pith. sign in

arxiv: 2605.17826 · v1 · pith:EAOAA6PGnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models

Pith reviewed 2026-05-20 12:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelscounting biascounterfactual reasoningattention modulationmultimodal groundingobject priors
0
0 comments X

The pith

Vision-language models count objects using learned priors over the actual visual evidence when the two conflict.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models answer counting questions well when an image matches what they already know about typical scenes, yet they degrade when the image is edited to change the count while keeping everything else the same. CounterCount supplies paired factual and counterfactual images, verified answers, and annotations that mark the exact visual tokens carrying the count information. Tests on recent models show the drop occurs even when the relevant evidence is clearly present, because the models assign lower attention weight to those tokens. A single inference-time change that boosts the weight on the selected tokens lifts counterfactual accuracy by as much as eight percent. The framework therefore both diagnoses the prior-driven failure and supplies a practical fix that works across multiple models.

Core claim

The paper establishes that recent VLMs maintain high accuracy on factual counting images yet show consistent drops when count-relevant attributes are edited to produce contradictory visual evidence. Localized evidence annotations demonstrate that these errors arise from under-attention to the precise visual tokens that determine the count, rather than from missing or ambiguous evidence. A unified inference-time attention modulation method that reweights the selected tokens improves counterfactual counting performance by up to 8 percent across tested models.

What carries the argument

CounterCount paired factual-counterfactual image set with localized evidence annotations that mark count-relevant visual tokens, plus an inference-time attention reweighting operation that increases the weight on those tokens.

If this is right

  • Future VLMs will require explicit mechanisms to override object-level priors when visual evidence is presented.
  • Diagnostic datasets built on minimal attribute edits can expose similar grounding failures in other reasoning tasks.
  • Attention modulation at inference time offers a lightweight way to improve visual grounding without retraining.
  • Counting performance on real images may still be inflated by prior knowledge unless counterfactual checks are applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prior-over-evidence pattern is likely to appear in other attribute judgments such as size or color when those conflict with canonical expectations.
  • Training objectives that explicitly penalize mismatches between attention maps and human-provided evidence annotations could reduce the need for post-hoc fixes.
  • Applications that depend on accurate object counts in unusual scenes, such as inventory or safety monitoring, would benefit from running both factual and counterfactual versions of each query.

Load-bearing premise

The counterfactual edits change only the count-relevant attributes and leave all other scene properties unchanged, and the provided annotations correctly identify the visual tokens that should drive the count.

What would settle it

Models would achieve nearly identical accuracy on both factual and counterfactual versions of the same scenes, or the attention reweighting step would produce no measurable gain in counterfactual accuracy.

Figures

Figures reproduced from arXiv: 2605.17826 by Abdelrahman Eldesokey, Bernard Ghanem, Bushra Bin Hemid, Hassan Alshanqiti, Reem Alzahrani, Zaid Alyafeai.

Figure 1
Figure 1. Figure 1: CounterCount exposes prior-driven counting failures in VLMs. (a) On factual images, correct answers may reflect either visual counting or canonical priors. (b) Under counterfactual edits, VLMs often default to the prior and answer incorrectly. (c) Our Attention modulation of count-relevant visual tokens improves counterfactual counting. This ambiguity has become an active area of investigation [47, 46, 35,… view at source ↗
Figure 2
Figure 2. Figure 2: Representative factual–counterfactual pairs from CounterCount, spanning all ten semantic categories. Each column pair shows one instance: left image (teal border) shows the corresponding factual image with canonical counts intact, while right image (red border) is the CF in which a canonical attribute count is deliberately violated. To support analysis beyond answer accuracy, CounterCount also provides loc… view at source ↗
Figure 3
Figure 3. Figure 3: Region-level annotation pipeline. (a) Counterfactual input image. (b) Segmentation mask. (c) BB provides compact spatial localization. (d) Mask-BB Intersection. (e) Corresponding 15×15 visual token grid; tokens within the Mask-BB region are highlighted and selected for region-specific attention modulation. The badge indicates the number of selected tokens. bias, where VLMs may favor particular answer choic… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples of counting failures with Qwen3-VL-8B on our dataset. Each case shows the original image (left) with the region of interest highlighted, and the corresponding averaged attention map over the late 50% of layers (right). Qwen3-VL-8B-Instruct Gemma-3-4b-it [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average attention over all vision tokens (blue) and Mask-BB tokens (pink) per layer across [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Inference-time attention modulation. We modulate attention to selected visual-token regions, including WholeImg, Mask, BB, and Mask-BB. The framework supports amplification, suppression, and masking through the same logit-level formulation. We illustrate the T↑B↓ setting, which amplifies count-relevant target tokens while suppressing background tokens. A˜ h,i,j = α exp(zh,i,j ) Pn k=1 Γk exp(zh,i,k) = exp(… view at source ↗
Figure 7
Figure 7. Figure 7: illustrates the dataset generation pipeline, showing how each instance is first validated on the factual image to confirm the model recognizes the subject and its canonical attribute, before being tested on the counterfactual image under both open-ended and multiple-choice question formats. CounterCount Benchmark Validate: the model knows the subject and its canonical attribute OE. Complete the sentence: T… view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy convergence with increasing number of evaluation images. Baseline open￾ended CF accuracy (%) is computed on subsets of increasing size, sampled uniformly across cate￾gories. At each size, 50 random draws are performed; the solid line shows the mean, and the shaded region indicates ±1 standard deviation. Accuracy stabilizes by approximately 80 images for both models, confirming that the full datase… view at source ↗
Figure 9
Figure 9. Figure 9: Average attention over all vision tokens (blue) and Mask-BB tokens (pink) per layer for [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Average attention over all vision tokens (blue) and Mask-BB tokens (pink) per layer [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative examples of counting failures with Gemma-3-4B on our benchmark. Each case shows the original image (left) with the region of interest highlighted, and the corresponding averaged attention map over the late 50% of layers (right). Baseline: Four side mirrors Modulated: Eight side mirrors Question: How many side mirrors does this car have? Ground Truth: Eight side mirrors Baseline: Two tails Modu… view at source ↗
Figure 12
Figure 12. Figure 12: Miscounting corrections with Qwen3-VL-8B. Each case shows the original image (left) with the region of interest highlighted, the baseline attention map (middle), and the attention map under T↑B∅ (1.75, 0, BB, All) (right), averaged over the late 50% of layers. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Language prior corrections with Qwen3-VL-8B. Each case shows the original image (left) with the region of interest highlighted, the baseline attention map (middle), and the attention map under T↑B∅ (1.75, 0, BB, All) (right), averaged over the late 50% of layers. Baseline: Five prongs Modulated: Two prongs Question: How many prongs does this fork have? Ground Truth: Two prongs Baseline: Five snouts Modula… view at source ↗
Figure 14
Figure 14. Figure 14: Miscounting corrections with Gemma-3-4B. Each case shows the original image (left) with the region of interest highlighted, the baseline attention map (middle), and the attention map under T↑B↓ (2.0, 0.75, BB, All) (right), averaged over the late 50% of layers. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Language prior corrections with Gemma-3-4B. Each case shows the original image (left) with the region of interest highlighted, the baseline attention map (middle), and the attention map under T↑B↓ (2.0, 0.75, BB, All) (right), averaged over the late 50% of layers. rather than absent. For Qwen3-VL-8B, we address this by fully masking background tokens, forcing all attention mass onto the target region. For… view at source ↗
Figure 16
Figure 16. Figure 16: Examples where attention modulation corrects baseline errors on counterfactual images [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Examples of recurring counting failures on counterfactual images under the best con [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Examples where attention modulation degrades performance on counterfactual images [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Examples of counting errors on factual images. In these cases, the model fails to correctly [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to missing or ambiguous visual evidence, but to models underweighting attention to count-relevant visual tokens. We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8% across multiple VLMs. Overall, CounterCount exposes prior-driven counting failures and provides diagnostic insights for designing future VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces CounterCount, a diagnostic framework consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. It evaluates recent VLMs and reports strong performance on factual images but consistent degradation on counterfactual versions, which the authors attribute to reliance on object-level priors and underweighting of count-relevant visual tokens in attention. An inference-time attention modulation strategy is proposed that improves counterfactual counting accuracy by up to 8%.

Significance. If the counterfactual edits prove free of non-count artifacts, the framework supplies a precise testbed for visual grounding failures in VLMs and a lightweight mitigation technique. The localized annotations help separate missing evidence from attention misallocation, which could inform future model design.

major comments (1)
  1. [Methods section on counterfactual image construction and annotation] The central claim that performance degradation on counterfactual images demonstrates prior-driven underweighting of visual evidence (rather than edit artifacts) requires that edits alter only count-relevant attributes while exactly preserving all other scene properties. The manuscript describes 'edited count-relevant attributes' with 'verified answers' and 'localized evidence annotations' but does not report quantitative edit-fidelity metrics such as human similarity ratings on non-target attributes or automated checks for introduced inconsistencies (lighting, shadows, spatial relations). This validation is load-bearing for the interpretation of results.
minor comments (1)
  1. Specify the exact number of image pairs, the full list of evaluated VLMs, and the precise conditions under which the 8% gain is measured so that the improvement claim can be reproduced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to incorporate additional validation.

read point-by-point responses
  1. Referee: [Methods section on counterfactual image construction and annotation] The central claim that performance degradation on counterfactual images demonstrates prior-driven underweighting of visual evidence (rather than edit artifacts) requires that edits alter only count-relevant attributes while exactly preserving all other scene properties. The manuscript describes 'edited count-relevant attributes' with 'verified answers' and 'localized evidence annotations' but does not report quantitative edit-fidelity metrics such as human similarity ratings on non-target attributes or automated checks for introduced inconsistencies (lighting, shadows, spatial relations). This validation is load-bearing for the interpretation of results.

    Authors: We agree that demonstrating the fidelity of the counterfactual edits—specifically that only count-relevant attributes are altered while preserving all other scene properties—is critical for attributing performance degradation to object priors rather than edit artifacts. The current manuscript uses verified answers and localized evidence annotations to support the edits, but we acknowledge that quantitative metrics would provide stronger, more explicit evidence. In the revised manuscript, we will add a new subsection to the Methods section reporting edit-fidelity evaluation. This will include human similarity ratings on non-target attributes (e.g., lighting, shadows, spatial relations, and background elements) collected from multiple annotators with inter-rater agreement statistics, as well as automated consistency checks using perceptual similarity metrics. These additions will directly address the load-bearing concern and reinforce the validity of our conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical diagnostic framework on new image pairs

full rationale

The paper constructs a new dataset of factual/counterfactual image pairs with verified answers and localized evidence annotations, then reports empirical VLM performance degradation and an attention-reweighting intervention. No derivation reduces by construction to fitted parameters, self-defined quantities, or load-bearing self-citations. The central claim (degradation indicates prior reliance) follows directly from the experimental contrast between factual and edited images rather than from any internal fit or renaming. The attention modulation is presented as an external inference-time strategy, not a tautological re-expression of the annotations themselves. This is a standard empirical diagnostic setup with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that attribute edits can isolate counting priors without introducing new visual ambiguities, plus the modeling assumption that attention weights directly indicate reliance on visual evidence.

axioms (1)
  • domain assumption Counterfactual images can be generated by editing only count-relevant attributes while leaving all other visual properties unchanged and verifiable.
    The entire diagnostic comparison depends on this premise being true for the paired images.

pith-pipeline@v0.9.0 · 5743 in / 1271 out tokens · 37610 ms · 2026-05-20T12:16:13.738631+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 2 internal anchors

  1. [1]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. a. Bi ´nkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual langu...

  2. [2]

    Alghisi, G

    S. Alghisi, G. Roccabruna, M. Rizzoli, S. M. Mousavi, and G. Riccardi. [de|re]constructing vlms’ reasoning in counting, 2025

  3. [3]

    Antol, A

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015

  4. [4]

    Atabuzzaman, A

    M. Atabuzzaman, A. Asgarov, and C. Thomas. Benchmarking and mitigating mcqa selection bias of large vision-language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2025

  5. [5]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

  6. [6]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

  7. [7]

    D. I. Campbell, S. Rane, T. Giallanza, C. N. D. Sabbata, K. Ghods, A. Joshi, A. Ku, S. M. Fran- kland, T. L. Griffiths, J. D. Cohen, and T. W. Webb. Understanding the limits of vision language models through the lens of the binding problem. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  8. [8]

    T. T. Chan, V . Suresh, A. Saha, M. Hahn, and V . Demberg. System-mediated attention imbal- ances make vision-language models say yes, 2026

  9. [9]

    L. Che, Z. Xue, Y . Quan, B. Liu, Z. Shi, M. Hurst, J. Feldman, R. Tang, R. Krishna, and V . Pavlovic. Counting circuits: Mechanistic interpretability of visual reasoning in large vision- language models, 2026

  10. [10]

    A. Deng, T. Cao, Z. Chen, and B. Hooi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3867–3876, June 2025

  11. [11]

    Favero, L

    A. Favero, L. Zancato, M. Trager, S. Choudhary, P. Perera, A. Achille, A. Swaminathan, and S. Soatto. Multi-modal hallucination control by visual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14303–14312, June 2024. 10

  12. [12]

    S. Fu, T. Bonnen, D. Guillory, and T. Darrell. Hidden in plain sight: Vlms overlook their visual representations, 2025

  13. [13]

    I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed. Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097–1179, Sept. 2024

  14. [14]

    Golovanevsky, W

    M. Golovanevsky, W. Rudman, M. A. Lepori, A. Bar, R. Singh, and C. Eickhoff. Pixels versus priors: Controlling knowledge priors in vision-language models through visual counterfacts. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24837–2...

  15. [15]

    S. M. Hall, F. G. Abrantes, H. Zhu, G. Sodunke, A. Shtedritski, and H. R. Kirk. Visogender: A dataset for benchmarking gender bias in image-text pronoun resolution. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

  16. [16]

    Understanding Counting Mechanisms in Large Language and Vision-Language Models

    H. Hasani, A. Izadi, F. Askari, M. Bagherian, S. Mohammadian, M. Izadi, and M. S. Baghshah. Understanding counting mechanisms in large language and vision-language models.arXiv preprint arXiv:2511.17699, 2024

  17. [17]

    Howard, A

    P. Howard, A. Madasu, T. Le, G. L. Moreno, A. Bhiwandiwalla, and V . Lal. Socialcounter- factuals: Probing and mitigating intersectional social biases in vision-language models with counterfactual examples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11975–11985, June 2024

  18. [18]

    D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  19. [19]

    M. I. Ismithdeen, M. U. Khattak, and S. Khan. Promptception: How sensitive are large multimodal models to prompts? In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23950–23985, Suzhou, China, Nov. 2025. Association for Computational Linguistics

  20. [20]

    Jiang, J

    Z. Jiang, J. Chen, B. Zhu, T. Luo, Y . Shen, and X. Yang. Devils in middle layers of large vision- language models: Interpreting, detecting and mitigating object hallucinations via attention lens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25004–25014, June 2025

  21. [21]

    Kirillov, E

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, and et al. Segment anything. InICCV, 2023

  22. [22]

    K.-i. Lee, M. Kim, S. Yoon, M. Kim, D. Lee, H. Koh, and K. Jung. VLind-bench: Measuring language priors in large vision-language models. In L. Chiruzzo, A. Ritter, and L. Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025, pages 4129–4144, Albuquerque, New Mexico, Apr. 2025. Association for Computational Linguistics

  23. [23]

    J. Li, D. Li, C. Xiong, and S. Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages ...

  24. [24]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023

  25. [25]

    S. Liu, K. Zheng, and W. Chen. Paying more attention to image: A training-free method for alleviating hallucination in lvlms, 2024

  26. [26]

    Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin. Mmbench: Is your multi-modal model an all-around player? InThe European Conference on Computer Vision (ECCV), 2024. 11

  27. [27]

    Z. Liu, Z. Chen, H. Liu, C. Luo, X. Tang, S. Wang, J. Zeng, Z. Dai, Z. Shi, T. Wei, B. Dumoulin, and H. Tong. Seeing but not believing: Probing the disconnect between visual attention and answer correctness in vlms, 2025

  28. [28]

    P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024

  29. [29]

    T. Luo, A. Cao, G. Lee, J. Johnson, and H. Lee. Probing visual language priors in VLMs. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 41120–41156. PMLR, 13–19 Jul 2025

  30. [30]

    Naous, M

    T. Naous, M. J. Ryan, A. Ritter, and W. Xu. Having beer after prayer? measuring cultural bias in large language models. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16366–16393, Bangkok, Thailand, Aug. 2024. Association for Computat...

  31. [31]

    F. Ortu, Z. Jin, D. Doimo, and A. Cazzaniga. When seeing overrides knowing: Disentangling knowledge conflicts in vision-language models. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

  32. [32]

    Paiss, A

    R. Paiss, A. Ephrat, O. Tov, S. Zada, I. Mosseri, M. Irani, and T. Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3170–3180, October 2023

  33. [33]

    Parrish, A

    A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. Bowman. BBQ: A hand-built bias benchmark for question answering. In S. Muresan, P. Nakov, and A. Villavicencio, editors,Findings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105, Dublin, Ireland, May 2022. Association for Computational Linguistics

  34. [34]

    M. F. Qharabagh, M. Ghofrani, and K. Fountoulakis. LVLM-COUNT: Enhancing the counting ability of large vision-language models, 2025

  35. [35]

    Rahmanzadehgervi, L

    P. Rahmanzadehgervi, L. Bolton, M. R. Taesiri, and A. T. Nguyen. Vision language models are blind. InProceedings of the Asian Conference on Computer Vision (ACCV), pages 18–34, December 2024

  36. [36]

    C. Raj, A. Mukherjee, A. Caliskan, A. Anastasopoulos, and Z. Zhu. BiasDora: Exploring hidden biased associations in vision-language models. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10439–10455, Miami, Florida, USA, Nov. 2024. Association for Computational Linguistics

  37. [37]

    Rudman, M

    W. Rudman, M. Golovanevsky, A. Bar, V . Palit, Y . LeCun, C. Eickhoff, and R. Singh. For- gotten polygons: Multimodal large language models are shape-blind. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Findings of the Association for Computational Linguis- tics: ACL 2025, pages 11983–11998, Vienna, Austria, July 2025. Association for Com...

  38. [38]

    Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

    S. Sengupta, N. Moradinasab, J. Liu, and D. E. Brown. Can vision-language models count? a synthetic benchmark and analysis of attention-based interventions.arXiv preprint arXiv:2511.17722, 2025

  39. [39]

    Sheng, K.-W

    E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng. The woman worked as a babysitter: On biases in language generation. In K. Inui, J. Jiang, V . Ng, and X. Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3...

  40. [40]

    what shapes your bias?

    J. Shin, H. Song, H. Lee, S. Jeong, and J. Park. Ask LLMs directly, “what shapes your bias?”: Measuring social bias in large language models. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 16122– 16143, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics

  41. [41]

    Sterz, J

    H. Sterz, J. Pfeiffer, and I. Vuli ´c. Dare: Diverse visual question answering with robustness evaluation.Transactions of the Association for Computational Linguistics, 13:1121–1145, 09 2025

  42. [42]

    G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, L. Rouillard, T. Mesnard, G. Cideron, J. bastien Grill, S. Ramos, E. Yvinec, M. Casbon, E. Pot, I. Penchev, G. Liu, F. Visin, K. Kenealy, L. Beyer, X. Zhai, A. Tsitsulin, R. Busa-Fekete, A. Feng, N. Sachdeva, B. Coleman, Y . Gao, B. Mustafa,...

  43. [43]

    S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9568–9578, June 2024

  44. [44]

    C. Tu, P. Ye, D. Zhou, L. Bai, G. Yu, T. Chen, and W. Ouyang. Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms.Int. J. Comput. Vision, 134(1), Dec. 2025

  45. [45]

    T. Ullman. The illusion-illusion: Vision language models see illusions where there are none, 2024

  46. [46]

    V o, K.-N

    A. V o, K.-N. Nguyen, M. R. Taesiri, V . T. Dang, A. T. Nguyen, and D. Kim. Vision language models are biased, 2025

  47. [47]

    V o, K.-N

    A. V o, K.-N. Nguyen, M. R. Taesiri, V . T. Dang, A. T. Nguyen, and D. Kim. Vision language models are biased: Counting legs of an animal is surprisingly hard. In2nd AI for Math Workshop @ ICML 2025, 2025

  48. [48]

    A. V o, M. R. Taesiri, D. Kim, and A. T. Nguyen. B-score: Detecting biases in large language models using response history. InForty-second International Conference on Machine Learning, 2025

  49. [49]

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, Z. Wang, Z. Chen, H. Zhang, G. Yang, H. Wang, Q. Wei, J. Yin, W. Li, E. Cui, G. Chen, Z. Ding, C. Tian, 13 Z. Wu, J. Xie, Z. Li, B. Yang, Y . Duan, X. Wang, Z. Hou, H. Hao, T. Zhang, S. Li, X. Zhao, H. Duan, N. Deng, B. Fu, Y . He, Y . Wang, C. He, B. Shi, J. He, Y . Xiong, H....

  50. [50]

    W. Wang, W. Jiao, J. Huang, R. Dai, J.-t. Huang, Z. Tu, and M. Lyu. Not all countries celebrate thanksgiving: On the cultural dominance in large language models. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6349–6384, Bangkok, Thail...

  51. [51]

    Z. Wang, B. Xu, Y . Xia, and P. Li. Vegas: Mitigating hallucinations in large vision-language models via vision-encoder attention guided adaptive steering, 2025

  52. [52]

    Z. Wu, Y . Niu, H. Gao, M. Lin, Z. Zhang, Z. Zhang, Q. Shi, Y . Wang, S. Fu, J. Xu, J. Ao, E. Dai, L. Feng, X. Zhang, and S. Wang. Lanp: Rethinking the impact of language priors in large vision-language models, 2025

  53. [53]

    T. Yang, Z. Li, J. Cao, and C. Xu. Understanding and mitigating hallucination in large vision- language models via modular attribution and intervention. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 51546–51568, 2025

  54. [54]

    T. Yang, L. Zhang, J. Lin, G. Hu, D. Wang, and L. Hu. Tracing and mitigating hallucinations in multimodal llms via dynamic attention localization, 2025

  55. [55]

    Y . Yang, C. P. Lee, S. Feng, D. Zhao, B. Wen, A. Z. Liu, Y . Tsvetkov, and B. Howe. Escaping the spuriverse: Can large vision-language models generalize beyond seen spurious correlations? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  56. [56]

    L. Yu, Z. Chen, P. Kuang, Z. Feng, F. Zhou, L. Wang, and G. Dobbie. Causally-grounded dual-path attention intervention for object hallucination mitigation in lvlms, 2025

  57. [57]

    W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: evaluating large multimodal models for integrated capabilities. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  58. [58]

    X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y . Liu, W. Huang, H. Sun, Y . Su, and W. Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision...

  59. [59]

    G. Zeno, N. Jedidi, and S. R. Gomez. Choosing ‘right’ from wrong: A closer look at selection bias in spatial multiple-choice questions in large multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 535–544, 2025

  60. [60]

    Zhang, J

    J. Zhang, J. Huang, S. Jin, and S. Lu. Vision-language models for vision tasks: A survey.IEEE Trans. Pattern Anal. Mach. Intell., 46(8):5625–5644, Aug. 2024

  61. [61]

    H. Zhao, S. Si, L. Chen, Y . Zhang, M. Sun, B. Chang, and M. Zhang. Looking beyond text: Reducing language bias in large vision-language models via multimodal dual-attention and soft-image guidance. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processi...

  62. [62]

    J. Zhao, F. Zhang, X. Sun, C. Feng, and Z. Tan. Tell model where to look: Mitigating hallucinations in mllms by vision-guided attention, 2025. 14 Contents 1 Introduction 1 2 Related Work 2 2.1 Language Priors and Bias in VLMs . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Counterfactual Images for Diagnosing Bias in VLMs . . . . . . . . . . . ....

  63. [63]

    parrot”) with its category label (e.g., “bird

    in Figure 11. Attention maps are computed by averaging over the late 50% of transformer layers following [27], with 95th-percentile contrast enhancement applied prior to normalization for visualization clarity. As observed with Qwen3-VL-8B [6] in the main paper, attention is concentrated within the relevant regions in both failure cases, suggesting that l...