Recognition: unknown
Counting to Four is still a Chore for VLMs
Pith reviewed 2026-05-10 15:34 UTC · model grok-4.3
The pith
VLMs fail at counting because visual evidence degrades during language-stage reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this, the Modality Attention Share intervention encourages a minimum budget of visual attention during answer generation, suggesting that counting failures stem from underuse of visual evidence during language-stage reasoning in addition to visual perception limits.
What carries the argument
The observed degradation of visual evidence across layers from modality projection to language stages, addressed by the Modality Attention Share (MAS) intervention that enforces a minimum visual attention budget during answer generation.
Load-bearing premise
That attention analysis and component-wise probing accurately reveal the degradation of visual evidence across layers without confounding factors.
What would settle it
An experiment where the MAS intervention is applied but counting accuracy on the controlled shape-based cases shows no improvement, or where high visual attention persists yet errors remain.
Figures
read the original abstract
Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical study of vision-language model (VLM) failures on simple object counting. It introduces the COUNTINGTRICKS benchmark of controlled shape-based cases, performs behavioral evaluation under varying patchification and prompting, and uses attention analysis plus component-wise probing to show that count-relevant visual signals are strongest at the modality projection stage but degrade in subsequent language layers, increasing susceptibility to text priors. Motivated by this, the authors propose the lightweight Modality Attention Share (MAS) intervention to enforce a minimum visual attention budget during generation and report that it improves counting performance, concluding that failures arise from both perception limits and underuse of visual evidence in language-stage reasoning.
Significance. If the mechanistic account and intervention results hold, the work provides useful insight into internal causes of grounding failures in VLMs beyond output-only evaluation and offers a simple, deployable fix. The controlled COUNTINGTRICKS suite and planned code/dataset release are clear strengths that enable reproducibility and further study. The emphasis on layer-wise degradation of multimodal signals is a timely contribution to understanding where multimodal reasoning breaks down.
major comments (2)
- [Mechanistic analysis] Mechanistic analysis section: attention maps and probing show count-relevant visual evidence strongest at modality projection and degrading in later language layers, but this remains correlational. No direct causal test (e.g., targeted intervention on hidden states to restore count information before the observed drop and measurement of downstream error reduction) is reported, leaving open the possibility that degradation is a symptom of an earlier commitment to a text-prior answer rather than an independent cause of counting errors.
- [MAS intervention] MAS intervention evaluation: the reported performance lift from enforcing visual attention share is promising, yet the manuscript lacks ablations that isolate whether the gain stems specifically from restored use of visual evidence versus nonspecific changes in generation dynamics (e.g., comparison against random attention reweighting or other non-visual interventions). Without such controls, the claim that MAS directly mitigates underuse of visual evidence during language-stage reasoning is not fully supported.
minor comments (2)
- [Abstract] Abstract and results sections should include concrete quantitative details (accuracy deltas, sample sizes per condition, statistical tests) rather than qualitative statements such as 'degrades substantially' or 'boosts visual attention share.'
- [Benchmark construction] The description of COUNTINGTRICKS would benefit from an explicit table or figure summarizing the patchification layouts and adversarial prompt variants used, to make the controlled nature of the benchmark immediately clear.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review of our manuscript. The comments highlight important aspects of our mechanistic analysis and intervention evaluation that we will address to improve clarity and rigor. We provide point-by-point responses below.
read point-by-point responses
-
Referee: [Mechanistic analysis] Mechanistic analysis section: attention maps and probing show count-relevant visual evidence strongest at modality projection and degrading in later language layers, but this remains correlational. No direct causal test (e.g., targeted intervention on hidden states to restore count information before the observed drop and measurement of downstream error reduction) is reported, leaving open the possibility that degradation is a symptom of an earlier commitment to a text-prior answer rather than an independent cause of counting errors.
Authors: We agree that the mechanistic findings rely on correlational evidence from attention analysis and component-wise probing, which demonstrate the layer-wise degradation of visual signals but do not include direct causal interventions such as targeted modifications to hidden states. This correlational approach is consistent with the behavioral results showing increased susceptibility to text priors in later layers. In the revised manuscript, we will explicitly acknowledge the correlational nature of these analyses, discuss the alternative interpretation raised, and outline potential causal experiments as future work to more definitively establish the direction of causality. revision: partial
-
Referee: [MAS intervention] MAS intervention evaluation: the reported performance lift from enforcing visual attention share is promising, yet the manuscript lacks ablations that isolate whether the gain stems specifically from restored use of visual evidence versus nonspecific changes in generation dynamics (e.g., comparison against random attention reweighting or other non-visual interventions). Without such controls, the claim that MAS directly mitigates underuse of visual evidence during language-stage reasoning is not fully supported.
Authors: We thank the referee for this observation. To better isolate the effect of the Modality Attention Share intervention, we will add the suggested ablation studies in the revised manuscript. These will include comparisons against random attention reweighting and other non-visual interventions to determine whether the performance gains arise specifically from the enforced visual attention budget rather than general changes to generation dynamics. revision: yes
Circularity Check
No circularity: empirical study with independent experimental support
full rationale
The paper is a purely empirical investigation. It introduces a new controlled dataset (COUNTINGTRICKS), performs attention analysis and component-wise probing on existing VLMs, and evaluates a proposed lightweight intervention (MAS). All load-bearing claims are grounded in direct experimental measurements and intervention outcomes rather than any derivation, fitted parameter, or self-referential definition. No equations, predictions, or uniqueness theorems are invoked that reduce to the paper's own inputs. Self-citations, if present, are not load-bearing for the central claim. This matches the default non-circular outcome for empirical work that remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Lvlm-count: Improving counting in large vision-language models
Anonymous. Lvlm-count: Improving counting in large vision-language models. InOpenReview, 2025. 2
2025
-
[2]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Weis, Boyang Li, Silvio Savarese, and Steven C.H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.arXiv preprint arXiv:2305.06500, 2023. 1, 2
work page internal anchor Pith review arXiv 2023
-
[4]
Xuyang Guo, Zekai Huang, Zhenmei Shi, Zhao Song, and Ji- ahao Zhang. Your vision-language model can’t even count to 20: Exposing the failures of vlms in compositional counting. arXiv preprint arXiv:2510.04401, 2025. 2
-
[5]
Blind faith in text: Robustness of vision-language models is undermined
Seokhyun Jeong and et al. Blind faith in text: Robustness of vision-language models is undermined. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2
2025
-
[6]
When language overrules: Modality imbalance in vlms.arXiv preprint arXiv:2508.10552, 2025
Minsoo Kim and et al. When language overrules: Modality imbalance in vlms.arXiv preprint arXiv:2508.10552, 2025. 2
-
[7]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 1, 2
2023
-
[8]
Pairtally: Segmenting pairs for counting in vlms.arXiv preprint arXiv:2509.13939, 2025
Chen Ma and et al. Pairtally: Segmenting pairs for counting in vlms.arXiv preprint arXiv:2509.13939, 2025. 2
-
[9]
Gpt-4o system card.OpenAI Technical Report,
OpenAI. Gpt-4o system card.OpenAI Technical Report,
-
[10]
Vision language models are blind: Failing to translate detailed visual features into words, 2025
Pooyan Rahmanzadehgervi, Logan Bolton, Moham- mad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind: Failing to translate detailed visual features into words, 2025. 2
2025
-
[11]
Learning to count everything
Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh Hoai. Learning to count everything. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 2
2021
-
[12]
Qwen Team. Qwen2.5-vl: Scaling vision-language models. arXiv preprint arXiv:2502.13923, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Shengbang Tong and et al. Vlms can’t see the obvi- ous: Benchmarking visual understanding in vision-language models.arXiv preprint arXiv:2507.04741, 2025. 2
-
[14]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Alexander Wan and et al. Two effects, one trigger: Visual limitations in vlms.arXiv preprint arXiv:2404.07983, 2024. 2
-
[16]
Can clip count stars? an empirical study on quantity bias in clip
Yixuan Wei and et al. Treat visual tokens as text?arXiv preprint arXiv:2410.06169, 2024. 2
-
[17]
Haoyu Zhang and et al. See what you are told: Visual atten- tion redistribution.arXiv preprint arXiv:2503.03321, 2025. 2
-
[18]
What’s in the image? what survives in modern multimodal lms.arXiv preprint arXiv:2411.17491, 2024
Yanzhe Zhang and et al. What’s in the image? what survives in modern multimodal lms.arXiv preprint arXiv:2411.17491, 2024. 2
-
[19]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Yifan Zhang and et al. Words or vision? investigating the dominance of text in vlms.arXiv preprint arXiv:2408.11039,
work page internal anchor Pith review arXiv
-
[20]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2023. 1
2023
-
[21]
Simiao Zuo and et al. Dynamic visual token compression. arXiv preprint arXiv:2411.19628, 2024. 2 9 A. Appendix A.1. Supplementary Experimental Model1A 1B 2A 2B 2C 2D 3A 3B 3C 3D 4A llava-1.5-7b 16.9 16.5 16.4 15.3 16.7 16.0 17.1 15.2 16.7 14.9 17.8 llava-1.6-vicuna-7b 18.0 20.0 18.5 18.3 19.4 20.2 18.9 19.4 17.9 19.7 20.5 internvl3-8b 13.2 7.3 44.5 12.4 3...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.