arxiv: 2604.10039 · v1 · submitted 2026-04-11 · 💻 cs.CV

Recognition: unknown

Counting to Four is still a Chore for VLMs

Duy Le Dinh Anh , Patrick Amadeus Irawan , Tuan Van Vo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsobject countingattention analysismultimodal reasoningmodality attention sharevisual evidence degradationlanguage layer reasoning

0 comments

The pith

VLMs fail at counting because visual evidence degrades during language-stage reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the internal causes of object counting failures in vision-language models. It introduces a test suite to probe these failures under controlled conditions and uses attention analysis to track how visual information is handled. The analysis reveals that relevant visual signals are strong initially but weaken as processing moves to language layers, where text influences dominate. Motivated by this, a simple intervention is tested to maintain visual attention during response generation. The work implies that improving counting requires addressing not only perception but also how evidence is used in reasoning steps.

Core claim

Count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this, the Modality Attention Share intervention encourages a minimum budget of visual attention during answer generation, suggesting that counting failures stem from underuse of visual evidence during language-stage reasoning in addition to visual perception limits.

What carries the argument

The observed degradation of visual evidence across layers from modality projection to language stages, addressed by the Modality Attention Share (MAS) intervention that enforces a minimum visual attention budget during answer generation.

Load-bearing premise

That attention analysis and component-wise probing accurately reveal the degradation of visual evidence across layers without confounding factors.

What would settle it

An experiment where the MAS intervention is applied but counting accuracy on the controlled shape-based cases shows no improvement, or where high visual attention persists yet errors remain.

Figures

Figures reproduced from arXiv: 2604.10039 by Duy Le Dinh Anh, Patrick Amadeus Irawan, Tuan Van Vo.

**Figure 2.** Figure 2: The Toy-32 Visual Suite. We craft multiple patchification cases to probe perception failures under different geometric alignments and isolate Vision Encoder blind spots. Object Counting Benchmark Data: 1A–B (fixed/varied sizes), 2A–D (vertical grid alignment), 3A–D (horizontal grid alignment), 4A–D (grid intersection alignment), 5A–8A (circle diameters 2.5–4 patch sizes). alignment relative to the patch gr… view at source ↗

**Figure 3.** Figure 3: Evidence of Text-Prior Dominance. Our probing analysis reveals that models suffer from a “visual attention sink,” where computation drifts away from image tokens toward system prompts and instructions [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Spatial evidence is strong in the vision stack but fades in the language layers. Detection performance (AP@0.5) peaks at the projector tap (green) and drops at the LLM tap (red) across modern architectures, quantifying the loss of spatial information. performance difference reflects the quality of the retained representation, not the probe’s learning capacity [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Training Dynamics (YOLO Probing Head). The Projector probe (center) reaches higher AP sooner, while the Encoder plateaus lower and LLM curves exhibit higher volatility. Since the schedule and seed are identical across taps, these differences reflect purely tap utility rather than optimization quirks [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Vision Encoder Heatmaps. Activations in the early visual backbone are sharp and distinct, perfectly localizing the object instances. This confirms that the counting signal is present upstream, however, fading when entering the language space. tention maps from the Fused Tokens within the LLM (averaged over layers 15–25). We observe a dramatic loss of fidelity, the attention becomes diffuse and spatially … view at source ↗

**Figure 7.** Figure 7: Accuracy vs. Ground Truth Count. Left: Scatter plots show a consistent negative correlation (r ≈ −0.78) between count magnitude and accuracy across diverse geometric cases. Right: Detailed breakdown for LLaVA-1.5 and Qwen2.5-VL reveals specific “blind spots” (circled in red) where models achieve 0% accuracy for specific numbers (e.g., 7, 11), evidencing strong linguistic priors. 11 [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 8.** Figure 8: Fused Token Heatmaps. Averaged over deep LLM layers (15–25), the attention becomes diffuse and misaligned. The signal “washes out,” failing to retain the instance-level separation required for accurate counting. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows VLMs lose visual count signals in language layers after early projection and that forcing more visual attention helps, but the causal connection stays correlational.

read the letter

The main point is that counting errors in VLMs come from two places: early visual processing and then a drop in how much the model actually uses the visual evidence once it shifts to language reasoning. The authors created COUNTINGTRICKS, a set of simple shape images with controlled patch layouts and adversarial prompts, to make this measurable. They track attention and probe components to show the count signal peaks at modality projection then fades in later layers, where text priors take over. They also test Modality Attention Share, a light intervention that sets a minimum visual attention budget during generation and gets better counts as a result.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study of vision-language model (VLM) failures on simple object counting. It introduces the COUNTINGTRICKS benchmark of controlled shape-based cases, performs behavioral evaluation under varying patchification and prompting, and uses attention analysis plus component-wise probing to show that count-relevant visual signals are strongest at the modality projection stage but degrade in subsequent language layers, increasing susceptibility to text priors. Motivated by this, the authors propose the lightweight Modality Attention Share (MAS) intervention to enforce a minimum visual attention budget during generation and report that it improves counting performance, concluding that failures arise from both perception limits and underuse of visual evidence in language-stage reasoning.

Significance. If the mechanistic account and intervention results hold, the work provides useful insight into internal causes of grounding failures in VLMs beyond output-only evaluation and offers a simple, deployable fix. The controlled COUNTINGTRICKS suite and planned code/dataset release are clear strengths that enable reproducibility and further study. The emphasis on layer-wise degradation of multimodal signals is a timely contribution to understanding where multimodal reasoning breaks down.

major comments (2)

[Mechanistic analysis] Mechanistic analysis section: attention maps and probing show count-relevant visual evidence strongest at modality projection and degrading in later language layers, but this remains correlational. No direct causal test (e.g., targeted intervention on hidden states to restore count information before the observed drop and measurement of downstream error reduction) is reported, leaving open the possibility that degradation is a symptom of an earlier commitment to a text-prior answer rather than an independent cause of counting errors.
[MAS intervention] MAS intervention evaluation: the reported performance lift from enforcing visual attention share is promising, yet the manuscript lacks ablations that isolate whether the gain stems specifically from restored use of visual evidence versus nonspecific changes in generation dynamics (e.g., comparison against random attention reweighting or other non-visual interventions). Without such controls, the claim that MAS directly mitigates underuse of visual evidence during language-stage reasoning is not fully supported.

minor comments (2)

[Abstract] Abstract and results sections should include concrete quantitative details (accuracy deltas, sample sizes per condition, statistical tests) rather than qualitative statements such as 'degrades substantially' or 'boosts visual attention share.'
[Benchmark construction] The description of COUNTINGTRICKS would benefit from an explicit table or figure summarizing the patchification layouts and adversarial prompt variants used, to make the controlled nature of the benchmark immediately clear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review of our manuscript. The comments highlight important aspects of our mechanistic analysis and intervention evaluation that we will address to improve clarity and rigor. We provide point-by-point responses below.

read point-by-point responses

Referee: [Mechanistic analysis] Mechanistic analysis section: attention maps and probing show count-relevant visual evidence strongest at modality projection and degrading in later language layers, but this remains correlational. No direct causal test (e.g., targeted intervention on hidden states to restore count information before the observed drop and measurement of downstream error reduction) is reported, leaving open the possibility that degradation is a symptom of an earlier commitment to a text-prior answer rather than an independent cause of counting errors.

Authors: We agree that the mechanistic findings rely on correlational evidence from attention analysis and component-wise probing, which demonstrate the layer-wise degradation of visual signals but do not include direct causal interventions such as targeted modifications to hidden states. This correlational approach is consistent with the behavioral results showing increased susceptibility to text priors in later layers. In the revised manuscript, we will explicitly acknowledge the correlational nature of these analyses, discuss the alternative interpretation raised, and outline potential causal experiments as future work to more definitively establish the direction of causality. revision: partial
Referee: [MAS intervention] MAS intervention evaluation: the reported performance lift from enforcing visual attention share is promising, yet the manuscript lacks ablations that isolate whether the gain stems specifically from restored use of visual evidence versus nonspecific changes in generation dynamics (e.g., comparison against random attention reweighting or other non-visual interventions). Without such controls, the claim that MAS directly mitigates underuse of visual evidence during language-stage reasoning is not fully supported.

Authors: We thank the referee for this observation. To better isolate the effect of the Modality Attention Share intervention, we will add the suggested ablation studies in the revised manuscript. These will include comparisons against random attention reweighting and other non-visual interventions to determine whether the performance gains arise specifically from the enforced visual attention budget rather than general changes to generation dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study with independent experimental support

full rationale

The paper is a purely empirical investigation. It introduces a new controlled dataset (COUNTINGTRICKS), performs attention analysis and component-wise probing on existing VLMs, and evaluates a proposed lightweight intervention (MAS). All load-bearing claims are grounded in direct experimental measurements and intervention outcomes rather than any derivation, fitted parameter, or self-referential definition. No equations, predictions, or uniqueness theorems are invoked that reduce to the paper's own inputs. Self-citations, if present, are not load-bearing for the central claim. This matches the default non-circular outcome for empirical work that remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning study with no mathematical derivations. It rests on standard interpretability assumptions that attention weights reflect evidence usage and that probing layers reveals information flow.

pith-pipeline@v0.9.0 · 5507 in / 1088 out tokens · 30203 ms · 2026-05-10T15:34:39.145519+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 14 canonical work pages · 5 internal anchors

[1]

Lvlm-count: Improving counting in large vision-language models

Anonymous. Lvlm-count: Improving counting in large vision-language models. InOpenReview, 2025. 2

2025
[2]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Weis, Boyang Li, Silvio Savarese, and Steven C.H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.arXiv preprint arXiv:2305.06500, 2023. 1, 2

work page internal anchor Pith review arXiv 2023
[4]

Your vision-language model can’t even count to 20: Exposing the failures of vlms in compositional counting

Xuyang Guo, Zekai Huang, Zhenmei Shi, Zhao Song, and Ji- ahao Zhang. Your vision-language model can’t even count to 20: Exposing the failures of vlms in compositional counting. arXiv preprint arXiv:2510.04401, 2025. 2

work page arXiv 2025
[5]

Blind faith in text: Robustness of vision-language models is undermined

Seokhyun Jeong and et al. Blind faith in text: Robustness of vision-language models is undermined. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

2025
[6]

When language overrules: Modality imbalance in vlms.arXiv preprint arXiv:2508.10552, 2025

Minsoo Kim and et al. When language overrules: Modality imbalance in vlms.arXiv preprint arXiv:2508.10552, 2025. 2

work page arXiv 2025
[7]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 1, 2

2023
[8]

Pairtally: Segmenting pairs for counting in vlms.arXiv preprint arXiv:2509.13939, 2025

Chen Ma and et al. Pairtally: Segmenting pairs for counting in vlms.arXiv preprint arXiv:2509.13939, 2025. 2

work page arXiv 2025
[9]

Gpt-4o system card.OpenAI Technical Report,

OpenAI. Gpt-4o system card.OpenAI Technical Report,
[10]

Vision language models are blind: Failing to translate detailed visual features into words, 2025

Pooyan Rahmanzadehgervi, Logan Bolton, Moham- mad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind: Failing to translate detailed visual features into words, 2025. 2

2025
[11]

Learning to count everything

Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh Hoai. Learning to count everything. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 2

2021
[12]

Qwen2.5-VL Technical Report

Qwen Team. Qwen2.5-vl: Scaling vision-language models. arXiv preprint arXiv:2502.13923, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Vlms can’t see the obvi- ous: Benchmarking visual understanding in vision-language models.arXiv preprint arXiv:2507.04741, 2025

Shengbang Tong and et al. Vlms can’t see the obvi- ous: Benchmarking visual understanding in vision-language models.arXiv preprint arXiv:2507.04741, 2025. 2

work page arXiv 2025
[14]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Two effects, one trigger: on the modality gap, object bias, and information imbalance in contrastive vision-language representation learning

Alexander Wan and et al. Two effects, one trigger: Visual limitations in vlms.arXiv preprint arXiv:2404.07983, 2024. 2

work page arXiv 2024
[16]

Can clip count stars? an empirical study on quantity bias in clip

Yixuan Wei and et al. Treat visual tokens as text?arXiv preprint arXiv:2410.06169, 2024. 2

work page arXiv 2024
[17]

See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

Haoyu Zhang and et al. See what you are told: Visual atten- tion redistribution.arXiv preprint arXiv:2503.03321, 2025. 2

work page arXiv 2025
[18]

What’s in the image? what survives in modern multimodal lms.arXiv preprint arXiv:2411.17491, 2024

Yanzhe Zhang and et al. What’s in the image? what survives in modern multimodal lms.arXiv preprint arXiv:2411.17491, 2024. 2

work page arXiv 2024
[19]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Yifan Zhang and et al. Words or vision? investigating the dominance of text in vlms.arXiv preprint arXiv:2408.11039,

work page internal anchor Pith review arXiv
[20]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2023. 1

2023
[21]

Accelerating multi- modal large language models via dynamic visual-token exit and the empirical findings.arXiv preprint arXiv:2411.19628, 2024

Simiao Zuo and et al. Dynamic visual token compression. arXiv preprint arXiv:2411.19628, 2024. 2 9 A. Appendix A.1. Supplementary Experimental Model1A 1B 2A 2B 2C 2D 3A 3B 3C 3D 4A llava-1.5-7b 16.9 16.5 16.4 15.3 16.7 16.0 17.1 15.2 16.7 14.9 17.8 llava-1.6-vicuna-7b 18.0 20.0 18.5 18.3 19.4 20.2 18.9 19.4 17.9 19.7 20.5 internvl3-8b 13.2 7.3 44.5 12.4 3...

work page arXiv 2024