pith. sign in

arxiv: 2511.17699 · v2 · submitted 2025-11-21 · 💻 cs.CV · cs.AI

Understanding Counting Mechanisms in Large Language and Vision-Language Models

Pith reviewed 2026-05-17 20:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords counting mechanismslarge language modelsvision-language modelsmechanistic interpretabilitynumerical representationslayerwise analysisinternal countercausal mediation
0
0 comments X

The pith

Large language and vision-language models maintain counts through an internal counter that updates with each item and stores the total mainly in the final token or region.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how LLMs and LVLMs handle counting by running controlled tests with repeated text or image items. Behavioral tests combined with observational and causal analyses show that tokens or visual features carry hidden position-based count signals that can move between settings. These signals build gradually across layers, with early layers handling small numbers and later layers handling bigger ones. The work identifies a dedicated counter process that adds one for each new item and keeps the running total in the last position. In vision-language models the same signals also move between background and foreground parts of an image depending on layout, and models often lean on separators or other structural markers as quick ways to track totals.

Core claim

Counting emerges as a structured, layerwise process in which an internal counter mechanism updates with each successive item. The count value is encoded in latent positional information carried by individual tokens or visual features and is stored primarily in the final token or image region. Lower layers represent small counts while higher layers represent larger counts. In vision-language models numerical information also appears in visual embeddings and shifts between background and foreground regions according to spatial arrangement. Models further exploit structural cues such as text separators as shortcuts that strongly affect numerical accuracy.

What carries the argument

The internal counter mechanism, which increments with each new item and holds the running total chiefly in the final token or region, carries the numerical state through the model.

If this is right

  • Count information encoded in single tokens can be read out and reused in new contexts.
  • Lower layers handle small counts while higher layers handle large counts, creating a progressive buildup of numerical precision.
  • In vision-language models the same numerical signals move between background and foreground image regions based on scene layout.
  • Models rely on separators and other structural markers in text as shortcuts that can boost or hurt counting accuracy.
  • The same general counting pattern appears in both language-only and vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Targeting the final token with small activation edits might improve counting performance on tasks where models currently fail.
  • The layerwise shift from small to large counts suggests that deeper layers could be the right place to add explicit numerical modules in future architectures.
  • If similar counters exist for other sequential operations, the same analysis approach could map how models track order or time.
  • Models that depend heavily on separators may underperform on counting problems that lack clear textual boundaries.

Load-bearing premise

The observational and causal analyses truly detect real counting processes inside the models rather than side effects created by the repeated-item test prompts or the analysis tool.

What would settle it

A direct test would intervene on the final token's activations and check whether the model's reported count changes while other capabilities stay intact.

Figures

Figures reproduced from arXiv: 2511.17699 by Amirmohammad Izadi, Fatemeh Askari, Hosein Hasani, Mahdieh Soleymani Baghshah, Mobin Bagherian, Mohammad Izadi, Sadegh Mohammadian.

Figure 1
Figure 1. Figure 1: A minimal graphical abstract illustrating how [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representational behavior of embeddings in a selected [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ground-truth (total count) probability of visual objects [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-item latent count encoding, decoded by CountScope. Each heatmap shows the probability of decoding numbers (1–9) across [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Type-specific counter behavior revealed by CountScope. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Layerwise decoding of latent counts using online patch [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-item latent count for separators, decoded by [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example images from the visual dataset. Monotypic [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Mean ground-truth probability across layer windows un [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Layer-wise PCA of Qwen2.5 input-token representations. PCA trajectories across layers for (a) element tokens and (b) separator tokens in the monotypic, question-first setting. 6 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Layer-wise PCA of Qwen2.5 output representations. PCA embeddings across layers for generated numerical responses, colored by (a) predicted count and (b) item type, in the monotypic, question-first setting. 7 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Layer-wise PCA of Qwen2.5VL embeddings. PCA trajectories across layers for (a) input item embeddings and (b) generated output embeddings in the monotypic setting. 8 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Layer-wise cosine similarity of Qwen2.5 representations. Cosine similarity matrices across layers for (a) element tokens and (b) separator tokens in the monotypic, question-first setting. Cosine similarities are computed across different tasks with different item types and then averaged over the dataset. 9 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
read the original abstract

Counting is one of the fundamental abilities of large language models (LLMs) and large vision-language models (LVLMs). This paper examines how these foundation models represent and compute numerical information in counting tasks. We use controlled experiments with repeated textual and visual items and analyze counting in LLMs and LVLMs through a set of behavioral, observational, and causal mediation analyses. To this end, we design a specialized tool, CountScope, for the mechanistic interpretability of numerical content. Results show that individual tokens or visual features encode latent positional count information that can be extracted and transferred across contexts. Layerwise analyses reveal a progressive emergence of numerical representations, with lower layers encoding small counts and higher layers representing larger ones. We identify an internal counter mechanism that updates with each item, stored mainly in the final token or region. In LVLMs, numerical information also appears in visual embeddings, shifting between background and foreground regions depending on spatial composition. We further reveal that models rely on structural cues such as separators in text, which act as shortcuts for tracking item counts and strongly influence the accuracy of numerical predictions. Overall, counting emerges as a structured, layerwise process in LLMs and follows the same general pattern in LVLMs, shaped by the properties of the vision encoder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines counting in LLMs and LVLMs via behavioral, observational, and causal mediation analyses on repeated textual and visual items. It introduces CountScope for mechanistic interpretability and claims to locate an internal counter that updates per item (primarily in the final token/region), with layerwise progressive emergence of numerical representations (lower layers for small counts, higher layers for larger ones). Numerical information also appears in LVLMs' visual embeddings (shifting by spatial composition), while models rely on structural cues such as separators as shortcuts that strongly affect prediction accuracy.

Significance. If the mediation analyses and CountScope tool successfully isolate genuine internal mechanisms rather than prompt artifacts, the work would advance mechanistic interpretability of numerical reasoning in foundation models and supply a reusable tool for probing count representations. The layerwise and cross-modal findings could inform architecture design for better numerical generalization.

major comments (2)
  1. [Causal mediation analyses and CountScope results] The central claim that an internal counter mechanism (stored mainly in the final token/region) has been identified rests on repeated-item prompts; the abstract itself states that models rely on structural cues such as separators as shortcuts that strongly influence numerical predictions. No explicit test or ablation is described that rules out the possibility that CountScope and the mediation analyses are extracting these prompt-induced positional/separator signals rather than a general counting procedure.
  2. [Layerwise analyses] Layerwise analyses claim progressive emergence (lower layers encode small counts, higher layers larger ones), yet the abstract and high-level description provide no quantitative metrics, error bars, statistical tests, or ablation controls to support the layer-specific encoding distinction or its load-bearing role in the counting mechanism.
minor comments (2)
  1. [Abstract] The abstract summarizes methods and high-level results but omits key quantitative findings, sample sizes, or performance numbers that would allow readers to assess effect sizes.
  2. [Methods] Clarify the precise input format and output of CountScope (e.g., how it extracts and transfers latent positional count information) to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Causal mediation analyses and CountScope results] The central claim that an internal counter mechanism (stored mainly in the final token/region) has been identified rests on repeated-item prompts; the abstract itself states that models rely on structural cues such as separators as shortcuts that strongly influence numerical predictions. No explicit test or ablation is described that rules out the possibility that CountScope and the mediation analyses are extracting these prompt-induced positional/separator signals rather than a general counting procedure.

    Authors: We agree that it is important to distinguish prompt artifacts from a general internal counting procedure. The causal mediation analyses intervene on activations within the final token while holding the input prompt (including separators) fixed, and demonstrate that these interventions causally alter the model's numerical output. CountScope similarly extracts and transfers count information from internal states across contexts. Nevertheless, we acknowledge that an explicit ablation varying or removing separators was not included. We will add this experiment in the revised manuscript, showing that the internal counter mechanism remains operative even when structural cues are minimized. revision: yes

  2. Referee: [Layerwise analyses] Layerwise analyses claim progressive emergence (lower layers encode small counts, higher layers larger ones), yet the abstract and high-level description provide no quantitative metrics, error bars, statistical tests, or ablation controls to support the layer-specific encoding distinction or its load-bearing role in the counting mechanism.

    Authors: The quantitative support for progressive emergence, including per-layer accuracy curves, error bars from multiple random seeds, and statistical comparisons, appears in Section 4.2 and the associated figures. We will revise the abstract and introduction to explicitly report key metrics (e.g., the layer thresholds at which accuracy for count k exceeds baseline) and briefly summarize the ablation controls that establish the functional role of these layer-specific representations. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical mediation and observational analyses

full rationale

The paper conducts behavioral experiments, observational analyses, and causal mediation interventions via the CountScope tool on repeated-item prompts in LLMs and LVLMs. No first-principles derivation, mathematical model, or predictive equation is presented whose output is shown to equal its inputs by construction. Layerwise emergence claims and internal-counter localization rest on direct measurements and interventions rather than any self-definitional loop, fitted-parameter renaming, or self-citation chain that would render the central result tautological. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions from mechanistic interpretability research; the abstract does not introduce or fit new free parameters or postulate new entities.

axioms (1)
  • domain assumption Causal mediation analysis can isolate the causal contribution of specific internal representations to counting behavior
    Invoked when the paper states it uses causal mediation analyses to study numerical content.

pith-pipeline@v0.9.0 · 5556 in / 1181 out tokens · 39296 ms · 2026-05-17T20:06:21.432021+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    CounterCount shows VLMs perform well on factual counting images but degrade on counterfactual edits, revealing reliance on object priors, and introduces an attention reweighting method that improves accuracy by up to 8%.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Yang Fan, Kai Dang, et al. Qwen2.5-VL technical report.arXiv:2502.13923, 2025. 3

  2. [2]

    Can we count on LLMs? the fixed-effect fallacy and claims of GPT-4 ca- pabilities.arXiv:2409.07638, 2024

    Thomas Ball, Shuo Chen, and Cormac Herley. Can we count on LLMs? the fixed-effect fallacy and claims of GPT-4 ca- pabilities.arXiv:2409.07638, 2024. 1, 8

  3. [3]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Ja- cob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv:2303.08112, 2023. 4, 9

  4. [4]

    Frankland, Thomas L

    Declan Campbell, Sunayana Rane, Tyler Giallanza, Nicol `o De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, Steven M. Frankland, Thomas L. Griffiths, Jonathan D. Co- hen, and Taylor W. Webb. Understanding the limits of vision language models through the lens of the binding problem. In Advances in Neural Information Processing Systems, pages 113436–113460, 2024. 1

  5. [5]

    Oxford University Press, New York, NY , revised and updated edition edition, 2011

    Stanislas Dehaene.The Number Sense: How the Mind Cre- ates Mathematics. Oxford University Press, New York, NY , revised and updated edition edition, 2011. 1

  6. [6]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The Llama 3 herd of models.arXiv:2407.21783, 2024. 3

  7. [7]

    Core systems of number.Trends in Cognitive Sciences, 8 (7):307–314, 2004

    Lisa Feigenson, Stanislas Dehaene, and Elizabeth Spelke. Core systems of number.Trends in Cognitive Sciences, 8 (7):307–314, 2004. 1

  8. [8]

    Why Do Large Language Models (

    Tairan Fu, Raquel Ferrando, Javier Conde, Carlos Arriaga, and Pedro Reviriego. Why do large language models (LLMs) struggle to count letters?arXiv:2412.18626, 2024. 1, 8

  9. [9]

    Patchscopes: A unifying framework for inspecting hidden representations of language models

    Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes: A unifying framework for inspecting hidden representations of language models. InInternational Conference on Machine Learning (ICML),

  10. [10]

    Contextual counting: A mechanistic study of transformers on a quantitative task

    Siavash Golkar, Alberto Bietti, Mariel Pettee, Michael Eick- enberg, Miles Cranmer, Keiya Hirashima, Geraud Krawezik, Nicholas Lourie, Michael McCabe, Rudy Morel, Ruben Ohana, Liam Holden Parker, Bruno R´egaldo-Saint Blancard, Kyunghyun Cho, and Shirley Ho. Contextual counting: A mechanistic study of transformers on a quantitative task. arXiv:2406.02585, 2024. 8

  11. [11]

    Your vision-language model can’t even count to 20: Exposing the failures of vlms in compositional counting

    Xuyang Guo, Zekai Huang, Zhenmei Shi, Zhao Song, and Ji- ahao Zhang. Your vision-language model can’t even count to 20: Exposing the failures of VLMs in compositional count- ing.arXiv:2510.04401, 2025. 9

  12. [12]

    How to use and interpret activation patching

    Stefan Heimersheim and Neel Nanda. How to use and inter- pret activation patching.arXiv:2404.15255, 2024. 2

  13. [13]

    Towards interpreting visual infor- mation processing in vision-language models.arXiv preprint arXiv:2410.07149, 2024

    Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting vi- sual information processing in vision-language models. arXiv:2410.07149, 2024. 4, 9

  14. [14]

    Interpreting GPT: The logit lens

    nostalgebraist. Interpreting GPT: The logit lens. LessWrong,

  15. [15]

    Teaching CLIP to count to ten

    Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3170–3180,

  16. [16]

    Vision language models are blind

    Pooyan Rahmanzadehgervi, Logan Bolton, Moham- mad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InProceedings of the Asian Conference on Computer Vision (ACCV), pages 18–34, 2024. 1

  17. [17]

    Numeric magnitude comparison effects in large language models

    Raj Shah, Vijay Marupudi, Reba Koenen, Khushi Bhardwaj, and Sashank Varma. Numeric magnitude comparison effects in large language models. InFindings of the Association for Computational Linguistics: ACL 2023, pages 6147–6161, Toronto, Canada, 2023. Association for Computational Lin- guistics. 8

  18. [18]

    A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis.arXiv preprint arXiv:2305.15054,

    Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. A mechanistic interpretation of arithmetic reason- ing in language models using causal mediation analysis. arXiv:2305.15054, 2023. 1, 2, 8

  19. [19]

    Inves- tigating gender bias in language models using causal media- tion analysis

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Inves- tigating gender bias in language models using causal media- tion analysis. InAdvances in Neural Information Processing Systems, pages 12388–12401, 2020. 1, 2

  20. [20]

    Interpretability in the wild: A circuit for indirect object identification in GPT-2 small

    Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. InInternational Conference on Learning Representations,

  21. [21]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv:2508.18265, 2025. 3

  22. [22]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Pro- cessing Systems, 2022. NeurIPS. 9

  23. [23]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, et al. Qwen2.5 technical report. arXiv:2412.15115, 2024. 3

  24. [24]

    arXiv preprint arXiv:2407.15160 , year =

    Gilad Yehudai, Haim Kaplan, Asma Ghandeharioun, Mor Geva, and Amir Globerson. When can transformers count to n?arXiv:2407.15160, 2024. 8

  25. [25]

    Towards best practices of acti- vation patching in language models: Metrics and methods

    Fred Zhang and Neel Nanda. Towards best practices of acti- vation patching in language models: Metrics and methods. InInternational Conference on Learning Representations,

  26. [26]

    arXiv preprint arXiv:2410.19730 , year =

    Xiang Zhang, Juntai Cao, and Chenyu You. Counting abil- ity of large language models and impact of tokenization. arXiv:2410.19730, 2024. 1, 8 10 Supplementary Material

  27. [27]

    How many objects are there in the image?

    Task Details The textual dataset is built from simple lists of item names and short counting questions. Items are sampled uniformly from a fixed vocabulary of common fruits (apple, orange, peach, fig, mango, pear, coconut, cherry, plum). Lists range from length 1 to 9. We use four prompt configura- tions: monotypic lists, polytypic lists, list-first (also...

  28. [28]

    Behavioral Characterization of Counting We begin by quantifying the counting accuracy of LLMs and LVLMs across all experimental configurations. Table 6 reports the performance of two LLMs (Qwen2.5, Llama3) and two LVLMs (Qwen2.5-VL, InternVL3.5) on textual counting tasks across category types, ordering conditions, and question types. All models are of sim...

  29. [29]

    Table 10 reports the mean drop in the probability of the ground-truth count after offline zero patching of context and question (for LLMs) or image and prompt (for LVLMs)

    Causal Mediation Analysis Here, we provide additional details of the experiments con- ducted for causal mediation analysis. Table 10 reports the mean drop in the probability of the ground-truth count after offline zero patching of context and question (for LLMs) or image and prompt (for LVLMs). The results confirm that count-related information is primari...

  30. [30]

    Figure 10 shows PCA projections of input tokens, and Figure 11 presents PCA of generated responses

    Layer-Wise Representational Analysis This section provides layer-wise visualizations of represen- tational structure for both LLMs and LVLMs. Figure 10 shows PCA projections of input tokens, and Figure 11 presents PCA of generated responses. Figure 12 reports the corresponding trajectories for the LVLM. Figure 13 shows cosine similarity patterns across la...