Understanding Counting Mechanisms in Large Language and Vision-Language Models
Pith reviewed 2026-05-17 20:06 UTC · model grok-4.3
The pith
Large language and vision-language models maintain counts through an internal counter that updates with each item and stores the total mainly in the final token or region.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Counting emerges as a structured, layerwise process in which an internal counter mechanism updates with each successive item. The count value is encoded in latent positional information carried by individual tokens or visual features and is stored primarily in the final token or image region. Lower layers represent small counts while higher layers represent larger counts. In vision-language models numerical information also appears in visual embeddings and shifts between background and foreground regions according to spatial arrangement. Models further exploit structural cues such as text separators as shortcuts that strongly affect numerical accuracy.
What carries the argument
The internal counter mechanism, which increments with each new item and holds the running total chiefly in the final token or region, carries the numerical state through the model.
If this is right
- Count information encoded in single tokens can be read out and reused in new contexts.
- Lower layers handle small counts while higher layers handle large counts, creating a progressive buildup of numerical precision.
- In vision-language models the same numerical signals move between background and foreground image regions based on scene layout.
- Models rely on separators and other structural markers in text as shortcuts that can boost or hurt counting accuracy.
- The same general counting pattern appears in both language-only and vision-language models.
Where Pith is reading between the lines
- Targeting the final token with small activation edits might improve counting performance on tasks where models currently fail.
- The layerwise shift from small to large counts suggests that deeper layers could be the right place to add explicit numerical modules in future architectures.
- If similar counters exist for other sequential operations, the same analysis approach could map how models track order or time.
- Models that depend heavily on separators may underperform on counting problems that lack clear textual boundaries.
Load-bearing premise
The observational and causal analyses truly detect real counting processes inside the models rather than side effects created by the repeated-item test prompts or the analysis tool.
What would settle it
A direct test would intervene on the final token's activations and check whether the model's reported count changes while other capabilities stay intact.
Figures
read the original abstract
Counting is one of the fundamental abilities of large language models (LLMs) and large vision-language models (LVLMs). This paper examines how these foundation models represent and compute numerical information in counting tasks. We use controlled experiments with repeated textual and visual items and analyze counting in LLMs and LVLMs through a set of behavioral, observational, and causal mediation analyses. To this end, we design a specialized tool, CountScope, for the mechanistic interpretability of numerical content. Results show that individual tokens or visual features encode latent positional count information that can be extracted and transferred across contexts. Layerwise analyses reveal a progressive emergence of numerical representations, with lower layers encoding small counts and higher layers representing larger ones. We identify an internal counter mechanism that updates with each item, stored mainly in the final token or region. In LVLMs, numerical information also appears in visual embeddings, shifting between background and foreground regions depending on spatial composition. We further reveal that models rely on structural cues such as separators in text, which act as shortcuts for tracking item counts and strongly influence the accuracy of numerical predictions. Overall, counting emerges as a structured, layerwise process in LLMs and follows the same general pattern in LVLMs, shaped by the properties of the vision encoder.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines counting in LLMs and LVLMs via behavioral, observational, and causal mediation analyses on repeated textual and visual items. It introduces CountScope for mechanistic interpretability and claims to locate an internal counter that updates per item (primarily in the final token/region), with layerwise progressive emergence of numerical representations (lower layers for small counts, higher layers for larger ones). Numerical information also appears in LVLMs' visual embeddings (shifting by spatial composition), while models rely on structural cues such as separators as shortcuts that strongly affect prediction accuracy.
Significance. If the mediation analyses and CountScope tool successfully isolate genuine internal mechanisms rather than prompt artifacts, the work would advance mechanistic interpretability of numerical reasoning in foundation models and supply a reusable tool for probing count representations. The layerwise and cross-modal findings could inform architecture design for better numerical generalization.
major comments (2)
- [Causal mediation analyses and CountScope results] The central claim that an internal counter mechanism (stored mainly in the final token/region) has been identified rests on repeated-item prompts; the abstract itself states that models rely on structural cues such as separators as shortcuts that strongly influence numerical predictions. No explicit test or ablation is described that rules out the possibility that CountScope and the mediation analyses are extracting these prompt-induced positional/separator signals rather than a general counting procedure.
- [Layerwise analyses] Layerwise analyses claim progressive emergence (lower layers encode small counts, higher layers larger ones), yet the abstract and high-level description provide no quantitative metrics, error bars, statistical tests, or ablation controls to support the layer-specific encoding distinction or its load-bearing role in the counting mechanism.
minor comments (2)
- [Abstract] The abstract summarizes methods and high-level results but omits key quantitative findings, sample sizes, or performance numbers that would allow readers to assess effect sizes.
- [Methods] Clarify the precise input format and output of CountScope (e.g., how it extracts and transfers latent positional count information) to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Causal mediation analyses and CountScope results] The central claim that an internal counter mechanism (stored mainly in the final token/region) has been identified rests on repeated-item prompts; the abstract itself states that models rely on structural cues such as separators as shortcuts that strongly influence numerical predictions. No explicit test or ablation is described that rules out the possibility that CountScope and the mediation analyses are extracting these prompt-induced positional/separator signals rather than a general counting procedure.
Authors: We agree that it is important to distinguish prompt artifacts from a general internal counting procedure. The causal mediation analyses intervene on activations within the final token while holding the input prompt (including separators) fixed, and demonstrate that these interventions causally alter the model's numerical output. CountScope similarly extracts and transfers count information from internal states across contexts. Nevertheless, we acknowledge that an explicit ablation varying or removing separators was not included. We will add this experiment in the revised manuscript, showing that the internal counter mechanism remains operative even when structural cues are minimized. revision: yes
-
Referee: [Layerwise analyses] Layerwise analyses claim progressive emergence (lower layers encode small counts, higher layers larger ones), yet the abstract and high-level description provide no quantitative metrics, error bars, statistical tests, or ablation controls to support the layer-specific encoding distinction or its load-bearing role in the counting mechanism.
Authors: The quantitative support for progressive emergence, including per-layer accuracy curves, error bars from multiple random seeds, and statistical comparisons, appears in Section 4.2 and the associated figures. We will revise the abstract and introduction to explicitly report key metrics (e.g., the layer thresholds at which accuracy for count k exceeds baseline) and briefly summarize the ablation controls that establish the functional role of these layer-specific representations. revision: yes
Circularity Check
No circularity: purely empirical mediation and observational analyses
full rationale
The paper conducts behavioral experiments, observational analyses, and causal mediation interventions via the CountScope tool on repeated-item prompts in LLMs and LVLMs. No first-principles derivation, mathematical model, or predictive equation is presented whose output is shown to equal its inputs by construction. Layerwise emergence claims and internal-counter localization rest on direct measurements and interventions rather than any self-definitional loop, fitted-parameter renaming, or self-citation chain that would render the central result tautological. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Causal mediation analysis can isolate the causal contribution of specific internal representations to counting behavior
Forward citations
Cited by 1 Pith paper
-
CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models
CounterCount shows VLMs perform well on factual counting images but degrade on counterfactual edits, revealing reliance on object priors, and introduces an attention reweighting method that improves accuracy by up to 8%.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yang Fan, Kai Dang, et al. Qwen2.5-VL technical report.arXiv:2502.13923, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Thomas Ball, Shuo Chen, and Cormac Herley. Can we count on LLMs? the fixed-effect fallacy and claims of GPT-4 ca- pabilities.arXiv:2409.07638, 2024. 1, 8
-
[3]
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Ja- cob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv:2303.08112, 2023. 4, 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Declan Campbell, Sunayana Rane, Tyler Giallanza, Nicol `o De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, Steven M. Frankland, Thomas L. Griffiths, Jonathan D. Co- hen, and Taylor W. Webb. Understanding the limits of vision language models through the lens of the binding problem. In Advances in Neural Information Processing Systems, pages 113436–113460, 2024. 1
work page 2024
-
[5]
Oxford University Press, New York, NY , revised and updated edition edition, 2011
Stanislas Dehaene.The Number Sense: How the Mind Cre- ates Mathematics. Oxford University Press, New York, NY , revised and updated edition edition, 2011. 1
work page 2011
-
[6]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The Llama 3 herd of models.arXiv:2407.21783, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Core systems of number.Trends in Cognitive Sciences, 8 (7):307–314, 2004
Lisa Feigenson, Stanislas Dehaene, and Elizabeth Spelke. Core systems of number.Trends in Cognitive Sciences, 8 (7):307–314, 2004. 1
work page 2004
-
[8]
Why Do Large Language Models (
Tairan Fu, Raquel Ferrando, Javier Conde, Carlos Arriaga, and Pedro Reviriego. Why do large language models (LLMs) struggle to count letters?arXiv:2412.18626, 2024. 1, 8
-
[9]
Patchscopes: A unifying framework for inspecting hidden representations of language models
Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes: A unifying framework for inspecting hidden representations of language models. InInternational Conference on Machine Learning (ICML),
-
[10]
Contextual counting: A mechanistic study of transformers on a quantitative task
Siavash Golkar, Alberto Bietti, Mariel Pettee, Michael Eick- enberg, Miles Cranmer, Keiya Hirashima, Geraud Krawezik, Nicholas Lourie, Michael McCabe, Rudy Morel, Ruben Ohana, Liam Holden Parker, Bruno R´egaldo-Saint Blancard, Kyunghyun Cho, and Shirley Ho. Contextual counting: A mechanistic study of transformers on a quantitative task. arXiv:2406.02585, 2024. 8
-
[11]
Xuyang Guo, Zekai Huang, Zhenmei Shi, Zhao Song, and Ji- ahao Zhang. Your vision-language model can’t even count to 20: Exposing the failures of VLMs in compositional count- ing.arXiv:2510.04401, 2025. 9
-
[12]
How to use and interpret activation patching
Stefan Heimersheim and Neel Nanda. How to use and inter- pret activation patching.arXiv:2404.15255, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting vi- sual information processing in vision-language models. arXiv:2410.07149, 2024. 4, 9
-
[14]
Interpreting GPT: The logit lens
nostalgebraist. Interpreting GPT: The logit lens. LessWrong,
-
[15]
Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching CLIP to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3170–3180,
-
[16]
Vision language models are blind
Pooyan Rahmanzadehgervi, Logan Bolton, Moham- mad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InProceedings of the Asian Conference on Computer Vision (ACCV), pages 18–34, 2024. 1
work page 2024
-
[17]
Numeric magnitude comparison effects in large language models
Raj Shah, Vijay Marupudi, Reba Koenen, Khushi Bhardwaj, and Sashank Varma. Numeric magnitude comparison effects in large language models. InFindings of the Association for Computational Linguistics: ACL 2023, pages 6147–6161, Toronto, Canada, 2023. Association for Computational Lin- guistics. 8
work page 2023
-
[18]
Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. A mechanistic interpretation of arithmetic reason- ing in language models using causal mediation analysis. arXiv:2305.15054, 2023. 1, 2, 8
-
[19]
Inves- tigating gender bias in language models using causal media- tion analysis
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Inves- tigating gender bias in language models using causal media- tion analysis. InAdvances in Neural Information Processing Systems, pages 12388–12401, 2020. 1, 2
work page 2020
-
[20]
Interpretability in the wild: A circuit for indirect object identification in GPT-2 small
Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. InInternational Conference on Learning Representations,
-
[21]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv:2508.18265, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Pro- cessing Systems, 2022. NeurIPS. 9
work page 2022
-
[23]
An Yang, Baosong Yang, et al. Qwen2.5 technical report. arXiv:2412.15115, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
arXiv preprint arXiv:2407.15160 , year =
Gilad Yehudai, Haim Kaplan, Asma Ghandeharioun, Mor Geva, and Amir Globerson. When can transformers count to n?arXiv:2407.15160, 2024. 8
-
[25]
Towards best practices of acti- vation patching in language models: Metrics and methods
Fred Zhang and Neel Nanda. Towards best practices of acti- vation patching in language models: Metrics and methods. InInternational Conference on Learning Representations,
-
[26]
arXiv preprint arXiv:2410.19730 , year =
Xiang Zhang, Juntai Cao, and Chenyu You. Counting abil- ity of large language models and impact of tokenization. arXiv:2410.19730, 2024. 1, 8 10 Supplementary Material
-
[27]
How many objects are there in the image?
Task Details The textual dataset is built from simple lists of item names and short counting questions. Items are sampled uniformly from a fixed vocabulary of common fruits (apple, orange, peach, fig, mango, pear, coconut, cherry, plum). Lists range from length 1 to 9. We use four prompt configura- tions: monotypic lists, polytypic lists, list-first (also...
-
[28]
Behavioral Characterization of Counting We begin by quantifying the counting accuracy of LLMs and LVLMs across all experimental configurations. Table 6 reports the performance of two LLMs (Qwen2.5, Llama3) and two LVLMs (Qwen2.5-VL, InternVL3.5) on textual counting tasks across category types, ordering conditions, and question types. All models are of sim...
-
[29]
Causal Mediation Analysis Here, we provide additional details of the experiments con- ducted for causal mediation analysis. Table 10 reports the mean drop in the probability of the ground-truth count after offline zero patching of context and question (for LLMs) or image and prompt (for LVLMs). The results confirm that count-related information is primari...
-
[30]
Figure 10 shows PCA projections of input tokens, and Figure 11 presents PCA of generated responses
Layer-Wise Representational Analysis This section provides layer-wise visualizations of represen- tational structure for both LLMs and LVLMs. Figure 10 shows PCA projections of input tokens, and Figure 11 presents PCA of generated responses. Figure 12 reports the corresponding trajectories for the LVLM. Figure 13 shows cosine similarity patterns across la...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.