pith. sign in

arxiv: 2605.30170 · v1 · pith:6HVVYEY3new · submitted 2026-05-28 · 💻 cs.MM · cs.CV· cs.LG

Unveiling the Visual Counting Bottleneck in Vision-Language Models

Pith reviewed 2026-06-28 23:29 UTC · model grok-4.3

classification 💻 cs.MM cs.CVcs.LG
keywords vision-language modelsvisual countingextrapolationsymbolic mappingmagnitude representationsGo boardsfractured magnitude hypothesis
0
0 comments X

The pith

Vision-language models fail at visual counting because they cannot map visual magnitudes onto number symbols.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper breaks visual counting into three stages—visual individuation, magnitude awareness, and symbolic mapping—to locate where extrapolation fails in VLMs. Linear probes on synthetic Go board images show that visual backbones keep linearly separable quantity signals even past the training range, and models can still compare quantities they cannot name. The collapse occurs at symbolic mapping, where visual magnitudes are not projected onto tokens. This pattern supports the claim that VLMs build separate statistical manifolds for numbers in each modality rather than one shared number space. The result matters because it indicates that simply scaling training data will not produce reliable counting outside the observed range.

Core claim

VLMs maintain robust, linearly separable representations of quantity in visual backbones well into the extrapolation regime and retain latent magnitude awareness that supports comparative reasoning, yet they fail to project those visual magnitudes onto symbolic tokens. The collapse at the symbolic mapping stage leads to the fractured magnitude hypothesis that VLMs acquire disjoint, modality-specific statistical manifolds instead of a universal number space, preventing cross-modal grounding for unseen quantities.

What carries the argument

The three-stage decomposition of visual counting (visual individuation, magnitude awareness, symbolic mapping) tested with linear probes on synthetic Go board images.

If this is right

  • Visual backbones preserve linearly separable quantity information beyond the training distribution.
  • Models succeed at comparative magnitude tasks on quantities they cannot enumerate.
  • Symbolic mapping is the specific stage that breaks cross-modal grounding for unseen counts.
  • Data scaling alone cannot produce a unified number space across modalities.
  • Inductive priors that enforce unified representations across vision and language are required for reliable extrapolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mapping failure may limit performance on other tasks that require precise numerical output from visual input, such as estimating sizes or durations.
  • Training methods that force explicit alignment between visual magnitude signals and linguistic number tokens could be tested as a direct remedy.
  • The fractured manifold view predicts that VLMs will show similar modality-specific number representations in domains beyond counting, such as basic arithmetic on images.

Load-bearing premise

Linear probes on synthetic Go board images capture the same quantity representations the model uses when counting objects in natural images.

What would settle it

Measure whether improving only the symbolic mapping component on extrapolation examples raises counting accuracy on natural images that contain novel quantities.

Figures

Figures reproduced from arXiv: 2605.30170 by Junling Wang, Mrinmaya Sachan, Xingzhou Pang, Yifan Hou.

Figure 1
Figure 1. Figure 1: Deconstructing visual counting. We analyze the count￾ing bottleneck in VLMs via 3 cognitive stages: (1) visual individu￾ation, (2) magnitude awareness, and (3) symbolic mapping. and artificial intelligence (Fodor & Pylyshyn, 1988; Marcus, 2003; Lake & Baroni, 2018; Hupkes et al., 2020). On the one hand, while Large Vision-Language Models (VLMs) have demonstrated impressive proficiency in describing vi￾sual… view at source ↗
Figure 2
Figure 2. Figure 2: The Baseline Paradox: Linguistic competence does not ensure visual ability. While textual counting (bronze area) generalizes perfectly to N = 99, visual counting (red line) col￾lapses to near-zero immediately outside the training distribution (N > 49). This stark dissociation in the Visual Extrapolation (VE) regime demonstrates that the model fails to apply its known symbolic labels to unseen visual quanti… view at source ↗
Figure 4
Figure 4. Figure 4: Models can compare quantities they cannot name. In the binary magnitude comparison task, the model maintains high accuracy (> 90%) for visual inputs throughout the VE regime (50 − 99), despite having 0% accuracy in explicit enumeration. This proves the model possesses robust magnitude awareness; the failure is isolated strictly to the generative decoding stage. Experiment Results. As shown in [PITH_FULL_I… view at source ↗
Figure 5
Figure 5. Figure 5: Decoding failure manifests as mode collapse. Instead of random noise, prediction errors cluster tightly around specific “attractor” tokens: the visual training boundary (49), textual priors (90, 99), and hallucinations (9). This confirms that in the absence of a valid cross-modal path, the model abandons the counting algorithm and reverts to retrieving memorized statistical priors. 3.4.1. ANALYSIS I: ERROR… view at source ↗
Figure 8
Figure 8. Figure 8: Visual encoders retain rich signal. The Vision Gap (|NG − NH|) remains negligible (< 0.1), indicating the probe recovers the true count even when the model fails. In contrast, the Language Gap (|NP − NH|) diverges rapidly. This confirms that while real-world encoders introduce minor noise, the primary driver of failure remains the symbolic decoding bottleneck. → C) as our synthetic laboratory. 4.2. Hypothe… view at source ↗
Figure 9
Figure 9. Figure 9: Magnitude information persists deep into the model. Probing reveals that visual module (99.4%) and the first 40 lan￾guage module layers (> 99%) maintain precise quantity represen￾tations. The signal decay is restricted to the final layers, confirming that the failure is not due to “reasoning drift” or forgetting, but occurs strictly during the final projection to symbolic logits. perceptual errors not capt… view at source ↗
Figure 11
Figure 11. Figure 11: Black-stone patch embeddings causally drive counting via attention masking. On the In-Distribution set (baseline counting accuracy: 100%), we randomly sample k specific black stones in each image (k ∈ [1, 5]), and mask the corresponding tokens in the attention computation. We report Steering Accuracy, defined as the percentage of examples where the intervened prediction matches the new groundtruth N ′ G =… view at source ↗
Figure 12
Figure 12. Figure 12: Latent quantity representations are perfectly linearly separable. Linear probes trained on both the vision and language modules achieve 100% accuracy. This confirms that distinct magnitude information persists across modalities, ruling out representation collapse as the failure mechanism. Results. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt template for text counting with synthetic VLM. Prompt Template for Vision Counting with Synthetic VLM Input: Question: How many black stones are there on the chessboard ? Answer: 41 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt template for vision counting with synthetic VLM. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt template for text counting comparison with synthetic VLM. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt template for vision counting comparison with synthetic VLM. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt template for text counting with real VLM. Prompt Template for Vision Counting with Real VLM System Prompt: You are a precise counting engine. 1. Verification: Before answering, output a compact data check (e.g., list of indices for text, or row-by-row counts for images). Avoid conversational filler. 2. Format: Verification: <Dense Data> → FINAL ANSWER: <Result>. Input: Question: Analyze the 6x6 Go … view at source ↗
Figure 18
Figure 18. Figure 18: Prompt template for vision counting with real VLM. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
read the original abstract

While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive stages: visual individuation, magnitude awareness, and symbolic mapping. Using synthetic Go boards and linear probes, we demonstrate that visual backbones maintain robust, linearly separable representations of quantity well into the extrapolation regime, ruling out perceptual failure. Furthermore, models retain latent magnitude awareness, successfully performing comparative reasoning on quantities they fail to enumerate. We pinpoint the collapse to the symbolic mapping stage, where the model fails to project valid visual magnitudes onto symbolic tokens. Our findings support a frac tured magnitude hypothesis: VLMs fail to acquire a universal number space, instead learning disjoint, modality-specific statistical manifolds that prevent cross-modal grounding for unseen quantities. Validated on the state-of-the-art foundation model, our results suggest that bridging this gap requires inductive priors enforcing unified representations, as data scaling alone is insufficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that VLMs suffer from a visual counting extrapolation bottleneck that can be isolated to the symbolic mapping stage rather than visual individuation or magnitude awareness. Using linear probes on synthetic Go board images, the authors show that visual backbones retain linearly separable quantity representations even in extrapolation regimes and that models can perform comparative reasoning on quantities they cannot enumerate, supporting a 'fractured magnitude hypothesis' in which VLMs learn disjoint modality-specific statistical manifolds instead of a universal number space. The work concludes that data scaling alone is insufficient and that inductive priors for unified cross-modal representations are needed.

Significance. If the central empirical dissociation holds, the paper offers a useful stage-wise decomposition of counting failures in VLMs and a concrete methodological template (linear probes on controlled synthetic stimuli) for diagnosing where representational collapse occurs. The finding that magnitude awareness persists while symbolic projection fails is potentially actionable for future architecture or training interventions. The explicit contrast between interpolation success and extrapolation failure, together with the comparative-reasoning control, strengthens the diagnostic value of the results.

major comments (2)
  1. [Abstract / methods] Abstract and methods (linear-probe experiments on synthetic Go boards): the claim that perceptual failure is ruled out rests on the assumption that success of linear probes in this highly regular, non-occluded grid setting implies the same quantity representations are available and used for natural-image counting. The skeptic concern is load-bearing because the paper's real-world extrapolation claim requires this generalization; without additional controls (e.g., probes on natural scenes or ablation of positional cues), the attribution of collapse exclusively to symbolic mapping does not fully follow.
  2. [Results] Results on comparative reasoning versus enumeration: while the dissociation is interesting, the manuscript does not report whether the comparative-reasoning probes were performed on the same held-out extrapolation quantities used for the enumeration task, nor whether performance remains high when the visual backbone is frozen versus fine-tuned. This detail is needed to confirm that latent magnitude awareness truly survives in the regime where symbolic mapping fails.
minor comments (2)
  1. [Abstract] Abstract contains a typographical error: 'frac tured' should be 'fractured'.
  2. [Discussion] The term 'fractured magnitude hypothesis' is introduced without a formal definition or falsifiable prediction; a short paragraph clarifying its precise empirical content would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive feedback, which helps clarify the scope and limitations of our findings. We provide point-by-point responses to the major comments below.

read point-by-point responses
  1. Referee: [Abstract / methods] Abstract and methods (linear-probe experiments on synthetic Go boards): the claim that perceptual failure is ruled out rests on the assumption that success of linear probes in this highly regular, non-occluded grid setting implies the same quantity representations are available and used for natural-image counting. The skeptic concern is load-bearing because the paper's real-world extrapolation claim requires this generalization; without additional controls (e.g., probes on natural scenes or ablation of positional cues), the attribution of collapse exclusively to symbolic mapping does not fully follow.

    Authors: We acknowledge that our experiments utilize controlled synthetic stimuli to isolate the cognitive stages of counting. This design choice enables precise manipulation of quantity while minimizing confounding factors present in natural images. While we agree that direct probes on natural scenes would provide stronger evidence for generalization, the success of linear probes in the extrapolation regime demonstrates that the visual backbone can extract quantity information even for unseen magnitudes in a setting where perceptual individuation is feasible. We will revise the manuscript to explicitly discuss this limitation and emphasize that our conclusions pertain to the controlled setting, with the implication that perceptual representations are intact under conditions where visual individuation succeeds. Additionally, we will include an ablation study on positional cues to address this concern. revision: partial

  2. Referee: [Results] Results on comparative reasoning versus enumeration: while the dissociation is interesting, the manuscript does not report whether the comparative-reasoning probes were performed on the same held-out extrapolation quantities used for the enumeration task, nor whether performance remains high when the visual backbone is frozen versus fine-tuned. This detail is needed to confirm that latent magnitude awareness truly survives in the regime where symbolic mapping fails.

    Authors: The comparative-reasoning experiments were indeed conducted on the same held-out extrapolation quantities as the enumeration task. Furthermore, all linear probes, including those for comparative reasoning, were performed with the visual backbone frozen to evaluate the pre-existing representations without any fine-tuning. We will update the methods and results sections to explicitly report these details, ensuring the dissociation is clearly tied to the extrapolation regime. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probe results on synthetic data

full rationale

The paper presents an empirical investigation using linear probes on synthetic Go board images to localize failure to the symbolic mapping stage. All load-bearing claims rest on direct experimental measurements (probe accuracy in extrapolation regime, comparative reasoning success) rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, ansatzes, or uniqueness theorems appear; the fractured magnitude hypothesis is offered as an interpretation of the probe outcomes, not a mathematical reduction to inputs. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that linear probes extract the relevant quantity information and that the synthetic setup generalizes; the fractured magnitude hypothesis is introduced as an explanatory construct without external validation.

axioms (1)
  • domain assumption Linear probes on visual backbone activations can reliably extract and demonstrate the presence of quantity information even in extrapolation regimes.
    Used to conclude that perceptual failure is ruled out.
invented entities (1)
  • fractured magnitude hypothesis no independent evidence
    purpose: To explain why visual magnitudes are not projected onto symbolic tokens by positing disjoint modality-specific statistical manifolds.
    Proposed based on the observed probe successes and mapping failures; no independent falsifiable prediction outside the described experiments.

pith-pipeline@v0.9.1-grok · 5713 in / 1330 out tokens · 21052 ms · 2026-06-28T23:29:44.576027+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 24 canonical work pages · 7 internal anchors

  1. [1]

    Fine-grained analysis of sentence embeddings using auxiliary prediction tasks

    Adi, Y., Kermany, E., Belinkov, Y., Lavi, O., and Goldberg, Y. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017. URL https://openreview.net/forum?id=BJh6Ztuxl

  2. [2]

    and Bengio, Y

    Alain, G. and Bengio, Y. Understanding intermediate layers using linear classifier probes. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings . OpenReview.net, 2017. URL https://openreview.net/forum?id=HJ4-rAVtl

  3. [3]

    V., Slone, A., Gur - Ari, G., Dyer, E., and Neyshabur, B

    Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V. V., Slone, A., Gur - Ari, G., Dyer, E., and Neyshabur, B. Exploring length generalization in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Inf...

  4. [4]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  5. [5]

    Bender, E. M. and Koller, A. Climbing towards NLU: on meaning, form, and understanding in the age of data. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. R. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 , pp.\ 5185--5198. Association for Computational Linguist...

  6. [6]

    Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., Unterthiner, T., Keysers, D., Koppula, S., Liu, F., Grycner, A., Gritsenko, A. A., Houlsby, N., Kumar, M., Rong, K., Eisenschlos, J., Kabra, R., Bauer, M., Bosnjak, M., Chen, X., Minderer, M., Voigtlaender, P., Bica, I....

  7. [7]

    The origin of concepts

    Carey, S. The origin of concepts. Journal of Cognition and Development, 1 0 (1): 0 37--41, 2000

  8. [8]

    The devil is in the detail: Simple tricks improve systematic generalization of transformers

    Csord \' a s, R., Irie, K., and Schmidhuber, J. The devil is in the detail: Simple tricks improve systematic generalization of transformers. In Moens, M., Huang, X., Specia, L., and Yih, S. W. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 Nov...

  9. [9]

    Was sind und was sollen die zahlen? In Was sind und was sollen die Zahlen?

    Dedekind, R. Was sind und was sollen die zahlen? In Was sind und was sollen die Zahlen?. Stetigkeit und Irrationale Zahlen, pp.\ 1--47. Springer, 1965

  10. [10]

    The number sense: How the mind creates mathematics

    Dehaene, S. The number sense: How the mind creates mathematics. OUP USA, 2011

  11. [11]

    K., Catt, E., Cundy, C., Hutter, M., Legg, S., Veness, J., and Ortega, P

    Del \' e tang, G., Ruoss, A., Grau - Moya, J., Genewein, T., Wenliang, L. K., Catt, E., Cundy, C., Hutter, M., Legg, S., Veness, J., and Ortega, P. A. Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/f...

  12. [12]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 ...

  13. [13]

    O ., Abdulkadir, A., Marrakchi, Y., B \

    Falk, T., Mai, D., Bensch, R., C i c ek, \"O ., Abdulkadir, A., Marrakchi, Y., B \"o hm, A., Deubner, J., J \"a ckel, Z., Seiwald, K., et al. U-net: deep learning for cell counting, detection, and morphometry. Nature methods, 16 0 (1): 0 67--70, 2019

  14. [14]

    Fodor, J. A. and Pylyshyn, Z. W. Connectionism and cognitive architecture: A critical analysis. Cognition, 28 0 (1-2): 0 3--71, 1988

  15. [15]

    Vbench: Comprehensive benchmark suite for video generative models

    Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., and Zhou, T. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, Jun...

  16. [16]

    Your vision-language model can't even count to 20: Exposing the failures of vlms in compositional counting

    Guo, X., Huang, Z., Shi, Z., Song, Z., and Zhang, J. Your vision-language model can't even count to 20: Exposing the failures of vlms in compositional counting. CoRR, abs/2510.04401, 2025. doi:10.48550/ARXIV.2510.04401. URL https://doi.org/10.48550/arXiv.2510.04401

  17. [17]

    Finding neurons in a haystack: Case studies with sparse probing

    Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D. Finding neurons in a haystack: Case studies with sparse probing. Trans. Mach. Learn. Res., 2023, 2023. URL https://openreview.net/forum?id=JYs1R9IMJr

  18. [18]

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D

    Hewitt, J. and Liang, P. Designing and interpreting probes with control tasks. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019 , pp.\ 2...

  19. [19]

    Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality

    Hsieh, C., Zhang, J., Ma, Z., Kembhavi, A., and Krishna, R. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, Ne...

  20. [20]

    Revisiting Multimodal Positional Encoding in Vision-Language Models

    Huang, J., Liu, X., Song, S., Hou, R., Chang, H., Lin, J., and Bai, S. Revisiting multimodal positional encoding in vision-language models. CoRR, abs/2510.23095, 2025. doi:10.48550/ARXIV.2510.23095. URL https://doi.org/10.48550/arXiv.2510.23095

  21. [21]

    Compositionality decomposed: How do neural networks generalise? J

    Hupkes, D., Dankers, V., Mul, M., and Bruni, E. Compositionality decomposed: How do neural networks generalise? J. Artif. Intell. Res., 67: 0 757--795, 2020. doi:10.1613/JAIR.1.11674. URL https://doi.org/10.1613/jair.1.11674

  22. [22]

    Clip-count: Towards text-guided zero-shot object counting

    Jiang, R., Liu, L., and Chen, C. Clip-count: Towards text-guided zero-shot object counting. In El - Saddik, A., Mei, T., Cucchiara, R., Bertini, M., Vallejo, D. P. T., Atrey, P. K., and Hossain, M. S. (eds.), Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023 , pp.\ 4535--4545....

  23. [23]

    N., Das, P., and Reddy, S

    Kazemnejad, A., Padhi, I., Ramamurthy, K. N., Das, P., and Reddy, S. The impact of positional encoding on length generalization in transformers. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS ...

  24. [24]

    Lake, B. M. and Baroni, M. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15, 2018 , volume 80 of Proceedings of Machine Learni...

  25. [25]

    Lempitsky, V. S. and Zisserman, A. Learning to count objects in images. In Lafferty, J. D., Williams, C. K. I., Shawe - Taylor, J., Zemel, R. S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, Bri...

  26. [26]

    X., and Wen, J

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and Wen, J. Evaluating object hallucination in large vision-language models. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 292--305. Association for Computational Lingui...

  27. [27]

    Marcus, G. F. The algebraic mind: Integrating connectionism and cognitive science. MIT press, 2003

  28. [28]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774

  29. [29]

    Bokhovkin, S

    Paiss, R., Ephrat, A., Tov, O., Zada, S., Mosseri, I., Irani, M., and Dekel, T. Teaching CLIP to count to ten. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023 , pp.\ 3147--3157. IEEE , 2023. doi:10.1109/ICCV51070.2023.00294. URL https://doi.org/10.1109/ICCV51070.2023.00294

  30. [30]

    T., Tenenbaum, J

    Piantadosi, S. T., Tenenbaum, J. B., and Goodman, N. D. Bootstrapping in a language of thought: A formal model of numerical concept learning. Cognition, 123 0 (2): 0 199--217, 2012

  31. [31]

    A., and Lewis, M

    Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0

  32. [32]

    Language models are unsupervised multitask learners

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  33. [33]

    R., and Nguyen, A

    Rahmanzadehgervi, P., Bolton, L., Taesiri, M. R., and Nguyen, A. T. Vision language models are blind. In Cho, M., Laptev, I., Tran, D., Yao, A., and Zha, H. (eds.), Computer Vision - ACCV 2024 - 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8-12, 2024, Proceedings, Part V , volume 15476 of Lecture Notes in Computer Science, pp.\ 293--...

  34. [34]

    U-net: Convolutional networks for biomedical image segmentation

    Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Navab, N., Hornegger, J., III, W. M. W., and Frangi, A. F. (eds.), Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III , volume 9351 ...

  35. [35]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023. doi:10.48550/ARXIV.2312.11805. URL https://doi.org/10.48550/arXiv.2312.11805

  36. [36]

    Gemma 3 Technical Report

    Team, G. Gemma 3 technical report. CoRR, abs/2503.19786, 2025. doi:10.48550/ARXIV.2503.19786. URL https://doi.org/10.48550/arXiv.2503.19786

  37. [37]

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A

    Tenney, I., Das, D., and Pavlick, E. BERT rediscovers the classical NLP pipeline. In Korhonen, A., Traum, D. R., and M \` a rquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pp.\ 4593--4601. Association for Computational Linguis...

  38. [39]

    Otaduy, and Dan Casas

    Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., and Ross, C. Winoground: Probing vision and language models for visio-linguistic compositionality. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pp.\ 5228--5238. IEEE , 2022 b . doi:10.1109/CVPR52688.2022.00517. ...

  39. [40]

    Do NLP models know numbers? probing numeracy in embeddings

    Wallace, E., Wang, Y., Li, S., Singh, S., and Gardner, M. Do NLP models know numbers? probing numeracy in embeddings. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong K...

  40. [41]

    H., Le, Q

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Proces...

  41. [42]

    Black, and Otmar Hilliges

    Xu, J., Le, H., Nguyen, V., Ranjan, V., and Samaras, D. Zero-shot object counting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023 , pp.\ 15548--15557. IEEE , 2023. doi:10.1109/CVPR52729.2023.01492. URL https://doi.org/10.1109/CVPR52729.2023.01492

  42. [43]

    u ksekg \

    Y \" u ksekg \" o n \" u l, M., Bianchi, F., Kalluri, P., Jurafsky, D., and Zou, J. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023 a . URL https://openreview.net/forum?id=KRLUvxh8uaX

  43. [44]

    u ksekg \

    Y \" u ksekg \" o n \" u l, M., Bianchi, F., Kalluri, P., Jurafsky, D., and Zou, J. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023 b . URL https://openreview.net/forum?id=KRLUvxh8uaX

  44. [45]

    D., Tigges, C., Zhang, Z., Biderman, S., Raginsky, M., and Ringer, T

    Zhang, S. D., Tigges, C., Zhang, Z., Biderman, S., Raginsky, M., and Ringer, T. Transformer-based models are not yet perfect at learning to emulate structural recursion. Trans. Mach. Learn. Res., 2024, 2024. URL https://openreview.net/forum?id=Ry5CXXm1sf

  45. [46]

    Single-image crowd counting via multi-column convolutional neural network

    Zhang, Y., Zhou, D., Chen, S., Gao, S., and Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 , pp.\ 589--597. IEEE Computer Society, 2016. doi:10.1109/CVPR.2016.70. URL https://doi.org/10.1109/CVPR.2016.70

  46. [47]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., and Smola, A. Multimodal chain-of-thought reasoning in language models. CoRR, abs/2302.00923, 2023. doi:10.48550/ARXIV.2302.00923. URL https://doi.org/10.48550/arXiv.2302.00923

  47. [48]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...