pith. machine review for the scientific record. sign in

arxiv: 2605.10887 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Count Anything at Any Granularity

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-world object countingmulti-grained countingvisual exemplarsfine-grained text promptssynthetic dataset generationKubriCountHieraCountgranularity levels
0
0 comments X

The pith

Open-world counting becomes reliable once granularity is made explicit across five levels using text and visual exemplars together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models and counting networks still fail when users want to count something more specific than a broad category, such as only red cars or only instances of a particular pose. The paper establishes that treating counting as a multi-grained problem, with visual examples fixing the appearance and text fixing one of five semantic levels, removes this ambiguity. To make the claim testable, the authors built KubriCount, a large dataset created automatically by controllable 3D rendering followed by VLM filtering, and trained HieraCount to process both modalities jointly. If the claim holds, counting systems can finally follow user intent at the level of detail the user actually means rather than defaulting to coarse category matches. A reader would care because precise, intent-aligned counting is required in monitoring, ecology, retail, and many other real-world settings where the same scene can be counted in multiple valid ways.

Core claim

The paper claims that open-world object counting remains brittle because granularity is left implicit, and that redefining the task as multi-grained counting—with visual exemplars specifying target appearance and fine-grained text (plus optional negatives) specifying one of five explicit semantic levels—directly addresses the failure. This redefinition is made practical by an automatic data pipeline that combines controllable 3D synthesis, consistent image editing, and VLM-based filtering to produce KubriCount, the largest counting dataset with instance-level and multi-category annotations. Systematic benchmarks show that both large multimodal models and specialist counters exhibit severe l1

What carries the argument

HieraCount, a model that jointly ingests text prompts for one of five granularity levels and visual exemplars for target appearance to output counts that respect both signals.

If this is right

  • Existing multimodal and specialist models fail to follow prompts once granularity is made finer than category level.
  • The KubriCount dataset supplies both training data and a standardized multi-grained evaluation benchmark.
  • HieraCount achieves higher multi-grained accuracy than prior methods by using text and visual exemplars as complementary signals.
  • The resulting model generalizes to real-world scenes that contain distractors and varied viewpoints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same explicit-granularity design could be applied to open-world detection or segmentation to reduce prompt ambiguity in those tasks.
  • Extending the five-level scheme to continuous or user-defined granularity might remove the need for a fixed taxonomy.
  • Video or 3D scene counting could adopt the same joint text-visual mechanism to handle temporal or spatial granularity.

Load-bearing premise

The five fixed granularity levels together with the synthetic 3D-plus-VLM pipeline are sufficient to represent the full range of user intents and to produce annotations that match real-world fine-grained distinctions without major bias or artifacts.

What would settle it

Measure whether HieraCount maintains its reported accuracy gain when tested on a held-out collection of real photographs that have been manually annotated with counting intents whose granularity falls outside the five predefined levels or that contain natural variations absent from the 3D synthesis.

Figures

Figures reproduced from arXiv: 2605.10887 by Chang Liu, Haoning Wu, Weidi Xie.

Figure 1
Figure 1. Figure 1: Multi-grained counting benchmark (KubriCount) and model evalu [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multi-grained counting levels and semantic hierarchy. Left: [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our automatic counting data scaling pipeline. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: KubriCount statistics. Category and count distributions show broad, bal￾anced coverage for multi-grained evaluation. Data distributions [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative analysis. Representative failure cases of prior models under multi-grained prompts, contrasted with HieraCount under the same prompts. aCount substantially improves over CountGD and achieves state-of-the-art per￾formance under positive-only prompting. These results suggest that our pipeline effectively mitigates the sim-to-real gap, and that multi-grained supervision with controlled distractors… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative visualizations for Level 1. Level 1 corresponds to identity￾level counting, where each image contains only one object category and the task is to count all instances in the scene. Each example shows the scene and its corresponding counting prompt and GT answer [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative visualizations for Level 2 (size mode). [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative visualizations for Level 2 (color mode). [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative visualizations for Level 3. Level 3 corresponds to category￾level counting, where each image contains two different categories and the task is to count the target category while ignoring the distractor category. Each example shows the scene and its corresponding counting prompt and GT answer [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative visualizations for Level 4. Level 4 corresponds to instance￾level counting, where each image contains two different instance types within the same category and the task is to distinguish and count only the target type. Each example shows the scene and its corresponding counting prompt and GT answer [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative visualizations for Level 5. Level 5 corresponds to concept￾level counting, where each image contains two categories with larger intra-category variation, requiring the model to count the target category under more diverse and challenging distractor settings. Each example shows the scene and its corresponding counting prompt and GT answer [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Category distribution re-balance from 3D assets to generated im [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prediction-versus-ground-truth scatter plots on KubriCount. [PITH_FULL_IMAGE:figures/full_fig_p044_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional qualitative visualizations on KubriCount. [PITH_FULL_IMAGE:figures/full_fig_p045_14.png] view at source ↗
read the original abstract

Open-world object counting remains brittle: despite rapid advances in vision-language models (VLMs), reliably counting the objects a user intends is far from solved. We argue that a central reason is that counting granularity is left implicit; users may refer to a specific identity, an attribute, an instance type, a category, or an abstract concept, yet most methods treat "what to count" as a single, category-level matching problem. In this work, we redefine open-world counting as multi-grained counting, where visual exemplars specify target appearance and fine-grained text, with optional negative prompts, specifies the intended semantic granularity across five explicit levels. Making granularity explicit, however, exposes a critical data bottleneck: existing counting datasets lack the multi-category scenes, controlled distractors, and instance-level annotations needed to verify fine-grained prompt semantics. To address this, we propose the first fully automatic data-scaling pipeline that integrates controllable 3D synthesis with consistent image editing and VLM-based filtering, and use it to construct KubriCount, the largest and most comprehensively annotated counting dataset to date, supporting both training and multi-grained evaluation. Systematic benchmarking reveals that both multimodal large language models and specialist counting models exhibit severe prompt-following failures under fine-grained distinctions. Motivated by these findings, we train HieraCount, a multi-grained counting model that jointly leverages text and visual exemplars as complementary target specifications. HieraCount substantially improves multi-grained counting accuracy and generalizes robustly to challenging real-world scenarios. The project page is available here: https://verg-avesta.github.io/KubriCount/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper argues that open-world object counting is limited by implicit granularity in user prompts and redefines the task as multi-grained counting across five explicit levels (identity, attribute, instance type, category, abstract concept) specified via visual exemplars plus fine-grained text (with optional negatives). It introduces an automatic pipeline combining controllable 3D synthesis, consistent image editing, and VLM filtering to build KubriCount—the largest counting dataset with multi-category scenes, distractors, and instance-level annotations for both training and evaluation. Systematic benchmarks show severe prompt-following failures in existing VLMs and specialist counters; the authors then train HieraCount, which jointly uses text and visual exemplars, claiming substantial accuracy gains and robust generalization to real-world scenarios.

Significance. If the central claims hold, this work would be significant for the computer vision community by shifting open-world counting from category-level matching to explicit multi-grained semantics, directly addressing a practical failure mode in VLMs. The fully automatic data-scaling pipeline is a clear strength, enabling the largest and most comprehensively annotated counting dataset to date and supporting reproducible large-scale training. The systematic demonstration of prompt-following failures across model classes provides a useful diagnostic, while HieraCount's complementary use of text and exemplars offers a concrete architectural direction for future models.

major comments (3)
  1. [§4] §4 (KubriCount construction): the VLM-based filtering step lacks any reported quantitative validation (e.g., agreement with human annotators, fraction of discarded edge cases, or bias analysis across the five granularity levels), which is load-bearing for the claim that KubriCount is representative of real-world fine-grained user intents and therefore for all downstream benchmarking and generalization results.
  2. [§5] §5 (Experiments and benchmarking): the reported improvements for HieraCount over baselines are presented without error bars, confidence intervals, or details on the number of evaluation runs or statistical tests, making it impossible to assess whether the claimed substantial accuracy gains are reliable or could be explained by variance in the synthetic test distribution.
  3. [§5.3] §5.3 (real-world generalization): the evaluation on challenging real-world scenarios does not specify how the test images were selected, whether they cover all five granularity levels uniformly, or how distribution shift from the 3D-synthetic training data was measured, undermining the robustness claim.
minor comments (2)
  1. [§3] The exact operational definitions and decision boundaries between the five granularity levels (especially attribute vs. instance type and category vs. abstract concept) should be formalized with examples in §3 to reduce reader ambiguity.
  2. Figure captions for KubriCount examples and failure cases should explicitly label which granularity level each prompt targets and note any negative prompts used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of rigor in dataset validation, statistical reporting, and evaluation transparency. We have revised the manuscript to incorporate additional analyses and clarifications for each point, as detailed below.

read point-by-point responses
  1. Referee: [§4] §4 (KubriCount construction): the VLM-based filtering step lacks any reported quantitative validation (e.g., agreement with human annotators, fraction of discarded edge cases, or bias analysis across the five granularity levels), which is load-bearing for the claim that KubriCount is representative of real-world fine-grained user intents and therefore for all downstream benchmarking and generalization results.

    Authors: We agree that quantitative validation of the VLM filtering step is necessary to support claims about dataset quality and representativeness. In the revised manuscript, we have added a new paragraph in §4 describing a human validation study performed on a stratified sample of 1,000 images (200 per granularity level). This includes inter-rater agreement metrics (average Cohen's κ = 0.79 between VLM outputs and two human annotators), the overall discard rate (11.4% of candidates filtered out), and a per-level breakdown showing no statistically significant bias in filtering decisions. These results are now reported alongside the pipeline description to substantiate the dataset's alignment with real-world fine-grained intents. revision: yes

  2. Referee: [§5] §5 (Experiments and benchmarking): the reported improvements for HieraCount over baselines are presented without error bars, confidence intervals, or details on the number of evaluation runs or statistical tests, making it impossible to assess whether the claimed substantial accuracy gains are reliable or could be explained by variance in the synthetic test distribution.

    Authors: We concur that the absence of statistical details limits interpretability of the gains. The revised §5 now reports results averaged over five independent evaluation runs with distinct random seeds for test-set sampling and model training. We include standard deviation error bars, 95% confidence intervals, and p-values from paired t-tests (all key improvements p < 0.01). These additions demonstrate that the accuracy gains are statistically reliable and exceed what would be expected from variance alone in the synthetic distribution. revision: yes

  3. Referee: [§5.3] §5.3 (real-world generalization): the evaluation on challenging real-world scenarios does not specify how the test images were selected, whether they cover all five granularity levels uniformly, or how distribution shift from the 3D-synthetic training data was measured, undermining the robustness claim.

    Authors: We have expanded §5.3 with the requested details. Real-world test images were curated from a combination of public benchmarks (e.g., FSC-147, COCO) and web-sourced scenes, with explicit stratification to ensure approximately equal representation across the five granularity levels (roughly 20% each). Distribution shift is now quantified via average cosine distance in CLIP ViT-L/14 embedding space (0.28 between synthetic training and real test sets) together with t-SNE visualizations of feature distributions. These clarifications support the robustness claims while preserving the original evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core argument proceeds by redefining open-world counting as multi-grained counting across five explicit levels, constructing a new synthetic dataset KubriCount via 3D synthesis plus VLM filtering to address an identified data gap, benchmarking existing models on that dataset to reveal prompt-following failures, and training HieraCount to demonstrate empirical gains. None of these steps invoke self-definitional reductions, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems imported from prior author work, smuggled ansatzes, or renamings of known results; the claims rest on newly generated data and model outputs rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the work introduces no explicit free parameters, invented entities, or additional axioms beyond standard assumptions in vision-language modeling and controllable synthesis; the five-level granularity is treated as a domain assumption.

axioms (1)
  • domain assumption Five explicit levels of granularity cover the range of user intents for open-world counting
    Invoked when redefining counting as multi-grained and specifying semantic distinctions via text and exemplars.

pith-pipeline@v0.9.0 · 5580 in / 1405 out tokens · 70734 ms · 2026-05-12T04:15:16.116453+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · 13 internal anchors

  1. [1]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2019) 4, 10

    Acharya,M.,Kafle,K.,Kanan,C.:Tallyqa:Answeringcomplexcountingquestions. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019) 4, 10

  2. [2]

    In: Proceedings of the British Machine Vision Conference (2023) 2, 4

    Amini-Naieni, N., Amini-Naieni, K., Han, T., Zisserman, A.: Open-world text- specified object counting. In: Proceedings of the British Machine Vision Conference (2023) 2, 4

  3. [3]

    In: Conference on Neural Information Processing Systems (2024) 2, 4, 6, 7, 11, 13, 14, 43

    Amini-Naieni, N., Han, T., Zisserman, A.: Countgd: Multi-modal open-world counting. In: Conference on Neural Information Processing Systems (2024) 2, 4, 6, 7, 11, 13, 14, 43

  4. [4]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025) 4, 11, 13, 14

    Amini-Naieni, N., Zisserman, A.: Countgd++: Generalized prompting for open- world counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025) 4, 11, 13, 14

  5. [5]

    Anthropic: Claude haiku 4.5 system card (2025),https://www.anthropic.com/ claude-haiku-4-5-system-card511, 13

  6. [6]

    Anthropic: Claude opus 4.5 system card (2025),https://www.anthropic.com/ claude-opus-4-5-system-card11, 13

  7. [7]

    Anthropic: System card: Claude sonnet 4.5 (2025),https://www.anthropic.com/ claude-sonnet-4-5-system-card11, 13

  8. [8]

    In: Proceedings of the European Conference on Computer Vision (2014) 3

    Arteta, C., Lempitsky, V., Noble, J.A., Zisserman, A.: Interactive object counting. In: Proceedings of the European Conference on Computer Vision (2014) 3

  9. [9]

    In: Proceedings of the European Conference on Computer Vision (2016) 2, 3

    Arteta, C., Lempitsky, V., Zisserman, A.: Counting in the wild. In: Proceedings of the European Conference on Computer Vision (2016) 2, 3

  10. [10]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 1, 4, 11, 13

  11. [11]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 1, 4, 11, 13

  12. [12]

    IEEE Transactions on Pattern Analysis and Machine In- telligence (2012) 3

    Barinova, O., Lempitsky, V., Kholi, P.: On detection of multiple object instances using hough transforms. IEEE Transactions on Pattern Analysis and Machine In- telligence (2012) 3

  13. [13]

    In:ProceedingsoftheInternationalConferenceonLearningRepresentations(2026) 1

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. In:ProceedingsoftheInternationalConferenceonLearningRepresentations(2026) 1

  14. [14]

    ShapeNet: An Information-Rich 3D Model Repository

    Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015) 8

  15. [15]

    In: Proceedings of the British Machine Vision Conference (2022) 2, 4, 11, 13

    Chang, L., Yujie, Z., Andrew, Z., Weidi, X.: Countr: Transformer-based generalised visual counting. In: Proceedings of the British Machine Vision Conference (2022) 2, 4, 11, 13

  16. [16]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2024) 13

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2024) 13

  17. [17]

    ACM Transactions on Graphics (TOG) (2022) 8

    Chen, Z., Wang, G., Liu, Z.: Text2light: Zero-shot text-driven hdr panorama gen- eration. ACM Transactions on Graphics (TOG) (2022) 8

  18. [18]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal Abbreviated paper title 17 models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024) 1, 11, 13

  19. [19]

    In: Win- ter Conference on Applications of Computer Vision (2025) 4

    Ciampi, L., Messina, N., Pierucci, M., Amato, G., Avvenuti, M., Falchi, F.: Mind the prompt: A novel benchmark for prompt-based class-agnostic counting. In: Win- ter Conference on Applications of Computer Vision (2025) 4

  20. [20]

    Molmo2: Open weights and data for vision-language models with video understanding and grounding, 2026

    Clark, C., Zhang, J., Ma, Z., Park, J.S., Salehi, M., Tripathi, R., Lee, S., Ren, Z., Kim, C.D., Yang, Y., et al.: Molmo2: Open weights and data for vision-language models with video understanding and grounding. arXiv preprint arXiv:2601.10611 (2026) 11, 13

  21. [21]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 4, 11, 13

  22. [22]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2024) 4, 10

    Dai, S., Liu, J., Cheung, N.M.: Referring expression counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2024) 4, 10

  23. [23]

    In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (2025) 4, 8, 10, 11, 12, 13, 29

    Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (2025) 4, 8, 10, 11, 12, 13, 29

  24. [24]

    Conference on Neural Information Processing Systems (2023) 8

    Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d objects. Conference on Neural Information Processing Systems (2023) 8

  25. [25]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023) 8

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of an- notated 3d objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023) 8

  26. [26]

    International Journal of Computer Vision (2011) 3

    Desai, C., Ramanan, D., Fowlkes, C.C.: Discriminative models for multi-class ob- ject layout. International Journal of Computer Vision (2011) 3

  27. [27]

    In: IEEE International Conference on Robotics and Automation (2022) 8

    Downs, L., Francis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K., McHugh, T.B., Vanhoucke, V.: Google scanned objects: A high-quality dataset of 3d scanned household items. In: IEEE International Conference on Robotics and Automation (2022) 8

  28. [28]

    In: Proceedings of the European Conference on Computer Vision (2022) 4

    Gong, S., Zhang, S., Yang, J., Dai, D., Schiele, B.: Class-agnostic object count- ing robust to intraclass diversity. In: Proceedings of the European Conference on Computer Vision (2022) 4

  29. [29]

    googleblog

    Google: Introducing gemini 2.5 flash image, our state-of-the-art image model (2025),https : / / developers . googleblog . com / introducing - gemini - 2 - 5 - flash-image/8, 29

  30. [30]

    Google: Introducing nano banana pro (2025),https://blog.google/innovation- and-ai/products/nano-banana-pro/3, 9

  31. [31]

    Google: A new era of intelligence with gemini 3 (2025),https://blog.google/ products/gemini/gemini-3/3, 4, 8, 9, 11, 12, 13, 29, 36

  32. [32]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 11, 12, 13

  33. [33]

    In: Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition (2022) 3, 8 18 C

    Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition (2022) 3, 8 18 C. Liu et al

  34. [34]

    In: Proceedings of the International Conference on Computer Vision (2017) 3

    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the International Conference on Computer Vision (2017) 3

  35. [35]

    In: Proceedings of the International Conference on Computer Vision (2017) 3, 4, 10

    Hsieh, M.R., Lin, Y.L., Hsu, W.H.: Drone-based object counting by spatially regu- larized regional proposal network. In: Proceedings of the International Conference on Computer Vision (2017) 3, 4, 10

  36. [36]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 11, 13

  37. [37]

    In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (2013) 10

    Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (2013) 10

  38. [38]

    In: Proceedings of the European Conference on Computer Vision (2018) 10

    Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Maadeed, S., Rajpoot, N., Shah, M.:Compositionlossforcounting,densitymapestimationandlocalizationindense crowds. In: Proceedings of the European Conference on Computer Vision (2018) 10

  39. [39]

    arXiv preprint arXiv:2510.12798 (2025) 10 M

    Jiang, Q., Huo, J., Chen, X., Xiong, Y., Zeng, Z., Chen, Y., Ren, T., Yu, J., Zhang, L.: Detect anything via next point prediction. arXiv preprint arXiv:2510.12798 (2025) 1, 11, 13

  40. [40]

    In: Proceedings of the European Conference on Computer Vision (2024) 1

    Jiang, Q., Li, F., Zeng, Z., Ren, T., Liu, S., Zhang, L.: T-rex2: Towards generic object detection via text-visual prompt synergy. In: Proceedings of the European Conference on Computer Vision (2024) 1

  41. [41]

    In: ACM Multimedia (2023) 4

    Jiang, R., Liu, L., Chen, C.: Clip-count: Towards text-guided zero-shot object counting. In: ACM Multimedia (2023) 4

  42. [42]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2024) 2, 4

    Kang, S., Moon, W., Kim, E., Heo, J.P.: Vlcounter: Text-aware visual represen- tation for zero-shot object counting. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024) 2, 4

  43. [43]

    In: Proceedings of the International Conference on Computer Vision (2023) 1

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the International Conference on Computer Vision (2023) 1

  44. [44]

    In: Proceedings of the International Conference on Pattern Recognition (2006) 3

    Kong, D., Gray, D., Tao, H.: A viewpoint invariant approach for crowd counting. In: Proceedings of the International Conference on Pattern Recognition (2006) 3

  45. [45]

    The American Journal of Psychology (1949) 1

    L., K.E., W., L.M., W., R.T., J., V.: The discrimination of visual number. The American Journal of Psychology (1949) 1

  46. [46]

    In: Conference on Neural Information Processing Systems (2010) 3

    Lempitsky, V., Zisserman, A.: Learning to count objects in images. In: Conference on Neural Information Processing Systems (2010) 3

  47. [47]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 11, 13

  48. [48]

    In: Proceedings of the International Conference on Computer Vision (2017) 3

    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the International Conference on Computer Vision (2017) 3

  49. [49]

    In: Proceedings of the British Machine Vision Conference (2022) 4

    Lin, W., Yang, K., Ma, X., Gao, J., Liu, L., Liu, S., Hou, J., Yi, S., Chan, A.B.: Scale-prior deformable convolution for exemplar-guided class-agnostic counting. In: Proceedings of the British Machine Vision Conference (2022) 4

  50. [50]

    In: Conference on Neural Information Processing Systems (2023) 11, 12, 13

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Conference on Neural Information Processing Systems (2023) 11, 12, 13

  51. [51]

    In: Proceedings of the European Conference on Computer Vision (2024) 1, 4, 6 Abbreviated paper title 19

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: Proceedings of the European Conference on Computer Vision (2024) 1, 4, 6 Abbreviated paper title 19

  52. [52]

    In: Asian Conference on Computer Vision (2018) 2, 4

    Lu, E., Xie, W., Zisserman, A.: Class-agnostic counting. In: Asian Conference on Computer Vision (2018) 2, 4

  53. [53]

    In: IEE Colloquium on Image Processing for Security Ap- plications (1997) 3

    Marana, A.N., Velastín, S.A., Costa, L., Lotufo, R.: Estimation of crowd density using image processing. In: IEE Colloquium on Image Processing for Security Ap- plications (1997) 3

  54. [54]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2025) 3, 4, 10

    Mondal, A., Nag, S., Zhu, X., Dutta, A.: Omnicount: Multi-label object count- ing with semantic-geometric priors. In: Proceedings of the AAAI Conference on Artificial Intelligence (2025) 3, 4, 10

  55. [55]

    In: Proceedings of the European Conference on Computer Vision (2016) 2

    Mundhenk, T.N., Konjevod, G., Sakla, W.A., Boakye, K.: A large contextual dataset for classification, detection and counting of cars with deep learning. In: Proceedings of the European Conference on Computer Vision (2016) 2

  56. [56]

    In: 2025 International Conference on Digital Image Computing: Techniques and Applications (DICTA) (2025) 4, 11, 13

    Nguyen, G.K., Huang, Y., Hoai, M.: Can current ai models count what we mean, not what they see? a benchmark and systematic evaluation. In: 2025 International Conference on Digital Image Computing: Techniques and Applications (DICTA) (2025) 4, 11, 13

  57. [57]

    In: Proceedings of the European Conference on Computer Vision (2022) 3, 4

    Nguyen, T., Pham, C., Nguyen, K., Hoai, M.: Few-shot object counting and de- tection. In: Proceedings of the European Conference on Computer Vision (2022) 3, 4

  58. [58]

    OpenAI: Gpt-5.1: A smarter, more conversational chatgpt (2025),https : // openai.com/index/gpt-5-1/11, 13

  59. [59]

    OpenAI: Introducing gpt-5.2 (2025),https://openai.com/index/introducing- gpt-5-2/11, 13

  60. [60]

    arXiv preprint arXiv:2504.01805 (2025) 22 B

    Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F., Sun, X.: Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805 (2025) 13

  61. [61]

    In: Proceedings of the International Conference on Computer Vision (2023) 4, 10

    Paiss, R., Ephrat, A., Tov, O., Zada, S., Mosseri, I., Irani, M., Dekel, T.: Teaching clip to count to ten. In: Proceedings of the International Conference on Computer Vision (2023) 4, 10

  62. [62]

    In: Conference on Neural Information Processing Systems (2024) 2, 11, 13, 14

    Pelhan, J., Lukezic, A., Zavrtanik, V., Kristan, M.: A novel unified architecture for low-shot counting by detection and segmentation. In: Conference on Neural Information Processing Systems (2024) 2, 11, 13, 14

  63. [63]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2024) 2, 11, 13, 14

    Pelhan, J., Zavrtanik, V., Kristan, M., et al.: Dave-a detect-and-verify paradigm for low-shot counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2024) 2, 11, 13, 14

  64. [64]

    In: Proceedings of the International Conference on Machine Learning (2021) 2, 4

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning (2021) 2, 4

  65. [65]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021) 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 29

    Ranjan, V., Sharma, U., Nguyen, T., Hoai, M.: Learning to count everything. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021) 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 29

  66. [66]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 1

  67. [67]

    Shi, M., Lu, H., Feng, C., Liu, C., Cao, Z.: Represent, compare, and learn: A similarity-awareframeworkforclass-agnosticcounting.In:ProceedingsoftheIEEE Conference on Computer Vision and Pattern Recognition (2022) 4

  68. [68]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) 10 20 C

    Sindagi, V.A., Yasarla, R., Patel, V.M.: Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) 10 20 C. Liu et al

  69. [69]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025) 8, 11, 13, 29, 43

  70. [70]

    Kimi-VL Technical Report

    Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., et al.: Kimi-vl technical report. arXiv preprint arXiv:2504.07491 (2025) 1, 13

  71. [71]

    In: Proceedings of the International Conference on Computer Vision (2023) 2, 3, 4, 11, 13, 14

    Ðukić, N., Lukežič, A., Zavrtanik, V., Kristan, M.: A low-shot object counting network with iterative prototype adaptation. In: Proceedings of the International Conference on Computer Vision (2023) 2, 3, 4, 11, 13, 14

  72. [72]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) 10

    Wang,Q.,Gao,J.,Lin,W.,Li,X.:Nwpu-crowd:Alarge-scalebenchmarkforcrowd counting and localization. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) 10

  73. [73]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 11, 13

  74. [74]

    Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

    Xiang, J., Chen, X., Xu, S., Wang, R., Lv, Z., Deng, Y., Zhu, H., Dong, Y., Zhao, H., Yuan, N.J., et al.: Native and compact structured latents for 3d generation. arXiv preprint arXiv:2512.14692 (2025) 8, 30

  75. [75]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025) 8

    Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025) 8

  76. [76]

    Computer methods in biomechanics and biomedical engineering: Imaging & Visualization (2018) 2, 3

    Xie, W., Noble, J.A., Zisserman, A.: Microscopy cell counting and detection with fully convolutional regression networks. Computer methods in biomechanics and biomedical engineering: Imaging & Visualization (2018) 2, 3

  77. [77]

    In: Proceedings of the International Conference on Computer Vision (2025) 11, 12, 13

    Xu, G., Jin, P., Hao, L., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. In: Proceedings of the International Conference on Computer Vision (2025) 11, 12, 13

  78. [78]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023) 4

    Xu, J., Le, H., Nguyen, V., Ranjan, V., Samaras, D.: Zero-shot object counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023) 4

  79. [79]

    In: Winter Conference on Applications of Computer Vision (2021) 4

    Yang, S.D., Su, H.T., Hsu, W.H., Chen, W.C.: Class-agnostic few-shot object counting. In: Winter Conference on Applications of Computer Vision (2021) 4

  80. [80]

    In: Winter Conference on Applications of Computer Vision (2023) 4

    You, Z., Yang, K., Luo, W., Lu, X., Cui, L., Le, X.: Few-shot object counting with similarity-aware feature enhancement. In: Winter Conference on Applications of Computer Vision (2023) 4

Showing first 80 references.