Recognition: 2 theorem links
· Lean TheoremCount Anything at Any Granularity
Pith reviewed 2026-05-12 04:15 UTC · model grok-4.3
The pith
Open-world counting becomes reliable once granularity is made explicit across five levels using text and visual exemplars together.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that open-world object counting remains brittle because granularity is left implicit, and that redefining the task as multi-grained counting—with visual exemplars specifying target appearance and fine-grained text (plus optional negatives) specifying one of five explicit semantic levels—directly addresses the failure. This redefinition is made practical by an automatic data pipeline that combines controllable 3D synthesis, consistent image editing, and VLM-based filtering to produce KubriCount, the largest counting dataset with instance-level and multi-category annotations. Systematic benchmarks show that both large multimodal models and specialist counters exhibit severe l1
What carries the argument
HieraCount, a model that jointly ingests text prompts for one of five granularity levels and visual exemplars for target appearance to output counts that respect both signals.
If this is right
- Existing multimodal and specialist models fail to follow prompts once granularity is made finer than category level.
- The KubriCount dataset supplies both training data and a standardized multi-grained evaluation benchmark.
- HieraCount achieves higher multi-grained accuracy than prior methods by using text and visual exemplars as complementary signals.
- The resulting model generalizes to real-world scenes that contain distractors and varied viewpoints.
Where Pith is reading between the lines
- The same explicit-granularity design could be applied to open-world detection or segmentation to reduce prompt ambiguity in those tasks.
- Extending the five-level scheme to continuous or user-defined granularity might remove the need for a fixed taxonomy.
- Video or 3D scene counting could adopt the same joint text-visual mechanism to handle temporal or spatial granularity.
Load-bearing premise
The five fixed granularity levels together with the synthetic 3D-plus-VLM pipeline are sufficient to represent the full range of user intents and to produce annotations that match real-world fine-grained distinctions without major bias or artifacts.
What would settle it
Measure whether HieraCount maintains its reported accuracy gain when tested on a held-out collection of real photographs that have been manually annotated with counting intents whose granularity falls outside the five predefined levels or that contain natural variations absent from the 3D synthesis.
Figures
read the original abstract
Open-world object counting remains brittle: despite rapid advances in vision-language models (VLMs), reliably counting the objects a user intends is far from solved. We argue that a central reason is that counting granularity is left implicit; users may refer to a specific identity, an attribute, an instance type, a category, or an abstract concept, yet most methods treat "what to count" as a single, category-level matching problem. In this work, we redefine open-world counting as multi-grained counting, where visual exemplars specify target appearance and fine-grained text, with optional negative prompts, specifies the intended semantic granularity across five explicit levels. Making granularity explicit, however, exposes a critical data bottleneck: existing counting datasets lack the multi-category scenes, controlled distractors, and instance-level annotations needed to verify fine-grained prompt semantics. To address this, we propose the first fully automatic data-scaling pipeline that integrates controllable 3D synthesis with consistent image editing and VLM-based filtering, and use it to construct KubriCount, the largest and most comprehensively annotated counting dataset to date, supporting both training and multi-grained evaluation. Systematic benchmarking reveals that both multimodal large language models and specialist counting models exhibit severe prompt-following failures under fine-grained distinctions. Motivated by these findings, we train HieraCount, a multi-grained counting model that jointly leverages text and visual exemplars as complementary target specifications. HieraCount substantially improves multi-grained counting accuracy and generalizes robustly to challenging real-world scenarios. The project page is available here: https://verg-avesta.github.io/KubriCount/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that open-world object counting is limited by implicit granularity in user prompts and redefines the task as multi-grained counting across five explicit levels (identity, attribute, instance type, category, abstract concept) specified via visual exemplars plus fine-grained text (with optional negatives). It introduces an automatic pipeline combining controllable 3D synthesis, consistent image editing, and VLM filtering to build KubriCount—the largest counting dataset with multi-category scenes, distractors, and instance-level annotations for both training and evaluation. Systematic benchmarks show severe prompt-following failures in existing VLMs and specialist counters; the authors then train HieraCount, which jointly uses text and visual exemplars, claiming substantial accuracy gains and robust generalization to real-world scenarios.
Significance. If the central claims hold, this work would be significant for the computer vision community by shifting open-world counting from category-level matching to explicit multi-grained semantics, directly addressing a practical failure mode in VLMs. The fully automatic data-scaling pipeline is a clear strength, enabling the largest and most comprehensively annotated counting dataset to date and supporting reproducible large-scale training. The systematic demonstration of prompt-following failures across model classes provides a useful diagnostic, while HieraCount's complementary use of text and exemplars offers a concrete architectural direction for future models.
major comments (3)
- [§4] §4 (KubriCount construction): the VLM-based filtering step lacks any reported quantitative validation (e.g., agreement with human annotators, fraction of discarded edge cases, or bias analysis across the five granularity levels), which is load-bearing for the claim that KubriCount is representative of real-world fine-grained user intents and therefore for all downstream benchmarking and generalization results.
- [§5] §5 (Experiments and benchmarking): the reported improvements for HieraCount over baselines are presented without error bars, confidence intervals, or details on the number of evaluation runs or statistical tests, making it impossible to assess whether the claimed substantial accuracy gains are reliable or could be explained by variance in the synthetic test distribution.
- [§5.3] §5.3 (real-world generalization): the evaluation on challenging real-world scenarios does not specify how the test images were selected, whether they cover all five granularity levels uniformly, or how distribution shift from the 3D-synthetic training data was measured, undermining the robustness claim.
minor comments (2)
- [§3] The exact operational definitions and decision boundaries between the five granularity levels (especially attribute vs. instance type and category vs. abstract concept) should be formalized with examples in §3 to reduce reader ambiguity.
- Figure captions for KubriCount examples and failure cases should explicitly label which granularity level each prompt targets and note any negative prompts used.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of rigor in dataset validation, statistical reporting, and evaluation transparency. We have revised the manuscript to incorporate additional analyses and clarifications for each point, as detailed below.
read point-by-point responses
-
Referee: [§4] §4 (KubriCount construction): the VLM-based filtering step lacks any reported quantitative validation (e.g., agreement with human annotators, fraction of discarded edge cases, or bias analysis across the five granularity levels), which is load-bearing for the claim that KubriCount is representative of real-world fine-grained user intents and therefore for all downstream benchmarking and generalization results.
Authors: We agree that quantitative validation of the VLM filtering step is necessary to support claims about dataset quality and representativeness. In the revised manuscript, we have added a new paragraph in §4 describing a human validation study performed on a stratified sample of 1,000 images (200 per granularity level). This includes inter-rater agreement metrics (average Cohen's κ = 0.79 between VLM outputs and two human annotators), the overall discard rate (11.4% of candidates filtered out), and a per-level breakdown showing no statistically significant bias in filtering decisions. These results are now reported alongside the pipeline description to substantiate the dataset's alignment with real-world fine-grained intents. revision: yes
-
Referee: [§5] §5 (Experiments and benchmarking): the reported improvements for HieraCount over baselines are presented without error bars, confidence intervals, or details on the number of evaluation runs or statistical tests, making it impossible to assess whether the claimed substantial accuracy gains are reliable or could be explained by variance in the synthetic test distribution.
Authors: We concur that the absence of statistical details limits interpretability of the gains. The revised §5 now reports results averaged over five independent evaluation runs with distinct random seeds for test-set sampling and model training. We include standard deviation error bars, 95% confidence intervals, and p-values from paired t-tests (all key improvements p < 0.01). These additions demonstrate that the accuracy gains are statistically reliable and exceed what would be expected from variance alone in the synthetic distribution. revision: yes
-
Referee: [§5.3] §5.3 (real-world generalization): the evaluation on challenging real-world scenarios does not specify how the test images were selected, whether they cover all five granularity levels uniformly, or how distribution shift from the 3D-synthetic training data was measured, undermining the robustness claim.
Authors: We have expanded §5.3 with the requested details. Real-world test images were curated from a combination of public benchmarks (e.g., FSC-147, COCO) and web-sourced scenes, with explicit stratification to ensure approximately equal representation across the five granularity levels (roughly 20% each). Distribution shift is now quantified via average cosine distance in CLIP ViT-L/14 embedding space (0.28 between synthetic training and real test sets) together with t-SNE visualizations of feature distributions. These clarifications support the robustness claims while preserving the original evaluation protocol. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's core argument proceeds by redefining open-world counting as multi-grained counting across five explicit levels, constructing a new synthetic dataset KubriCount via 3D synthesis plus VLM filtering to address an identified data gap, benchmarking existing models on that dataset to reveal prompt-following failures, and training HieraCount to demonstrate empirical gains. None of these steps invoke self-definitional reductions, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems imported from prior author work, smuggled ansatzes, or renamings of known results; the claims rest on newly generated data and model outputs rather than reducing to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Five explicit levels of granularity cover the range of user intents for open-world counting
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We decompose user intent into five semantic levels: identity, attribute, category, instance, and concept... KubriCount... automatic data scaling pipeline that integrates controllable 3D synthesis with consistent image editing and VLM-based filtering
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HieraCount... jointly leverages text and visual exemplars... feature enhancer (fϕ) fuses the visual and text prompt tokens
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the AAAI Conference on Artificial Intelligence (2019) 4, 10
Acharya,M.,Kafle,K.,Kanan,C.:Tallyqa:Answeringcomplexcountingquestions. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019) 4, 10
work page 2019
-
[2]
In: Proceedings of the British Machine Vision Conference (2023) 2, 4
Amini-Naieni, N., Amini-Naieni, K., Han, T., Zisserman, A.: Open-world text- specified object counting. In: Proceedings of the British Machine Vision Conference (2023) 2, 4
work page 2023
-
[3]
In: Conference on Neural Information Processing Systems (2024) 2, 4, 6, 7, 11, 13, 14, 43
Amini-Naieni, N., Han, T., Zisserman, A.: Countgd: Multi-modal open-world counting. In: Conference on Neural Information Processing Systems (2024) 2, 4, 6, 7, 11, 13, 14, 43
work page 2024
-
[4]
Amini-Naieni, N., Zisserman, A.: Countgd++: Generalized prompting for open- world counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025) 4, 11, 13, 14
work page 2025
-
[5]
Anthropic: Claude haiku 4.5 system card (2025),https://www.anthropic.com/ claude-haiku-4-5-system-card511, 13
work page 2025
-
[6]
Anthropic: Claude opus 4.5 system card (2025),https://www.anthropic.com/ claude-opus-4-5-system-card11, 13
work page 2025
-
[7]
Anthropic: System card: Claude sonnet 4.5 (2025),https://www.anthropic.com/ claude-sonnet-4-5-system-card11, 13
work page 2025
-
[8]
In: Proceedings of the European Conference on Computer Vision (2014) 3
Arteta, C., Lempitsky, V., Noble, J.A., Zisserman, A.: Interactive object counting. In: Proceedings of the European Conference on Computer Vision (2014) 3
work page 2014
-
[9]
In: Proceedings of the European Conference on Computer Vision (2016) 2, 3
Arteta, C., Lempitsky, V., Zisserman, A.: Counting in the wild. In: Proceedings of the European Conference on Computer Vision (2016) 2, 3
work page 2016
-
[10]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 1, 4, 11, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 1, 4, 11, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
IEEE Transactions on Pattern Analysis and Machine In- telligence (2012) 3
Barinova, O., Lempitsky, V., Kholi, P.: On detection of multiple object instances using hough transforms. IEEE Transactions on Pattern Analysis and Machine In- telligence (2012) 3
work page 2012
-
[13]
In:ProceedingsoftheInternationalConferenceonLearningRepresentations(2026) 1
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. In:ProceedingsoftheInternationalConferenceonLearningRepresentations(2026) 1
work page 2026
-
[14]
ShapeNet: An Information-Rich 3D Model Repository
Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015) 8
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[15]
In: Proceedings of the British Machine Vision Conference (2022) 2, 4, 11, 13
Chang, L., Yujie, Z., Andrew, Z., Weidi, X.: Countr: Transformer-based generalised visual counting. In: Proceedings of the British Machine Vision Conference (2022) 2, 4, 11, 13
work page 2022
-
[16]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2024) 13
Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2024) 13
work page 2024
-
[17]
ACM Transactions on Graphics (TOG) (2022) 8
Chen, Z., Wang, G., Liu, Z.: Text2light: Zero-shot text-driven hdr panorama gen- eration. ACM Transactions on Graphics (TOG) (2022) 8
work page 2022
-
[18]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal Abbreviated paper title 17 models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024) 1, 11, 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
In: Win- ter Conference on Applications of Computer Vision (2025) 4
Ciampi, L., Messina, N., Pierucci, M., Amato, G., Avvenuti, M., Falchi, F.: Mind the prompt: A novel benchmark for prompt-based class-agnostic counting. In: Win- ter Conference on Applications of Computer Vision (2025) 4
work page 2025
-
[20]
Clark, C., Zhang, J., Ma, Z., Park, J.S., Salehi, M., Tripathi, R., Lee, S., Ren, Z., Kim, C.D., Yang, Y., et al.: Molmo2: Open weights and data for vision-language models with video understanding and grounding. arXiv preprint arXiv:2601.10611 (2026) 11, 13
-
[21]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 4, 11, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2024) 4, 10
Dai, S., Liu, J., Cheung, N.M.: Referring expression counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2024) 4, 10
work page 2024
-
[23]
Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (2025) 4, 8, 10, 11, 12, 13, 29
work page 2025
-
[24]
Conference on Neural Information Processing Systems (2023) 8
Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d objects. Conference on Neural Information Processing Systems (2023) 8
work page 2023
-
[25]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023) 8
Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of an- notated 3d objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023) 8
work page 2023
-
[26]
International Journal of Computer Vision (2011) 3
Desai, C., Ramanan, D., Fowlkes, C.C.: Discriminative models for multi-class ob- ject layout. International Journal of Computer Vision (2011) 3
work page 2011
-
[27]
In: IEEE International Conference on Robotics and Automation (2022) 8
Downs, L., Francis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K., McHugh, T.B., Vanhoucke, V.: Google scanned objects: A high-quality dataset of 3d scanned household items. In: IEEE International Conference on Robotics and Automation (2022) 8
work page 2022
-
[28]
In: Proceedings of the European Conference on Computer Vision (2022) 4
Gong, S., Zhang, S., Yang, J., Dai, D., Schiele, B.: Class-agnostic object count- ing robust to intraclass diversity. In: Proceedings of the European Conference on Computer Vision (2022) 4
work page 2022
-
[29]
Google: Introducing gemini 2.5 flash image, our state-of-the-art image model (2025),https : / / developers . googleblog . com / introducing - gemini - 2 - 5 - flash-image/8, 29
work page 2025
-
[30]
Google: Introducing nano banana pro (2025),https://blog.google/innovation- and-ai/products/nano-banana-pro/3, 9
work page 2025
-
[31]
Google: A new era of intelligence with gemini 3 (2025),https://blog.google/ products/gemini/gemini-3/3, 4, 8, 9, 11, 12, 13, 29, 36
work page 2025
-
[32]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 11, 12, 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
In: Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition (2022) 3, 8 18 C
Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition (2022) 3, 8 18 C. Liu et al
work page 2022
-
[34]
In: Proceedings of the International Conference on Computer Vision (2017) 3
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the International Conference on Computer Vision (2017) 3
work page 2017
-
[35]
In: Proceedings of the International Conference on Computer Vision (2017) 3, 4, 10
Hsieh, M.R., Lin, Y.L., Hsu, W.H.: Drone-based object counting by spatially regu- larized regional proposal network. In: Proceedings of the International Conference on Computer Vision (2017) 3, 4, 10
work page 2017
-
[36]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 11, 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (2013) 10
Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (2013) 10
work page 2013
-
[38]
In: Proceedings of the European Conference on Computer Vision (2018) 10
Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Maadeed, S., Rajpoot, N., Shah, M.:Compositionlossforcounting,densitymapestimationandlocalizationindense crowds. In: Proceedings of the European Conference on Computer Vision (2018) 10
work page 2018
-
[39]
arXiv preprint arXiv:2510.12798 (2025) 10 M
Jiang, Q., Huo, J., Chen, X., Xiong, Y., Zeng, Z., Chen, Y., Ren, T., Yu, J., Zhang, L.: Detect anything via next point prediction. arXiv preprint arXiv:2510.12798 (2025) 1, 11, 13
-
[40]
In: Proceedings of the European Conference on Computer Vision (2024) 1
Jiang, Q., Li, F., Zeng, Z., Ren, T., Liu, S., Zhang, L.: T-rex2: Towards generic object detection via text-visual prompt synergy. In: Proceedings of the European Conference on Computer Vision (2024) 1
work page 2024
-
[41]
Jiang, R., Liu, L., Chen, C.: Clip-count: Towards text-guided zero-shot object counting. In: ACM Multimedia (2023) 4
work page 2023
-
[42]
In: Proceedings of the AAAI Conference on Artificial Intelligence (2024) 2, 4
Kang, S., Moon, W., Kim, E., Heo, J.P.: Vlcounter: Text-aware visual represen- tation for zero-shot object counting. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024) 2, 4
work page 2024
-
[43]
In: Proceedings of the International Conference on Computer Vision (2023) 1
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the International Conference on Computer Vision (2023) 1
work page 2023
-
[44]
In: Proceedings of the International Conference on Pattern Recognition (2006) 3
Kong, D., Gray, D., Tao, H.: A viewpoint invariant approach for crowd counting. In: Proceedings of the International Conference on Pattern Recognition (2006) 3
work page 2006
-
[45]
The American Journal of Psychology (1949) 1
L., K.E., W., L.M., W., R.T., J., V.: The discrimination of visual number. The American Journal of Psychology (1949) 1
work page 1949
-
[46]
In: Conference on Neural Information Processing Systems (2010) 3
Lempitsky, V., Zisserman, A.: Learning to count objects in images. In: Conference on Neural Information Processing Systems (2010) 3
work page 2010
-
[47]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 11, 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
In: Proceedings of the International Conference on Computer Vision (2017) 3
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the International Conference on Computer Vision (2017) 3
work page 2017
-
[49]
In: Proceedings of the British Machine Vision Conference (2022) 4
Lin, W., Yang, K., Ma, X., Gao, J., Liu, L., Liu, S., Hou, J., Yi, S., Chan, A.B.: Scale-prior deformable convolution for exemplar-guided class-agnostic counting. In: Proceedings of the British Machine Vision Conference (2022) 4
work page 2022
-
[50]
In: Conference on Neural Information Processing Systems (2023) 11, 12, 13
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Conference on Neural Information Processing Systems (2023) 11, 12, 13
work page 2023
-
[51]
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: Proceedings of the European Conference on Computer Vision (2024) 1, 4, 6 Abbreviated paper title 19
work page 2024
-
[52]
In: Asian Conference on Computer Vision (2018) 2, 4
Lu, E., Xie, W., Zisserman, A.: Class-agnostic counting. In: Asian Conference on Computer Vision (2018) 2, 4
work page 2018
-
[53]
In: IEE Colloquium on Image Processing for Security Ap- plications (1997) 3
Marana, A.N., Velastín, S.A., Costa, L., Lotufo, R.: Estimation of crowd density using image processing. In: IEE Colloquium on Image Processing for Security Ap- plications (1997) 3
work page 1997
-
[54]
In: Proceedings of the AAAI Conference on Artificial Intelligence (2025) 3, 4, 10
Mondal, A., Nag, S., Zhu, X., Dutta, A.: Omnicount: Multi-label object count- ing with semantic-geometric priors. In: Proceedings of the AAAI Conference on Artificial Intelligence (2025) 3, 4, 10
work page 2025
-
[55]
In: Proceedings of the European Conference on Computer Vision (2016) 2
Mundhenk, T.N., Konjevod, G., Sakla, W.A., Boakye, K.: A large contextual dataset for classification, detection and counting of cars with deep learning. In: Proceedings of the European Conference on Computer Vision (2016) 2
work page 2016
-
[56]
Nguyen, G.K., Huang, Y., Hoai, M.: Can current ai models count what we mean, not what they see? a benchmark and systematic evaluation. In: 2025 International Conference on Digital Image Computing: Techniques and Applications (DICTA) (2025) 4, 11, 13
work page 2025
-
[57]
In: Proceedings of the European Conference on Computer Vision (2022) 3, 4
Nguyen, T., Pham, C., Nguyen, K., Hoai, M.: Few-shot object counting and de- tection. In: Proceedings of the European Conference on Computer Vision (2022) 3, 4
work page 2022
-
[58]
OpenAI: Gpt-5.1: A smarter, more conversational chatgpt (2025),https : // openai.com/index/gpt-5-1/11, 13
work page 2025
-
[59]
OpenAI: Introducing gpt-5.2 (2025),https://openai.com/index/introducing- gpt-5-2/11, 13
work page 2025
-
[60]
arXiv preprint arXiv:2504.01805 (2025) 22 B
Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F., Sun, X.: Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805 (2025) 13
-
[61]
In: Proceedings of the International Conference on Computer Vision (2023) 4, 10
Paiss, R., Ephrat, A., Tov, O., Zada, S., Mosseri, I., Irani, M., Dekel, T.: Teaching clip to count to ten. In: Proceedings of the International Conference on Computer Vision (2023) 4, 10
work page 2023
-
[62]
In: Conference on Neural Information Processing Systems (2024) 2, 11, 13, 14
Pelhan, J., Lukezic, A., Zavrtanik, V., Kristan, M.: A novel unified architecture for low-shot counting by detection and segmentation. In: Conference on Neural Information Processing Systems (2024) 2, 11, 13, 14
work page 2024
-
[63]
Pelhan, J., Zavrtanik, V., Kristan, M., et al.: Dave-a detect-and-verify paradigm for low-shot counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2024) 2, 11, 13, 14
work page 2024
-
[64]
In: Proceedings of the International Conference on Machine Learning (2021) 2, 4
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning (2021) 2, 4
work page 2021
-
[65]
Ranjan, V., Sharma, U., Nguyen, T., Hoai, M.: Learning to count everything. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021) 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 29
work page 2021
-
[66]
SAM 2: Segment Anything in Images and Videos
Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Shi, M., Lu, H., Feng, C., Liu, C., Cao, Z.: Represent, compare, and learn: A similarity-awareframeworkforclass-agnosticcounting.In:ProceedingsoftheIEEE Conference on Computer Vision and Pattern Recognition (2022) 4
work page 2022
-
[68]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) 10 20 C
Sindagi, V.A., Yasarla, R., Patel, V.M.: Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) 10 20 C. Liu et al
work page 2020
-
[69]
Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025) 8, 11, 13, 29, 43
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., et al.: Kimi-vl technical report. arXiv preprint arXiv:2504.07491 (2025) 1, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
In: Proceedings of the International Conference on Computer Vision (2023) 2, 3, 4, 11, 13, 14
Ðukić, N., Lukežič, A., Zavrtanik, V., Kristan, M.: A low-shot object counting network with iterative prototype adaptation. In: Proceedings of the International Conference on Computer Vision (2023) 2, 3, 4, 11, 13, 14
work page 2023
-
[72]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) 10
Wang,Q.,Gao,J.,Lin,W.,Li,X.:Nwpu-crowd:Alarge-scalebenchmarkforcrowd counting and localization. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) 10
work page 2020
-
[73]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 11, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025
Xiang, J., Chen, X., Xu, S., Wang, R., Lv, Z., Deng, Y., Zhu, H., Dong, Y., Zhao, H., Yuan, N.J., et al.: Native and compact structured latents for 3d generation. arXiv preprint arXiv:2512.14692 (2025) 8, 30
-
[75]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025) 8
Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025) 8
work page 2025
-
[76]
Computer methods in biomechanics and biomedical engineering: Imaging & Visualization (2018) 2, 3
Xie, W., Noble, J.A., Zisserman, A.: Microscopy cell counting and detection with fully convolutional regression networks. Computer methods in biomechanics and biomedical engineering: Imaging & Visualization (2018) 2, 3
work page 2018
-
[77]
In: Proceedings of the International Conference on Computer Vision (2025) 11, 12, 13
Xu, G., Jin, P., Hao, L., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. In: Proceedings of the International Conference on Computer Vision (2025) 11, 12, 13
work page 2025
-
[78]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023) 4
Xu, J., Le, H., Nguyen, V., Ranjan, V., Samaras, D.: Zero-shot object counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023) 4
work page 2023
-
[79]
In: Winter Conference on Applications of Computer Vision (2021) 4
Yang, S.D., Su, H.T., Hsu, W.H., Chen, W.C.: Class-agnostic few-shot object counting. In: Winter Conference on Applications of Computer Vision (2021) 4
work page 2021
-
[80]
In: Winter Conference on Applications of Computer Vision (2023) 4
You, Z., Yang, K., Luo, W., Lu, X., Cui, L., Le, X.: Few-shot object counting with similarity-aware feature enhancement. In: Winter Conference on Applications of Computer Vision (2023) 4
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.