ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

Anindya Mondal; Anjan Dutta; Sauradip Nag

arxiv: 2606.23835 · v1 · pith:Y7KULFXTnew · submitted 2026-06-22 · 💻 cs.CV · eess.IV

ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

Anindya Mondal , Sauradip Nag , Anjan Dutta This is my paper

Pith reviewed 2026-06-26 08:50 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords object countingcrowd countingimage generationvision-language modelunified foundation modeladaptive zoomingcount policy

0 comments

The pith

ABACUS adapts a 3B vision-language model to unify object counting and count-faithful image generation without benchmark-specific training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a single 3B-parameter unified foundation model can be adapted to perform object counting, crowd counting, referring-expression counting, and count-faithful image generation. It does this through three innovations: density-aware adaptive zooming using objectness maps, a boundary-aware count policy with GRPO, and cycle-consistent GRPO for self-critique. This approach eliminates the need for separate task-specific models or additional annotations. A reader would care because it demonstrates that understanding and generation can be bridged in one compact model, outperforming both specialists and larger models on seven benchmarks.

Core claim

ABACUS is a unified vision-language model built on an existing 3B-parameter foundation model that handles multiple counting tasks and count-faithful image generation without any benchmark-specific training. It achieves this by integrating density-aware adaptive zooming with objectness maps for spatial grounding, a boundary-aware count policy via GRPO to eliminate crop-boundary errors, and a cycle-consistent GRPO strategy where the understanding branch self-critiques generated outputs to close the understanding-generation gap without external annotations. The model achieves state-of-the-art results across seven benchmarks, outperforming task-specific specialists and larger generalist models.

What carries the argument

Cycle-consistent GRPO strategy integrated with density-aware adaptive zooming and boundary-aware count policy, enabling self-critique between understanding and generation branches in the 3B model.

If this is right

Handles object counting, crowd counting, referring-expression counting, and count-faithful image generation in one model.
Achieves state-of-the-art performance on seven benchmarks without task-specific training or external annotations.
Outperforms both task-specific specialist models and larger generalist models.
Closes the understanding-generation gap through internal self-critique mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-critique approach could extend to aligning other perception and generation tasks in vision-language models.
Using a smaller 3B model suggests potential for efficient deployment in resource-constrained environments.
Adaptive zooming and boundary policies might improve localization accuracy in other dense prediction tasks.

Load-bearing premise

The three proposed innovations can be successfully integrated into the base 3B model to close the understanding-generation gap without any external annotations or benchmark-specific training.

What would settle it

Demonstrating that the cycle-consistent GRPO does not improve generation quality on a counting benchmark or that performance drops below specialist models when any innovation is removed would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.23835 by Anindya Mondal, Anjan Dutta, Sauradip Nag.

**Figure 1.** Figure 1: ABACUS overview. A single unified model performs count-aware image generation (left), object counting, crowd counting, and referring-expression counting (right) using text-only prompts with no benchmark-specific training. Abstract ABACUS is a unified vision-language model that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark… view at source ↗

**Figure 2.** Figure 2: Issues in Count Generation and Understanding. (a) Text-to-image diffusion models lack any mechanism to verify output cardinality. (b) VLMs and MLLMs default to coarse magnitude estimates (e.g., “Greater than 100”) on dense scenes. (c) Existing UMMs support both tasks from a single models yet exhibit a synergy gap: the same model that correctly counts 4 apples cannot generate exactly 4. precisely the numb… view at source ↗

**Figure 4.** Figure 4: Infusing objectness in the MLLM. The objectness map is extracted from the attention layers of the language model. Per-head isolation and learned affine alignment produce a spatial distribution over visual token positions, from which object peaks are identified. to the MLLM input. Each sub-image Ii is queried independently to estimate the count of the target specified in text prompt T , and the local predi… view at source ↗

**Figure 3.** Figure 3: Density-aware adaptive zooming. The zoom indicator ϕ(·) classifies each image region as sparse or dense. Dense regions are recursively partitioned into 2×2 sub-regions until resolution γ is reached; sparse regions are processed in a single pass. Local predictions are aggregated to produce the global count. where β > 0 controls regularisation strength. In Sec. 4, we instantiate this framework twice with dif… view at source ↗

**Figure 5.** Figure 5: Boundary-aware count policy. When a (a) GT dense image is partitioned into 2×2 quadrants, objects straddling crop boundaries (red) risk double-counting. The policy classifies each object as interior (green), edge (yellow, centroid on this side), or boundary (red, centroid in adjacent quadrant), and GRPO with nested rewards trains the model to produce consistent per-quadrant counts. trains the MLLM to expli… view at source ↗

**Figure 6.** Figure 6: Count-aware image enhancement via cycle-consistent GRPO. The generation branch samples N candidate images from a count-conditioned prompt. The understanding branch counts objects in each candidate and scores aesthetic quality. The count deviation and aesthetic score form the GRPO reward, which updates the generation branch. diffusion head via cross-attention. Directly using understanding encoder embedding… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on count understanding. ABACUS (Ours) tracks the ground truth across all density regimes, from sparse (GT: 4) to extremely dense (GT: 1400). CountGD++ overcounts in dense scenes (900 for GT: 298); T2I Count catastrophically fails on out-of-distribution layouts (0 for GT: 1231); WS-COC and UniLIP-3B default to coarse magnitude estimates accuracy of the closest competitor BAGEL (36%) w… view at source ↗

**Figure 8.** Figure 8: Qualitative comparison on count generation. ABACUS achieves exact or near-exact counts while preserving natural spatial arrangement. UniLIP-3B [74] systematically undercounts; BAGEL [24] overcounts with unnatural compositions; CountGen [10] produces rigid grid patterns; Counting Guidance [35] exhibits mode collapse (97 donuts for a prompt of 22) [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: ), satellite vehicle detection, or industrial defect inspection present distribution shifts that the current countbalanced training mixture does not cover. Lightweight domain adaptation of the LoRA adapter, without retraining the frozen backbone, could unlock these verticals while preserving the general-purpose counting ability. Inference cost on extremely dense scenes.: Adaptive zooming keeps average… view at source ↗

**Figure 10.** Figure 10: Count understanding gallery. ABACUS predictions (green) across diverse categories and count ranges from FSC-147, CARPK, and ShanghaiTech. The model achieves exact or near-exact counts from sparse scenes (GT: 1, pencil) to dense crowds (GT: 261, go stones) using text-only prompts [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Count generation gallery. ABACUS generates images with the exact requested count across diverse prompts, maintaining naturalistic spatial arrangement and high aesthetic quality [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Human evaluation form. For each prompt, annotators rate five anonymized images on Count Accuracy, Aesthetic Quality, and Prompt Alignment (0–4 Likert), then select an overall preferred image. [21] Siyang Dai, Jun Liu, and Ngai-Man Cheung. Referring expression counting. In CVPR (CVPR), pages 16985–16995, June 2024. 3 [22] Siyang Dai, Jun Liu, and Ngai-Man Cheung. Referring expression counting. In CVPR, p… view at source ↗

read the original abstract

ABACUS is a unified vision-language model that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific training required. Our model is built on existing 3B-parameter unified foundation model and is adapted for object localization tasks using three key innovations: density-aware adaptive zooming with objectness maps for spatial grounding; a boundary-aware count policy via GRPO to eliminate crop-boundary errors; and a cycle-consistent GRPO strategy where the understanding branch self-critiques generated outputs, closing the understanding-generation gap without any external annotations. ABACUS achieves state-of-the-art results across seven benchmarks, outperforming both task-specific specialists and larger generalist models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ABACUS adapts a 3B model with zooming, GRPO policies, and cycle-consistent self-critique for counting tasks, but the text supplies no numbers or ablations to check the SOTA claims.

read the letter

The core of the paper is a 3B vision-language model adapted for object counting, crowd counting, referring counting, and count-faithful generation. The three changes are density-aware adaptive zooming with objectness maps, a boundary-aware count policy using GRPO, and a cycle-consistent GRPO loop where the understanding branch critiques its own generated outputs. No benchmark-specific training or external annotations are used.

The cycle-consistent loop is the part that stands out. It attempts to close the understanding-generation gap internally, which is a reasonable direction if the self-critique actually works.

The presentation identifies a real fragmentation problem in current systems that treat counting understanding and generation separately. That framing is clear.

The soft spot is the total lack of supporting evidence. The abstract states SOTA results on seven benchmarks and outperformance of both specialists and larger models, yet no scores, error bars, ablation tables, or comparisons appear in the text. Without those, it is impossible to tell whether the three adaptations produce the gains or whether the base model already performs well on these tasks. The same holds for the GRPO components: their implementation details and failure modes are not shown.

This is for researchers already working on multimodal counting or unified vision-language models. Someone in that subfield could extract the adaptation ideas for their own experiments, but the current version does not provide enough to evaluate the claims.

I would send the full manuscript to peer review if the results and ablations are added, because the unification goal is worth checking even if the current evidence is missing.

Referee Report

2 major / 0 minor

Summary. The paper introduces ABACUS, a 3B-parameter unified vision-language model adapted from an existing foundation model using three innovations (density-aware adaptive zooming with objectness maps, boundary-aware count policy via GRPO, and cycle-consistent GRPO) to handle object counting, crowd counting, referring-expression counting, and count-faithful image generation without benchmark-specific training or external annotations. It claims state-of-the-art results across seven benchmarks, outperforming both task-specific specialists and larger generalist models.

Significance. If substantiated, the result would be significant for showing that a compact unified model can close the understanding-generation gap in counting tasks via self-critique mechanisms, offering an efficient alternative to specialized models and reducing reliance on task-specific data.

major comments (2)

Abstract: The claim of achieving state-of-the-art results on seven benchmarks is presented without any quantitative metrics, tables, error bars, ablation studies, or baseline comparisons, which is load-bearing for the central empirical claim and prevents verification of whether the three innovations drive the reported gains.
The manuscript text provides no methods, equations, implementation details, or result sections to examine the integration of density-aware adaptive zooming, the GRPO policies, or the cycle-consistent loop, leaving the central claim that these close the understanding-generation gap without external annotations unverifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the feedback. We address each major comment below and will revise the manuscript to improve the verifiability of our claims.

read point-by-point responses

Referee: Abstract: The claim of achieving state-of-the-art results on seven benchmarks is presented without any quantitative metrics, tables, error bars, ablation studies, or baseline comparisons, which is load-bearing for the central empirical claim and prevents verification of whether the three innovations drive the reported gains.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will add specific performance numbers (e.g., absolute gains on each of the seven benchmarks versus the strongest baselines) and a brief statement on the contribution of each innovation, while remaining within length limits. revision: yes
Referee: The manuscript text provides no methods, equations, implementation details, or result sections to examine the integration of density-aware adaptive zooming, the GRPO policies, or the cycle-consistent loop, leaving the central claim that these close the understanding-generation gap without external annotations unverifiable.

Authors: The full manuscript contains dedicated methodology and results sections that describe the density-aware adaptive zooming (with objectness-map equations), the boundary-aware GRPO policy, the cycle-consistent self-critique loop, implementation details, and supporting ablations. We will ensure these sections are more prominently cross-referenced from the abstract and introduction in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, derivations, or first-principles claims. All central assertions are empirical performance results on external benchmarks after adaptation of a base model. No self-citations, fitted parameters renamed as predictions, or self-definitional steps are present or load-bearing. The work is therefore self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, training details, or parameter counts beyond the base 3B model are provided, so the ledger cannot be populated with concrete free parameters or axioms from the paper.

pith-pipeline@v0.9.1-grok · 5653 in / 1142 out tokens · 21841 ms · 2026-06-26T08:50:27.515817+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 20 canonical work pages · 12 internal anchors

[1]

nocaps: novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Ste- fan Lee, and Peter Anderson. nocaps: novel object captioning at scale. InICCV, pages 8948–8957, 2019. 3

2019
[2]

Mea- suring the objectness of image windows.IEEE transactions on pattern analysis and machine intelligence, 34(11):2189– 2202, 2012

Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. Mea- suring the objectness of image windows.IEEE transactions on pattern analysis and machine intelligence, 34(11):2189– 2202, 2012. 2

2012
[3]

Amini-Naieni, K

N. Amini-Naieni, K. Amini-Naieni, T. Han, and A. Zisser- man. Open-world text-specified object counting. InBritish Machine Vision Conference, 2023. 1

2023
[4]

Amini-Naieni, T

N. Amini-Naieni, T. Han, and A. Zisserman. Countgd: Multi- modal open-world counting. InNeurIPS (NeurIPS), 2024. 3

2024
[5]

CountGD++: Generalized Prompting for Open-World Counting

Niki Amini-Naieni and Andrew Zisserman. Countgd++: Gen- eralized prompting for open-world counting.arXiv preprint arXiv:2512.23351, 2025. 2, 7, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Open-world text-specified object count- ing.arXiv preprint arXiv:2306.01851, 2023

Niki Amini-Naieni, Kiana Amini-Naieni, Tengda Han, and Andrew Zisserman. Open-world text-specified object count- ing.arXiv preprint arXiv:2306.01851, 2023. 2, 7

work page arXiv 2023
[7]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Improving image generation with better cap- tions.Computer Science, 2(3):8, 2023

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better cap- tions.Computer Science, 2(3):8, 2023. DALL-E 3 technical report. 3

2023
[10]

Make it count: Text-to- image generation with an accurate number of objects, 2024

Lital Binyamin, Yoad Tewel, et al. Make it count: Text-to- image generation with an accurate number of objects, 2024. arXiv:2406.10210. 9

work page arXiv 2024
[11]

Make it count: Text-to-image generation with an accurate number of objects

Lital Binyamin, Yoad Tewel, Hilit Segev, Eran Hirsch, Royi Rassin, and Gal Chechik. Make it count: Text-to-image generation with an accurate number of objects. InCVPR, pages 13242–13251, 2025. 1, 2, 3, 6, 7, 16

2025
[12]

Black-Forest-Labs. FLUX. https://github.com/ black-forest-labs/flux, 2024. 3

2024
[13]

Counting everyday objects in everyday scenes

Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ram- prasaath R Selvaraju, Dhruv Batra, and Devi Parikh. Counting everyday objects in everyday scenes. InCVPR, pages 1135– 1144, 2017. 3

2017
[14]

Attend-and-excite: Attention-based semantic guid- ance for text-to-image diffusion models

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guid- ance for text-to-image diffusion models. volume 42, pages 1–10, 2023. 3

2023
[15]

Revisiting referring expression comprehension evaluation in the era of large multimodal models

Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and Hongyang Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. InCVPR, pages 513–524,
[16]

Dc-ae 1.5: Accelerating dif- fusion model convergence with structured latent space

Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc-ae 1.5: Accelerating dif- fusion model convergence with structured latent space. In ICCV, pages 19628–19637, 2025. 6, 14

2025
[17]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 2, 7, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024. 6, 14

2024
[19]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Be yourself: Bounded attention for multi-subject text-to-image generation

Omer Dahary, Or Patashnik, Kfir Aberman, and Daniel Cohen-Or. Be yourself: Bounded attention for multi-subject text-to-image generation. InECCV, pages 432–448, 2024. 2, 3, 7, 16 Figure 12. Human evaluation form.For each prompt, annotators rate five anonymized images on Count Accuracy, Aesthetic Quality, and Prompt Alignment (0–4 Likert), then select an ov...

2024
[21]

Referring ex- pression counting

Siyang Dai, Jun Liu, and Ngai-Man Cheung. Referring ex- pression counting. InCVPR (CVPR), pages 16985–16995, June 2024. 3

2024
[22]

Referring ex- pression counting

Siyang Dai, Jun Liu, and Ngai-Man Cheung. Referring ex- pression counting. InCVPR, pages 16985–16995, 2024. 6, 7

2024
[23]

Referring ex- pression counting

Siyang Dai, Jun Liu, and Ngai-Man Cheung. Referring ex- pression counting. InCVPR, pages 16985–16995, 2024. 2, 6

2024
[24]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pre- training.arXiv preprint arXiv:2505.14683, 2025. 2, 7, 9, 10, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Domain- general crowd counting in unseen scenarios

Zhipeng Du, Jiankang Deng, and Miaojing Shi. Domain- general crowd counting in unseen scenarios. InAAAI, vol- ume 37, pages 561–570, 2023. 3

2023
[26]

LayoutGPT: compositional visual plan- ning and generation with large language models.NeurIPS, 36:18225–18250, 2023

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Ar- jun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. LayoutGPT: compositional visual plan- ning and generation with large language models.NeurIPS, 36:18225–18250, 2023. 3

2023
[27]

Geneval: An object-focused framework for evaluating text-to- image alignment.NeurIPS, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to- image alignment.NeurIPS, 36:52132–52152, 2023. 6

2023
[28]

Precise detection in densely packed scenes

Eran Goldman, Roei Herzig, Aviv Eisenschtat, Jacob Gold- berger, and Tal Hassner. Precise detection in densely packed scenes. InProc. Conf. Comput. Vision Pattern Recognition (CVPR), 2019. 6

2019
[29]

In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual ques- tion answering. InCVPR, pages 6325–6334, 2017. doi: 10.1109/CVPR.2017.670. 3

work page doi:10.1109/cvpr.2017.670 2017
[30]

Regressor-segmenter mutual prompt learning for crowd counting

Mingyue Guo, Li Yuan, Zhaoyi Yan, Binghui Chen, Yaowei Wang, and Qixiang Ye. Regressor-segmenter mutual prompt learning for crowd counting. InCVPR, pages 28380–28389,
[31]

Drone- based object counting by spatially regularized regional pro- posal network

Meng-Ru Hsieh, Yen-Liang Lin, and Winston H Hsu. Drone- based object counting by spatially regularized regional pro- posal network. InICCV, pages 4145–4153, 2017. 3, 6

2017
[32]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR,
[33]

URL https://openreview.net/forum?id= nZeVKeeFYf9. 6, 14
[34]

CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Ruixiang Jiang, Lingbo Liu, and Changwen Chen. Clip-count: Towards text-guided zero-shot object counting.arXiv preprint arXiv:2305.07304, 2023. 3

work page internal anchor Pith review arXiv 2023
[35]

Vlcounter: Text-aware visual representation for zero- shot object counting

Seunggu Kang, WonJun Moon, Euiyeon Kim, and Jae-Pil Heo. Vlcounter: Text-aware visual representation for zero- shot object counting. InAAAI, volume 38, pages 2714–2722,
[36]

Counting guidance for high fidelity text-to-image synthesis

Wonjun Kang, Kevin Galim, Hyung Il Koo, and Nam Ik Cho. Counting guidance for high fidelity text-to-image synthesis. InWACV, pages 899–908, 2025. 1, 2, 3, 7, 9, 16

2025
[37]

Referitgame: Referring to objects in pho- tographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. InProceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 4

2014
[38]

Deep- box: Learning objectness with convolutional networks

Weicheng Kuo, Bharath Hariharan, and Jitendra Malik. Deep- box: Learning objectness with convolutional networks. In ICCV, pages 2479–2487, 2015. 2

2015
[39]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Calibrating uncertainty for semi-supervised crowd counting

Chen Li, Xiaoling Hu, Shahira Abousamra, and Chao Chen. Calibrating uncertainty for semi-supervised crowd counting. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 16685–16695. IEEE, 2023. 3

2023
[41]

Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InICML, pages 12888–12900. PMLR, 2022. 3

2022
[42]

GLIGEN: open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Jianwei Yang, et al. GLIGEN: open-set grounded text-to-image generation. InCVPR, pages 22511–22521, 2023. 3

2023
[43]

Csrnet: Di- lated convolutional neural networks for understanding the highly congested scenes

Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Di- lated convolutional neural networks for understanding the highly congested scenes. InCVPR, pages 1091–1100, 2018. 3

2018
[44]

An end-to-end transformer model for crowd localization.ECCV, 2022

Dingkang Liang, Wei Xu, and Xiang Bai. An end-to-end transformer model for crowd localization.ECCV, 2022. 3

2022
[45]

Crowdclip: Unsupervised crowd counting via vision-language model

Dingkang Liang, Jiahao Xie, Zhikang Zou, Xiaoqing Ye, Wei Xu, and Xiang Bai. Crowdclip: Unsupervised crowd counting via vision-language model. InCVPR, pages 2893–2903, 2023. 2

2023
[46]

Countr: Transformer-based generalised visual count- ing.arXiv preprint arXiv:2208.13721, 2022

Chang Liu, Yujie Zhong, Andrew Zisserman, and Weidi Xie. Countr: Transformer-based generalised visual count- ing.arXiv preprint arXiv:2208.13721, 2022. 1, 3

work page arXiv 2022
[47]

Point- query quadtree for crowd counting, localization, and more

Chengxin Liu, Hao Lu, Zhiguo Cao, and Tongliang Liu. Point- query quadtree for crowd counting, localization, and more. In ICCV, pages 1676–1685, 2023. 3

2023
[48]

Llavanext: Improved reason- ing, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reason- ing, ocr, and world knowledge, 2024. 3

2024
[49]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, pages 38–55. Springer, 2024. 4

2024
[51]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, pages 38–55. Springer, 2024. 7

2024
[52]

Countse: Soft exemplar open-set object counting

Shuai Liu, Peng Zhang, Shiwei Zhang, and Wei Ke. Countse: Soft exemplar open-set object counting. InICCV, pages 21536–21546, 2025. 7

2025
[53]

Class-agnostic counting

Erika Lu, Weidi Xie, and Andrew Zisserman. Class-agnostic counting. InComputer Vision–ACCV 2018: 14th Asian Con- ference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages 669–684. Springer, 2019. 3

2018
[54]

CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance

Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Llados, Xiatian Zhu, and Anjan Dutta. Countloop: Training-free high- instance image generation via iterative agent guidance.arXiv preprint arXiv:2508.16644, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

A large contextual dataset for classification, de- tection and counting of cars with deep learning

T Nathan Mundhenk, Goran Konjevod, Wesam A Sakla, and Kofi Boakye. A large contextual dataset for classification, de- tection and counting of cars with deep learning. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 785–800. Springer, 2016. 3

2016
[56]

Towards interpreting visual in- formation processing in vision-language models

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual in- formation processing in vision-language models. InICLR, volume 2025, pages 57172–57189, 2025. 4

2025
[57]

Gpt-5.5 system card, 2026

OpenAI. Gpt-5.5 system card, 2026. URL https:// deploymentsafety.openai.com/gpt-5-5/gpt- 5-5.pdf. Accessed: 2026-05-04. 7

2026
[58]

Teaching clip to count to ten

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InICCV, pages 3170–3180, 2023. 3

2023
[59]

Dave-a detect-and-verify paradigm for low-shot counting

Jer Pelhan, Vitjan Zavrtanik, Matej Kristan, et al. Dave-a detect-and-verify paradigm for low-shot counting. InCVPR, pages 23293–23302, 2024. 3

2024
[60]

Single domain general- ization for crowd counting

Zhuoxuan Peng and S-H Gary Chan. Single domain general- ization for crowd counting. InCVPR, pages 28025–28034,
[61]

SDXL: improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 3

2024
[62]

Lvlm-count: Enhancing the count- ing ability of large vision-language models.arXiv preprint arXiv:2412.00686, 2024

Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, and Kimon Fountoulakis. Lvlm-count: Enhancing the count- ing ability of large vision-language models.arXiv preprint arXiv:2412.00686, 2024. 2, 3, 4

work page arXiv 2024
[63]

T2icount: Enhancing cross-modal understanding for zero-shot counting

Yifei Qian, Zhongliang Guo, Bowen Deng, Chun Tong Lei, Shuai Zhao, Chun Pong Lau, Xiaopeng Hong, and Michael P Pound. T2icount: Enhancing cross-modal understanding for zero-shot counting. InCVPR, 2025. 3, 7

2025
[64]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763. PmLR, 2021. 3

2021
[65]

Crowddiff: Multi-hypothesis crowd density estimation using diffusion models

Yasiru Ranasinghe, Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, and Vishal M Patel. Crowddiff: Multi-hypothesis crowd density estimation using diffusion models. InCVPR, pages 12809–12819, 2024. 3

2024
[66]

Learning to count everything

Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh Hoai. Learning to count everything. InCVPR, pages 3394–3403,
[67]

Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 6

2022
[68]

Objects365: A large-scale, high-quality dataset for object detection

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 8430–8439, 2019. 6

2019
[69]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Re- visiting perspective information for efficient crowd counting

Miaojing Shi, Zhaohui Yang, Chao Xu, and Qijun Chen. Re- visiting perspective information for efficient crowd counting. InCVPR, pages 7279–7288, 2019. 3

2019
[71]

Training-free object counting with prompts

Zenglin Shi, Ying Sun, and Mengmi Zhang. Training-free object counting with prompts. InWACV, pages 323–331,
[72]

Crowd counting in the frequency domain

Weibo Shu, Jia Wan, Kay Chen Tan, Sam Kwong, and An- toni B Chan. Crowd counting in the frequency domain. In CVPR, pages 19618–19627, 2022. 2

2022
[73]

Principles of object perception.Cognitive science, 14(1):29–56, 1990

Elizabeth S Spelke. Principles of object perception.Cognitive science, 14(1):29–56, 1990. 2

1990
[74]

2025 , bdsk-url-1 =

Jayant Sravan Tamarapalli, Rynaa Grover, Nilay Pande, and Sahiti Yerramilli. Countqa: How well do mllms count in the wild?arXiv preprint arXiv:2508.06585, 2025. 6, 7

work page arXiv 2025
[75]

Unilip: Adapting clip for unified multimodal understanding, generation and editing

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278, 2025. 2, 4, 5, 6, 7, 9, 14, 16

work page arXiv 2025
[76]

Training-free consistent text-to-image generation.ACM TOG, 43(4):1–18, 2024

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consistent text-to-image generation.ACM TOG, 43(4):1–18, 2024. 3

2024
[77]

Attention is all you need.NeurIPS, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017. 2, 4

2017
[78]

Diffusion model align- ment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InCVPR, pages 8228–8238, 2024. 3

2024
[79]

Distribution matching for crowd counting.NeurIPS, 33:1595–1607, 2020

Boyu Wang, Huidong Liu, Dimitris Samaras, and Minh Hoai Nguyen. Distribution matching for crowd counting.NeurIPS, 33:1595–1607, 2020. 3

2020
[80]

Yolov9: Learning what you want to learn using programmable gradient information

Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn using programmable gradient information. InECCV, pages 1–21. Springer, 2024. 6

2024

Showing first 80 references.

[1] [1]

nocaps: novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Ste- fan Lee, and Peter Anderson. nocaps: novel object captioning at scale. InICCV, pages 8948–8957, 2019. 3

2019

[2] [2]

Mea- suring the objectness of image windows.IEEE transactions on pattern analysis and machine intelligence, 34(11):2189– 2202, 2012

Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. Mea- suring the objectness of image windows.IEEE transactions on pattern analysis and machine intelligence, 34(11):2189– 2202, 2012. 2

2012

[3] [3]

Amini-Naieni, K

N. Amini-Naieni, K. Amini-Naieni, T. Han, and A. Zisser- man. Open-world text-specified object counting. InBritish Machine Vision Conference, 2023. 1

2023

[4] [4]

Amini-Naieni, T

N. Amini-Naieni, T. Han, and A. Zisserman. Countgd: Multi- modal open-world counting. InNeurIPS (NeurIPS), 2024. 3

2024

[5] [5]

CountGD++: Generalized Prompting for Open-World Counting

Niki Amini-Naieni and Andrew Zisserman. Countgd++: Gen- eralized prompting for open-world counting.arXiv preprint arXiv:2512.23351, 2025. 2, 7, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Open-world text-specified object count- ing.arXiv preprint arXiv:2306.01851, 2023

Niki Amini-Naieni, Kiana Amini-Naieni, Tengda Han, and Andrew Zisserman. Open-world text-specified object count- ing.arXiv preprint arXiv:2306.01851, 2023. 2, 7

work page arXiv 2023

[7] [7]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.a...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Improving image generation with better cap- tions.Computer Science, 2(3):8, 2023

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better cap- tions.Computer Science, 2(3):8, 2023. DALL-E 3 technical report. 3

2023

[10] [10]

Make it count: Text-to- image generation with an accurate number of objects, 2024

Lital Binyamin, Yoad Tewel, et al. Make it count: Text-to- image generation with an accurate number of objects, 2024. arXiv:2406.10210. 9

work page arXiv 2024

[11] [11]

Make it count: Text-to-image generation with an accurate number of objects

Lital Binyamin, Yoad Tewel, Hilit Segev, Eran Hirsch, Royi Rassin, and Gal Chechik. Make it count: Text-to-image generation with an accurate number of objects. InCVPR, pages 13242–13251, 2025. 1, 2, 3, 6, 7, 16

2025

[12] [12]

Black-Forest-Labs. FLUX. https://github.com/ black-forest-labs/flux, 2024. 3

2024

[13] [13]

Counting everyday objects in everyday scenes

Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ram- prasaath R Selvaraju, Dhruv Batra, and Devi Parikh. Counting everyday objects in everyday scenes. InCVPR, pages 1135– 1144, 2017. 3

2017

[14] [14]

Attend-and-excite: Attention-based semantic guid- ance for text-to-image diffusion models

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guid- ance for text-to-image diffusion models. volume 42, pages 1–10, 2023. 3

2023

[15] [15]

Revisiting referring expression comprehension evaluation in the era of large multimodal models

Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and Hongyang Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. InCVPR, pages 513–524,

[16] [16]

Dc-ae 1.5: Accelerating dif- fusion model convergence with structured latent space

Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc-ae 1.5: Accelerating dif- fusion model convergence with structured latent space. In ICCV, pages 19628–19637, 2025. 6, 14

2025

[17] [17]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 2, 7, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024. 6, 14

2024

[19] [19]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Be yourself: Bounded attention for multi-subject text-to-image generation

Omer Dahary, Or Patashnik, Kfir Aberman, and Daniel Cohen-Or. Be yourself: Bounded attention for multi-subject text-to-image generation. InECCV, pages 432–448, 2024. 2, 3, 7, 16 Figure 12. Human evaluation form.For each prompt, annotators rate five anonymized images on Count Accuracy, Aesthetic Quality, and Prompt Alignment (0–4 Likert), then select an ov...

2024

[21] [21]

Referring ex- pression counting

Siyang Dai, Jun Liu, and Ngai-Man Cheung. Referring ex- pression counting. InCVPR (CVPR), pages 16985–16995, June 2024. 3

2024

[22] [22]

Referring ex- pression counting

Siyang Dai, Jun Liu, and Ngai-Man Cheung. Referring ex- pression counting. InCVPR, pages 16985–16995, 2024. 6, 7

2024

[23] [23]

Referring ex- pression counting

Siyang Dai, Jun Liu, and Ngai-Man Cheung. Referring ex- pression counting. InCVPR, pages 16985–16995, 2024. 2, 6

2024

[24] [24]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pre- training.arXiv preprint arXiv:2505.14683, 2025. 2, 7, 9, 10, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Domain- general crowd counting in unseen scenarios

Zhipeng Du, Jiankang Deng, and Miaojing Shi. Domain- general crowd counting in unseen scenarios. InAAAI, vol- ume 37, pages 561–570, 2023. 3

2023

[26] [26]

LayoutGPT: compositional visual plan- ning and generation with large language models.NeurIPS, 36:18225–18250, 2023

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Ar- jun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. LayoutGPT: compositional visual plan- ning and generation with large language models.NeurIPS, 36:18225–18250, 2023. 3

2023

[27] [27]

Geneval: An object-focused framework for evaluating text-to- image alignment.NeurIPS, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to- image alignment.NeurIPS, 36:52132–52152, 2023. 6

2023

[28] [28]

Precise detection in densely packed scenes

Eran Goldman, Roei Herzig, Aviv Eisenschtat, Jacob Gold- berger, and Tal Hassner. Precise detection in densely packed scenes. InProc. Conf. Comput. Vision Pattern Recognition (CVPR), 2019. 6

2019

[29] [29]

In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual ques- tion answering. InCVPR, pages 6325–6334, 2017. doi: 10.1109/CVPR.2017.670. 3

work page doi:10.1109/cvpr.2017.670 2017

[30] [30]

Regressor-segmenter mutual prompt learning for crowd counting

Mingyue Guo, Li Yuan, Zhaoyi Yan, Binghui Chen, Yaowei Wang, and Qixiang Ye. Regressor-segmenter mutual prompt learning for crowd counting. InCVPR, pages 28380–28389,

[31] [31]

Drone- based object counting by spatially regularized regional pro- posal network

Meng-Ru Hsieh, Yen-Liang Lin, and Winston H Hsu. Drone- based object counting by spatially regularized regional pro- posal network. InICCV, pages 4145–4153, 2017. 3, 6

2017

[32] [32]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR,

[33] [33]

URL https://openreview.net/forum?id= nZeVKeeFYf9. 6, 14

[34] [34]

CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Ruixiang Jiang, Lingbo Liu, and Changwen Chen. Clip-count: Towards text-guided zero-shot object counting.arXiv preprint arXiv:2305.07304, 2023. 3

work page internal anchor Pith review arXiv 2023

[35] [35]

Vlcounter: Text-aware visual representation for zero- shot object counting

Seunggu Kang, WonJun Moon, Euiyeon Kim, and Jae-Pil Heo. Vlcounter: Text-aware visual representation for zero- shot object counting. InAAAI, volume 38, pages 2714–2722,

[36] [36]

Counting guidance for high fidelity text-to-image synthesis

Wonjun Kang, Kevin Galim, Hyung Il Koo, and Nam Ik Cho. Counting guidance for high fidelity text-to-image synthesis. InWACV, pages 899–908, 2025. 1, 2, 3, 7, 9, 16

2025

[37] [37]

Referitgame: Referring to objects in pho- tographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. InProceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 4

2014

[38] [38]

Deep- box: Learning objectness with convolutional networks

Weicheng Kuo, Bharath Hariharan, and Jitendra Malik. Deep- box: Learning objectness with convolutional networks. In ICCV, pages 2479–2487, 2015. 2

2015

[39] [39]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Calibrating uncertainty for semi-supervised crowd counting

Chen Li, Xiaoling Hu, Shahira Abousamra, and Chao Chen. Calibrating uncertainty for semi-supervised crowd counting. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 16685–16695. IEEE, 2023. 3

2023

[41] [41]

Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InICML, pages 12888–12900. PMLR, 2022. 3

2022

[42] [42]

GLIGEN: open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Jianwei Yang, et al. GLIGEN: open-set grounded text-to-image generation. InCVPR, pages 22511–22521, 2023. 3

2023

[43] [43]

Csrnet: Di- lated convolutional neural networks for understanding the highly congested scenes

Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Di- lated convolutional neural networks for understanding the highly congested scenes. InCVPR, pages 1091–1100, 2018. 3

2018

[44] [44]

An end-to-end transformer model for crowd localization.ECCV, 2022

Dingkang Liang, Wei Xu, and Xiang Bai. An end-to-end transformer model for crowd localization.ECCV, 2022. 3

2022

[45] [45]

Crowdclip: Unsupervised crowd counting via vision-language model

Dingkang Liang, Jiahao Xie, Zhikang Zou, Xiaoqing Ye, Wei Xu, and Xiang Bai. Crowdclip: Unsupervised crowd counting via vision-language model. InCVPR, pages 2893–2903, 2023. 2

2023

[46] [46]

Countr: Transformer-based generalised visual count- ing.arXiv preprint arXiv:2208.13721, 2022

Chang Liu, Yujie Zhong, Andrew Zisserman, and Weidi Xie. Countr: Transformer-based generalised visual count- ing.arXiv preprint arXiv:2208.13721, 2022. 1, 3

work page arXiv 2022

[47] [47]

Point- query quadtree for crowd counting, localization, and more

Chengxin Liu, Hao Lu, Zhiguo Cao, and Tongliang Liu. Point- query quadtree for crowd counting, localization, and more. In ICCV, pages 1676–1685, 2023. 3

2023

[48] [48]

Llavanext: Improved reason- ing, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reason- ing, ocr, and world knowledge, 2024. 3

2024

[49] [49]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, pages 38–55. Springer, 2024. 4

2024

[51] [51]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, pages 38–55. Springer, 2024. 7

2024

[52] [52]

Countse: Soft exemplar open-set object counting

Shuai Liu, Peng Zhang, Shiwei Zhang, and Wei Ke. Countse: Soft exemplar open-set object counting. InICCV, pages 21536–21546, 2025. 7

2025

[53] [53]

Class-agnostic counting

Erika Lu, Weidi Xie, and Andrew Zisserman. Class-agnostic counting. InComputer Vision–ACCV 2018: 14th Asian Con- ference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages 669–684. Springer, 2019. 3

2018

[54] [54]

CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance

Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Llados, Xiatian Zhu, and Anjan Dutta. Countloop: Training-free high- instance image generation via iterative agent guidance.arXiv preprint arXiv:2508.16644, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

A large contextual dataset for classification, de- tection and counting of cars with deep learning

T Nathan Mundhenk, Goran Konjevod, Wesam A Sakla, and Kofi Boakye. A large contextual dataset for classification, de- tection and counting of cars with deep learning. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 785–800. Springer, 2016. 3

2016

[56] [56]

Towards interpreting visual in- formation processing in vision-language models

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual in- formation processing in vision-language models. InICLR, volume 2025, pages 57172–57189, 2025. 4

2025

[57] [57]

Gpt-5.5 system card, 2026

OpenAI. Gpt-5.5 system card, 2026. URL https:// deploymentsafety.openai.com/gpt-5-5/gpt- 5-5.pdf. Accessed: 2026-05-04. 7

2026

[58] [58]

Teaching clip to count to ten

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InICCV, pages 3170–3180, 2023. 3

2023

[59] [59]

Dave-a detect-and-verify paradigm for low-shot counting

Jer Pelhan, Vitjan Zavrtanik, Matej Kristan, et al. Dave-a detect-and-verify paradigm for low-shot counting. InCVPR, pages 23293–23302, 2024. 3

2024

[60] [60]

Single domain general- ization for crowd counting

Zhuoxuan Peng and S-H Gary Chan. Single domain general- ization for crowd counting. InCVPR, pages 28025–28034,

[61] [61]

SDXL: improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 3

2024

[62] [62]

Lvlm-count: Enhancing the count- ing ability of large vision-language models.arXiv preprint arXiv:2412.00686, 2024

Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, and Kimon Fountoulakis. Lvlm-count: Enhancing the count- ing ability of large vision-language models.arXiv preprint arXiv:2412.00686, 2024. 2, 3, 4

work page arXiv 2024

[63] [63]

T2icount: Enhancing cross-modal understanding for zero-shot counting

Yifei Qian, Zhongliang Guo, Bowen Deng, Chun Tong Lei, Shuai Zhao, Chun Pong Lau, Xiaopeng Hong, and Michael P Pound. T2icount: Enhancing cross-modal understanding for zero-shot counting. InCVPR, 2025. 3, 7

2025

[64] [64]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763. PmLR, 2021. 3

2021

[65] [65]

Crowddiff: Multi-hypothesis crowd density estimation using diffusion models

Yasiru Ranasinghe, Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, and Vishal M Patel. Crowddiff: Multi-hypothesis crowd density estimation using diffusion models. InCVPR, pages 12809–12819, 2024. 3

2024

[66] [66]

Learning to count everything

Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh Hoai. Learning to count everything. InCVPR, pages 3394–3403,

[67] [67]

Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gen- eration image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 6

2022

[68] [68]

Objects365: A large-scale, high-quality dataset for object detection

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 8430–8439, 2019. 6

2019

[69] [69]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

Re- visiting perspective information for efficient crowd counting

Miaojing Shi, Zhaohui Yang, Chao Xu, and Qijun Chen. Re- visiting perspective information for efficient crowd counting. InCVPR, pages 7279–7288, 2019. 3

2019

[71] [71]

Training-free object counting with prompts

Zenglin Shi, Ying Sun, and Mengmi Zhang. Training-free object counting with prompts. InWACV, pages 323–331,

[72] [72]

Crowd counting in the frequency domain

Weibo Shu, Jia Wan, Kay Chen Tan, Sam Kwong, and An- toni B Chan. Crowd counting in the frequency domain. In CVPR, pages 19618–19627, 2022. 2

2022

[73] [73]

Principles of object perception.Cognitive science, 14(1):29–56, 1990

Elizabeth S Spelke. Principles of object perception.Cognitive science, 14(1):29–56, 1990. 2

1990

[74] [74]

2025 , bdsk-url-1 =

Jayant Sravan Tamarapalli, Rynaa Grover, Nilay Pande, and Sahiti Yerramilli. Countqa: How well do mllms count in the wild?arXiv preprint arXiv:2508.06585, 2025. 6, 7

work page arXiv 2025

[75] [75]

Unilip: Adapting clip for unified multimodal understanding, generation and editing

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278, 2025. 2, 4, 5, 6, 7, 9, 14, 16

work page arXiv 2025

[76] [76]

Training-free consistent text-to-image generation.ACM TOG, 43(4):1–18, 2024

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consistent text-to-image generation.ACM TOG, 43(4):1–18, 2024. 3

2024

[77] [77]

Attention is all you need.NeurIPS, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017. 2, 4

2017

[78] [78]

Diffusion model align- ment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InCVPR, pages 8228–8238, 2024. 3

2024

[79] [79]

Distribution matching for crowd counting.NeurIPS, 33:1595–1607, 2020

Boyu Wang, Huidong Liu, Dimitris Samaras, and Minh Hoai Nguyen. Distribution matching for crowd counting.NeurIPS, 33:1595–1607, 2020. 3

2020

[80] [80]

Yolov9: Learning what you want to learn using programmable gradient information

Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn using programmable gradient information. InECCV, pages 1–21. Springer, 2024. 6

2024