CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Changwen Chen; Lingbo Liu; Ruixiang Jiang

arxiv: 2305.07304 · v2 · pith:YSYX45NWnew · submitted 2023-05-12 · 💻 cs.CV · cs.AI

CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Ruixiang Jiang , Lingbo Liu , Changwen Chen This is my paper

classification 💻 cs.CV cs.AI

keywords countingobjectclip-countmodelsvisualzero-shotdensedensity

0 comments

read the original abstract

Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to downstream tasks such as object detection and segmentation. Adapting these models for object counting, however, remains a formidable challenge. In this study, we first investigate transferring vision-language models (VLMs) for class-agnostic object counting. Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module to propagate semantic information across different resolution levels of visual features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained VLMs, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate state-of-the-art accuracy and generalizability of the proposed method. Code is available: https://github.com/songrise/CLIP-Count.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Test-Time Training for Robust Text-Guided Open-Vocabulary Object Counting
cs.CV 2026-06 unverdicted novelty 7.0

Introduces Robust-TOOC benchmark for corrupted images and Dual-TTT test-time training that updates only a text-guided denoising module to boost robustness in open-vocabulary counting.
MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block
cs.CV 2026-06 unverdicted novelty 6.0

MambaCount uses S^4D blocks with spatial token selection and multi-granularity prototypes to reach MAE 12.23 on FSC-147 for open-vocabulary counting while keeping linear complexity.
RT-Counter: Real-Time Text-Guided Open-Vocabulary Object Counting
cs.CV 2026-06 unverdicted novelty 6.0

RT-Counter introduces VPT and Weaformer modules to achieve competitive accuracy and real-time inference in text-guided open-vocabulary object counting.