arxiv: 2604.08704 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: no theorem link

RS-OVC: Open-Vocabulary Counting for Remote-Sensing Data

Tamir Shor , George Leifman , Genady Beryozkin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary countingremote sensingaerial imageryobject countingzero-shot learningcomputer visionimage analysis

0 comments

The pith

RS-OVC counts novel object classes in remote-sensing images using only text or visual prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RS-OVC as the first model for open-vocabulary counting in remote-sensing and aerial imagery. Existing methods are limited to pre-defined classes and require re-training for new ones, which is costly for real-world use. RS-OVC overcomes this by counting unseen objects based on textual descriptions or example images alone. If correct, this allows flexible, adaptive monitoring without repeated data collection and model updates.

Core claim

RS-OVC is the first open-vocabulary counting model designed specifically for remote-sensing data. It demonstrates the ability to accurately count object classes that were not encountered during training, relying exclusively on conditioning from text prompts or visual examples.

What carries the argument

The RS-OVC model architecture, which supports open-vocabulary conditioning to enable counting of arbitrary classes in aerial imagery.

If this is right

It removes the requirement for costly re-annotation when new object types need counting.
The approach supports dynamic applications in environmental monitoring and urban planning.
Both text-based and image-based prompts can be used interchangeably for conditioning the count.
Performance holds for novel classes without additional training steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This opens the door to integrating such models into automated satellite analysis pipelines for ongoing surveillance.
Similar techniques might apply to counting in other specialized imagery domains like medical or industrial inspection.
Future work could test robustness across different sensor types or resolutions in remote sensing.

Load-bearing premise

The assumption that text or visual conditioning provides sufficient information for a model to count previously unseen object classes accurately in remote-sensing scenes.

What would settle it

Evaluating the model on a held-out dataset of remote-sensing images containing a new object class such as solar panels, and checking if the predicted counts match manual tallies within acceptable error margins.

Figures

Figures reproduced from arXiv: 2604.08704 by Genady Beryozkin, George Leifman, Tamir Shor.

**Figure 1.** Figure 1: Object confidence maps - illustrating spatial correspondence with aggregated textual (i.e. prompt) and visual (i.e. exemplar) conditioning. Visual exemplars are marked in red bounding-boxes on each image (red arrows highlight exemplars). Abstract. Object-Counting for remote-sensing (RS) imagery is attracting increasing research interest due to its crucial role in a wide and diverse set of applications. W… view at source ↗

**Figure 2.** Figure 2: RS-OVC Pipeline - Our modifications from the original CountGD architecture are highlighted with an orange background. Image and text encoders remain frozen during optimization, other parameters are finetuned. FAIR1M [40], DIOR [25], and DOTA [42] to the OVC setting by converting all annotated bounding boxes into point-based instance labels using their centroid coordinates. These datasets are combined with… view at source ↗

**Figure 3.** Figure 3: Curated dataset class-wise mean and standard-deviation (error bars) for object instance counts across images, for the training (top) and test (bottom) splits. OVD baseline we compare to Locate-Anything-on-Earth (LAE) [31], which is a SOTA RS open-vocabulary object-detection model. This model can be trivially adapted for object counting by using the number of detected bounding-boxes per-prompt. For an OVC b… view at source ↗

**Figure 4.** Figure 4: Object confidence maps - for the standard setting and for joint visual–textual conditioning. Correct predictions require local object-level semantic understanding. For instance, in the top row our model correctly selects the single truck that has a red front (and not other trucks or red cars) given the textual prompt. Red markings indicate visual exemplars. Yellow markings upscale important image regions… view at source ↗

**Figure 5.** Figure 5: Object confidence maps - for the standard setting and for joint visual–textual conditioning. Correct predictions require global scene-level semantic understanding. The two rightmost columns introduce textual prompts that require relational reasoning. For instance, in the top row the model must identify absolute and relative spatial positions and orientations of baseball fields and trees to count correct… view at source ↗

**Figure 6.** Figure 6: Object confidence maps - for the standard setting and for joint visual–textual conditioning that require basic reasoning. For instance, in the first row the model must attribute the bottom boat’s proximity to a pier to infer it is docking [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: MAE as a function of number of instances in the scene [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Object-Counting for remote-sensing (RS) imagery is attracting increasing research interest due to its crucial role in a wide and diverse set of applications. While several promising methods for RS object-counting have been proposed, existing methods focus on a closed, pre-defined set of object classes. This limitation necessitates costly re-annotation and model re-training to adapt current approaches for counting of novel objects that have not been seen during training, and severely inhibits their application in dynamic, real-world monitoring scenarios. To address this gap, in this work we propose RS-OVC - the first Open Vocabulary Counting (OVC) model for Remote-Sensing and aerial imagery. We show that our model is capable of accurate counting of novel object classes, that were unseen during training, based solely on textual and/or visual conditioning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper flags a real gap in remote sensing counting by proposing the first open-vocabulary approach, but the abstract alone doesn't show whether it delivers on accuracy for novel classes.

read the letter

The main thing to know is that RS-OVC is framed as the first open-vocabulary counting model for remote sensing and aerial imagery. It targets the practical issue that existing methods are locked to a fixed set of classes and require new annotations plus retraining whenever a different object type appears, which slows down applications like ongoing environmental or infrastructure monitoring. The proposal uses text or visual prompts to handle unseen classes instead. That framing is straightforward and points to a genuine limitation in the subfield. The work does a solid job laying out why closed-set counting creates friction in dynamic scenarios and why extending open-vocabulary techniques from natural images could help. It also notes both textual and visual conditioning, which feels relevant for remote sensing where objects can be described in varied ways. The soft spot is the complete absence of supporting material. The abstract states that the model achieves accurate counting for novel classes but supplies no architecture description, training details, datasets, quantitative results, or comparisons. Without those, there is no way to check whether the conditioning actually works on remote sensing data, which often involves small objects, varying scales, and different visual statistics than everyday photos. The central assumption that text or visual examples alone suffice without performance loss remains untested in what is shown. This is mainly for researchers already working on object counting in aerial imagery who are interested in moving past fixed vocabularies. A reader could pick up the motivation and the new problem statement, but there is little technical substance to engage with yet. I would send the full version for peer review if it includes proper experiments, ablations, and error analysis, since the idea addresses a real need even if the current sketch is thin.

Referee Report

1 major / 0 minor

Summary. The paper proposes RS-OVC as the first open-vocabulary counting (OVC) model for remote-sensing and aerial imagery. It claims that this model can perform accurate counting of novel object classes unseen during training, relying solely on textual and/or visual conditioning to overcome the limitations of closed-set counting methods that require re-annotation and retraining.

Significance. Should the approach prove effective, it would offer a substantial advance in remote-sensing object counting by facilitating adaptation to new classes without retraining, which is particularly valuable for dynamic real-world monitoring applications. The concept directly targets a practical limitation in current RS counting techniques.

major comments (1)

[Abstract] The manuscript consists solely of an abstract. The central claim that the model achieves 'accurate counting of novel object classes' via text/visual conditioning is presented without any methodological description, architecture details, training procedure, quantitative results, ablation studies, or validation on novel classes, making the claim impossible to evaluate or verify.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for recognizing the potential significance of open-vocabulary counting for remote-sensing applications. We address the major comment below.

read point-by-point responses

Referee: [Abstract] The manuscript consists solely of an abstract. The central claim that the model achieves 'accurate counting of novel object classes' via text/visual conditioning is presented without any methodological description, architecture details, training procedure, quantitative results, ablation studies, or validation on novel classes, making the claim impossible to evaluate or verify.

Authors: We agree that the version provided for review contains only the abstract and therefore lacks the requested details, making independent evaluation impossible at this stage. The full manuscript (arXiv:2604.08704) includes dedicated sections on the RS-OVC architecture (a vision-language backbone adapted with a counting head), the training procedure (closed-set pre-training followed by open-vocabulary fine-tuning with text and visual prompts), quantitative results on RS datasets with held-out novel classes, ablation studies on conditioning modalities, and explicit validation experiments measuring counting accuracy for unseen object categories. We will submit a revised manuscript that incorporates these sections in full so that the claims can be properly assessed. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and context contain no equations, derivations, fitted parameters, or load-bearing self-citations. The central claim is a proposal of a new model (RS-OVC) for open-vocabulary counting based on text/visual conditioning. No step reduces by construction to its inputs, renames a known result, or relies on an unverified self-citation chain. The derivation chain, if present in the full manuscript, cannot be examined for circularity from the given text, but nothing in the provided material exhibits the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no information on free parameters, axioms, or invented entities; the model is described only at a high level as using textual/visual conditioning.

pith-pipeline@v0.9.0 · 5439 in / 1153 out tokens · 46764 ms · 2026-05-10T17:54:46.180920+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 2 canonical work pages · 1 internal anchor

[1]

NeurIPS37, 48810–48837 (2024)

Amini-Naieni, N., Han, T., Zisserman, A.: Countgd: Multi-modal open-world counting. NeurIPS37, 48810–48837 (2024)

2024
[2]

Angeoletto, F., Fellowes, M.D., Santos, J.W.: Counting brazil’s urban trees will help make brazil’s urban trees count. J. For.116(5), 489–490 (2018)

2018
[3]

In: ECCV

Arteta, C., Lempitsky, V., Zisserman, A.: Counting in the wild. In: ECCV. pp. 483–498. Springer (2016)

2016
[4]

Barrington,L.,Ghosh,S.,Greene,M.,Har-Noy,S.,Berger,J.,Gill,S.,Lin,A.Y.M., Huyck, C.: Crowdsourcing earthquake damage assessment using remote sensing imagery. Ann. Geophys.54(6) (2011)

2011
[5]

Remote Sens

Bernd, A., Braun, D., Ortmann, A., Ulloa-Torrealba, Y.Z., Wohlfart, C., Bell, A.: More than counting pixels–perspectives on the importance of remote sensing training in ecology and conservation. Remote Sens. Ecol. Conserv.3(1), 38–47 (2017)

2017
[6]

Remote Sens

Chen, G., Shang, Y.: Transformer for tree counting in aerial images. Remote Sens. 14(3), 476 (2022)

2022
[7]

Geo-spatial Inf

Chong, K.L., Kanniah, K.D., Pohl, C., Tan, K.P.: A review of remote sensing applications for oil palm studies. Geo-spatial Inf. Sci.20(2), 184–200 (2017)

2017
[8]

Australas

Dare, P., Fraser, C., Duthie, T.: Application of automated remote sensing tech- niques to dam counting. Australas. J. Water Resour.5(2), 195–208 (2002)

2002
[9]

In: NAACL

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: NAACL. pp. 4171–4186 (2019) 14 T. Shor et al

2019
[10]

IEEE TGRS 60, 1–11 (2022)

Ding, G., Cui, M., Yang, D., Wang, T., Wang, S., Zhang, Y.: Object counting for remote-sensing images via adaptive density map-assisted learning. IEEE TGRS 60, 1–11 (2022)

2022
[11]

IEEE TGRS60, 1–12 (2021)

Duan, Z., Wang, S., Di, H., Deng, J.: Distillation remote sensing object counting via multi-scale context feature aggregation. IEEE TGRS60, 1–12 (2021)

2021
[12]

IJCV88(2), 303–338 (2010)

Everingham,M.,VanGool,L.,Williams,C.K.,Winn,J.,Zisserman,A.:Thepascal visual object classes (voc) challenge. IJCV88(2), 303–338 (2010)

2010
[13]

Fan, Y., Wen, Q., Wang, W., Wang, P., Li, L., Zhang, P.: Quantifying disaster physical damage using remote sensing data—a technical work flow and case study of the 2014 ludian earthquake in china. Int. J. Disaster Risk Sci.8(4), 471–488 (2017)

2014
[14]

Farjon, G., Huijun, L., Edan, Y.: Deep-learning-based counting methods, datasets, and applications in agriculture: A review. Precis. Agric.24(5), 1683–1711 (2023)

2023
[15]

IEEE TGRS59(5), 3642–3655 (2020)

Gao, G., Liu, Q., Wang, Y.: Counting from sky: A large-scale data set for remote sensing object counting and a benchmark method. IEEE TGRS59(5), 3642–3655 (2020)

2020
[16]

IEEE TGRS62, 1–14 (2024)

Gao, J., Zhao, L., Li, X.: Nwpu-moc: A benchmark for fine-grained multicategory object counting in aerial images. IEEE TGRS62, 1–14 (2024)

2024
[17]

Garrido-Valenzuela, F., Cats, O., van Cranenburgh, S.: Where are the people? counting people in millions of street-level images to explore associations between people’s urban density and urban characteristics. Comput. Environ. Urban Syst. 102, 101971 (2023)

2023
[18]

IEEE TGRS62, 1–13 (2024)

Guo, H., Gao, J., Yuan, Y.: Balanced density regression network for remote sensing object counting. IEEE TGRS62, 1–13 (2024)

2024
[19]

Remote Sens.14(24), 6363 (2022)

Guo, X., Anisetti, M., Gao, M., Jeon, G.: Object counting in remote sensing via triple attention and scale-aware network. Remote Sens.14(24), 6363 (2022)

2022
[20]

Guo, Y., Wu, C., Du, B., Zhang, L.: Density map-based vehicle counting in remote sensing images with limited resolution. ISPRS J. Photogramm. Remote Sens.189, 201–217 (2022)

2022
[21]

Methods Ecol

Hollings, T., Burgman, M., van Andel, M., Gilbert, M., Robinson, T., Robinson, A.: How do you find the green sheep? a critical review of the use of remotely sensed imagery to detect and count animals. Methods Ecol. Evol.9(4), 881–892 (2018)

2018
[22]

ISPRS Int

Kızılkaya, S., Alganci, U., Sertel, E.: Vhrships: An extensive benchmark dataset for scalable deep learning-based ship detection applications. ISPRS Int. J. Geo-Inf. 11(8), 445 (2022)

2022
[23]

John Wiley & Sons (2009)

Klemelä, J.S.: Smoothing of multivariate data: density estimation and visualiza- tion. John Wiley & Sons (2009)

2009
[24]

IEEE TGRS61, 1–11 (2023)

Li, C., Cheng, G., Wang, G., Zhou, P., Han, J.: Instance-aware distillation for efficient object detection in remote sensing images. IEEE TGRS61, 1–11 (2023)

2023
[25]

Li, K., Wan, G., Cheng, G., Meng, L., Han, J.: Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens.159, 296–307 (2020)

2020
[26]

In: ECCV

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755. Springer (2014)

2014
[27]

arXiv preprint arXiv:2501.06697 (2025)

Liu, P., Lei, S., Li, H.C.: Mamba-moc: A multicategory remote object counting via state space model. arXiv preprint arXiv:2501.06697 (2025)

work page arXiv 2025
[28]

In: ECCV

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: ECCV. pp. 38–55. Springer (2024) RS-OVC: Open-Vocabulary Counting for Remote-Sensing Data 15

2024
[29]

In: ICCV

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV. pp. 10012–10022 (2021)

2021
[30]

Michelat, T., Hueber, N., Raymond, P., Pichler, A., Schaal, P., Dugaret, B.: Auto- matic pedestrian detection and counting applied to urban planning. In: AmI. pp. 285–289. Springer (2010)

2010
[31]

In: AAAI

Pan, J., Liu, Y., Fu, Y., Ma, M., Li, J., Paudel, D.P., Van Gool, L., Huang, X.: Locate anything on earth: Advancing open-vocabulary object detection for remote sensing community. In: AAAI. vol. 39, pp. 6281–6289 (2025)

2025
[32]

Park, J.J., Park, K.A., Kim, T.S., Oh, S., Lee, M.: Aerial hyperspectral remote sensing detection for maritime search and surveillance of floating small objects. Adv. Space Res.72(6), 2118–2136 (2023)

2023
[33]

Remote Sens.16(3), 557 (2024)

Reggiannini, M., Salerno, E., Bacciu, C., D’Errico, A., Lo Duca, A., Marchetti, A., Martinelli, M., Mercurio, C., Mistretta, A., Righi, M., et al.: Remote sensing for maritime traffic understanding. Remote Sens.16(3), 557 (2024)

2024
[34]

IEEE TPAMI39(6), 1137–1149 (2016)

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de- tection with region proposal networks. IEEE TPAMI39(6), 1137–1149 (2016)

2016
[35]

Saleh, S.A.M., Suandi, S.A., Ibrahim, H.: Recent survey on crowd density esti- mation and counting for visual surveillance. Eng. Appl. Artif. Intell.41, 103–114 (2015)

2015
[36]

Ecosystems25(8), 1719–1737 (2022)

Senf, C.: Seeing the system from above: The use and potential of remote sensing for studying ecosystem dynamics. Ecosystems25(8), 1719–1737 (2022)

2022
[37]

IEEE TCSVT (2024)

Shen, Z., Li, G., Xia, R., Meng, H., Huang, Z.: A lightweight object counting network based on density map knowledge distillation. IEEE TCSVT (2024)

2024
[38]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

In: RSETE

Song, H., Liu, X., Zhang, X., Hu, J.: Real-time monitoring for crowd counting using video surveillance and gis. In: RSETE. pp. 1–4. IEEE (2012)

2012
[40]

Sun, X., Wang, P., Yan, Z., Xu, F., Wang, R., Diao, W., Chen, J., Li, J., Feng, Y., Xu, T., et al.: Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 184, 116–130 (2022)

2022
[41]

Remote Sens.17, 3385 (2025)

Wang, S., Song, Y., Xiang, J., Chen, Y., Zhong, P., Fu, R.: Mask-guided teacher– student learning for open-vocabulary object detection in remote sensing images. Remote Sens.17, 3385 (2025)

2025
[42]

In: CVPR

Xia, G.S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., Zhang, L.: Dota: A large-scale dataset for object detection in aerial images. In: CVPR. pp. 3974–3983 (2018)

2018
[43]

In: CVPR

Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: CVPR. pp. 589–597 (2016) 16 T. Shor et al. A Experimental Setup & Implementation Details A.1 RS-OVC Implementation Our implementation of RS-OVC mostly follows the official code published in the CountGD paper, which is the backbone...

2016