RS-OVC: Open-Vocabulary Counting for Remote-Sensing Data
Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3
The pith
RS-OVC counts novel object classes in remote-sensing images using only text or visual prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RS-OVC is the first open-vocabulary counting model designed specifically for remote-sensing data. It demonstrates the ability to accurately count object classes that were not encountered during training, relying exclusively on conditioning from text prompts or visual examples.
What carries the argument
The RS-OVC model architecture, which supports open-vocabulary conditioning to enable counting of arbitrary classes in aerial imagery.
If this is right
- It removes the requirement for costly re-annotation when new object types need counting.
- The approach supports dynamic applications in environmental monitoring and urban planning.
- Both text-based and image-based prompts can be used interchangeably for conditioning the count.
- Performance holds for novel classes without additional training steps.
Where Pith is reading between the lines
- This opens the door to integrating such models into automated satellite analysis pipelines for ongoing surveillance.
- Similar techniques might apply to counting in other specialized imagery domains like medical or industrial inspection.
- Future work could test robustness across different sensor types or resolutions in remote sensing.
Load-bearing premise
The assumption that text or visual conditioning provides sufficient information for a model to count previously unseen object classes accurately in remote-sensing scenes.
What would settle it
Evaluating the model on a held-out dataset of remote-sensing images containing a new object class such as solar panels, and checking if the predicted counts match manual tallies within acceptable error margins.
Figures
read the original abstract
Object-Counting for remote-sensing (RS) imagery is attracting increasing research interest due to its crucial role in a wide and diverse set of applications. While several promising methods for RS object-counting have been proposed, existing methods focus on a closed, pre-defined set of object classes. This limitation necessitates costly re-annotation and model re-training to adapt current approaches for counting of novel objects that have not been seen during training, and severely inhibits their application in dynamic, real-world monitoring scenarios. To address this gap, in this work we propose RS-OVC - the first Open Vocabulary Counting (OVC) model for Remote-Sensing and aerial imagery. We show that our model is capable of accurate counting of novel object classes, that were unseen during training, based solely on textual and/or visual conditioning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RS-OVC as the first open-vocabulary counting (OVC) model for remote-sensing and aerial imagery. It claims that this model can perform accurate counting of novel object classes unseen during training, relying solely on textual and/or visual conditioning to overcome the limitations of closed-set counting methods that require re-annotation and retraining.
Significance. Should the approach prove effective, it would offer a substantial advance in remote-sensing object counting by facilitating adaptation to new classes without retraining, which is particularly valuable for dynamic real-world monitoring applications. The concept directly targets a practical limitation in current RS counting techniques.
major comments (1)
- [Abstract] The manuscript consists solely of an abstract. The central claim that the model achieves 'accurate counting of novel object classes' via text/visual conditioning is presented without any methodological description, architecture details, training procedure, quantitative results, ablation studies, or validation on novel classes, making the claim impossible to evaluate or verify.
Simulated Author's Rebuttal
We thank the referee for their review and for recognizing the potential significance of open-vocabulary counting for remote-sensing applications. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] The manuscript consists solely of an abstract. The central claim that the model achieves 'accurate counting of novel object classes' via text/visual conditioning is presented without any methodological description, architecture details, training procedure, quantitative results, ablation studies, or validation on novel classes, making the claim impossible to evaluate or verify.
Authors: We agree that the version provided for review contains only the abstract and therefore lacks the requested details, making independent evaluation impossible at this stage. The full manuscript (arXiv:2604.08704) includes dedicated sections on the RS-OVC architecture (a vision-language backbone adapted with a counting head), the training procedure (closed-set pre-training followed by open-vocabulary fine-tuning with text and visual prompts), quantitative results on RS datasets with held-out novel classes, ablation studies on conditioning modalities, and explicit validation experiments measuring counting accuracy for unseen object categories. We will submit a revised manuscript that incorporates these sections in full so that the claims can be properly assessed. revision: yes
Circularity Check
No significant circularity detected
full rationale
The abstract and context contain no equations, derivations, fitted parameters, or load-bearing self-citations. The central claim is a proposal of a new model (RS-OVC) for open-vocabulary counting based on text/visual conditioning. No step reduces by construction to its inputs, renames a known result, or relies on an unverified self-citation chain. The derivation chain, if present in the full manuscript, cannot be examined for circularity from the given text, but nothing in the provided material exhibits the enumerated patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Amini-Naieni, N., Han, T., Zisserman, A.: Countgd: Multi-modal open-world counting. NeurIPS37, 48810–48837 (2024)
work page 2024
-
[2]
Angeoletto, F., Fellowes, M.D., Santos, J.W.: Counting brazil’s urban trees will help make brazil’s urban trees count. J. For.116(5), 489–490 (2018)
work page 2018
- [3]
-
[4]
Barrington,L.,Ghosh,S.,Greene,M.,Har-Noy,S.,Berger,J.,Gill,S.,Lin,A.Y.M., Huyck, C.: Crowdsourcing earthquake damage assessment using remote sensing imagery. Ann. Geophys.54(6) (2011)
work page 2011
-
[5]
Bernd, A., Braun, D., Ortmann, A., Ulloa-Torrealba, Y.Z., Wohlfart, C., Bell, A.: More than counting pixels–perspectives on the importance of remote sensing training in ecology and conservation. Remote Sens. Ecol. Conserv.3(1), 38–47 (2017)
work page 2017
-
[6]
Chen, G., Shang, Y.: Transformer for tree counting in aerial images. Remote Sens. 14(3), 476 (2022)
work page 2022
-
[7]
Chong, K.L., Kanniah, K.D., Pohl, C., Tan, K.P.: A review of remote sensing applications for oil palm studies. Geo-spatial Inf. Sci.20(2), 184–200 (2017)
work page 2017
- [8]
- [9]
-
[10]
Ding, G., Cui, M., Yang, D., Wang, T., Wang, S., Zhang, Y.: Object counting for remote-sensing images via adaptive density map-assisted learning. IEEE TGRS 60, 1–11 (2022)
work page 2022
-
[11]
Duan, Z., Wang, S., Di, H., Deng, J.: Distillation remote sensing object counting via multi-scale context feature aggregation. IEEE TGRS60, 1–12 (2021)
work page 2021
-
[12]
Everingham,M.,VanGool,L.,Williams,C.K.,Winn,J.,Zisserman,A.:Thepascal visual object classes (voc) challenge. IJCV88(2), 303–338 (2010)
work page 2010
-
[13]
Fan, Y., Wen, Q., Wang, W., Wang, P., Li, L., Zhang, P.: Quantifying disaster physical damage using remote sensing data—a technical work flow and case study of the 2014 ludian earthquake in china. Int. J. Disaster Risk Sci.8(4), 471–488 (2017)
work page 2014
-
[14]
Farjon, G., Huijun, L., Edan, Y.: Deep-learning-based counting methods, datasets, and applications in agriculture: A review. Precis. Agric.24(5), 1683–1711 (2023)
work page 2023
-
[15]
IEEE TGRS59(5), 3642–3655 (2020)
Gao, G., Liu, Q., Wang, Y.: Counting from sky: A large-scale data set for remote sensing object counting and a benchmark method. IEEE TGRS59(5), 3642–3655 (2020)
work page 2020
-
[16]
Gao, J., Zhao, L., Li, X.: Nwpu-moc: A benchmark for fine-grained multicategory object counting in aerial images. IEEE TGRS62, 1–14 (2024)
work page 2024
-
[17]
Garrido-Valenzuela, F., Cats, O., van Cranenburgh, S.: Where are the people? counting people in millions of street-level images to explore associations between people’s urban density and urban characteristics. Comput. Environ. Urban Syst. 102, 101971 (2023)
work page 2023
-
[18]
Guo, H., Gao, J., Yuan, Y.: Balanced density regression network for remote sensing object counting. IEEE TGRS62, 1–13 (2024)
work page 2024
-
[19]
Remote Sens.14(24), 6363 (2022)
Guo, X., Anisetti, M., Gao, M., Jeon, G.: Object counting in remote sensing via triple attention and scale-aware network. Remote Sens.14(24), 6363 (2022)
work page 2022
-
[20]
Guo, Y., Wu, C., Du, B., Zhang, L.: Density map-based vehicle counting in remote sensing images with limited resolution. ISPRS J. Photogramm. Remote Sens.189, 201–217 (2022)
work page 2022
-
[21]
Hollings, T., Burgman, M., van Andel, M., Gilbert, M., Robinson, T., Robinson, A.: How do you find the green sheep? a critical review of the use of remotely sensed imagery to detect and count animals. Methods Ecol. Evol.9(4), 881–892 (2018)
work page 2018
- [22]
-
[23]
Klemelä, J.S.: Smoothing of multivariate data: density estimation and visualiza- tion. John Wiley & Sons (2009)
work page 2009
-
[24]
Li, C., Cheng, G., Wang, G., Zhou, P., Han, J.: Instance-aware distillation for efficient object detection in remote sensing images. IEEE TGRS61, 1–11 (2023)
work page 2023
-
[25]
Li, K., Wan, G., Cheng, G., Meng, L., Han, J.: Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens.159, 296–307 (2020)
work page 2020
- [26]
-
[27]
arXiv preprint arXiv:2501.06697 (2025)
Liu, P., Lei, S., Li, H.C.: Mamba-moc: A multicategory remote object counting via state space model. arXiv preprint arXiv:2501.06697 (2025)
-
[28]
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: ECCV. pp. 38–55. Springer (2024) RS-OVC: Open-Vocabulary Counting for Remote-Sensing Data 15
work page 2024
- [29]
-
[30]
Michelat, T., Hueber, N., Raymond, P., Pichler, A., Schaal, P., Dugaret, B.: Auto- matic pedestrian detection and counting applied to urban planning. In: AmI. pp. 285–289. Springer (2010)
work page 2010
- [31]
-
[32]
Park, J.J., Park, K.A., Kim, T.S., Oh, S., Lee, M.: Aerial hyperspectral remote sensing detection for maritime search and surveillance of floating small objects. Adv. Space Res.72(6), 2118–2136 (2023)
work page 2023
-
[33]
Reggiannini, M., Salerno, E., Bacciu, C., D’Errico, A., Lo Duca, A., Marchetti, A., Martinelli, M., Mercurio, C., Mistretta, A., Righi, M., et al.: Remote sensing for maritime traffic understanding. Remote Sens.16(3), 557 (2024)
work page 2024
-
[34]
IEEE TPAMI39(6), 1137–1149 (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de- tection with region proposal networks. IEEE TPAMI39(6), 1137–1149 (2016)
work page 2016
-
[35]
Saleh, S.A.M., Suandi, S.A., Ibrahim, H.: Recent survey on crowd density esti- mation and counting for visual surveillance. Eng. Appl. Artif. Intell.41, 103–114 (2015)
work page 2015
-
[36]
Ecosystems25(8), 1719–1737 (2022)
Senf, C.: Seeing the system from above: The use and potential of remote sensing for studying ecosystem dynamics. Ecosystems25(8), 1719–1737 (2022)
work page 2022
-
[37]
Shen, Z., Li, G., Xia, R., Meng, H., Huang, Z.: A lightweight object counting network based on density map knowledge distillation. IEEE TCSVT (2024)
work page 2024
-
[38]
Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [39]
-
[40]
Sun, X., Wang, P., Yan, Z., Xu, F., Wang, R., Diao, W., Chen, J., Li, J., Feng, Y., Xu, T., et al.: Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 184, 116–130 (2022)
work page 2022
-
[41]
Wang, S., Song, Y., Xiang, J., Chen, Y., Zhong, P., Fu, R.: Mask-guided teacher– student learning for open-vocabulary object detection in remote sensing images. Remote Sens.17, 3385 (2025)
work page 2025
- [42]
-
[43]
Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: CVPR. pp. 589–597 (2016) 16 T. Shor et al. A Experimental Setup & Implementation Details A.1 RS-OVC Implementation Our implementation of RS-OVC mostly follows the official code published in the CountGD paper, which is the backbone...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.