SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images
Pith reviewed 2026-05-16 23:48 UTC · model grok-4.3
The pith
SAM 3 performs open-vocabulary semantic segmentation on remote sensing images without any training by fusing its heads and filtering absent categories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SAM 3 can be applied directly to remote sensing open-vocabulary semantic segmentation, change detection, and 3D segmentation without training: a mask fusion strategy combines outputs from the semantic segmentation head and Transformer decoder, while the presence head score filters out non-present categories, and the resulting fused logits support joint verification for change detection, yielding promising performance on 20 segmentation datasets, 3 change detection datasets, and 1 3D dataset.
What carries the argument
Mask fusion strategy that merges SAM 3 semantic segmentation head outputs with Transformer decoder instance-head outputs, combined with presence-score filtering to remove absent categories from the vocabulary.
If this is right
- Fused logits enable joint instance- and pixel-level verification for open-vocabulary change detection without extra modules.
- Presence-score filtering reduces false positives when processing large vocabularies over geospatial patches.
- The same pipeline supports 2D semantic segmentation, change detection, and 3D semantic segmentation without task-specific retraining.
- Performance holds across 24 remote sensing datasets spanning varied sensors and resolutions.
Where Pith is reading between the lines
- If the transfer holds, remote sensing segmentation pipelines could shift from training separate localization and classification heads toward promptable foundation models.
- The fusion-plus-filter recipe may generalize to other dense-small-object domains such as medical microscopy or autonomous driving scenes.
- Performance gaps on spectral bands outside SAM 3 training data could be closed by lightweight spectral adapters while keeping the no-training constraint.
- The approach implies that unified segmentation-recognition models reduce the need for separate CLIP encoders and post-processing heuristics in geospatial tasks.
Load-bearing premise
SAM 3 heads pre-trained on natural images transfer directly to remote sensing scenes that contain dense small targets and different spectral characteristics.
What would settle it
A remote sensing benchmark where the fused SAM 3 outputs produce lower per-pixel accuracy than a standard CLIP-based open-vocabulary baseline on the same 2D segmentation or change detection tasks.
Figures
read the original abstract
Most existing methods for training-free open-vocabulary semantic segmentation are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a comprehensive exploration of applying SAM 3 to the remote sensing open-vocabulary tasks (i.e., 2D semantic segmentation, change detection, and 3D semantic segmentation) without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3's semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes. Furthermore, we extend our method to open-vocabulary change detection by a joint instance- and pixel-level verification strategy built directly upon our fused logits. We evaluate our method on extensive remote sensing datasets and tasks, including 20 segmentation datasets, 3 change detection datasets, and a 3D segmentation dataset. Experiments show that our method achieves promising performance, demonstrating the potential of SAM 3 for remote sensing open-vocabulary tasks. Our code is released at https://github.com/earth-insights/SegEarth-OV-3.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SegEarth-OV3, a training-free pipeline that applies the released SAM 3 model directly to open-vocabulary semantic segmentation, change detection, and 3D segmentation on remote-sensing imagery. It introduces a mask-fusion step that combines the semantic-segmentation head with the Transformer-decoder (instance) head, uses the presence head to filter absent categories, and extends the fused logits to a joint instance- and pixel-level verification strategy for change detection. The method is evaluated on 20 segmentation datasets, three change-detection datasets, and one 3D dataset, with the central claim that it achieves promising performance and thereby demonstrates the potential of SAM 3 for remote-sensing open-vocabulary tasks.
Significance. If the reported results are robust, the work provides concrete evidence that a single unified promptable model can be used zero-shot across multiple geospatial tasks without domain-specific fine-tuning or complex CLIP-based pipelines. The public code release supports reproducibility and enables direct follow-up studies on domain-shift mitigation.
major comments (2)
- [§3.2] §3.2 (Mask Fusion Strategy): the fusion rule that combines semantic-head and instance-head outputs is presented as the key to handling dense small targets, yet no ablation quantifies the contribution of each head on remote-sensing imagery; without this, it is impossible to verify whether the claimed improvement over CLIP-based methods stems from the fusion or simply from SAM 3's base segmentation quality.
- [§4] §4 (Experiments): the presence-score filtering step is asserted to reduce false positives caused by large vocabularies and patch-level processing, but the manuscript provides no quantitative comparison of precision/recall with and without the filter on the 20 segmentation datasets; this omission leaves the central zero-shot transfer claim under-supported given the spectral and scale mismatch between natural-image training data and remote-sensing inputs.
minor comments (2)
- [Figure 3] Figure 3 caption and §3.1: the notation for the fused logit map is introduced without an explicit equation; adding a compact formula would improve clarity.
- [Table 2] Table 2: several dataset names are abbreviated without a footnote or reference; expanding the first occurrence would aid readers unfamiliar with remote-sensing benchmarks.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and the recommendation of minor revision. The comments highlight important aspects for strengthening the empirical support of our claims. We address each major comment below and have incorporated the suggested analyses into the revised manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Mask Fusion Strategy): the fusion rule that combines semantic-head and instance-head outputs is presented as the key to handling dense small targets, yet no ablation quantifies the contribution of each head on remote-sensing imagery; without this, it is impossible to verify whether the claimed improvement over CLIP-based methods stems from the fusion or simply from SAM 3's base segmentation quality.
Authors: We agree that an explicit ablation would strengthen the justification for the fusion strategy. The design is motivated by the complementary properties observed in remote-sensing scenes: the semantic head tends to provide more complete coverage for dense small targets, while the instance head contributes sharper object boundaries. In the revised manuscript we add a dedicated ablation subsection in §3.2 that reports mIoU for the semantic head alone, the instance head alone, and the fused output on four representative remote-sensing datasets. The results confirm that fusion improves performance beyond either head individually, particularly on categories with numerous small objects, thereby supporting that the gains are not attributable solely to SAM 3's base segmentation quality. revision: yes
-
Referee: [§4] §4 (Experiments): the presence-score filtering step is asserted to reduce false positives caused by large vocabularies and patch-level processing, but the manuscript provides no quantitative comparison of precision/recall with and without the filter on the 20 segmentation datasets; this omission leaves the central zero-shot transfer claim under-supported given the spectral and scale mismatch between natural-image training data and remote-sensing inputs.
Authors: We concur that a direct quantitative comparison is necessary to substantiate the role of presence-score filtering in mitigating false positives under large vocabularies and domain shift. The revised manuscript now includes an additional table in §4 that reports precision and recall (as well as mIoU) with and without the filtering step across all 20 segmentation datasets. The comparison shows a consistent reduction in false positives when the filter is applied, which helps corroborate the zero-shot transfer claim despite the spectral and scale differences between natural-image pre-training and remote-sensing inputs. revision: yes
Circularity Check
No significant circularity: direct empirical application of pre-trained SAM 3
full rationale
The paper presents a training-free exploration that applies the released SAM 3 model directly to remote-sensing tasks via mask fusion of its semantic and instance heads plus presence-score filtering. No mathematical derivations, fitted parameters, or self-citation chains appear; the method consists of simple post-processing on model outputs followed by empirical evaluation on external datasets. Claims reduce to observed performance rather than any input-by-construction equivalence.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption SAM 3's semantic segmentation head and Transformer decoder produce complementary outputs that can be fused without retraining
- domain assumption Presence scores from SAM 3 reliably indicate category absence in remote sensing patches
Forward citations
Cited by 4 Pith papers
-
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
-
OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
OmniOVCD uses SAM 3's decoupled outputs and an SFID strategy to achieve state-of-the-art IoU scores of 67.2, 66.5, 24.5, and 27.1 on four OVCD benchmarks, surpassing prior methods.
-
MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification
MemOVCD reformulates change detection as cross-temporal memory reasoning with weighted bidirectional propagation and adaptive rectification to improve semantic change identification without task-specific training.
-
Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation
SeeCo is a training-free on-the-fly recalibration method using multi-view geometric consistency and adaptive textual calibration to improve open-vocabulary semantic segmentation in remote sensing images.
Reference graph
Works this paper leans on
-
[1]
Single-stage seman- tic segmentation from image labels
Nikita Araslanov and Stefan Roth. Single-stage seman- tic segmentation from image labels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4253–4262, 2020. 5
work page 2020
-
[2]
Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, and Yansong Tang. Self-calibrated clip for training-free open-vocabulary segmentation.arXiv preprint arXiv:2411.15869, 2024. 6
-
[3]
Training-free open- vocabulary segmentation with offline diffusion-augmented prototype generation
Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Training-free open- vocabulary segmentation with offline diffusion-augmented prototype generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3689–3698, 2024. 2, 6
work page 2024
-
[4]
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work.arXiv preprint arXiv:2504.13181, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Grounding everything: Emerging localiza- tion properties in vision-language transformers
Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localiza- tion properties in vision-language transformers. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3828–3837, 2024. 2, 5
work page 2024
-
[6]
Coco- stuff: Thing and stuff classes in context
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 4
work page 2018
-
[7]
Vdd: Varied drone dataset for semantic segmentation.arXiv preprint arXiv:2305.13608, 2023
Wenxiao Cai, Ke Jin, Jinyan Hou, Cong Guo, Letian Wu, and Wankou Yang. Vdd: Varied drone dataset for semantic segmentation.arXiv preprint arXiv:2305.13608, 2023. 4
-
[8]
Open-vocabulary remote sensing image semantic segmenta- tion.arXiv preprint arXiv:2409.07683, 2024
Qinglong Cao, Yuntian Chen, Chao Ma, and Xiaokang Yang. Open-vocabulary remote sensing image semantic segmenta- tion.arXiv preprint arXiv:2409.07683, 2024. 2, 5
-
[9]
End-to- end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas 7 Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 1
work page 2020
-
[10]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2
work page 2021
-
[12]
Junbum Cha, Jonghwan Mun, and Byungseok Roh. Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11165–11174, 2023. 6
work page 2023
-
[13]
Hongruixuan Chen, Cuiling Lan, Jian Song, Clifford Broni- Bediako, Junshi Xia, and Naoto Yokoya. Land-cover change detection using paired openstreetmap data and optical high- resolution imagery via object-guided transformer.arXiv preprint arXiv:2310.02674, 2023. 7
-
[14]
Training-free class purification for open-vocabulary semantic segmentation
Qi Chen, Lingxiao Yang, Yun Chen, Nailong Zhao, Jian- huang Lai, Jie Shao, and Xiaohua Xie. Training-free class purification for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23124–23134, 2025. 2
work page 2025
-
[15]
Large-scale structure from motion with semantic con- straints of aerial images
Yu Chen, Yao Wang, Peng Lu, Yisong Chen, and Guoping Wang. Large-scale structure from motion with semantic con- straints of aerial images. InChinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 347–359. Springer, 2018. 4
work page 2018
-
[16]
Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation.Advances in Neural Information Processing Systems, 34:17864–17875, 2021. 1
work page 2021
-
[17]
Masked-attention mask transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 1
work page 2022
-
[18]
Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation
Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4113– 4123, 2024. 2, 5, 6
work page 2024
-
[19]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 4
work page 2016
-
[20]
Deepglobe 2018: A challenge to parse the earth through satellite images
Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 172–181, 2018. 4
work page 2018
-
[21]
De- coupling zero-shot semantic segmentation
Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022. 2
work page 2022
-
[22]
Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images
Saikat Dutta, Akhil Vasim, Siddhant Gole, Hamid Rezatofighi, and Biplab Banerjee. Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 2254–2264, 2025. 2
work page 2025
-
[23]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010. 4
work page 2010
-
[24]
Scal- ing open-vocabulary image segmentation with image-level labels
Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. InEuropean Conference on Computer Vision, pages 540–557. Springer, 2022. 2
work page 2022
-
[25]
xbd: A dataset for assessing building dam- age from satellite imagery, 2019
Ritwik Gupta, Richard Hosfelt, Sandra Sajeev, Nirav Patel, Bryce Goodman, Jigar Doshi, Eric Heim, Howie Choset, and Matthew Gaston. xbd: A dataset for assessing building dam- age from satellite imagery, 2019. 4
work page 2019
-
[26]
Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation
Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. In2025 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), pages 5061–5071. IEEE, 2025. 2, 6
work page 2025
-
[27]
Score: Scene context matters in open-vocabulary remote sensing instance segmentation
Shiqi Huang, Shuting He, Huaiyuan Qin, and Bihan Wen. Score: Scene context matters in open-vocabulary remote sensing instance segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12559–12569, 2025. 3
work page 2025
-
[28]
Shunping Ji, Shiqing Wei, and Meng Lu. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set.IEEE Transactions on geoscience and remote sensing, 57(1):574–586, 2018. 4
work page 2018
-
[29]
Shuo Jin, Siyue Yu, Bingfeng Zhang, Mingjie Sun, Yi Dong, and Jimin Xiao. Feature purification matters: Suppressing outlier propagation for training-free open-vocabulary seman- tic segmentation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 20291–20300,
-
[30]
In defense of lazy vi- sual grounding for open-vocabulary semantic segmentation
Dahyun Kang and Minsu Cho. In defense of lazy vi- sual grounding for open-vocabulary semantic segmentation. InEuropean Conference on Computer Vision and Pattern Recognition (ECCV), 2024. 6
work page 2024
-
[31]
Distilling spectral graph for object-context aware open-vocabulary semantic segmenta- tion
Chanyoung Kim, Dayun Ju, Woojung Han, Ming-Hsuan Yang, and Seong Jae Hwang. Distilling spectral graph for object-context aware open-vocabulary semantic segmenta- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15033–15042, 2025. 2, 6
work page 2025
-
[32]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 1, 2 8
work page 2023
-
[33]
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: Decom- posing clip representations for dense vision-language infer- ence.arXiv preprint arXiv:2407.12442, 2024. 2, 5, 6
-
[34]
Proxyclip: Proxy attention improves clip for open-vocabulary segmentation
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InEuropean Conference on Computer Vision, pages 70–88. Springer, 2025. 1, 2, 5, 6
work page 2025
-
[35]
Language-driven Semantic Segmentation
Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[36]
Bingyu Li, Haocheng Dong, Da Zhang, Zhiyuan Zhao, Junyu Gao, and Xuelong Li. Exploring efficient open- vocabulary segmentation in the remote sensing.arXiv preprint arXiv:2509.12040, 2025. 5
-
[37]
Ke Li, Fuyu Dong, Di Wang, Shaofeng Li, Quan Wang, Xinbo Gao, and Tat-Seng Chua. Show me what and where has changed? question answering and grounding for remote sensing change detection.arXiv preprint arXiv:2410.23828,
-
[38]
Kaiyu Li, Xiangyong Cao, Yupeng Deng, Chao Pang, Zepeng Xin, Deyu Meng, and Zhi Wang. Dynamicearth: How far are we from open-vocabulary change detection? arXiv preprint arXiv:2501.12931, 2025. 1
-
[39]
Annotation-free open-vocabulary segmentation for remote-sensing images
Kaiyu Li, Xiangyong Cao, Ruixun Liu, Shihong Wang, Zix- uan Jiang, Zhi Wang, and Deyu Meng. Annotation-free open-vocabulary segmentation for remote-sensing images. arXiv preprint arXiv:2508.18067, 2025. 1
-
[40]
Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images
Kaiyu Li, Ruixun Liu, Xiangyong Cao, Xueru Bai, Feng Zhou, Deyu Meng, and Zhi Wang. Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 10545–10556, 2025. 1, 2, 4, 5
work page 2025
-
[41]
Open-vocabulary semantic segmentation with mask-adapted clip
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 7061–7070, 2023. 2
work page 2023
-
[42]
Open-vocabulary segmentation with semantic-assisted calibration
Yong Liu, Sule Bai, Guanbin Li, Yitong Wang, and Yansong Tang. Open-vocabulary segmentation with semantic-assisted calibration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3491– 3500, 2024. 5
work page 2024
-
[43]
Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation
Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. InIn- ternational Conference on Machine Learning, pages 23033– 23044. PMLR, 2023. 2
work page 2023
-
[44]
Ye Lyu, George V osselman, Gui-Song Xia, Alper Yilmaz, and Michael Ying Yang. Uavid: A semantic segmentation dataset for uav imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 165:108 – 119, 2020. 4
work page 2020
-
[45]
Can semantic labeling methods general- ize to any city? the inria aerial image labeling benchmark
Emmanuel Maggiori, Yuliya Tarabalka, Guillaume Charpiat, and Pierre Alliez. Can semantic labeling methods general- ize to any city? the inria aerial image labeling benchmark. In2017 IEEE International geoscience and remote sensing symposium (IGARSS), pages 3226–3229. IEEE, 2017. 4
work page 2017
-
[46]
PhD thesis, University of Toronto, 2013
V olodymyr Mnih.Machine Learning for Aerial Image La- beling. PhD thesis, University of Toronto, 2013. 4
work page 2013
-
[47]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 5, 6
work page 2021
-
[49]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Ex- plore the potential of clip for training-free open vocabulary semantic segmentation
Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. InEuropean Conference on Com- puter Vision. Springer, 2024. 6
work page 2024
-
[51]
Yuheng Shi, Minjing Dong, and Chang Xu. Harnessing vision foundation models for high-performance, training- free open vocabulary segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23487–23497, 2025. 2, 6
work page 2025
-
[52]
Lin Sun, Jiale Cao, Jin Xie, Xiaoheng Jiang, and Yanwei Pang. Cliper: Hierarchically improving spatial representa- tion of clip for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23199–23209, 2025. 6
work page 2025
-
[53]
Clip as rnn: Segment countless visual concepts without training endeavor
Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, and Siyang Li. Clip as rnn: Segment countless visual concepts without training endeavor. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13171–13182, 2024. 2, 6
work page 2024
-
[54]
SpaceNet: A Remote Sensing Dataset and Challenge Series
Adam Van Etten, Dave Lindenbaum, and Todd M Bacastow. Spacenet: A remote sensing dataset and challenge series. arXiv preprint arXiv:1807.01232, 2018. 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[55]
Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethink- ing self-attention for dense vision-language inference.arXiv preprint arXiv:2312.01597, 2023. 1, 2, 5, 6
-
[56]
Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation
Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation. InProceed- ings of the Neural Information Processing Systems Track on Datasets and Benchmarks. Curran Associates, Inc., 2021. 4
work page 2021
-
[57]
Junjue Wang, Weihao Xuan, Heli Qi, Zhihao Liu, Kunyi Liu, Yuhan Wu, Hongruixuan Chen, Jian Song, Junshi Xia, Zhuo Zheng, et al. Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.arXiv preprint arXiv:2505.21089, 2025. 1
-
[58]
isaid: A large-scale dataset for instance segmentation in aerial images
Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling 9 Shao, Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset for instance segmentation in aerial images. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 28–37, 2019. 4
work page 2019
-
[59]
Image-text co- decomposition for text-supervised semantic segmentation
Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang, Chun-Pei Chen, Yu-Lun Liu, Min-Hung Chen, Hou-Ning Hu, Yung-Yu Chuang, and Yen-Yu Lin. Image-text co- decomposition for text-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 26794–26803, 2024. 6
work page 2024
-
[60]
Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation
Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation. InEuropean Conference on Computer Vision, pages 320–337. Springer, 2024. 6
work page 2024
-
[61]
Openearthmap: A benchmark dataset for global high-resolution land cover mapping
Junshi Xia, Naoto Yokoya, Bruno Adriano, and Clifford Broni-Bediako. Openearthmap: A benchmark dataset for global high-resolution land cover mapping. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6254–6264, 2023. 4
work page 2023
-
[62]
Sed: A simple encoder-decoder for open- vocabulary semantic segmentation
Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, and Yanwei Pang. Sed: A simple encoder-decoder for open- vocabulary semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 5
work page 2024
-
[63]
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021. 5
work page 2021
-
[64]
Groupvit: Semantic segmentation emerges from text supervision
Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18134–18144, 2022. 2
work page 2022
-
[65]
Side adapter network for open-vocabulary semantic segmentation
Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945– 2954, 2023. 2, 5
work page 2023
-
[66]
Masq- clip for open-vocabulary universal image segmentation
Xin Xu, Tianyi Xiong, Zheng Ding, and Zhuowen Tu. Masq- clip for open-vocabulary universal image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 887–898, 2023. 2
work page 2023
-
[67]
Resclip: Residual attention for training-free dense vision- language inference
Yuhang Yang, Jinhong Deng, Wen Li, and Lixin Duan. Resclip: Residual attention for training-free dense vision- language inference. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29968–29978,
-
[68]
Remotesam: Towards segment anything for earth observa- tion
Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. Remotesam: Towards segment anything for earth observa- tion. InProceedings of the 33rd ACM International Confer- ence on Multimedia, pages 3027–3036, 2025. 3
work page 2025
-
[69]
To- wards open-vocabulary remote sensing image semantic seg- mentation
Chengyang Ye, Yunzhi Zhuge, and Pingping Zhang. To- wards open-vocabulary remote sensing image semantic seg- mentation. InProceedings of the AAAI Conference on Arti- ficial Intelligence, pages 9436–9444, 2025. 2, 5
work page 2025
-
[70]
Corrclip: Re- constructing patch correlations in clip for open-vocabulary semantic segmentation
Dengke Zhang, Fagui Liu, and Quan Tang. Corrclip: Re- constructing patch correlations in clip for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 24677– 24687, 2025. 1, 2, 5, 6
work page 2025
-
[71]
Yijie Zheng, Weijie Wu, Qingyun Li, Xuehui Wang, Xu Zhou, Aiai Ren, Jun Shen, Long Zhao, Guoqing Li, and Xue Yang. Instructsam: A training-free framework for instruction-oriented remote sensing object recognition.arXiv preprint arXiv:2505.15818, 2025. 3
-
[72]
Extract free dense labels from clip
Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean Conference on Com- puter Vision, pages 696–712. Springer, 2022. 1, 2, 5, 6
work page 2022
-
[73]
Qiqi Zhu, Yanan Zhang, Lizeng Wang, Yanfei Zhong, Qingfeng Guan, Xiaoyan Lu, Liangpei Zhang, and Deren Li. A global context-aware and batch-independent network for road extraction from vhr satellite imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 175:353–365, 2021. 4
work page 2021
-
[74]
Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 14733–14744, 2025. 2, 5 10
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.