arxiv: 2512.08730 · v2 · submitted 2025-12-09 · 💻 cs.CV

SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images

Kaiyu Li , Shengqi Zhang , Yujie Wang , Yupeng Deng , Zhi Wang , Deyu Meng , Xiangyong Cao This is my paper

Pith reviewed 2026-05-16 23:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary semantic segmentationremote sensingSAM 3training-freechange detectionmask fusionpresence score3D segmentation

0 comments

The pith

SAM 3 performs open-vocabulary semantic segmentation on remote sensing images without any training by fusing its heads and filtering absent categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the unified promptable framework of SAM 3 can replace CLIP-based pipelines for open-vocabulary tasks in remote sensing, where dense small targets and large vocabularies create localization and false-positive problems. Authors implement a mask fusion step that merges the semantic segmentation head with the Transformer decoder instance head to improve land-cover completeness. They add a presence-score filter from the presence head to discard categories absent from each scene, then extend the same fused logits to open-vocabulary change detection via joint instance- and pixel-level checks. Results on 24 remote sensing datasets across 2D segmentation, change detection, and 3D segmentation indicate the approach works without domain adaptation.

Core claim

SAM 3 can be applied directly to remote sensing open-vocabulary semantic segmentation, change detection, and 3D segmentation without training: a mask fusion strategy combines outputs from the semantic segmentation head and Transformer decoder, while the presence head score filters out non-present categories, and the resulting fused logits support joint verification for change detection, yielding promising performance on 20 segmentation datasets, 3 change detection datasets, and 1 3D dataset.

What carries the argument

Mask fusion strategy that merges SAM 3 semantic segmentation head outputs with Transformer decoder instance-head outputs, combined with presence-score filtering to remove absent categories from the vocabulary.

If this is right

Fused logits enable joint instance- and pixel-level verification for open-vocabulary change detection without extra modules.
Presence-score filtering reduces false positives when processing large vocabularies over geospatial patches.
The same pipeline supports 2D semantic segmentation, change detection, and 3D semantic segmentation without task-specific retraining.
Performance holds across 24 remote sensing datasets spanning varied sensors and resolutions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the transfer holds, remote sensing segmentation pipelines could shift from training separate localization and classification heads toward promptable foundation models.
The fusion-plus-filter recipe may generalize to other dense-small-object domains such as medical microscopy or autonomous driving scenes.
Performance gaps on spectral bands outside SAM 3 training data could be closed by lightweight spectral adapters while keeping the no-training constraint.
The approach implies that unified segmentation-recognition models reduce the need for separate CLIP encoders and post-processing heuristics in geospatial tasks.

Load-bearing premise

SAM 3 heads pre-trained on natural images transfer directly to remote sensing scenes that contain dense small targets and different spectral characteristics.

What would settle it

A remote sensing benchmark where the fused SAM 3 outputs produce lower per-pixel accuracy than a standard CLIP-based open-vocabulary baseline on the same 2D segmentation or change detection tasks.

Figures

Figures reproduced from arXiv: 2512.08730 by Deyu Meng, Kaiyu Li, Shengqi Zhang, Xiangyong Cao, Yujie Wang, Yupeng Deng, Zhi Wang.

**Figure 2.** Figure 2: The overall inference pipeline of SegEarth-OV [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of vocabulary size and our filtering strategy. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Inference results of SegEarth-OV3 on a remote sensing image exceeding 10k×10k resolution. The image originates from [13]. 5. Conclusion This paper explores the potential of SAM 3 for remote sensing OVSS. Building upon the powerful zero-shot capabilities of the SAM 3 foundation model, we introduce an inference strategy to adapt its prompt-based architecture for remote sensing OVSS. By fusing semantic and in… view at source ↗

read the original abstract

Most existing methods for training-free open-vocabulary semantic segmentation are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a comprehensive exploration of applying SAM 3 to the remote sensing open-vocabulary tasks (i.e., 2D semantic segmentation, change detection, and 3D semantic segmentation) without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3's semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes. Furthermore, we extend our method to open-vocabulary change detection by a joint instance- and pixel-level verification strategy built directly upon our fused logits. We evaluate our method on extensive remote sensing datasets and tasks, including 20 segmentation datasets, 3 change detection datasets, and a 3D segmentation dataset. Experiments show that our method achieves promising performance, demonstrating the potential of SAM 3 for remote sensing open-vocabulary tasks. Our code is released at https://github.com/earth-insights/SegEarth-OV-3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is the first direct application of SAM 3 zero-shot to remote sensing open-vocabulary segmentation, using straightforward mask fusion and presence filtering, but the domain shift from natural images is the unproven core assumption.

read the letter

The paper is the first to explore SAM 3 for open-vocabulary semantic segmentation in remote sensing without training. It uses a mask fusion strategy combining the semantic segmentation head and Transformer decoder outputs for better land coverage, and applies the presence head to filter non-existent categories, cutting false positives in large-vocab patch processing. They extend this to open-vocabulary change detection via joint instance- and pixel-level verification. Evaluation covers 20 segmentation datasets, 3 change detection datasets, and one 3D segmentation dataset, with claims of promising performance. The code is released. This approach is straightforward and capitalizes on SAM 3's unified framework, which is a step up from CLIP-based methods that struggle with localization in dense scenes. The broad evaluation and code availability are practical strengths. The main concern is the zero-shot transfer. SAM 3's heads come from natural image training, and remote sensing imagery has spectral mismatches plus many sub-10-pixel targets. If the raw scores don't hold, the fusion and filtering break down. The paper doesn't detail ablations or error analysis, so the full results need to show whether this works reliably or only on favorable cases. This is for applied computer vision researchers in remote sensing who want training-free baselines. A serious thinker would find the empirical demonstration useful, even if the conclusions are provisional. It deserves peer review to verify the quantitative support and assess the transfer robustness.

Referee Report

2 major / 2 minor

Summary. The paper presents SegEarth-OV3, a training-free pipeline that applies the released SAM 3 model directly to open-vocabulary semantic segmentation, change detection, and 3D segmentation on remote-sensing imagery. It introduces a mask-fusion step that combines the semantic-segmentation head with the Transformer-decoder (instance) head, uses the presence head to filter absent categories, and extends the fused logits to a joint instance- and pixel-level verification strategy for change detection. The method is evaluated on 20 segmentation datasets, three change-detection datasets, and one 3D dataset, with the central claim that it achieves promising performance and thereby demonstrates the potential of SAM 3 for remote-sensing open-vocabulary tasks.

Significance. If the reported results are robust, the work provides concrete evidence that a single unified promptable model can be used zero-shot across multiple geospatial tasks without domain-specific fine-tuning or complex CLIP-based pipelines. The public code release supports reproducibility and enables direct follow-up studies on domain-shift mitigation.

major comments (2)

[§3.2] §3.2 (Mask Fusion Strategy): the fusion rule that combines semantic-head and instance-head outputs is presented as the key to handling dense small targets, yet no ablation quantifies the contribution of each head on remote-sensing imagery; without this, it is impossible to verify whether the claimed improvement over CLIP-based methods stems from the fusion or simply from SAM 3's base segmentation quality.
[§4] §4 (Experiments): the presence-score filtering step is asserted to reduce false positives caused by large vocabularies and patch-level processing, but the manuscript provides no quantitative comparison of precision/recall with and without the filter on the 20 segmentation datasets; this omission leaves the central zero-shot transfer claim under-supported given the spectral and scale mismatch between natural-image training data and remote-sensing inputs.

minor comments (2)

[Figure 3] Figure 3 caption and §3.1: the notation for the fused logit map is introduced without an explicit equation; adding a compact formula would improve clarity.
[Table 2] Table 2: several dataset names are abbreviated without a footnote or reference; expanding the first occurrence would aid readers unfamiliar with remote-sensing benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and the recommendation of minor revision. The comments highlight important aspects for strengthening the empirical support of our claims. We address each major comment below and have incorporated the suggested analyses into the revised manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Mask Fusion Strategy): the fusion rule that combines semantic-head and instance-head outputs is presented as the key to handling dense small targets, yet no ablation quantifies the contribution of each head on remote-sensing imagery; without this, it is impossible to verify whether the claimed improvement over CLIP-based methods stems from the fusion or simply from SAM 3's base segmentation quality.

Authors: We agree that an explicit ablation would strengthen the justification for the fusion strategy. The design is motivated by the complementary properties observed in remote-sensing scenes: the semantic head tends to provide more complete coverage for dense small targets, while the instance head contributes sharper object boundaries. In the revised manuscript we add a dedicated ablation subsection in §3.2 that reports mIoU for the semantic head alone, the instance head alone, and the fused output on four representative remote-sensing datasets. The results confirm that fusion improves performance beyond either head individually, particularly on categories with numerous small objects, thereby supporting that the gains are not attributable solely to SAM 3's base segmentation quality. revision: yes
Referee: [§4] §4 (Experiments): the presence-score filtering step is asserted to reduce false positives caused by large vocabularies and patch-level processing, but the manuscript provides no quantitative comparison of precision/recall with and without the filter on the 20 segmentation datasets; this omission leaves the central zero-shot transfer claim under-supported given the spectral and scale mismatch between natural-image training data and remote-sensing inputs.

Authors: We concur that a direct quantitative comparison is necessary to substantiate the role of presence-score filtering in mitigating false positives under large vocabularies and domain shift. The revised manuscript now includes an additional table in §4 that reports precision and recall (as well as mIoU) with and without the filtering step across all 20 segmentation datasets. The comparison shows a consistent reduction in false positives when the filter is applied, which helps corroborate the zero-shot transfer claim despite the spectral and scale differences between natural-image pre-training and remote-sensing inputs. revision: yes

Circularity Check

0 steps flagged

No significant circularity: direct empirical application of pre-trained SAM 3

full rationale

The paper presents a training-free exploration that applies the released SAM 3 model directly to remote-sensing tasks via mask fusion of its semantic and instance heads plus presence-score filtering. No mathematical derivations, fitted parameters, or self-citation chains appear; the method consists of simple post-processing on model outputs followed by empirical evaluation on external datasets. Claims reduce to observed performance rather than any input-by-construction equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that SAM 3's pre-trained heads generalize to remote sensing imagery and that simple post-processing rules suffice; no new mathematical axioms or free parameters are introduced.

axioms (2)

domain assumption SAM 3's semantic segmentation head and Transformer decoder produce complementary outputs that can be fused without retraining
Invoked in the description of the mask fusion strategy
domain assumption Presence scores from SAM 3 reliably indicate category absence in remote sensing patches
Used to filter false positives from large vocabularies

pith-pipeline@v0.9.0 · 5611 in / 1317 out tokens · 51813 ms · 2026-05-16T23:48:40.682944+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
cs.CV 2026-04 conditional novelty 7.0

Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
cs.CV 2026-01 conditional novelty 7.0

OmniOVCD uses SAM 3's decoupled outputs and an SFID strategy to achieve state-of-the-art IoU scores of 67.2, 66.5, 24.5, and 27.1 on four OVCD benchmarks, surpassing prior methods.
MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification
cs.CV 2026-04 unverdicted novelty 6.0

MemOVCD reformulates change detection as cross-temporal memory reasoning with weighted bidirectional propagation and adaptive rectification to improve semantic change identification without task-specific training.
Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

SeeCo is a training-free on-the-fly recalibration method using multi-view geometric consistency and adaptive textual calibration to improve open-vocabulary semantic segmentation in remote sensing images.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 4 Pith papers · 6 internal anchors

[1]

Single-stage seman- tic segmentation from image labels

Nikita Araslanov and Stefan Roth. Single-stage seman- tic segmentation from image labels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4253–4262, 2020. 5

work page 2020
[2]

Self-calibrated clip for training-free open-vocabulary segmentation.arXiv preprint arXiv:2411.15869, 2024

Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, and Yansong Tang. Self-calibrated clip for training-free open-vocabulary segmentation.arXiv preprint arXiv:2411.15869, 2024. 6

work page arXiv 2024
[3]

Training-free open- vocabulary segmentation with offline diffusion-augmented prototype generation

Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Training-free open- vocabulary segmentation with offline diffusion-augmented prototype generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3689–3698, 2024. 2, 6

work page 2024
[4]

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work.arXiv preprint arXiv:2504.13181, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Grounding everything: Emerging localiza- tion properties in vision-language transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localiza- tion properties in vision-language transformers. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3828–3837, 2024. 2, 5

work page 2024
[6]

Coco- stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 4

work page 2018
[7]

Vdd: Varied drone dataset for semantic segmentation.arXiv preprint arXiv:2305.13608, 2023

Wenxiao Cai, Ke Jin, Jinyan Hou, Cong Guo, Letian Wu, and Wankou Yang. Vdd: Varied drone dataset for semantic segmentation.arXiv preprint arXiv:2305.13608, 2023. 4

work page arXiv 2023
[8]

Open-vocabulary remote sensing image semantic segmenta- tion.arXiv preprint arXiv:2409.07683, 2024

Qinglong Cao, Yuntian Chen, Chao Ma, and Xiaokang Yang. Open-vocabulary remote sensing image semantic segmenta- tion.arXiv preprint arXiv:2409.07683, 2024. 2, 5

work page arXiv 2024
[9]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas 7 Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 1

work page 2020
[10]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2

work page 2021
[12]

Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs

Junbum Cha, Jonghwan Mun, and Byungseok Roh. Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11165–11174, 2023. 6

work page 2023
[13]

Land-cover change detection using paired openstreetmap data and optical high- resolution imagery via object-guided transformer.arXiv preprint arXiv:2310.02674, 2023

Hongruixuan Chen, Cuiling Lan, Jian Song, Clifford Broni- Bediako, Junshi Xia, and Naoto Yokoya. Land-cover change detection using paired openstreetmap data and optical high- resolution imagery via object-guided transformer.arXiv preprint arXiv:2310.02674, 2023. 7

work page arXiv 2023
[14]

Training-free class purification for open-vocabulary semantic segmentation

Qi Chen, Lingxiao Yang, Yun Chen, Nailong Zhao, Jian- huang Lai, Jie Shao, and Xiaohua Xie. Training-free class purification for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23124–23134, 2025. 2

work page 2025
[15]

Large-scale structure from motion with semantic con- straints of aerial images

Yu Chen, Yao Wang, Peng Lu, Yisong Chen, and Guoping Wang. Large-scale structure from motion with semantic con- straints of aerial images. InChinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 347–359. Springer, 2018. 4

work page 2018
[16]

Per- pixel classification is not all you need for semantic segmen- tation.Advances in Neural Information Processing Systems, 34:17864–17875, 2021

Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation.Advances in Neural Information Processing Systems, 34:17864–17875, 2021. 1

work page 2021
[17]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 1

work page 2022
[18]

Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation

Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4113– 4123, 2024. 2, 5, 6

work page 2024
[19]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 4

work page 2016
[20]

Deepglobe 2018: A challenge to parse the earth through satellite images

Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 172–181, 2018. 4

work page 2018
[21]

De- coupling zero-shot semantic segmentation

Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022. 2

work page 2022
[22]

Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images

Saikat Dutta, Akhil Vasim, Siddhant Gole, Hamid Rezatofighi, and Biplab Banerjee. Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 2254–2264, 2025. 2

work page 2025
[23]

The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010. 4

work page 2010
[24]

Scal- ing open-vocabulary image segmentation with image-level labels

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. InEuropean Conference on Computer Vision, pages 540–557. Springer, 2022. 2

work page 2022
[25]

xbd: A dataset for assessing building dam- age from satellite imagery, 2019

Ritwik Gupta, Richard Hosfelt, Sandra Sajeev, Nirav Patel, Bryce Goodman, Jigar Doshi, Eric Heim, Howie Choset, and Matthew Gaston. xbd: A dataset for assessing building dam- age from satellite imagery, 2019. 4

work page 2019
[26]

Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation

Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. In2025 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), pages 5061–5071. IEEE, 2025. 2, 6

work page 2025
[27]

Score: Scene context matters in open-vocabulary remote sensing instance segmentation

Shiqi Huang, Shuting He, Huaiyuan Qin, and Bihan Wen. Score: Scene context matters in open-vocabulary remote sensing instance segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12559–12569, 2025. 3

work page 2025
[28]

Shunping Ji, Shiqing Wei, and Meng Lu. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set.IEEE Transactions on geoscience and remote sensing, 57(1):574–586, 2018. 4

work page 2018
[29]

Feature purification matters: Suppressing outlier propagation for training-free open-vocabulary seman- tic segmentation

Shuo Jin, Siyue Yu, Bingfeng Zhang, Mingjie Sun, Yi Dong, and Jimin Xiao. Feature purification matters: Suppressing outlier propagation for training-free open-vocabulary seman- tic segmentation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 20291–20300,

work page
[30]

In defense of lazy vi- sual grounding for open-vocabulary semantic segmentation

Dahyun Kang and Minsu Cho. In defense of lazy vi- sual grounding for open-vocabulary semantic segmentation. InEuropean Conference on Computer Vision and Pattern Recognition (ECCV), 2024. 6

work page 2024
[31]

Distilling spectral graph for object-context aware open-vocabulary semantic segmenta- tion

Chanyoung Kim, Dayun Ju, Woojung Han, Ming-Hsuan Yang, and Seong Jae Hwang. Distilling spectral graph for object-context aware open-vocabulary semantic segmenta- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15033–15042, 2025. 2, 6

work page 2025
[32]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 1, 2 8

work page 2023
[33]

Clearclip: Decom- posing clip representations for dense vision-language infer- ence.arXiv preprint arXiv:2407.12442, 2024

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: Decom- posing clip representations for dense vision-language infer- ence.arXiv preprint arXiv:2407.12442, 2024. 2, 5, 6

work page arXiv 2024
[34]

Proxyclip: Proxy attention improves clip for open-vocabulary segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InEuropean Conference on Computer Vision, pages 70–88. Springer, 2025. 1, 2, 5, 6

work page 2025
[35]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 2

work page internal anchor Pith review arXiv 2022
[36]

Exploring efficient open- vocabulary segmentation in the remote sensing.arXiv preprint arXiv:2509.12040, 2025

Bingyu Li, Haocheng Dong, Da Zhang, Zhiyuan Zhao, Junyu Gao, and Xuelong Li. Exploring efficient open- vocabulary segmentation in the remote sensing.arXiv preprint arXiv:2509.12040, 2025. 5

work page arXiv 2025
[37]

Show me what and where has changed? question answering and grounding for remote sensing change detection.arXiv preprint arXiv:2410.23828,

Ke Li, Fuyu Dong, Di Wang, Shaofeng Li, Quan Wang, Xinbo Gao, and Tat-Seng Chua. Show me what and where has changed? question answering and grounding for remote sensing change detection.arXiv preprint arXiv:2410.23828,

work page arXiv
[38]

Dynamicearth: How far are we from open-vocabulary change detection? arXiv preprint arXiv:2501.12931, 2025

Kaiyu Li, Xiangyong Cao, Yupeng Deng, Chao Pang, Zepeng Xin, Deyu Meng, and Zhi Wang. Dynamicearth: How far are we from open-vocabulary change detection? arXiv preprint arXiv:2501.12931, 2025. 1

work page arXiv 2025
[39]

Annotation-free open-vocabulary segmentation for remote-sensing images

Kaiyu Li, Xiangyong Cao, Ruixun Liu, Shihong Wang, Zix- uan Jiang, Zhi Wang, and Deyu Meng. Annotation-free open-vocabulary segmentation for remote-sensing images. arXiv preprint arXiv:2508.18067, 2025. 1

work page arXiv 2025
[40]

Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images

Kaiyu Li, Ruixun Liu, Xiangyong Cao, Xueru Bai, Feng Zhou, Deyu Meng, and Zhi Wang. Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 10545–10556, 2025. 1, 2, 4, 5

work page 2025
[41]

Open-vocabulary semantic segmentation with mask-adapted clip

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 7061–7070, 2023. 2

work page 2023
[42]

Open-vocabulary segmentation with semantic-assisted calibration

Yong Liu, Sule Bai, Guanbin Li, Yitong Wang, and Yansong Tang. Open-vocabulary segmentation with semantic-assisted calibration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3491– 3500, 2024. 5

work page 2024
[43]

Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation

Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. InIn- ternational Conference on Machine Learning, pages 23033– 23044. PMLR, 2023. 2

work page 2023
[44]

Uavid: A semantic segmentation dataset for uav imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 165:108 – 119, 2020

Ye Lyu, George V osselman, Gui-Song Xia, Alper Yilmaz, and Michael Ying Yang. Uavid: A semantic segmentation dataset for uav imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 165:108 – 119, 2020. 4

work page 2020
[45]

Can semantic labeling methods general- ize to any city? the inria aerial image labeling benchmark

Emmanuel Maggiori, Yuliya Tarabalka, Guillaume Charpiat, and Pierre Alliez. Can semantic labeling methods general- ize to any city? the inria aerial image labeling benchmark. In2017 IEEE International geoscience and remote sensing symposium (IGARSS), pages 3226–3229. IEEE, 2017. 4

work page 2017
[46]

PhD thesis, University of Toronto, 2013

V olodymyr Mnih.Machine Learning for Aerial Image La- beling. PhD thesis, University of Toronto, 2013. 4

work page 2013
[47]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 5, 6

work page 2021
[49]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Ex- plore the potential of clip for training-free open vocabulary semantic segmentation

Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. InEuropean Conference on Com- puter Vision. Springer, 2024. 6

work page 2024
[51]

Harnessing vision foundation models for high-performance, training- free open vocabulary segmentation

Yuheng Shi, Minjing Dong, and Chang Xu. Harnessing vision foundation models for high-performance, training- free open vocabulary segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23487–23497, 2025. 2, 6

work page 2025
[52]

Cliper: Hierarchically improving spatial representa- tion of clip for open-vocabulary semantic segmentation

Lin Sun, Jiale Cao, Jin Xie, Xiaoheng Jiang, and Yanwei Pang. Cliper: Hierarchically improving spatial representa- tion of clip for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23199–23209, 2025. 6

work page 2025
[53]

Clip as rnn: Segment countless visual concepts without training endeavor

Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, and Siyang Li. Clip as rnn: Segment countless visual concepts without training endeavor. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13171–13182, 2024. 2, 6

work page 2024
[54]

SpaceNet: A Remote Sensing Dataset and Challenge Series

Adam Van Etten, Dave Lindenbaum, and Todd M Bacastow. Spacenet: A remote sensing dataset and challenge series. arXiv preprint arXiv:1807.01232, 2018. 4

work page internal anchor Pith review Pith/arXiv arXiv 2018
[55]

Sclip: Rethink- ing self-attention for dense vision-language inference.arXiv preprint arXiv:2312.01597, 2023

Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethink- ing self-attention for dense vision-language inference.arXiv preprint arXiv:2312.01597, 2023. 1, 2, 5, 6

work page arXiv 2023
[56]

Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation

Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation. InProceed- ings of the Neural Information Processing Systems Track on Datasets and Benchmarks. Curran Associates, Inc., 2021. 4

work page 2021
[57]

Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.arXiv preprint arXiv:2505.21089, 2025

Junjue Wang, Weihao Xuan, Heli Qi, Zhihao Liu, Kunyi Liu, Yuhan Wu, Hongruixuan Chen, Jian Song, Junshi Xia, Zhuo Zheng, et al. Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.arXiv preprint arXiv:2505.21089, 2025. 1

work page arXiv 2025
[58]

isaid: A large-scale dataset for instance segmentation in aerial images

Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling 9 Shao, Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset for instance segmentation in aerial images. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 28–37, 2019. 4

work page 2019
[59]

Image-text co- decomposition for text-supervised semantic segmentation

Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang, Chun-Pei Chen, Yu-Lun Liu, Min-Hung Chen, Hou-Ning Hu, Yung-Yu Chuang, and Yen-Yu Lin. Image-text co- decomposition for text-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 26794–26803, 2024. 6

work page 2024
[60]

Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation

Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation. InEuropean Conference on Computer Vision, pages 320–337. Springer, 2024. 6

work page 2024
[61]

Openearthmap: A benchmark dataset for global high-resolution land cover mapping

Junshi Xia, Naoto Yokoya, Bruno Adriano, and Clifford Broni-Bediako. Openearthmap: A benchmark dataset for global high-resolution land cover mapping. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6254–6264, 2023. 4

work page 2023
[62]

Sed: A simple encoder-decoder for open- vocabulary semantic segmentation

Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, and Yanwei Pang. Sed: A simple encoder-decoder for open- vocabulary semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 2, 5

work page 2024
[63]

Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transform- ers.Advances in neural information processing systems, 34: 12077–12090, 2021. 5

work page 2021
[64]

Groupvit: Semantic segmentation emerges from text supervision

Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18134–18144, 2022. 2

work page 2022
[65]

Side adapter network for open-vocabulary semantic segmentation

Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945– 2954, 2023. 2, 5

work page 2023
[66]

Masq- clip for open-vocabulary universal image segmentation

Xin Xu, Tianyi Xiong, Zheng Ding, and Zhuowen Tu. Masq- clip for open-vocabulary universal image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 887–898, 2023. 2

work page 2023
[67]

Resclip: Residual attention for training-free dense vision- language inference

Yuhang Yang, Jinhong Deng, Wen Li, and Lixin Duan. Resclip: Residual attention for training-free dense vision- language inference. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29968–29978,

work page
[68]

Remotesam: Towards segment anything for earth observa- tion

Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. Remotesam: Towards segment anything for earth observa- tion. InProceedings of the 33rd ACM International Confer- ence on Multimedia, pages 3027–3036, 2025. 3

work page 2025
[69]

To- wards open-vocabulary remote sensing image semantic seg- mentation

Chengyang Ye, Yunzhi Zhuge, and Pingping Zhang. To- wards open-vocabulary remote sensing image semantic seg- mentation. InProceedings of the AAAI Conference on Arti- ficial Intelligence, pages 9436–9444, 2025. 2, 5

work page 2025
[70]

Corrclip: Re- constructing patch correlations in clip for open-vocabulary semantic segmentation

Dengke Zhang, Fagui Liu, and Quan Tang. Corrclip: Re- constructing patch correlations in clip for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 24677– 24687, 2025. 1, 2, 5, 6

work page 2025
[71]

Instructsam: A training-free framework for instruction-oriented remote sensing object recognition.arXiv preprint arXiv:2505.15818, 2025

Yijie Zheng, Weijie Wu, Qingyun Li, Xuehui Wang, Xu Zhou, Aiai Ren, Jun Shen, Long Zhao, Guoqing Li, and Xue Yang. Instructsam: A training-free framework for instruction-oriented remote sensing object recognition.arXiv preprint arXiv:2505.15818, 2025. 3

work page arXiv 2025
[72]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean Conference on Com- puter Vision, pages 696–712. Springer, 2022. 1, 2, 5, 6

work page 2022
[73]

A global context-aware and batch-independent network for road extraction from vhr satellite imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 175:353–365, 2021

Qiqi Zhu, Yanan Zhang, Lizeng Wang, Yanfei Zhong, Qingfeng Guan, Xiaoyan Lu, Liangpei Zhang, and Deren Li. A global context-aware and batch-independent network for road extraction from vhr satellite imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 175:353–365, 2021. 4

work page 2021
[74]

Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling

Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 14733–14744, 2025. 2, 5 10

work page 2025