arxiv: 2604.15652 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline

Bingyu Li , Tao Huo , Haocheng Dong , Da Zhang , Zhiyuan Zhao , Junyu Gao , Xuelong Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary segmentationremote sensingbenchmark datasetpositive-incentive noiseimage segmentationOVRSISBenchV2transfer learning

0 comments

The pith

OVRSISBenchV2 provides a more realistic and challenging benchmark for open-vocabulary remote sensing segmentation, where the Pi-Seg baseline achieves strong performance through positive-incentive noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the fragmented and limited datasets in open-vocabulary remote sensing image segmentation by creating OVRSISBenchV2. This new benchmark combines a new 95K image dataset with 10 others to reach 170K images and 128 categories, plus protocols for practical tasks like building extraction. A sympathetic reader would care because current evaluations may not reflect real geospatial needs, leading to methods that fail in deployment. The authors also present Pi-Seg, which uses a positive-incentive noise mechanism to improve how visual and text features align during training. If this holds, it suggests that both better data and targeted perturbations can push forward generalization in this domain.

Core claim

OVRSISBenchV2 substantially expands scene diversity, semantic coverage, and evaluation difficulty beyond prior work, and Pi-Seg delivers strong and consistent results particularly on the more challenging OVRSISBenchV2 benchmark by using positive-incentive noise to broaden the visual-text feature space.

What carries the argument

The positive-incentive noise mechanism in Pi-Seg, consisting of learnable and semantically guided perturbations that broaden the visual-text feature space during training to improve transferability.

If this is right

OVRSISBenchV2 enables more rigorous testing of open-vocabulary methods in remote sensing.
Pi-Seg shows consistent improvements across OVRSISBenchV1, V2, and downstream tasks.
Realistic benchmarks better reflect complex deployment scenarios like flood detection.
Positive-incentive noise enhances cross-dataset generalization in segmentation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar benchmark expansion approaches could be applied to other specialized domains like medical imaging.
The noise perturbation idea might transfer to general open-vocabulary segmentation outside remote sensing.
Future work could test if the benchmark's diversity reduces selection bias in model development.
Integration with other sensors or modalities could further test the method's robustness.

Load-bearing premise

The manually selected categories and datasets accurately capture the full range of realistic geospatial application demands without introducing selection bias or missing key scene and sensor variations.

What would settle it

A follow-up experiment where Pi-Seg fails to outperform standard baselines on OVRSISBenchV2 or where new datasets reveal that the 128 categories do not cover important real-world cases would challenge the central claims.

Figures

Figures reproduced from arXiv: 2604.15652 by Bingyu Li, Da Zhang, Haocheng Dong, Junyu Gao, Tao Huo, Xuelong Li, Zhiyuan Zhao.

**Figure 1.** Figure 1: From OVRSISBenchV1/RSKT-Seg (V1 Paper) to OVRSISBenchV2/Pi-Seg(V2 Paper). (a.1) OVRSISBenchV1 adopts single-source cross-dataset evaluation. (a.2) RSKT-Seg improves OVRSIS through remote-sensing-aware cost-map construction and refinement. (b) OVRSISBenchV2 expands the training foundation to 95K pairs across five scene domains, while Pi-Seg provides a noise-aware baseline with stronger generalization under … view at source ↗

**Figure 2.** Figure 2: Two-stage pipeline for OVRSIS95K Construction. Given an input remote sensing image, Stage 1 first generates a detailed scene caption and parses it into candidate object categories, which are then aligned with the dataset taxonomy through filtering and matching. Stage 2 uses the resulting categories to produce instance-level masks, followed by mask combination and correction to obtain the final semantic-lev… view at source ↗

**Figure 3.** Figure 3: Overview of OVRSIS95K. The dataset is organized into five representative remote sensing scene domains: wasteland, industrial, town, waterfront, and forest. The upper part provides semantic descriptions and characteristic examples for each domain, while the lower part presents representative image patches with corresponding dense annotations. open-vocabulary segmentation into specialized domains such as rem… view at source ↗

**Figure 4.** Figure 4: Evolutionary overview of the OVRSISBench series. OVRSISBenchV1 (left) focuses on standard open-vocabulary protocols with a primary emphasis on basic remote sensing datasets. In contrast, OVRSISBenchV2 (right) represents a significant expansion in both scale and diversity, utilizing the comprehensive OVRSIS95K dataset for training and incorporating multiple specialized downstream tasks to evaluate model gen… view at source ↗

**Figure 6.** Figure 6: Comparison between RSKT-Seg and Pi-Seg. (a) RSKT [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Overview of the proposed Pi-Seg. (a) Main pipeline. CLIP image and text encoders extract visual and textual features, which is perturbed by two semantic-aware perturbation module (SPM): Image-SPM and Text-SPM, followed by cost-map construction and upsampling-based decoding. (b) Text perturbation. A learnable noise generator transforms deterministic text embeddings into stochastic ones to improve semantic r… view at source ↗

**Figure 8.** Figure 8: Qualitative comparison of segmentation results on OVRSISBenchV1 and OVRSISBenchV2. The left panel shows representative results on OVRSISBenchV1, while the right panel presents corresponding examples on OVRSISBenchV2. We compare the proposed Pi-Seg with several strong baselines. Across both benchmarks, Pi-Seg produces segmentation maps that are more spatially complete, semantically coherent, and better alig… view at source ↗

**Figure 9.** Figure 9: Visualization of Downstream Tasks: Building Extraction, Flood Extraction, Road Extraction. performance on both high-resolution m-mIoU/m-mACC and low-resolution m-mIoU, while remaining highly competitive on low-resolution m-mACC. This suggests that Pi-Seg generalizes well across different image scales and is particularly effective for high-resolution remote sensing imagery, where fine-grained structures and… view at source ↗

**Figure 10.** Figure 10: Training dynamics of the proposed Pi-Seg measured by four Δ-correlation statistics. We report gt_in_mean, non_gt_mean, gap, and align_ratio throughout training. From left to right, the curves correspond to training on DLRSD, iSAID, and OVRSIS95K. As training progresses, the learned perturbation strengthens correlation responses in ground-truth regions while suppressing those in non-target regions, resulti… view at source ↗

**Figure 11.** Figure 11: Qualitative examples of fine-grained cost refinement with the proposed Pi-Seg. Compared with the original cost maps, Pi-Noise introduces class-aware corrections that highlight target regions and suppress spurious background responses. The refined cost maps are more spatially concentrated and better aligned with binary ground truth. background suppression, effectively reducing false activation over visuall… view at source ↗

**Figure 12.** Figure 12: Representative failure cases of Pi-Seg. Pi-Seg may [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

read the original abstract

Open-vocabulary remote sensing image segmentation (OVRSIS) remains underexplored due to fragmented datasets, limited training diversity, and the lack of evaluation benchmarks that reflect realistic geospatial application demands. Our previous \textit{OVRSISBenchV1} established an initial cross-dataset evaluation protocol, but its limited scope is insufficient for assessing realistic open-world generalization. To address this issue, we propose \textit{OVRSISBenchV2}, a large-scale and application-oriented benchmark for OVRSIS. We first construct \textbf{OVRSIS95K}, a balanced dataset of about 95K image--mask pairs covering 35 common semantic categories across diverse remote sensing scenes. Built upon OVRSIS95K and 10 downstream datasets, OVRSISBenchV2 contains 170K images and 128 categories, substantially expanding scene diversity, semantic coverage, and evaluation difficulty. Beyond standard open-vocabulary segmentation, it further includes downstream protocols for building extraction, road extraction, and flood detection, thereby better reflecting realistic geospatial application demands and complex deployment scenarios. We also propose \textbf{Pi-Seg}, a baseline for OVRSIS. Pi-Seg improves transferability through a \textbf{positive-incentive noise} mechanism, where learnable and semantically guided perturbations broaden the visual-text feature space during training. Extensive experiments on OVRSISBenchV1, OVRSISBenchV2, and downstream tasks show that Pi-Seg delivers strong and consistent results, particularly on the more challenging OVRSISBenchV2 benchmark. Our results highlight both the importance of realistic benchmark design and the effectiveness of perturbation-based transfer for OVRSIS. The code and datasets are available at \href{https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg}{LiBingyu01/RSKT-Seg/tree/Pi-Seg}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OVRSISBenchV2 scales up the prior benchmark with more images and categories plus downstream task protocols, and Pi-Seg adds a noise perturbation trick that reportedly helps on the harder set, but the realism and diversity claims rest on unquantified manual curation.

read the letter

This paper introduces OVRSISBenchV2, built from a new manually constructed OVRSIS95K dataset of about 95K image-mask pairs over 35 categories, then merged with 10 other datasets to reach 170K images and 128 categories total. It also defines evaluation protocols for practical tasks such as building extraction, road extraction, and flood detection. The baseline Pi-Seg uses a positive-incentive noise mechanism that adds learnable, semantically guided perturbations to expand the visual-text feature space during training. The abstract states that this yields consistent gains, especially on the more challenging V2 benchmark compared with prior open-vocabulary methods and their own V1 version. Code and data are released, which is helpful for follow-up work. The expansion in scope and the inclusion of downstream protocols are the clearest advances over earlier fragmented datasets in this subfield. The main soft spot is the absence of any quantitative backing for the claim that the manual construction produces greater scene diversity, semantic coverage, or evaluation difficulty that better matches realistic geospatial demands. No metrics appear on geographic coverage, sensor distribution, scene-type entropy, or inter-annotator agreement, and the selection process for the 10 downstream datasets and 128 categories is not detailed. Without those, it is hard to separate genuine generalization from effects of the particular curation choices. The experiments are described as extensive, yet the abstract gives no error bars, exact splits, or ablation depth, leaving the robustness of the reported improvements open to question. This is primarily for researchers working on open-vocabulary segmentation in remote sensing imagery. A reader who needs a larger testbed with task-specific protocols could get value from the benchmark if the data quality checks out. I would send it for peer review so that referees can examine the dataset construction details and verify whether the gains hold under closer inspection.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claims rest on standard computer-vision assumptions about feature alignment between images and text plus the representativeness of the assembled remote-sensing data; the positive-incentive noise is a new training device without independent prior validation.

axioms (1)

domain assumption Visual and text encoders can be aligned via contrastive or perturbation-based training to support open-vocabulary generalization.
Invoked in the description of Pi-Seg broadening the visual-text feature space.

invented entities (1)

positive-incentive noise mechanism no independent evidence
purpose: Learnable semantically guided perturbations to broaden visual-text feature space during training.
New device introduced in the paper to improve transferability; no external falsifiable evidence provided in abstract.

pith-pipeline@v0.9.0 · 5662 in / 1288 out tokens · 42348 ms · 2026-05-10T08:16:55.367852+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

104 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Fully convolutional networks for semantic segmentation,

J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440

2015
[2]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMedical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III. Springer, 2015, pp. 234–241

2015
[3]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017

2017
[4]

Segnet: A deep convolutional encoder-decoder architecture for image segmentation,

V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017

2017
[5]

Rethinking Atrous Convolution for Semantic Image Segmentation

L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,”arXiv preprint arXiv:1706.05587, 2017

work page internal anchor Pith review arXiv 2017
[6]

Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,

S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torret al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6881–6890

2021
[7]

Remote sensing image segmentation advances: A meta-analysis,

I. Kotaridis and M. Lazaridou, “Remote sensing image segmentation advances: A meta-analysis,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 173, pp. 309–322, 2021

2021
[8]

Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data,

F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, “Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 162, pp. 94–114, 2020

2020
[9]

Scale-aware detailed matching for few-shot aerial image semantic segmentation,

X. Yao, Q. Cao, X. Feng, G. Cheng, and J. Han, “Scale-aware detailed matching for few-shot aerial image semantic segmentation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2021

2021
[10]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 8748–8763

2021
[11]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 4904–4916

2021
[12]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[13]

Prototypical hash encoding for on-the-fly fine-grained category discovery,

H. Zheng, N. Pu, W. Li, N. Sebe, and Z. Zhong, “Prototypical hash encoding for on-the-fly fine-grained category discovery,”Advances in Neural Information Processing Systems, vol. 37, pp. 101428–101455, 2024

2024
[14]

Open-world deepfake attribution via confidence-aware asymmetric learning,

H. Zheng, N. Pu, W. Li, T. Long, N. Sebe, and Z. Zhong, “Open-world deepfake attribution via confidence-aware asymmetric learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 16, 2026, pp. 13378–13386

2026
[15]

Language-driven semantic segmentation,

B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ran- ftl, “Language-driven semantic segmentation,”arXiv preprint arXiv:2201.03546, 2022

work page arXiv 2022
[16]

Groupvit: Semantic segmentation emerges from text supervision,

J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18134–18144

2022
[17]

Scaling open-vocabulary image segmentation with image-level labels,

G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin, “Scaling open-vocabulary image segmentation with image-level labels,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 540–557

2022
[18]

A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,

M. Xu, Z. Zhang, F. Wei, Y. Lin, Y. Cao, H. Hu, and X. Bai, “A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 736–753

2022
[19]

Open-vocabulary semantic segmentation with mask-adapted clip,

F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted clip,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7061–7070

2023
[20]

Open- vocabulary panoptic segmentation with text-to-image diffusion models,

J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello, “Open- vocabulary panoptic segmentation with text-to-image diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2955–2966

2023
[21]

Open vocabulary semantic segmentation with patch aligned contrastive learning,

J. Mukhoti, T.-Y. Lin, O. Poursaeed, R. Wang, A. Shah, P. H. S. Torr, and S.-N. Lim, “Open vocabulary semantic segmentation with patch aligned contrastive learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19413–19423

2023
[22]

Open-vocabulary semantic segmentation with decoupled one-pass network,

C. Han, Y. Zhong, D. Li, K. Han, and L. Ma, “Open-vocabulary semantic segmentation with decoupled one-pass network,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1086–1096

2023
[23]

Cat- seg: Cost aggregation for open-vocabulary semantic segmentation,

S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S. J. Kim, “Cat- seg: Cost aggregation for open-vocabulary semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4113–4123

2024
[24]

Exploring efficient open-vocabulary segmentation in the remote sensing,

B. Li, H. Dong, D. Zhang, Z. Zhao, H. Sun, and J. Gao, “Exploring efficient open-vocabulary segmentation in the remote sensing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 8, 2026, pp. 5982–5991

2026
[25]

Open-vocabulary high-resolution remote sensing image semantic segmentation,

Q. Cao, Y. Chen, C. Ma, and X. Yang, “Open-vocabulary high-resolution remote sensing image semantic segmentation,”IEEE Transactions on Geoscience and Remote Sensing, 2025

2025
[26]

Towards open-vocabulary remote sensing image semantic segmentation,

C. Ye, Y. Zhuge, and P. Zhang, “Towards open-vocabulary remote sensing image semantic segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9436– 9444

2025
[27]

Toward open-world semantic segmentation of remote sensing images,

Y. Chen and L. Bruzzone, “Toward open-world semantic segmentation of remote sensing images,” inIGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2023, pp. 5045– 5048

2023
[28]

Freemix: Open-vocabulary domain generalization of remote-sensing images for semantic segmen- tation,

J. Wu, J. Shi, Z. Zhao, Z. Liu, and R. Zhi, “Freemix: Open-vocabulary domain generalization of remote-sensing images for semantic segmen- tation,”Remote Sensing, vol. 17, no. 8, p. 1357, 2025

2025
[29]

Bridging vision foundation and vision–language models for open-vocabulary semantic segmentation of uav imagery,

F. Li, Z. Zhang, X. Wang, X. Wang, and Y. Xu, “Bridging vision foundation and vision–language models for open-vocabulary semantic segmentation of uav imagery,”Remote Sensing, vol. 17, no. 22, p. 3704, 2025

2025
[30]

Rsclip for training-free open- vocabulary remote sensing image semantic segmentation,

S. Wang, X. Sun, J. Han, and X. X. Zhu, “Rsclip for training-free open- vocabulary remote sensing image semantic segmentation,”Authorea Preprints, 2025

2025
[31]

Open-vocabulary panoptic segmentation maskclip,

Z. Ding, J. Wang, and Z. Tu, “Open-vocabulary panoptic segmentation maskclip,”arXiv preprint arXiv:2208.01343, 2022

work page arXiv 2022
[32]

Extract free dense labels from clip,

C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 696–712

2022
[33]

Open - vocabulary semantic segmentation with image embedding balancing,

X. Shan, D. Wu, G. Zhu, Y. Shao, N. Sang, and C. Gao, “Open - vocabulary semantic segmentation with image embedding balancing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28412–28421

2024
[34]

Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images,

K. Li, R. Liu, X. Cao, X. Bai, F. Zhou, D. Meng, and Z. Wang, “Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10545–10556

2025
[35]

Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images,

S. Dutta, A. Vasim, S. Gole, H. Rezatofighi, and B. Banerjee, “Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2254–2264

2025
[36]

SAM-MI: A mask-injected framework for enhancing open-vocabulary semantic segmentation with SAM,

L. Chen, Y. Zhu, Q. Yang, X. Niu, K. Ding, and S. Xiang, “SAM-MI: A mask-injected framework for enhancing open-vocabulary semantic segmentation with SAM,”arXiv preprint arXiv:2511.20027, 2025

work page arXiv 2025
[37]

Soft-guided open-vocabulary seman- tic segmentation of remote sensing images,

K. An, Y. Wang, and L. Chen, “Soft-guided open-vocabulary seman- tic segmentation of remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–16, 2025

2025
[38]

Text-to-image activationforopen-vocabularysemanticsegmentationinremotesensing,

W. Wang, A. Xiao, W. He, H. Zhu, and L. Xiao, “Text-to-image activationforopen-vocabularysemanticsegmentationinremotesensing,” IEEE Transactions on Geoscience and Remote Sensing, 2025

2025
[39]

Integrating sam with feature interaction for remote sensing change detection,

D. Zhang, F. Wang, L. Ning, Z. Zhao, J. Gao, and X. Li, “Integrating sam with feature interaction for remote sensing change detection,”IEEE Transactions on Geoscience and Remote Sensing, 2024

2024
[40]

Score: Scene context matters in open-vocabulary remote sensing instance segmentation,

S. Huang, S. He, H. Qin, and B. Wen, “Score: Scene context matters in open-vocabulary remote sensing instance segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 12559–12569

2025
[41]

Tpov-seg: Textually enhanced prompt tuning of vision-language models for open-vocabulary remote sensing semantic segmentation,

X. Zhang, C. Zhou, J. Huang, and L. Zhang, “Tpov-seg: Textually enhanced prompt tuning of vision-language models for open-vocabulary remote sensing semantic segmentation,”IEEE Transactions on Geo- science and Remote Sensing, 2025. 17

2025
[42]

Open-vocabulary object detection for high-resolution remote sensing images,

H. Li, “Open-vocabulary object detection for high-resolution remote sensing images,”Computer Vision and Image Understanding, p. 104566, 2025

2025
[43]

Annotation-free open-vocabulary segmentation for remote-sensing images,

K. Li, X. Cao, R. Liu, S. Wang, Z. Jiang, Z. Wang, and D. Meng, “Annotation-free open-vocabulary segmentation for remote-sensing images,”arXiv preprint arXiv:2508.18067, 2025

work page arXiv 2025
[44]

Open-vocabulary segmentation for remote sensing,

V. Gunda, J. Klenke, and A. Mehta, “Open-vocabulary segmentation for remote sensing,” in2024 IEEE MIT Undergraduate Research Technology Conference (URTC). IEEE, 2024, pp. 1–5

2024
[45]

Uavid: A semantic segmentation dataset for uav imagery,

Y. Lyu, G. Vosselman, G.-S. Xia, A. Yilmaz, and M. Y. Yang, “Uavid: A semantic segmentation dataset for uav imagery,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 165, pp. 108 – 119, 2020. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ S0924271620301295

2020
[46]

Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation,

J. Wang, Z. Zheng, A. Ma, X. Lu, and Y. Zhong, “Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung, Eds., vol. 1. Curran Associates, Inc., 2021. [Online]. Available: https://datasets-benchmarks-procee...

2021
[47]

Vdd: Varied drone dataset for semantic segmentation,

W. Cai, K. Jin, J. Hou, C. Guo, L. Wu, and W. Yang, “Vdd: Varied drone dataset for semantic segmentation,”Journal of Visual Communication and Image Representation, vol. 109, p. 104429, 2025

2025
[48]

Expanding open- vocabulary understanding for uav aerial imagery: A vision–language framework to semantic segmentation,

B. Huang, J. Li, W. Luan, J. Tan, C. Li, and L. Huang, “Expanding open- vocabulary understanding for uav aerial imagery: A vision–language framework to semantic segmentation,”Drones, vol. 9, no. 2, p. 155, 2025

2025
[49]

Aeriaiclip: Lightweight open-vocabulary segmentation for uav-based aerial images,

P. Jia, Y. Gao, W. Li, Q. Gao, and F. Pan, “Aeriaiclip: Lightweight open-vocabulary segmentation for uav-based aerial images,” in2025 44th Chinese Control Conference (CCC). IEEE, 2025, pp. 8193–8198

2025
[50]

Toward high-resolution uav imagery open-vocabulary semantic segmentation,

Z. Chen, Y. Xie, and Y. Wei, “Toward high-resolution uav imagery open-vocabulary semantic segmentation,”Drones, vol. 9, no. 7, p. 470, 2025

2025
[51]

Rsvg-zeroov: Exploring a training-free framework for zero-shot open-vocabulary visual grounding in remote sensing images,

K. Li, D. Wang, T. Wang, F. Dong, Y. Zhang, L. Zhang, X. Wang, S. Li, and Q. Wang, “Rsvg-zeroov: Exploring a training-free framework for zero-shot open-vocabulary visual grounding in remote sensing images,” arXiv preprint arXiv:2509.18711, 2025

work page arXiv 2025
[52]

U3m: Unbi- ased multiscale modal fusion model for multimodal semantic segmentation, 2024

B. Li, D. Zhang, Z. Zhao, J. Gao, and X. Li, “U3m: Unbiased multiscale modal fusion model for multimodal semantic segmentation,”arXiv preprint arXiv:2405.15365, 2024

work page arXiv 2024
[53]

Stitchfusion: Weaving any visual modalities to enhance multi- modal semantic segmentation,

——, “Stitchfusion: Weaving any visual modalities to enhance multi- modal semantic segmentation,”arXiv preprint arXiv:2408.01343, 2024

work page arXiv 2024
[54]

Tri- subspaces disentanglement for multimodal sentiment analysis,

C. Meng, J. Luo, Z. Yan, Z. Yu, R. Fu, Z. Gan, and C. Ouyang, “Tri- subspaces disentanglement for multimodal sentiment analysis,”CVPR 2026, 2026

2026
[55]

Segformer: Simple and efficient design for semantic segmentation with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”Advances in Neural Information Processing Systems, vol. 34, pp. 12077–12090, 2021

2021
[56]

Per - pixel classification is not all you need for semantic segmentation,

B. Cheng, A. Schwing, and A. Kirillov, “Per - pixel classification is not all you need for semantic segmentation,”Advances in Neural Information Processing Systems, vol. 34, pp. 17864–17875, 2021

2021
[57]

Masked - attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked - attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022, pp. 1290–1299

2022
[58]

Panoptic seg- mentation,

A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic seg- mentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9404–9413

2019
[59]

Yolov8-dds: A lightweight model based on pruning and distillation for early detection of root mold in barley seedling,

H. Junjie, M. Zheng, W. Yuzhu, B. Yujian, W. Yizhe, S. Zhongbin, and G. Lifeng, “Yolov8-dds: A lightweight model based on pruning and distillation for early detection of root mold in barley seedling,” Information Processing in Agriculture, 2025

2025
[60]

Fine-grained preference optimization improves spatial reasoning in vlms,

Y. Shen, Y. Liu, J. Zhu, X. Cao, X. Zhang, Y. He, W. Ye, J. M. Rehg, and I. Lourentzou, “Fine-grained preference optimization improves spatial reasoning in vlms,”arXiv preprint arXiv:2506.21656, 2025

work page arXiv 2025
[61]

Toward cognitive supersensing in multimodal large language model,

B. Li, Y. Shen, Y. Liu, Y. Xu, J. Liu, X. Li, Z. Li, J. Zhu, Y. Zhong, F. Lanet al., “Toward cognitive supersensing in multimodal large language model,”arXiv preprint arXiv:2602.01541, 2026

work page arXiv 2026
[62]

Reasoning in computer vi- sion: Taxonomy, models, tasks, and methodologies,

A. Sarkar, M. Y. I. Idris, and Z. Yu, “Reasoning in computer vision: Taxonomy, models, tasks, and methodologies,”arXiv preprint arXiv:2508.10523, 2025

work page arXiv 2025
[63]

Open-vocabulary segmentation with semantic-assisted calibration,

Y. Liu, S. Bai, G. Li, Y. Wang, and Y. Tang, “Open-vocabulary segmentation with semantic-assisted calibration,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3491–3500

2024
[64]

Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,

Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,”Advances in Neural Information Processing Systems, vol. 36, pp. 32215–32234, 2023

2023
[65]

Eov-seg: Efficient open-vocabulary panoptic segmentation,

H. Niu, J. Hu, J. Lin, G. Jiang, and S. Zhang, “Eov-seg: Efficient open-vocabulary panoptic segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 6, 2025, pp. 6254– 6262

2025
[66]

Learning mask-aware clip representations for zero-shot segmentation,

S. Jiao, Y. Wei, Y. Wang, Y. Zhao, and H. Shi, “Learning mask-aware clip representations for zero-shot segmentation,”Advances in Neural Information Processing Systems, vol. 36, pp. 35631–35653, 2023

2023
[67]

Collaborative vision - text representation optimizing for open - vocabulary segmenta- tion,

S. Jiao, H. Zhu, J. Huang, Y. Zhao, Y. Wei, and H. Shi, “Collaborative vision - text representation optimizing for open - vocabulary segmenta- tion,” inProceedings of the European Conference on Computer Vision, 2025, pp. 399–416

2025
[68]

Sed: A simple encoder- decoder for open-vocabulary semantic segmentation,

B. Xie, J. Cao, J. Xie, F. S. Khan, and Y. Pang, “Sed: A simple encoder- decoder for open-vocabulary semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3426–3436

2024
[69]

Efficient redundancy reduction for open-vocabulary semantic segmentation,

L. Chen, Q. Yang, K. Ding, Z. Li, G. Shen, F. Li, Q. Cao, and S. Xiang, “Efficient redundancy reduction for open-vocabulary semantic segmentation,”Neurocomputing, vol. 665, p. 132229, 2026. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0925231225029017

2026
[70]

Proxyclip: Proxy attention improves clip for open-vocabulary segmentation,

M. Lan, C. Chen, Y. Ke, X. Wang, L. Feng, and W. Zhang, “Proxyclip: Proxy attention improves clip for open-vocabulary segmentation,” in European Conference on Computer Vision. Springer, 2024, pp. 70–88

2024
[71]

Sclip: Rethinking self-attention for dense vision-language inference,

F. Wang, J. Mei, and A. Yuille, “Sclip: Rethinking self-attention for dense vision-language inference,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 315–332

2024
[72]

Clearclip: Decomposing clip representations for dense vision-language inference,

M. Lan, C. Chen, Y. Ke, X. Wang, L. Feng, and W. Zhang, “Clearclip: Decomposing clip representations for dense vision-language inference,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 143–160

2024
[73]

Explore the potential of clip for training-free open vocabulary semantic segmentation,

T. Shao, Z. Tian, H. Zhao, and J. Su, “Explore the potential of clip for training-free open vocabulary semantic segmentation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 139–156

2024
[74]

Self- calibrated clip for training-free open-vocabulary segmentation,

S. Bai, Y. Liu, Y. Han, H. Zhang, Y. Tang, J. Zhou, and J. Lu, “Self- calibrated clip for training-free open-vocabulary segmentation,”IEEE Transactions on Image Processing, 2025

2025
[75]

Generalization boosted adapter for open - vocabulary segmentation,

W. Xu, C. Wang, X. Feng, R. Xu, L. Huang, Z. Zhang, L. Guo, and S. Xu, “Generalization boosted adapter for open - vocabulary segmentation,” IEEE Transactions on Circuits and Systems for Video Technology, 2024

2024
[76]

Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation,

L. Barsellotti, L. Bianchi, N. Messina, F. Carrara, M. Cornia, L. Baraldi, F. Falchi, and R. Cucchiara, “Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22025–22035

2025
[77]

Llmformer: Large language model for open-vocabulary semantic segmentation,

H. Shi, S. D. Dao, and J. Cai, “Llmformer: Large language model for open-vocabulary semantic segmentation,”International Journal of Computer Vision, vol. 133, no. 2, pp. 742–759, 2025

2025
[78]

Towards open-vocabulary video semantic segmentation,

X. Li, Y. Liu, G. Sun, M. Wu, L. Zhang, and C. Zhu, “Towards open-vocabulary video semantic segmentation,”IEEE Transactions on Multimedia, 2025

2025
[79]

Hierarchical open-vocabulary universal image segmentation,

X. Wang, S. Li, K. Kallidromitis, Y. Kato, K. Kozuka, and T. Darrell, “Hierarchical open-vocabulary universal image segmentation,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024
[80]

Large multimodal model for open vocabulary semantic segmentation of remote sensing images,

B. Liu, X. Chen, A. Yu, F. Feng, J. Yue, and X. Yu, “Large multimodal model for open vocabulary semantic segmentation of remote sensing images,”European Journal of Remote Sensing, vol. 58, no. 1, p. 2447344, 2025

2025

Showing first 80 references.