pith. machine review for the scientific record. sign in

arxiv: 2604.15652 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary segmentationremote sensingbenchmark datasetpositive-incentive noiseimage segmentationOVRSISBenchV2transfer learning
0
0 comments X

The pith

OVRSISBenchV2 provides a more realistic and challenging benchmark for open-vocabulary remote sensing segmentation, where the Pi-Seg baseline achieves strong performance through positive-incentive noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the fragmented and limited datasets in open-vocabulary remote sensing image segmentation by creating OVRSISBenchV2. This new benchmark combines a new 95K image dataset with 10 others to reach 170K images and 128 categories, plus protocols for practical tasks like building extraction. A sympathetic reader would care because current evaluations may not reflect real geospatial needs, leading to methods that fail in deployment. The authors also present Pi-Seg, which uses a positive-incentive noise mechanism to improve how visual and text features align during training. If this holds, it suggests that both better data and targeted perturbations can push forward generalization in this domain.

Core claim

OVRSISBenchV2 substantially expands scene diversity, semantic coverage, and evaluation difficulty beyond prior work, and Pi-Seg delivers strong and consistent results particularly on the more challenging OVRSISBenchV2 benchmark by using positive-incentive noise to broaden the visual-text feature space.

What carries the argument

The positive-incentive noise mechanism in Pi-Seg, consisting of learnable and semantically guided perturbations that broaden the visual-text feature space during training to improve transferability.

If this is right

  • OVRSISBenchV2 enables more rigorous testing of open-vocabulary methods in remote sensing.
  • Pi-Seg shows consistent improvements across OVRSISBenchV1, V2, and downstream tasks.
  • Realistic benchmarks better reflect complex deployment scenarios like flood detection.
  • Positive-incentive noise enhances cross-dataset generalization in segmentation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar benchmark expansion approaches could be applied to other specialized domains like medical imaging.
  • The noise perturbation idea might transfer to general open-vocabulary segmentation outside remote sensing.
  • Future work could test if the benchmark's diversity reduces selection bias in model development.
  • Integration with other sensors or modalities could further test the method's robustness.

Load-bearing premise

The manually selected categories and datasets accurately capture the full range of realistic geospatial application demands without introducing selection bias or missing key scene and sensor variations.

What would settle it

A follow-up experiment where Pi-Seg fails to outperform standard baselines on OVRSISBenchV2 or where new datasets reveal that the 128 categories do not cover important real-world cases would challenge the central claims.

Figures

Figures reproduced from arXiv: 2604.15652 by Bingyu Li, Da Zhang, Haocheng Dong, Junyu Gao, Tao Huo, Xuelong Li, Zhiyuan Zhao.

Figure 1
Figure 1. Figure 1: From OVRSISBenchV1/RSKT-Seg (V1 Paper) to OVRSISBenchV2/Pi-Seg(V2 Paper). (a.1) OVRSISBenchV1 adopts single-source cross-dataset evaluation. (a.2) RSKT-Seg improves OVRSIS through remote-sensing-aware cost-map construction and refinement. (b) OVRSISBenchV2 expands the training foundation to 95K pairs across five scene domains, while Pi-Seg provides a noise-aware baseline with stronger generalization under … view at source ↗
Figure 2
Figure 2. Figure 2: Two-stage pipeline for OVRSIS95K Construction. Given an input remote sensing image, Stage 1 first generates a detailed scene caption and parses it into candidate object categories, which are then aligned with the dataset taxonomy through filtering and matching. Stage 2 uses the resulting categories to produce instance-level masks, followed by mask combination and correction to obtain the final semantic-lev… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of OVRSIS95K. The dataset is organized into five representative remote sensing scene domains: wasteland, industrial, town, waterfront, and forest. The upper part provides semantic descriptions and characteristic examples for each domain, while the lower part presents representative image patches with corresponding dense annotations. open-vocabulary segmentation into specialized domains such as rem… view at source ↗
Figure 4
Figure 4. Figure 4: Evolutionary overview of the OVRSISBench series. OVRSISBenchV1 (left) focuses on standard open-vocabulary protocols with a primary emphasis on basic remote sensing datasets. In contrast, OVRSISBenchV2 (right) represents a significant expansion in both scale and diversity, utilizing the comprehensive OVRSIS95K dataset for training and incorporating multiple specialized downstream tasks to evaluate model gen… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between RSKT-Seg and Pi-Seg. (a) RSKT [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the proposed Pi-Seg. (a) Main pipeline. CLIP image and text encoders extract visual and textual features, which is perturbed by two semantic-aware perturbation module (SPM): Image-SPM and Text-SPM, followed by cost-map construction and upsampling-based decoding. (b) Text perturbation. A learnable noise generator transforms deterministic text embeddings into stochastic ones to improve semantic r… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of segmentation results on OVRSISBenchV1 and OVRSISBenchV2. The left panel shows representative results on OVRSISBenchV1, while the right panel presents corresponding examples on OVRSISBenchV2. We compare the proposed Pi-Seg with several strong baselines. Across both benchmarks, Pi-Seg produces segmentation maps that are more spatially complete, semantically coherent, and better alig… view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of Downstream Tasks: Building Extraction, Flood Extraction, Road Extraction. performance on both high-resolution m-mIoU/m-mACC and low-resolution m-mIoU, while remaining highly competitive on low-resolution m-mACC. This suggests that Pi-Seg generalizes well across different image scales and is particularly effective for high-resolution remote sensing imagery, where fine-grained structures and… view at source ↗
Figure 10
Figure 10. Figure 10: Training dynamics of the proposed Pi-Seg measured by four Δ-correlation statistics. We report gt_in_mean, non_gt_mean, gap, and align_ratio throughout training. From left to right, the curves correspond to training on DLRSD, iSAID, and OVRSIS95K. As training progresses, the learned perturbation strengthens correlation responses in ground-truth regions while suppressing those in non-target regions, resulti… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative examples of fine-grained cost refinement with the proposed Pi-Seg. Compared with the original cost maps, Pi-Noise introduces class-aware corrections that highlight target regions and suppress spurious background responses. The refined cost maps are more spatially concentrated and better aligned with binary ground truth. background suppression, effectively reducing false activation over visuall… view at source ↗
Figure 12
Figure 12. Figure 12: Representative failure cases of Pi-Seg. Pi-Seg may [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
read the original abstract

Open-vocabulary remote sensing image segmentation (OVRSIS) remains underexplored due to fragmented datasets, limited training diversity, and the lack of evaluation benchmarks that reflect realistic geospatial application demands. Our previous \textit{OVRSISBenchV1} established an initial cross-dataset evaluation protocol, but its limited scope is insufficient for assessing realistic open-world generalization. To address this issue, we propose \textit{OVRSISBenchV2}, a large-scale and application-oriented benchmark for OVRSIS. We first construct \textbf{OVRSIS95K}, a balanced dataset of about 95K image--mask pairs covering 35 common semantic categories across diverse remote sensing scenes. Built upon OVRSIS95K and 10 downstream datasets, OVRSISBenchV2 contains 170K images and 128 categories, substantially expanding scene diversity, semantic coverage, and evaluation difficulty. Beyond standard open-vocabulary segmentation, it further includes downstream protocols for building extraction, road extraction, and flood detection, thereby better reflecting realistic geospatial application demands and complex deployment scenarios. We also propose \textbf{Pi-Seg}, a baseline for OVRSIS. Pi-Seg improves transferability through a \textbf{positive-incentive noise} mechanism, where learnable and semantically guided perturbations broaden the visual-text feature space during training. Extensive experiments on OVRSISBenchV1, OVRSISBenchV2, and downstream tasks show that Pi-Seg delivers strong and consistent results, particularly on the more challenging OVRSISBenchV2 benchmark. Our results highlight both the importance of realistic benchmark design and the effectiveness of perturbation-based transfer for OVRSIS. The code and datasets are available at \href{https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg}{LiBingyu01/RSKT-Seg/tree/Pi-Seg}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claims rest on standard computer-vision assumptions about feature alignment between images and text plus the representativeness of the assembled remote-sensing data; the positive-incentive noise is a new training device without independent prior validation.

axioms (1)
  • domain assumption Visual and text encoders can be aligned via contrastive or perturbation-based training to support open-vocabulary generalization.
    Invoked in the description of Pi-Seg broadening the visual-text feature space.
invented entities (1)
  • positive-incentive noise mechanism no independent evidence
    purpose: Learnable semantically guided perturbations to broaden visual-text feature space during training.
    New device introduced in the paper to improve transferability; no external falsifiable evidence provided in abstract.

pith-pipeline@v0.9.0 · 5662 in / 1288 out tokens · 42348 ms · 2026-05-10T08:16:55.367852+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

104 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Fully convolutional networks for semantic segmentation,

    J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440

  2. [2]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMedical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III. Springer, 2015, pp. 234–241

  3. [3]

    Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

    L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017

  4. [4]

    Segnet: A deep convolutional encoder-decoder architecture for image segmentation,

    V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017

  5. [5]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,”arXiv preprint arXiv:1706.05587, 2017

  6. [6]

    Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,

    S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torret al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6881–6890

  7. [7]

    Remote sensing image segmentation advances: A meta-analysis,

    I. Kotaridis and M. Lazaridou, “Remote sensing image segmentation advances: A meta-analysis,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 173, pp. 309–322, 2021

  8. [8]

    Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data,

    F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, “Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 162, pp. 94–114, 2020

  9. [9]

    Scale-aware detailed matching for few-shot aerial image semantic segmentation,

    X. Yao, Q. Cao, X. Feng, G. Cheng, and J. Han, “Scale-aware detailed matching for few-shot aerial image semantic segmentation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2021

  10. [10]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 8748–8763

  11. [11]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 4904–4916

  12. [12]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  13. [13]

    Prototypical hash encoding for on-the-fly fine-grained category discovery,

    H. Zheng, N. Pu, W. Li, N. Sebe, and Z. Zhong, “Prototypical hash encoding for on-the-fly fine-grained category discovery,”Advances in Neural Information Processing Systems, vol. 37, pp. 101428–101455, 2024

  14. [14]

    Open-world deepfake attribution via confidence-aware asymmetric learning,

    H. Zheng, N. Pu, W. Li, T. Long, N. Sebe, and Z. Zhong, “Open-world deepfake attribution via confidence-aware asymmetric learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 16, 2026, pp. 13378–13386

  15. [15]

    Language-driven semantic segmentation,

    B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ran- ftl, “Language-driven semantic segmentation,”arXiv preprint arXiv:2201.03546, 2022

  16. [16]

    Groupvit: Semantic segmentation emerges from text supervision,

    J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18134–18144

  17. [17]

    Scaling open-vocabulary image segmentation with image-level labels,

    G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin, “Scaling open-vocabulary image segmentation with image-level labels,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 540–557

  18. [18]

    A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,

    M. Xu, Z. Zhang, F. Wei, Y. Lin, Y. Cao, H. Hu, and X. Bai, “A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 736–753

  19. [19]

    Open-vocabulary semantic segmentation with mask-adapted clip,

    F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted clip,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7061–7070

  20. [20]

    Open- vocabulary panoptic segmentation with text-to-image diffusion models,

    J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello, “Open- vocabulary panoptic segmentation with text-to-image diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2955–2966

  21. [21]

    Open vocabulary semantic segmentation with patch aligned contrastive learning,

    J. Mukhoti, T.-Y. Lin, O. Poursaeed, R. Wang, A. Shah, P. H. S. Torr, and S.-N. Lim, “Open vocabulary semantic segmentation with patch aligned contrastive learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19413–19423

  22. [22]

    Open-vocabulary semantic segmentation with decoupled one-pass network,

    C. Han, Y. Zhong, D. Li, K. Han, and L. Ma, “Open-vocabulary semantic segmentation with decoupled one-pass network,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1086–1096

  23. [23]

    Cat- seg: Cost aggregation for open-vocabulary semantic segmentation,

    S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S. J. Kim, “Cat- seg: Cost aggregation for open-vocabulary semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4113–4123

  24. [24]

    Exploring efficient open-vocabulary segmentation in the remote sensing,

    B. Li, H. Dong, D. Zhang, Z. Zhao, H. Sun, and J. Gao, “Exploring efficient open-vocabulary segmentation in the remote sensing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 8, 2026, pp. 5982–5991

  25. [25]

    Open-vocabulary high-resolution remote sensing image semantic segmentation,

    Q. Cao, Y. Chen, C. Ma, and X. Yang, “Open-vocabulary high-resolution remote sensing image semantic segmentation,”IEEE Transactions on Geoscience and Remote Sensing, 2025

  26. [26]

    Towards open-vocabulary remote sensing image semantic segmentation,

    C. Ye, Y. Zhuge, and P. Zhang, “Towards open-vocabulary remote sensing image semantic segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9436– 9444

  27. [27]

    Toward open-world semantic segmentation of remote sensing images,

    Y. Chen and L. Bruzzone, “Toward open-world semantic segmentation of remote sensing images,” inIGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2023, pp. 5045– 5048

  28. [28]

    Freemix: Open-vocabulary domain generalization of remote-sensing images for semantic segmen- tation,

    J. Wu, J. Shi, Z. Zhao, Z. Liu, and R. Zhi, “Freemix: Open-vocabulary domain generalization of remote-sensing images for semantic segmen- tation,”Remote Sensing, vol. 17, no. 8, p. 1357, 2025

  29. [29]

    Bridging vision foundation and vision–language models for open-vocabulary semantic segmentation of uav imagery,

    F. Li, Z. Zhang, X. Wang, X. Wang, and Y. Xu, “Bridging vision foundation and vision–language models for open-vocabulary semantic segmentation of uav imagery,”Remote Sensing, vol. 17, no. 22, p. 3704, 2025

  30. [30]

    Rsclip for training-free open- vocabulary remote sensing image semantic segmentation,

    S. Wang, X. Sun, J. Han, and X. X. Zhu, “Rsclip for training-free open- vocabulary remote sensing image semantic segmentation,”Authorea Preprints, 2025

  31. [31]

    Open-vocabulary panoptic segmentation maskclip,

    Z. Ding, J. Wang, and Z. Tu, “Open-vocabulary panoptic segmentation maskclip,”arXiv preprint arXiv:2208.01343, 2022

  32. [32]

    Extract free dense labels from clip,

    C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 696–712

  33. [33]

    Open - vocabulary semantic segmentation with image embedding balancing,

    X. Shan, D. Wu, G. Zhu, Y. Shao, N. Sang, and C. Gao, “Open - vocabulary semantic segmentation with image embedding balancing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28412–28421

  34. [34]

    Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images,

    K. Li, R. Liu, X. Cao, X. Bai, F. Zhou, D. Meng, and Z. Wang, “Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10545–10556

  35. [35]

    Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images,

    S. Dutta, A. Vasim, S. Gole, H. Rezatofighi, and B. Banerjee, “Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2254–2264

  36. [36]

    SAM-MI: A mask-injected framework for enhancing open-vocabulary semantic segmentation with SAM,

    L. Chen, Y. Zhu, Q. Yang, X. Niu, K. Ding, and S. Xiang, “SAM-MI: A mask-injected framework for enhancing open-vocabulary semantic segmentation with SAM,”arXiv preprint arXiv:2511.20027, 2025

  37. [37]

    Soft-guided open-vocabulary seman- tic segmentation of remote sensing images,

    K. An, Y. Wang, and L. Chen, “Soft-guided open-vocabulary seman- tic segmentation of remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–16, 2025

  38. [38]

    Text-to-image activationforopen-vocabularysemanticsegmentationinremotesensing,

    W. Wang, A. Xiao, W. He, H. Zhu, and L. Xiao, “Text-to-image activationforopen-vocabularysemanticsegmentationinremotesensing,” IEEE Transactions on Geoscience and Remote Sensing, 2025

  39. [39]

    Integrating sam with feature interaction for remote sensing change detection,

    D. Zhang, F. Wang, L. Ning, Z. Zhao, J. Gao, and X. Li, “Integrating sam with feature interaction for remote sensing change detection,”IEEE Transactions on Geoscience and Remote Sensing, 2024

  40. [40]

    Score: Scene context matters in open-vocabulary remote sensing instance segmentation,

    S. Huang, S. He, H. Qin, and B. Wen, “Score: Scene context matters in open-vocabulary remote sensing instance segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 12559–12569

  41. [41]

    Tpov-seg: Textually enhanced prompt tuning of vision-language models for open-vocabulary remote sensing semantic segmentation,

    X. Zhang, C. Zhou, J. Huang, and L. Zhang, “Tpov-seg: Textually enhanced prompt tuning of vision-language models for open-vocabulary remote sensing semantic segmentation,”IEEE Transactions on Geo- science and Remote Sensing, 2025. 17

  42. [42]

    Open-vocabulary object detection for high-resolution remote sensing images,

    H. Li, “Open-vocabulary object detection for high-resolution remote sensing images,”Computer Vision and Image Understanding, p. 104566, 2025

  43. [43]

    Annotation-free open-vocabulary segmentation for remote-sensing images,

    K. Li, X. Cao, R. Liu, S. Wang, Z. Jiang, Z. Wang, and D. Meng, “Annotation-free open-vocabulary segmentation for remote-sensing images,”arXiv preprint arXiv:2508.18067, 2025

  44. [44]

    Open-vocabulary segmentation for remote sensing,

    V. Gunda, J. Klenke, and A. Mehta, “Open-vocabulary segmentation for remote sensing,” in2024 IEEE MIT Undergraduate Research Technology Conference (URTC). IEEE, 2024, pp. 1–5

  45. [45]

    Uavid: A semantic segmentation dataset for uav imagery,

    Y. Lyu, G. Vosselman, G.-S. Xia, A. Yilmaz, and M. Y. Yang, “Uavid: A semantic segmentation dataset for uav imagery,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 165, pp. 108 – 119, 2020. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ S0924271620301295

  46. [46]

    Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation,

    J. Wang, Z. Zheng, A. Ma, X. Lu, and Y. Zhong, “Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung, Eds., vol. 1. Curran Associates, Inc., 2021. [Online]. Available: https://datasets-benchmarks-procee...

  47. [47]

    Vdd: Varied drone dataset for semantic segmentation,

    W. Cai, K. Jin, J. Hou, C. Guo, L. Wu, and W. Yang, “Vdd: Varied drone dataset for semantic segmentation,”Journal of Visual Communication and Image Representation, vol. 109, p. 104429, 2025

  48. [48]

    Expanding open- vocabulary understanding for uav aerial imagery: A vision–language framework to semantic segmentation,

    B. Huang, J. Li, W. Luan, J. Tan, C. Li, and L. Huang, “Expanding open- vocabulary understanding for uav aerial imagery: A vision–language framework to semantic segmentation,”Drones, vol. 9, no. 2, p. 155, 2025

  49. [49]

    Aeriaiclip: Lightweight open-vocabulary segmentation for uav-based aerial images,

    P. Jia, Y. Gao, W. Li, Q. Gao, and F. Pan, “Aeriaiclip: Lightweight open-vocabulary segmentation for uav-based aerial images,” in2025 44th Chinese Control Conference (CCC). IEEE, 2025, pp. 8193–8198

  50. [50]

    Toward high-resolution uav imagery open-vocabulary semantic segmentation,

    Z. Chen, Y. Xie, and Y. Wei, “Toward high-resolution uav imagery open-vocabulary semantic segmentation,”Drones, vol. 9, no. 7, p. 470, 2025

  51. [51]

    Rsvg-zeroov: Exploring a training-free framework for zero-shot open-vocabulary visual grounding in remote sensing images,

    K. Li, D. Wang, T. Wang, F. Dong, Y. Zhang, L. Zhang, X. Wang, S. Li, and Q. Wang, “Rsvg-zeroov: Exploring a training-free framework for zero-shot open-vocabulary visual grounding in remote sensing images,” arXiv preprint arXiv:2509.18711, 2025

  52. [52]

    U3m: Unbi- ased multiscale modal fusion model for multimodal semantic segmentation, 2024

    B. Li, D. Zhang, Z. Zhao, J. Gao, and X. Li, “U3m: Unbiased multiscale modal fusion model for multimodal semantic segmentation,”arXiv preprint arXiv:2405.15365, 2024

  53. [53]

    Stitchfusion: Weaving any visual modalities to enhance multi- modal semantic segmentation,

    ——, “Stitchfusion: Weaving any visual modalities to enhance multi- modal semantic segmentation,”arXiv preprint arXiv:2408.01343, 2024

  54. [54]

    Tri- subspaces disentanglement for multimodal sentiment analysis,

    C. Meng, J. Luo, Z. Yan, Z. Yu, R. Fu, Z. Gan, and C. Ouyang, “Tri- subspaces disentanglement for multimodal sentiment analysis,”CVPR 2026, 2026

  55. [55]

    Segformer: Simple and efficient design for semantic segmentation with transformers,

    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”Advances in Neural Information Processing Systems, vol. 34, pp. 12077–12090, 2021

  56. [56]

    Per - pixel classification is not all you need for semantic segmentation,

    B. Cheng, A. Schwing, and A. Kirillov, “Per - pixel classification is not all you need for semantic segmentation,”Advances in Neural Information Processing Systems, vol. 34, pp. 17864–17875, 2021

  57. [57]

    Masked - attention mask transformer for universal image segmentation,

    B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked - attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022, pp. 1290–1299

  58. [58]

    Panoptic seg- mentation,

    A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic seg- mentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9404–9413

  59. [59]

    Yolov8-dds: A lightweight model based on pruning and distillation for early detection of root mold in barley seedling,

    H. Junjie, M. Zheng, W. Yuzhu, B. Yujian, W. Yizhe, S. Zhongbin, and G. Lifeng, “Yolov8-dds: A lightweight model based on pruning and distillation for early detection of root mold in barley seedling,” Information Processing in Agriculture, 2025

  60. [60]

    Fine-grained preference optimization improves spatial reasoning in vlms,

    Y. Shen, Y. Liu, J. Zhu, X. Cao, X. Zhang, Y. He, W. Ye, J. M. Rehg, and I. Lourentzou, “Fine-grained preference optimization improves spatial reasoning in vlms,”arXiv preprint arXiv:2506.21656, 2025

  61. [61]

    Toward cognitive supersensing in multimodal large language model,

    B. Li, Y. Shen, Y. Liu, Y. Xu, J. Liu, X. Li, Z. Li, J. Zhu, Y. Zhong, F. Lanet al., “Toward cognitive supersensing in multimodal large language model,”arXiv preprint arXiv:2602.01541, 2026

  62. [62]

    Reasoning in computer vi- sion: Taxonomy, models, tasks, and methodologies,

    A. Sarkar, M. Y. I. Idris, and Z. Yu, “Reasoning in computer vision: Taxonomy, models, tasks, and methodologies,”arXiv preprint arXiv:2508.10523, 2025

  63. [63]

    Open-vocabulary segmentation with semantic-assisted calibration,

    Y. Liu, S. Bai, G. Li, Y. Wang, and Y. Tang, “Open-vocabulary segmentation with semantic-assisted calibration,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3491–3500

  64. [64]

    Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,

    Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,”Advances in Neural Information Processing Systems, vol. 36, pp. 32215–32234, 2023

  65. [65]

    Eov-seg: Efficient open-vocabulary panoptic segmentation,

    H. Niu, J. Hu, J. Lin, G. Jiang, and S. Zhang, “Eov-seg: Efficient open-vocabulary panoptic segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 6, 2025, pp. 6254– 6262

  66. [66]

    Learning mask-aware clip representations for zero-shot segmentation,

    S. Jiao, Y. Wei, Y. Wang, Y. Zhao, and H. Shi, “Learning mask-aware clip representations for zero-shot segmentation,”Advances in Neural Information Processing Systems, vol. 36, pp. 35631–35653, 2023

  67. [67]

    Collaborative vision - text representation optimizing for open - vocabulary segmenta- tion,

    S. Jiao, H. Zhu, J. Huang, Y. Zhao, Y. Wei, and H. Shi, “Collaborative vision - text representation optimizing for open - vocabulary segmenta- tion,” inProceedings of the European Conference on Computer Vision, 2025, pp. 399–416

  68. [68]

    Sed: A simple encoder- decoder for open-vocabulary semantic segmentation,

    B. Xie, J. Cao, J. Xie, F. S. Khan, and Y. Pang, “Sed: A simple encoder- decoder for open-vocabulary semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3426–3436

  69. [69]

    Efficient redundancy reduction for open-vocabulary semantic segmentation,

    L. Chen, Q. Yang, K. Ding, Z. Li, G. Shen, F. Li, Q. Cao, and S. Xiang, “Efficient redundancy reduction for open-vocabulary semantic segmentation,”Neurocomputing, vol. 665, p. 132229, 2026. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0925231225029017

  70. [70]

    Proxyclip: Proxy attention improves clip for open-vocabulary segmentation,

    M. Lan, C. Chen, Y. Ke, X. Wang, L. Feng, and W. Zhang, “Proxyclip: Proxy attention improves clip for open-vocabulary segmentation,” in European Conference on Computer Vision. Springer, 2024, pp. 70–88

  71. [71]

    Sclip: Rethinking self-attention for dense vision-language inference,

    F. Wang, J. Mei, and A. Yuille, “Sclip: Rethinking self-attention for dense vision-language inference,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 315–332

  72. [72]

    Clearclip: Decomposing clip representations for dense vision-language inference,

    M. Lan, C. Chen, Y. Ke, X. Wang, L. Feng, and W. Zhang, “Clearclip: Decomposing clip representations for dense vision-language inference,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 143–160

  73. [73]

    Explore the potential of clip for training-free open vocabulary semantic segmentation,

    T. Shao, Z. Tian, H. Zhao, and J. Su, “Explore the potential of clip for training-free open vocabulary semantic segmentation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 139–156

  74. [74]

    Self- calibrated clip for training-free open-vocabulary segmentation,

    S. Bai, Y. Liu, Y. Han, H. Zhang, Y. Tang, J. Zhou, and J. Lu, “Self- calibrated clip for training-free open-vocabulary segmentation,”IEEE Transactions on Image Processing, 2025

  75. [75]

    Generalization boosted adapter for open - vocabulary segmentation,

    W. Xu, C. Wang, X. Feng, R. Xu, L. Huang, Z. Zhang, L. Guo, and S. Xu, “Generalization boosted adapter for open - vocabulary segmentation,” IEEE Transactions on Circuits and Systems for Video Technology, 2024

  76. [76]

    Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation,

    L. Barsellotti, L. Bianchi, N. Messina, F. Carrara, M. Cornia, L. Baraldi, F. Falchi, and R. Cucchiara, “Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22025–22035

  77. [77]

    Llmformer: Large language model for open-vocabulary semantic segmentation,

    H. Shi, S. D. Dao, and J. Cai, “Llmformer: Large language model for open-vocabulary semantic segmentation,”International Journal of Computer Vision, vol. 133, no. 2, pp. 742–759, 2025

  78. [78]

    Towards open-vocabulary video semantic segmentation,

    X. Li, Y. Liu, G. Sun, M. Wu, L. Zhang, and C. Zhu, “Towards open-vocabulary video semantic segmentation,”IEEE Transactions on Multimedia, 2025

  79. [79]

    Hierarchical open-vocabulary universal image segmentation,

    X. Wang, S. Li, K. Kallidromitis, Y. Kato, K. Kozuka, and T. Darrell, “Hierarchical open-vocabulary universal image segmentation,”Advances in Neural Information Processing Systems, vol. 36, 2024

  80. [80]

    Large multimodal model for open vocabulary semantic segmentation of remote sensing images,

    B. Liu, X. Chen, A. Yu, F. Feng, J. Yue, and X. Yu, “Large multimodal model for open vocabulary semantic segmentation of remote sensing images,”European Journal of Remote Sensing, vol. 58, no. 1, p. 2447344, 2025

Showing first 80 references.