Recognition: unknown
Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline
Pith reviewed 2026-05-10 08:16 UTC · model grok-4.3
The pith
OVRSISBenchV2 provides a more realistic and challenging benchmark for open-vocabulary remote sensing segmentation, where the Pi-Seg baseline achieves strong performance through positive-incentive noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OVRSISBenchV2 substantially expands scene diversity, semantic coverage, and evaluation difficulty beyond prior work, and Pi-Seg delivers strong and consistent results particularly on the more challenging OVRSISBenchV2 benchmark by using positive-incentive noise to broaden the visual-text feature space.
What carries the argument
The positive-incentive noise mechanism in Pi-Seg, consisting of learnable and semantically guided perturbations that broaden the visual-text feature space during training to improve transferability.
If this is right
- OVRSISBenchV2 enables more rigorous testing of open-vocabulary methods in remote sensing.
- Pi-Seg shows consistent improvements across OVRSISBenchV1, V2, and downstream tasks.
- Realistic benchmarks better reflect complex deployment scenarios like flood detection.
- Positive-incentive noise enhances cross-dataset generalization in segmentation models.
Where Pith is reading between the lines
- Similar benchmark expansion approaches could be applied to other specialized domains like medical imaging.
- The noise perturbation idea might transfer to general open-vocabulary segmentation outside remote sensing.
- Future work could test if the benchmark's diversity reduces selection bias in model development.
- Integration with other sensors or modalities could further test the method's robustness.
Load-bearing premise
The manually selected categories and datasets accurately capture the full range of realistic geospatial application demands without introducing selection bias or missing key scene and sensor variations.
What would settle it
A follow-up experiment where Pi-Seg fails to outperform standard baselines on OVRSISBenchV2 or where new datasets reveal that the 128 categories do not cover important real-world cases would challenge the central claims.
Figures
read the original abstract
Open-vocabulary remote sensing image segmentation (OVRSIS) remains underexplored due to fragmented datasets, limited training diversity, and the lack of evaluation benchmarks that reflect realistic geospatial application demands. Our previous \textit{OVRSISBenchV1} established an initial cross-dataset evaluation protocol, but its limited scope is insufficient for assessing realistic open-world generalization. To address this issue, we propose \textit{OVRSISBenchV2}, a large-scale and application-oriented benchmark for OVRSIS. We first construct \textbf{OVRSIS95K}, a balanced dataset of about 95K image--mask pairs covering 35 common semantic categories across diverse remote sensing scenes. Built upon OVRSIS95K and 10 downstream datasets, OVRSISBenchV2 contains 170K images and 128 categories, substantially expanding scene diversity, semantic coverage, and evaluation difficulty. Beyond standard open-vocabulary segmentation, it further includes downstream protocols for building extraction, road extraction, and flood detection, thereby better reflecting realistic geospatial application demands and complex deployment scenarios. We also propose \textbf{Pi-Seg}, a baseline for OVRSIS. Pi-Seg improves transferability through a \textbf{positive-incentive noise} mechanism, where learnable and semantically guided perturbations broaden the visual-text feature space during training. Extensive experiments on OVRSISBenchV1, OVRSISBenchV2, and downstream tasks show that Pi-Seg delivers strong and consistent results, particularly on the more challenging OVRSISBenchV2 benchmark. Our results highlight both the importance of realistic benchmark design and the effectiveness of perturbation-based transfer for OVRSIS. The code and datasets are available at \href{https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg}{LiBingyu01/RSKT-Seg/tree/Pi-Seg}.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual and text encoders can be aligned via contrastive or perturbation-based training to support open-vocabulary generalization.
invented entities (1)
-
positive-incentive noise mechanism
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Fully convolutional networks for semantic segmentation,
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440
2015
-
[2]
U-net: Convolutional networks for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMedical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III. Springer, 2015, pp. 234–241
2015
-
[3]
Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017
2017
-
[4]
Segnet: A deep convolutional encoder-decoder architecture for image segmentation,
V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017
2017
-
[5]
Rethinking Atrous Convolution for Semantic Image Segmentation
L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,”arXiv preprint arXiv:1706.05587, 2017
work page internal anchor Pith review arXiv 2017
-
[6]
Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,
S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torret al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6881–6890
2021
-
[7]
Remote sensing image segmentation advances: A meta-analysis,
I. Kotaridis and M. Lazaridou, “Remote sensing image segmentation advances: A meta-analysis,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 173, pp. 309–322, 2021
2021
-
[8]
Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data,
F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, “Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 162, pp. 94–114, 2020
2020
-
[9]
Scale-aware detailed matching for few-shot aerial image semantic segmentation,
X. Yao, Q. Cao, X. Feng, G. Cheng, and J. Han, “Scale-aware detailed matching for few-shot aerial image semantic segmentation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2021
2021
-
[10]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 8748–8763
2021
-
[11]
Scaling up visual and vision-language representation learning with noisy text supervision,
C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 4904–4916
2021
-
[12]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[13]
Prototypical hash encoding for on-the-fly fine-grained category discovery,
H. Zheng, N. Pu, W. Li, N. Sebe, and Z. Zhong, “Prototypical hash encoding for on-the-fly fine-grained category discovery,”Advances in Neural Information Processing Systems, vol. 37, pp. 101428–101455, 2024
2024
-
[14]
Open-world deepfake attribution via confidence-aware asymmetric learning,
H. Zheng, N. Pu, W. Li, T. Long, N. Sebe, and Z. Zhong, “Open-world deepfake attribution via confidence-aware asymmetric learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 16, 2026, pp. 13378–13386
2026
-
[15]
Language-driven semantic segmentation,
B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ran- ftl, “Language-driven semantic segmentation,”arXiv preprint arXiv:2201.03546, 2022
-
[16]
Groupvit: Semantic segmentation emerges from text supervision,
J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18134–18144
2022
-
[17]
Scaling open-vocabulary image segmentation with image-level labels,
G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin, “Scaling open-vocabulary image segmentation with image-level labels,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 540–557
2022
-
[18]
A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,
M. Xu, Z. Zhang, F. Wei, Y. Lin, Y. Cao, H. Hu, and X. Bai, “A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 736–753
2022
-
[19]
Open-vocabulary semantic segmentation with mask-adapted clip,
F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted clip,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7061–7070
2023
-
[20]
Open- vocabulary panoptic segmentation with text-to-image diffusion models,
J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello, “Open- vocabulary panoptic segmentation with text-to-image diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2955–2966
2023
-
[21]
Open vocabulary semantic segmentation with patch aligned contrastive learning,
J. Mukhoti, T.-Y. Lin, O. Poursaeed, R. Wang, A. Shah, P. H. S. Torr, and S.-N. Lim, “Open vocabulary semantic segmentation with patch aligned contrastive learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19413–19423
2023
-
[22]
Open-vocabulary semantic segmentation with decoupled one-pass network,
C. Han, Y. Zhong, D. Li, K. Han, and L. Ma, “Open-vocabulary semantic segmentation with decoupled one-pass network,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1086–1096
2023
-
[23]
Cat- seg: Cost aggregation for open-vocabulary semantic segmentation,
S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S. J. Kim, “Cat- seg: Cost aggregation for open-vocabulary semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4113–4123
2024
-
[24]
Exploring efficient open-vocabulary segmentation in the remote sensing,
B. Li, H. Dong, D. Zhang, Z. Zhao, H. Sun, and J. Gao, “Exploring efficient open-vocabulary segmentation in the remote sensing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 8, 2026, pp. 5982–5991
2026
-
[25]
Open-vocabulary high-resolution remote sensing image semantic segmentation,
Q. Cao, Y. Chen, C. Ma, and X. Yang, “Open-vocabulary high-resolution remote sensing image semantic segmentation,”IEEE Transactions on Geoscience and Remote Sensing, 2025
2025
-
[26]
Towards open-vocabulary remote sensing image semantic segmentation,
C. Ye, Y. Zhuge, and P. Zhang, “Towards open-vocabulary remote sensing image semantic segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9436– 9444
2025
-
[27]
Toward open-world semantic segmentation of remote sensing images,
Y. Chen and L. Bruzzone, “Toward open-world semantic segmentation of remote sensing images,” inIGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2023, pp. 5045– 5048
2023
-
[28]
Freemix: Open-vocabulary domain generalization of remote-sensing images for semantic segmen- tation,
J. Wu, J. Shi, Z. Zhao, Z. Liu, and R. Zhi, “Freemix: Open-vocabulary domain generalization of remote-sensing images for semantic segmen- tation,”Remote Sensing, vol. 17, no. 8, p. 1357, 2025
2025
-
[29]
Bridging vision foundation and vision–language models for open-vocabulary semantic segmentation of uav imagery,
F. Li, Z. Zhang, X. Wang, X. Wang, and Y. Xu, “Bridging vision foundation and vision–language models for open-vocabulary semantic segmentation of uav imagery,”Remote Sensing, vol. 17, no. 22, p. 3704, 2025
2025
-
[30]
Rsclip for training-free open- vocabulary remote sensing image semantic segmentation,
S. Wang, X. Sun, J. Han, and X. X. Zhu, “Rsclip for training-free open- vocabulary remote sensing image semantic segmentation,”Authorea Preprints, 2025
2025
-
[31]
Open-vocabulary panoptic segmentation maskclip,
Z. Ding, J. Wang, and Z. Tu, “Open-vocabulary panoptic segmentation maskclip,”arXiv preprint arXiv:2208.01343, 2022
-
[32]
Extract free dense labels from clip,
C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 696–712
2022
-
[33]
Open - vocabulary semantic segmentation with image embedding balancing,
X. Shan, D. Wu, G. Zhu, Y. Shao, N. Sang, and C. Gao, “Open - vocabulary semantic segmentation with image embedding balancing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28412–28421
2024
-
[34]
Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images,
K. Li, R. Liu, X. Cao, X. Bai, F. Zhou, D. Meng, and Z. Wang, “Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10545–10556
2025
-
[35]
Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images,
S. Dutta, A. Vasim, S. Gole, H. Rezatofighi, and B. Banerjee, “Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2254–2264
2025
-
[36]
SAM-MI: A mask-injected framework for enhancing open-vocabulary semantic segmentation with SAM,
L. Chen, Y. Zhu, Q. Yang, X. Niu, K. Ding, and S. Xiang, “SAM-MI: A mask-injected framework for enhancing open-vocabulary semantic segmentation with SAM,”arXiv preprint arXiv:2511.20027, 2025
-
[37]
Soft-guided open-vocabulary seman- tic segmentation of remote sensing images,
K. An, Y. Wang, and L. Chen, “Soft-guided open-vocabulary seman- tic segmentation of remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–16, 2025
2025
-
[38]
Text-to-image activationforopen-vocabularysemanticsegmentationinremotesensing,
W. Wang, A. Xiao, W. He, H. Zhu, and L. Xiao, “Text-to-image activationforopen-vocabularysemanticsegmentationinremotesensing,” IEEE Transactions on Geoscience and Remote Sensing, 2025
2025
-
[39]
Integrating sam with feature interaction for remote sensing change detection,
D. Zhang, F. Wang, L. Ning, Z. Zhao, J. Gao, and X. Li, “Integrating sam with feature interaction for remote sensing change detection,”IEEE Transactions on Geoscience and Remote Sensing, 2024
2024
-
[40]
Score: Scene context matters in open-vocabulary remote sensing instance segmentation,
S. Huang, S. He, H. Qin, and B. Wen, “Score: Scene context matters in open-vocabulary remote sensing instance segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 12559–12569
2025
-
[41]
Tpov-seg: Textually enhanced prompt tuning of vision-language models for open-vocabulary remote sensing semantic segmentation,
X. Zhang, C. Zhou, J. Huang, and L. Zhang, “Tpov-seg: Textually enhanced prompt tuning of vision-language models for open-vocabulary remote sensing semantic segmentation,”IEEE Transactions on Geo- science and Remote Sensing, 2025. 17
2025
-
[42]
Open-vocabulary object detection for high-resolution remote sensing images,
H. Li, “Open-vocabulary object detection for high-resolution remote sensing images,”Computer Vision and Image Understanding, p. 104566, 2025
2025
-
[43]
Annotation-free open-vocabulary segmentation for remote-sensing images,
K. Li, X. Cao, R. Liu, S. Wang, Z. Jiang, Z. Wang, and D. Meng, “Annotation-free open-vocabulary segmentation for remote-sensing images,”arXiv preprint arXiv:2508.18067, 2025
-
[44]
Open-vocabulary segmentation for remote sensing,
V. Gunda, J. Klenke, and A. Mehta, “Open-vocabulary segmentation for remote sensing,” in2024 IEEE MIT Undergraduate Research Technology Conference (URTC). IEEE, 2024, pp. 1–5
2024
-
[45]
Uavid: A semantic segmentation dataset for uav imagery,
Y. Lyu, G. Vosselman, G.-S. Xia, A. Yilmaz, and M. Y. Yang, “Uavid: A semantic segmentation dataset for uav imagery,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 165, pp. 108 – 119, 2020. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ S0924271620301295
2020
-
[46]
Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation,
J. Wang, Z. Zheng, A. Ma, X. Lu, and Y. Zhong, “Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung, Eds., vol. 1. Curran Associates, Inc., 2021. [Online]. Available: https://datasets-benchmarks-procee...
2021
-
[47]
Vdd: Varied drone dataset for semantic segmentation,
W. Cai, K. Jin, J. Hou, C. Guo, L. Wu, and W. Yang, “Vdd: Varied drone dataset for semantic segmentation,”Journal of Visual Communication and Image Representation, vol. 109, p. 104429, 2025
2025
-
[48]
Expanding open- vocabulary understanding for uav aerial imagery: A vision–language framework to semantic segmentation,
B. Huang, J. Li, W. Luan, J. Tan, C. Li, and L. Huang, “Expanding open- vocabulary understanding for uav aerial imagery: A vision–language framework to semantic segmentation,”Drones, vol. 9, no. 2, p. 155, 2025
2025
-
[49]
Aeriaiclip: Lightweight open-vocabulary segmentation for uav-based aerial images,
P. Jia, Y. Gao, W. Li, Q. Gao, and F. Pan, “Aeriaiclip: Lightweight open-vocabulary segmentation for uav-based aerial images,” in2025 44th Chinese Control Conference (CCC). IEEE, 2025, pp. 8193–8198
2025
-
[50]
Toward high-resolution uav imagery open-vocabulary semantic segmentation,
Z. Chen, Y. Xie, and Y. Wei, “Toward high-resolution uav imagery open-vocabulary semantic segmentation,”Drones, vol. 9, no. 7, p. 470, 2025
2025
-
[51]
K. Li, D. Wang, T. Wang, F. Dong, Y. Zhang, L. Zhang, X. Wang, S. Li, and Q. Wang, “Rsvg-zeroov: Exploring a training-free framework for zero-shot open-vocabulary visual grounding in remote sensing images,” arXiv preprint arXiv:2509.18711, 2025
-
[52]
U3m: Unbi- ased multiscale modal fusion model for multimodal semantic segmentation, 2024
B. Li, D. Zhang, Z. Zhao, J. Gao, and X. Li, “U3m: Unbiased multiscale modal fusion model for multimodal semantic segmentation,”arXiv preprint arXiv:2405.15365, 2024
-
[53]
Stitchfusion: Weaving any visual modalities to enhance multi- modal semantic segmentation,
——, “Stitchfusion: Weaving any visual modalities to enhance multi- modal semantic segmentation,”arXiv preprint arXiv:2408.01343, 2024
-
[54]
Tri- subspaces disentanglement for multimodal sentiment analysis,
C. Meng, J. Luo, Z. Yan, Z. Yu, R. Fu, Z. Gan, and C. Ouyang, “Tri- subspaces disentanglement for multimodal sentiment analysis,”CVPR 2026, 2026
2026
-
[55]
Segformer: Simple and efficient design for semantic segmentation with transformers,
E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”Advances in Neural Information Processing Systems, vol. 34, pp. 12077–12090, 2021
2021
-
[56]
Per - pixel classification is not all you need for semantic segmentation,
B. Cheng, A. Schwing, and A. Kirillov, “Per - pixel classification is not all you need for semantic segmentation,”Advances in Neural Information Processing Systems, vol. 34, pp. 17864–17875, 2021
2021
-
[57]
Masked - attention mask transformer for universal image segmentation,
B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked - attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022, pp. 1290–1299
2022
-
[58]
Panoptic seg- mentation,
A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic seg- mentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9404–9413
2019
-
[59]
Yolov8-dds: A lightweight model based on pruning and distillation for early detection of root mold in barley seedling,
H. Junjie, M. Zheng, W. Yuzhu, B. Yujian, W. Yizhe, S. Zhongbin, and G. Lifeng, “Yolov8-dds: A lightweight model based on pruning and distillation for early detection of root mold in barley seedling,” Information Processing in Agriculture, 2025
2025
-
[60]
Fine-grained preference optimization improves spatial reasoning in vlms,
Y. Shen, Y. Liu, J. Zhu, X. Cao, X. Zhang, Y. He, W. Ye, J. M. Rehg, and I. Lourentzou, “Fine-grained preference optimization improves spatial reasoning in vlms,”arXiv preprint arXiv:2506.21656, 2025
-
[61]
Toward cognitive supersensing in multimodal large language model,
B. Li, Y. Shen, Y. Liu, Y. Xu, J. Liu, X. Li, Z. Li, J. Zhu, Y. Zhong, F. Lanet al., “Toward cognitive supersensing in multimodal large language model,”arXiv preprint arXiv:2602.01541, 2026
-
[62]
Reasoning in computer vi- sion: Taxonomy, models, tasks, and methodologies,
A. Sarkar, M. Y. I. Idris, and Z. Yu, “Reasoning in computer vision: Taxonomy, models, tasks, and methodologies,”arXiv preprint arXiv:2508.10523, 2025
-
[63]
Open-vocabulary segmentation with semantic-assisted calibration,
Y. Liu, S. Bai, G. Li, Y. Wang, and Y. Tang, “Open-vocabulary segmentation with semantic-assisted calibration,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3491–3500
2024
-
[64]
Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,
Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,”Advances in Neural Information Processing Systems, vol. 36, pp. 32215–32234, 2023
2023
-
[65]
Eov-seg: Efficient open-vocabulary panoptic segmentation,
H. Niu, J. Hu, J. Lin, G. Jiang, and S. Zhang, “Eov-seg: Efficient open-vocabulary panoptic segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 6, 2025, pp. 6254– 6262
2025
-
[66]
Learning mask-aware clip representations for zero-shot segmentation,
S. Jiao, Y. Wei, Y. Wang, Y. Zhao, and H. Shi, “Learning mask-aware clip representations for zero-shot segmentation,”Advances in Neural Information Processing Systems, vol. 36, pp. 35631–35653, 2023
2023
-
[67]
Collaborative vision - text representation optimizing for open - vocabulary segmenta- tion,
S. Jiao, H. Zhu, J. Huang, Y. Zhao, Y. Wei, and H. Shi, “Collaborative vision - text representation optimizing for open - vocabulary segmenta- tion,” inProceedings of the European Conference on Computer Vision, 2025, pp. 399–416
2025
-
[68]
Sed: A simple encoder- decoder for open-vocabulary semantic segmentation,
B. Xie, J. Cao, J. Xie, F. S. Khan, and Y. Pang, “Sed: A simple encoder- decoder for open-vocabulary semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3426–3436
2024
-
[69]
Efficient redundancy reduction for open-vocabulary semantic segmentation,
L. Chen, Q. Yang, K. Ding, Z. Li, G. Shen, F. Li, Q. Cao, and S. Xiang, “Efficient redundancy reduction for open-vocabulary semantic segmentation,”Neurocomputing, vol. 665, p. 132229, 2026. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0925231225029017
2026
-
[70]
Proxyclip: Proxy attention improves clip for open-vocabulary segmentation,
M. Lan, C. Chen, Y. Ke, X. Wang, L. Feng, and W. Zhang, “Proxyclip: Proxy attention improves clip for open-vocabulary segmentation,” in European Conference on Computer Vision. Springer, 2024, pp. 70–88
2024
-
[71]
Sclip: Rethinking self-attention for dense vision-language inference,
F. Wang, J. Mei, and A. Yuille, “Sclip: Rethinking self-attention for dense vision-language inference,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 315–332
2024
-
[72]
Clearclip: Decomposing clip representations for dense vision-language inference,
M. Lan, C. Chen, Y. Ke, X. Wang, L. Feng, and W. Zhang, “Clearclip: Decomposing clip representations for dense vision-language inference,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 143–160
2024
-
[73]
Explore the potential of clip for training-free open vocabulary semantic segmentation,
T. Shao, Z. Tian, H. Zhao, and J. Su, “Explore the potential of clip for training-free open vocabulary semantic segmentation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 139–156
2024
-
[74]
Self- calibrated clip for training-free open-vocabulary segmentation,
S. Bai, Y. Liu, Y. Han, H. Zhang, Y. Tang, J. Zhou, and J. Lu, “Self- calibrated clip for training-free open-vocabulary segmentation,”IEEE Transactions on Image Processing, 2025
2025
-
[75]
Generalization boosted adapter for open - vocabulary segmentation,
W. Xu, C. Wang, X. Feng, R. Xu, L. Huang, Z. Zhang, L. Guo, and S. Xu, “Generalization boosted adapter for open - vocabulary segmentation,” IEEE Transactions on Circuits and Systems for Video Technology, 2024
2024
-
[76]
Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation,
L. Barsellotti, L. Bianchi, N. Messina, F. Carrara, M. Cornia, L. Baraldi, F. Falchi, and R. Cucchiara, “Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 22025–22035
2025
-
[77]
Llmformer: Large language model for open-vocabulary semantic segmentation,
H. Shi, S. D. Dao, and J. Cai, “Llmformer: Large language model for open-vocabulary semantic segmentation,”International Journal of Computer Vision, vol. 133, no. 2, pp. 742–759, 2025
2025
-
[78]
Towards open-vocabulary video semantic segmentation,
X. Li, Y. Liu, G. Sun, M. Wu, L. Zhang, and C. Zhu, “Towards open-vocabulary video semantic segmentation,”IEEE Transactions on Multimedia, 2025
2025
-
[79]
Hierarchical open-vocabulary universal image segmentation,
X. Wang, S. Li, K. Kallidromitis, Y. Kato, K. Kozuka, and T. Darrell, “Hierarchical open-vocabulary universal image segmentation,”Advances in Neural Information Processing Systems, vol. 36, 2024
2024
-
[80]
Large multimodal model for open vocabulary semantic segmentation of remote sensing images,
B. Liu, X. Chen, A. Yu, F. Feng, J. Yue, and X. Yu, “Large multimodal model for open vocabulary semantic segmentation of remote sensing images,”European Journal of Remote Sensing, vol. 58, no. 1, p. 2447344, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.