pith. machine review for the scientific record. sign in

arxiv: 2604.08461 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

Recognition: unknown

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords open-vocabulary segmentationDINO featuresSAM alignmentboundary awarenessstructure-aware encoderweakly-supervised segmentationimage segmentationlanguage guidance
0
0 comments X

The pith

Aligning DINO features with SAM structures restores boundary awareness for open-vocabulary segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that DINO vision models lose edge detection ability as their features pass through deeper transformer blocks, which restricts precise segmentation of object regions described only by text. It counters this by adding a Structure-Aware Encoder and Structure-Modulated Decoder that borrow structural cues from the Segment Anything Model and train with SAM-generated pseudo-masks. A sympathetic reader would care because current language-guided segmentation often produces blurry or incomplete outlines in busy scenes. If the claim holds, combining these two models yields sharper boundaries for categories never seen during training, reducing the need for dense labeled data.

Core claim

DINO's inherent boundary awareness attenuates progressively through its transformer blocks. The authors address this attenuation by introducing a Structure-Aware Encoder and Structure-Modulated Decoder that align DINO representations with SAM structural priors, supervised by SAM pseudo-masks, and report state-of-the-art results on weakly-supervised open-vocabulary segmentation benchmarks.

What carries the argument

The Structure-Aware Encoder (SAE) and Structure-Modulated Decoder (SMD) that activate DINO's latent boundary features by alignment with SAM structural priors.

If this is right

  • State-of-the-art average performance on weakly-supervised OVS benchmarks with a 2.1 percent gain.
  • Particularly large gains of 6.3 percent on Cityscapes, where scenes are complex and cluttered.
  • Improved handling of fine object boundaries when segmenting regions using only language descriptions.
  • Reduced reliance on pixel-level labels through the use of SAM pseudo-masks for supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structural alignment pattern could be tested on other vision foundation models that show similar progressive loss of spatial detail.
  • Sharper boundaries produced this way might directly help open-vocabulary object detection or instance segmentation pipelines.
  • Pseudo-mask supervision from SAM suggests a route to lower annotation costs for other dense prediction tasks that need precise edges.

Load-bearing premise

The main bottleneck for DINO in segmentation is the attenuation of boundary awareness in deeper layers, and this can be corrected by SAM alignment without introducing new semantic biases or edge artifacts.

What would settle it

A side-by-side test on a cluttered benchmark dataset that measures boundary precision metrics such as boundary F-score or edge IoU and finds no improvement over baseline DINO would show the alignment does not restore the claimed edge sensitivity.

Figures

Figures reproduced from arXiv: 2604.08461 by Deqiang Ouyang, Guoqing Wang, Haiyue Zhang, Haoxi Zeng, Heng Tao Shen, Qiankun Liu, Yi Bin, Yujuan Ding.

Figure 1
Figure 1. Figure 1: Overall performance comparison. We evaluate our [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Frequency-domain analysis of DINO layers. (a) The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: CKA heatmap results of DINOv2 and SAM, obtained [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overall architecture of OVS-DINO. We first encode the multi-level features from DINO using the Structure-Aware [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization results of different models. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 1
Figure 1. Figure 1: Visualization of pseudo-masks. Currently, many pseudo-mask generation methods rely on the panoptic segmentation mode of the SAM [20] model for automatic mask generation. However, this approach [43] utilizes uniformly sampled points as prompts, which leads to relatively slow inference speed and severe over-segmentation. Such over-segmentation is particularly detrimental to semantic segmentation tasks. In ad… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of success cases. C Case Study C.1 Success Case Analysis To qualitatively evaluate the effectiveness of the proposed model, [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of failure cases. are more difficult to localize and classify accurately in an open￾vocabulary setting. Overall, these failure cases indicate that the proposed model performs reliably on dominant semantic regions but remains less robust for small objects, thin structures, and boundary-sensitive areas. In addition, visually or semantically similar categories are more likely to be confused, esp… view at source ↗
Figure 4
Figure 4. Figure 4: Additional qualitative results on the ADE20K [ [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional qualitative results on the Cityscapes [ [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional qualitative results on the COCO-Stuff [ [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Open-Vocabulary Segmentation (OVS) aims to segment image regions beyond predefined category sets by leveraging semantic descriptions. While CLIP based approaches excel in semantic generalization, they frequently lack the fine-grained spatial awareness required for dense prediction. Recent efforts have incorporated Vision Foundation Models (VFMs) like DINO to alleviate these limitations. However, these methods still struggle with the precise edge perception necessary for high fidelity segmentation. In this paper, we analyze internal representations of DINO and discover that its inherent boundary awareness is not absent but rather undergoes progressive attenuation as features transition into deeper transformer blocks. To address this, we propose OVS-DINO, a novel framework that revitalizes latent edge-sensitivity of DINO through structural alignment with the Segment Anything Model (SAM). Specifically, we introduce a Structure-Aware Encoder (SAE) and a Structure-Modulated Decoder (SMD) to effectively activate boundary features of DINO using SAM's structural priors, complemented by a supervision strategy utilizing SAM generated pseudo-masks. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple weakly-supervised OVS benchmarks, improving the average score by 2.1% (from 44.8% to 46.9%). Notably, our approach significantly enhances segmentation accuracy in complex, cluttered scenarios, with a gain of 6.3% on Cityscapes (from 36.6% to 42.9%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes OVS-DINO, a framework for weakly-supervised open-vocabulary segmentation that analyzes progressive attenuation of boundary awareness in DINO transformer features and addresses it via a Structure-Aware Encoder (SAE) and Structure-Modulated Decoder (SMD) that align DINO representations with SAM structural priors, using SAM-generated pseudo-masks for supervision. It reports SOTA results on multiple benchmarks, with a 2.1% average mIoU lift (44.8% to 46.9%) and a 6.3% gain on Cityscapes (36.6% to 42.9%).

Significance. If the central claim holds after proper validation, the work offers a practical way to combine DINO's semantic strengths with SAM's edge sensitivity for improved dense OVS in cluttered scenes, without requiring dense labels. This could be useful for real-world applications where boundary precision matters, provided the alignment does not trade off semantic fidelity.

major comments (3)
  1. [Abstract] Abstract: The 6.3% Cityscapes improvement and overall SOTA claim rest on the premise that SAM pseudo-masks supply clean structural supervision without semantic bias, yet no analysis, ablation, or error breakdown is referenced to rule out class-agnostic edge artifacts or misalignment with language-guided categories in cluttered scenes.
  2. [Methods] Methods (SAE/SMD description): The claim that SAE and SMD 'revitalize latent edge-sensitivity' via structural alignment lacks concrete equations, feature-map visualizations, or quantitative metrics showing that boundary features are recovered without attenuating DINO's semantic representations; this is load-bearing for the attenuation hypothesis.
  3. [Experiments] Experiments: No ablations, baseline variants (e.g., DINO without SAM alignment or with random masks), or statistical significance tests are mentioned to support that the reported deltas are attributable to the proposed modules rather than implementation details or dataset specifics.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'progressive attenuation as features transition into deeper transformer blocks' would benefit from a brief reference to the specific layer indices or attention maps used in the analysis.
  2. [Methods] Notation: 'SAE' and 'SMD' are introduced without expanding the acronyms on first use or providing a high-level diagram reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and outline revisions that will strengthen the manuscript's clarity and validation of the central claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The 6.3% Cityscapes improvement and overall SOTA claim rest on the premise that SAM pseudo-masks supply clean structural supervision without semantic bias, yet no analysis, ablation, or error breakdown is referenced to rule out class-agnostic edge artifacts or misalignment with language-guided categories in cluttered scenes.

    Authors: We agree that the abstract does not explicitly reference supporting analyses of the SAM pseudo-masks. The manuscript describes SAM masks as providing structural priors while language guidance handles semantic categories, but to directly address potential edge artifacts or misalignment in cluttered scenes, we will add a dedicated error breakdown and ablation study in the revised version. This will include quantitative comparison of mask quality against ground-truth boundaries and semantic consistency metrics on Cityscapes. revision: yes

  2. Referee: [Methods] Methods (SAE/SMD description): The claim that SAE and SMD 'revitalize latent edge-sensitivity' via structural alignment lacks concrete equations, feature-map visualizations, or quantitative metrics showing that boundary features are recovered without attenuating DINO's semantic representations; this is load-bearing for the attenuation hypothesis.

    Authors: The manuscript presents the attenuation hypothesis based on internal DINO feature analysis and describes the SAE/SMD alignment mechanism. However, we acknowledge that additional concrete support would strengthen the presentation. In revision we will insert the explicit alignment equations, before/after feature-map visualizations highlighting boundary activation, and quantitative metrics (such as boundary precision and semantic similarity scores) demonstrating that edge sensitivity is recovered without degrading DINO's semantic content. revision: yes

  3. Referee: [Experiments] Experiments: No ablations, baseline variants (e.g., DINO without SAM alignment or with random masks), or statistical significance tests are mentioned to support that the reported deltas are attributable to the proposed modules rather than implementation details or dataset specifics.

    Authors: We concur that isolating the contribution of SAE and SMD requires explicit ablations. The current results report overall performance gains, yet to confirm attribution we will incorporate additional experiments in the revision: (i) DINO baseline without SAM alignment, (ii) variant using random masks instead of SAM priors, and (iii) statistical significance testing (e.g., paired t-tests) across multiple runs on the reported benchmarks. These will be presented in a new ablation table. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pre-trained models and empirical benchmarks

full rationale

The paper's core contribution is the introduction of SAE and SMD modules for structural alignment between DINO and SAM, plus SAM pseudo-mask supervision. These are presented as novel additions to address observed attenuation in DINO features. Performance gains (e.g., +2.1% average, +6.3% on Cityscapes) are reported as experimental outcomes on external benchmarks, not as quantities derived by fitting parameters to the target metrics or by self-referential definitions. No load-bearing self-citations, uniqueness theorems from prior author work, or ansatzes smuggled via citation appear in the abstract or described chain. The method builds on independent VFMs (DINO, SAM) whose representations are treated as given inputs, with the alignment modules providing independent content. This satisfies the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method relies on alignment of existing pre-trained models with new modules whose internal details are not specified here.

pith-pipeline@v0.9.0 · 5581 in / 1117 out tokens · 92546 ms · 2026-05-10T17:00:20.964427+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254(2021)

  2. [2]

    Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cor- nia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. 2025. Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation. InProceedings of the IEEE/CVF International Conference on Com- puter Vision. 22025–22035

  3. [3]

    O’Reilly Media, Inc

    Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc. "

  4. [4]

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. 2018. Coco-stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recognition. 1209–1218

  5. [5]

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision. 9650–9660

  6. [6]

    Junbum Cha, Jonghwan Mun, and Byungseok Roh. 2023. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11165–11174

  7. [7]

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev

  8. [8]

    In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2818–2829

  9. [9]

    MMSegmentation Contributors. 2020. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark

  10. [10]

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus En- zweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition. 3213–3223

  11. [11]

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. 2023. Vision transformers need registers.arXiv preprint arXiv:2309.16588(2023)

  12. [12]

    Zheng Ding, Jieke Wang, and Zhuowen Tu. 2022. Open-vocabulary universal image segmentation with maskclip.arXiv preprint arXiv:2208.08984(2022)

  13. [13]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

  14. [14]

    Weijia Dou, Xu Zhang, Yi Bin, Jian Liu, Bo Peng, Guoqing Wang, Yang Yang, and Heng Tao Shen. 2026. GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation. InICLR

  15. [15]

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge.Inter- national journal of computer vision88, 2 (2010), 303–338

  16. [16]

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick

  17. [17]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009

  18. [18]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

  19. [19]

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision- language representation learning with noisy text supervision. InInternational conference on machine learning. PMLR, 4904–4916

  20. [20]

    Cijo Jose, Théo Moutakanni, Dahyun Kang, Federico Baldassarre, Timothée Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Michaël Ramamonjisoa, Maxime Oquab, et al . 2025. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InProceedings of the Computer Vision and Pattern Recognition Conference. 24905–24916

  21. [21]

    Lei Ke, Mingqiao Ye, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu, et al. 2023. Segment anything in high quality.Advances in Neural Information Processing Systems36 (2023), 29914–29934

  22. [22]

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al

  23. [23]

    InProceedings of the IEEE/CVF international conference on computer vision

    Segment anything. InProceedings of the IEEE/CVF international conference on computer vision. 4015–4026

  24. [24]

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. InInternational conference on machine learning. PMlR, 3519–3529

  25. [25]

    Minhyeok Lee, Suhwan Cho, Jungho Lee, Sunghun Yang, Heeseung Choi, Ig-Jae Kim, and Sangyoun Lee. 2025. Effective SAM combination for open-vocabulary semantic segmentation. InProceedings of the Computer Vision and Pattern Recog- nition Conference. 26081–26090

  26. [26]

    Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. 2023. Open-vocabulary semantic segmentation with mask-adapted clip. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7061–7070

  27. [27]

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755

  28. [28]

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding dino: Marry- ing dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision. Springer, 38–55

  29. [29]

    Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

  30. [30]

    Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. 2023. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. InInternational Conference on Machine Learning. PMLR, 23033– 23044

  31. [31]

    Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. 2024. Segment anything in medical images.Nature communications15, 1 (2024), 654

  32. [32]

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The role of context for object detection and semantic segmentation in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition. 891–898

  33. [33]

    Jishnu Mukhoti, Tsung-Yu Lin, Omid Poursaeed, Rui Wang, Ashish Shah, Philip HS Torr, and Ser-Nam Lim. 2023. Open vocabulary semantic segmen- tation with patch aligned contrastive learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19413–19423

  34. [34]

    Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffu- sion probabilistic models. InInternational conference on machine learning. PMLR, 8162–8171

  35. [35]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. 2024. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research Journal(2024)

  36. [36]

    Keiron O’shea and Ryan Nash. 2015. An introduction to convolutional neural networks.arXiv preprint arXiv:1511.08458(2015). 9

  37. [37]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  38. [38]

    Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. 2017. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in neural information processing systems30 (2017)

  39. [39]

    Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. 2019. On the spectral bias of neural networks. InInternational conference on machine learning. PMLR, 5301– 5310

  40. [40]

    Pengzhen Ren, Changlin Li, Hang Xu, Yi Zhu, Guangrun Wang, Jianzhuang Liu, Xiaojun Chang, and Xiaodan Liang. 2023. Viewco: Discovering text- supervised segmentation masks via multi-view semantic consistency.arXiv preprint arXiv:2302.10307(2023)

  41. [41]

    Gyungin Shin, Weidi Xie, and Samuel Albanie. 2022. Reco: Retrieve and co- segment for zero-shot transfer.Advances in Neural Information Processing Systems 35 (2022), 33754–33767

  42. [42]

    Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. 2024. Towards open-vocabulary semantic segmentation without semantic labels.Advances in Neural Information Processing Systems37 (2024), 9153–9177

  43. [43]

    Oriane Siméoni, Gilles Puy, Huy V Vo, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, and Jean Ponce. 2021. Localizing objects with self-supervised transformers and no labels.arXiv preprint arXiv:2109.14279 (2021)

  44. [44]

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. 2023. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389 (2023)

  45. [45]

    Lv Tang, Haoke Xiao, and Bo Li. 2023. Can sam segment anything? when sam meets camouflaged object detection.arXiv preprint arXiv:2304.04709(2023)

  46. [46]

    Zhengyang Wang, Tingliang Feng, Fan Lyu, Fanhua Shang, Wei Feng, and Liang Wan. 2025. Dual semantic guidance for open vocabulary semantic segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference. 20212– 20222

  47. [47]

    Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang, Chun-Pei Chen, Yu-Lun Liu, Min-Hung Chen, Hou-Ning Hu, Yung-Yu Chuang, and Yen-Yu Lin. 2024. Image- text co-decomposition for text-supervised semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26794– 26803

  48. [48]

    Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, and Chen Change Loy. 2023. Clipself: Vision transformer distills itself for open- vocabulary dense prediction.arXiv preprint arXiv:2310.01403(2023)

  49. [49]

    Monika Wysoczańska, Oriane Siméoni, Michaël Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński, and Patrick Pérez. 2024. Clip-dinoiser: Teaching clip a few dino tricks for open-vocabulary semantic segmentation. InEuropean Conference on Computer Vision. Springer, 320–337

  50. [50]

    Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. 2022. Groupvit: Semantic segmentation emerges from text supervision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18134–18144

  51. [51]

    Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. 2023. Open-vocabulary panoptic segmentation with text-to-image diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2955–2966

  52. [52]

    Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. 2023. Side adapter network for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2945–2954

  53. [53]

    Fei Zhang, Tianfei Zhou, Boyang Li, Hao He, Chaofan Ma, Tianjiao Zhang, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. 2023. Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation.Advances in Neural Information Processing Systems36 (2023), 73652–73665

  54. [54]

    Weichen Zhang, Kebin Liu, Fan Dang, Zhui Zhu, Xikai Sun, and Yunhao Liu

  55. [55]

    SynSeg: Feature Synergy for Multi-Category Contrastive Learning in Open-Vocabulary Semantic Segmentation.arXiv e-prints(2025), arXiv–2508

  56. [56]

    Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, BG Vijay Kumar, Anas- tasis Stathopoulos, Manmohan Chandraker, and Dimitris N Metaxas. 2022. Ex- ploiting unlabeled data with vision and language models for object detection. In European conference on computer vision. Springer, 159–175

  57. [57]

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liu- nian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. 2022. Regionclip: Region-based language-image pretraining. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. 16793–16803

  58. [58]

    Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2019. Semantic understanding of scenes through the ade20k dataset.International journal of computer vision127, 3 (2019), 302–321

  59. [59]

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. 2021. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832(2021). 10 A Dataset Creation Process ImagePseudo-mask Figure 1: Visualization of pseudo-masks. Currently, many pseudo-mask generation methods rely on the panoptic segmentation mode o...