arxiv: 2604.08461 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

Recognition: unknown

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

Haoxi Zeng , Qiankun Liu , Yi Bin , Haiyue Zhang , Yujuan Ding , Guoqing Wang , Deqiang Ouyang , Heng Tao Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords open-vocabulary segmentationDINO featuresSAM alignmentboundary awarenessstructure-aware encoderweakly-supervised segmentationimage segmentationlanguage guidance

0 comments

The pith

Aligning DINO features with SAM structures restores boundary awareness for open-vocabulary segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that DINO vision models lose edge detection ability as their features pass through deeper transformer blocks, which restricts precise segmentation of object regions described only by text. It counters this by adding a Structure-Aware Encoder and Structure-Modulated Decoder that borrow structural cues from the Segment Anything Model and train with SAM-generated pseudo-masks. A sympathetic reader would care because current language-guided segmentation often produces blurry or incomplete outlines in busy scenes. If the claim holds, combining these two models yields sharper boundaries for categories never seen during training, reducing the need for dense labeled data.

Core claim

DINO's inherent boundary awareness attenuates progressively through its transformer blocks. The authors address this attenuation by introducing a Structure-Aware Encoder and Structure-Modulated Decoder that align DINO representations with SAM structural priors, supervised by SAM pseudo-masks, and report state-of-the-art results on weakly-supervised open-vocabulary segmentation benchmarks.

What carries the argument

The Structure-Aware Encoder (SAE) and Structure-Modulated Decoder (SMD) that activate DINO's latent boundary features by alignment with SAM structural priors.

If this is right

State-of-the-art average performance on weakly-supervised OVS benchmarks with a 2.1 percent gain.
Particularly large gains of 6.3 percent on Cityscapes, where scenes are complex and cluttered.
Improved handling of fine object boundaries when segmenting regions using only language descriptions.
Reduced reliance on pixel-level labels through the use of SAM pseudo-masks for supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structural alignment pattern could be tested on other vision foundation models that show similar progressive loss of spatial detail.
Sharper boundaries produced this way might directly help open-vocabulary object detection or instance segmentation pipelines.
Pseudo-mask supervision from SAM suggests a route to lower annotation costs for other dense prediction tasks that need precise edges.

Load-bearing premise

The main bottleneck for DINO in segmentation is the attenuation of boundary awareness in deeper layers, and this can be corrected by SAM alignment without introducing new semantic biases or edge artifacts.

What would settle it

A side-by-side test on a cluttered benchmark dataset that measures boundary precision metrics such as boundary F-score or edge IoU and finds no improvement over baseline DINO would show the alignment does not restore the claimed edge sensitivity.

Figures

Figures reproduced from arXiv: 2604.08461 by Deqiang Ouyang, Guoqing Wang, Haiyue Zhang, Haoxi Zeng, Heng Tao Shen, Qiankun Liu, Yi Bin, Yujuan Ding.

**Figure 2.** Figure 2: Frequency-domain analysis of DINO layers. (a) The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: CKA heatmap results of DINOv2 and SAM, obtained [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Overall architecture of OVS-DINO. We first encode the multi-level features from DINO using the Structure-Aware [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization results of different models. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 1.** Figure 1: Visualization of pseudo-masks. Currently, many pseudo-mask generation methods rely on the panoptic segmentation mode of the SAM [20] model for automatic mask generation. However, this approach [43] utilizes uniformly sampled points as prompts, which leads to relatively slow inference speed and severe over-segmentation. Such over-segmentation is particularly detrimental to semantic segmentation tasks. In ad… view at source ↗

**Figure 2.** Figure 2: Visualization of success cases. C Case Study C.1 Success Case Analysis To qualitatively evaluate the effectiveness of the proposed model, [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of failure cases. are more difficult to localize and classify accurately in an openvocabulary setting. Overall, these failure cases indicate that the proposed model performs reliably on dominant semantic regions but remains less robust for small objects, thin structures, and boundary-sensitive areas. In addition, visually or semantically similar categories are more likely to be confused, esp… view at source ↗

**Figure 4.** Figure 4: Additional qualitative results on the ADE20K [ [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Additional qualitative results on the Cityscapes [ [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Additional qualitative results on the COCO-Stuff [ [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Open-Vocabulary Segmentation (OVS) aims to segment image regions beyond predefined category sets by leveraging semantic descriptions. While CLIP based approaches excel in semantic generalization, they frequently lack the fine-grained spatial awareness required for dense prediction. Recent efforts have incorporated Vision Foundation Models (VFMs) like DINO to alleviate these limitations. However, these methods still struggle with the precise edge perception necessary for high fidelity segmentation. In this paper, we analyze internal representations of DINO and discover that its inherent boundary awareness is not absent but rather undergoes progressive attenuation as features transition into deeper transformer blocks. To address this, we propose OVS-DINO, a novel framework that revitalizes latent edge-sensitivity of DINO through structural alignment with the Segment Anything Model (SAM). Specifically, we introduce a Structure-Aware Encoder (SAE) and a Structure-Modulated Decoder (SMD) to effectively activate boundary features of DINO using SAM's structural priors, complemented by a supervision strategy utilizing SAM generated pseudo-masks. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple weakly-supervised OVS benchmarks, improving the average score by 2.1% (from 44.8% to 46.9%). Notably, our approach significantly enhances segmentation accuracy in complex, cluttered scenarios, with a gain of 6.3% on Cityscapes (from 36.6% to 42.9%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies real boundary attenuation in deeper DINO layers and adds SAE/SMD modules plus SAM pseudo-masks to fix it, delivering small average gains but with an untested risk that the masks add edge artifacts instead of clean semantic help.

read the letter

The core contribution is spotting that DINO keeps some boundary sensitivity but loses it progressively through the transformer blocks, then trying to restore it by structural alignment to SAM. They add a Structure-Aware Encoder and Structure-Modulated Decoder to pull in SAM priors and supervise with SAM-generated pseudo-masks in a weakly-supervised open-vocabulary setup. The reported results are a 2.1% average lift across benchmarks and 6.3% on Cityscapes, which lines up with the claim that cluttered scenes benefit most from better edges.

Referee Report

3 major / 2 minor

Summary. The paper proposes OVS-DINO, a framework for weakly-supervised open-vocabulary segmentation that analyzes progressive attenuation of boundary awareness in DINO transformer features and addresses it via a Structure-Aware Encoder (SAE) and Structure-Modulated Decoder (SMD) that align DINO representations with SAM structural priors, using SAM-generated pseudo-masks for supervision. It reports SOTA results on multiple benchmarks, with a 2.1% average mIoU lift (44.8% to 46.9%) and a 6.3% gain on Cityscapes (36.6% to 42.9%).

Significance. If the central claim holds after proper validation, the work offers a practical way to combine DINO's semantic strengths with SAM's edge sensitivity for improved dense OVS in cluttered scenes, without requiring dense labels. This could be useful for real-world applications where boundary precision matters, provided the alignment does not trade off semantic fidelity.

major comments (3)

[Abstract] Abstract: The 6.3% Cityscapes improvement and overall SOTA claim rest on the premise that SAM pseudo-masks supply clean structural supervision without semantic bias, yet no analysis, ablation, or error breakdown is referenced to rule out class-agnostic edge artifacts or misalignment with language-guided categories in cluttered scenes.
[Methods] Methods (SAE/SMD description): The claim that SAE and SMD 'revitalize latent edge-sensitivity' via structural alignment lacks concrete equations, feature-map visualizations, or quantitative metrics showing that boundary features are recovered without attenuating DINO's semantic representations; this is load-bearing for the attenuation hypothesis.
[Experiments] Experiments: No ablations, baseline variants (e.g., DINO without SAM alignment or with random masks), or statistical significance tests are mentioned to support that the reported deltas are attributable to the proposed modules rather than implementation details or dataset specifics.

minor comments (2)

[Abstract] Abstract: The phrase 'progressive attenuation as features transition into deeper transformer blocks' would benefit from a brief reference to the specific layer indices or attention maps used in the analysis.
[Methods] Notation: 'SAE' and 'SMD' are introduced without expanding the acronyms on first use or providing a high-level diagram reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and outline revisions that will strengthen the manuscript's clarity and validation of the central claims.

read point-by-point responses

Referee: [Abstract] Abstract: The 6.3% Cityscapes improvement and overall SOTA claim rest on the premise that SAM pseudo-masks supply clean structural supervision without semantic bias, yet no analysis, ablation, or error breakdown is referenced to rule out class-agnostic edge artifacts or misalignment with language-guided categories in cluttered scenes.

Authors: We agree that the abstract does not explicitly reference supporting analyses of the SAM pseudo-masks. The manuscript describes SAM masks as providing structural priors while language guidance handles semantic categories, but to directly address potential edge artifacts or misalignment in cluttered scenes, we will add a dedicated error breakdown and ablation study in the revised version. This will include quantitative comparison of mask quality against ground-truth boundaries and semantic consistency metrics on Cityscapes. revision: yes
Referee: [Methods] Methods (SAE/SMD description): The claim that SAE and SMD 'revitalize latent edge-sensitivity' via structural alignment lacks concrete equations, feature-map visualizations, or quantitative metrics showing that boundary features are recovered without attenuating DINO's semantic representations; this is load-bearing for the attenuation hypothesis.

Authors: The manuscript presents the attenuation hypothesis based on internal DINO feature analysis and describes the SAE/SMD alignment mechanism. However, we acknowledge that additional concrete support would strengthen the presentation. In revision we will insert the explicit alignment equations, before/after feature-map visualizations highlighting boundary activation, and quantitative metrics (such as boundary precision and semantic similarity scores) demonstrating that edge sensitivity is recovered without degrading DINO's semantic content. revision: yes
Referee: [Experiments] Experiments: No ablations, baseline variants (e.g., DINO without SAM alignment or with random masks), or statistical significance tests are mentioned to support that the reported deltas are attributable to the proposed modules rather than implementation details or dataset specifics.

Authors: We concur that isolating the contribution of SAE and SMD requires explicit ablations. The current results report overall performance gains, yet to confirm attribution we will incorporate additional experiments in the revision: (i) DINO baseline without SAM alignment, (ii) variant using random masks instead of SAM priors, and (iii) statistical significance testing (e.g., paired t-tests) across multiple runs on the reported benchmarks. These will be presented in a new ablation table. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pre-trained models and empirical benchmarks

full rationale

The paper's core contribution is the introduction of SAE and SMD modules for structural alignment between DINO and SAM, plus SAM pseudo-mask supervision. These are presented as novel additions to address observed attenuation in DINO features. Performance gains (e.g., +2.1% average, +6.3% on Cityscapes) are reported as experimental outcomes on external benchmarks, not as quantities derived by fitting parameters to the target metrics or by self-referential definitions. No load-bearing self-citations, uniqueness theorems from prior author work, or ansatzes smuggled via citation appear in the abstract or described chain. The method builds on independent VFMs (DINO, SAM) whose representations are treated as given inputs, with the alignment modules providing independent content. This satisfies the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method relies on alignment of existing pre-trained models with new modules whose internal details are not specified here.

pith-pipeline@v0.9.0 · 5581 in / 1117 out tokens · 92546 ms · 2026-05-10T17:00:20.964427+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 12 canonical work pages · 6 internal anchors

[1]

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254(2021)

work page internal anchor Pith review arXiv 2021
[2]

Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cor- nia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. 2025. Talking to dino: Bridging self-supervised vision backbones with language for open-vocabulary segmentation. InProceedings of the IEEE/CVF International Conference on Com- puter Vision. 22025–22035

2025
[3]

O’Reilly Media, Inc

Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc. "

2009
[4]

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. 2018. Coco-stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recognition. 1209–1218

2018
[5]

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision. 9650–9660

2021
[6]

Junbum Cha, Jonghwan Mun, and Byungseok Roh. 2023. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11165–11174

2023
[7]

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev
[8]

In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2818–2829
[9]

MMSegmentation Contributors. 2020. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark

2020
[10]

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus En- zweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition. 3213–3223

2016
[11]

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. 2023. Vision transformers need registers.arXiv preprint arXiv:2309.16588(2023)

work page internal anchor Pith review arXiv 2023
[12]

Zheng Ding, Jieke Wang, and Zhuowen Tu. 2022. Open-vocabulary universal image segmentation with maskclip.arXiv preprint arXiv:2208.08984(2022)

work page arXiv 2022
[13]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[14]

Weijia Dou, Xu Zhang, Yi Bin, Jian Liu, Bo Peng, Guoqing Wang, Yang Yang, and Heng Tao Shen. 2026. GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation. InICLR

2026
[15]

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge.Inter- national journal of computer vision88, 2 (2010), 303–338

2010
[16]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick
[17]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009
[18]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

2016
[19]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision- language representation learning with noisy text supervision. InInternational conference on machine learning. PMLR, 4904–4916

2021
[20]

Cijo Jose, Théo Moutakanni, Dahyun Kang, Federico Baldassarre, Timothée Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Michaël Ramamonjisoa, Maxime Oquab, et al . 2025. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InProceedings of the Computer Vision and Pattern Recognition Conference. 24905–24916

2025
[21]

Lei Ke, Mingqiao Ye, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu, et al. 2023. Segment anything in high quality.Advances in Neural Information Processing Systems36 (2023), 29914–29934

2023
[22]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al
[23]

InProceedings of the IEEE/CVF international conference on computer vision

Segment anything. InProceedings of the IEEE/CVF international conference on computer vision. 4015–4026
[24]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. InInternational conference on machine learning. PMlR, 3519–3529

2019
[25]

Minhyeok Lee, Suhwan Cho, Jungho Lee, Sunghun Yang, Heeseung Choi, Ig-Jae Kim, and Sangyoun Lee. 2025. Effective SAM combination for open-vocabulary semantic segmentation. InProceedings of the Computer Vision and Pattern Recog- nition Conference. 26081–26090

2025
[26]

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. 2023. Open-vocabulary semantic segmentation with mask-adapted clip. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7061–7070

2023
[27]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755

2014
[28]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding dino: Marry- ing dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision. Springer, 38–55

2024
[29]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. 2023. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. InInternational Conference on Machine Learning. PMLR, 23033– 23044

2023
[31]

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. 2024. Segment anything in medical images.Nature communications15, 1 (2024), 654

2024
[32]

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The role of context for object detection and semantic segmentation in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition. 891–898

2014
[33]

Jishnu Mukhoti, Tsung-Yu Lin, Omid Poursaeed, Rui Wang, Ashish Shah, Philip HS Torr, and Ser-Nam Lim. 2023. Open vocabulary semantic segmen- tation with patch aligned contrastive learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19413–19423

2023
[34]

Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffu- sion probabilistic models. InInternational conference on machine learning. PMLR, 8162–8171

2021
[35]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. 2024. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research Journal(2024)

2024
[36]

Keiron O’shea and Ryan Nash. 2015. An introduction to convolutional neural networks.arXiv preprint arXiv:1511.08458(2015). 9

work page arXiv 2015
[37]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

2021
[38]

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. 2017. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in neural information processing systems30 (2017)

2017
[39]

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. 2019. On the spectral bias of neural networks. InInternational conference on machine learning. PMLR, 5301– 5310

2019
[40]

Pengzhen Ren, Changlin Li, Hang Xu, Yi Zhu, Guangrun Wang, Jianzhuang Liu, Xiaojun Chang, and Xiaodan Liang. 2023. Viewco: Discovering text- supervised segmentation masks via multi-view semantic consistency.arXiv preprint arXiv:2302.10307(2023)

work page arXiv 2023
[41]

Gyungin Shin, Weidi Xie, and Samuel Albanie. 2022. Reco: Retrieve and co- segment for zero-shot transfer.Advances in Neural Information Processing Systems 35 (2022), 33754–33767

2022
[42]

Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. 2024. Towards open-vocabulary semantic segmentation without semantic labels.Advances in Neural Information Processing Systems37 (2024), 9153–9177

2024
[43]

Oriane Siméoni, Gilles Puy, Huy V Vo, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, and Jean Ponce. 2021. Localizing objects with self-supervised transformers and no labels.arXiv preprint arXiv:2109.14279 (2021)

work page arXiv 2021
[44]

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. 2023. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389 (2023)

work page internal anchor Pith review arXiv 2023
[45]

Lv Tang, Haoke Xiao, and Bo Li. 2023. Can sam segment anything? when sam meets camouflaged object detection.arXiv preprint arXiv:2304.04709(2023)

work page arXiv 2023
[46]

Zhengyang Wang, Tingliang Feng, Fan Lyu, Fanhua Shang, Wei Feng, and Liang Wan. 2025. Dual semantic guidance for open vocabulary semantic segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference. 20212– 20222

2025
[47]

Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang, Chun-Pei Chen, Yu-Lun Liu, Min-Hung Chen, Hou-Ning Hu, Yung-Yu Chuang, and Yen-Yu Lin. 2024. Image- text co-decomposition for text-supervised semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26794– 26803

2024
[48]

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, and Chen Change Loy. 2023. Clipself: Vision transformer distills itself for open- vocabulary dense prediction.arXiv preprint arXiv:2310.01403(2023)

work page arXiv 2023
[49]

Monika Wysoczańska, Oriane Siméoni, Michaël Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński, and Patrick Pérez. 2024. Clip-dinoiser: Teaching clip a few dino tricks for open-vocabulary semantic segmentation. InEuropean Conference on Computer Vision. Springer, 320–337

2024
[50]

Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. 2022. Groupvit: Semantic segmentation emerges from text supervision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18134–18144

2022
[51]

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. 2023. Open-vocabulary panoptic segmentation with text-to-image diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2955–2966

2023
[52]

Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. 2023. Side adapter network for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2945–2954

2023
[53]

Fei Zhang, Tianfei Zhou, Boyang Li, Hao He, Chaofan Ma, Tianjiao Zhang, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. 2023. Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation.Advances in Neural Information Processing Systems36 (2023), 73652–73665

2023
[54]

Weichen Zhang, Kebin Liu, Fan Dang, Zhui Zhu, Xikai Sun, and Yunhao Liu
[55]

SynSeg: Feature Synergy for Multi-Category Contrastive Learning in Open-Vocabulary Semantic Segmentation.arXiv e-prints(2025), arXiv–2508

2025
[56]

Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, BG Vijay Kumar, Anas- tasis Stathopoulos, Manmohan Chandraker, and Dimitris N Metaxas. 2022. Ex- ploiting unlabeled data with vision and language models for object detection. In European conference on computer vision. Springer, 159–175

2022
[57]

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liu- nian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. 2022. Regionclip: Region-based language-image pretraining. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. 16793–16803

2022
[58]

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2019. Semantic understanding of scenes through the ade20k dataset.International journal of computer vision127, 3 (2019), 302–321

2019
[59]

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. 2021. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832(2021). 10 A Dataset Creation Process ImagePseudo-mask Figure 1: Visualization of pseudo-masks. Currently, many pseudo-mask generation methods rely on the panoptic segmentation mode o...

work page internal anchor Pith review arXiv 2021