pith. machine review for the scientific record. sign in

arxiv: 2604.07723 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: unknown

Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary semantic segmentationtraining-free methodsdistribution discrepancyanalytic solutionvision-language alignmentsemantic mapslogits optimization
0
0 comments X

The pith

Open-vocabulary semantic segmentation can be achieved by directly using the analytic solution to the distribution discrepancy between visual and linguistic features as the segmentation map.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that the usual process of iteratively optimizing logits to align vision and language features is unnecessary for open-vocabulary semantic segmentation. Instead, the authors start from the observation that the raw discrepancy between those features already varies in a category-specific way, then solve for the exact map that would produce that discrepancy. If this holds, segmentation reduces to a one-step calculation that needs no training loop and no model-specific tuning. Readers would care because it removes a major computational bottleneck while still reaching the highest reported numbers on eight standard test collections.

Core claim

We posit that the distribution discrepancy between visual and linguistic features encodes semantic information, with consistency across patches of the same category and inconsistency across different categories. Therefore the analytic solution of this discrepancy can be used directly as the semantic segmentation map, reformulating the entire task as a closed-form derivation rather than an optimization problem.

What carries the argument

The analytic solution of the distribution discrepancy between visual and linguistic features, applied directly as the pixel-level semantic map.

If this is right

  • No iterative training or gradient steps are required at inference time.
  • Model-specific attention modulation becomes unnecessary.
  • State-of-the-art results are obtained on eight standard open-vocabulary segmentation benchmarks.
  • Pixel-level vision-language alignment occurs in a single forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-time systems could adopt this approach where optimization loops are too slow.
  • The same discrepancy solution might be tested on video sequences to enforce temporal consistency without additional training.
  • Any pre-trained vision-language backbone could be plugged in without architectural changes.

Load-bearing premise

The distribution discrepancy between visual and linguistic features must be consistent for patches of the same category and inconsistent for patches of different categories.

What would settle it

On a benchmark where intra-category visual patches produce highly varying distribution discrepancies, the direct analytic map would yield lower accuracy than methods that still optimize logits.

Figures

Figures reproduced from arXiv: 2604.07723 by Fangyong Wang, Jiahao Li, Yachao Zhang, Yang Lu, Yanyun Qu, Yuan Xie.

Figure 1
Figure 1. Figure 1: Compared with the logits-optimization methods seeking [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed method. The pipeline begins by computing the cosine similarity (Cos.) between vision-language [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reformulation analysis results. (a) Performance compar [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of outputs from each stage of our pipeline. We present the logits and distribution discrepancies for each category [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Quantitative evaluation of SD2 time step, attention resolution, threshold [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Quantitative evaluation of SD2 time step, attention resolution, and foundation model on standard benchmarks is presented, with [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of segmentation maps on COCO-Stuff benchmark dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of segmentation maps on Pascal VOC benchmark dataset. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of segmentation maps on ADE150k benchmark dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of segmentation maps on Pascal Context benchmark dataset. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of maximum velocity maps across different threshold. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of optimal path maps across different iterations. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
read the original abstract

Open-vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, \ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the logits and the ground truth (GT) to generate optimal logits that are subsequently used to construct segmentation maps, yet it depends on time-consuming iterative training or model-specific attention modulation. In this work, we propose a more direct approach that eschews the logits-optimization process by directly deriving an analytic solution for the segmentation map. We posit a key hypothesis: the distribution discrepancy encodes semantic information; specifically, this discrepancy exhibits consistency across patches belonging to the same category but inconsistency across different categories. Based on this hypothesis, we directly utilize the analytic solution of this distribution discrepancy as the semantic maps. In other words, we reformulate the optimization of the distribution discrepancy as deriving its analytic solution, thereby eliminating time-consuming iterative training, freeing us from model-specific attention modulation, and achieving state-of-the-art performance on eight benchmark datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce a training-free method for open-vocabulary semantic segmentation that avoids logits optimization by directly deriving and using an analytic solution to the distribution discrepancy between visual and linguistic features as the semantic segmentation maps. It is based on the hypothesis that this discrepancy is consistent for patches of the same category and inconsistent across categories, leading to state-of-the-art performance on eight benchmark datasets.

Significance. If the central hypothesis holds and the analytic solution is correctly formulated to not require ground truth or iterative processes, this could represent a substantial advance in simplifying OVSS by eliminating training time and model-specific adjustments, potentially making such segmentation more accessible and efficient. The claim of SOTA results, if verified, would strengthen its impact.

major comments (2)
  1. Abstract: The distribution discrepancy is defined with respect to the ground truth (GT), but the approach is claimed to be training-free and applicable at inference time without GT. The manuscript must clarify how the discrepancy is computed or its analytic solution derived without GT, as this is load-bearing for the 'direct segmentation without logits optimization' claim.
  2. Abstract: The key hypothesis regarding the consistency of the distribution discrepancy across same-category patches is posited without any derivation, equations, or empirical support referenced. Since the entire method rests on using the analytic solution based on this property, a detailed mathematical justification is required to substantiate that the solution yields valid semantic maps.
minor comments (1)
  1. The abstract mentions 'eight benchmark datasets' but does not specify which ones; listing them would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help us improve the clarity of our presentation. We address each major comment below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: Abstract: The distribution discrepancy is defined with respect to the ground truth (GT), but the approach is claimed to be training-free and applicable at inference time without GT. The manuscript must clarify how the discrepancy is computed or its analytic solution derived without GT, as this is load-bearing for the 'direct segmentation without logits optimization' claim.

    Authors: We agree the abstract phrasing is potentially ambiguous and will revise it. In the method, the distribution discrepancy is defined between the visual feature distribution of image patches and the corresponding linguistic feature distribution induced by the open-vocabulary prompts; it does not involve ground-truth labels. The analytic solution is obtained by directly solving the consistency equation implied by our hypothesis, which is a closed-form expression that operates solely on the extracted features at inference time. Ground truth appears only in the introductory motivation when contrasting against prior logit-optimization approaches. We will add an explicit sentence in the abstract and a short paragraph in Section 3 stating that no GT is required for the derivation or application of the solution. revision: yes

  2. Referee: Abstract: The key hypothesis regarding the consistency of the distribution discrepancy across same-category patches is posited without any derivation, equations, or empirical support referenced. Since the entire method rests on using the analytic solution based on this property, a detailed mathematical justification is required to substantiate that the solution yields valid semantic maps.

    Authors: The abstract states the hypothesis concisely for brevity, but the full manuscript contains the supporting derivation. We formalize the discrepancy as a quantity that is invariant within each semantic category and derive the segmentation map as the unique analytic solution to the resulting linear system; the derivation appears in Section 3 together with the closed-form expression. We will revise the abstract to include a brief pointer to this derivation and will ensure the equations are highlighted. If the referee considers additional empirical verification of the consistency property useful, we can add a short analysis or figure in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain.

full rationale

The paper posits a hypothesis that distribution discrepancy between logits and GT encodes semantic information via intra-category consistency, then claims to derive and directly apply its analytic solution as semantic maps without iterative optimization. No equations or steps in the provided text reduce the final segmentation output to the inputs by construction (e.g., no redefinition of the map as the discrepancy itself or fitted parameter renamed as prediction). The central claim rests on the posited hypothesis and benchmark validation rather than a self-citation chain or tautological reformulation, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests entirely on one key hypothesis introduced without supporting derivation or external evidence; no free parameters or new entities are mentioned.

axioms (1)
  • ad hoc to paper The distribution discrepancy between visual and linguistic features encodes semantic information, showing consistency across patches of the same category and inconsistency across different categories.
    This hypothesis is explicitly posited in the abstract as the basis for directly using the analytic solution as semantic maps.

pith-pipeline@v0.9.0 · 5512 in / 1313 out tokens · 87580 ms · 2026-05-10T16:54:32.474418+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Self-calibrated clip for training-free open- vocabulary segmentation.arXiv preprint arXiv:2411.15869,

    Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, and Yansong Tang. Self-calibrated clip for training-free open-vocabulary segmentation.arXiv preprint arXiv:2411.15869, 2024. 6

  2. [2]

    Training-free open- vocabulary segmentation with offline diffusion-augmented prototype generation

    Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Training-free open- vocabulary segmentation with offline diffusion-augmented prototype generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3689–3698, 2024. 6

  3. [3]

    Grounding everything: Emerging localiza- tion properties in vision-language transformers

    Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localiza- tion properties in vision-language transformers. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3828–3837, 2024. 6

  4. [4]

    Zero-shot semantic segmentation.Advances in Neural Information Processing Systems, 32, 2019

    Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick P´erez. Zero-shot semantic segmentation.Advances in Neural Information Processing Systems, 32, 2019. 2

  5. [5]

    Coco- stuff: Thing and stuff classes in context

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InComputer Vision and Pattern Recognition (CVPR), 2018 IEEE conference on. IEEE, 2018. 5

  6. [6]

    Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs

    Junbum Cha, Jonghwan Mun, and Byungseok Roh. Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11165–11174, 2023. 6

  7. [7]

    Sign: Spatial-information incorpo- rated generative network for generalized zero-shot seman- tic segmentation

    Jiaxin Cheng, Soumyaroop Nandi, Prem Natarajan, and Wael Abd-Almageed. Sign: Spatial-information incorpo- rated generative network for generalized zero-shot seman- tic segmentation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9556–9566,

  8. [8]

    Cat-seg: Cost aggregation for open-vocabulary semantic segmentation.arXiv preprint arXiv:2303.11797, 2023

    Seokju Cho, Heeseong Shin, Sunghwan Hong, Seungjun An, Seungjun Lee, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation.arXiv preprint arXiv:2303.11797, 2023. 2

  9. [9]

    Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation

    Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4113– 4123, 2024. 2

  10. [10]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 5

  11. [11]

    Sinkhorn distances: Lightspeed computation of optimal transport.Advances in Neural Information Pro- cessing Systems, 26, 2013

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in Neural Information Pro- cessing Systems, 26, 2013. 4

  12. [12]

    De- coupling zero-shot semantic segmentation

    Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022. 2

  13. [13]

    Everingham, L

    M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Chal- lenge 2012 (VOC2012) Results, 2012. 5

  14. [14]

    Context-aware feature generation for zero- shot semantic segmentation

    Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and Liqing Zhang. Context-aware feature generation for zero- shot semantic segmentation. InProceedings of the 28th ACM International Conference on Multimedia, pages 1921–1929,

  15. [15]

    Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation

    Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. In2025 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), pages 5061–5071. IEEE, 2025. 1, 2, 6

  16. [16]

    Primitive gener- ation and semantic-related alignment for universal zero-shot segmentation

    Shuting He, Henghui Ding, and Wei Jiang. Primitive gener- ation and semantic-related alignment for universal zero-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11238– 11247, 2023. 1

  17. [17]

    Learning mask-aware clip representations for zero-shot segmentation.Advances in Neural Information Processing Systems, 36:35631–35653, 2023

    Siyu Jiao, Yunchao Wei, Yaowei Wang, Yao Zhao, and Humphrey Shi. Learning mask-aware clip representations for zero-shot segmentation.Advances in Neural Information Processing Systems, 36:35631–35653, 2023. 2

  18. [18]

    In defense of lazy visual grounding for open-vocabulary semantic segmentation

    Dahyun Kang and Minsu Cho. In defense of lazy visual grounding for open-vocabulary semantic segmentation. In European Conference on Computer Vision, pages 143–164. Springer, 2024. 6

  19. [19]

    Repurposing stable diffusion attention for training-free unsupervised interactive segmentation

    Markus Karmann and Onay Urfalioglu. Repurposing stable diffusion attention for training-free unsupervised interactive segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24518–24528, 2025. 5

  20. [20]

    Distilling spectral graph for object-context aware open-vocabulary semantic segmenta- tion

    Chanyoung Kim, Dayun Ju, Woojung Han, Ming-Hsuan Yang, and Seong Jae Hwang. Distilling spectral graph for object-context aware open-vocabulary semantic segmenta- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15033–15042, 2025. 2, 6

  21. [21]

    Probabilistic prompt learning for dense prediction

    Hyeongjun Kwon, Taeyong Song, Somi Jeong, Jin Kim, Jinhyun Jang, and Kwanghoon Sohn. Probabilistic prompt learning for dense prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6768–6777, 2023. 2

  22. [22]

    Clearclip: Decom- posing clip representations for dense vision-language infer- ence

    Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: Decom- posing clip representations for dense vision-language infer- ence. InEuropean Conference on Computer Vision, pages 143–160. Springer, 2024. 1, 2, 6

  23. [23]

    Proxyclip: Proxy attention improves clip for open-vocabulary segmentation

    Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InEuropean Conference on Computer Vision, pages 70–88. Springer, 2024. 1, 2, 6

  24. [24]

    Relation- ship prompt learning is enough for open-vocabulary seman- tic segmentation.Advances in Neural Information Process- ing Systems, 37:74298–74324, 2024

    Jiahao Li, Yang Lu, Yuan Xie, and Yanyun Qu. Relation- ship prompt learning is enough for open-vocabulary seman- tic segmentation.Advances in Neural Information Process- ing Systems, 37:74298–74324, 2024. 2

  25. [25]

    Novel category discovery with x-agent attention for open-vocabulary semantic segmentation

    Jiahao Li, Yang Lu, Yachao Zhang, Fangyong Wang, Yuan Xie, and Yanyun Qu. Novel category discovery with x-agent attention for open-vocabulary semantic segmentation. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 2929–2938, 2025. 1

  26. [26]

    Consistent structural relation learning for zero-shot segmentation.Advances in Neural Information Processing Systems, 33:10317–10327,

    Peike Li, Yunchao Wei, and Yi Yang. Consistent structural relation learning for zero-shot segmentation.Advances in Neural Information Processing Systems, 33:10317–10327,

  27. [27]

    Clip surgery for better explainability with enhancement in open- vocabulary tasks.arXiv e-prints, pages arXiv–2304, 2023

    Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip surgery for better explainability with enhancement in open- vocabulary tasks.arXiv e-prints, pages arXiv–2304, 2023. 6

  28. [28]

    Open-vocabulary semantic segmentation with mask-adapted clip

    Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023. 2

  29. [29]

    Delving into shape-aware zero-shot semantic segmentation

    Xinyu Liu, Beiwen Tian, Zhen Wang, Rui Wang, Kehua Sheng, Bo Zhang, Hao Zhao, and Guyue Zhou. Delving into shape-aware zero-shot semantic segmentation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2999–3009, 2023. 1

  30. [30]

    Emergent open-vocabulary semantic segmenta- tion from off-the-shelf vision-language models

    Jiayun Luo, Siddhesh Khandelwal, Leonid Sigal, and Boyang Li. Emergent open-vocabulary semantic segmenta- tion from off-the-shelf vision-language models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4029–4040, 2024. 6

  31. [31]

    The role of context for object detection and semantic segmentation in the wild

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 5

  32. [32]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2

  33. [33]

    A closer look at self-training for zero-label semantic segmentation

    Giuseppe Pastore, Fabio Cermelli, Yongqin Xian, Massimil- iano Mancini, Zeynep Akata, and Barbara Caputo. A closer look at self-training for zero-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2693–2702, 2021. 2

  34. [34]

    Freeseg: Unified, universal and open-vocabulary image segmentation

    Jie Qin, Jie Wu, Pengxiang Yan, Ming Li, Ren Yuxi, Xue- feng Xiao, Yitong Wang, Rui Wang, Shilei Wen, Xin Pan, et al. Freeseg: Unified, universal and open-vocabulary image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19446– 19455, 2023. 1

  35. [35]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

  36. [36]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 5

  37. [37]

    Ex- plore the potential of clip for training-free open vocabulary semantic segmentation

    Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. InEuropean Conference on Com- puter Vision, pages 139–156. Springer, 2024. 6

  38. [38]

    Conterfactual gen- erative zero-shot semantic segmentation.arXiv preprint arXiv:2106.06360, 2021

    Feihong Shen, Jun Liu, and Ping Hu. Conterfactual gen- erative zero-shot semantic segmentation.arXiv preprint arXiv:2106.06360, 2021. 1

  39. [39]

    Llmformer: Large language model for open-vocabulary semantic seg- mentation.International Journal of Computer Vision, 133 (2):742–759, 2025

    Hengcan Shi, Son Duy Dao, and Jianfei Cai. Llmformer: Large language model for open-vocabulary semantic seg- mentation.International Journal of Computer Vision, 133 (2):742–759, 2025. 2

  40. [40]

    Reco: Re- trieve and co-segment for zero-shot transfer.Advances in Neural Information Processing Systems, 35:33754–33767,

    Gyungin Shin, Weidi Xie, and Samuel Albanie. Reco: Re- trieve and co-segment for zero-shot transfer.Advances in Neural Information Processing Systems, 35:33754–33767,

  41. [41]

    Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation

    Vladan Stojni ´c, Yannis Kalantidis, Ji ˇr´ı Matas, and Giorgos Tolias. Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9794–9803, 2025. 2, 3

  42. [42]

    Clip as rnn: Segment countless visual concepts without training endeavor

    Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, and Siyang Li. Clip as rnn: Segment countless visual concepts without training endeavor. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13171–13182, 2024. 6

  43. [43]

    Diffuse attend and segment: Un- supervised zero-shot segmentation using stable diffusion

    Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, and Mar Gonzalez-Franco. Diffuse attend and segment: Un- supervised zero-shot segmentation using stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3554–3563, 2024. 5

  44. [44]

    Sclip: Rethink- ing self-attention for dense vision-language inference

    Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethink- ing self-attention for dense vision-language inference. In European Conference on Computer Vision, pages 315–332. Springer, 2024. 1, 2, 6

  45. [45]

    Diffusion model is secretly a training-free open vocabulary semantic segmenter

    Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, and Dong Xu. Diffusion model is secretly a training-free open vocabulary semantic segmenter. IEEE Transactions on Image Processing, 2025. 2

  46. [46]

    Clipself: Vision trans- former distills itself for open-vocabulary dense prediction

    Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, and Chen Change Loy. Clipself: Vision trans- former distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403, 2023. 2

  47. [47]

    Clim: Contrastive language- image mosaic for region representation

    Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Wentao Liu, and Chen Change Loy. Clim: Contrastive language- image mosaic for region representation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6117– 6125, 2024. 2

  48. [48]

    Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using dif- fusion models.arXiv preprint arXiv:2303.11681, 2023

    Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, and Chunhua Shen. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using dif- fusion models.arXiv preprint arXiv:2303.11681, 2023. 2

  49. [49]

    Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation

    Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation. InEuropean Conference on Computer Vision, pages 320–337. Springer, 2024. 6

  50. [50]

    Semantic projection network for zero-and few-label semantic segmentation

    Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, and Zeynep Akata. Semantic projection network for zero-and few-label semantic segmentation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8256–8265, 2019. 2

  51. [51]

    Groupvit: Semantic segmentation emerges from text supervision

    Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18134–18144, 2022. 2, 6

  52. [52]

    A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model

    Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model. InEuropean Conference on Computer Vi- sion, pages 736–753. Springer, 2022. 2

  53. [53]

    Side adapter network for open-vocabulary semantic segmentation

    Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945– 2954, 2023. 2

  54. [54]

    Zero- shot referring image segmentation with global-local context features

    Seonghoon Yu, Paul Hongsuck Seo, and Jeany Son. Zero- shot referring image segmentation with global-local context features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19456– 19465, 2023. 2

  55. [55]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 2

  56. [56]

    Open vocabulary scene parsing

    Hang Zhao, Xavier Puig, Bolei Zhou, Sanja Fidler, and An- tonio Torralba. Open vocabulary scene parsing. InProceed- ings of the IEEE International Conference on Computer Vi- sion, pages 2002–2010, 2017. 1

  57. [57]

    Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019

    Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019. 5

  58. [58]

    Extract free dense labels from clip

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean Conference on Com- puter Vision, pages 696–712. Springer, 2022. 1, 2, 6

  59. [59]

    Zegclip: Towards adapting clip for zero-shot se- mantic segmentation

    Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yifan Liu. Zegclip: Towards adapting clip for zero-shot se- mantic segmentation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11175–11185, 2023. 2 Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation Sup...

  60. [60]

    Proof Given the cost matrixC∈R N×N and the regularization scalarϵ, the objective is to solve the following equation: π∗ = min π X i,j Ci,jπi,j −ϵ X i,j πi,j(lnπ i,j −1), (13) subject to marginal constraints: X j πi,j =f c i , X i πi,j =f t j ,∀i, j,π i,j ≥0, (14) whereP i f c i = 1,P j f t j = 1. By introducing Lagrange multipliersα∈R N andβ∈R N , the Lag...

  61. [61]

    More Details Non-maximum suppression.Given that existing visual language models are constrained by coarse-grained multi- modal training paradigms, the resulting logits often con- tain numerous misaligned patches, which serve as noise and interfere with downstream fine-grained tasks. In this work, this noise disrupts the distribution transmission pro- cess...

  62. [62]

    Logits” and “Attention

    More results Ablation about optimal path.We conduct component ablation experiments under the optimal path mode. As il- lustrated in Figure 6, our analysis reveals that the effect of denoising step length confirms that single-step denois- ing generates deterministic self-attention tensors with opti- mal performance. Moreover, higher resolution of the atten...