pith. sign in

arxiv: 2604.08110 · v2 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.LG

OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation

Pith reviewed 2026-05-10 16:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords open-vocabulary semantic segmentationtraining-free methodsglobal attentionsliding windowvision-language modelscontext aggregationfeature stitchingdense prediction
0
0 comments X

The pith

OV-Stitcher stitches sub-image attention maps inside the final encoder block to restore global context for training-free open-vocabulary segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sliding-window processing of high-resolution images breaks global attention in pretrained vision-language encoders, producing inconsistent segmentation. OV-Stitcher fixes this by reconstructing full-image attention representations directly from the independently processed sub-image features, but only inside the last encoder block. This change yields spatially coherent and semantically aligned output maps without any retraining. The approach raises average performance across eight standard benchmarks. A sympathetic reader would care because it turns a common practical workaround into a source of usable global reasoning while preserving the training-free property.

Core claim

OV-Stitcher enables global attention within the final encoder block by stitching fragmented sub-image features, which produces coherent context aggregation and spatially consistent, semantically aligned segmentation maps for training-free open-vocabulary semantic segmentation.

What carries the argument

The stitching operation that reconstructs attention representations from independently processed sub-image features inside only the final encoder block.

If this is right

  • Segmentation maps become spatially consistent because global attention is restored at the point where final predictions are formed.
  • The method scales to arbitrary image resolutions while remaining training-free.
  • Performance improves from 48.7 to 50.7 mIoU on average across eight benchmarks.
  • Existing pretrained encoders can be used for dense open-vocabulary prediction without architectural changes to early layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stitching step could be inserted into other dense-prediction heads that rely on sliding windows, such as depth estimation or instance segmentation.
  • If earlier encoder blocks were also stitched, the model might capture even longer-range dependencies, though this would increase compute.
  • The improvement is largest on scenes with distant but semantically related objects, suggesting the technique mainly helps relational reasoning.

Load-bearing premise

Stitching and reconstructing attention inside only the final encoder block is enough to recover accurate global context without changing earlier layers or retraining the model.

What would settle it

A controlled test that measures whether removing the stitching step inside the final block drops mIoU back to the prior sliding-window baseline level on the same eight benchmarks.

Figures

Figures reproduced from arXiv: 2604.08110 by Seunghyun Oh, Seungjae Moon, Youngmin Ro.

Figure 1
Figure 1. Figure 1: Top: Prior works process cropped sub-images indepen￾dently, preventing attention across different sub-image features. Bottom: We introduce a Stitch Attention mechanism that enables global attention across all cropped regions, yielding more coherent and contextually consistent feature integration. ble adaptation across diverse domains. Within this paradigm, training-free OVSS (TF-OVSS) represents a particul… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the attention maps and patch-interactions for prior methods and our Stitch Attention. (a) presents prior methods, and [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of feature representations and segmenta￾tion results. (a) and (b) show the image feature maps and segmen￾tation results obtained from the baseline and the proposed Stitch Attention, respectively. The top row shows the feature maps after applying PCA, and the bottom row presents the corresponding seg￾mentation results. The predicted segmentation result in [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our method OV-Stitcher. Our core framework starts from processing each sub-image using a sliding window approach. From the final layer of each sub-image, we extract Q˜, K˜ , and V˜ features, and stitch each type separately across all sub-images to form the global Q, K, and V . Self-attention on these stitched features produces a feature map capturing global correlations. The features resulting … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison with previous training-free open vocabulary segmentation methods. the highest average score among all methods. As shown in the lower part of Tab. 1 for MetaCLIP, OV-Stitcher again achieves the highest performance across all datasets, and the average score further improves by 1.2% in mIoU compared to the OpenAI CLIP results, reflecting the benefit of stronger visual representations. O… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation on resolution robustness. Post-processing is excluded to clearly show the effect of the proposed framework. The x-axis represents the settings in the format shorter side – window size – stride. ProxyCLIP V21 C60 Obj. Stf. City ADE X 61.3 35.3 37.5 26.5 38.1 20.2 ✓ 62.9 36.3 38.1 26.7 39.8 20.9 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation on resolution robustness. Post-processing is excluded to clearly show the effect of the proposed framework. The x-axis represents the settings in the format shorter side – window size – stride. introduced by global token interactions can be effectively mitigated using standard efficient attention implementations, supporting the practical applicability of our method. To generate Class-Biased Prompt… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison showing the effect of CBP. Qualitative comparison showing the effect of CBP. To enable a more explicit comparison, post-processing is removed; while higher feature coherence can cause larger regions to be assigned to the wrong class, CBP reduces class ambiguity and helps maintain correct labeling [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison without post-processing. By removing post-processing, it becomes clear that our method produces more spatially and semantically feature-coherent results than the baseline CorrCLIP [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional qualitative comparison on VOC21 [16]. Image SCLIP ProxyCLIP Trident CorrCLIP OV-Stitcher G.T [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional qualitative comparison on COCO Object [5] [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional qualitative comparison on Context60 [37]. Image SCLIP ProxyCLIP Trident CorrCLIP OV-Stitcher G.T [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional qualitative comparison on Cityscapes [12] [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional qualitative comparison on ADE20K [65] Image SCLIP ProxyCLIP Trident CorrCLIP OV-Stitcher G.T [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Additional qualitative comparison on COCO Stuff [5] [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
read the original abstract

Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes OV-Stitcher, a training-free open-vocabulary semantic segmentation framework that processes high-resolution images via sliding windows but stitches fragmented sub-image features inside the final encoder block of a pretrained vision-language model. By reconstructing attention representations from these sub-image tokens, the method claims to enable global attention within that single block, yielding coherent context aggregation and spatially consistent segmentation maps. Extensive experiments on eight benchmarks report an mIoU improvement from 48.7 to 50.7 over prior training-free baselines.

Significance. If the single-block stitching mechanism proves sufficient to recover usable global context, the approach would offer a practical, training-free advance for high-resolution TF-OVSS by mitigating fragmentation without retraining or altering earlier layers. The multi-benchmark evaluation and focus on scalability are strengths that could influence follow-up work on efficient context aggregation in pretrained encoders.

major comments (2)
  1. The core claim (Abstract and Method description) that reconstructing attention representations from independently encoded sub-image features inside only the final encoder block enables global attention and coherent context aggregation is not yet load-bearing supported. Because sub-images are processed independently through all preceding layers, the query/key/value tokens entering the final block encode exclusively local context; post-hoc attention reconstruction can at best perform a mixing of already-localized representations and cannot inject the long-range dependencies that earlier self-attention layers would have captured across sub-images. The reported mIoU gain (48.7 to 50.7) is therefore not yet attributable to “global attention within the final encoder block” without an ablation that compares single-block stitching against a true full-image forward pass or against stitching in K>
  2. Experiments section: the manuscript reports consistent mIoU gains across eight benchmarks yet provides no ablation studies on the stitching operation, no statistical significance tests, and no implementation details (e.g., how attention maps are exactly reconstructed and merged). These omissions are load-bearing for the central claim that the improvement stems from the proposed global-context mechanism rather than from unstated dataset-specific choices or baseline re-implementations.
minor comments (1)
  1. Abstract: the phrase “eight benchmarks” is used without naming the datasets; listing them would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thorough and constructive review. We address each major comment below with the strongest honest defense of the manuscript while acknowledging where revisions are needed to strengthen the claims and evidence.

read point-by-point responses
  1. Referee: The core claim (Abstract and Method description) that reconstructing attention representations from independently encoded sub-image features inside only the final encoder block enables global attention and coherent context aggregation is not yet load-bearing supported. Because sub-images are processed independently through all preceding layers, the query/key/value tokens entering the final block encode exclusively local context; post-hoc attention reconstruction can at best perform a mixing of already-localized representations and cannot inject the long-range dependencies that earlier self-attention layers would have captured across sub-images. The reported mIoU gain (48.7 to 50.7) is therefore not yet attributable to “global attention within the final encoder block” without an ablation that compares single-block stitching against a true full-image forward pass or against stitching in K>

    Authors: We agree that tokens entering the final block carry only local context from prior independent processing, so the final-block attention performs mixing of localized representations rather than recovering dependencies that earlier layers could have modeled across sub-images. The mechanism still differs from prior TF-OVSS baselines, which perform no cross-sub-image attention at any stage and typically rely on post-hoc averaging or independent decoding. By enabling full-image token attention inside the last block we obtain measurable coherence gains, as reflected in the consistent mIoU lift. We cannot run a true full-image forward pass because the pretrained encoder’s fixed input resolution (and associated memory limits) precludes high-resolution inputs without the sliding-window strategy. We will revise the abstract and method sections to state the claim more precisely as “global attention over stitched tokens in the final block” and add an ablation that applies stitching in K>1 blocks to quantify the incremental benefit of the final-block placement. revision: partial

  2. Referee: Experiments section: the manuscript reports consistent mIoU gains across eight benchmarks yet provides no ablation studies on the stitching operation, no statistical significance tests, and no implementation details (e.g., how attention maps are exactly reconstructed and merged). These omissions are load-bearing for the central claim that the improvement stems from the proposed global-context mechanism rather than from unstated dataset-specific choices or baseline re-implementations.

    Authors: We accept that the current manuscript lacks these elements. In the revised version we will (1) supply complete implementation details on token stitching, Q/K/V reconstruction, and attention-map merging inside the final block; (2) add ablation studies that isolate the stitching operation (e.g., feature concatenation without attention, attention-based stitching vs. simple averaging, and varying the number of blocks in which stitching occurs); and (3) report statistical significance (standard deviation over multiple runs or deterministic baseline comparisons) to confirm the gains are attributable to the proposed mechanism rather than implementation artifacts. revision: yes

standing simulated objections not resolved
  • Direct ablation against a true full-image forward pass, which is infeasible under the pretrained model’s fixed input resolution and memory constraints for the high-resolution images used in the benchmarks.

Circularity Check

0 steps flagged

No circularity; algorithmic stitching procedure with empirical results.

full rationale

The paper describes OV-Stitcher as a training-free algorithmic framework that stitches sub-image attention representations only inside the final encoder block. No equations, derivations, or first-principles results are presented that reduce the reported mIoU gains (48.7 to 50.7) to quantities defined by the method's own inputs or fitted parameters. The contribution is an engineering procedure for handling high-resolution inputs, validated through benchmark evaluations rather than any closed mathematical chain. No self-citations, ansatzes, or uniqueness theorems appear as load-bearing elements that would create circularity. The central claim remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the effectiveness of the newly introduced stitching procedure and on background assumptions about pretrained models; no free parameters are fitted inside the method itself.

axioms (2)
  • domain assumption Pretrained large vision and vision-language models contain sufficient semantic knowledge to support open-vocabulary segmentation when global context is restored.
    Invoked to justify leveraging existing encoders without training while addressing only the resolution limitation.
  • ad hoc to paper Reconstructing attention representations from independently encoded sub-image features inside the final block approximates the attention that would arise from a full-image forward pass.
    This is the core unproven premise that makes the stitching step work.
invented entities (1)
  • OV-Stitcher stitching mechanism no independent evidence
    purpose: To reconstruct global attention from fragmented sub-image features within the final encoder block
    Newly proposed algorithmic component whose validity is demonstrated only through the paper's own experiments.

pith-pipeline@v0.9.0 · 5505 in / 1501 out tokens · 69153 ms · 2026-05-10T16:48:07.111240+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 2 internal anchors

  1. [1]

    IEEE Transactions on Image Processing34, 8271–8284 (2025) arXiv:2411.15869 [cs.CV]

    Bai andSule, Liu andYong, Han andYifei, Zhang, Haoji, Tang, and Yansong. Self-calibrated clip for training-free open- vocabulary segmentation.arXiv preprint arXiv:2411.15869,

  2. [2]

    Grounding everything: Emerg- ing localization properties in vision-language transformers

    Bousselham andWalid, Petersen andFelix, Ferrari andVitto- rio, and Kuehne andHilde. Grounding everything: Emerg- ing localization properties in vision-language transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3828–3837,

  3. [3]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bo- janowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, and Nicolas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the International Conference on Computer Vision (ICCV), pages 15619–15629, 2023. 3

  4. [4]

    Window attention is bugged: How not to inter- polate position embeddings

    Daniel Bolya, Chaitanya Ryali, Judy Hoffman, and Christoph Feichtenhofer. Window attention is bugged: How not to inter- polate position embeddings. InThe International Conference on Learning Representations (ICLR), 2024. 6

  5. [5]

    Coco- stuff: Thing and stuff classes in context

    Caesar, Holger, Uijlings, Jasper, Ferrari, and Vittorio. Coco- stuff: Thing and stuff classes in context. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1209–1218, 2018. 5, 12, 14, 18, 20

  6. [6]

    Emerging properties in self-supervised vision transformers

    Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J´egou, Herv´e, Mairal, Julien, Bojanowski, Piotr, Joulin, and Armand. Emerging properties in self-supervised vision transformers. InProceedings of the International Conference on Computer Vision (ICCV), 2021. 2, 3, 6, 13

  7. [7]

    Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs

    Cha, Junbum, Mun, Jonghwan, Roh, and Byungseok. Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

  8. [8]

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning.Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, 2023. 3, 13

  9. [9]

    Cat-seg: Cost ag- gregation for open-vocabulary semantic segmentation, 2024

    Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat-seg: Cost ag- gregation for open-vocabulary semantic segmentation, 2024. 3

  10. [10]

    MMSegmenta- tion: Openmmlab semantic segmentation toolbox and benchmark

    MMSegmentation Contributors. MMSegmenta- tion: Openmmlab semantic segmentation toolbox and benchmark. https : / / github . com / open - mmlab/mmsegmentation, 2020. 6

  11. [11]

    MMEngine: Openmmlab foun- dational library for training deep learning models

    MMEngine Contributors. MMEngine: Openmmlab foun- dational library for training deep learning models. https: //github.com/open-mmlab/mmengine, 2022. 6

  12. [12]

    The cityscapes dataset for semantic urban scene understanding

    Cordts, Marius, Omran, Mohamed, Ramos, Sebastian, Re- hfeld, Timo, Enzweiler, Markus, Benenson, Rodrigo, Franke, Uwe, Roth, Stefan, Schiele, and Bernt. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016. 5, 14, 19

  13. [13]

    FlashAttention-2: Faster attention with better parallelism and work partitioning

    Dao and Tri. FlashAttention-2: Faster attention with better parallelism and work partitioning. InICLR, 2024. 12

  14. [14]

    Vision transformers need registers

    Darcet, Timoth ´ee, Oquab, Maxime, Mairal, Julien, Bo- janowski, and Piotr. Vision transformers need registers. The International Conference on Learning Representations (ICLR), 2023. 3

  15. [15]

    Imagenet: A large-scale hierarchical image database

    Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009. 13

  16. [16]

    The pascal visual object classes (voc) challenge.International journal of computer vision (IJCV), 88(2):303–338, 2010

    Everingham, Mark, Van Gool, Luc, Williams, Christopher KI, Winn, John, Zisserman, and Andrew. The pascal visual object classes (voc) challenge.International journal of computer vision (IJCV), 88(2):303–338, 2010. 5, 12, 14, 18

  17. [17]

    arXiv preprint arXiv:2309.17425 (2023)

    Fang, Alex, Jose, Albin Madappally, Jain, Amit, Schmidt, Ludwig, Toshev, Alexander, Shankar, and Vaishaal. Data filtering networks.arXiv preprint arXiv:2309.17425, 2023. 3, 13

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, and Akhil Mathur et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 6

  19. [19]

    Pay at- tention to your neighbours: Training-free open-vocabulary semantic segmentation

    Hajimiri, Sina, Ben Ayed, Ismail, Dolz, and Jose. Pay at- tention to your neighbours: Training-free open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV),

  20. [20]

    Few-shot object detection with foundation models

    Guangxing Han and Ser-Nam Lim. Few-shot object detection with foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

  21. [21]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, , and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 16000–16009, 2022. 2, 3, 6

  22. [22]

    Jin, Shuo, Yu, Siyue, Zhang, Bingfeng, Sun, Mingjie, Dong, Yi, Xiao, and Jimin. Feature purification matters: Suppressing outlier propagation for training-free open-vocabulary seman- tic segmentation.Proceedings of the International Conference on Computer Vision (ICCV), 2025. 1, 3, 6

  23. [23]

    Distilling self-supervised vision transformers for weakly-supervised few-shot classification & segmentation

    Kang, Dahyun, Koniusz, Piotr, Cho, Minsu, Murray, and Naila. Distilling self-supervised vision transformers for weakly-supervised few-shot classification & segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

  24. [24]

    In defense of lazy visual grounding for open-vocabulary semantic segmentation

    Kang, Dahyun, Cho, and Minsu. In defense of lazy visual grounding for open-vocabulary semantic segmentation. In The European Conference on Computer Vision (ECCV), pages 143–164. Springer, 2024. 3

  25. [25]

    Distilling spectral graph for object-context aware open-vocabulary semantic segmentation

    Kim, Chanyoung, Ju, Dayun, Han, Woojung, Yang, Ming- Hsuan, Hwang, and Seong Jae. Distilling spectral graph for object-context aware open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 6

  26. [26]

    Towards generalizable scene change detection

    Jaewoo Kim and Uehwan Kim. Towards generalizable scene change detection. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  27. [27]

    Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforce- ment learning

    Jaewoo Kim and Uehwan Kim. Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforce- ment learning. InAdvances in Neural Information Processing Systems (NIPS), 2025. 3

  28. [28]

    Segment anything

    Kirillov, Alexander, Mintun, Eric, Ravi, Nikhila, Mao, Hanzi, Rolland, Chloe, Gustafson, Laura, Xiao, Tete, Whitehead, Spencer, Berg, Alexander C., Lo, Wan-Yen, Doll ´ar, Piotr, Girshick, and Ross. Segment anything. InProceedings of the International Conference on Computer Vision (ICCV), pages 3992–4003, 2023. 2, 3

  29. [29]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

    Kuznetsova, Alina, Rom, Hassan, Alldrin, Neil, Uijlings, Jasper, Krasin, Ivan, Pont-Tuset, Jordi, Kamali, Shahab, Popov, Stefan, Malloci, Matteo, Kolesnikov, Alexander, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision (IJCV), 128(7): 1956...

  30. [30]

    Clearclip: Decompos- ing clip representations for dense vision-language inference

    Lan, Mengcheng, Chen, Chaofeng, Ke, Yiping, Wang, Xin- jiang, Feng, Litong, Zhang, and Wayne. Clearclip: Decompos- ing clip representations for dense vision-language inference. InThe European Conference on Computer Vision (ECCV), pages 143–160. Springer, 2024. 3, 6

  31. [31]

    Proxyclip: Proxy attention improves clip for open-vocabulary segmentation

    Lan, Mengcheng, Chen, Chaofeng, Ke, Yiping, Wang, Xin- jiang, Feng, Litong, Zhang, and Wayne. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InThe European Conference on Computer Vision (ECCV), pages 70–88. Springer, 2024. 2, 3, 4, 6, 14

  32. [32]

    A closer look at the explainability of contrastive language-image pre-training.Pattern Recognition (PR), 162: 111409, 2025

    Yi Li, Hualiang Wang, Yiqun Duan, Jiheng Zhang, and Xi- aomeng Li. A closer look at the explainability of contrastive language-image pre-training.Pattern Recognition (PR), 162: 111409, 2025. 1, 3

  33. [33]

    Open-vocabulary semantic segmentation with mask-adapted clip

    Liang, Feng, Wu, Bichen, Dai, Xiaoliang, Li, Kunpeng, Zhao, Yinan, Zhang, Hang, Zhang, Peizhao, Vajda, Peter, Mar- culescu, and Diana. Open-vocabulary semantic segmentation with mask-adapted clip. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 7061–7070, 2023. 3

  34. [34]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, , and C Lawrence Zitnick. Microsoft coco: Common objects in context. InThe European Conference on Computer Vision (ECCV). Springer,

  35. [35]

    Open-vocabulary segmentation with semantic- assisted calibration.Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

    Liu, Yong, Bai, Sule, Li, Guanbin, Wang, Yitong, Tang, and Yansong. Open-vocabulary segmentation with semantic- assisted calibration.Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  36. [36]

    SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation.The International Conference on Machine Learning (ICML), 2023

    Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation.The International Conference on Machine Learning (ICML), 2023. 3

  37. [37]

    The role of context for object detection and semantic segmentation in the wild

    Mottaghi, Roozbeh, Chen, Xianjie, Liu, Xiaobai, Cho, Nam- Gyu, Lee, Seong-Whan, Fidler, Sanja, Urtasun, Raquel, Yuille, and Alan. The role of context for object detection and semantic segmentation in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 891–898, 2014. 5, 12, 14, 19

  38. [38]

    Dinov2: Learning robust visual fea- tures without supervision.Transactions on Machine Learning Research (TMLR), 2023

    Oquab, Maxime, Darcet, Timoth´ee, Moutakanni, Theo, V o, Huy V ., Szafraniec, Marc, Khalidov, Vasil, Fernandez, Pierre, Haziza, Daniel, Massa, Francisco, El-Nouby, Alaaeldin, Howes, Russell, Huang, Po-Yao, Xu, Hu, Sharma, Vasu, Li, Shang-Wen, Galuba, Wojciech, Rabbat, Mike, Assran, Mido, Ballas, Nicolas, Synnaeve, Gabriel, Misra, Ishan, Je- gou, Herve, Ma...

  39. [39]

    Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever

    Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from nat- ural language supervision. InThe International Conference on Machine Learning (ICML), 2021. 1, 3, 6

  40. [40]

    Sam 2: Segment anything in images and videos.The International Conference on Learning Representations (ICLR), 2024

    Ravi, Nikhila, Gabeur, Valentin, Hu, Yuan-Ting, Hu, Rong- hang, Ryali, Chaitanya, Ma, Tengyu, Khedr, Haitham, R¨adle, Roman, Rolland, Chloe, Gustafson, Laura, et al. Sam 2: Segment anything in images and videos.The International Conference on Learning Representations (ICLR), 2024. 2, 3, 4, 6

  41. [41]

    Imagenet-21k pretraining for the masses.Advances in Neural Information Processing Systems (NIPS), 2021

    Ridnik, Tal, Ben-Baruch, Emanuel, Noy, Asaf, Zelnik-Manor, and Lihi. Imagenet-21k pretraining for the masses.Advances in Neural Information Processing Systems (NIPS), 2021. 13

  42. [42]

    Hiera: A hier- archical vision transformer without the bells-and-whistles

    Ryali, Chaitanya, Hu, Yuan-Ting, Bolya, Daniel, Wei, Chen, Fan, Haoqi, Huang, Po-Yao, Aggarwal, Vaibhav, Chowdhury, Arkabandhu, Poursaeed, Omid, Hoffman, Judy, Malik, Jiten- dra, Li, Yanghao, Feichtenhofer, and Christoph. Hiera: A hier- archical vision transformer without the bells-and-whistles. In The International Conference on Machine Learning (ICML),

  43. [43]

    Ex- plore the potential of clip for training-free open vocabulary semantic segmentation

    Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. InThe European Conference on Com- puter Vision (ECCV). Springer, 2024. 3, 6

  44. [44]

    Harnessing vision foundation models for high-performance, training-free open vocabulary segmentation.Proceedings of the International Conference on Computer Vision (ICCV), 2025

    Yuheng Shi, Minjing Dong, and Chang Xu. Harnessing vision foundation models for high-performance, training-free open vocabulary segmentation.Proceedings of the International Conference on Computer Vision (ICCV), 2025. 4, 6, 14

  45. [45]

    DINOv3

    Sim´eoni, Oriane, V o, Huy V ., Seitzer, Maximilian, Baldas- sarre, Federico, Oquab, Maxime, Jose, Cijo, Khalidov, Vasil, Szafraniec, Marc, Yi, Seungeun, Ramamonjisoa, Micha ¨el, Massa, Francisco, Haziza, Daniel, Wehrstedt, Luca, Wang, Jianyuan, Darcet, Timoth ´ee, Moutakanni, Th ´eo, Sentana, Leonel, Roberts, Claire, Vedaldi, Andrea, Tolan, Jamie, Brandt...

  46. [46]

    Clip as rnn: Segment countless visual concepts without training endeavor

    Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, and Siyang Li. Clip as rnn: Segment countless visual concepts without training endeavor. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  47. [47]

    Sclip: Rethinking self-attention for dense vision-language inference

    Wang, Feng, Mei, Jieru, Yuille, and Alan. Sclip: Rethinking self-attention for dense vision-language inference. InThe European Conference on Computer Vision (ECCV), pages 315–332. Springer, 2024. 1, 3, 6, 14

  48. [48]

    Use: Universal seg- ment embeddings for open-vocabulary image segmentation,

    Xiaoqi Wang, Wenbin He, Xiwei Xuan, Clint Sebastian, Jorge Piazentin Ono, Xin Li, Sima Behpour, Thang Doan, Liang Gou, Han Wei Shen, and Liu Ren. Use: Universal seg- ment embeddings for open-vocabulary image segmentation,

  49. [49]

    Stronger, fewer, & superior: Harnessing vision foundation models for domain generalized semantic segmentation

    Zhixiang Wei, Lin Chen, Yi Jin, Xiaoxiao Ma, Tianle Liu, Pengyang Ling, Ben Wang, Huaian Chen, and Jinjin Zheng. Stronger, fewer, & superior: Harnessing vision foundation models for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2024. 3

  50. [50]

    Image-text co- decomposition for text-supervised semantic segmentation

    Wu, Ji-Jia, Chang, Andy Chia-Hao, Chuang, Chieh-Yu, Chen, Chun-Pei, Liu, Yu-Lun, Chen, Min-Hung, Hu, Hou- Ning, Chuang, Yung-Yu, Lin, and Yen-Yu. Image-text co- decomposition for text-supervised semantic segmentation. Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2024. 3

  51. [51]

    Clip-dinoiser: Teaching clip a few dino tricks for open-vocabulary semantic segmentation

    Wysocza´nska, Monika, Sim ´eoni, Oriane, Ramamonjisoa, Micha¨el, Bursuc, Andrei, Trzci ´nski, Tomasz, P ´erez, and Patrick. Clip-dinoiser: Teaching clip a few dino tricks for open-vocabulary semantic segmentation. InThe European Conference on Computer Vision (ECCV), pages 320–337. Springer, 2024. 3

  52. [52]

    Textregion: Text-aligned region tokens from frozen image-text models.Transactions on Machine Learning Research (TMLR), 2025

    Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, and Derek Hoiem. Textregion: Text-aligned region tokens from frozen image-text models.Transactions on Machine Learning Research (TMLR), 2025. J2C Certification. 2, 3

  53. [53]

    Simmim: A simple framework for masked image modeling

    Xie, Zhenda, Zhang, Zheng, Cao, Yue, Lin, Yutong, Bao, Jianmin, Yao, Zhuliang, Dai, Qi, Hu, and Han. Simmim: A simple framework for masked image modeling. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9653–9663, 2022. 3

  54. [54]

    Sed: A simple encoder-decoder for open-vocabulary semantic segmentation

    Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, and Yanwei Pang. Sed: A simple encoder-decoder for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR),

  55. [55]

    Rewrite caption semantics: Bridging seman- tic gaps for language-supervised semantic segmentation

    Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Shao Ling, and Shijian Lu. Rewrite caption semantics: Bridging seman- tic gaps for language-supervised semantic segmentation. In Advances in Neural Information Processing Systems (NIPS),

  56. [56]

    Demysti- fying clip data.The International Conference on Learning Representations (ICLR), 2023

    Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demysti- fying clip data.The International Conference on Learning Representations (ICLR), 2023. 3, 6, 13

  57. [57]

    Side adapter network for open-vocabulary semantic segmentation

    Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. Side adapter network for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

  58. [58]

    Resclip: Residual attention for training-free dense vision- language inference

    Yang, Yuhang, Deng, Jinhong, Li, Wen, Duan, and Lixin. Resclip: Residual attention for training-free dense vision- language inference. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 29968–29978, 2025. 6

  59. [59]

    Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip

    Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip. InAdvances in Neural Information Processing Systems (NIPS), 2023. 3

  60. [60]

    Soma: Singular value decomposed minor components adaptation for domain generalizable representation learning

    Seokju Yun, Seunghye Chae, Dongheon Lee, and Youngmin Ro. Soma: Singular value decomposed minor components adaptation for domain generalizable representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

  61. [61]

    Sigmoid loss for language image pre-training

    Zhai, Xiaohua, Mustafa, Basil, Kolesnikov, Alexander, Beyer, and Lucas. Sigmoid loss for language image pre-training. In Proceedings of the International Conference on Computer Vision (ICCV), pages 11975–11986, 2023. 3

  62. [62]

    Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation.Advances in Neural Information Processing Systems (NIPS), 2023

    Zhang, Fei, Zhou, Tianfei, Li, Boyang, He, Hao, Ma, Chao- fan, Zhang, Tianjiao, Yao, Jiangchao, Zhang, Ya, Wang, and Yanfeng. Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation.Advances in Neural Information Processing Systems (NIPS), 2023. 3

  63. [63]

    Corrclip: Re- constructing patch correlations in clip for open-vocabulary semantic segmentation.Proceedings of the International Con- ference on Computer Vision (ICCV), 2025

    Zhang, Dengke, Liu, Fagui, Tang, and Quan. Corrclip: Re- constructing patch correlations in clip for open-vocabulary semantic segmentation.Proceedings of the International Con- ference on Computer Vision (ICCV), 2025. 2, 3, 4, 6, 12, 14

  64. [64]

    Mamba as a bridge: Where vision foundation models meet vision language models for domain-generalized semantic segmentation

    Zhang, Xin, Tan, and Robby T. Mamba as a bridge: Where vision foundation models meet vision language models for domain-generalized semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14527–14537, 2025. 3

  65. [65]

    Scene parsing through ade20k dataset

    Zhou, Bolei, Zhao, Hang, Puig, Xavier, Fidler, Sanja, Bar- riuso, Adela, Torralba, and Antonio. Scene parsing through ade20k dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 633–641, 2017. 5, 12, 14, 20

  66. [66]

    a photo of {class}

    Zhou, Chong, Loy, Chen Change, Dai, and Bo. Extract free dense labels from clip. InThe European Conference on Computer Vision (ECCV). Springer, 2022. 1, 3, 6 OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation Supplementary Material A. Additional Results on Varying Resolutions. To complement the results pr...