OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation
Pith reviewed 2026-05-10 16:48 UTC · model grok-4.3
The pith
OV-Stitcher stitches sub-image attention maps inside the final encoder block to restore global context for training-free open-vocabulary segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OV-Stitcher enables global attention within the final encoder block by stitching fragmented sub-image features, which produces coherent context aggregation and spatially consistent, semantically aligned segmentation maps for training-free open-vocabulary semantic segmentation.
What carries the argument
The stitching operation that reconstructs attention representations from independently processed sub-image features inside only the final encoder block.
If this is right
- Segmentation maps become spatially consistent because global attention is restored at the point where final predictions are formed.
- The method scales to arbitrary image resolutions while remaining training-free.
- Performance improves from 48.7 to 50.7 mIoU on average across eight benchmarks.
- Existing pretrained encoders can be used for dense open-vocabulary prediction without architectural changes to early layers.
Where Pith is reading between the lines
- The same stitching step could be inserted into other dense-prediction heads that rely on sliding windows, such as depth estimation or instance segmentation.
- If earlier encoder blocks were also stitched, the model might capture even longer-range dependencies, though this would increase compute.
- The improvement is largest on scenes with distant but semantically related objects, suggesting the technique mainly helps relational reasoning.
Load-bearing premise
Stitching and reconstructing attention inside only the final encoder block is enough to recover accurate global context without changing earlier layers or retraining the model.
What would settle it
A controlled test that measures whether removing the stitching step inside the final block drops mIoU back to the prior sliding-window baseline level on the same eight benchmarks.
Figures
read the original abstract
Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes OV-Stitcher, a training-free open-vocabulary semantic segmentation framework that processes high-resolution images via sliding windows but stitches fragmented sub-image features inside the final encoder block of a pretrained vision-language model. By reconstructing attention representations from these sub-image tokens, the method claims to enable global attention within that single block, yielding coherent context aggregation and spatially consistent segmentation maps. Extensive experiments on eight benchmarks report an mIoU improvement from 48.7 to 50.7 over prior training-free baselines.
Significance. If the single-block stitching mechanism proves sufficient to recover usable global context, the approach would offer a practical, training-free advance for high-resolution TF-OVSS by mitigating fragmentation without retraining or altering earlier layers. The multi-benchmark evaluation and focus on scalability are strengths that could influence follow-up work on efficient context aggregation in pretrained encoders.
major comments (2)
- The core claim (Abstract and Method description) that reconstructing attention representations from independently encoded sub-image features inside only the final encoder block enables global attention and coherent context aggregation is not yet load-bearing supported. Because sub-images are processed independently through all preceding layers, the query/key/value tokens entering the final block encode exclusively local context; post-hoc attention reconstruction can at best perform a mixing of already-localized representations and cannot inject the long-range dependencies that earlier self-attention layers would have captured across sub-images. The reported mIoU gain (48.7 to 50.7) is therefore not yet attributable to “global attention within the final encoder block” without an ablation that compares single-block stitching against a true full-image forward pass or against stitching in K>
- Experiments section: the manuscript reports consistent mIoU gains across eight benchmarks yet provides no ablation studies on the stitching operation, no statistical significance tests, and no implementation details (e.g., how attention maps are exactly reconstructed and merged). These omissions are load-bearing for the central claim that the improvement stems from the proposed global-context mechanism rather than from unstated dataset-specific choices or baseline re-implementations.
minor comments (1)
- Abstract: the phrase “eight benchmarks” is used without naming the datasets; listing them would improve immediate readability.
Simulated Author's Rebuttal
We thank the referee for the thorough and constructive review. We address each major comment below with the strongest honest defense of the manuscript while acknowledging where revisions are needed to strengthen the claims and evidence.
read point-by-point responses
-
Referee: The core claim (Abstract and Method description) that reconstructing attention representations from independently encoded sub-image features inside only the final encoder block enables global attention and coherent context aggregation is not yet load-bearing supported. Because sub-images are processed independently through all preceding layers, the query/key/value tokens entering the final block encode exclusively local context; post-hoc attention reconstruction can at best perform a mixing of already-localized representations and cannot inject the long-range dependencies that earlier self-attention layers would have captured across sub-images. The reported mIoU gain (48.7 to 50.7) is therefore not yet attributable to “global attention within the final encoder block” without an ablation that compares single-block stitching against a true full-image forward pass or against stitching in K>
Authors: We agree that tokens entering the final block carry only local context from prior independent processing, so the final-block attention performs mixing of localized representations rather than recovering dependencies that earlier layers could have modeled across sub-images. The mechanism still differs from prior TF-OVSS baselines, which perform no cross-sub-image attention at any stage and typically rely on post-hoc averaging or independent decoding. By enabling full-image token attention inside the last block we obtain measurable coherence gains, as reflected in the consistent mIoU lift. We cannot run a true full-image forward pass because the pretrained encoder’s fixed input resolution (and associated memory limits) precludes high-resolution inputs without the sliding-window strategy. We will revise the abstract and method sections to state the claim more precisely as “global attention over stitched tokens in the final block” and add an ablation that applies stitching in K>1 blocks to quantify the incremental benefit of the final-block placement. revision: partial
-
Referee: Experiments section: the manuscript reports consistent mIoU gains across eight benchmarks yet provides no ablation studies on the stitching operation, no statistical significance tests, and no implementation details (e.g., how attention maps are exactly reconstructed and merged). These omissions are load-bearing for the central claim that the improvement stems from the proposed global-context mechanism rather than from unstated dataset-specific choices or baseline re-implementations.
Authors: We accept that the current manuscript lacks these elements. In the revised version we will (1) supply complete implementation details on token stitching, Q/K/V reconstruction, and attention-map merging inside the final block; (2) add ablation studies that isolate the stitching operation (e.g., feature concatenation without attention, attention-based stitching vs. simple averaging, and varying the number of blocks in which stitching occurs); and (3) report statistical significance (standard deviation over multiple runs or deterministic baseline comparisons) to confirm the gains are attributable to the proposed mechanism rather than implementation artifacts. revision: yes
- Direct ablation against a true full-image forward pass, which is infeasible under the pretrained model’s fixed input resolution and memory constraints for the high-resolution images used in the benchmarks.
Circularity Check
No circularity; algorithmic stitching procedure with empirical results.
full rationale
The paper describes OV-Stitcher as a training-free algorithmic framework that stitches sub-image attention representations only inside the final encoder block. No equations, derivations, or first-principles results are presented that reduce the reported mIoU gains (48.7 to 50.7) to quantities defined by the method's own inputs or fitted parameters. The contribution is an engineering procedure for handling high-resolution inputs, validated through benchmark evaluations rather than any closed mathematical chain. No self-citations, ansatzes, or uniqueness theorems appear as load-bearing elements that would create circularity. The central claim remains independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pretrained large vision and vision-language models contain sufficient semantic knowledge to support open-vocabulary segmentation when global context is restored.
- ad hoc to paper Reconstructing attention representations from independently encoded sub-image features inside the final block approximates the attention that would arise from a full-image forward pass.
invented entities (1)
-
OV-Stitcher stitching mechanism
no independent evidence
Reference graph
Works this paper leans on
-
[1]
IEEE Transactions on Image Processing34, 8271–8284 (2025) arXiv:2411.15869 [cs.CV]
Bai andSule, Liu andYong, Han andYifei, Zhang, Haoji, Tang, and Yansong. Self-calibrated clip for training-free open- vocabulary segmentation.arXiv preprint arXiv:2411.15869,
-
[2]
Grounding everything: Emerg- ing localization properties in vision-language transformers
Bousselham andWalid, Petersen andFelix, Ferrari andVitto- rio, and Kuehne andHilde. Grounding everything: Emerg- ing localization properties in vision-language transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3828–3837,
-
[3]
Self-supervised learning from images with a joint-embedding predictive architecture
Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bo- janowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, and Nicolas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the International Conference on Computer Vision (ICCV), pages 15619–15629, 2023. 3
work page 2023
-
[4]
Window attention is bugged: How not to inter- polate position embeddings
Daniel Bolya, Chaitanya Ryali, Judy Hoffman, and Christoph Feichtenhofer. Window attention is bugged: How not to inter- polate position embeddings. InThe International Conference on Learning Representations (ICLR), 2024. 6
work page 2024
-
[5]
Coco- stuff: Thing and stuff classes in context
Caesar, Holger, Uijlings, Jasper, Ferrari, and Vittorio. Coco- stuff: Thing and stuff classes in context. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1209–1218, 2018. 5, 12, 14, 18, 20
work page 2018
-
[6]
Emerging properties in self-supervised vision transformers
Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J´egou, Herv´e, Mairal, Julien, Bojanowski, Piotr, Joulin, and Armand. Emerging properties in self-supervised vision transformers. InProceedings of the International Conference on Computer Vision (ICCV), 2021. 2, 3, 6, 13
work page 2021
-
[7]
Cha, Junbum, Mun, Jonghwan, Roh, and Byungseok. Learn- ing to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3
work page 2023
-
[8]
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning.Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, 2023. 3, 13
work page 2023
-
[9]
Cat-seg: Cost ag- gregation for open-vocabulary semantic segmentation, 2024
Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat-seg: Cost ag- gregation for open-vocabulary semantic segmentation, 2024. 3
work page 2024
-
[10]
MMSegmenta- tion: Openmmlab semantic segmentation toolbox and benchmark
MMSegmentation Contributors. MMSegmenta- tion: Openmmlab semantic segmentation toolbox and benchmark. https : / / github . com / open - mmlab/mmsegmentation, 2020. 6
work page 2020
-
[11]
MMEngine: Openmmlab foun- dational library for training deep learning models
MMEngine Contributors. MMEngine: Openmmlab foun- dational library for training deep learning models. https: //github.com/open-mmlab/mmengine, 2022. 6
work page 2022
-
[12]
The cityscapes dataset for semantic urban scene understanding
Cordts, Marius, Omran, Mohamed, Ramos, Sebastian, Re- hfeld, Timo, Enzweiler, Markus, Benenson, Rodrigo, Franke, Uwe, Roth, Stefan, Schiele, and Bernt. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016. 5, 14, 19
work page 2016
-
[13]
FlashAttention-2: Faster attention with better parallelism and work partitioning
Dao and Tri. FlashAttention-2: Faster attention with better parallelism and work partitioning. InICLR, 2024. 12
work page 2024
-
[14]
Vision transformers need registers
Darcet, Timoth ´ee, Oquab, Maxime, Mairal, Julien, Bo- janowski, and Piotr. Vision transformers need registers. The International Conference on Learning Representations (ICLR), 2023. 3
work page 2023
-
[15]
Imagenet: A large-scale hierarchical image database
Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009. 13
work page 2009
-
[16]
Everingham, Mark, Van Gool, Luc, Williams, Christopher KI, Winn, John, Zisserman, and Andrew. The pascal visual object classes (voc) challenge.International journal of computer vision (IJCV), 88(2):303–338, 2010. 5, 12, 14, 18
work page 2010
-
[17]
arXiv preprint arXiv:2309.17425 (2023)
Fang, Alex, Jose, Albin Madappally, Jain, Amit, Schmidt, Ludwig, Toshev, Alexander, Shankar, and Vaishaal. Data filtering networks.arXiv preprint arXiv:2309.17425, 2023. 3, 13
-
[18]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, and Akhil Mathur et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Pay at- tention to your neighbours: Training-free open-vocabulary semantic segmentation
Hajimiri, Sina, Ben Ayed, Ismail, Dolz, and Jose. Pay at- tention to your neighbours: Training-free open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV),
-
[20]
Few-shot object detection with foundation models
Guangxing Han and Ser-Nam Lim. Few-shot object detection with foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3
work page 2024
-
[21]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, , and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 16000–16009, 2022. 2, 3, 6
work page 2022
-
[22]
Jin, Shuo, Yu, Siyue, Zhang, Bingfeng, Sun, Mingjie, Dong, Yi, Xiao, and Jimin. Feature purification matters: Suppressing outlier propagation for training-free open-vocabulary seman- tic segmentation.Proceedings of the International Conference on Computer Vision (ICCV), 2025. 1, 3, 6
work page 2025
-
[23]
Kang, Dahyun, Koniusz, Piotr, Cho, Minsu, Murray, and Naila. Distilling self-supervised vision transformers for weakly-supervised few-shot classification & segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3
work page 2023
-
[24]
In defense of lazy visual grounding for open-vocabulary semantic segmentation
Kang, Dahyun, Cho, and Minsu. In defense of lazy visual grounding for open-vocabulary semantic segmentation. In The European Conference on Computer Vision (ECCV), pages 143–164. Springer, 2024. 3
work page 2024
-
[25]
Distilling spectral graph for object-context aware open-vocabulary semantic segmentation
Kim, Chanyoung, Ju, Dayun, Han, Woojung, Yang, Ming- Hsuan, Hwang, and Seong Jae. Distilling spectral graph for object-context aware open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 6
work page 2025
-
[26]
Towards generalizable scene change detection
Jaewoo Kim and Uehwan Kim. Towards generalizable scene change detection. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),
-
[27]
Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforce- ment learning
Jaewoo Kim and Uehwan Kim. Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforce- ment learning. InAdvances in Neural Information Processing Systems (NIPS), 2025. 3
work page 2025
-
[28]
Kirillov, Alexander, Mintun, Eric, Ravi, Nikhila, Mao, Hanzi, Rolland, Chloe, Gustafson, Laura, Xiao, Tete, Whitehead, Spencer, Berg, Alexander C., Lo, Wan-Yen, Doll ´ar, Piotr, Girshick, and Ross. Segment anything. InProceedings of the International Conference on Computer Vision (ICCV), pages 3992–4003, 2023. 2, 3
work page 2023
-
[29]
Kuznetsova, Alina, Rom, Hassan, Alldrin, Neil, Uijlings, Jasper, Krasin, Ivan, Pont-Tuset, Jordi, Kamali, Shahab, Popov, Stefan, Malloci, Matteo, Kolesnikov, Alexander, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision (IJCV), 128(7): 1956...
work page 1956
-
[30]
Clearclip: Decompos- ing clip representations for dense vision-language inference
Lan, Mengcheng, Chen, Chaofeng, Ke, Yiping, Wang, Xin- jiang, Feng, Litong, Zhang, and Wayne. Clearclip: Decompos- ing clip representations for dense vision-language inference. InThe European Conference on Computer Vision (ECCV), pages 143–160. Springer, 2024. 3, 6
work page 2024
-
[31]
Proxyclip: Proxy attention improves clip for open-vocabulary segmentation
Lan, Mengcheng, Chen, Chaofeng, Ke, Yiping, Wang, Xin- jiang, Feng, Litong, Zhang, and Wayne. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InThe European Conference on Computer Vision (ECCV), pages 70–88. Springer, 2024. 2, 3, 4, 6, 14
work page 2024
-
[32]
Yi Li, Hualiang Wang, Yiqun Duan, Jiheng Zhang, and Xi- aomeng Li. A closer look at the explainability of contrastive language-image pre-training.Pattern Recognition (PR), 162: 111409, 2025. 1, 3
work page 2025
-
[33]
Open-vocabulary semantic segmentation with mask-adapted clip
Liang, Feng, Wu, Bichen, Dai, Xiaoliang, Li, Kunpeng, Zhao, Yinan, Zhang, Hang, Zhang, Peizhao, Vajda, Peter, Mar- culescu, and Diana. Open-vocabulary semantic segmentation with mask-adapted clip. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 7061–7070, 2023. 3
work page 2023
-
[34]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, , and C Lawrence Zitnick. Microsoft coco: Common objects in context. InThe European Conference on Computer Vision (ECCV). Springer,
-
[35]
Liu, Yong, Bai, Sule, Li, Guanbin, Wang, Yitong, Tang, and Yansong. Open-vocabulary segmentation with semantic- assisted calibration.Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),
-
[36]
Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation.The International Conference on Machine Learning (ICML), 2023. 3
work page 2023
-
[37]
The role of context for object detection and semantic segmentation in the wild
Mottaghi, Roozbeh, Chen, Xianjie, Liu, Xiaobai, Cho, Nam- Gyu, Lee, Seong-Whan, Fidler, Sanja, Urtasun, Raquel, Yuille, and Alan. The role of context for object detection and semantic segmentation in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 891–898, 2014. 5, 12, 14, 19
work page 2014
-
[38]
Oquab, Maxime, Darcet, Timoth´ee, Moutakanni, Theo, V o, Huy V ., Szafraniec, Marc, Khalidov, Vasil, Fernandez, Pierre, Haziza, Daniel, Massa, Francisco, El-Nouby, Alaaeldin, Howes, Russell, Huang, Po-Yao, Xu, Hu, Sharma, Vasu, Li, Shang-Wen, Galuba, Wojciech, Rabbat, Mike, Assran, Mido, Ballas, Nicolas, Synnaeve, Gabriel, Misra, Ishan, Je- gou, Herve, Ma...
work page 2023
-
[39]
Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from nat- ural language supervision. InThe International Conference on Machine Learning (ICML), 2021. 1, 3, 6
work page 2021
-
[40]
Ravi, Nikhila, Gabeur, Valentin, Hu, Yuan-Ting, Hu, Rong- hang, Ryali, Chaitanya, Ma, Tengyu, Khedr, Haitham, R¨adle, Roman, Rolland, Chloe, Gustafson, Laura, et al. Sam 2: Segment anything in images and videos.The International Conference on Learning Representations (ICLR), 2024. 2, 3, 4, 6
work page 2024
-
[41]
Ridnik, Tal, Ben-Baruch, Emanuel, Noy, Asaf, Zelnik-Manor, and Lihi. Imagenet-21k pretraining for the masses.Advances in Neural Information Processing Systems (NIPS), 2021. 13
work page 2021
-
[42]
Hiera: A hier- archical vision transformer without the bells-and-whistles
Ryali, Chaitanya, Hu, Yuan-Ting, Bolya, Daniel, Wei, Chen, Fan, Haoqi, Huang, Po-Yao, Aggarwal, Vaibhav, Chowdhury, Arkabandhu, Poursaeed, Omid, Hoffman, Judy, Malik, Jiten- dra, Li, Yanghao, Feichtenhofer, and Christoph. Hiera: A hier- archical vision transformer without the bells-and-whistles. In The International Conference on Machine Learning (ICML),
-
[43]
Ex- plore the potential of clip for training-free open vocabulary semantic segmentation
Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Ex- plore the potential of clip for training-free open vocabulary semantic segmentation. InThe European Conference on Com- puter Vision (ECCV). Springer, 2024. 3, 6
work page 2024
-
[44]
Yuheng Shi, Minjing Dong, and Chang Xu. Harnessing vision foundation models for high-performance, training-free open vocabulary segmentation.Proceedings of the International Conference on Computer Vision (ICCV), 2025. 4, 6, 14
work page 2025
-
[45]
Sim´eoni, Oriane, V o, Huy V ., Seitzer, Maximilian, Baldas- sarre, Federico, Oquab, Maxime, Jose, Cijo, Khalidov, Vasil, Szafraniec, Marc, Yi, Seungeun, Ramamonjisoa, Micha ¨el, Massa, Francisco, Haziza, Daniel, Wehrstedt, Luca, Wang, Jianyuan, Darcet, Timoth ´ee, Moutakanni, Th ´eo, Sentana, Leonel, Roberts, Claire, Vedaldi, Andrea, Tolan, Jamie, Brandt...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Clip as rnn: Segment countless visual concepts without training endeavor
Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, and Siyang Li. Clip as rnn: Segment countless visual concepts without training endeavor. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),
-
[47]
Sclip: Rethinking self-attention for dense vision-language inference
Wang, Feng, Mei, Jieru, Yuille, and Alan. Sclip: Rethinking self-attention for dense vision-language inference. InThe European Conference on Computer Vision (ECCV), pages 315–332. Springer, 2024. 1, 3, 6, 14
work page 2024
-
[48]
Use: Universal seg- ment embeddings for open-vocabulary image segmentation,
Xiaoqi Wang, Wenbin He, Xiwei Xuan, Clint Sebastian, Jorge Piazentin Ono, Xin Li, Sima Behpour, Thang Doan, Liang Gou, Han Wei Shen, and Liu Ren. Use: Universal seg- ment embeddings for open-vocabulary image segmentation,
-
[49]
Zhixiang Wei, Lin Chen, Yi Jin, Xiaoxiao Ma, Tianle Liu, Pengyang Ling, Ben Wang, Huaian Chen, and Jinjin Zheng. Stronger, fewer, & superior: Harnessing vision foundation models for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2024. 3
work page 2024
-
[50]
Image-text co- decomposition for text-supervised semantic segmentation
Wu, Ji-Jia, Chang, Andy Chia-Hao, Chuang, Chieh-Yu, Chen, Chun-Pei, Liu, Yu-Lun, Chen, Min-Hung, Hu, Hou- Ning, Chuang, Yung-Yu, Lin, and Yen-Yu. Image-text co- decomposition for text-supervised semantic segmentation. Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2024. 3
work page 2024
-
[51]
Clip-dinoiser: Teaching clip a few dino tricks for open-vocabulary semantic segmentation
Wysocza´nska, Monika, Sim ´eoni, Oriane, Ramamonjisoa, Micha¨el, Bursuc, Andrei, Trzci ´nski, Tomasz, P ´erez, and Patrick. Clip-dinoiser: Teaching clip a few dino tricks for open-vocabulary semantic segmentation. InThe European Conference on Computer Vision (ECCV), pages 320–337. Springer, 2024. 3
work page 2024
-
[52]
Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, and Derek Hoiem. Textregion: Text-aligned region tokens from frozen image-text models.Transactions on Machine Learning Research (TMLR), 2025. J2C Certification. 2, 3
work page 2025
-
[53]
Simmim: A simple framework for masked image modeling
Xie, Zhenda, Zhang, Zheng, Cao, Yue, Lin, Yutong, Bao, Jianmin, Yao, Zhuliang, Dai, Qi, Hu, and Han. Simmim: A simple framework for masked image modeling. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9653–9663, 2022. 3
work page 2022
-
[54]
Sed: A simple encoder-decoder for open-vocabulary semantic segmentation
Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, and Yanwei Pang. Sed: A simple encoder-decoder for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR),
-
[55]
Rewrite caption semantics: Bridging seman- tic gaps for language-supervised semantic segmentation
Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Shao Ling, and Shijian Lu. Rewrite caption semantics: Bridging seman- tic gaps for language-supervised semantic segmentation. In Advances in Neural Information Processing Systems (NIPS),
-
[56]
Demysti- fying clip data.The International Conference on Learning Representations (ICLR), 2023
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demysti- fying clip data.The International Conference on Learning Representations (ICLR), 2023. 3, 6, 13
work page 2023
-
[57]
Side adapter network for open-vocabulary semantic segmentation
Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. Side adapter network for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3
work page 2023
-
[58]
Resclip: Residual attention for training-free dense vision- language inference
Yang, Yuhang, Deng, Jinhong, Li, Wen, Duan, and Lixin. Resclip: Residual attention for training-free dense vision- language inference. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 29968–29978, 2025. 6
work page 2025
-
[59]
Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip
Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip. InAdvances in Neural Information Processing Systems (NIPS), 2023. 3
work page 2023
-
[60]
Seokju Yun, Seunghye Chae, Dongheon Lee, and Youngmin Ro. Soma: Singular value decomposed minor components adaptation for domain generalizable representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3
work page 2025
-
[61]
Sigmoid loss for language image pre-training
Zhai, Xiaohua, Mustafa, Basil, Kolesnikov, Alexander, Beyer, and Lucas. Sigmoid loss for language image pre-training. In Proceedings of the International Conference on Computer Vision (ICCV), pages 11975–11986, 2023. 3
work page 2023
-
[62]
Zhang, Fei, Zhou, Tianfei, Li, Boyang, He, Hao, Ma, Chao- fan, Zhang, Tianjiao, Yao, Jiangchao, Zhang, Ya, Wang, and Yanfeng. Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation.Advances in Neural Information Processing Systems (NIPS), 2023. 3
work page 2023
-
[63]
Zhang, Dengke, Liu, Fagui, Tang, and Quan. Corrclip: Re- constructing patch correlations in clip for open-vocabulary semantic segmentation.Proceedings of the International Con- ference on Computer Vision (ICCV), 2025. 2, 3, 4, 6, 12, 14
work page 2025
-
[64]
Zhang, Xin, Tan, and Robby T. Mamba as a bridge: Where vision foundation models meet vision language models for domain-generalized semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14527–14537, 2025. 3
work page 2025
-
[65]
Scene parsing through ade20k dataset
Zhou, Bolei, Zhao, Hang, Puig, Xavier, Fidler, Sanja, Bar- riuso, Adela, Torralba, and Antonio. Scene parsing through ade20k dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 633–641, 2017. 5, 12, 14, 20
work page 2017
-
[66]
Zhou, Chong, Loy, Chen Change, Dai, and Bo. Extract free dense labels from clip. InThe European Conference on Computer Vision (ECCV). Springer, 2022. 1, 3, 6 OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation Supplementary Material A. Additional Results on Varying Resolutions. To complement the results pr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.