PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation
Pith reviewed 2026-05-20 06:39 UTC · model grok-4.3
The pith
PrAda learns visual prototypes from few examples to correct misclassifications in text-prompted segmentation models under domain shift.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a parameter-efficient adaptation of frozen text-prompted segmentation models can be achieved by generating class-specific prototypes from combined pixel-level and transformer features, then blending those prototypes with text predictions via a learned weighting factor; this corrects domain-shift misclassifications across semantic, instance, and panoptic segmentation while retaining the model's ability to handle arbitrary text prompts without retraining.
What carries the argument
Prototype Adaptation (PrAda), which builds class-specific prototypes by merging fine-grained pixel features and high-level transformer representations then fuses them with text predictions through a learned importance factor to adjust classification under domain shift.
Load-bearing premise
That misclassification is the dominant error under domain shift and that fusing visual prototypes with text predictions will correct those errors without reducing performance on zero-shot classes or introducing new mistakes.
What would settle it
Measuring accuracy on a held-out specialized domain where the adapted model shows no gain over the frozen baseline or where zero-shot performance on new text prompts drops compared to the original model.
Figures
read the original abstract
Segmenting images is critical for visual understanding but demands extensive pixel-level annotations. Foundational models have enabled new paradigms for predicting new classes guided by textual prompts, without annotations from the target domain. Yet, on specialized target domains, far from the original pre-training, their performance degrades. We study the errors of existing methods under such domain-shift, finding that misclassification rather than mask generation is the main culprit. To address this, we introduce the novel problem of Few-Shot Visual Adaptation for text-prompted Segmentation. This kind of adaptation has been largely studied for image classification, but it remains unexplored for segmentation. We tackle this task with Prototype Adaptation (PrAda), a novel, parameter-efficient method that adapts a frozen text-prompted segmentation model. Our approach learns class-specific prototypes by combining fine-grained pixel features and high-level transformer representations, which are then fused with the original text-based predictions through a learned importance factor. This preserves the model's zero-shot potential while enabling strong adaptation to new domains. Experiments across semantic, instance, and panoptic segmentation on five benchmarks demonstrate that PrAda yields significant improvements over state-of-the-art and proposed baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the problem of few-shot visual adaptation for text-prompted segmentation and proposes PrAda, a parameter-efficient method that adapts a frozen foundational model. It claims that error analysis under domain shift reveals misclassification (rather than mask generation) as the dominant failure mode. PrAda learns class-specific prototypes by combining fine-grained pixel features with high-level transformer representations and fuses them with the original text-based predictions through a learned importance factor. This is intended to enable strong adaptation on new domains while preserving zero-shot capability. Experiments across semantic, instance, and panoptic segmentation on five benchmarks are reported to yield significant improvements over state-of-the-art and proposed baselines.
Significance. If the central claims are substantiated, the work would be significant for extending few-shot adaptation techniques from classification to text-prompted segmentation, an area noted as largely unexplored. The emphasis on parameter efficiency, retention of zero-shot performance, and evaluation across multiple segmentation paradigms and benchmarks would provide a practical contribution for deploying foundational models in specialized domains.
major comments (2)
- [Abstract] Abstract: the claim that 'misclassification rather than mask generation is the main culprit' under domain shift is presented as the key motivation, yet no quantitative error breakdown, failure-mode statistics, or before/after comparison is referenced to support this premise, which is load-bearing for the rationale of prototype fusion.
- [Method] The method description (implied in the abstract and method sections): the learned importance factor and class-specific prototypes (pixel features + transformer representations) are introduced as new components, but without an ablation isolating the fusion step or metrics showing specific reduction in misclassifications while preserving zero-shot mask quality on held-out classes/domains, the targeted correction mechanism remains unverified.
minor comments (1)
- [Abstract] The abstract mentions 'proposed baselines' without clarifying their construction or relation to existing few-shot segmentation methods; a dedicated section or table comparing against these would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. The comments identify important opportunities to strengthen the substantiation of our claims, and we address each point below with plans for revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'misclassification rather than mask generation is the main culprit' under domain shift is presented as the key motivation, yet no quantitative error breakdown, failure-mode statistics, or before/after comparison is referenced to support this premise, which is load-bearing for the rationale of prototype fusion.
Authors: We acknowledge that the submitted manuscript does not explicitly reference quantitative error statistics in the abstract or main text to support this claim. Our internal analysis did identify misclassification as the dominant failure mode under domain shift. In the revised manuscript we will add a dedicated subsection (or expanded paragraph in the introduction) presenting the quantitative error breakdown, including failure-mode statistics and before/after comparisons on held-out domains. This will directly support the motivation for prototype fusion. revision: yes
-
Referee: [Method] The method description (implied in the abstract and method sections): the learned importance factor and class-specific prototypes (pixel features + transformer representations) are introduced as new components, but without an ablation isolating the fusion step or metrics showing specific reduction in misclassifications while preserving zero-shot mask quality on held-out classes/domains, the targeted correction mechanism remains unverified.
Authors: We agree that an explicit ablation isolating the fusion step and the learned importance factor would strengthen verification of the correction mechanism. We have conducted such ablations, which show measurable reductions in misclassifications while preserving zero-shot mask quality on held-out classes and domains. We will add these ablation results, along with the corresponding metrics and analysis, to the revised method and experiments sections. revision: yes
Circularity Check
No significant circularity; method and claims rest on external empirical validation
full rationale
The paper proposes PrAda as a new few-shot adaptation technique that learns class-specific prototypes from pixel features and transformer representations, then fuses them with text predictions via a learned importance factor. This construction introduces independent learned components rather than defining outputs in terms of the frozen model's inputs. Central claims of improvement are supported by experiments across semantic/instance/panoptic segmentation on five benchmarks, not by any derivation that reduces to fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to force the result. The approach is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- learned importance factor
axioms (1)
- domain assumption Misclassification rather than mask generation is the main culprit in performance degradation under domain shift for text-prompted segmentation models.
invented entities (1)
-
class-specific prototypes
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Zerowaste dataset: To- wards deformable object segmentation in cluttered scenes
Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. Zerowaste dataset: To- wards deformable object segmentation in cluttered scenes. In CVPR, 2022. 12
work page 2022
-
[2]
What’s the point: Semantic segmentation with point supervision
Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision. InECCV, 2016. 1
work page 2016
-
[3]
Lan- guage models are few-shot learners.NeurIPS, 2020
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.NeurIPS, 2020. 3
work page 2020
-
[4]
End-to- end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InECCV, 2020. 3
work page 2020
-
[5]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2, 7
work page 2021
-
[6]
PEM: Prototype-based efficient maskformer for image segmentation
Niccolo Cavagnero, Gabriele Rosi, Claudia Cuttano, Francesca Pistilli, Marco Ciccone, Giuseppe Averta, and Fabio Cermelli. PEM: Prototype-based efficient maskformer for image segmentation. InCVPR, 2024. 1
work page 2024
-
[7]
Com- former: Continual learning in semantic and panoptic seg- mentation
Fabio Cermelli, Matthieu Cord, and Arthur Douillard. Com- former: Continual learning in semantic and panoptic seg- mentation. InCVPR, 2023. 5
work page 2023
-
[8]
PLOT: Prompt learning with optimal transport for vision-language models
Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. PLOT: Prompt learning with optimal transport for vision-language models. InICLR,
-
[9]
Per- pixel classification is not all you need for semantic segmen- tation.NeurIPS, 2021
Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation.NeurIPS, 2021. 1
work page 2021
-
[10]
Masked-attention mask transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR,
-
[11]
CAT-Seg: Cost aggregation for open-vocabulary semantic segmenta- tion
Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. CAT-Seg: Cost aggregation for open-vocabulary semantic segmenta- tion. InCVPR, 2024. 2, 3
work page 2024
-
[12]
The Cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. InCVPR,
-
[13]
2, 6, 7, 8, 12, 13, 14, 16, 18
-
[14]
Claudia Cuttano, Antonio Tavera, Fabio Cermelli, Giuseppe Averta, and Barbara Caputo. Cross-domain transfer learning with corte: Consistent and reliable transfer from black-box to lightweight segmentation model. InICCV, 2023. 1
work page 2023
-
[15]
SAMWISE: Infusing wis- dom in sam2 for text-driven video segmentation
Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, and Giuseppe Averta. SAMWISE: Infusing wis- dom in sam2 for text-driven video segmentation. InCVPR,
-
[16]
BERT: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), 2019. 3
work page 2019
-
[17]
De- coupling zero-shot semantic segmentation
Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InCVPR, 2022. 2
work page 2022
-
[18]
Open- vocabulary universal image segmentation with MaskCLIP
Zheng Ding, Jieke Wang, and Zhuowen Tu. Open- vocabulary universal image segmentation with MaskCLIP. Preprint arXiv:2208.08984, 2022. 2, 3
-
[19]
Takumi Ege, Wataru Shimoda, and Keiji Yanai. A new large- scale food image segmentation dataset and its application to food calorie estimation based on grains of rice. InPro- ceedings of the 5th international workshop on multimedia assisted dietary management, 2019. 12
work page 2019
-
[20]
The pascal visual object classes (voc) challenge.IJCV, 88:303–338, 2010
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.IJCV, 88:303–338, 2010. 12
work page 2010
-
[21]
The Pascal visual object classes challenge: A retrospective.IJCV, 111(1):98–136, 2015
Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo- pher KI Williams, John Winn, and Andrew Zisserman. The Pascal visual object classes challenge: A retrospective.IJCV, 111(1):98–136, 2015. 6
work page 2015
-
[22]
Rethinking few-shot adaptation of vision- language models in two stages
Matteo Farina, Massimiliano Mancini, Giovanni Iacca, and Elisa Ricci. Rethinking few-shot adaptation of vision- language models in two stages. InCVPR, 2025. 2, 3
work page 2025
-
[23]
CLIP-Adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. CLIP-Adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024. 2, 3, 5, 6, 7, 12
work page 2024
-
[24]
Scal- ing open-vocabulary image segmentation with image-level labels
Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. InECCV, 2022. 2
work page 2022
-
[25]
Text-guided visual prompt dino for generic segmentation
Yuchen Guan, Chong Sun, Canmiao Fu, Zhipeng Huang, Chun Yuan, and Chen Li. Text-guided visual prompt dino for generic segmentation. InICCV, 2025. 2, 3, 7
work page 2025
-
[26]
Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhao- chong An, Karsten Roth, Ameya Prabhu, and Philip Torr. kNN-CLIP: Retrieval enables training-free segmentation on continually expanding large vocabularies.TMLR, 2024. 2, 3, 7
work page 2024
-
[27]
Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation
Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. InWACV, 2025. 7
work page 2025
-
[28]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,
-
[29]
Parameter-efficient transfer learning for NLP
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. InICML, 2019. 3 9
work page 2019
-
[30]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 3
work page 2022
-
[31]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, 2021. 2
work page 2021
-
[32]
T-Rex2: Towards generic object detec- tion via text-visual prompt synergy
Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-Rex2: Towards generic object detec- tion via text-visual prompt synergy. InECCV, 2024. 2, 6
work page 2024
-
[33]
Collaborative vision-text rep- resentation optimizing for open-vocabulary segmentation
Siyu Jiao, Hongguang Zhu, Jiannan Huang, Yao Zhao, Yun- chao Wei, and Humphrey Shi. Collaborative vision-text rep- resentation optimizing for open-vocabulary segmentation. In ECCV, 2024. 2, 7, 13, 14
work page 2024
-
[34]
DINOv2 meets text: A unified framework for image-and pixel-level vision-language alignment
Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. DINOv2 meets text: A unified framework for image-and pixel-level vision-language alignment. InCVPR, 2025. 2
work page 2025
-
[35]
Your ViT is secretly an im- age segmentation model
Tommie Kerssies, Niccolo Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. Your ViT is secretly an im- age segmentation model. InCVPR, 2025. 1
work page 2025
-
[36]
MaPLe: Multi-modal prompt learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. MaPLe: Multi-modal prompt learning. InCVPR, 2023. 3, 5
work page 2023
-
[37]
Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Doll ´ar. Panoptic segmentation. InCVPR,
-
[38]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1, 2, 7, 8
work page 2023
-
[39]
ProxyCLIP: Proxy attention improves clip for open-vocabulary segmentation
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. ProxyCLIP: Proxy attention improves clip for open-vocabulary segmentation. In ECCV, 2024. 2, 7
work page 2024
-
[40]
Language-driven semantic seg- mentation
Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation. InICLR, 2022. 2, 3
work page 2022
-
[41]
Mask DINO: To- wards a unified transformer-based framework for object de- tection and segmentation
Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask DINO: To- wards a unified transformer-based framework for object de- tection and segmentation. InCVPR, 2023. 1
work page 2023
-
[42]
Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Jianwei Yang, Chunyuan Li, et al. Visual in-context prompting. InCVPR,
-
[43]
Multiple-Human Parsing in the Wild
Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yidong Li, Terence Sim, Shuicheng Yan, and Jiashi Feng. Multiple- human parsing in the wild.arXiv preprint arXiv:1705.07206,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Op- timizing continuous prompts for generation.Preprint arXiv:2101.00190, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[45]
Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelli- gence, 40(12):2935–2947, 2017. 5
work page 2017
-
[46]
Open-vocabulary semantic segmentation with mask-adapted CLIP
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted CLIP. InCVPR, 2023. 2
work page 2023
-
[47]
Microsoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 2, 3, 6
work page 2014
-
[48]
GRES: Gen- eralized referring expression segmentation
Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Gen- eralized referring expression segmentation. InCVPR, 2023. 1
work page 2023
-
[49]
A simple image segmentation framework via in-context examples.NeurIPS,
Yang Liu, Chenchen Jing, Hengtao Li, Muzhi Zhu, Hao Chen, Xinlong Wang, and Chunhua Shen. A simple image segmentation framework via in-context examples.NeurIPS,
-
[50]
Matcher: Segment anything with one shot using all-purpose feature matching
Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, and Chunhua Shen. Matcher: Segment anything with one shot using all-purpose feature matching. InICLR, 2024. 7
work page 2024
-
[51]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InCVPR, 2022. 6
work page 2022
-
[52]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 12
work page 2019
-
[53]
Ye Lyu, George V osselman, Gui-Song Xia, Alper Yilmaz, and Michael Ying Yang. Uavid: A semantic segmentation dataset for uav imagery.ISPRS journal of photogrammetry and remote sensing, 165:108–119, 2020. 12
work page 2020
-
[54]
Simple open-vocabulary object detection
Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection. In ECCV, 2022. 2
work page 2022
-
[55]
The Mapillary Vistas dataset for seman- tic understanding of street scenes
Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The Mapillary Vistas dataset for seman- tic understanding of street scenes. InICCV, 2017. 2, 6, 7, 12, 14, 16, 19
work page 2017
-
[56]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Preprint arXiv:2304.07193, 2023. 1, 2, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1, 2, 3, 4, 6, 7
work page 2021
-
[58]
SAM 2: Seg- ment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Seg- ment anything in images and videos. InICLR, 2025. 2
work page 2025
-
[59]
Show or tell? a bench- mark to evaluate visual and textual prompts in semantic seg- mentation
Gabriele Rosi and Fabio Cermelli. Show or tell? a bench- mark to evaluate visual and textual prompts in semantic seg- mentation. InCVPR, 2025. 1, 2, 6, 7, 8, 12, 13, 14, 15, 16, 20 10
work page 2025
-
[60]
The revenge of bisenet: Efficient multi-task image segmentation
Gabriele Rosi, Claudia Cuttano, Niccol `o Cavagnero, Giuseppe Averta, and Fabio Cermelli. The revenge of bisenet: Efficient multi-task image segmentation. InCVPR,
-
[61]
Aligning and prompting everything all at once for univer- sal visual perception
Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing Sun, Yunsheng Wu, Shaohui Lin, and Rongrong Ji. Aligning and prompting everything all at once for univer- sal visual perception. InCVPR, 2024. 7
work page 2024
-
[62]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. DINOv3.Preprint arXiv:2508.10104, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation
Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation. InNeurIPS,
-
[64]
Images speak in images: A generalist painter for in-context visual learning
Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. InCVPR, 2023. 1
work page 2023
-
[65]
SegGPT: Segmenting ev- erything in context
Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. SegGPT: Segmenting ev- erything in context. InICCV, 2023. 1
work page 2023
-
[66]
Robust fine-tuning of zero-shot models
Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InCVPR, 2022. 3
work page 2022
-
[67]
CLIP-DIY: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free
Monika Wysocza ´nska, Micha ¨el Ramamonjisoa, Tomasz Trzci´nski, and Oriane Sim´eoni. CLIP-DIY: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free. InWACV, 2024. 2
work page 2024
-
[68]
CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation
Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation. InECCV, 2024. 2
work page 2024
-
[69]
Jialu Xing, Jianping Liu, Jian Wang, Lulu Sun, Xi Chen, Xunxun Gu, and Yingfei Wang. A survey of efficient fine- tuning methods for vision-language models—prompt and adapter.Computers & Graphics, 119:103885, 2024. 3
work page 2024
-
[70]
Open-vocabulary panop- tic segmentation with text-to-image diffusion models
Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In CVPR, 2023. 1, 2, 5, 7
work page 2023
-
[71]
A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model
Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open- vocabulary semantic segmentation with pre-trained vision- language model. InECCV, 2022. 2
work page 2022
-
[72]
Side adapter network for open-vocabulary semantic segmentation
Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InCVPR, 2023. 2, 3
work page 2023
-
[73]
Multi-modal queried object detection in the wild.NeurIPS, 2023
Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xi- aoshan Yang, Ke Li, and Changsheng Xu. Multi-modal queried object detection in the wild.NeurIPS, 2023. 2
work page 2023
-
[74]
MMA: Multi-modal adapter for vision-language models
Lingxiao Yang, Ru-Yuan Zhang, Yanchen Wang, and Xiao- hua Xie. MMA: Multi-modal adapter for vision-language models. InCVPR, 2024. 3, 5
work page 2024
-
[75]
Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip.NeurIPS,
Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Convolutions die hard: Open-vocabulary seg- mentation with single frozen convolutional clip.NeurIPS,
-
[76]
1, 2, 3, 4, 5, 6, 7, 14, 15
-
[77]
Task residual for tuning vision-language models
Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, and Xinchao Wang. Task residual for tuning vision-language models. InCVPR,
-
[78]
Open-vocabulary detr with conditional matching
Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. InECCV, 2022. 2
work page 2022
-
[79]
Bridge the points: Graph-based few-shot segment anything semantically.NeurIPS, 2024
Anqi Zhang, Guangyu Gao, Jianbo Jiao, Chi Liu, and Yun- chao Wei. Bridge the points: Graph-based few-shot segment anything semantically.NeurIPS, 2024. 7
work page 2024
-
[80]
A simple framework for open-vocabulary segmentation and detection
Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection. InICCV,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.