RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models
Pith reviewed 2026-05-17 05:19 UTC · model grok-4.3
The pith
RADSeg uses enhancements to the RADIO model to achieve better zero-shot open-vocabulary segmentation accuracy with far fewer parameters and lower latency than prior large-model combinations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RADSeg applies self-correlating recursive attention, self-correlating global aggregation, and computationally efficient RADIO-SAM mask refinement to the RADIO agglomerative model, delivering 6-30% mIoU gains in the base ViT class, 3.95x faster inference, and 2.5x fewer parameters; the 106M-parameter RADSeg-base version exceeds the mIoU of prior combinations of 850-1350M-parameter models while using substantially less compute and memory.
What carries the argument
The RADIO agglomerative vision foundation model, augmented by self-correlating recursive attention for local feature refinement, self-correlating global aggregation for broader context, and RADIO-SAM mask refinement for precise boundaries, enabling zero-shot open-vocabulary segmentation.
If this is right
- Zero-shot open-vocabulary segmentation becomes deployable on edge devices with limited memory and compute budgets.
- Robotics perception pipelines gain more generalizable semantic labels without task-specific retraining.
- Model combination strategies for vision tasks can be replaced by single-model refinements when the base model is chosen appropriately.
- Accuracy improvements in open-vocabulary settings can be achieved without scaling model size or adding training data.
Where Pith is reading between the lines
- The same RADIO-based refinements might transfer to related zero-shot tasks such as open-vocabulary detection or instance segmentation.
- This work implies that careful adaptation of a single efficient foundation model can outperform naive ensembles of larger models in efficiency-sensitive domains.
- Further tests on long-tail or out-of-distribution scenes would clarify how far the zero-shot generalization extends beyond the evaluated benchmarks.
Load-bearing premise
The self-correlating recursive attention, global aggregation, and RADIO-SAM refinement steps produce reliable accuracy and efficiency gains across datasets and conditions without extra labeled segmentation data or extensive tuning.
What would settle it
Running RADSeg-base on a new held-out dataset and finding that its mIoU falls below that of prior 850-1350M model combinations while still claiming lower latency would disprove the performance claims.
Figures
read the original abstract
Open-vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentation training data, which hinders generalization, or apply zero-shot heuristics to vision-language models (e.g CLIP), while the most competitive approaches combine multiple models to improve performance at the cost of high computational and memory demands. In this work, we leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero-shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency. We present the first comprehensive study of RADIO for zero-shot OVSS and enhance its performance through self-correlating recursive attention, self-correlating global aggregation, and computationally efficient RADIO SAM mask refinement. Our approach, RADSeg, achieves 6-30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters. Surprisingly, RADSeg-base (106M) outperforms previous combinations of huge vision models (850-1350M) in mIoU, achieving state-of-the-art accuracy with substantially lower computational and memory cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RADSeg for zero-shot open-vocabulary semantic segmentation, leveraging the RADIO agglomerative vision foundation model. It proposes three enhancements—self-correlating recursive attention, self-correlating global aggregation, and computationally efficient RADIO-SAM mask refinement—to improve mIoU, latency, and parameter efficiency simultaneously. The central claims are 6-30% mIoU gains in the base ViT setting, 3.95x faster inference, 2.5x fewer parameters, and that RADSeg-base (106M parameters) outperforms prior combinations of much larger models (850-1350M parameters) while achieving state-of-the-art accuracy.
Significance. If the reported gains are reproducible and generalize, the work would be significant for efficient OVSS. Demonstrating that a single, modestly sized model can surpass ensembles of much larger vision-language models on accuracy while reducing compute and memory has clear practical value for robotics and edge deployment. The emphasis on an overlooked agglomerative model (RADIO) and the joint optimization of accuracy and efficiency metrics is a timely contribution.
major comments (2)
- [§5, Table 2] §5 (Experimental Results), Table 2: the claim that RADSeg-base outperforms combinations of 850-1350M models rests on a single reported mIoU number per dataset without error bars, multiple random seeds, or statistical tests. This is load-bearing for the central outperformance claim; without variance estimates it is impossible to assess whether the 6-30% gains are reliable or dataset-specific.
- [§4.3] §4.3 (Ablation Studies): the contribution of self-correlating recursive attention and global aggregation is shown only via cumulative additions; an isolated ablation replacing these modules with standard attention or pooling is missing. This weakens the attribution of the efficiency gains (3.95x speed, 2.5x fewer parameters) specifically to the proposed components.
minor comments (3)
- [§3.2] The notation for self-correlating attention is introduced without an explicit equation; adding a compact mathematical definition would improve clarity.
- [Figure 4] Figure 4 (qualitative results) would benefit from consistent color mapping across rows and an additional failure-case example to illustrate remaining limitations.
- [§2] The manuscript cites prior OVSS works but omits recent efficient segmentation baselines that also target parameter reduction; a short related-work paragraph addressing this would strengthen context.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, the recommendation for minor revision, and the constructive comments that help strengthen the empirical support for our claims. We address each major comment below and have incorporated revisions to improve the manuscript.
read point-by-point responses
-
Referee: [§5, Table 2] §5 (Experimental Results), Table 2: the claim that RADSeg-base outperforms combinations of 850-1350M models rests on a single reported mIoU number per dataset without error bars, multiple random seeds, or statistical tests. This is load-bearing for the central outperformance claim; without variance estimates it is impossible to assess whether the 6-30% gains are reliable or dataset-specific.
Authors: We agree that variance estimates and statistical tests would make the central outperformance claims more robust. While single-run reporting is common in the OVSS literature for large foundation models given the high computational expense of repeated full evaluations, we acknowledge this as a limitation. In the revised manuscript we now include results from three independent runs with different random seeds, reporting mean mIoU and standard deviation in Table 2 along with a note on statistical significance. These additional results confirm that the 6-30% gains remain consistent across seeds and datasets. revision: yes
-
Referee: [§4.3] §4.3 (Ablation Studies): the contribution of self-correlating recursive attention and global aggregation is shown only via cumulative additions; an isolated ablation replacing these modules with standard attention or pooling is missing. This weakens the attribution of the efficiency gains (3.95x speed, 2.5x fewer parameters) specifically to the proposed components.
Authors: We concur that isolated ablations would strengthen the attribution of gains to the individual proposed modules. The original cumulative presentation was chosen to illustrate progressive improvement, but we recognize the value of direct replacements. In the revised §4.3 we have added two isolated ablation experiments: (1) replacing self-correlating recursive attention with standard multi-head attention while keeping all other components fixed, and (2) replacing self-correlating global aggregation with standard global average pooling. The new results show that each module contributes measurably to both accuracy and the reported efficiency improvements. revision: yes
Circularity Check
No significant circularity
full rationale
The paper advances an empirical method for zero-shot open-vocabulary segmentation by enhancing the RADIO vision foundation model with three proposed techniques (self-correlating recursive attention, self-correlating global aggregation, and RADIO-SAM mask refinement). No equations, derivations, or parameter-fitting steps are described that reduce by construction to the inputs or to self-citations; performance claims rest on comparative experiments against external baselines rather than any self-referential definitions or renamed predictions. The argument is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RADSeg-base (106M) outperforms previous combinations of huge vision models (850-1350M) in mIoU
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph opti...
-
FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers
FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.
Reference graph
Works this paper leans on
-
[1]
Omar Alama, Avigyan Bhattacharya, Haoyang He, Se- ungchan Kim, Yuheng Qiu, Wenshan Wang, Cherie Ho, Nikhil Keetha, and Sebastian Scherer. Rayfronts: Open-set semantic ray frontiers for online scene understanding and ex- ploration.arXiv preprint arXiv:2504.06994, 2025. 2, 3, 4, 6, 7, 8, 5
-
[2]
Single-stage seman- tic segmentation from image labels
Nikita Araslanov and Stefan Roth. Single-stage seman- tic segmentation from image labels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4253–4262, 2020. 5
work page 2020
-
[3]
IEEE Transactions on Image Processing34, 8271–8284 (2025) arXiv:2411.15869 [cs.CV]
Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, and Yansong Tang. Self-calibrated clip for training-free open-vocabulary segmentation.arXiv preprint arXiv:2411.15869, 2024. 3, 4, 6, 8, 2, 5
-
[4]
Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation.arXiv preprint arXiv:2411.19331,
-
[5]
Coco- stuff: Thing and stuff classes in context
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 2, 5
work page 2018
-
[6]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3
work page 2021
-
[7]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 2, 5
work page 2016
-
[8]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017. 7, 8
work page 2017
-
[9]
Vision Transformers Need Registers
Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[11]
Williams, John Winn, and Andrew Zisserman
Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.Int. J. Comput. Vision, 88(2): 303–338, 2010. 2, 5
work page 2010
-
[12]
Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning
Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning. In2024 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 5021–5028. IEEE, 2024. 3, 2
work page 2024
-
[13]
Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation
Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. In2025 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), pages 5061–5071. IEEE, 2025. 2, 3, 4, 5, 6, 8, 1
work page 2025
-
[14]
Radiov2.5: Improved baselines for agglomerative vision foundation models
Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2.5: Improved baselines for agglomerative vision foundation models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 22487–22497, 2025. 3
work page 2025
-
[15]
Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, and Antonio Torralba
Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B. Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, and Antonio Torralba. Conceptfusion: Open-set multimodal 3d mapping.Robotics: Science and Systems (RSS), 2023. 2, 3
work page 2023
-
[16]
Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embed- ding registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14137–14146, 2025. 3
work page 2025
-
[17]
Garfield: Group anything with radiance fields
Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Gold- berg, Matthew Tancik, and Angjoo Kanazawa. Garfield: Group anything with radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21530–21539, 2024. 3
work page 2024
-
[18]
RA VEN: Resilient Aerial Navigation via Open-Set Semantic Memory and Behavior Adaptation
Seungchan Kim, Omar Alama, Dmytro Kurdydyk, John Keller, Nikhil Keetha, Wenshan Wang, Yonatan Bisk, and Sebastian Scherer. Raven: Resilient aerial navigation via open-set semantic memory and behavior adaptation.arXiv preprint arXiv:2509.23563, 2025. 1
-
[19]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2, 3
work page 2023
-
[20]
Proxyclip: Proxy attention improves clip for open-vocabulary segmentation
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InEuropean Conference on Computer Vision, pages 70–88. Springer, 2024. 2, 3, 4, 5, 6, 8
work page 2024
-
[21]
Language-driven Semantic Segmentation
Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Yuqi Lin, Hengjia Li, Wenqi Shao, Zheng Yang, Jun Zhao, Xiaofei He, Ping Luo, and Kaipeng Zhang. Samrefiner: Taming segment anything model for universal mask refine- ment.arXiv preprint arXiv:2502.06756, 2025. 5
-
[23]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 3 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Andrew F Luo, Jacob Yeung, Rushikesh Zawar, Shaurya De- wan, Margaret M Henderson, Leila Wehbe, and Michael J Tarr. Brain mapping with dense features: Grounding cortical semantic selectivity in natural images with vision transform- ers.arXiv preprint arXiv:2410.05266, 2024. 1
-
[25]
The role of context for object detection and semantic segmentation in the wild
Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 2, 5
work page 2014
-
[26]
Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...
work page 2023
-
[27]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 3
work page 2021
-
[28]
Am-radio: Agglomerative vision foundation model reduce all domains into one
Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12490–12500, 2024. 2, 3
work page 2024
-
[29]
Denseclip: Language-guided dense prediction with context- aware prompting
Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18082–18091, 2022. 2
work page 2022
-
[30]
Language embedded radiance fields for zero-shot task-oriented grasping
Adam Rashid, Satvik Sharma, Chung Min Kim, Justin Kerr, Lawrence Yunliang Chen, Angjoo Kanazawa, and Ken Gold- berg. Language embedded radiance fields for zero-shot task-oriented grasping. In7th Annual Conference on Robot Learning, 2023. 1
work page 2023
-
[31]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2
work page 2022
-
[33]
arXiv preprint arXiv:2210.05663 (2022)
Nur Muhammad Mahi Shafiullah, Chris Paxton, Lerrel Pinto, Soumith Chintala, and Arthur Szlam. Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663, 2022. 3
-
[34]
Jinghuan Shang, Karl Schmeckpeper, Brandon B May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, and Laura Herlant. Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024. 4
-
[35]
Language embedded 3d gaussians for open- vocabulary scene understanding
Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5333–5343, 2024. 3
work page 2024
-
[36]
Yuheng Shi, Minjing Dong, and Chang Xu. Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024. 2, 3, 4, 5, 6, 8, 1
-
[37]
The Replica Dataset: A Digital Replica of Indoor Spaces
Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.arXiv:1906.05797, 2019. 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[38]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Sclip: Rethink- ing self-attention for dense vision-language inference
Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethink- ing self-attention for dense vision-language inference. In European Conference on Computer Vision, pages 315–332. Springer, 2024. 2, 3, 4, 5, 1
work page 2024
-
[40]
Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation
Abdelrhman Werby, Chenguang Huang, Martin B ¨uchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. 3, 2
work page 2024
-
[41]
Clip-diy: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free
Monika Wysocza ´nska, Micha ¨el Ramamonjisoa, Tomasz Trzci´nski, and Oriane Sim ´eoni. Clip-diy: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1403–1413, 2024. 2
work page 2024
-
[42]
Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, and Derek Hoiem. Textregion: Text-aligned region tokens from frozen image-text models.arXiv preprint arXiv:2505.23769,
-
[43]
Side adapter network for open-vocabulary semantic segmentation
Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2945– 2954, 2023. 2
work page 2023
-
[44]
Resclip: Residual attention for training-free dense vision- language inference
Yuhang Yang, Jinhong Deng, Wen Li, and Lixin Duan. Resclip: Residual attention for training-free dense vision- language inference. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29968–29978,
-
[45]
Scannet++: A high-fidelity dataset of 3d in- door scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 7, 8 10
work page 2023
-
[46]
Vlfm: Vision-language frontier maps for zero-shot semantic navigation
Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In2024 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024. 1
work page 2024
-
[47]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2, 3
work page 2023
-
[48]
Scene parsing through ade20k dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,
-
[49]
Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022. 2 11 RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models Supplementary Material Limitations WhileRADSeg-base delivers strong mIoU gains with ...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.