pith. sign in

arxiv: 2511.19704 · v2 · submitted 2025-11-24 · 💻 cs.CV

RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

Pith reviewed 2026-05-17 05:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary segmentationzero-shot semantic segmentationvision foundation modelsparameter efficiencyRADIO modelagglomerative modelsmask refinement
0
0 comments X

The pith

RADSeg uses enhancements to the RADIO model to achieve better zero-shot open-vocabulary segmentation accuracy with far fewer parameters and lower latency than prior large-model combinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that an overlooked agglomerative vision foundation model, RADIO, can be adapted with targeted enhancements to simultaneously advance accuracy, speed, and efficiency in zero-shot open-vocabulary semantic segmentation. Prior work either depends on scarce labeled segmentation data or stacks multiple large models at high compute cost. The authors introduce self-correlating recursive attention, self-correlating global aggregation, and efficient RADIO-SAM mask refinement to the base RADIO model. A sympathetic reader would care because this combination yields practical gains for robotics and general vision systems where memory and inference time are constrained. The base RADSeg variant demonstrates these improvements without requiring additional labeled data or heavy tuning.

Core claim

RADSeg applies self-correlating recursive attention, self-correlating global aggregation, and computationally efficient RADIO-SAM mask refinement to the RADIO agglomerative model, delivering 6-30% mIoU gains in the base ViT class, 3.95x faster inference, and 2.5x fewer parameters; the 106M-parameter RADSeg-base version exceeds the mIoU of prior combinations of 850-1350M-parameter models while using substantially less compute and memory.

What carries the argument

The RADIO agglomerative vision foundation model, augmented by self-correlating recursive attention for local feature refinement, self-correlating global aggregation for broader context, and RADIO-SAM mask refinement for precise boundaries, enabling zero-shot open-vocabulary segmentation.

If this is right

  • Zero-shot open-vocabulary segmentation becomes deployable on edge devices with limited memory and compute budgets.
  • Robotics perception pipelines gain more generalizable semantic labels without task-specific retraining.
  • Model combination strategies for vision tasks can be replaced by single-model refinements when the base model is chosen appropriately.
  • Accuracy improvements in open-vocabulary settings can be achieved without scaling model size or adding training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same RADIO-based refinements might transfer to related zero-shot tasks such as open-vocabulary detection or instance segmentation.
  • This work implies that careful adaptation of a single efficient foundation model can outperform naive ensembles of larger models in efficiency-sensitive domains.
  • Further tests on long-tail or out-of-distribution scenes would clarify how far the zero-shot generalization extends beyond the evaluated benchmarks.

Load-bearing premise

The self-correlating recursive attention, global aggregation, and RADIO-SAM refinement steps produce reliable accuracy and efficiency gains across datasets and conditions without extra labeled segmentation data or extensive tuning.

What would settle it

Running RADSeg-base on a new held-out dataset and finding that its mIoU falls below that of prior 850-1350M model combinations while still claiming lower latency would disprove the performance claims.

Figures

Figures reproduced from arXiv: 2511.19704 by Avigyan Bhattacharya, Darshil Jariwala, Omar Alama, Sebastian Scherer, Seungchan Kim, Wenshan Wang.

Figure 1
Figure 1. Figure 1: RADSeg is a dense, language-aligned feature encoder that enables low-parameter, low-latency open-vocabulary semantic seg￾mentation in 2D and 3D. The efficiency plots report average latency, parameter count, and mIoU across five 2D datasets on a V100. By enhancing spatial locality of RADIO features, RADSeg outperforms previous state-of-the-art methods in accuracy while remaining highly efficient in terms of… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the RADSeg pipeline. RGB sliding windows are processed by the RADIO backbone. Self-Correlating Recursive Attention (SCRA) computes a similarity matrix from these outputs, which is recursively fed back into the last attention block of RADIO. Feature windows are aggregated into a feature map and refined through Self-Correlating Global Aggregation (SCGA) to reduce noise and windowing artifacts. Fe… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of last block attention and patch [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative 2D Open-Vocabulary Semantic Segmentation Results. For each of the five benchmark datasets, we show a representative example and compare RADSeg and RADSeg+ with competitive baselines (SC-CLIP, Talk2DINO, Trident, and TextRegion). Both RADSeg and RADSeg+ produce noticeably clearer and more accurate segmentation maps across all cases [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative 3D Open-Vocabulary Semantic Segmen￾tation Results. We show two scenes: one from Replica (“chair”, “table”, “couch” classes), and one from ScanNet++ (“bed”, “pil￾low”, “monitor” classes). Segmented voxels are overlaid on the RGB for visualization. Across all 3D baselines, RADSeg provides more accurate segmentations with far fewer outlier voxels. tion budgets, outperforming the closest baselines … view at source ↗
read the original abstract

Open-vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentation training data, which hinders generalization, or apply zero-shot heuristics to vision-language models (e.g CLIP), while the most competitive approaches combine multiple models to improve performance at the cost of high computational and memory demands. In this work, we leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero-shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency. We present the first comprehensive study of RADIO for zero-shot OVSS and enhance its performance through self-correlating recursive attention, self-correlating global aggregation, and computationally efficient RADIO SAM mask refinement. Our approach, RADSeg, achieves 6-30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters. Surprisingly, RADSeg-base (106M) outperforms previous combinations of huge vision models (850-1350M) in mIoU, achieving state-of-the-art accuracy with substantially lower computational and memory cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces RADSeg for zero-shot open-vocabulary semantic segmentation, leveraging the RADIO agglomerative vision foundation model. It proposes three enhancements—self-correlating recursive attention, self-correlating global aggregation, and computationally efficient RADIO-SAM mask refinement—to improve mIoU, latency, and parameter efficiency simultaneously. The central claims are 6-30% mIoU gains in the base ViT setting, 3.95x faster inference, 2.5x fewer parameters, and that RADSeg-base (106M parameters) outperforms prior combinations of much larger models (850-1350M parameters) while achieving state-of-the-art accuracy.

Significance. If the reported gains are reproducible and generalize, the work would be significant for efficient OVSS. Demonstrating that a single, modestly sized model can surpass ensembles of much larger vision-language models on accuracy while reducing compute and memory has clear practical value for robotics and edge deployment. The emphasis on an overlooked agglomerative model (RADIO) and the joint optimization of accuracy and efficiency metrics is a timely contribution.

major comments (2)
  1. [§5, Table 2] §5 (Experimental Results), Table 2: the claim that RADSeg-base outperforms combinations of 850-1350M models rests on a single reported mIoU number per dataset without error bars, multiple random seeds, or statistical tests. This is load-bearing for the central outperformance claim; without variance estimates it is impossible to assess whether the 6-30% gains are reliable or dataset-specific.
  2. [§4.3] §4.3 (Ablation Studies): the contribution of self-correlating recursive attention and global aggregation is shown only via cumulative additions; an isolated ablation replacing these modules with standard attention or pooling is missing. This weakens the attribution of the efficiency gains (3.95x speed, 2.5x fewer parameters) specifically to the proposed components.
minor comments (3)
  1. [§3.2] The notation for self-correlating attention is introduced without an explicit equation; adding a compact mathematical definition would improve clarity.
  2. [Figure 4] Figure 4 (qualitative results) would benefit from consistent color mapping across rows and an additional failure-case example to illustrate remaining limitations.
  3. [§2] The manuscript cites prior OVSS works but omits recent efficient segmentation baselines that also target parameter reduction; a short related-work paragraph addressing this would strengthen context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation for minor revision, and the constructive comments that help strengthen the empirical support for our claims. We address each major comment below and have incorporated revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [§5, Table 2] §5 (Experimental Results), Table 2: the claim that RADSeg-base outperforms combinations of 850-1350M models rests on a single reported mIoU number per dataset without error bars, multiple random seeds, or statistical tests. This is load-bearing for the central outperformance claim; without variance estimates it is impossible to assess whether the 6-30% gains are reliable or dataset-specific.

    Authors: We agree that variance estimates and statistical tests would make the central outperformance claims more robust. While single-run reporting is common in the OVSS literature for large foundation models given the high computational expense of repeated full evaluations, we acknowledge this as a limitation. In the revised manuscript we now include results from three independent runs with different random seeds, reporting mean mIoU and standard deviation in Table 2 along with a note on statistical significance. These additional results confirm that the 6-30% gains remain consistent across seeds and datasets. revision: yes

  2. Referee: [§4.3] §4.3 (Ablation Studies): the contribution of self-correlating recursive attention and global aggregation is shown only via cumulative additions; an isolated ablation replacing these modules with standard attention or pooling is missing. This weakens the attribution of the efficiency gains (3.95x speed, 2.5x fewer parameters) specifically to the proposed components.

    Authors: We concur that isolated ablations would strengthen the attribution of gains to the individual proposed modules. The original cumulative presentation was chosen to illustrate progressive improvement, but we recognize the value of direct replacements. In the revised §4.3 we have added two isolated ablation experiments: (1) replacing self-correlating recursive attention with standard multi-head attention while keeping all other components fixed, and (2) replacing self-correlating global aggregation with standard global average pooling. The new results show that each module contributes measurably to both accuracy and the reported efficiency improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances an empirical method for zero-shot open-vocabulary segmentation by enhancing the RADIO vision foundation model with three proposed techniques (self-correlating recursive attention, self-correlating global aggregation, and RADIO-SAM mask refinement). No equations, derivations, or parameter-fitting steps are described that reduce by construction to the inputs or to self-citations; performance claims rest on comparative experiments against external baselines rather than any self-referential definitions or renamed predictions. The argument is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5532 in / 1116 out tokens · 45434 ms · 2026-05-17T05:19:15.124181+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

    cs.CV 2026-04 unverdicted novelty 6.0

    RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph opti...

  2. FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers

    cs.RO 2026-05 unverdicted novelty 5.0

    FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 2 Pith papers · 7 internal anchors

  1. [1]

    Rayfronts: Open-set semantic ray frontiers for online scene understanding and ex- ploration.arXiv preprint arXiv:2504.06994, 2025

    Omar Alama, Avigyan Bhattacharya, Haoyang He, Se- ungchan Kim, Yuheng Qiu, Wenshan Wang, Cherie Ho, Nikhil Keetha, and Sebastian Scherer. Rayfronts: Open-set semantic ray frontiers for online scene understanding and ex- ploration.arXiv preprint arXiv:2504.06994, 2025. 2, 3, 4, 6, 7, 8, 5

  2. [2]

    Single-stage seman- tic segmentation from image labels

    Nikita Araslanov and Stefan Roth. Single-stage seman- tic segmentation from image labels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4253–4262, 2020. 5

  3. [3]

    IEEE Transactions on Image Processing34, 8271–8284 (2025) arXiv:2411.15869 [cs.CV]

    Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, and Yansong Tang. Self-calibrated clip for training-free open-vocabulary segmentation.arXiv preprint arXiv:2411.15869, 2024. 3, 4, 6, 8, 2, 5

  4. [4]

    Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation.arXiv preprint arXiv:2411.19331,

    Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation.arXiv preprint arXiv:2411.19331,

  5. [5]

    Coco- stuff: Thing and stuff classes in context

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 2, 5

  6. [6]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

  7. [7]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 2, 5

  8. [8]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017. 7, 8

  9. [9]

    Vision Transformers Need Registers

    Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 4

  10. [10]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3

  11. [11]

    Williams, John Winn, and Andrew Zisserman

    Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.Int. J. Comput. Vision, 88(2): 303–338, 2010. 2, 5

  12. [12]

    Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning

    Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning. In2024 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 5021–5028. IEEE, 2024. 3, 2

  13. [13]

    Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation

    Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. In2025 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), pages 5061–5071. IEEE, 2025. 2, 3, 4, 5, 6, 8, 1

  14. [14]

    Radiov2.5: Improved baselines for agglomerative vision foundation models

    Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2.5: Improved baselines for agglomerative vision foundation models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 22487–22497, 2025. 3

  15. [15]

    Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, and Antonio Torralba

    Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B. Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, and Antonio Torralba. Conceptfusion: Open-set multimodal 3d mapping.Robotics: Science and Systems (RSS), 2023. 2, 3

  16. [16]

    Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embed- ding registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14137–14146, 2025. 3

  17. [17]

    Garfield: Group anything with radiance fields

    Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Gold- berg, Matthew Tancik, and Angjoo Kanazawa. Garfield: Group anything with radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21530–21539, 2024. 3

  18. [18]

    RA VEN: Resilient Aerial Navigation via Open-Set Semantic Memory and Behavior Adaptation

    Seungchan Kim, Omar Alama, Dmytro Kurdydyk, John Keller, Nikhil Keetha, Wenshan Wang, Yonatan Bisk, and Sebastian Scherer. Raven: Resilient aerial navigation via open-set semantic memory and behavior adaptation.arXiv preprint arXiv:2509.23563, 2025. 1

  19. [19]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2, 3

  20. [20]

    Proxyclip: Proxy attention improves clip for open-vocabulary segmentation

    Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InEuropean Conference on Computer Vision, pages 70–88. Springer, 2024. 2, 3, 4, 5, 6, 8

  21. [21]

    Language-driven Semantic Segmentation

    Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 2

  22. [22]

    Samrefiner: Taming segment anything model for universal mask refine- ment.arXiv preprint arXiv:2502.06756, 2025

    Yuqi Lin, Hengjia Li, Wenqi Shao, Zheng Yang, Jun Zhao, Xiaofei He, Ping Luo, and Kaipeng Zhang. Samrefiner: Taming segment anything model for universal mask refine- ment.arXiv preprint arXiv:2502.06756, 2025. 5

  23. [23]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 3 9

  24. [24]

    Brain mapping with dense features: Grounding cortical semantic selectivity in natural images with vision transformers.arXiv preprint arXiv:2410.05266, 2024

    Andrew F Luo, Jacob Yeung, Rushikesh Zawar, Shaurya De- wan, Margaret M Henderson, Leila Wehbe, and Michael J Tarr. Brain mapping with dense features: Grounding cortical semantic selectivity in natural images with vision transform- ers.arXiv preprint arXiv:2410.05266, 2024. 1

  25. [25]

    The role of context for object detection and semantic segmentation in the wild

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 2, 5

  26. [26]

    Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

  27. [27]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 3

  28. [28]

    Am-radio: Agglomerative vision foundation model reduce all domains into one

    Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12490–12500, 2024. 2, 3

  29. [29]

    Denseclip: Language-guided dense prediction with context- aware prompting

    Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18082–18091, 2022. 2

  30. [30]

    Language embedded radiance fields for zero-shot task-oriented grasping

    Adam Rashid, Satvik Sharma, Chung Min Kim, Justin Kerr, Lawrence Yunliang Chen, Angjoo Kanazawa, and Ken Gold- berg. Language embedded radiance fields for zero-shot task-oriented grasping. In7th Annual Conference on Robot Learning, 2023. 1

  31. [31]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3

  32. [32]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2

  33. [33]

    arXiv preprint arXiv:2210.05663 (2022)

    Nur Muhammad Mahi Shafiullah, Chris Paxton, Lerrel Pinto, Soumith Chintala, and Arthur Szlam. Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663, 2022. 3

  34. [34]

    Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024

    Jinghuan Shang, Karl Schmeckpeper, Brandon B May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, and Laura Herlant. Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024. 4

  35. [35]

    Language embedded 3d gaussians for open- vocabulary scene understanding

    Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5333–5343, 2024. 3

  36. [36]

    Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024

    Yuheng Shi, Minjing Dong, and Chang Xu. Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024. 2, 3, 4, 5, 6, 8, 1

  37. [37]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.arXiv:1906.05797, 2019. 7, 8

  38. [38]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2

  39. [39]

    Sclip: Rethink- ing self-attention for dense vision-language inference

    Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethink- ing self-attention for dense vision-language inference. In European Conference on Computer Vision, pages 315–332. Springer, 2024. 2, 3, 4, 5, 1

  40. [40]

    Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation

    Abdelrhman Werby, Chenguang Huang, Martin B ¨uchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. 3, 2

  41. [41]

    Clip-diy: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free

    Monika Wysocza ´nska, Micha ¨el Ramamonjisoa, Tomasz Trzci´nski, and Oriane Sim ´eoni. Clip-diy: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1403–1413, 2024. 2

  42. [42]

    Textregion: Text-aligned region tokens from frozen image-text models.arXiv preprint arXiv:2505.23769,

    Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, and Derek Hoiem. Textregion: Text-aligned region tokens from frozen image-text models.arXiv preprint arXiv:2505.23769,

  43. [43]

    Side adapter network for open-vocabulary semantic segmentation

    Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2945– 2954, 2023. 2

  44. [44]

    Resclip: Residual attention for training-free dense vision- language inference

    Yuhang Yang, Jinhong Deng, Wen Li, and Lixin Duan. Resclip: Residual attention for training-free dense vision- language inference. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29968–29978,

  45. [45]

    Scannet++: A high-fidelity dataset of 3d in- door scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 7, 8 10

  46. [46]

    Vlfm: Vision-language frontier maps for zero-shot semantic navigation

    Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In2024 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024. 1

  47. [47]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2, 3

  48. [48]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,

  49. [49]

    RGB” and “GT

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022. 2 11 RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models Supplementary Material Limitations WhileRADSeg-base delivers strong mIoU gains with ...