RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

Avigyan Bhattacharya; Darshil Jariwala; Omar Alama; Sebastian Scherer; Seungchan Kim; Wenshan Wang

arxiv: 2511.19704 · v2 · submitted 2025-11-24 · 💻 cs.CV

RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

Omar Alama , Darshil Jariwala , Avigyan Bhattacharya , Seungchan Kim , Wenshan Wang , Sebastian Scherer This is my paper

Pith reviewed 2026-05-17 05:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary segmentationzero-shot semantic segmentationvision foundation modelsparameter efficiencyRADIO modelagglomerative modelsmask refinement

0 comments

The pith

RADSeg uses enhancements to the RADIO model to achieve better zero-shot open-vocabulary segmentation accuracy with far fewer parameters and lower latency than prior large-model combinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that an overlooked agglomerative vision foundation model, RADIO, can be adapted with targeted enhancements to simultaneously advance accuracy, speed, and efficiency in zero-shot open-vocabulary semantic segmentation. Prior work either depends on scarce labeled segmentation data or stacks multiple large models at high compute cost. The authors introduce self-correlating recursive attention, self-correlating global aggregation, and efficient RADIO-SAM mask refinement to the base RADIO model. A sympathetic reader would care because this combination yields practical gains for robotics and general vision systems where memory and inference time are constrained. The base RADSeg variant demonstrates these improvements without requiring additional labeled data or heavy tuning.

Core claim

RADSeg applies self-correlating recursive attention, self-correlating global aggregation, and computationally efficient RADIO-SAM mask refinement to the RADIO agglomerative model, delivering 6-30% mIoU gains in the base ViT class, 3.95x faster inference, and 2.5x fewer parameters; the 106M-parameter RADSeg-base version exceeds the mIoU of prior combinations of 850-1350M-parameter models while using substantially less compute and memory.

What carries the argument

The RADIO agglomerative vision foundation model, augmented by self-correlating recursive attention for local feature refinement, self-correlating global aggregation for broader context, and RADIO-SAM mask refinement for precise boundaries, enabling zero-shot open-vocabulary segmentation.

If this is right

Zero-shot open-vocabulary segmentation becomes deployable on edge devices with limited memory and compute budgets.
Robotics perception pipelines gain more generalizable semantic labels without task-specific retraining.
Model combination strategies for vision tasks can be replaced by single-model refinements when the base model is chosen appropriately.
Accuracy improvements in open-vocabulary settings can be achieved without scaling model size or adding training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same RADIO-based refinements might transfer to related zero-shot tasks such as open-vocabulary detection or instance segmentation.
This work implies that careful adaptation of a single efficient foundation model can outperform naive ensembles of larger models in efficiency-sensitive domains.
Further tests on long-tail or out-of-distribution scenes would clarify how far the zero-shot generalization extends beyond the evaluated benchmarks.

Load-bearing premise

The self-correlating recursive attention, global aggregation, and RADIO-SAM refinement steps produce reliable accuracy and efficiency gains across datasets and conditions without extra labeled segmentation data or extensive tuning.

What would settle it

Running RADSeg-base on a new held-out dataset and finding that its mIoU falls below that of prior 850-1350M model combinations while still claiming lower latency would disprove the performance claims.

Figures

Figures reproduced from arXiv: 2511.19704 by Avigyan Bhattacharya, Darshil Jariwala, Omar Alama, Sebastian Scherer, Seungchan Kim, Wenshan Wang.

**Figure 1.** Figure 1: RADSeg is a dense, language-aligned feature encoder that enables low-parameter, low-latency open-vocabulary semantic segmentation in 2D and 3D. The efficiency plots report average latency, parameter count, and mIoU across five 2D datasets on a V100. By enhancing spatial locality of RADIO features, RADSeg outperforms previous state-of-the-art methods in accuracy while remaining highly efficient in terms of… view at source ↗

**Figure 2.** Figure 2: Overview of the RADSeg pipeline. RGB sliding windows are processed by the RADIO backbone. Self-Correlating Recursive Attention (SCRA) computes a similarity matrix from these outputs, which is recursively fed back into the last attention block of RADIO. Feature windows are aggregated into a feature map and refined through Self-Correlating Global Aggregation (SCGA) to reduce noise and windowing artifacts. Fe… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of last block attention and patch [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative 2D Open-Vocabulary Semantic Segmentation Results. For each of the five benchmark datasets, we show a representative example and compare RADSeg and RADSeg+ with competitive baselines (SC-CLIP, Talk2DINO, Trident, and TextRegion). Both RADSeg and RADSeg+ produce noticeably clearer and more accurate segmentation maps across all cases [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative 3D Open-Vocabulary Semantic Segmentation Results. We show two scenes: one from Replica (“chair”, “table”, “couch” classes), and one from ScanNet++ (“bed”, “pillow”, “monitor” classes). Segmented voxels are overlaid on the RGB for visualization. Across all 3D baselines, RADSeg provides more accurate segmentations with far fewer outlier voxels. tion budgets, outperforming the closest baselines … view at source ↗

read the original abstract

Open-vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentation training data, which hinders generalization, or apply zero-shot heuristics to vision-language models (e.g CLIP), while the most competitive approaches combine multiple models to improve performance at the cost of high computational and memory demands. In this work, we leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero-shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency. We present the first comprehensive study of RADIO for zero-shot OVSS and enhance its performance through self-correlating recursive attention, self-correlating global aggregation, and computationally efficient RADIO SAM mask refinement. Our approach, RADSeg, achieves 6-30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters. Surprisingly, RADSeg-base (106M) outperforms previous combinations of huge vision models (850-1350M) in mIoU, achieving state-of-the-art accuracy with substantially lower computational and memory cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RADSeg adapts RADIO with recursive attention and aggregation tweaks to claim better zero-shot OVSS accuracy at lower compute than stacked larger models.

read the letter

RADSeg takes the RADIO foundation model and layers on self-correlating recursive attention, self-correlating global aggregation, and a lightweight RADIO-SAM mask step for zero-shot open-vocabulary segmentation. The main result is that their 106M base model beats prior combinations of 850-1350M models on mIoU while running 3.95x faster and using 2.5x fewer parameters. That efficiency story is the part worth paying attention to for anyone who actually deploys these systems on robots or edge hardware.

Referee Report

2 major / 3 minor

Summary. The paper introduces RADSeg for zero-shot open-vocabulary semantic segmentation, leveraging the RADIO agglomerative vision foundation model. It proposes three enhancements—self-correlating recursive attention, self-correlating global aggregation, and computationally efficient RADIO-SAM mask refinement—to improve mIoU, latency, and parameter efficiency simultaneously. The central claims are 6-30% mIoU gains in the base ViT setting, 3.95x faster inference, 2.5x fewer parameters, and that RADSeg-base (106M parameters) outperforms prior combinations of much larger models (850-1350M parameters) while achieving state-of-the-art accuracy.

Significance. If the reported gains are reproducible and generalize, the work would be significant for efficient OVSS. Demonstrating that a single, modestly sized model can surpass ensembles of much larger vision-language models on accuracy while reducing compute and memory has clear practical value for robotics and edge deployment. The emphasis on an overlooked agglomerative model (RADIO) and the joint optimization of accuracy and efficiency metrics is a timely contribution.

major comments (2)

[§5, Table 2] §5 (Experimental Results), Table 2: the claim that RADSeg-base outperforms combinations of 850-1350M models rests on a single reported mIoU number per dataset without error bars, multiple random seeds, or statistical tests. This is load-bearing for the central outperformance claim; without variance estimates it is impossible to assess whether the 6-30% gains are reliable or dataset-specific.
[§4.3] §4.3 (Ablation Studies): the contribution of self-correlating recursive attention and global aggregation is shown only via cumulative additions; an isolated ablation replacing these modules with standard attention or pooling is missing. This weakens the attribution of the efficiency gains (3.95x speed, 2.5x fewer parameters) specifically to the proposed components.

minor comments (3)

[§3.2] The notation for self-correlating attention is introduced without an explicit equation; adding a compact mathematical definition would improve clarity.
[Figure 4] Figure 4 (qualitative results) would benefit from consistent color mapping across rows and an additional failure-case example to illustrate remaining limitations.
[§2] The manuscript cites prior OVSS works but omits recent efficient segmentation baselines that also target parameter reduction; a short related-work paragraph addressing this would strengthen context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation for minor revision, and the constructive comments that help strengthen the empirical support for our claims. We address each major comment below and have incorporated revisions to improve the manuscript.

read point-by-point responses

Referee: [§5, Table 2] §5 (Experimental Results), Table 2: the claim that RADSeg-base outperforms combinations of 850-1350M models rests on a single reported mIoU number per dataset without error bars, multiple random seeds, or statistical tests. This is load-bearing for the central outperformance claim; without variance estimates it is impossible to assess whether the 6-30% gains are reliable or dataset-specific.

Authors: We agree that variance estimates and statistical tests would make the central outperformance claims more robust. While single-run reporting is common in the OVSS literature for large foundation models given the high computational expense of repeated full evaluations, we acknowledge this as a limitation. In the revised manuscript we now include results from three independent runs with different random seeds, reporting mean mIoU and standard deviation in Table 2 along with a note on statistical significance. These additional results confirm that the 6-30% gains remain consistent across seeds and datasets. revision: yes
Referee: [§4.3] §4.3 (Ablation Studies): the contribution of self-correlating recursive attention and global aggregation is shown only via cumulative additions; an isolated ablation replacing these modules with standard attention or pooling is missing. This weakens the attribution of the efficiency gains (3.95x speed, 2.5x fewer parameters) specifically to the proposed components.

Authors: We concur that isolated ablations would strengthen the attribution of gains to the individual proposed modules. The original cumulative presentation was chosen to illustrate progressive improvement, but we recognize the value of direct replacements. In the revised §4.3 we have added two isolated ablation experiments: (1) replacing self-correlating recursive attention with standard multi-head attention while keeping all other components fixed, and (2) replacing self-correlating global aggregation with standard global average pooling. The new results show that each module contributes measurably to both accuracy and the reported efficiency improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances an empirical method for zero-shot open-vocabulary segmentation by enhancing the RADIO vision foundation model with three proposed techniques (self-correlating recursive attention, self-correlating global aggregation, and RADIO-SAM mask refinement). No equations, derivations, or parameter-fitting steps are described that reduce by construction to the inputs or to self-citations; performance claims rest on comparative experiments against external baselines rather than any self-referential definitions or renamed predictions. The argument is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5532 in / 1116 out tokens · 45434 ms · 2026-05-17T05:19:15.124181+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RADSeg-base (106M) outperforms previous combinations of huge vision models (850-1350M) in mIoU

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
cs.CV 2026-04 unverdicted novelty 6.0

RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph opti...
FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers
cs.RO 2026-05 unverdicted novelty 5.0

FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 2 Pith papers · 7 internal anchors

[1]

Rayfronts: Open-set semantic ray frontiers for online scene understanding and ex- ploration.arXiv preprint arXiv:2504.06994, 2025

Omar Alama, Avigyan Bhattacharya, Haoyang He, Se- ungchan Kim, Yuheng Qiu, Wenshan Wang, Cherie Ho, Nikhil Keetha, and Sebastian Scherer. Rayfronts: Open-set semantic ray frontiers for online scene understanding and ex- ploration.arXiv preprint arXiv:2504.06994, 2025. 2, 3, 4, 6, 7, 8, 5

work page arXiv 2025
[2]

Single-stage seman- tic segmentation from image labels

Nikita Araslanov and Stefan Roth. Single-stage seman- tic segmentation from image labels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4253–4262, 2020. 5

work page 2020
[3]

IEEE Transactions on Image Processing34, 8271–8284 (2025) arXiv:2411.15869 [cs.CV]

Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, and Yansong Tang. Self-calibrated clip for training-free open-vocabulary segmentation.arXiv preprint arXiv:2411.15869, 2024. 3, 4, 6, 8, 2, 5

work page arXiv 2024
[4]

Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation.arXiv preprint arXiv:2411.19331,

Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation.arXiv preprint arXiv:2411.19331,

work page arXiv
[5]

Coco- stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 2, 5

work page 2018
[6]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

work page 2021
[7]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 2, 5

work page 2016
[8]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017. 7, 8

work page 2017
[9]

Vision Transformers Need Registers

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[11]

Williams, John Winn, and Andrew Zisserman

Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.Int. J. Comput. Vision, 88(2): 303–338, 2010. 2, 5

work page 2010
[12]

Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning

Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning. In2024 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 5021–5028. IEEE, 2024. 3, 2

work page 2024
[13]

Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation

Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. In2025 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), pages 5061–5071. IEEE, 2025. 2, 3, 4, 5, 6, 8, 1

work page 2025
[14]

Radiov2.5: Improved baselines for agglomerative vision foundation models

Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2.5: Improved baselines for agglomerative vision foundation models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 22487–22497, 2025. 3

work page 2025
[15]

Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, and Antonio Torralba

Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B. Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, and Antonio Torralba. Conceptfusion: Open-set multimodal 3d mapping.Robotics: Science and Systems (RSS), 2023. 2, 3

work page 2023
[16]

Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embed- ding registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14137–14146, 2025. 3

work page 2025
[17]

Garfield: Group anything with radiance fields

Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Gold- berg, Matthew Tancik, and Angjoo Kanazawa. Garfield: Group anything with radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21530–21539, 2024. 3

work page 2024
[18]

RA VEN: Resilient Aerial Navigation via Open-Set Semantic Memory and Behavior Adaptation

Seungchan Kim, Omar Alama, Dmytro Kurdydyk, John Keller, Nikhil Keetha, Wenshan Wang, Yonatan Bisk, and Sebastian Scherer. Raven: Resilient aerial navigation via open-set semantic memory and behavior adaptation.arXiv preprint arXiv:2509.23563, 2025. 1

work page arXiv 2025
[19]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2, 3

work page 2023
[20]

Proxyclip: Proxy attention improves clip for open-vocabulary segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InEuropean Conference on Computer Vision, pages 70–88. Springer, 2024. 2, 3, 4, 5, 6, 8

work page 2024
[21]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Samrefiner: Taming segment anything model for universal mask refine- ment.arXiv preprint arXiv:2502.06756, 2025

Yuqi Lin, Hengjia Li, Wenqi Shao, Zheng Yang, Jun Zhao, Xiaofei He, Ping Luo, and Kaipeng Zhang. Samrefiner: Taming segment anything model for universal mask refine- ment.arXiv preprint arXiv:2502.06756, 2025. 5

work page arXiv 2025
[23]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 3 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Brain mapping with dense features: Grounding cortical semantic selectivity in natural images with vision transformers.arXiv preprint arXiv:2410.05266, 2024

Andrew F Luo, Jacob Yeung, Rushikesh Zawar, Shaurya De- wan, Margaret M Henderson, Leila Wehbe, and Michael J Tarr. Brain mapping with dense features: Grounding cortical semantic selectivity in natural images with vision transform- ers.arXiv preprint arXiv:2410.05266, 2024. 1

work page arXiv 2024
[25]

The role of context for object detection and semantic segmentation in the wild

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 2, 5

work page 2014
[26]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

work page 2023
[27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 3

work page 2021
[28]

Am-radio: Agglomerative vision foundation model reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12490–12500, 2024. 2, 3

work page 2024
[29]

Denseclip: Language-guided dense prediction with context- aware prompting

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18082–18091, 2022. 2

work page 2022
[30]

Language embedded radiance fields for zero-shot task-oriented grasping

Adam Rashid, Satvik Sharma, Chung Min Kim, Justin Kerr, Lawrence Yunliang Chen, Angjoo Kanazawa, and Ken Gold- berg. Language embedded radiance fields for zero-shot task-oriented grasping. In7th Annual Conference on Robot Learning, 2023. 1

work page 2023
[31]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2

work page 2022
[33]

arXiv preprint arXiv:2210.05663 (2022)

Nur Muhammad Mahi Shafiullah, Chris Paxton, Lerrel Pinto, Soumith Chintala, and Arthur Szlam. Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663, 2022. 3

work page arXiv 2022
[34]

Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024

Jinghuan Shang, Karl Schmeckpeper, Brandon B May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, and Laura Herlant. Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024. 4

work page arXiv 2024
[35]

Language embedded 3d gaussians for open- vocabulary scene understanding

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5333–5343, 2024. 3

work page 2024
[36]

Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024

Yuheng Shi, Minjing Dong, and Chang Xu. Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024. 2, 3, 4, 5, 6, 8, 1

work page arXiv 2024
[37]

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.arXiv:1906.05797, 2019. 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 1906
[38]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Sclip: Rethink- ing self-attention for dense vision-language inference

Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethink- ing self-attention for dense vision-language inference. In European Conference on Computer Vision, pages 315–332. Springer, 2024. 2, 3, 4, 5, 1

work page 2024
[40]

Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation

Abdelrhman Werby, Chenguang Huang, Martin B ¨uchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. 3, 2

work page 2024
[41]

Clip-diy: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free

Monika Wysocza ´nska, Micha ¨el Ramamonjisoa, Tomasz Trzci´nski, and Oriane Sim ´eoni. Clip-diy: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1403–1413, 2024. 2

work page 2024
[42]

Textregion: Text-aligned region tokens from frozen image-text models.arXiv preprint arXiv:2505.23769,

Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, and Derek Hoiem. Textregion: Text-aligned region tokens from frozen image-text models.arXiv preprint arXiv:2505.23769,

work page arXiv
[43]

Side adapter network for open-vocabulary semantic segmentation

Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2945– 2954, 2023. 2

work page 2023
[44]

Resclip: Residual attention for training-free dense vision- language inference

Yuhang Yang, Jinhong Deng, Wen Li, and Lixin Duan. Resclip: Residual attention for training-free dense vision- language inference. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29968–29978,

work page
[45]

Scannet++: A high-fidelity dataset of 3d in- door scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 7, 8 10

work page 2023
[46]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In2024 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024. 1

work page 2024
[47]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2, 3

work page 2023
[48]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,

work page
[49]

RGB” and “GT

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022. 2 11 RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models Supplementary Material Limitations WhileRADSeg-base delivers strong mIoU gains with ...

work page 2022

[1] [1]

Rayfronts: Open-set semantic ray frontiers for online scene understanding and ex- ploration.arXiv preprint arXiv:2504.06994, 2025

Omar Alama, Avigyan Bhattacharya, Haoyang He, Se- ungchan Kim, Yuheng Qiu, Wenshan Wang, Cherie Ho, Nikhil Keetha, and Sebastian Scherer. Rayfronts: Open-set semantic ray frontiers for online scene understanding and ex- ploration.arXiv preprint arXiv:2504.06994, 2025. 2, 3, 4, 6, 7, 8, 5

work page arXiv 2025

[2] [2]

Single-stage seman- tic segmentation from image labels

Nikita Araslanov and Stefan Roth. Single-stage seman- tic segmentation from image labels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4253–4262, 2020. 5

work page 2020

[3] [3]

IEEE Transactions on Image Processing34, 8271–8284 (2025) arXiv:2411.15869 [cs.CV]

Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, and Yansong Tang. Self-calibrated clip for training-free open-vocabulary segmentation.arXiv preprint arXiv:2411.15869, 2024. 3, 4, 6, 8, 2, 5

work page arXiv 2024

[4] [4]

Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation.arXiv preprint arXiv:2411.19331,

Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. Talking to dino: Bridging self- supervised vision backbones with language for open- vocabulary segmentation.arXiv preprint arXiv:2411.19331,

work page arXiv

[5] [5]

Coco- stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 2, 5

work page 2018

[6] [6]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

work page 2021

[7] [7]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 2, 5

work page 2016

[8] [8]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017. 7, 8

work page 2017

[9] [9]

Vision Transformers Need Registers

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010

[11] [11]

Williams, John Winn, and Andrew Zisserman

Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.Int. J. Comput. Vision, 88(2): 303–338, 2010. 2, 5

work page 2010

[12] [12]

Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning

Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning. In2024 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 5021–5028. IEEE, 2024. 3, 2

work page 2024

[13] [13]

Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation

Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation. In2025 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), pages 5061–5071. IEEE, 2025. 2, 3, 4, 5, 6, 8, 1

work page 2025

[14] [14]

Radiov2.5: Improved baselines for agglomerative vision foundation models

Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2.5: Improved baselines for agglomerative vision foundation models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 22487–22497, 2025. 3

work page 2025

[15] [15]

Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, and Antonio Torralba

Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B. Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, and Antonio Torralba. Conceptfusion: Open-set multimodal 3d mapping.Robotics: Science and Systems (RSS), 2023. 2, 3

work page 2023

[16] [16]

Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embed- ding registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14137–14146, 2025. 3

work page 2025

[17] [17]

Garfield: Group anything with radiance fields

Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Gold- berg, Matthew Tancik, and Angjoo Kanazawa. Garfield: Group anything with radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21530–21539, 2024. 3

work page 2024

[18] [18]

RA VEN: Resilient Aerial Navigation via Open-Set Semantic Memory and Behavior Adaptation

Seungchan Kim, Omar Alama, Dmytro Kurdydyk, John Keller, Nikhil Keetha, Wenshan Wang, Yonatan Bisk, and Sebastian Scherer. Raven: Resilient aerial navigation via open-set semantic memory and behavior adaptation.arXiv preprint arXiv:2509.23563, 2025. 1

work page arXiv 2025

[19] [19]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2, 3

work page 2023

[20] [20]

Proxyclip: Proxy attention improves clip for open-vocabulary segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InEuropean Conference on Computer Vision, pages 70–88. Springer, 2024. 2, 3, 4, 5, 6, 8

work page 2024

[21] [21]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Samrefiner: Taming segment anything model for universal mask refine- ment.arXiv preprint arXiv:2502.06756, 2025

Yuqi Lin, Hengjia Li, Wenqi Shao, Zheng Yang, Jun Zhao, Xiaofei He, Ping Luo, and Kaipeng Zhang. Samrefiner: Taming segment anything model for universal mask refine- ment.arXiv preprint arXiv:2502.06756, 2025. 5

work page arXiv 2025

[23] [23]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. 3 9

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Brain mapping with dense features: Grounding cortical semantic selectivity in natural images with vision transformers.arXiv preprint arXiv:2410.05266, 2024

Andrew F Luo, Jacob Yeung, Rushikesh Zawar, Shaurya De- wan, Margaret M Henderson, Leila Wehbe, and Michael J Tarr. Brain mapping with dense features: Grounding cortical semantic selectivity in natural images with vision transform- ers.arXiv preprint arXiv:2410.05266, 2024. 1

work page arXiv 2024

[25] [25]

The role of context for object detection and semantic segmentation in the wild

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 2, 5

work page 2014

[26] [26]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

work page 2023

[27] [27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 3

work page 2021

[28] [28]

Am-radio: Agglomerative vision foundation model reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12490–12500, 2024. 2, 3

work page 2024

[29] [29]

Denseclip: Language-guided dense prediction with context- aware prompting

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18082–18091, 2022. 2

work page 2022

[30] [30]

Language embedded radiance fields for zero-shot task-oriented grasping

Adam Rashid, Satvik Sharma, Chung Min Kim, Justin Kerr, Lawrence Yunliang Chen, Angjoo Kanazawa, and Ken Gold- berg. Language embedded radiance fields for zero-shot task-oriented grasping. In7th Annual Conference on Robot Learning, 2023. 1

work page 2023

[31] [31]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2

work page 2022

[33] [33]

arXiv preprint arXiv:2210.05663 (2022)

Nur Muhammad Mahi Shafiullah, Chris Paxton, Lerrel Pinto, Soumith Chintala, and Arthur Szlam. Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663, 2022. 3

work page arXiv 2022

[34] [34]

Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024

Jinghuan Shang, Karl Schmeckpeper, Brandon B May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, and Laura Herlant. Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024. 4

work page arXiv 2024

[35] [35]

Language embedded 3d gaussians for open- vocabulary scene understanding

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5333–5343, 2024. 3

work page 2024

[36] [36]

Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024

Yuheng Shi, Minjing Dong, and Chang Xu. Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024. 2, 3, 4, 5, 6, 8, 1

work page arXiv 2024

[37] [37]

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.arXiv:1906.05797, 2019. 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 1906

[38] [38]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Sclip: Rethink- ing self-attention for dense vision-language inference

Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethink- ing self-attention for dense vision-language inference. In European Conference on Computer Vision, pages 315–332. Springer, 2024. 2, 3, 4, 5, 1

work page 2024

[40] [40]

Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation

Abdelrhman Werby, Chenguang Huang, Martin B ¨uchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open- vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. 3, 2

work page 2024

[41] [41]

Clip-diy: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free

Monika Wysocza ´nska, Micha ¨el Ramamonjisoa, Tomasz Trzci´nski, and Oriane Sim ´eoni. Clip-diy: Clip dense infer- ence yields open-vocabulary semantic segmentation for-free. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1403–1413, 2024. 2

work page 2024

[42] [42]

Textregion: Text-aligned region tokens from frozen image-text models.arXiv preprint arXiv:2505.23769,

Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, and Derek Hoiem. Textregion: Text-aligned region tokens from frozen image-text models.arXiv preprint arXiv:2505.23769,

work page arXiv

[43] [43]

Side adapter network for open-vocabulary semantic segmentation

Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2945– 2954, 2023. 2

work page 2023

[44] [44]

Resclip: Residual attention for training-free dense vision- language inference

Yuhang Yang, Jinhong Deng, Wen Li, and Lixin Duan. Resclip: Residual attention for training-free dense vision- language inference. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29968–29978,

work page

[45] [45]

Scannet++: A high-fidelity dataset of 3d in- door scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 7, 8 10

work page 2023

[46] [46]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In2024 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024. 1

work page 2024

[47] [47]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2, 3

work page 2023

[48] [48]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,

work page

[49] [49]

RGB” and “GT

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022. 2 11 RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models Supplementary Material Limitations WhileRADSeg-base delivers strong mIoU gains with ...

work page 2022