Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

AntonioManuel Lopez Pena; Danna Xue; JoseLuis Gomez Zurita; Yachan Guo; Yi Xiao

arxiv: 2605.16864 · v1 · pith:OVA5G7E5new · submitted 2026-05-16 · 💻 cs.CV · cs.AI

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

Yachan Guo , JoseLuis Gomez Zurita , Danna Xue , Yi Xiao , AntonioManuel Lopez Pena This is my paper

Pith reviewed 2026-05-19 20:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual foundation modelsfeature fusionmetric-guided selectionsegmentationdense predictionlabel-free metricsstructural coherenceedge fidelity

0 comments

The pith

Label-free metrics identify complementary VFM pairs for fusion that boosts dense prediction performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual foundation models display distinct representation biases, such as promptable segmentation models emphasizing fine-grained boundaries while self-supervised models highlight object-level structure. The paper introduces a suite of label-free metrics in feature space to score Structural Coherence and Edge Fidelity, which guide the selection of complementary encoder pairs. These pairs are integrated through a master-auxiliary fusion scheme that avoids complex architectural modifications and requires only single-stage training. The resulting model delivers consistent gains on multiple dense prediction tasks with improved object semantics and boundary localization.

Core claim

The paper claims that explicit assessment scores from label-free metrics on Structural Coherence and Edge Fidelity can reliably select complementary edge-strong and structure-strong VFM encoders, and that integrating their features via a master-auxiliary fusion scheme produces consistent performance gains across dense prediction tasks, including better object-level semantics and more accurately localized boundaries.

What carries the argument

A suite of label-free metrics for Structural Coherence and Edge Fidelity that scores VFM encoder features to select and fuse complementary pairs through a master-auxiliary scheme.

If this is right

Consistent performance gains occur across multiple dense prediction tasks compared with the baselines.
Fused features exhibit better object-level semantics than either encoder alone.
Boundaries are localized more accurately than in the individual models.
The approach requires no complex architectural changes and trains in a single stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same metric-guided selection could be tested on other dense prediction problems such as depth estimation to check whether the coherence and fidelity scores transfer.
Dynamic re-selection of auxiliary encoders per input image might further improve results without retraining the fusion module.
The metrics offer an interpretable criterion that could reduce trial-and-error when building ensembles from new VFMs.

Load-bearing premise

The label-free metrics for Structural Coherence and Edge Fidelity in feature space can reliably identify which VFM encoders are complementary and worth fusing, without any task-specific labels or supervision.

What would settle it

If fusion guided by the Structural Coherence and Edge Fidelity scores produces no measurable improvement over single-VFM baselines or random pairing on standard segmentation benchmarks, the utility of the metric-guided selection would be falsified.

Figures

Figures reproduced from arXiv: 2605.16864 by AntonioManuel Lopez Pena, Danna Xue, JoseLuis Gomez Zurita, Yachan Guo, Yi Xiao.

**Figure 2.** Figure 2: Channel-averaged feature activation maps (warm colors [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The overview of our framework. (a) Given features extracted by multiple VFM encoders, we design two categories of metrics, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on Cityscapes. (a) Injecting SAM2 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on Cityscapes. All outputs are unfiltered (no confidence thresholding). Red boxes: failure cases of [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at {https://github.com/gyc-code/metric-guided-fusion}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a metric-guided feature fusion approach for combining visual foundation models (VFMs) such as SAM2 and DINOv3 in dense prediction tasks. It introduces two label-free metrics—Structural Coherence and Edge Fidelity—computed in feature space to assess and select complementary encoder pairs (edge-strong and structure-strong), which are then integrated via a master-auxiliary fusion scheme. The method requires no complex architectural modifications and is trained in a single stage. The central claim is that this yields consistent performance gains over baselines across multiple segmentation tasks, with improved object-level semantics and boundary localization.

Significance. If the experimental support holds, the work provides an interpretable, label-free procedure for exploiting complementary biases across VFMs, addressing the unreliability of naive multi-VFM fusion. The public code release at https://github.com/gyc-code/metric-guided-fusion is a positive contribution to reproducibility. The approach could be useful for practitioners seeking lightweight improvements in instance-aware dense prediction without multi-stage training or heavy architectural redesign.

major comments (2)

[§4 and §3.2] §4 (Experiments) and §3.2 (Metric definitions): The central claim that the Structural Coherence and Edge Fidelity metrics reliably identify complementary VFM pairs rests on an unverified assumption. No controlled ablation is presented that compares downstream fusion performance for metric-selected pairs versus random pairs, low-scoring pairs, or exhaustive enumeration of all possible pairs. Without this correlation check, it remains unclear whether higher metric scores predict useful complementarity for object semantics and boundary accuracy or merely reflect incidental feature statistics.
[Table 2] Table 2 and associated text: While performance gains are reported, the manuscript does not include error bars, statistical significance tests, or per-task breakdowns that isolate the contribution of the metric-guided selection from the fusion architecture itself. This weakens the ability to attribute improvements specifically to the proposed metrics.

minor comments (2)

[§3.3] The notation for the master-auxiliary fusion weights (e.g., how α and β are determined from the metric scores) could be clarified with an explicit equation or pseudocode in §3.3.
[Figure 3] Figure 3 (qualitative results) would benefit from side-by-side comparison with the individual VFM baselines to better illustrate the claimed improvements in boundary localization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below and describe the revisions we will make to address the concerns.

read point-by-point responses

Referee: [§4 and §3.2] The central claim that the Structural Coherence and Edge Fidelity metrics reliably identify complementary VFM pairs rests on an unverified assumption. No controlled ablation is presented that compares downstream fusion performance for metric-selected pairs versus random pairs, low-scoring pairs, or exhaustive enumeration of all possible pairs. Without this correlation check, it remains unclear whether higher metric scores predict useful complementarity for object semantics and boundary accuracy or merely reflect incidental feature statistics.

Authors: We agree that a direct ablation comparing metric-selected pairs to random and low-scoring pairs would provide stronger evidence for the predictive value of the metrics. In the revised manuscript we will add these controlled experiments on the evaluated tasks, reporting the resulting segmentation performance to demonstrate that higher metric scores correspond to improved complementarity. We will also note that exhaustive enumeration of all VFM pairs quickly becomes computationally prohibitive and is therefore not performed, while the added ablations on representative pairs will still allow readers to assess the correlation between metric scores and downstream gains. revision: yes
Referee: [Table 2] While performance gains are reported, the manuscript does not include error bars, statistical significance tests, or per-task breakdowns that isolate the contribution of the metric-guided selection from the fusion architecture itself. This weakens the ability to attribute improvements specifically to the proposed metrics.

Authors: We acknowledge that the current presentation of results would benefit from greater statistical rigor and clearer isolation of the metric-guided component. In the revision we will augment Table 2 with error bars (standard deviation over multiple random seeds), include paired statistical significance tests for the reported gains, and add per-task breakdowns. We will further introduce a control experiment that applies the same fusion architecture without metric-based pair selection, allowing readers to separate the contribution of the proposed metrics from the fusion scheme itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical selection-and-fusion procedure is self-contained

full rationale

The paper defines two label-free metrics (Structural Coherence and Edge Fidelity) in feature space, uses them to select encoder pairs, and fuses via a master-auxiliary scheme trained in one stage. Performance gains are reported on downstream dense-prediction benchmarks. No equations, fitted parameters, or self-citations are shown that reduce the claimed gains or the metric-guided selection to quantities defined by construction inside the paper. The derivation chain consists of explicit metric definitions followed by empirical validation rather than a closed identity or self-referential premise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that the two newly designed label-free metrics correctly rank encoder complementarity. No free parameters are mentioned in the abstract. No new physical entities are introduced.

axioms (1)

domain assumption Different VFMs exhibit complementary biases in representation (SAM2 for boundaries, DINOv3 for object structure).
Stated in the opening paragraph of the abstract as the motivation for fusion.

pith-pipeline@v0.9.0 · 5787 in / 1238 out tokens · 34230 ms · 2026-05-19T20:26:34.510690+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 3 internal anchors

[1]

Improving vision transformers by revisiting high-frequency components

Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, and Wei Liu. Improving vision transformers by revisiting high-frequency components. InECCV, 2022. 1, 3

work page 2022
[2]

Do computer vision foundation models learn the low- level characteristics of the human visual system? InCVPR,

Yancheng Cai, Fei Yin, Dounia Hammou, and Rafal Man- tiuk. Do computer vision foundation models learn the low- level characteristics of the human visual system? InCVPR,

work page
[3]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 1

work page 2021
[4]

Sam-adapter: Adapting segment any- thing in underperformed scenes

Tianrun Chen, Lanyun Zhu, Chaotao Deng, Runlong Cao, Yan Wang, Shangzhan Zhang, Zejian Li, Lingyun Sun, Ying Zang, and Papa Mao. Sam-adapter: Adapting segment any- thing in underperformed scenes. InICCV, 2023. 2

work page 2023
[5]

Vision transformer adapter for dense predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InICLR, 2023. 2, 6

work page 2023
[6]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR,

work page
[7]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InCVPR,

work page
[8]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 2

work page 2021
[9]

Prob- ing the 3d awareness of visual foundation models

Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab- hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob- ing the 3d awareness of visual foundation models. InCVPR,

work page
[10]

There is no samantics! exploring sam as a backbone for visual understanding tasks

Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, and Elliot J Crowley. There is no samantics! exploring sam as a backbone for visual understanding tasks. arXiv preprint arXiv:2411.15288, 2024. 1

work page arXiv 2024
[11]

Uda4inst: Unsupervised domain adaptation for instance segmentation

Yachan Guo, Yi Xiao, Danna Xue, Jose L G ´omez, and An- tonio M L ´opez. Uda4inst: Unsupervised domain adaptation for instance segmentation. InIEEE Intelligent Vehicles Sym- posium, 2025. 1

work page 2025
[12]

Prompt-sam: A vision- language and sam based hybrid framework for prompt- augmented zero-shot segmentation.Human-Centric Intel- ligent Systems, pages 1–19, 2025

Uma Gurav and Sanket Jadhav. Prompt-sam: A vision- language and sam based hybrid framework for prompt- augmented zero-shot segmentation.Human-Centric Intel- ligent Systems, pages 1–19, 2025. 2

work page 2025
[13]

Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2. 5: Improved baselines for agglomerative vision foundation models. InCVPR, 2025. 1, 6

work page 2025
[14]

Open- clip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. 1, 2

work page 2021
[15]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InECCV, 2022. 2

work page 2022
[16]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1, 2, 5

work page 2023
[17]

Mask dino: Towards a unified transformer-based framework for object detection and segmentation

Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. InCVPR, 2023. 2

work page 2023
[18]

Segment and recognize anything at any granularity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, Lei Zhang, and Jianfeng Gao. Segment and recognize anything at any granularity. In ECCV, 2024. 1

work page 2024
[19]

Semantic-sam: Segment and recognize anything at any gran- ularity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any gran- ularity. InECCV, 2024. 6

work page 2024
[20]

Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024

Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, and Chen Change Loy. Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024. 1

work page 2024
[21]

Open-vocabulary semantic segmentation with mask-adapted clip

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InCVPR, 2023. 1

work page 2023
[22]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5

work page 2014
[23]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 2

work page 2021
[24]

Intriguing properties of vision transform- ers.NeurIPS, 2021

Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transform- ers.NeurIPS, 2021. 1, 3

work page 2021
[25]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1, 2

work page 2021
[27]

Am-radio: Agglomerative vision foundation model reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. InCVPR, 2024. 1

work page 2024
[28]

Sam 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InICLR, 2025. 1, 2, 5

work page 2025
[29]

Silhouettes: a graphical aid to the inter- pretation and validation of cluster analysis.Journal of com- putational and applied mathematics, 20:53–65, 1987

Peter J Rousseeuw. Silhouettes: a graphical aid to the inter- pretation and validation of cluster analysis.Journal of com- putational and applied mathematics, 20:53–65, 1987. 4

work page 1987
[30]

G ´omez, and Antonio M

Joan Serrat, Jose L. G ´omez, and Antonio M. L´opez. Closing the gap in domain adaptation for semantic segmentation: a time-aware method.Machine Vision and Applications, 36: 13, 2024. 1

work page 2024
[31]

The devil is in the object bound- ary: Towards annotation-free instance segmentation using foundation models

Cheng Shi and Sibei Yang. The devil is in the object bound- ary: Towards annotation-free instance segmentation using foundation models. InICLR, 2024. 2

work page 2024
[32]

Region- based representations revisited

Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, Sethuraman TV , Heyi Tao, Jae Yong Lee, Wil- fredo Torres, Yu-Xiong Wang, and Derek Hoiem. Region- based representations revisited. InCVPR, 2024. 2

work page 2024
[33]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense featu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Git: Towards generalist vision transformer through universal language interface

Haiyang Wang, Hao Tang, Li Jiang, Shaoshuai Shi, Muham- mad Ferjad Naeem, Hongsheng Li, Bernt Schiele, and Liwei Wang. Git: Towards generalist vision transformer through universal language interface. InECCV, 2024. 2

work page 2024
[36]

Sam-clip: Merging vision foundation models towards semantic and spatial understanding

Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, and Hadi Pouransari. Sam-clip: Merging vision foundation models towards semantic and spatial understanding. InCVPR, 2024. 1

work page 2024
[37]

Image as a foreign language: Beit pretraining for vision and vision- language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision- language tasks. InCVPR, 2023. 2

work page 2023
[38]

Stronger fewer & superior: Harnessing vision foundation models for domain generalized semantic segmentation

Zhixiang Wei, Lin Chen, Yi Jin, Xiaoxiao Ma, Tianle Liu, Pengyang Ling, Ben Wang, Huaian Chen, and Jinjin Zheng. Stronger fewer & superior: Harnessing vision foundation models for domain generalized semantic segmentation. In CVPR, 2024. 1

work page 2024
[39]

Semantic-aware sam for point- prompted instance segmentation

Zhaoyang Wei, Pengfei Chen, Xuehui Yu, Guorong Li, Jian- bin Jiao, and Zhenjun Han. Semantic-aware sam for point- prompted instance segmentation. InCVPR, 2024. 1

work page 2024
[40]

Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation

Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation. InECCV, 2024. 1, 2

work page 2024
[41]

Sam4udass: When sam meets un- supervised domain adaptive semantic segmentation in intel- ligent vehicles.IEEE TIV, 9(2):3396–3408, 2024

Weihao Yan, Yeqiang Qian, Hanyang Zhuang, Chunxiang Wang, and Ming Yang. Sam4udass: When sam meets un- supervised domain adaptive semantic segmentation in intel- ligent vehicles.IEEE TIV, 9(2):3396–3408, 2024. 1

work page 2024
[42]

Towards open-ended visual recognition with large language models

Qihang Yu, Xiaohui Shen, and Liang-Chieh Chen. Towards open-ended visual recognition with large language models. InECCV, 2024. 1, 6

work page 2024
[43]

Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively

Haobo Yuan, Xiangtai Li, Chong Zhou, Yining Li, Kai Chen, and Chen Change Loy. Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively. InECCV,

work page
[44]

Quantifying the limits of segmentation foun- dation models: Modeling challenges in segmenting tree-like and low-contrast objects.arXiv preprint arXiv:2412.04243,

Yixin Zhang, Nicholas Konz, Kevin Kramer, and Maciej A Mazurowski. Quantifying the limits of segmentation foun- dation models: Modeling challenges in segmenting tree-like and low-contrast objects.arXiv preprint arXiv:2412.04243,

work page arXiv
[45]

Metric Details A.1

3 Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks Supplementary Material A. Metric Details A.1. Hyperparameter Sensitivity Our metrics involve several hyperparameters. To verify that the structure–edge characterization is robust to hyper- parameters choices, we vary each within a range while keeping others fixed: SFC grid s...

work page
[46]

with a Mask2Former head, measured on a single A40 GPU. Backbone Params GFLOPs Throughput DINOv2 108.1 1294 1.6 DINOv3 107.1 1083 2.2 SAM 111.1 2302 1.4 SAM2 89.1 1132 2.3 Ours-D3S2 176.2 1857 1.4 master encoders, injection at OS=16 yields the best aver- age AP (38.7 for DINOv2, 39.1 for DINOv3), with gains most pronounced on boundary-sensitive classes suc...

work page
[47]

Despite the additional encoder, through- put remains comparable to single-encoder baselines such as DINOv2-B (1.4 vs

because we attach SAM2’s backbone as an auxil- iary edge provider. Despite the additional encoder, through- put remains comparable to single-encoder baselines such as DINOv2-B (1.4 vs. 1.6 img/s). C. Additional Qualitative Results Fig. 5 extends the main-paper comparison (Fig. 1) by in- cluding fine-tuned variants and our fused model. The two VFM families...

work page

[1] [1]

Improving vision transformers by revisiting high-frequency components

Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, and Wei Liu. Improving vision transformers by revisiting high-frequency components. InECCV, 2022. 1, 3

work page 2022

[2] [2]

Do computer vision foundation models learn the low- level characteristics of the human visual system? InCVPR,

Yancheng Cai, Fei Yin, Dounia Hammou, and Rafal Man- tiuk. Do computer vision foundation models learn the low- level characteristics of the human visual system? InCVPR,

work page

[3] [3]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 1

work page 2021

[4] [4]

Sam-adapter: Adapting segment any- thing in underperformed scenes

Tianrun Chen, Lanyun Zhu, Chaotao Deng, Runlong Cao, Yan Wang, Shangzhan Zhang, Zejian Li, Lingyun Sun, Ying Zang, and Papa Mao. Sam-adapter: Adapting segment any- thing in underperformed scenes. InICCV, 2023. 2

work page 2023

[5] [5]

Vision transformer adapter for dense predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InICLR, 2023. 2, 6

work page 2023

[6] [6]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR,

work page

[7] [7]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InCVPR,

work page

[8] [8]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 2

work page 2021

[9] [9]

Prob- ing the 3d awareness of visual foundation models

Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab- hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob- ing the 3d awareness of visual foundation models. InCVPR,

work page

[10] [10]

There is no samantics! exploring sam as a backbone for visual understanding tasks

Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, and Elliot J Crowley. There is no samantics! exploring sam as a backbone for visual understanding tasks. arXiv preprint arXiv:2411.15288, 2024. 1

work page arXiv 2024

[11] [11]

Uda4inst: Unsupervised domain adaptation for instance segmentation

Yachan Guo, Yi Xiao, Danna Xue, Jose L G ´omez, and An- tonio M L ´opez. Uda4inst: Unsupervised domain adaptation for instance segmentation. InIEEE Intelligent Vehicles Sym- posium, 2025. 1

work page 2025

[12] [12]

Prompt-sam: A vision- language and sam based hybrid framework for prompt- augmented zero-shot segmentation.Human-Centric Intel- ligent Systems, pages 1–19, 2025

Uma Gurav and Sanket Jadhav. Prompt-sam: A vision- language and sam based hybrid framework for prompt- augmented zero-shot segmentation.Human-Centric Intel- ligent Systems, pages 1–19, 2025. 2

work page 2025

[13] [13]

Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2. 5: Improved baselines for agglomerative vision foundation models. InCVPR, 2025. 1, 6

work page 2025

[14] [14]

Open- clip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. 1, 2

work page 2021

[15] [15]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InECCV, 2022. 2

work page 2022

[16] [16]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1, 2, 5

work page 2023

[17] [17]

Mask dino: Towards a unified transformer-based framework for object detection and segmentation

Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. InCVPR, 2023. 2

work page 2023

[18] [18]

Segment and recognize anything at any granularity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, Lei Zhang, and Jianfeng Gao. Segment and recognize anything at any granularity. In ECCV, 2024. 1

work page 2024

[19] [19]

Semantic-sam: Segment and recognize anything at any gran- ularity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any gran- ularity. InECCV, 2024. 6

work page 2024

[20] [20]

Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024

Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, and Chen Change Loy. Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024. 1

work page 2024

[21] [21]

Open-vocabulary semantic segmentation with mask-adapted clip

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InCVPR, 2023. 1

work page 2023

[22] [22]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5

work page 2014

[23] [23]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 2

work page 2021

[24] [24]

Intriguing properties of vision transform- ers.NeurIPS, 2021

Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transform- ers.NeurIPS, 2021. 1, 3

work page 2021

[25] [25]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1, 2

work page 2021

[27] [27]

Am-radio: Agglomerative vision foundation model reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. InCVPR, 2024. 1

work page 2024

[28] [28]

Sam 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InICLR, 2025. 1, 2, 5

work page 2025

[29] [29]

Silhouettes: a graphical aid to the inter- pretation and validation of cluster analysis.Journal of com- putational and applied mathematics, 20:53–65, 1987

Peter J Rousseeuw. Silhouettes: a graphical aid to the inter- pretation and validation of cluster analysis.Journal of com- putational and applied mathematics, 20:53–65, 1987. 4

work page 1987

[30] [30]

G ´omez, and Antonio M

Joan Serrat, Jose L. G ´omez, and Antonio M. L´opez. Closing the gap in domain adaptation for semantic segmentation: a time-aware method.Machine Vision and Applications, 36: 13, 2024. 1

work page 2024

[31] [31]

The devil is in the object bound- ary: Towards annotation-free instance segmentation using foundation models

Cheng Shi and Sibei Yang. The devil is in the object bound- ary: Towards annotation-free instance segmentation using foundation models. InICLR, 2024. 2

work page 2024

[32] [32]

Region- based representations revisited

Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, Sethuraman TV , Heyi Tao, Jae Yong Lee, Wil- fredo Torres, Yu-Xiong Wang, and Derek Hoiem. Region- based representations revisited. InCVPR, 2024. 2

work page 2024

[33] [33]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense featu...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Git: Towards generalist vision transformer through universal language interface

Haiyang Wang, Hao Tang, Li Jiang, Shaoshuai Shi, Muham- mad Ferjad Naeem, Hongsheng Li, Bernt Schiele, and Liwei Wang. Git: Towards generalist vision transformer through universal language interface. InECCV, 2024. 2

work page 2024

[36] [36]

Sam-clip: Merging vision foundation models towards semantic and spatial understanding

Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, and Hadi Pouransari. Sam-clip: Merging vision foundation models towards semantic and spatial understanding. InCVPR, 2024. 1

work page 2024

[37] [37]

Image as a foreign language: Beit pretraining for vision and vision- language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision- language tasks. InCVPR, 2023. 2

work page 2023

[38] [38]

Stronger fewer & superior: Harnessing vision foundation models for domain generalized semantic segmentation

Zhixiang Wei, Lin Chen, Yi Jin, Xiaoxiao Ma, Tianle Liu, Pengyang Ling, Ben Wang, Huaian Chen, and Jinjin Zheng. Stronger fewer & superior: Harnessing vision foundation models for domain generalized semantic segmentation. In CVPR, 2024. 1

work page 2024

[39] [39]

Semantic-aware sam for point- prompted instance segmentation

Zhaoyang Wei, Pengfei Chen, Xuehui Yu, Guorong Li, Jian- bin Jiao, and Zhenjun Han. Semantic-aware sam for point- prompted instance segmentation. InCVPR, 2024. 1

work page 2024

[40] [40]

Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation

Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation. InECCV, 2024. 1, 2

work page 2024

[41] [41]

Sam4udass: When sam meets un- supervised domain adaptive semantic segmentation in intel- ligent vehicles.IEEE TIV, 9(2):3396–3408, 2024

Weihao Yan, Yeqiang Qian, Hanyang Zhuang, Chunxiang Wang, and Ming Yang. Sam4udass: When sam meets un- supervised domain adaptive semantic segmentation in intel- ligent vehicles.IEEE TIV, 9(2):3396–3408, 2024. 1

work page 2024

[42] [42]

Towards open-ended visual recognition with large language models

Qihang Yu, Xiaohui Shen, and Liang-Chieh Chen. Towards open-ended visual recognition with large language models. InECCV, 2024. 1, 6

work page 2024

[43] [43]

Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively

Haobo Yuan, Xiangtai Li, Chong Zhou, Yining Li, Kai Chen, and Chen Change Loy. Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively. InECCV,

work page

[44] [44]

Quantifying the limits of segmentation foun- dation models: Modeling challenges in segmenting tree-like and low-contrast objects.arXiv preprint arXiv:2412.04243,

Yixin Zhang, Nicholas Konz, Kevin Kramer, and Maciej A Mazurowski. Quantifying the limits of segmentation foun- dation models: Modeling challenges in segmenting tree-like and low-contrast objects.arXiv preprint arXiv:2412.04243,

work page arXiv

[45] [45]

Metric Details A.1

3 Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks Supplementary Material A. Metric Details A.1. Hyperparameter Sensitivity Our metrics involve several hyperparameters. To verify that the structure–edge characterization is robust to hyper- parameters choices, we vary each within a range while keeping others fixed: SFC grid s...

work page

[46] [46]

with a Mask2Former head, measured on a single A40 GPU. Backbone Params GFLOPs Throughput DINOv2 108.1 1294 1.6 DINOv3 107.1 1083 2.2 SAM 111.1 2302 1.4 SAM2 89.1 1132 2.3 Ours-D3S2 176.2 1857 1.4 master encoders, injection at OS=16 yields the best aver- age AP (38.7 for DINOv2, 39.1 for DINOv3), with gains most pronounced on boundary-sensitive classes suc...

work page

[47] [47]

Despite the additional encoder, through- put remains comparable to single-encoder baselines such as DINOv2-B (1.4 vs

because we attach SAM2’s backbone as an auxil- iary edge provider. Despite the additional encoder, through- put remains comparable to single-encoder baselines such as DINOv2-B (1.4 vs. 1.6 img/s). C. Additional Qualitative Results Fig. 5 extends the main-paper comparison (Fig. 1) by in- cluding fine-tuned variants and our fused model. The two VFM families...

work page