pith. sign in

arxiv: 2605.16864 · v1 · pith:OVA5G7E5new · submitted 2026-05-16 · 💻 cs.CV · cs.AI

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

Pith reviewed 2026-05-19 20:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords visual foundation modelsfeature fusionmetric-guided selectionsegmentationdense predictionlabel-free metricsstructural coherenceedge fidelity
0
0 comments X

The pith

Label-free metrics identify complementary VFM pairs for fusion that boosts dense prediction performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual foundation models display distinct representation biases, such as promptable segmentation models emphasizing fine-grained boundaries while self-supervised models highlight object-level structure. The paper introduces a suite of label-free metrics in feature space to score Structural Coherence and Edge Fidelity, which guide the selection of complementary encoder pairs. These pairs are integrated through a master-auxiliary fusion scheme that avoids complex architectural modifications and requires only single-stage training. The resulting model delivers consistent gains on multiple dense prediction tasks with improved object semantics and boundary localization.

Core claim

The paper claims that explicit assessment scores from label-free metrics on Structural Coherence and Edge Fidelity can reliably select complementary edge-strong and structure-strong VFM encoders, and that integrating their features via a master-auxiliary fusion scheme produces consistent performance gains across dense prediction tasks, including better object-level semantics and more accurately localized boundaries.

What carries the argument

A suite of label-free metrics for Structural Coherence and Edge Fidelity that scores VFM encoder features to select and fuse complementary pairs through a master-auxiliary scheme.

If this is right

  • Consistent performance gains occur across multiple dense prediction tasks compared with the baselines.
  • Fused features exhibit better object-level semantics than either encoder alone.
  • Boundaries are localized more accurately than in the individual models.
  • The approach requires no complex architectural changes and trains in a single stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same metric-guided selection could be tested on other dense prediction problems such as depth estimation to check whether the coherence and fidelity scores transfer.
  • Dynamic re-selection of auxiliary encoders per input image might further improve results without retraining the fusion module.
  • The metrics offer an interpretable criterion that could reduce trial-and-error when building ensembles from new VFMs.

Load-bearing premise

The label-free metrics for Structural Coherence and Edge Fidelity in feature space can reliably identify which VFM encoders are complementary and worth fusing, without any task-specific labels or supervision.

What would settle it

If fusion guided by the Structural Coherence and Edge Fidelity scores produces no measurable improvement over single-VFM baselines or random pairing on standard segmentation benchmarks, the utility of the metric-guided selection would be falsified.

Figures

Figures reproduced from arXiv: 2605.16864 by AntonioManuel Lopez Pena, Danna Xue, JoseLuis Gomez Zurita, Yachan Guo, Yi Xiao.

Figure 1
Figure 1. Figure 1: Failures of the frozen VFMs’ encoders applied to in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Channel-averaged feature activation maps (warm colors [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overview of our framework. (a) Given features extracted by multiple VFM encoders, we design two categories of metrics, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on Cityscapes. (a) Injecting SAM2 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on Cityscapes. All outputs are unfiltered (no confidence thresholding). Red boxes: failure cases of [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at {https://github.com/gyc-code/metric-guided-fusion}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a metric-guided feature fusion approach for combining visual foundation models (VFMs) such as SAM2 and DINOv3 in dense prediction tasks. It introduces two label-free metrics—Structural Coherence and Edge Fidelity—computed in feature space to assess and select complementary encoder pairs (edge-strong and structure-strong), which are then integrated via a master-auxiliary fusion scheme. The method requires no complex architectural modifications and is trained in a single stage. The central claim is that this yields consistent performance gains over baselines across multiple segmentation tasks, with improved object-level semantics and boundary localization.

Significance. If the experimental support holds, the work provides an interpretable, label-free procedure for exploiting complementary biases across VFMs, addressing the unreliability of naive multi-VFM fusion. The public code release at https://github.com/gyc-code/metric-guided-fusion is a positive contribution to reproducibility. The approach could be useful for practitioners seeking lightweight improvements in instance-aware dense prediction without multi-stage training or heavy architectural redesign.

major comments (2)
  1. [§4 and §3.2] §4 (Experiments) and §3.2 (Metric definitions): The central claim that the Structural Coherence and Edge Fidelity metrics reliably identify complementary VFM pairs rests on an unverified assumption. No controlled ablation is presented that compares downstream fusion performance for metric-selected pairs versus random pairs, low-scoring pairs, or exhaustive enumeration of all possible pairs. Without this correlation check, it remains unclear whether higher metric scores predict useful complementarity for object semantics and boundary accuracy or merely reflect incidental feature statistics.
  2. [Table 2] Table 2 and associated text: While performance gains are reported, the manuscript does not include error bars, statistical significance tests, or per-task breakdowns that isolate the contribution of the metric-guided selection from the fusion architecture itself. This weakens the ability to attribute improvements specifically to the proposed metrics.
minor comments (2)
  1. [§3.3] The notation for the master-auxiliary fusion weights (e.g., how α and β are determined from the metric scores) could be clarified with an explicit equation or pseudocode in §3.3.
  2. [Figure 3] Figure 3 (qualitative results) would benefit from side-by-side comparison with the individual VFM baselines to better illustrate the claimed improvements in boundary localization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below and describe the revisions we will make to address the concerns.

read point-by-point responses
  1. Referee: [§4 and §3.2] The central claim that the Structural Coherence and Edge Fidelity metrics reliably identify complementary VFM pairs rests on an unverified assumption. No controlled ablation is presented that compares downstream fusion performance for metric-selected pairs versus random pairs, low-scoring pairs, or exhaustive enumeration of all possible pairs. Without this correlation check, it remains unclear whether higher metric scores predict useful complementarity for object semantics and boundary accuracy or merely reflect incidental feature statistics.

    Authors: We agree that a direct ablation comparing metric-selected pairs to random and low-scoring pairs would provide stronger evidence for the predictive value of the metrics. In the revised manuscript we will add these controlled experiments on the evaluated tasks, reporting the resulting segmentation performance to demonstrate that higher metric scores correspond to improved complementarity. We will also note that exhaustive enumeration of all VFM pairs quickly becomes computationally prohibitive and is therefore not performed, while the added ablations on representative pairs will still allow readers to assess the correlation between metric scores and downstream gains. revision: yes

  2. Referee: [Table 2] While performance gains are reported, the manuscript does not include error bars, statistical significance tests, or per-task breakdowns that isolate the contribution of the metric-guided selection from the fusion architecture itself. This weakens the ability to attribute improvements specifically to the proposed metrics.

    Authors: We acknowledge that the current presentation of results would benefit from greater statistical rigor and clearer isolation of the metric-guided component. In the revision we will augment Table 2 with error bars (standard deviation over multiple random seeds), include paired statistical significance tests for the reported gains, and add per-task breakdowns. We will further introduce a control experiment that applies the same fusion architecture without metric-based pair selection, allowing readers to separate the contribution of the proposed metrics from the fusion scheme itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical selection-and-fusion procedure is self-contained

full rationale

The paper defines two label-free metrics (Structural Coherence and Edge Fidelity) in feature space, uses them to select encoder pairs, and fuses via a master-auxiliary scheme trained in one stage. Performance gains are reported on downstream dense-prediction benchmarks. No equations, fitted parameters, or self-citations are shown that reduce the claimed gains or the metric-guided selection to quantities defined by construction inside the paper. The derivation chain consists of explicit metric definitions followed by empirical validation rather than a closed identity or self-referential premise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that the two newly designed label-free metrics correctly rank encoder complementarity. No free parameters are mentioned in the abstract. No new physical entities are introduced.

axioms (1)
  • domain assumption Different VFMs exhibit complementary biases in representation (SAM2 for boundaries, DINOv3 for object structure).
    Stated in the opening paragraph of the abstract as the motivation for fusion.

pith-pipeline@v0.9.0 · 5787 in / 1238 out tokens · 34230 ms · 2026-05-19T20:26:34.510690+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 3 internal anchors

  1. [1]

    Improving vision transformers by revisiting high-frequency components

    Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, and Wei Liu. Improving vision transformers by revisiting high-frequency components. InECCV, 2022. 1, 3

  2. [2]

    Do computer vision foundation models learn the low- level characteristics of the human visual system? InCVPR,

    Yancheng Cai, Fei Yin, Dounia Hammou, and Rafal Man- tiuk. Do computer vision foundation models learn the low- level characteristics of the human visual system? InCVPR,

  3. [3]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 1

  4. [4]

    Sam-adapter: Adapting segment any- thing in underperformed scenes

    Tianrun Chen, Lanyun Zhu, Chaotao Deng, Runlong Cao, Yan Wang, Shangzhan Zhang, Zejian Li, Lingyun Sun, Ying Zang, and Papa Mao. Sam-adapter: Adapting segment any- thing in underperformed scenes. InICCV, 2023. 2

  5. [5]

    Vision transformer adapter for dense predictions

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InICLR, 2023. 2, 6

  6. [6]

    Masked-attention mask transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR,

  7. [7]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InCVPR,

  8. [8]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 2

  9. [9]

    Prob- ing the 3d awareness of visual foundation models

    Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab- hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob- ing the 3d awareness of visual foundation models. InCVPR,

  10. [10]

    There is no samantics! exploring sam as a backbone for visual understanding tasks

    Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, and Elliot J Crowley. There is no samantics! exploring sam as a backbone for visual understanding tasks. arXiv preprint arXiv:2411.15288, 2024. 1

  11. [11]

    Uda4inst: Unsupervised domain adaptation for instance segmentation

    Yachan Guo, Yi Xiao, Danna Xue, Jose L G ´omez, and An- tonio M L ´opez. Uda4inst: Unsupervised domain adaptation for instance segmentation. InIEEE Intelligent Vehicles Sym- posium, 2025. 1

  12. [12]

    Prompt-sam: A vision- language and sam based hybrid framework for prompt- augmented zero-shot segmentation.Human-Centric Intel- ligent Systems, pages 1–19, 2025

    Uma Gurav and Sanket Jadhav. Prompt-sam: A vision- language and sam based hybrid framework for prompt- augmented zero-shot segmentation.Human-Centric Intel- ligent Systems, pages 1–19, 2025. 2

  13. [13]

    Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2. 5: Improved baselines for agglomerative vision foundation models. InCVPR, 2025. 1, 6

  14. [14]

    Open- clip, 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. 1, 2

  15. [15]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InECCV, 2022. 2

  16. [16]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1, 2, 5

  17. [17]

    Mask dino: Towards a unified transformer-based framework for object detection and segmentation

    Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. InCVPR, 2023. 2

  18. [18]

    Segment and recognize anything at any granularity

    Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, Lei Zhang, and Jianfeng Gao. Segment and recognize anything at any granularity. In ECCV, 2024. 1

  19. [19]

    Semantic-sam: Segment and recognize anything at any gran- ularity

    Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any gran- ularity. InECCV, 2024. 6

  20. [20]

    Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024

    Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, and Chen Change Loy. Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024. 1

  21. [21]

    Open-vocabulary semantic segmentation with mask-adapted clip

    Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InCVPR, 2023. 1

  22. [22]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5

  23. [23]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 2

  24. [24]

    Intriguing properties of vision transform- ers.NeurIPS, 2021

    Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transform- ers.NeurIPS, 2021. 1, 3

  25. [25]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2, 5

  26. [26]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1, 2

  27. [27]

    Am-radio: Agglomerative vision foundation model reduce all domains into one

    Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. InCVPR, 2024. 1

  28. [28]

    Sam 2: Seg- ment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InICLR, 2025. 1, 2, 5

  29. [29]

    Silhouettes: a graphical aid to the inter- pretation and validation of cluster analysis.Journal of com- putational and applied mathematics, 20:53–65, 1987

    Peter J Rousseeuw. Silhouettes: a graphical aid to the inter- pretation and validation of cluster analysis.Journal of com- putational and applied mathematics, 20:53–65, 1987. 4

  30. [30]

    G ´omez, and Antonio M

    Joan Serrat, Jose L. G ´omez, and Antonio M. L´opez. Closing the gap in domain adaptation for semantic segmentation: a time-aware method.Machine Vision and Applications, 36: 13, 2024. 1

  31. [31]

    The devil is in the object bound- ary: Towards annotation-free instance segmentation using foundation models

    Cheng Shi and Sibei Yang. The devil is in the object bound- ary: Towards annotation-free instance segmentation using foundation models. InICLR, 2024. 2

  32. [32]

    Region- based representations revisited

    Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, Sethuraman TV , Heyi Tao, Jae Yong Lee, Wil- fredo Torres, Yu-Xiong Wang, and Derek Hoiem. Region- based representations revisited. InCVPR, 2024. 2

  33. [33]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 2, 5

  34. [34]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense featu...

  35. [35]

    Git: Towards generalist vision transformer through universal language interface

    Haiyang Wang, Hao Tang, Li Jiang, Shaoshuai Shi, Muham- mad Ferjad Naeem, Hongsheng Li, Bernt Schiele, and Liwei Wang. Git: Towards generalist vision transformer through universal language interface. InECCV, 2024. 2

  36. [36]

    Sam-clip: Merging vision foundation models towards semantic and spatial understanding

    Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, and Hadi Pouransari. Sam-clip: Merging vision foundation models towards semantic and spatial understanding. InCVPR, 2024. 1

  37. [37]

    Image as a foreign language: Beit pretraining for vision and vision- language tasks

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision- language tasks. InCVPR, 2023. 2

  38. [38]

    Stronger fewer & superior: Harnessing vision foundation models for domain generalized semantic segmentation

    Zhixiang Wei, Lin Chen, Yi Jin, Xiaoxiao Ma, Tianle Liu, Pengyang Ling, Ben Wang, Huaian Chen, and Jinjin Zheng. Stronger fewer & superior: Harnessing vision foundation models for domain generalized semantic segmentation. In CVPR, 2024. 1

  39. [39]

    Semantic-aware sam for point- prompted instance segmentation

    Zhaoyang Wei, Pengfei Chen, Xuehui Yu, Guorong Li, Jian- bin Jiao, and Zhenjun Han. Semantic-aware sam for point- prompted instance segmentation. InCVPR, 2024. 1

  40. [40]

    Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation

    Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation. InECCV, 2024. 1, 2

  41. [41]

    Sam4udass: When sam meets un- supervised domain adaptive semantic segmentation in intel- ligent vehicles.IEEE TIV, 9(2):3396–3408, 2024

    Weihao Yan, Yeqiang Qian, Hanyang Zhuang, Chunxiang Wang, and Ming Yang. Sam4udass: When sam meets un- supervised domain adaptive semantic segmentation in intel- ligent vehicles.IEEE TIV, 9(2):3396–3408, 2024. 1

  42. [42]

    Towards open-ended visual recognition with large language models

    Qihang Yu, Xiaohui Shen, and Liang-Chieh Chen. Towards open-ended visual recognition with large language models. InECCV, 2024. 1, 6

  43. [43]

    Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively

    Haobo Yuan, Xiangtai Li, Chong Zhou, Yining Li, Kai Chen, and Chen Change Loy. Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively. InECCV,

  44. [44]

    Quantifying the limits of segmentation foun- dation models: Modeling challenges in segmenting tree-like and low-contrast objects.arXiv preprint arXiv:2412.04243,

    Yixin Zhang, Nicholas Konz, Kevin Kramer, and Maciej A Mazurowski. Quantifying the limits of segmentation foun- dation models: Modeling challenges in segmenting tree-like and low-contrast objects.arXiv preprint arXiv:2412.04243,

  45. [45]

    Metric Details A.1

    3 Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks Supplementary Material A. Metric Details A.1. Hyperparameter Sensitivity Our metrics involve several hyperparameters. To verify that the structure–edge characterization is robust to hyper- parameters choices, we vary each within a range while keeping others fixed: SFC grid s...

  46. [46]

    with a Mask2Former head, measured on a single A40 GPU. Backbone Params GFLOPs Throughput DINOv2 108.1 1294 1.6 DINOv3 107.1 1083 2.2 SAM 111.1 2302 1.4 SAM2 89.1 1132 2.3 Ours-D3S2 176.2 1857 1.4 master encoders, injection at OS=16 yields the best aver- age AP (38.7 for DINOv2, 39.1 for DINOv3), with gains most pronounced on boundary-sensitive classes suc...

  47. [47]

    Despite the additional encoder, through- put remains comparable to single-encoder baselines such as DINOv2-B (1.4 vs

    because we attach SAM2’s backbone as an auxil- iary edge provider. Despite the additional encoder, through- put remains comparable to single-encoder baselines such as DINOv2-B (1.4 vs. 1.6 img/s). C. Additional Qualitative Results Fig. 5 extends the main-paper comparison (Fig. 1) by in- cluding fine-tuned variants and our fused model. The two VFM families...