Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks
Pith reviewed 2026-05-19 20:26 UTC · model grok-4.3
The pith
Label-free metrics identify complementary VFM pairs for fusion that boosts dense prediction performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that explicit assessment scores from label-free metrics on Structural Coherence and Edge Fidelity can reliably select complementary edge-strong and structure-strong VFM encoders, and that integrating their features via a master-auxiliary fusion scheme produces consistent performance gains across dense prediction tasks, including better object-level semantics and more accurately localized boundaries.
What carries the argument
A suite of label-free metrics for Structural Coherence and Edge Fidelity that scores VFM encoder features to select and fuse complementary pairs through a master-auxiliary scheme.
If this is right
- Consistent performance gains occur across multiple dense prediction tasks compared with the baselines.
- Fused features exhibit better object-level semantics than either encoder alone.
- Boundaries are localized more accurately than in the individual models.
- The approach requires no complex architectural changes and trains in a single stage.
Where Pith is reading between the lines
- The same metric-guided selection could be tested on other dense prediction problems such as depth estimation to check whether the coherence and fidelity scores transfer.
- Dynamic re-selection of auxiliary encoders per input image might further improve results without retraining the fusion module.
- The metrics offer an interpretable criterion that could reduce trial-and-error when building ensembles from new VFMs.
Load-bearing premise
The label-free metrics for Structural Coherence and Edge Fidelity in feature space can reliably identify which VFM encoders are complementary and worth fusing, without any task-specific labels or supervision.
What would settle it
If fusion guided by the Structural Coherence and Edge Fidelity scores produces no measurable improvement over single-VFM baselines or random pairing on standard segmentation benchmarks, the utility of the metric-guided selection would be falsified.
Figures
read the original abstract
Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at {https://github.com/gyc-code/metric-guided-fusion}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a metric-guided feature fusion approach for combining visual foundation models (VFMs) such as SAM2 and DINOv3 in dense prediction tasks. It introduces two label-free metrics—Structural Coherence and Edge Fidelity—computed in feature space to assess and select complementary encoder pairs (edge-strong and structure-strong), which are then integrated via a master-auxiliary fusion scheme. The method requires no complex architectural modifications and is trained in a single stage. The central claim is that this yields consistent performance gains over baselines across multiple segmentation tasks, with improved object-level semantics and boundary localization.
Significance. If the experimental support holds, the work provides an interpretable, label-free procedure for exploiting complementary biases across VFMs, addressing the unreliability of naive multi-VFM fusion. The public code release at https://github.com/gyc-code/metric-guided-fusion is a positive contribution to reproducibility. The approach could be useful for practitioners seeking lightweight improvements in instance-aware dense prediction without multi-stage training or heavy architectural redesign.
major comments (2)
- [§4 and §3.2] §4 (Experiments) and §3.2 (Metric definitions): The central claim that the Structural Coherence and Edge Fidelity metrics reliably identify complementary VFM pairs rests on an unverified assumption. No controlled ablation is presented that compares downstream fusion performance for metric-selected pairs versus random pairs, low-scoring pairs, or exhaustive enumeration of all possible pairs. Without this correlation check, it remains unclear whether higher metric scores predict useful complementarity for object semantics and boundary accuracy or merely reflect incidental feature statistics.
- [Table 2] Table 2 and associated text: While performance gains are reported, the manuscript does not include error bars, statistical significance tests, or per-task breakdowns that isolate the contribution of the metric-guided selection from the fusion architecture itself. This weakens the ability to attribute improvements specifically to the proposed metrics.
minor comments (2)
- [§3.3] The notation for the master-auxiliary fusion weights (e.g., how α and β are determined from the metric scores) could be clarified with an explicit equation or pseudocode in §3.3.
- [Figure 3] Figure 3 (qualitative results) would benefit from side-by-side comparison with the individual VFM baselines to better illustrate the claimed improvements in boundary localization.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major comment below and describe the revisions we will make to address the concerns.
read point-by-point responses
-
Referee: [§4 and §3.2] The central claim that the Structural Coherence and Edge Fidelity metrics reliably identify complementary VFM pairs rests on an unverified assumption. No controlled ablation is presented that compares downstream fusion performance for metric-selected pairs versus random pairs, low-scoring pairs, or exhaustive enumeration of all possible pairs. Without this correlation check, it remains unclear whether higher metric scores predict useful complementarity for object semantics and boundary accuracy or merely reflect incidental feature statistics.
Authors: We agree that a direct ablation comparing metric-selected pairs to random and low-scoring pairs would provide stronger evidence for the predictive value of the metrics. In the revised manuscript we will add these controlled experiments on the evaluated tasks, reporting the resulting segmentation performance to demonstrate that higher metric scores correspond to improved complementarity. We will also note that exhaustive enumeration of all VFM pairs quickly becomes computationally prohibitive and is therefore not performed, while the added ablations on representative pairs will still allow readers to assess the correlation between metric scores and downstream gains. revision: yes
-
Referee: [Table 2] While performance gains are reported, the manuscript does not include error bars, statistical significance tests, or per-task breakdowns that isolate the contribution of the metric-guided selection from the fusion architecture itself. This weakens the ability to attribute improvements specifically to the proposed metrics.
Authors: We acknowledge that the current presentation of results would benefit from greater statistical rigor and clearer isolation of the metric-guided component. In the revision we will augment Table 2 with error bars (standard deviation over multiple random seeds), include paired statistical significance tests for the reported gains, and add per-task breakdowns. We will further introduce a control experiment that applies the same fusion architecture without metric-based pair selection, allowing readers to separate the contribution of the proposed metrics from the fusion scheme itself. revision: yes
Circularity Check
No significant circularity; empirical selection-and-fusion procedure is self-contained
full rationale
The paper defines two label-free metrics (Structural Coherence and Edge Fidelity) in feature space, uses them to select encoder pairs, and fuses via a master-auxiliary scheme trained in one stage. Performance gains are reported on downstream dense-prediction benchmarks. No equations, fitted parameters, or self-citations are shown that reduce the claimed gains or the metric-guided selection to quantities defined by construction inside the paper. The derivation chain consists of explicit metric definitions followed by empirical validation rather than a closed identity or self-referential premise.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Different VFMs exhibit complementary biases in representation (SAM2 for boundaries, DINOv3 for object structure).
Reference graph
Works this paper leans on
-
[1]
Improving vision transformers by revisiting high-frequency components
Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, and Wei Liu. Improving vision transformers by revisiting high-frequency components. InECCV, 2022. 1, 3
work page 2022
-
[2]
Yancheng Cai, Fei Yin, Dounia Hammou, and Rafal Man- tiuk. Do computer vision foundation models learn the low- level characteristics of the human visual system? InCVPR,
-
[3]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 1
work page 2021
-
[4]
Sam-adapter: Adapting segment any- thing in underperformed scenes
Tianrun Chen, Lanyun Zhu, Chaotao Deng, Runlong Cao, Yan Wang, Shangzhan Zhang, Zejian Li, Lingyun Sun, Ying Zang, and Papa Mao. Sam-adapter: Adapting segment any- thing in underperformed scenes. InICCV, 2023. 2
work page 2023
-
[5]
Vision transformer adapter for dense predictions
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InICLR, 2023. 2, 6
work page 2023
-
[6]
Masked-attention mask transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR,
-
[7]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InCVPR,
-
[8]
An image is worth 16x16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 2
work page 2021
-
[9]
Prob- ing the 3d awareness of visual foundation models
Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab- hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob- ing the 3d awareness of visual foundation models. InCVPR,
-
[10]
There is no samantics! exploring sam as a backbone for visual understanding tasks
Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, and Elliot J Crowley. There is no samantics! exploring sam as a backbone for visual understanding tasks. arXiv preprint arXiv:2411.15288, 2024. 1
-
[11]
Uda4inst: Unsupervised domain adaptation for instance segmentation
Yachan Guo, Yi Xiao, Danna Xue, Jose L G ´omez, and An- tonio M L ´opez. Uda4inst: Unsupervised domain adaptation for instance segmentation. InIEEE Intelligent Vehicles Sym- posium, 2025. 1
work page 2025
-
[12]
Uma Gurav and Sanket Jadhav. Prompt-sam: A vision- language and sam based hybrid framework for prompt- augmented zero-shot segmentation.Human-Centric Intel- ligent Systems, pages 1–19, 2025. 2
work page 2025
-
[13]
Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2. 5: Improved baselines for agglomerative vision foundation models. InCVPR, 2025. 1, 6
work page 2025
-
[14]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. 1, 2
work page 2021
-
[15]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InECCV, 2022. 2
work page 2022
-
[16]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1, 2, 5
work page 2023
-
[17]
Mask dino: Towards a unified transformer-based framework for object detection and segmentation
Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. InCVPR, 2023. 2
work page 2023
-
[18]
Segment and recognize anything at any granularity
Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, Lei Zhang, and Jianfeng Gao. Segment and recognize anything at any granularity. In ECCV, 2024. 1
work page 2024
-
[19]
Semantic-sam: Segment and recognize anything at any gran- ularity
Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any gran- ularity. InECCV, 2024. 6
work page 2024
-
[20]
Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024
Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, and Chen Change Loy. Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024. 1
work page 2024
-
[21]
Open-vocabulary semantic segmentation with mask-adapted clip
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InCVPR, 2023. 1
work page 2023
-
[22]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5
work page 2014
-
[23]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 2
work page 2021
-
[24]
Intriguing properties of vision transform- ers.NeurIPS, 2021
Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transform- ers.NeurIPS, 2021. 1, 3
work page 2021
-
[25]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1, 2
work page 2021
-
[27]
Am-radio: Agglomerative vision foundation model reduce all domains into one
Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. InCVPR, 2024. 1
work page 2024
-
[28]
Sam 2: Seg- ment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InICLR, 2025. 1, 2, 5
work page 2025
-
[29]
Peter J Rousseeuw. Silhouettes: a graphical aid to the inter- pretation and validation of cluster analysis.Journal of com- putational and applied mathematics, 20:53–65, 1987. 4
work page 1987
-
[30]
Joan Serrat, Jose L. G ´omez, and Antonio M. L´opez. Closing the gap in domain adaptation for semantic segmentation: a time-aware method.Machine Vision and Applications, 36: 13, 2024. 1
work page 2024
-
[31]
Cheng Shi and Sibei Yang. The devil is in the object bound- ary: Towards annotation-free instance segmentation using foundation models. InICLR, 2024. 2
work page 2024
-
[32]
Region- based representations revisited
Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, Sethuraman TV , Heyi Tao, Jae Yong Lee, Wil- fredo Torres, Yu-Xiong Wang, and Derek Hoiem. Region- based representations revisited. InCVPR, 2024. 2
work page 2024
-
[33]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense featu...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Git: Towards generalist vision transformer through universal language interface
Haiyang Wang, Hao Tang, Li Jiang, Shaoshuai Shi, Muham- mad Ferjad Naeem, Hongsheng Li, Bernt Schiele, and Liwei Wang. Git: Towards generalist vision transformer through universal language interface. InECCV, 2024. 2
work page 2024
-
[36]
Sam-clip: Merging vision foundation models towards semantic and spatial understanding
Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, and Hadi Pouransari. Sam-clip: Merging vision foundation models towards semantic and spatial understanding. InCVPR, 2024. 1
work page 2024
-
[37]
Image as a foreign language: Beit pretraining for vision and vision- language tasks
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision- language tasks. InCVPR, 2023. 2
work page 2023
-
[38]
Zhixiang Wei, Lin Chen, Yi Jin, Xiaoxiao Ma, Tianle Liu, Pengyang Ling, Ben Wang, Huaian Chen, and Jinjin Zheng. Stronger fewer & superior: Harnessing vision foundation models for domain generalized semantic segmentation. In CVPR, 2024. 1
work page 2024
-
[39]
Semantic-aware sam for point- prompted instance segmentation
Zhaoyang Wei, Pengfei Chen, Xuehui Yu, Guorong Li, Jian- bin Jiao, and Zhenjun Han. Semantic-aware sam for point- prompted instance segmentation. InCVPR, 2024. 1
work page 2024
-
[40]
Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation
Monika Wysocza ´nska, Oriane Sim´eoni, Micha¨el Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci ´nski, and Patrick P ´erez. Clip-dinoiser: Teaching clip a few dino tricks for open- vocabulary semantic segmentation. InECCV, 2024. 1, 2
work page 2024
-
[41]
Weihao Yan, Yeqiang Qian, Hanyang Zhuang, Chunxiang Wang, and Ming Yang. Sam4udass: When sam meets un- supervised domain adaptive semantic segmentation in intel- ligent vehicles.IEEE TIV, 9(2):3396–3408, 2024. 1
work page 2024
-
[42]
Towards open-ended visual recognition with large language models
Qihang Yu, Xiaohui Shen, and Liang-Chieh Chen. Towards open-ended visual recognition with large language models. InECCV, 2024. 1, 6
work page 2024
-
[43]
Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively
Haobo Yuan, Xiangtai Li, Chong Zhou, Yining Li, Kai Chen, and Chen Change Loy. Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively. InECCV,
-
[44]
Yixin Zhang, Nicholas Konz, Kevin Kramer, and Maciej A Mazurowski. Quantifying the limits of segmentation foun- dation models: Modeling challenges in segmenting tree-like and low-contrast objects.arXiv preprint arXiv:2412.04243,
-
[45]
3 Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks Supplementary Material A. Metric Details A.1. Hyperparameter Sensitivity Our metrics involve several hyperparameters. To verify that the structure–edge characterization is robust to hyper- parameters choices, we vary each within a range while keeping others fixed: SFC grid s...
-
[46]
with a Mask2Former head, measured on a single A40 GPU. Backbone Params GFLOPs Throughput DINOv2 108.1 1294 1.6 DINOv3 107.1 1083 2.2 SAM 111.1 2302 1.4 SAM2 89.1 1132 2.3 Ours-D3S2 176.2 1857 1.4 master encoders, injection at OS=16 yields the best aver- age AP (38.7 for DINOv2, 39.1 for DINOv3), with gains most pronounced on boundary-sensitive classes suc...
-
[47]
because we attach SAM2’s backbone as an auxil- iary edge provider. Despite the additional encoder, through- put remains comparable to single-encoder baselines such as DINOv2-B (1.4 vs. 1.6 img/s). C. Additional Qualitative Results Fig. 5 extends the main-paper comparison (Fig. 1) by in- cluding fine-tuned variants and our fused model. The two VFM families...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.