arxiv: 2603.25744 · v2 · submitted 2026-03-26 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

Bocheng Zou , Mu Cai , Mark Stanley , Dingfu Lu , Yong Jae Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-resolution fusionvision foundation modelsfeature fusionmulti-scaletraining-freeDINOv2inference enhancement

0 comments

The pith

Fusing features from multiple image resolutions creates stronger representations in frozen vision foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vision foundation models typically use only one fixed input resolution at inference, but different resolutions capture different useful information: low resolutions for overall meaning and high for details. MuRF runs the model multiple times at different scales on the same image, then fuses those features into one representation. This is done without any training or changes to the model and works for various VFMs. Readers should care because it offers a straightforward way to get better results from models they already have, improving tasks like recognizing objects or segmenting images.

Core claim

MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. This strategy harnesses the complementary inductive biases of varying resolutions and serves as a training-free enhancement applicable to a broad spectrum of VFMs including DINOv2 and SigLIP2.

What carries the argument

Multi-Resolution Fusion (MuRF), the process of generating and combining feature sets from several input resolutions of the same image using an unchanged vision foundation model.

Load-bearing premise

The features extracted at different resolutions contain complementary information that can be fused into a single superior representation without any loss or need for model retraining.

What would settle it

Running a controlled experiment on a benchmark dataset where the multi-resolution fused features fail to outperform the single-resolution baseline would disprove the central claim.

Figures

Figures reproduced from arXiv: 2603.25744 by Bocheng Zou, Dingfu Lu, Mark Stanley, Mu Cai, Yong Jae Lee.

**Figure 1.** Figure 1: The “Recognition vs. Refinement” Dynamic. The feature map obtained when the input is resized to 266, 518, 784. At lower resolutions, the representation is globally coherent, enabling robust recognition. At higher resolutions, boundary details are sharper, enabling precise refinement, but the object’s interior becomes noisy, risking incomplete segmentation. Our work is motivated by synergizing these two rol… view at source ↗

**Figure 2.** Figure 2: Overview of Multi-Resolution Fusion (MuRF). An input image is resized to multiple resolutions and each view is processed by a frozen DINOv2 encoder to produce separate feature maps. These features are upsampled to a shared spatial resolution and fused into a single multi-resolution representation, which can then be used by lightweight task-specific heads for semantic segmentation, depth estimation, visual … view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of semantic segmentation results on ADE20K (top) and PASCAL VOC (bottom) with different input resolutions. All images are resized to a square shape before being fed into DINOv2, and the subtitle above each image indicates the corresponding input resolution (side length in pixels). Input 0.5x 1.0x 1.5x MuRF (a) Comparison of depth estimation results on NYUd. Input 0.5x 1.0x 1.5x MuRF … view at source ↗

**Figure 4.** Figure 4: Qualitative depth estimation results on NYUd (left) and SUN RGB-D (right). We compare single-scale DINOv2 predictions at 0.5×, 1.0×, and 1.5× input resolutions with our MuRF fusion. By aggregating multi-resolution features, MuRF better preserves global scene structure while sharpening local geometry, producing smoother and more accurate depth maps. Labels 0.X× indicate that the image fed into DINOv2 is res… view at source ↗

**Figure 5.** Figure 5: The visualization of anomaly detection on MVTec AD 2 TESTpub dataset. Our merged result (MuRF) successfully combines the robust detection from low-resolution views (e.g., 0.3× correctly identifies the anomaly’s presence but with a coarse mask) and the sharp boundaries from high-resolution views (e.g., 0.7×). Input 266 518 784 MuRF [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 7.** Figure 7: Additional segmentation visualizations from ADE20K and PASCAL VOC. A.3. LLaVA-style MLLM training To verify how MuRF can support MLLMs, we replace the vision encoder of LLaVA 1.5 with DINOv2 and SigLIP2. For the DINOv2 variant’s baseline, we resize all images into squares of 336 × 336, and put them into DINOv2. For MuRF, we used Sres = {224, 336}, and after that, we resize the feature embedding generated b… view at source ↗

**Figure 8.** Figure 8: Additional visualizations of depth estimation results on NYUd and SUN RGB-D. All experiments were conducted on a single server equipped with eight NVIDIA H100 GPUs. A.4. Anomaly Detection For anomaly detection, we followed the Embedding-based Anomaly Detection Paradigm as Roth et al. (2022), Zhang et al. (2025a). Specifically, for a given input image x ∈ R H×W×C , we first generate a set of resized version… view at source ↗

**Figure 9.** Figure 9: Additional PCA visualization. feature vector f ∗ ∈ ℱ ∗ l,s is its L2 distance to the nearest neighbor in the dedicated memory bank ℳl,s : S( f ∗ ) = min f∈ℳl,s ∥f ∗ − f∥2 (9) Finally, to produce the single output anomaly score map Sˆ, we fuse all individual score maps {Sˆ l,s}. Each map is first up-sampled to the original image dimensions (H, W) via bilinear interpolation, and then aggregated through eleme… view at source ↗

read the original abstract

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MuRF is a training-free inference trick that runs frozen VFMs at multiple resolutions and fuses the features, but the fusion mechanics are the part that still needs real evidence.

read the letter

MuRF processes an image at several scales through a frozen model like DINOv2, then combines the resulting features into one representation. The pitch is that low-resolution views capture global semantics while high-resolution ones add fine detail, and a simple fusion step at test time beats single-scale inference across tasks and model families including some contrastive ones like SigLIP2. That training-free, architecture-agnostic angle is the cleanest part of the work and could make it easy to drop into existing pipelines without retraining costs. The abstract also shows they tried it on a range of standard vision tasks, which at least sketches the scope. The soft spot is the fusion operator itself. Different resolutions produce feature maps of mismatched spatial size, so any practical combination requires resizing, interpolation, or pooling. Nothing in the provided description shows that this step reliably keeps the complementary signals instead of smoothing away high-frequency detail or introducing artifacts. Without ablations on the fusion method or quantitative breakdowns that compare against strong single-scale baselines, the central claim stays hard to verify. The paper is aimed at people already using VFMs in downstream applications who are looking for quick inference-time gains rather than new training recipes. Readers working on multi-scale or test-time adaptation would find the most direct value. I would send it to peer review because the procedure is straightforward to implement and test, and the empirical angle is concrete enough that referees could check the numbers and fusion details quickly.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Multi-Resolution Fusion (MuRF), a training-free inference-time method for vision foundation models. It processes an input image at multiple resolutions through a frozen VFM (primarily DINOv2, with extension to SigLIP2), extracts features at each scale, and fuses them into a single unified representation. The central claim is that this exploits complementary inductive biases—global semantics from low resolution and fine details from high resolution—yielding strictly superior performance across tasks without any retraining or architecture-specific changes.

Significance. If the empirical results hold, MuRF would constitute a broadly applicable, zero-cost enhancement to existing VFMs. Its claimed universality across model families and tasks is a notable strength, as it avoids the need for task-specific tuning or additional parameters. This could meaningfully improve representation quality in settings where both coarse and fine visual information matter.

major comments (2)

[Abstract] Abstract: the claim that multi-resolution fusion 'reliably' combines complementary inductive biases into a superior representation is load-bearing, yet the abstract (and the provided manuscript excerpt) gives no concrete description of the fusion operator. Without specifying how spatial-size mismatch is resolved (resizing, interpolation, or aggregation), it is impossible to verify that high-frequency detail is preserved rather than discarded, directly addressing the stress-test concern.
[Empirical validation] Empirical validation section: the assertion of successful generalization to contrastive models such as SigLIP2 is stated without quantitative results, ablations on the fusion step, or comparison against single-scale baselines. This leaves the universality argument unsupported and prevents assessment of whether the fusion step actually improves over the best single-resolution input.

minor comments (1)

[Abstract] Abstract: the phrase 'unified representation' is used without clarifying whether fusion occurs before or after global pooling, which affects downstream compatibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight opportunities to improve the clarity of our method description and the strength of our empirical claims. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that multi-resolution fusion 'reliably' combines complementary inductive biases into a superior representation is load-bearing, yet the abstract (and the provided manuscript excerpt) gives no concrete description of the fusion operator. Without specifying how spatial-size mismatch is resolved (resizing, interpolation, or aggregation), it is impossible to verify that high-frequency detail is preserved rather than discarded, directly addressing the stress-test concern.

Authors: We agree that the abstract would benefit from a concise description of the fusion operator. In the revised manuscript, we will update the abstract to specify that features from different resolutions are aligned via bilinear interpolation to a common spatial size, followed by concatenation and a lightweight aggregation (averaging across scales) that preserves high-frequency details from higher-resolution inputs while incorporating global context from lower resolutions. revision: yes
Referee: [Empirical validation] Empirical validation section: the assertion of successful generalization to contrastive models such as SigLIP2 is stated without quantitative results, ablations on the fusion step, or comparison against single-scale baselines. This leaves the universality argument unsupported and prevents assessment of whether the fusion step actually improves over the best single-resolution input.

Authors: The manuscript states successful generalization to SigLIP2 and provides supporting examples, but we acknowledge that explicit quantitative results, fusion ablations, and single-scale baseline comparisons for SigLIP2 are not presented in the main empirical section. We will add these in the revised version, including tables comparing MuRF against the best single-resolution input for SigLIP2 across representative tasks to directly substantiate the universality claim. revision: yes

Circularity Check

0 steps flagged

No circularity: MuRF is a direct procedural method with empirical support

full rationale

The paper introduces MuRF as a training-free procedure that processes an image at multiple resolutions through a frozen VFM and fuses the resulting features. No equations, derivations, or fitted parameters are presented that reduce a claimed prediction back to its inputs by construction. The universality claim is supported by empirical application across tasks and VFM families rather than by self-citation chains or uniqueness theorems. The method is therefore self-contained as a straightforward enhancement without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the stated domain assumption that different resolutions supply complementary biases; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption Varying resolutions offer complementary inductive biases where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement.
Explicitly stated in the abstract as a fundamental property of visual perception.

pith-pipeline@v0.9.0 · 5508 in / 1071 out tokens · 28854 ms · 2026-05-14T23:56:28.485745+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features... upsampled to a common target spatial resolution... concatenated along the channel dimension
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Jafar: Jack up any feature at any resolution.arXiv preprint arXiv:2506.11136,

Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, and Nicolas Thome. Jafar: Jack up any feature at any resolution.arXiv preprint arXiv:2506.11136,

work page arXiv
[2]

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ 06ea400b9b7cfce6428ec27a371632eb-Paper-Conference.pdf. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words:...

work page 2023
[3]

Mark Everingham, Luc Gool, Christopher K

URL https://proceedings.neurips.cc/paper_files/paper/2014/file/ 91c56ce4a249fae5419b90cba831e303-Paper.pdf. Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.Int. J. Comput. Vision, 88(2):303–338, June

work page 2014
[4]

doi: 10.1007/s11263-009-0275-4

ISSN 0920-5691. doi: 10.1007/s11263-009-0275-4. URLhttps://doi.org/10.1007/s11263-009-0275-4. Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. MME: A comprehensive evaluation 14 MuRF: Unlocking the Multi-Scale Potential of Vision Foundat...

work page doi:10.1007/s11263-009-0275-4
[5]

The mvtec ad 2 dataset: Advanced scenarios for unsupervised anomaly detection.arXiv preprint arXiv:2503.21622,

Lars Heckler-Kram, Jan-Hendrik Neudeck, Ulla Scheler, Rebecca König, and Carsten Steger. The mvtec ad 2 dataset: Advanced scenarios for unsupervised anomaly detection.arXiv preprint arXiv:2503.21622,

work page arXiv
[6]

Robis: Robustbinarysegmentationforhigh-resolution industrial images.arXiv preprint arXiv:2505.21152,

XuruiLi, ZhoneshengJiang, TingxuanAi, andYuZhou. Robis: Robustbinarysegmentationforhigh-resolution industrial images.arXiv preprint arXiv:2505.21152,

work page arXiv
[7]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, Singapore, December

work page 2023
[8]

In Bouamor, H., Pino, J

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.20. URL https://aclanthology.org/2023.emnlp-main.20/. Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages...

work page doi:10.18653/v1/2023.emnlp-main.20 2023
[9]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024a. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world know...

work page 2024
[10]

Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When do we not need larger vision models? In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 444–462, Cham,

work page 2024
[11]

ISBN 978-3-031-73242-3

Springer Nature Switzerland. ISBN 978-3-031-73242-3. Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid, editors,Computer Vision – ECCV 2012, pages 746–760, Berlin, Heidelberg,

work page 2012
[12]

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al

doi: 10.1109/CVPR.2015.7298655. Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page doi:10.1109/cvpr.2015.7298655 2015
[13]

Superad: A training-free anomaly classification and segmentation method for cvpr 2025 vand 3.0 workshop challenge track 1: Adapt & detect.arXiv preprint arXiv:2505.19750, 2025a

16 MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models Huaiyuan Zhang, Hang Chen, Yu Cheng, Shunyi Wu, Linghao Sun, Linao Han, Zeyu Shi, and Lei Qi. Superad: A training-free anomaly classification and segmentation method for cvpr 2025 vand 3.0 workshop challenge track 1: Adapt & detect.arXiv preprint arXiv:2505.19750, 2025a. YiFan Zhang,...

work page arXiv 2025
[14]

and a multi-layer fusion setup (Lin. 4). In the Lin. 1 configuration, features are extracted solely from the final transformer layer. In the Lin. 4 configuration, we designate a set of intermediate layers Slayer ={3, 6, 9, 12}. For both configurations, for every utilized layerl, we concatenate the layer-specific global classification token[CLS]l to each s...

work page 2021
[15]

We then apply a random rotationR(θ)with zero-padding, whereθ∼𝒰(−2.5 ◦, 2.5◦)with probability p=0.5 , followed by a random horizontal flip withp=0.5

to remove invalid border regions. We then apply a random rotationR(θ)with zero-padding, whereθ∼𝒰(−2.5 ◦, 2.5◦)with probability p=0.5 , followed by a random horizontal flip withp=0.5 . Subsequently, we extract a random cropxcrop ∈R 416×544 from the transformed image. We also employ photometric distortions with probabilityp=0.5 , specifically color jitterin...

work page 2021
[16]

Table 9:Resolution conversion table for MVTec AD v2 categories to standardize the total number of pixels for our experiments. Object Original Resolution # of Pixels New Resolution New # of Pixels Can 1024x2232 2,285,568 1536x3348 5,142,528 Fruit Jelly 1520x2100 3,192,000 1900x2625 4,987,500 Vial 1900x1400 2,660,000 2470x1820 4,495,400 (... other categorie...

work page 2019