pith. machine review for the scientific record. sign in

arxiv: 2603.25744 · v2 · submitted 2026-03-26 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-resolution fusionvision foundation modelsfeature fusionmulti-scaletraining-freeDINOv2inference enhancement
0
0 comments X

The pith

Fusing features from multiple image resolutions creates stronger representations in frozen vision foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vision foundation models typically use only one fixed input resolution at inference, but different resolutions capture different useful information: low resolutions for overall meaning and high for details. MuRF runs the model multiple times at different scales on the same image, then fuses those features into one representation. This is done without any training or changes to the model and works for various VFMs. Readers should care because it offers a straightforward way to get better results from models they already have, improving tasks like recognizing objects or segmenting images.

Core claim

MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. This strategy harnesses the complementary inductive biases of varying resolutions and serves as a training-free enhancement applicable to a broad spectrum of VFMs including DINOv2 and SigLIP2.

What carries the argument

Multi-Resolution Fusion (MuRF), the process of generating and combining feature sets from several input resolutions of the same image using an unchanged vision foundation model.

Load-bearing premise

The features extracted at different resolutions contain complementary information that can be fused into a single superior representation without any loss or need for model retraining.

What would settle it

Running a controlled experiment on a benchmark dataset where the multi-resolution fused features fail to outperform the single-resolution baseline would disprove the central claim.

Figures

Figures reproduced from arXiv: 2603.25744 by Bocheng Zou, Dingfu Lu, Mark Stanley, Mu Cai, Yong Jae Lee.

Figure 1
Figure 1. Figure 1: The “Recognition vs. Refinement” Dynamic. The feature map obtained when the input is resized to 266, 518, 784. At lower resolutions, the representation is globally coherent, enabling robust recognition. At higher resolutions, boundary details are sharper, enabling precise refinement, but the object’s interior becomes noisy, risking incomplete segmentation. Our work is motivated by synergizing these two rol… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Multi-Resolution Fusion (MuRF). An input image is resized to multiple resolutions and each view is processed by a frozen DINOv2 encoder to produce separate feature maps. These features are upsampled to a shared spatial resolution and fused into a single multi-resolution representation, which can then be used by lightweight task-specific heads for semantic segmentation, depth estimation, visual … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of semantic segmentation results on ADE20K (top) and PASCAL VOC (bottom) with different input resolutions. All images are resized to a square shape before being fed into DINOv2, and the subtitle above each image indicates the corresponding input resolution (side length in pixels). Input 0.5x 1.0x 1.5x MuRF (a) Comparison of depth estimation results on NYUd. Input 0.5x 1.0x 1.5x MuRF … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative depth estimation results on NYUd (left) and SUN RGB-D (right). We compare single-scale DINOv2 predictions at 0.5×, 1.0×, and 1.5× input resolutions with our MuRF fusion. By aggregating multi-resolution features, MuRF better preserves global scene structure while sharpening local geometry, producing smoother and more accurate depth maps. Labels 0.X× indicate that the image fed into DINOv2 is res… view at source ↗
Figure 5
Figure 5. Figure 5: The visualization of anomaly detection on MVTec AD 2 TESTpub dataset. Our merged result (MuRF) successfully combines the robust detection from low-resolution views (e.g., 0.3× correctly identifies the anomaly’s presence but with a coarse mask) and the sharp boundaries from high-resolution views (e.g., 0.7×). Input 266 518 784 MuRF [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional segmentation visualizations from ADE20K and PASCAL VOC. A.3. LLaVA-style MLLM training To verify how MuRF can support MLLMs, we replace the vision encoder of LLaVA 1.5 with DINOv2 and SigLIP2. For the DINOv2 variant’s baseline, we resize all images into squares of 336 × 336, and put them into DINOv2. For MuRF, we used Sres = {224, 336}, and after that, we resize the feature embedding generated b… view at source ↗
Figure 8
Figure 8. Figure 8: Additional visualizations of depth estimation results on NYUd and SUN RGB-D. All experiments were conducted on a single server equipped with eight NVIDIA H100 GPUs. A.4. Anomaly Detection For anomaly detection, we followed the Embedding-based Anomaly Detection Paradigm as Roth et al. (2022), Zhang et al. (2025a). Specifically, for a given input image x ∈ R H×W×C , we first generate a set of resized version… view at source ↗
Figure 9
Figure 9. Figure 9: Additional PCA visualization. feature vector f ∗ ∈ ℱ ∗ l,s is its L2 distance to the nearest neighbor in the dedicated memory bank ℳl,s : S( f ∗ ) = min f∈ℳl,s ∥f ∗ − f∥2 (9) Finally, to produce the single output anomaly score map Sˆ, we fuse all individual score maps {Sˆ l,s}. Each map is first up-sampled to the original image dimensions (H, W) via bilinear interpolation, and then aggregated through eleme… view at source ↗
read the original abstract

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Multi-Resolution Fusion (MuRF), a training-free inference-time method for vision foundation models. It processes an input image at multiple resolutions through a frozen VFM (primarily DINOv2, with extension to SigLIP2), extracts features at each scale, and fuses them into a single unified representation. The central claim is that this exploits complementary inductive biases—global semantics from low resolution and fine details from high resolution—yielding strictly superior performance across tasks without any retraining or architecture-specific changes.

Significance. If the empirical results hold, MuRF would constitute a broadly applicable, zero-cost enhancement to existing VFMs. Its claimed universality across model families and tasks is a notable strength, as it avoids the need for task-specific tuning or additional parameters. This could meaningfully improve representation quality in settings where both coarse and fine visual information matter.

major comments (2)
  1. [Abstract] Abstract: the claim that multi-resolution fusion 'reliably' combines complementary inductive biases into a superior representation is load-bearing, yet the abstract (and the provided manuscript excerpt) gives no concrete description of the fusion operator. Without specifying how spatial-size mismatch is resolved (resizing, interpolation, or aggregation), it is impossible to verify that high-frequency detail is preserved rather than discarded, directly addressing the stress-test concern.
  2. [Empirical validation] Empirical validation section: the assertion of successful generalization to contrastive models such as SigLIP2 is stated without quantitative results, ablations on the fusion step, or comparison against single-scale baselines. This leaves the universality argument unsupported and prevents assessment of whether the fusion step actually improves over the best single-resolution input.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'unified representation' is used without clarifying whether fusion occurs before or after global pooling, which affects downstream compatibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight opportunities to improve the clarity of our method description and the strength of our empirical claims. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that multi-resolution fusion 'reliably' combines complementary inductive biases into a superior representation is load-bearing, yet the abstract (and the provided manuscript excerpt) gives no concrete description of the fusion operator. Without specifying how spatial-size mismatch is resolved (resizing, interpolation, or aggregation), it is impossible to verify that high-frequency detail is preserved rather than discarded, directly addressing the stress-test concern.

    Authors: We agree that the abstract would benefit from a concise description of the fusion operator. In the revised manuscript, we will update the abstract to specify that features from different resolutions are aligned via bilinear interpolation to a common spatial size, followed by concatenation and a lightweight aggregation (averaging across scales) that preserves high-frequency details from higher-resolution inputs while incorporating global context from lower resolutions. revision: yes

  2. Referee: [Empirical validation] Empirical validation section: the assertion of successful generalization to contrastive models such as SigLIP2 is stated without quantitative results, ablations on the fusion step, or comparison against single-scale baselines. This leaves the universality argument unsupported and prevents assessment of whether the fusion step actually improves over the best single-resolution input.

    Authors: The manuscript states successful generalization to SigLIP2 and provides supporting examples, but we acknowledge that explicit quantitative results, fusion ablations, and single-scale baseline comparisons for SigLIP2 are not presented in the main empirical section. We will add these in the revised version, including tables comparing MuRF against the best single-resolution input for SigLIP2 across representative tasks to directly substantiate the universality claim. revision: yes

Circularity Check

0 steps flagged

No circularity: MuRF is a direct procedural method with empirical support

full rationale

The paper introduces MuRF as a training-free procedure that processes an image at multiple resolutions through a frozen VFM and fuses the resulting features. No equations, derivations, or fitted parameters are presented that reduce a claimed prediction back to its inputs by construction. The universality claim is supported by empirical application across tasks and VFM families rather than by self-citation chains or uniqueness theorems. The method is therefore self-contained as a straightforward enhancement without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the stated domain assumption that different resolutions supply complementary biases; no free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption Varying resolutions offer complementary inductive biases where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement.
    Explicitly stated in the abstract as a fundamental property of visual perception.

pith-pipeline@v0.9.0 · 5508 in / 1071 out tokens · 28854 ms · 2026-05-14T23:56:28.485745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    Jafar: Jack up any feature at any resolution.arXiv preprint arXiv:2506.11136,

    Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, and Nicolas Thome. Jafar: Jack up any feature at any resolution.arXiv preprint arXiv:2506.11136,

  2. [2]

    URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ 06ea400b9b7cfce6428ec27a371632eb-Paper-Conference.pdf. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words:...

  3. [3]

    Mark Everingham, Luc Gool, Christopher K

    URL https://proceedings.neurips.cc/paper_files/paper/2014/file/ 91c56ce4a249fae5419b90cba831e303-Paper.pdf. Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.Int. J. Comput. Vision, 88(2):303–338, June

  4. [4]

    doi: 10.1007/s11263-009-0275-4

    ISSN 0920-5691. doi: 10.1007/s11263-009-0275-4. URLhttps://doi.org/10.1007/s11263-009-0275-4. Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. MME: A comprehensive evaluation 14 MuRF: Unlocking the Multi-Scale Potential of Vision Foundat...

  5. [5]

    The mvtec ad 2 dataset: Advanced scenarios for unsupervised anomaly detection.arXiv preprint arXiv:2503.21622,

    Lars Heckler-Kram, Jan-Hendrik Neudeck, Ulla Scheler, Rebecca König, and Carsten Steger. The mvtec ad 2 dataset: Advanced scenarios for unsupervised anomaly detection.arXiv preprint arXiv:2503.21622,

  6. [6]

    Robis: Robustbinarysegmentationforhigh-resolution industrial images.arXiv preprint arXiv:2505.21152,

    XuruiLi, ZhoneshengJiang, TingxuanAi, andYuZhou. Robis: Robustbinarysegmentationforhigh-resolution industrial images.arXiv preprint arXiv:2505.21152,

  7. [7]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, Singapore, December

  8. [8]

    In Bouamor, H., Pino, J

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.20. URL https://aclanthology.org/2023.emnlp-main.20/. Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages...

  9. [9]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024a. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world know...

  10. [10]

    Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When do we not need larger vision models? In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 444–462, Cham,

  11. [11]

    ISBN 978-3-031-73242-3

    Springer Nature Switzerland. ISBN 978-3-031-73242-3. Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid, editors,Computer Vision – ECCV 2012, pages 746–760, Berlin, Heidelberg,

  12. [12]

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al

    doi: 10.1109/CVPR.2015.7298655. Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

  13. [13]

    Superad: A training-free anomaly classification and segmentation method for cvpr 2025 vand 3.0 workshop challenge track 1: Adapt & detect.arXiv preprint arXiv:2505.19750, 2025a

    16 MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models Huaiyuan Zhang, Hang Chen, Yu Cheng, Shunyi Wu, Linghao Sun, Linao Han, Zeyu Shi, and Lei Qi. Superad: A training-free anomaly classification and segmentation method for cvpr 2025 vand 3.0 workshop challenge track 1: Adapt & detect.arXiv preprint arXiv:2505.19750, 2025a. YiFan Zhang,...

  14. [14]

    and a multi-layer fusion setup (Lin. 4). In the Lin. 1 configuration, features are extracted solely from the final transformer layer. In the Lin. 4 configuration, we designate a set of intermediate layers Slayer ={3, 6, 9, 12}. For both configurations, for every utilized layerl, we concatenate the layer-specific global classification token[CLS]l to each s...

  15. [15]

    We then apply a random rotationR(θ)with zero-padding, whereθ∼𝒰(−2.5 ◦, 2.5◦)with probability p=0.5 , followed by a random horizontal flip withp=0.5

    to remove invalid border regions. We then apply a random rotationR(θ)with zero-padding, whereθ∼𝒰(−2.5 ◦, 2.5◦)with probability p=0.5 , followed by a random horizontal flip withp=0.5 . Subsequently, we extract a random cropxcrop ∈R 416×544 from the transformed image. We also employ photometric distortions with probabilityp=0.5 , specifically color jitterin...

  16. [16]

    Table 9:Resolution conversion table for MVTec AD v2 categories to standardize the total number of pixels for our experiments. Object Original Resolution # of Pixels New Resolution New # of Pixels Can 1024x2232 2,285,568 1536x3348 5,142,528 Fruit Jelly 1520x2100 3,192,000 1900x2625 4,987,500 Vial 1900x1400 2,660,000 2470x1820 4,495,400 (... other categorie...