pith. sign in

arxiv: 2605.07221 · v1 · submitted 2026-05-08 · 💻 cs.CV

DINO-MVR: Multi-View Readout of Frozen DINOv3 for Annotation-Efficient Medical Segmentation

Pith reviewed 2026-05-11 02:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords medical image segmentationfrozen backboneself-supervised learningannotation-efficientmulti-view readoutDINOv3entropy-weighted fusion
0
0 comments X

The pith

Frozen DINOv3 features enable accurate medical segmentation with lightweight multi-view readout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that structural and boundary information useful for medical segmentation is already present in the features of a frozen DINOv3 backbone. The bottleneck is therefore not the backbone itself but the way its features are interpreted. The proposed method attaches lightweight MLP probes only to the final three transformer blocks and combines outputs from multiple image resolutions and test-time augmentations through entropy-weighted fusion, with added Gaussian smoothing for volumetric data. This produces strong results on endoscopy, dermoscopy, and MRI tasks while requiring very few labeled examples, demonstrating that annotation-efficient segmentation is possible without updating the large pretrained model.

Core claim

The central claim is that a frozen DINOv3 backbone, when read out via lightweight MLP probes on its final three transformer blocks and entropy-weighted fusion of multi-resolution and augmented predictions, delivers accurate medical segmentations, including Dice scores of 0.895 on Kvasir-SEG, 0.897 on ISIC 2018, and 0.908 on BraTS FLAIR whole-tumor, while recovering 98.4 percent of the performance of a 40-patient reference using only five annotated BraTS cases.

What carries the argument

The multi-view readout mechanism, which trains lightweight MLP probes on features from the final three transformer blocks of the frozen backbone and fuses predictions from complementary resolutions and augmentations by entropy-weighted averaging.

If this is right

  • Medical segmentation becomes feasible in settings with very scarce annotations without the cost of fine-tuning large backbones.
  • Volumetric consistency on MRI improves through simple Gaussian smoothing along the z-axis.
  • The same frozen backbone can be applied across different modalities including endoscopy, dermoscopy, and MRI.
  • Lightweight probes enable fast task adaptation when new labeled data arrive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lightweight readout pattern could be tested on other frozen self-supervised models to check whether the multi-view benefit is general.
  • This style of readout might lower the barrier to deploying foundation models in clinical workflows where annotation budgets are limited.
  • Extending the approach to native 3D backbones could further reduce slice-to-slice inconsistencies in CT or MRI.
  • The transfer of natural-image structural cues to medical domains suggests similar readouts may help in other data-scarce scientific imaging tasks.

Load-bearing premise

Frozen DINOv3 features already contain useful structural and boundary cues for medical segmentation.

What would settle it

If single-view readout from only the final block without multi-view fusion or entropy weighting yields Dice scores below 0.75 on Kvasir-SEG or ISIC 2018 under the same fixed protocols, the necessity of the proposed multi-view strategy would be challenged.

Figures

Figures reproduced from arXiv: 2605.07221 by Feng Liu, Hongfu Sun, Nan Ye, Wei Jiang.

Figure 1
Figure 1. Figure 1: DINO-MVR method overview. Frozen DINOv3 features are read out by scale-specific MLP probes and fused [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative examples from held-out BraTS FLAIR, ISIC 2018, and Kvasir-SEG samples. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: BraTS FLAIR K-patient learning curve. Points show mean performance over patient [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Additional BraTS FLAIR qualitative examples. Rows are ordered from high-scoring to [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional ISIC 2018 qualitative examples. Rows are ordered from high-scoring to [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional Kvasir-SEG qualitative examples. Rows are ordered from high-scoring to [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Adapting foundation models to medical segmentation typically requires either backbone fine-tuning or high-capacity task-specific decoders, both of which are difficult to fit reliably when annotations are scarce. We show that frozen DINOv3 features already contain useful structural and boundary cues for medical segmentation, and that the main bottleneck lies in how these features are read out. We propose DINO-MVR, a Multi-View Readout framework for annotation-efficient medical segmentation. DINO-MVR trains only lightweight MLP probes on features from the final three transformer blocks of a frozen DINOv3 backbone, without updating the backbone itself. At inference, each input is interpreted through complementary resolutions and test-time augmentations, whose probability maps are combined by entropy-weighted fusion and refined with simple spatial regularization. For volumetric inputs, Gaussian z-axis smoothing further improves inter-slice consistency. Under fixed evaluation protocols on endoscopy, dermoscopy, and MRI benchmarks, DINO-MVR achieves strong readout-only performance, including 0.895 Dice on Kvasir-SEG, 0.897 Dice on ISIC 2018, and 0.908 Dice on BraTS FLAIR whole-tumor segmentation. With only five annotated BraTS patients, it recovers 98.4% of the performance obtained by the 40-patient BraTS reference run. These results suggest that frozen self-supervised vision backbones can support accurate medical segmentation when paired with an effective multi-view readout.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DINO-MVR, which freezes a DINOv3 vision transformer and trains only lightweight MLP probes on features from its final three blocks for medical segmentation. At inference it applies complementary resolutions, test-time augmentations, entropy-weighted fusion of probability maps, and (for volumes) Gaussian z-smoothing, reporting Dice scores of 0.895 on Kvasir-SEG, 0.897 on ISIC 2018, and 0.908 on BraTS FLAIR whole-tumor segmentation while recovering 98.4% of a 40-patient reference performance using only five annotated BraTS cases.

Significance. If the central claim holds, the work demonstrates that frozen self-supervised vision backbones can support accurate medical segmentation with minimal annotation and no backbone updates, which would be valuable for data-scarce clinical settings. The concrete low-data recovery result (98.4% with five patients) is a clear strength that merits attention.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experimental Results): the central claim that 'frozen DINOv3 features already contain useful structural and boundary cues' and that 'the main bottleneck lies in how these features are read out' is load-bearing, yet all reported Dice scores are obtained under the full multi-view inference pipeline (complementary resolutions, test-time augmentations, entropy-weighted fusion, and z-smoothing). No ablation is presented that applies the identical trained MLP probes under single-view, non-augmented inference, leaving open whether the performance is attributable to the frozen features or to the inference-time ensemble.
  2. [§4] §4 (Experimental Results): the manuscript reports absolute Dice numbers and a low-data recovery percentage but provides no baseline comparisons to other frozen-backbone or limited-data segmentation methods, nor any statistical details (standard deviations, significance tests) across runs. This weakens the ability to judge whether the readout-only performance is competitive.
minor comments (2)
  1. [§3] §3 (Method): the description of entropy-weighted fusion and Gaussian z-smoothing is clear, but the precise weighting formula and the choice of which three blocks are used could be stated more explicitly for reproducibility.
  2. [Abstract] Abstract: 'fixed evaluation protocols' is mentioned without reference to the specific train/val/test splits or preprocessing steps used on each benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the potential value of our low-data recovery results. We address each major comment below and will incorporate the suggested revisions to clarify the contribution of the frozen features.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Results): the central claim that 'frozen DINOv3 features already contain useful structural and boundary cues' and that 'the main bottleneck lies in how these features are read out' is load-bearing, yet all reported Dice scores are obtained under the full multi-view inference pipeline (complementary resolutions, test-time augmentations, entropy-weighted fusion, and z-smoothing). No ablation is presented that applies the identical trained MLP probes under single-view, non-augmented inference, leaving open whether the performance is attributable to the frozen features or to the inference-time ensemble.

    Authors: We agree that an explicit single-view ablation would better isolate the contribution of the frozen DINOv3 features. The multi-view components (complementary resolutions, TTA, entropy fusion, and z-smoothing) are an integral part of our proposed readout strategy rather than post-hoc enhancements, but they are applied only at inference after the MLP probes have been trained on single-view features. In the revised manuscript we will add a dedicated ablation table reporting Dice scores for the identical trained probes under single-view, non-augmented inference on all three benchmarks. This will demonstrate that the frozen features already yield competitive structural cues (e.g., >0.82 Dice on Kvasir-SEG and ISIC) while the full pipeline provides the additional robustness reported in the main results. revision: yes

  2. Referee: [§4] §4 (Experimental Results): the manuscript reports absolute Dice numbers and a low-data recovery percentage but provides no baseline comparisons to other frozen-backbone or limited-data segmentation methods, nor any statistical details (standard deviations, significance tests) across runs. This weakens the ability to judge whether the readout-only performance is competitive.

    Authors: We acknowledge that the current evaluation lacks external baselines and statistical reporting. In the revision we will expand §4 with (i) comparisons against representative frozen-backbone methods (e.g., linear probing or lightweight decoder heads on DINOv2/SAM features) and limited-data segmentation approaches (e.g., few-shot or semi-supervised baselines) under the same evaluation protocols, and (ii) standard deviations computed over five independent training runs together with paired statistical significance tests (Wilcoxon signed-rank) against the 40-patient reference. These additions will allow readers to assess competitiveness more rigorously while preserving the paper’s focus on readout-only adaptation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on independent benchmarks

full rationale

The paper contains no equations, derivations, or first-principles predictions. It reports experimental Dice scores on fixed public benchmarks (Kvasir-SEG, ISIC 2018, BraTS) after training only lightweight MLPs on frozen DINOv3 features. No parameter is fitted to the reported metrics themselves, no self-citation chain supports a load-bearing uniqueness claim, and no ansatz or renaming reduces the central result to its inputs by construction. The method description and performance numbers are therefore self-contained against external data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that DINOv3 features transfer to medical images and that the described readout operations are sufficient; the MLPs introduce trainable parameters fitted to the target tasks.

free parameters (1)
  • MLP probe weights
    Lightweight MLPs are trained on extracted features; their parameters are fitted to the segmentation task.
axioms (1)
  • domain assumption Frozen DINOv3 features already contain useful structural and boundary cues for medical segmentation
    Explicitly stated as the main premise in the abstract.

pith-pipeline@v0.9.0 · 5567 in / 1182 out tokens · 46318 ms · 2026-05-11T02:11:25.177628+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Kitamura, Spyridon Pati, Luciano M

    Ujjwal Baid, Satyajeet Ghodasara, Suyash Mohan, Michel Bilello, Evan Calabrese, Errol Colak, Keyvan Farahani, Jayashree Kalpathy-Cramer, Felipe C. Kitamura, Spyridon Pati, Luciano M. Prevedello, Jeffrey D. Rudie, Christian Sako, Russell T. Shinohara, Timothy Bergquist, Rong Chai, James Eddy, Jack Elliott, Walter Reade, Thomas Schaffter, Tong Yu, Jacob Zhe...

  2. [2]

    SPENet: Self-guided prototype enhancement network for few-shot medical im- age segmentation

    Chao Fan, Xibin Jia, Anqi Xiao, Hongyuan Yu, Zhenghan Yang, Dawei Yang, Hui Xu, Yan Huang, and Liang Wang. SPENet: Self-guided prototype enhancement network for few-shot medical im- age segmentation. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2025, pages 584–593. Springer Nature Switzerland,

  3. [3]

    doi: 10.1016/j.media.2023. 103024. Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Land- man, Holger R. Roth, and Daguang Xu. UNETR: Transformers for 3d medical image segmenta- tion. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1748–1758,

  4. [4]

    SAC: semantic attention composition for text-conditioned image retrieval

    doi: 10.1109/W ACV51458.2022.00181. International Skin Imaging Collaboration. ISIC 2018: Skin lesion analysis towards melanoma de- tection,

  5. [5]

    Visual prompt tuning

    11 Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InComputer Vision – ECCV 2022, pages 709–727. Springer Nature Switzerland,

  6. [6]

    Milletari, N

    doi: 10.1109/3DV .2016.79. Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y . Hammerla, Bernhard Kainz, Ben Glocker, and Daniel Rueckert. Attention U-Net: Learning where to look for the pancreas,

  7. [7]

    Self- supervision with superpixels: Training few-shot medical image segmentation without annotation

    Cheng Ouyang, Carlo Biffi, Chen Chen, Turkay Kart, Huaqi Qiu, and Daniel Rueckert. Self- supervision with superpixels: Training few-shot medical image segmentation without annotation. InComputer Vision – ECCV 2020, pages 762–780. Springer,

  8. [8]

    U-net: Convolutional networks for biomed- ical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomed- ical image segmentation. InMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241. Springer,

  9. [9]

    Jeya Maria Jose Valanarasu and Vishal M. Patel. UNeXt: MLP-based rapid medical image seg- mentation network. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2022, pages 23–33. Springer,

  10. [10]

    Extreme Theory of Functional Connections: A Fast Physics-Informed Neural Network Method for Solving Ordinary and Partial Differential Equations.Neurocomputing, 457:334–356, 2021

    doi: 10.1016/j.neucom. 2019.01.103. Junde Wu, Ziyue Wang, Mingxuan Hong, Wei Ji, Huazhu Fu, Yanwu Xu, Min Xu, and Yueming Jin. Medical SAM adapter: Adapting segment anything model for medical image segmentation. Medical Image Analysis, 102:103547,

  11. [11]

    12 Sicheng Yang, Hongqiu Wang, Zhaohu Xing, Sixiang Chen, and Lei Zhu

    doi: 10.1016/j.media.2025.103547. 12 Sicheng Yang, Hongqiu Wang, Zhaohu Xing, Sixiang Chen, and Lei Zhu. Segdino: An efficient design for medical and natural image segmentation with dino-v3,

  12. [12]

    Radke, and Ziyan Wu

    Meng Zheng, Benjamin Planche, Zhongpai Gao, Terrence Chen, Richard J. Radke, and Ziyan Wu. Few-shot 3d volumetric segmentation with multi-surrogate fusion. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2024, pages 286–296. Springer Nature Switzer- land,

  13. [13]

    Few-shot medical image segmentation via a region-enhanced prototypical transformer

    Yazhou Zhu, Shidong Wang, Tong Xin, and Haofeng Zhang. Few-shot medical image segmentation via a region-enhanced prototypical transformer. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2023, pages 271–280. Springer,

  14. [14]

    Per-case Dice is shown in the upper-right corner of the error map

    The columns match the main qualitative figure: input, ground truth, frozen-feature visualization, prediction, and error map. Per-case Dice is shown in the upper-right corner of the error map. 14 Figure 4: Additional BraTS FLAIR qualitative examples. Rows are ordered from high-scoring to medium-low cases. 15 Figure 5: Additional ISIC 2018 qualitative examp...