DINO-MVR: Multi-View Readout of Frozen DINOv3 for Annotation-Efficient Medical Segmentation
Pith reviewed 2026-05-11 02:11 UTC · model grok-4.3
The pith
Frozen DINOv3 features enable accurate medical segmentation with lightweight multi-view readout.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a frozen DINOv3 backbone, when read out via lightweight MLP probes on its final three transformer blocks and entropy-weighted fusion of multi-resolution and augmented predictions, delivers accurate medical segmentations, including Dice scores of 0.895 on Kvasir-SEG, 0.897 on ISIC 2018, and 0.908 on BraTS FLAIR whole-tumor, while recovering 98.4 percent of the performance of a 40-patient reference using only five annotated BraTS cases.
What carries the argument
The multi-view readout mechanism, which trains lightweight MLP probes on features from the final three transformer blocks of the frozen backbone and fuses predictions from complementary resolutions and augmentations by entropy-weighted averaging.
If this is right
- Medical segmentation becomes feasible in settings with very scarce annotations without the cost of fine-tuning large backbones.
- Volumetric consistency on MRI improves through simple Gaussian smoothing along the z-axis.
- The same frozen backbone can be applied across different modalities including endoscopy, dermoscopy, and MRI.
- Lightweight probes enable fast task adaptation when new labeled data arrive.
Where Pith is reading between the lines
- The same lightweight readout pattern could be tested on other frozen self-supervised models to check whether the multi-view benefit is general.
- This style of readout might lower the barrier to deploying foundation models in clinical workflows where annotation budgets are limited.
- Extending the approach to native 3D backbones could further reduce slice-to-slice inconsistencies in CT or MRI.
- The transfer of natural-image structural cues to medical domains suggests similar readouts may help in other data-scarce scientific imaging tasks.
Load-bearing premise
Frozen DINOv3 features already contain useful structural and boundary cues for medical segmentation.
What would settle it
If single-view readout from only the final block without multi-view fusion or entropy weighting yields Dice scores below 0.75 on Kvasir-SEG or ISIC 2018 under the same fixed protocols, the necessity of the proposed multi-view strategy would be challenged.
Figures
read the original abstract
Adapting foundation models to medical segmentation typically requires either backbone fine-tuning or high-capacity task-specific decoders, both of which are difficult to fit reliably when annotations are scarce. We show that frozen DINOv3 features already contain useful structural and boundary cues for medical segmentation, and that the main bottleneck lies in how these features are read out. We propose DINO-MVR, a Multi-View Readout framework for annotation-efficient medical segmentation. DINO-MVR trains only lightweight MLP probes on features from the final three transformer blocks of a frozen DINOv3 backbone, without updating the backbone itself. At inference, each input is interpreted through complementary resolutions and test-time augmentations, whose probability maps are combined by entropy-weighted fusion and refined with simple spatial regularization. For volumetric inputs, Gaussian z-axis smoothing further improves inter-slice consistency. Under fixed evaluation protocols on endoscopy, dermoscopy, and MRI benchmarks, DINO-MVR achieves strong readout-only performance, including 0.895 Dice on Kvasir-SEG, 0.897 Dice on ISIC 2018, and 0.908 Dice on BraTS FLAIR whole-tumor segmentation. With only five annotated BraTS patients, it recovers 98.4% of the performance obtained by the 40-patient BraTS reference run. These results suggest that frozen self-supervised vision backbones can support accurate medical segmentation when paired with an effective multi-view readout.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DINO-MVR, which freezes a DINOv3 vision transformer and trains only lightweight MLP probes on features from its final three blocks for medical segmentation. At inference it applies complementary resolutions, test-time augmentations, entropy-weighted fusion of probability maps, and (for volumes) Gaussian z-smoothing, reporting Dice scores of 0.895 on Kvasir-SEG, 0.897 on ISIC 2018, and 0.908 on BraTS FLAIR whole-tumor segmentation while recovering 98.4% of a 40-patient reference performance using only five annotated BraTS cases.
Significance. If the central claim holds, the work demonstrates that frozen self-supervised vision backbones can support accurate medical segmentation with minimal annotation and no backbone updates, which would be valuable for data-scarce clinical settings. The concrete low-data recovery result (98.4% with five patients) is a clear strength that merits attention.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experimental Results): the central claim that 'frozen DINOv3 features already contain useful structural and boundary cues' and that 'the main bottleneck lies in how these features are read out' is load-bearing, yet all reported Dice scores are obtained under the full multi-view inference pipeline (complementary resolutions, test-time augmentations, entropy-weighted fusion, and z-smoothing). No ablation is presented that applies the identical trained MLP probes under single-view, non-augmented inference, leaving open whether the performance is attributable to the frozen features or to the inference-time ensemble.
- [§4] §4 (Experimental Results): the manuscript reports absolute Dice numbers and a low-data recovery percentage but provides no baseline comparisons to other frozen-backbone or limited-data segmentation methods, nor any statistical details (standard deviations, significance tests) across runs. This weakens the ability to judge whether the readout-only performance is competitive.
minor comments (2)
- [§3] §3 (Method): the description of entropy-weighted fusion and Gaussian z-smoothing is clear, but the precise weighting formula and the choice of which three blocks are used could be stated more explicitly for reproducibility.
- [Abstract] Abstract: 'fixed evaluation protocols' is mentioned without reference to the specific train/val/test splits or preprocessing steps used on each benchmark.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the potential value of our low-data recovery results. We address each major comment below and will incorporate the suggested revisions to clarify the contribution of the frozen features.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental Results): the central claim that 'frozen DINOv3 features already contain useful structural and boundary cues' and that 'the main bottleneck lies in how these features are read out' is load-bearing, yet all reported Dice scores are obtained under the full multi-view inference pipeline (complementary resolutions, test-time augmentations, entropy-weighted fusion, and z-smoothing). No ablation is presented that applies the identical trained MLP probes under single-view, non-augmented inference, leaving open whether the performance is attributable to the frozen features or to the inference-time ensemble.
Authors: We agree that an explicit single-view ablation would better isolate the contribution of the frozen DINOv3 features. The multi-view components (complementary resolutions, TTA, entropy fusion, and z-smoothing) are an integral part of our proposed readout strategy rather than post-hoc enhancements, but they are applied only at inference after the MLP probes have been trained on single-view features. In the revised manuscript we will add a dedicated ablation table reporting Dice scores for the identical trained probes under single-view, non-augmented inference on all three benchmarks. This will demonstrate that the frozen features already yield competitive structural cues (e.g., >0.82 Dice on Kvasir-SEG and ISIC) while the full pipeline provides the additional robustness reported in the main results. revision: yes
-
Referee: [§4] §4 (Experimental Results): the manuscript reports absolute Dice numbers and a low-data recovery percentage but provides no baseline comparisons to other frozen-backbone or limited-data segmentation methods, nor any statistical details (standard deviations, significance tests) across runs. This weakens the ability to judge whether the readout-only performance is competitive.
Authors: We acknowledge that the current evaluation lacks external baselines and statistical reporting. In the revision we will expand §4 with (i) comparisons against representative frozen-backbone methods (e.g., linear probing or lightweight decoder heads on DINOv2/SAM features) and limited-data segmentation approaches (e.g., few-shot or semi-supervised baselines) under the same evaluation protocols, and (ii) standard deviations computed over five independent training runs together with paired statistical significance tests (Wilcoxon signed-rank) against the 40-patient reference. These additions will allow readers to assess competitiveness more rigorously while preserving the paper’s focus on readout-only adaptation. revision: yes
Circularity Check
No circularity: empirical results on independent benchmarks
full rationale
The paper contains no equations, derivations, or first-principles predictions. It reports experimental Dice scores on fixed public benchmarks (Kvasir-SEG, ISIC 2018, BraTS) after training only lightweight MLPs on frozen DINOv3 features. No parameter is fitted to the reported metrics themselves, no self-citation chain supports a load-bearing uniqueness claim, and no ansatz or renaming reduces the central result to its inputs by construction. The method description and performance numbers are therefore self-contained against external data.
Axiom & Free-Parameter Ledger
free parameters (1)
- MLP probe weights
axioms (1)
- domain assumption Frozen DINOv3 features already contain useful structural and boundary cues for medical segmentation
Reference graph
Works this paper leans on
-
[1]
Kitamura, Spyridon Pati, Luciano M
Ujjwal Baid, Satyajeet Ghodasara, Suyash Mohan, Michel Bilello, Evan Calabrese, Errol Colak, Keyvan Farahani, Jayashree Kalpathy-Cramer, Felipe C. Kitamura, Spyridon Pati, Luciano M. Prevedello, Jeffrey D. Rudie, Christian Sako, Russell T. Shinohara, Timothy Bergquist, Rong Chai, James Eddy, Jack Elliott, Walter Reade, Thomas Schaffter, Tong Yu, Jacob Zhe...
work page 2021
-
[2]
SPENet: Self-guided prototype enhancement network for few-shot medical im- age segmentation
Chao Fan, Xibin Jia, Anqi Xiao, Hongyuan Yu, Zhenghan Yang, Dawei Yang, Hui Xu, Yan Huang, and Liang Wang. SPENet: Self-guided prototype enhancement network for few-shot medical im- age segmentation. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2025, pages 584–593. Springer Nature Switzerland,
work page 2025
-
[3]
doi: 10.1016/j.media.2023. 103024. Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Land- man, Holger R. Roth, and Daguang Xu. UNETR: Transformers for 3d medical image segmenta- tion. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1748–1758,
-
[4]
SAC: semantic attention composition for text-conditioned image retrieval
doi: 10.1109/W ACV51458.2022.00181. International Skin Imaging Collaboration. ISIC 2018: Skin lesion analysis towards melanoma de- tection,
work page doi:10.1109/w 2022
-
[5]
11 Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InComputer Vision – ECCV 2022, pages 709–727. Springer Nature Switzerland,
work page 2022
-
[6]
doi: 10.1109/3DV .2016.79. Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y . Hammerla, Bernhard Kainz, Ben Glocker, and Daniel Rueckert. Attention U-Net: Learning where to look for the pancreas,
work page doi:10.1109/3dv 2016
-
[7]
Self- supervision with superpixels: Training few-shot medical image segmentation without annotation
Cheng Ouyang, Carlo Biffi, Chen Chen, Turkay Kart, Huaqi Qiu, and Daniel Rueckert. Self- supervision with superpixels: Training few-shot medical image segmentation without annotation. InComputer Vision – ECCV 2020, pages 762–780. Springer,
work page 2020
-
[8]
U-net: Convolutional networks for biomed- ical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomed- ical image segmentation. InMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241. Springer,
work page 2015
-
[9]
Jeya Maria Jose Valanarasu and Vishal M. Patel. UNeXt: MLP-based rapid medical image seg- mentation network. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2022, pages 23–33. Springer,
work page 2022
-
[10]
doi: 10.1016/j.neucom. 2019.01.103. Junde Wu, Ziyue Wang, Mingxuan Hong, Wei Ji, Huazhu Fu, Yanwu Xu, Min Xu, and Yueming Jin. Medical SAM adapter: Adapting segment anything model for medical image segmentation. Medical Image Analysis, 102:103547,
-
[11]
12 Sicheng Yang, Hongqiu Wang, Zhaohu Xing, Sixiang Chen, and Lei Zhu
doi: 10.1016/j.media.2025.103547. 12 Sicheng Yang, Hongqiu Wang, Zhaohu Xing, Sixiang Chen, and Lei Zhu. Segdino: An efficient design for medical and natural image segmentation with dino-v3,
-
[12]
Meng Zheng, Benjamin Planche, Zhongpai Gao, Terrence Chen, Richard J. Radke, and Ziyan Wu. Few-shot 3d volumetric segmentation with multi-surrogate fusion. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2024, pages 286–296. Springer Nature Switzer- land,
work page 2024
-
[13]
Few-shot medical image segmentation via a region-enhanced prototypical transformer
Yazhou Zhu, Shidong Wang, Tong Xin, and Haofeng Zhang. Few-shot medical image segmentation via a region-enhanced prototypical transformer. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2023, pages 271–280. Springer,
work page 2023
-
[14]
Per-case Dice is shown in the upper-right corner of the error map
The columns match the main qualitative figure: input, ground truth, frozen-feature visualization, prediction, and error map. Per-case Dice is shown in the upper-right corner of the error map. 14 Figure 4: Additional BraTS FLAIR qualitative examples. Rows are ordered from high-scoring to medium-low cases. 15 Figure 5: Additional ISIC 2018 qualitative examp...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.