Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation
Pith reviewed 2026-05-18 20:18 UTC · model grok-4.3
The pith
Dino U-Net leverages dense features from a frozen DINOv3 foundation model to set new benchmarks in medical image segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Dino U-Net achieves state-of-the-art performance by exploiting the high-fidelity dense features of the DINOv3 vision foundation model in an encoder-decoder architecture. The encoder uses a frozen DINOv3 backbone with a specialized adapter to fuse rich semantic features with low-level spatial details, and the fidelity-aware projection module refines and projects these features for the decoder. This approach is highly scalable, with segmentation accuracy improving as the backbone model size increases up to the 7-billion-parameter variant, and it works across various imaging modalities on seven diverse datasets.
What carries the argument
The fidelity-aware projection module (FAPM), which refines and projects the high-fidelity dense features from the DINOv3 backbone to the decoder while preserving their quality during dimensionality reduction.
If this is right
- Segmentation performance scales positively with increasing backbone size up to 7 billion parameters.
- The method outperforms previous approaches on seven diverse medical image datasets across modalities.
- It provides a parameter-efficient solution by keeping the foundation model frozen.
- Transfer of natural image features to medical segmentation is effective with the proposed adapter and projection.
Where Pith is reading between the lines
- If the transfer works well here, similar adapters might allow foundation models to boost other medical imaging tasks like classification or detection.
- Testing on private clinical datasets would check if the gains hold in real-world hospital settings.
- Exploring even larger or differently trained foundation models could reveal further accuracy improvements.
Load-bearing premise
The high-fidelity features from DINOv3 pre-trained on natural images transfer effectively to medical images without major loss of clinical relevance or introduction of artifacts.
What would settle it
Running Dino U-Net on an additional medical segmentation dataset where it fails to match or exceed the accuracy of the current best method, or where larger models do not yield better results, would challenge the central claim.
Figures
read the original abstract
Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Dino U-Net, an encoder-decoder architecture that employs a frozen DINOv3 vision foundation model as backbone for medical image segmentation. It adds a specialized adapter to fuse rich semantic features with low-level spatial details and a fidelity-aware projection module (FAPM) to refine features during dimensionality reduction for the decoder. Experiments on seven public datasets across modalities report state-of-the-art Dice/IoU scores that consistently exceed prior methods, with monotonic accuracy gains as backbone size scales to the 7-billion-parameter variant. Code is released publicly.
Significance. If the empirical results hold under rigorous controls, the work is significant because it supplies concrete evidence that dense features from large-scale natural-image foundation models can be transferred to medical segmentation in a parameter-efficient way, with clear scaling behavior. This reduces reliance on domain-specific pretraining and offers a practical route to leverage future foundation-model advances in clinical imaging. Public code release further strengthens reproducibility and downstream impact.
major comments (2)
- [§4.3 and Table 3] §4.3 (Ablation Studies) and Table 3: the reported gains from the adapter + FAPM are shown only against smaller or non-foundation baselines; no control experiment compares against a same-size randomly initialized or medical-pretrained encoder of comparable capacity. Without this, the central claim that performance stems from preserved high-fidelity DINOv3 features rather than raw capacity cannot be isolated.
- [§3.3 and §4.4] §3.3 (FAPM description) and §4.4 (feature analysis): the paper asserts that FAPM 'preserves the quality of these representations' yet supplies no quantitative diagnostic (cosine similarity, reconstruction error, or modality-specific feature distance) between pre- and post-projection DINO features on grayscale medical inputs. This diagnostic is load-bearing for the domain-shift argument.
minor comments (2)
- [Figure 2] Figure 2: the architecture diagram would be clearer if the adapter and FAPM blocks were annotated with exact tensor dimensions and the fusion operation (concatenation, addition, or attention) were labeled explicitly.
- [§2] §2 (Related Work): several recent medical foundation-model papers (e.g., MedSAM, SAM-Med2D) are cited only briefly; a short paragraph contrasting the frozen-backbone + adapter strategy with full fine-tuning approaches would sharpen novelty.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below and indicate where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [§4.3 and Table 3] §4.3 (Ablation Studies) and Table 3: the reported gains from the adapter + FAPM are shown only against smaller or non-foundation baselines; no control experiment compares against a same-size randomly initialized or medical-pretrained encoder of comparable capacity. Without this, the central claim that performance stems from preserved high-fidelity DINOv3 features rather than raw capacity cannot be isolated.
Authors: We agree that a control experiment using a randomly initialized encoder (or medical-pretrained encoder) of comparable parameter count would more cleanly isolate the benefit of the frozen DINOv3 features from raw model capacity. Our existing ablations in §4.3 demonstrate the incremental value of the adapter and FAPM, and the scaling results show monotonic gains as DINOv3 size increases to 7B parameters. In the revised manuscript we will add a new ablation row comparing Dino U-Net against an otherwise identical architecture with a randomly initialized ViT backbone of matching size, using the same training protocol. We will also briefly discuss available medical-pretrained baselines of similar scale. revision: yes
-
Referee: [§3.3 and §4.4] §3.3 (FAPM description) and §4.4 (feature analysis): the paper asserts that FAPM 'preserves the quality of these representations' yet supplies no quantitative diagnostic (cosine similarity, reconstruction error, or modality-specific feature distance) between pre- and post-projection DINO features on grayscale medical inputs. This diagnostic is load-bearing for the domain-shift argument.
Authors: We acknowledge that the current §4.4 provides only qualitative visualizations. To directly support the claim that FAPM preserves high-fidelity DINOv3 representations under domain shift, we will add quantitative diagnostics in the revision: cosine similarity and feature reconstruction error computed between the original DINOv3 dense features and the FAPM-projected features on held-out samples from the grayscale medical datasets. These metrics will be reported alongside the existing visualizations. revision: yes
Circularity Check
No circularity: empirical architecture validated on external public datasets
full rationale
The paper proposes Dino U-Net as an encoder-decoder using a frozen DINOv3 backbone plus adapter and FAPM, then reports measured Dice/IoU gains on seven independent public medical segmentation datasets plus monotonic scaling with backbone size up to 7B parameters. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. All central claims rest on direct experimental outcomes against external benchmarks rather than any internal equivalence or renaming of known results.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Dino U-Net achieves state-of-the-art performance... with accuracy improving as backbone size increases to 7 billion parameters
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Dino-NestedUNet: Unlocking Foundation Vision Encoders for Pathology Tumor Bulk Segmentation via Dense Decoding
Dino-NestedUNet improves pathology tumor segmentation by coupling DINOv3 encoders with dense nested decoding, showing gains over UNet++ and Dino-UNet baselines across multiple cohorts including zero-shot tests.
Reference graph
Works this paper leans on
-
[1]
Artificial intelligence–enabled rapid diagnosis of patients with covid-19,
X. Mei, H.-C. Lee, K.-y. Diao, M. Huang, B. Lin, C. Liu, Z. Xie, Y . Ma, P. M. Robson, M. Chung et al. , “Artificial intelligence–enabled rapid diagnosis of patients with covid-19,” Nature medicine , vol. 26, no. 8, pp. 1224–1228, 2020
work page 2020
-
[2]
Unetr++: delving into efficient and accurate 3d medical image segmentation,
A. Shaker, M. Maaz, H. Rasheed, S. Khan, M.-H. Yang, and F. S. Khan, “Unetr++: delving into efficient and accurate 3d medical image segmentation,” IEEE Transactions on Medical Imaging , vol. 43, no. 9, pp. 3377–3390, 2024
work page 2024
-
[3]
nn- former: V olumetric medical image segmentation via a 3d transformer,
H.-Y . Zhou, J. Guo, Y . Zhang, X. Han, L. Yu, L. Wang, and Y . Yu, “nn- former: V olumetric medical image segmentation via a 3d transformer,” IEEE transactions on image processing , vol. 32, pp. 4036–4045, 2023
work page 2023
-
[4]
Transmed: Transformers advance multi- modal medical image classification,
Y . Dai, Y . Gao, and F. Liu, “Transmed: Transformers advance multi- modal medical image classification,”Diagnostics, vol. 11, no. 8, p. 1384, 2021
work page 2021
-
[5]
Y . Gao, Y . Dong, W. Wu, C. Ge, F. Yuan, J. Sheng, H. Li, and X. Gao, “Wega: Weakly-supervised global-local affinity learning framework for lymph node metastasis prediction in rectal cancer,” arXiv preprint arXiv:2505.10502, 2025
-
[6]
An anatomy-aware frame- work for automatic segmentation of parotid tumor from multimodal mri,
Y . Gao, Y . Dai, F. Liu, W. Chen, and L. Shi, “An anatomy-aware frame- work for automatic segmentation of parotid tumor from multimodal mri,” Computers in Biology and Medicine , vol. 161, p. 107000, 2023
work page 2023
-
[7]
Y . Gao, S. Rui, H. Su, J. Xiang, L. Wu, and X. Wang, “A compos- ite alignment-aware framework for myocardial lesion segmentation in multi-sequence cmr images,” arXiv preprint arXiv:2507.11886 , 2025
-
[8]
U-net: Convolutional networks for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241
work page 2015
-
[9]
Review of semantic segmentation of medical images using modified architectures of unet,
M. Krithika Alias AnbuDevi and K. Suganthi, “Review of semantic segmentation of medical images using modified architectures of unet,” Diagnostics, vol. 12, no. 12, p. 3064, 2022
work page 2022
-
[10]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026
work page 2023
-
[11]
Ma-sam: Modality-agnostic sam adaptation for 3d medical image segmentation,
C. Chen, J. Miao, D. Wu, A. Zhong, Z. Yan, S. Kim, J. Hu, Z. Liu, L. Sun, X. Li et al. , “Ma-sam: Modality-agnostic sam adaptation for 3d medical image segmentation,” Medical Image Analysis , vol. 98, p. 103310, 2024
work page 2024
-
[12]
Segment anything model for medical image segmentation: Current applications and future directions,
Y . Zhang, Z. Shen, and R. Jiao, “Segment anything model for medical image segmentation: Current applications and future directions,” Com- puters in Biology and Medicine , vol. 171, p. 108238, 2024
work page 2024
-
[13]
Medical sam adapter: Adapting segment anything model for medical image segmentation,
J. Wu, Z. Wang, M. Hong, W. Ji, H. Fu, Y . Xu, M. Xu, and Y . Jin, “Medical sam adapter: Adapting segment anything model for medical image segmentation,”Medical image analysis, vol. 102, p. 103547, 2025
work page 2025
-
[14]
Y . Gao, J. Sheng, W. Wu, H. Li, Y . Dong, C. Ge, F. Yuan, and X. Gao, “Safeclick: Error-tolerant interactive segmentation of any medical volumes via hierarchical expert consensus,” arXiv preprint arXiv:2506.18404, 2025
-
[15]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,” arXiv preprint arXiv:2203.03605 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al. , “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski, “DINOv3,” 2025. [Online]. Available: h...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
SAM 2: Segment Anything in Images and Videos
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson et al., “Sam 2: Segment anything in images and videos,” arXiv preprint arXiv:2408.00714 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,
Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,” IEEE transactions on medical imaging , vol. 39, no. 6, pp. 1856–1867, 2019
work page 2019
-
[20]
3d mri brain tumor segmentation using autoencoder reg- ularization,
A. Myronenko, “3d mri brain tumor segmentation using autoencoder reg- ularization,” in International MICCAI brainlesion workshop . Springer, 2018, pp. 311–320
work page 2018
-
[21]
nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,
F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,”Nature methods, vol. 18, no. 2, pp. 203–211, 2021
work page 2021
-
[22]
U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation
J. Ma, F. Li, and B. Wang, “U-mamba: Enhancing long-range dependency for biomedical image segmentation,” arXiv preprint arXiv:2401.04722, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
U-kan makes strong backbone for medical image segmentation and generation,
C. Li, X. Liu, W. Li, C. Wang, H. Liu, Y . Liu, Z. Chen, and Y . Yuan, “U-kan makes strong backbone for medical image segmentation and generation,” in Proceedings of the AAAI Conference on Artificial Intel- ligence, vol. 39, no. 5, 2025, pp. 4652–4660
work page 2025
-
[24]
Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation,
X. Xiong, Z. Wu, S. Tan, W. Li, F. Tang, Y . Chen, S. Li, J. Ma, and G. Li, “Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation,” arXiv preprint arXiv:2408.08870 , 2024
-
[25]
Kvasir-seg: A segmented polyp dataset,
D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, and H. D. Johansen, “Kvasir-seg: A segmented polyp dataset,” in MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II
work page 2020
-
[26]
Springer, 2020, pp. 451–462
work page 2020
-
[27]
Drishti-gs: Retinal image dataset for optic nerve head(onh) segmentation,
J. Sivaswamy, S. R. Krishnadas, G. Datt Joshi, M. Jain, and A. U. Syed Tabish, “Drishti-gs: Retinal image dataset for optic nerve head(onh) segmentation,” in 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI) , 2014, pp. 53–56
work page 2014
-
[28]
Dataset of breast ultrasound images,
W. Al-Dhabyani, M. Gomaa, H. Khaled, and A. Fahmy, “Dataset of breast ultrasound images,” Data in brief , vol. 28, p. 104863, 2020
work page 2020
-
[29]
C. Shi, J. Fan, Z. Deng, H. Liu, Q. Kang, Y . Li, J. Guo, J. Wang, J. Gong, S. Liao et al., “Cellbindb: a large-scale multimodal annotated dataset for cell segmentation with benchmarking of universal models,” GigaScience, vol. 14, p. giaf069, 2025
work page 2025
-
[30]
Bayeseg: Bayesian modeling for medical image segmentation with interpretable generalizability,
S. Gao, H. Zhou, Y . Gao, and X. Zhuang, “Bayeseg: Bayesian modeling for medical image segmentation with interpretable generalizability,” Medical image analysis , vol. 89, p. 102889, 2023
work page 2023
-
[31]
Multivariate mixture model for myocardial segmentation combining multi-source images,
X. Zhuang, “Multivariate mixture model for myocardial segmentation combining multi-source images,” IEEE transactions on pattern analysis and machine intelligence , vol. 41, no. 12, pp. 2933–2946, 2018
work page 2018
-
[32]
J. Qiu, L. Li, S. Wang, K. Zhang, Y . Chen, S. Yang, and X. Zhuang, “Myops-net: Myocardial pathology segmentation with flexible combi- nation of multi-sequence cmr images,” Medical image analysis, vol. 84, p. 102694, 2023
work page 2023
-
[33]
Prostatex zone segmentations [data set],
A. Meyer, D. Schindele, D. V on Reibnitz, M. Rak, M. Schostak, and C. Hansen, “Prostatex zone segmentations [data set],” The Cancer Imaging Archive, p. 131, 2020
work page 2020
-
[34]
m2caiseg: Semantic seg- mentation of laparoscopic images using convolutional neural networks,
S. Maqbool, A. Riaz, H. Sajid, and O. Hasan, “m2caiseg: Semantic seg- mentation of laparoscopic images using convolutional neural networks,” arXiv preprint arXiv:2008.10134 , 2020
-
[35]
Swin-umamba: Mamba-based unet with imagenet-based pretraining,
J. Liu, H. Yang, H.-Y . Zhou, Y . Xi, L. Yu, C. Li, Y . Liang, G. Shi, Y . Yu, S. Zhang et al., “Swin-umamba: Mamba-based unet with imagenet-based pretraining,” in International conference on medical image computing and computer-assisted intervention . Springer, 2024, pp. 615–625
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.