pith. sign in

arxiv: 2508.20909 · v2 · submitted 2025-08-28 · 💻 cs.CV · eess.IV

Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation

Pith reviewed 2026-05-18 20:18 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords medical image segmentationfoundation modelsDINOU-Netfeature adaptationdense featurestransfer learning
0
0 comments X

The pith

Dino U-Net leverages dense features from a frozen DINOv3 foundation model to set new benchmarks in medical image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Dino U-Net, which builds a U-Net around a frozen DINOv3 vision model to use its high-quality dense features for segmenting medical images. It adds an adapter to combine semantic and spatial details and a fidelity-aware projection module to keep those features intact when reducing dimensions for the decoder. Experiments on seven public datasets across different modalities show it outperforms earlier approaches, and performance gets better as the backbone grows to seven billion parameters. A sympathetic reader would care because this offers a way to improve clinical imaging tools by reusing powerful general-purpose models without retraining everything from scratch.

Core claim

Dino U-Net achieves state-of-the-art performance by exploiting the high-fidelity dense features of the DINOv3 vision foundation model in an encoder-decoder architecture. The encoder uses a frozen DINOv3 backbone with a specialized adapter to fuse rich semantic features with low-level spatial details, and the fidelity-aware projection module refines and projects these features for the decoder. This approach is highly scalable, with segmentation accuracy improving as the backbone model size increases up to the 7-billion-parameter variant, and it works across various imaging modalities on seven diverse datasets.

What carries the argument

The fidelity-aware projection module (FAPM), which refines and projects the high-fidelity dense features from the DINOv3 backbone to the decoder while preserving their quality during dimensionality reduction.

If this is right

  • Segmentation performance scales positively with increasing backbone size up to 7 billion parameters.
  • The method outperforms previous approaches on seven diverse medical image datasets across modalities.
  • It provides a parameter-efficient solution by keeping the foundation model frozen.
  • Transfer of natural image features to medical segmentation is effective with the proposed adapter and projection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the transfer works well here, similar adapters might allow foundation models to boost other medical imaging tasks like classification or detection.
  • Testing on private clinical datasets would check if the gains hold in real-world hospital settings.
  • Exploring even larger or differently trained foundation models could reveal further accuracy improvements.

Load-bearing premise

The high-fidelity features from DINOv3 pre-trained on natural images transfer effectively to medical images without major loss of clinical relevance or introduction of artifacts.

What would settle it

Running Dino U-Net on an additional medical segmentation dataset where it fails to match or exceed the accuracy of the current best method, or where larger models do not yield better results, would challenge the central claim.

Figures

Figures reproduced from arXiv: 2508.20909 by Feng Yuan, Haoyue Li, Xiaosong Wang, Xin Gao, Yifan Gao.

Figure 2
Figure 2. Figure 2: The detailed architecture of our proposed FAPM. The module operates [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of segmentation results on representative samples from the seven evaluated datasets. Each column displays a different method, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on the proposed FAPM. (a) Parameter comparison between models with and without FAPM across different scales (S, B, L, 7B). [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Dino U-Net, an encoder-decoder architecture that employs a frozen DINOv3 vision foundation model as backbone for medical image segmentation. It adds a specialized adapter to fuse rich semantic features with low-level spatial details and a fidelity-aware projection module (FAPM) to refine features during dimensionality reduction for the decoder. Experiments on seven public datasets across modalities report state-of-the-art Dice/IoU scores that consistently exceed prior methods, with monotonic accuracy gains as backbone size scales to the 7-billion-parameter variant. Code is released publicly.

Significance. If the empirical results hold under rigorous controls, the work is significant because it supplies concrete evidence that dense features from large-scale natural-image foundation models can be transferred to medical segmentation in a parameter-efficient way, with clear scaling behavior. This reduces reliance on domain-specific pretraining and offers a practical route to leverage future foundation-model advances in clinical imaging. Public code release further strengthens reproducibility and downstream impact.

major comments (2)
  1. [§4.3 and Table 3] §4.3 (Ablation Studies) and Table 3: the reported gains from the adapter + FAPM are shown only against smaller or non-foundation baselines; no control experiment compares against a same-size randomly initialized or medical-pretrained encoder of comparable capacity. Without this, the central claim that performance stems from preserved high-fidelity DINOv3 features rather than raw capacity cannot be isolated.
  2. [§3.3 and §4.4] §3.3 (FAPM description) and §4.4 (feature analysis): the paper asserts that FAPM 'preserves the quality of these representations' yet supplies no quantitative diagnostic (cosine similarity, reconstruction error, or modality-specific feature distance) between pre- and post-projection DINO features on grayscale medical inputs. This diagnostic is load-bearing for the domain-shift argument.
minor comments (2)
  1. [Figure 2] Figure 2: the architecture diagram would be clearer if the adapter and FAPM blocks were annotated with exact tensor dimensions and the fusion operation (concatenation, addition, or attention) were labeled explicitly.
  2. [§2] §2 (Related Work): several recent medical foundation-model papers (e.g., MedSAM, SAM-Med2D) are cited only briefly; a short paragraph contrasting the frozen-backbone + adapter strategy with full fine-tuning approaches would sharpen novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below and indicate where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [§4.3 and Table 3] §4.3 (Ablation Studies) and Table 3: the reported gains from the adapter + FAPM are shown only against smaller or non-foundation baselines; no control experiment compares against a same-size randomly initialized or medical-pretrained encoder of comparable capacity. Without this, the central claim that performance stems from preserved high-fidelity DINOv3 features rather than raw capacity cannot be isolated.

    Authors: We agree that a control experiment using a randomly initialized encoder (or medical-pretrained encoder) of comparable parameter count would more cleanly isolate the benefit of the frozen DINOv3 features from raw model capacity. Our existing ablations in §4.3 demonstrate the incremental value of the adapter and FAPM, and the scaling results show monotonic gains as DINOv3 size increases to 7B parameters. In the revised manuscript we will add a new ablation row comparing Dino U-Net against an otherwise identical architecture with a randomly initialized ViT backbone of matching size, using the same training protocol. We will also briefly discuss available medical-pretrained baselines of similar scale. revision: yes

  2. Referee: [§3.3 and §4.4] §3.3 (FAPM description) and §4.4 (feature analysis): the paper asserts that FAPM 'preserves the quality of these representations' yet supplies no quantitative diagnostic (cosine similarity, reconstruction error, or modality-specific feature distance) between pre- and post-projection DINO features on grayscale medical inputs. This diagnostic is load-bearing for the domain-shift argument.

    Authors: We acknowledge that the current §4.4 provides only qualitative visualizations. To directly support the claim that FAPM preserves high-fidelity DINOv3 representations under domain shift, we will add quantitative diagnostics in the revision: cosine similarity and feature reconstruction error computed between the original DINOv3 dense features and the FAPM-projected features on held-out samples from the grayscale medical datasets. These metrics will be reported alongside the existing visualizations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on external public datasets

full rationale

The paper proposes Dino U-Net as an encoder-decoder using a frozen DINOv3 backbone plus adapter and FAPM, then reports measured Dice/IoU gains on seven independent public medical segmentation datasets plus monotonic scaling with backbone size up to 7B parameters. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. All central claims rest on direct experimental outcomes against external benchmarks rather than any internal equivalence or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces two new modules (adapter and FAPM) whose internal design choices and any hyperparameters are not detailed in the abstract; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5774 in / 1085 out tokens · 30891 ms · 2026-05-18T20:18:38.717507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dino-NestedUNet: Unlocking Foundation Vision Encoders for Pathology Tumor Bulk Segmentation via Dense Decoding

    cs.CV 2026-04 unverdicted novelty 6.0

    Dino-NestedUNet improves pathology tumor segmentation by coupling DINOv3 encoders with dense nested decoding, showing gains over UNet++ and Dino-UNet baselines across multiple cohorts including zero-shot tests.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Artificial intelligence–enabled rapid diagnosis of patients with covid-19,

    X. Mei, H.-C. Lee, K.-y. Diao, M. Huang, B. Lin, C. Liu, Z. Xie, Y . Ma, P. M. Robson, M. Chung et al. , “Artificial intelligence–enabled rapid diagnosis of patients with covid-19,” Nature medicine , vol. 26, no. 8, pp. 1224–1228, 2020

  2. [2]

    Unetr++: delving into efficient and accurate 3d medical image segmentation,

    A. Shaker, M. Maaz, H. Rasheed, S. Khan, M.-H. Yang, and F. S. Khan, “Unetr++: delving into efficient and accurate 3d medical image segmentation,” IEEE Transactions on Medical Imaging , vol. 43, no. 9, pp. 3377–3390, 2024

  3. [3]

    nn- former: V olumetric medical image segmentation via a 3d transformer,

    H.-Y . Zhou, J. Guo, Y . Zhang, X. Han, L. Yu, L. Wang, and Y . Yu, “nn- former: V olumetric medical image segmentation via a 3d transformer,” IEEE transactions on image processing , vol. 32, pp. 4036–4045, 2023

  4. [4]

    Transmed: Transformers advance multi- modal medical image classification,

    Y . Dai, Y . Gao, and F. Liu, “Transmed: Transformers advance multi- modal medical image classification,”Diagnostics, vol. 11, no. 8, p. 1384, 2021

  5. [5]

    Wega: Weakly-supervised global-local affinity learning framework for lymph node metastasis prediction in rectal cancer,

    Y . Gao, Y . Dong, W. Wu, C. Ge, F. Yuan, J. Sheng, H. Li, and X. Gao, “Wega: Weakly-supervised global-local affinity learning framework for lymph node metastasis prediction in rectal cancer,” arXiv preprint arXiv:2505.10502, 2025

  6. [6]

    An anatomy-aware frame- work for automatic segmentation of parotid tumor from multimodal mri,

    Y . Gao, Y . Dai, F. Liu, W. Chen, and L. Shi, “An anatomy-aware frame- work for automatic segmentation of parotid tumor from multimodal mri,” Computers in Biology and Medicine , vol. 161, p. 107000, 2023

  7. [7]

    A compos- ite alignment-aware framework for myocardial lesion segmentation in multi-sequence cmr images,

    Y . Gao, S. Rui, H. Su, J. Xiang, L. Wu, and X. Wang, “A compos- ite alignment-aware framework for myocardial lesion segmentation in multi-sequence cmr images,” arXiv preprint arXiv:2507.11886 , 2025

  8. [8]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

  9. [9]

    Review of semantic segmentation of medical images using modified architectures of unet,

    M. Krithika Alias AnbuDevi and K. Suganthi, “Review of semantic segmentation of medical images using modified architectures of unet,” Diagnostics, vol. 12, no. 12, p. 3064, 2022

  10. [10]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

  11. [11]

    Ma-sam: Modality-agnostic sam adaptation for 3d medical image segmentation,

    C. Chen, J. Miao, D. Wu, A. Zhong, Z. Yan, S. Kim, J. Hu, Z. Liu, L. Sun, X. Li et al. , “Ma-sam: Modality-agnostic sam adaptation for 3d medical image segmentation,” Medical Image Analysis , vol. 98, p. 103310, 2024

  12. [12]

    Segment anything model for medical image segmentation: Current applications and future directions,

    Y . Zhang, Z. Shen, and R. Jiao, “Segment anything model for medical image segmentation: Current applications and future directions,” Com- puters in Biology and Medicine , vol. 171, p. 108238, 2024

  13. [13]

    Medical sam adapter: Adapting segment anything model for medical image segmentation,

    J. Wu, Z. Wang, M. Hong, W. Ji, H. Fu, Y . Xu, M. Xu, and Y . Jin, “Medical sam adapter: Adapting segment anything model for medical image segmentation,”Medical image analysis, vol. 102, p. 103547, 2025

  14. [14]

    Safeclick: Error-tolerant interactive segmentation of any medical volumes via hierarchical expert consensus,

    Y . Gao, J. Sheng, W. Wu, H. Li, Y . Dong, C. Ge, F. Yuan, and X. Gao, “Safeclick: Error-tolerant interactive segmentation of any medical volumes via hierarchical expert consensus,” arXiv preprint arXiv:2506.18404, 2025

  15. [15]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,” arXiv preprint arXiv:2203.03605 , 2022

  16. [16]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al. , “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023

  17. [17]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski, “DINOv3,” 2025. [Online]. Available: h...

  18. [18]

    SAM 2: Segment Anything in Images and Videos

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson et al., “Sam 2: Segment anything in images and videos,” arXiv preprint arXiv:2408.00714 , 2024

  19. [19]

    Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,

    Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,” IEEE transactions on medical imaging , vol. 39, no. 6, pp. 1856–1867, 2019

  20. [20]

    3d mri brain tumor segmentation using autoencoder reg- ularization,

    A. Myronenko, “3d mri brain tumor segmentation using autoencoder reg- ularization,” in International MICCAI brainlesion workshop . Springer, 2018, pp. 311–320

  21. [21]

    nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,

    F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,”Nature methods, vol. 18, no. 2, pp. 203–211, 2021

  22. [22]

    U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

    J. Ma, F. Li, and B. Wang, “U-mamba: Enhancing long-range dependency for biomedical image segmentation,” arXiv preprint arXiv:2401.04722, 2024

  23. [23]

    U-kan makes strong backbone for medical image segmentation and generation,

    C. Li, X. Liu, W. Li, C. Wang, H. Liu, Y . Liu, Z. Chen, and Y . Yuan, “U-kan makes strong backbone for medical image segmentation and generation,” in Proceedings of the AAAI Conference on Artificial Intel- ligence, vol. 39, no. 5, 2025, pp. 4652–4660

  24. [24]

    Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation,

    X. Xiong, Z. Wu, S. Tan, W. Li, F. Tang, Y . Chen, S. Li, J. Ma, and G. Li, “Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation,” arXiv preprint arXiv:2408.08870 , 2024

  25. [25]

    Kvasir-seg: A segmented polyp dataset,

    D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, and H. D. Johansen, “Kvasir-seg: A segmented polyp dataset,” in MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II

  26. [26]

    Springer, 2020, pp. 451–462

  27. [27]

    Drishti-gs: Retinal image dataset for optic nerve head(onh) segmentation,

    J. Sivaswamy, S. R. Krishnadas, G. Datt Joshi, M. Jain, and A. U. Syed Tabish, “Drishti-gs: Retinal image dataset for optic nerve head(onh) segmentation,” in 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI) , 2014, pp. 53–56

  28. [28]

    Dataset of breast ultrasound images,

    W. Al-Dhabyani, M. Gomaa, H. Khaled, and A. Fahmy, “Dataset of breast ultrasound images,” Data in brief , vol. 28, p. 104863, 2020

  29. [29]

    Cellbindb: a large-scale multimodal annotated dataset for cell segmentation with benchmarking of universal models,

    C. Shi, J. Fan, Z. Deng, H. Liu, Q. Kang, Y . Li, J. Guo, J. Wang, J. Gong, S. Liao et al., “Cellbindb: a large-scale multimodal annotated dataset for cell segmentation with benchmarking of universal models,” GigaScience, vol. 14, p. giaf069, 2025

  30. [30]

    Bayeseg: Bayesian modeling for medical image segmentation with interpretable generalizability,

    S. Gao, H. Zhou, Y . Gao, and X. Zhuang, “Bayeseg: Bayesian modeling for medical image segmentation with interpretable generalizability,” Medical image analysis , vol. 89, p. 102889, 2023

  31. [31]

    Multivariate mixture model for myocardial segmentation combining multi-source images,

    X. Zhuang, “Multivariate mixture model for myocardial segmentation combining multi-source images,” IEEE transactions on pattern analysis and machine intelligence , vol. 41, no. 12, pp. 2933–2946, 2018

  32. [32]

    Myops-net: Myocardial pathology segmentation with flexible combi- nation of multi-sequence cmr images,

    J. Qiu, L. Li, S. Wang, K. Zhang, Y . Chen, S. Yang, and X. Zhuang, “Myops-net: Myocardial pathology segmentation with flexible combi- nation of multi-sequence cmr images,” Medical image analysis, vol. 84, p. 102694, 2023

  33. [33]

    Prostatex zone segmentations [data set],

    A. Meyer, D. Schindele, D. V on Reibnitz, M. Rak, M. Schostak, and C. Hansen, “Prostatex zone segmentations [data set],” The Cancer Imaging Archive, p. 131, 2020

  34. [34]

    m2caiseg: Semantic seg- mentation of laparoscopic images using convolutional neural networks,

    S. Maqbool, A. Riaz, H. Sajid, and O. Hasan, “m2caiseg: Semantic seg- mentation of laparoscopic images using convolutional neural networks,” arXiv preprint arXiv:2008.10134 , 2020

  35. [35]

    Swin-umamba: Mamba-based unet with imagenet-based pretraining,

    J. Liu, H. Yang, H.-Y . Zhou, Y . Xi, L. Yu, C. Li, Y . Liang, G. Shi, Y . Yu, S. Zhang et al., “Swin-umamba: Mamba-based unet with imagenet-based pretraining,” in International conference on medical image computing and computer-assisted intervention . Springer, 2024, pp. 615–625