Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images
Pith reviewed 2026-05-10 18:00 UTC · model grok-4.3
The pith
Adapting a pretrained vision transformer backbone yields accurate adnexal mass segmentation in ultrasound cine images even with limited annotations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A vision transformer backbone that carries semantic priors from pretraining, when combined with a decoder that hierarchically reassembles multi-scale features, produces state-of-the-art segmentation of adnexal masses on clinical ultrasound cine data and maintains high performance under substantial reductions in labeled training examples.
What carries the argument
A pretrained vision transformer backbone supplying global semantic representations, integrated with a decoder that reassembles multi-scale features into dense pixel predictions.
Load-bearing premise
The semantic understanding acquired from pretraining on ordinary images transfers sufficiently well to the distinct visual statistics of ultrasound frames.
What would settle it
A new test collection drawn from different ultrasound machines or patient populations yielding Dice scores or boundary errors markedly worse than those on the original 112-patient set, or a steep performance collapse when the training data is reduced below 25 percent.
Figures
read the original abstract
Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a label-efficient segmentation framework for adnexal masses in ultrasound cine images. It adapts a pretrained DINOv3 vision transformer backbone with a Dense Prediction Transformer (DPT)-style decoder to hierarchically reassemble multi-scale features. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, the method claims state-of-the-art performance over fully supervised baselines (U-Net, U-Net++, DeepLabV3, MAnet), with a Dice score of 0.945, an 11.4% reduction in 95th-percentile Hausdorff distance relative to the strongest convolutional baseline, and retained strong performance when trained on only 25% of the data. A public GitHub repository is provided.
Significance. If the reported results hold under proper controls, this work is significant for demonstrating that large-scale self-supervised foundation models can substantially reduce annotation requirements for medical image segmentation in data-scarce clinical domains like ultrasound. The empirical comparisons to established baselines and the explicit focus on data-starvation regimes provide concrete evidence of practical utility. Credit is given for releasing reproducible code via the linked repository, which enables verification of the DINOv3 + DPT integration and efficiency experiments.
major comments (2)
- [Experimental setup and results (low-data regime)] The data-efficiency claims (abstract and §4) are load-bearing for the paper's central contribution, yet the protocol for forming the 25% training subsets is not described. With 7,777 frames from only 112 patients, frame-level random subsampling would likely introduce patient-level and temporal leakage (adjacent cine frames from the same patient appearing in both train and test), artificially inflating robustness metrics. The manuscript must explicitly state whether splits are patient-wise, report the number of random seeds, and include performance variance or statistical tests to substantiate the 25%-data retention results.
- [Results section, Table 2] Table 2 (or equivalent results table) reports aggregate Dice and HD95 without per-patient breakdowns or cross-validation folds. Given the modest patient count (112), it is unclear whether the 0.945 Dice and 11.4% HD improvement generalize across patients or are driven by a few easy cases; patient-wise metrics and a proper k-fold patient-split protocol are needed to support the generalization claims.
minor comments (3)
- [Abstract] The abstract states concrete metrics (Dice 0.945, 11.4% HD reduction) without accompanying standard deviations or p-values from statistical tests against baselines; adding these would improve clarity.
- [Method section 3.2] Notation for the DPT decoder integration (e.g., how multi-scale features from DINOv3 are reassembled) could be clarified with a diagram or explicit equations in §3.2 to aid reproducibility.
- [Reproducibility statement] The GitHub link is provided but the manuscript does not reference specific commit hashes or exact training hyperparameters used for the reported numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The concerns regarding data splitting protocols and generalization metrics are well-taken, and we address each point below with plans for revision.
read point-by-point responses
-
Referee: [Experimental setup and results (low-data regime)] The data-efficiency claims (abstract and §4) are load-bearing for the paper's central contribution, yet the protocol for forming the 25% training subsets is not described. With 7,777 frames from only 112 patients, frame-level random subsampling would likely introduce patient-level and temporal leakage (adjacent cine frames from the same patient appearing in both train and test), artificially inflating robustness metrics. The manuscript must explicitly state whether splits are patient-wise, report the number of random seeds, and include performance variance or statistical tests to substantiate the 25%-data retention results.
Authors: We agree that explicit documentation of the splitting protocol is essential to substantiate the low-data regime claims and rule out leakage. All splits, including the 25% subsets, were performed patient-wise to ensure no frames from the same patient (or temporally adjacent cine frames) appear across train/test partitions. We employed 5 random seeds for subset selection and will report mean ± standard deviation for the efficiency experiments. We will add a detailed description of this protocol to the Methods section and include variance metrics plus basic statistical comparisons in the revised §4. revision: yes
-
Referee: [Results section, Table 2] Table 2 (or equivalent results table) reports aggregate Dice and HD95 without per-patient breakdowns or cross-validation folds. Given the modest patient count (112), it is unclear whether the 0.945 Dice and 11.4% HD improvement generalize across patients or are driven by a few easy cases; patient-wise metrics and a proper k-fold patient-split protocol are needed to support the generalization claims.
Authors: We acknowledge that aggregate-only reporting limits assessment of per-patient variability. We will add a supplementary table with per-patient Dice and HD95 values and explicitly describe the patient-stratified 5-fold cross-validation protocol used for all experiments. The main Table 2 will retain aggregate results for readability, with a reference to the supplementary per-patient analysis. revision: yes
Circularity Check
No circularity: purely empirical model evaluation
full rationale
The paper describes an architecture (DINOv3 backbone + DPT decoder) and reports measured performance metrics (Dice 0.945, Hausdorff improvements) on a fixed clinical dataset against published baselines. No equations, first-principles derivations, or 'predictions' appear that reduce to fitted parameters or self-citations by construction. All claims rest on standard supervised training and held-out evaluation, which are externally verifiable and independent of the reported numbers.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pretrained DINOv3 weights contain transferable semantic representations for medical ultrasound images
- domain assumption The 7777-frame dataset from 112 patients is representative of the target clinical distribution
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
integrates the DINOv3 foundational backbone with a high-performance dense prediction head based on the DPT architecture... trained using a combination of Binary Cross-Entropy and Dice loss
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Clinical significance of us artifacts.Radiographics, 37(5):1408–1423, 2017
Michael Baad, Zheng Feng Lu, Ingrid Reiser, and David Paushter. Clinical significance of us artifacts.Radiographics, 37(5):1408–1423, 2017. 2
work page 2017
-
[2]
Early detection of ovarian cancer.Disease markers, 23(5-6):397–410, 2007
Donna Badgwell and Robert C Bast Jr. Early detection of ovarian cancer.Disease markers, 23(5-6):397–410, 2007. 2
work page 2007
-
[3]
Jen Barcroft, Kristofer Linton-Reid, Chiara Landolfo, et al. Machine learning and radiomics for segmentation and classification of adnexal masses on ultrasound.npj Precision Oncology, 8(1):41, 2024. 2
work page 2024
-
[4]
Beryl R Benacerraf, Alfred Z Abuhamad, Bryann Bromley, Steven R Goldstein, Yvette Groszmann, Thomas D Shipp, and Ilan E Timor-Tritsch. Consider ultrasound first for imaging the female pelvis.American journal of obstetrics and gynecology, 212(4):450–455, 2015. 2
work page 2015
-
[5]
Jinzheng Cai, Youbao Tang, Le Lu, Adam P Harrison, Ke Yan, Jing Xiao, Lin Yang, and Ronald M Summers. Accurate weakly-supervised deep lesion segmentation using large-scale clinical annotations: Slice-propagated 3d mask generation from 2d recist. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 396–404. Spring...
work page 2018
-
[6]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 3
work page 2021
-
[7]
Alexander Chowdhury, Jacob Rosenthal, Jonathan Waring, and Renato Umeton. Applying self-supervised learning to medicine: review of the state of the art and medical implementations. InInformatics, page 59. MDPI, 2021. 3
work page 2021
-
[8]
M. K. Dhar, L. De Vitis, A. V . Gregory, C. Ainio, G. Schivardi, A. Lembo, J. Dave, S. Laughlin-Tommaso, B. Cliby, A. Mariani, A. Packard, C. Langstraat, and T. L. Kline. A deep learning framework for enhanced ovarian adnexal mass classification using routinely acquired ultrasound images.Journal Of Imaging Informatics In Medicine, 2026. In press. 2
work page 2026
-
[9]
Harrison C Gottlich, Adriana V Gregory, Vidit Sharma, Abhinav Khanna, Amr U Moustafa, Christine M Lohse, Theodora A Potretzke, Panagiotis Korfiatis, Aaron M Potretzke, Aleksandar Denic, et al. Effect of dataset size and medical image modality on convolutional neural network model performance for automated segmentation: a ct and mr renal tumor imaging stud...
work page 2023
-
[10]
Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation.arXiv preprint arXiv:1612.02649, 2016. 2
-
[11]
Antonios Koutras, Paraskevas Perros, Ioannis Prokopakis, Thomas Ntounis, Zacharias Fasoulakis, Savia Pittokopitou, Athina A Samara, Asimina Valsamaki, Athanasios Douligeris, Anastasia Mortaki, et al. Advantages and limitations of ultrasound as a screening test for ovarian cancer.Diagnostics, 13(12):2078, 2023. 2
work page 2078
-
[12]
Rayan Krishnan, Pranav Rajpurkar, and Eric J Topol. Self-supervised learning in medicine and healthcare.Nature Biomedical Engineering, 6(12):1346–1352, 2022. 3
work page 2022
-
[13]
Adnexal mass segmentation with ultrasound data synthesis
Clara Lebbos, Jen Barcroft, Jeremy Tan, Johanna M ¨uller, Matthew Baugh, Athanasios Vlontzos, Srdjan Saso, others, and Bernhard Kainz. Adnexal mass segmentation with ultrasound data synthesis. InSimplifying Medical Ultrasound (ASMUS 2022), Lecture Notes in Computer Science, pages 106–116, 2022. 2, 5
work page 2022
-
[14]
Yang Lei, Sibo Tian, Xiuxiu He, Tonghe Wang, Bo Wang, Pretesh Patel, Ashesh B Jani, Hui Mao, Walter J Curran, Tian Liu, et al. Ultrasound prostate segmentation based on multidirectional deeply supervised v-net.Medical physics, 46(7):3194–3206, 2019. 5
work page 2019
-
[15]
Yamei Li, Mingxia Liu, Jiandong Ding, Tongtong Wang, Jinan Tan, Guixiang Qian, Yachen Jin, and Yu-Tao Xiang. A deep learning model system for diagnosis and management of adnexal masses.Cancers, 14(21):5291, 2022. 2
work page 2022
-
[16]
Meddinov3: How to adapt vision foundation models for medical image segmentation?, 2025
Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, and Xiaofeng Yang. Meddinov3: How to adapt vision foundation models for medical image segmentation?, 2025. 3
work page 2025
-
[17]
Refinenet: Multi-path refinement networks for high-resolution semantic segmentation, 2016
Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation, 2016. 3
work page 2016
-
[18]
Management of the adnexal mass.Obstetrics & Gynecology, 117(6):1413–1428,
James H Liu and Kristine M Zanotti. Management of the adnexal mass.Obstetrics & Gynecology, 117(6):1413–1428,
-
[19]
Lu Liu, Wenjun Cai, Feibo Zheng, Hongyan Tian, Yanping Li, Ting Wang, Xiaonan Chen, and Wenjing Zhu. Automatic segmentation model and machine learning model grounded in ultrasound radiomics for distinguishing between low malignant risk and intermediate-high malignant risk of adnexal masses.Insights into Imaging, 16:14, 2025. 2
work page 2025
-
[20]
Segmentation of gynaecological ultrasound images using different u-net based approaches
S ´onia Marques, Catarina Carvalho, Carla Peixoto, Duarte Pignatelli, Jorge Beires, Jorge Silva, and Aur ´elio Campilho. Segmentation of gynaecological ultrasound images using different u-net based approaches. In2019 IEEE international ultrasonics symposium (IUS), pages 1485–1488. IEEE,
-
[21]
Vision transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, 2021. 3
work page 2021
-
[22]
Ovarian cancer screening and early detection in the general population
Jose A Rauh-Hain, Thomas C Krivak, Marcela G Del Carmen, and Alexander B Olawaiye. Ovarian cancer screening and early detection in the general population. Reviews in obstetrics and gynecology, 4(1):15, 2011. 2
work page 2011
-
[23]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), 2015. 2
work page 2015
-
[24]
Shreya A Sahu and Deepti Shrivastava. A comprehensive review of screening methods for ovarian masses: towards earlier detection.Cureus, 15(11), 2023. 2
work page 2023
-
[25]
Dinov3: Learning robust dense visual features without supervision, 2025
Baptiste Sim ´eoni, Yanis Daoudi, L ´eo Gros, Guillaume Bitton, Franc ¸ois-Xavier Joly, Nikolaos Efthymiadis, Georgios Gkioxari, Trung Vu, Antoine Miech, and Cordelia Schmid. Dinov3: Learning robust dense visual features without supervision, 2025. 3
work page 2025
-
[26]
Dirk Timmerman, P Schw ¨arzler, WP Collins, F Claerhout, M Coenen, Fr ´ed´eric Amant, Ignace Vergote, and TH Bourne. Subjective assessment of adnexal masses with the use of ultrasonography: an analysis of interobserver variability and experience.Ultrasound in Obstetrics and Gynecology: The Official Journal of the International Society of Ultrasound in Obs...
work page 1999
-
[27]
End-to-end ovarian structures segmentation
Diego S Wanderley, Catarina B Carvalho, Ana Domingues, Carla Peixoto, Duarte Pignatelli, Jorge Beires, Jorge Silva, and Aur ´elio Campilho. End-to-end ovarian structures segmentation. InIberoamerican Congress on Pattern Recognition, pages 681–689. Springer, 2018. 2
work page 2018
-
[28]
Whitney, Roni Yoeli-Bik, Jacques S
Heather M. Whitney, Roni Yoeli-Bik, Jacques S. Abramowicz, Li Lan, Hui Li, Ryan E. Longman, Ernst Lengyel, and Maryellen L. Giger. Ai-based automated segmentation for ovarian/adnexal masses and their internal components on ultrasound imaging.Journal of Medical Imaging, 11(4):044505, 2024. 2
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.