arxiv: 2604.08045 · v1 · submitted 2026-04-09 · 💻 cs.CV

Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images

Francesca Fati , Alberto Rota , Adriana V. Gregory , Anna Catozzo , Maria C. Giuliano , Mrinal Dhar , Luigi De Vitis , Annie T. Packard

show 4 more authors

Francesco Multinu Elena De Momi Carrie L. Langstraat Timothy L. Kline

This is my paper

Pith reviewed 2026-05-10 18:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords adnexal mass segmentationultrasound cine imaginglabel-efficient learningfoundation model adaptationmedical image segmentationtransformer backbonedata scarcity robustness

0 comments

The pith

Adapting a pretrained vision transformer backbone yields accurate adnexal mass segmentation in ultrasound cine images even with limited annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how semantic knowledge from large-scale pretraining on natural scenes can be repurposed for segmenting adnexal masses in ultrasound video frames. Conventional convolutional networks demand extensive pixel-level labels and degrade under the data scarcity and appearance variations typical of clinical scans. The proposed setup pairs the backbone's broad contextual representations with a decoder that reassembles features at multiple scales to recover fine boundary details. On a set of 7777 frames from 112 patients the method exceeds standard fully supervised baselines in overlap and boundary metrics while preserving most of its accuracy when trained on only one-quarter of the available labels. This pattern indicates a workable path toward reliable automated tools in annotation-constrained medical environments.

Core claim

A vision transformer backbone that carries semantic priors from pretraining, when combined with a decoder that hierarchically reassembles multi-scale features, produces state-of-the-art segmentation of adnexal masses on clinical ultrasound cine data and maintains high performance under substantial reductions in labeled training examples.

What carries the argument

A pretrained vision transformer backbone supplying global semantic representations, integrated with a decoder that reassembles multi-scale features into dense pixel predictions.

Load-bearing premise

The semantic understanding acquired from pretraining on ordinary images transfers sufficiently well to the distinct visual statistics of ultrasound frames.

What would settle it

A new test collection drawn from different ultrasound machines or patient populations yielding Dice scores or boundary errors markedly worse than those on the original 112-patient set, or a steep performance collapse when the training data is reduced below 25 percent.

Figures

Figures reproduced from arXiv: 2604.08045 by Adriana V. Gregory, Alberto Rota, Anna Catozzo, Annie T. Packard, Carrie L. Langstraat, Elena De Momi, Francesca Fati, Francesco Multinu, Luigi De Vitis, Maria C. Giuliano, Mrinal Dhar, Timothy L. Kline.

**Figure 1.** Figure 1: Adnexal mass segmentation from cine images: (a) Qualitative results demonstrating high boundary fidelity between the predicted masks and the ground truth. (b) Performance comparison showing that DINOv3-based architectures outperform convolutional state-of-the-art baselines. (c) Improved performance retention under data-starved training regimes. Abstract Adnexal mass evaluation via ultrasound is a challengi… view at source ↗

**Figure 2.** Figure 2: Model Architecture Overview: The input image I is processed by the DINOv3 Encoder E which extracts a set of hierarchical feature maps from which we retain only the subset {Fℓ0 , Fℓ1 , Fℓ2 , Fℓ3 }. The feature maps are then passed to a learned resampling operator ψ that resizes them to a higher or lower resolution depending on their rank in the hierarchy, {Gℓ0 , Gℓ1 , Gℓ2 , Gℓ3 }. Later, the Upsample and re… view at source ↗

**Figure 3.** Figure 3: Qualitative Segmentation Comparison: Visual comparison of adnexal mass segmentation results across different architectures. Our method demonstrates superior boundary fidelity and structural consistency. a lightweight prediction head, but introduces the risk of excessive computational overhead during fine-tuning. To resolve these trade-offs, we investigate how segmentation performance varies in relation to … view at source ↗

**Figure 4.** Figure 4: Efficiency Analysis Results: a) Pareto efficiency analysis comparing segmentation performance against model capacity. We fit a logarithmic curve to the methods based on DINOv3 and to the convolutional state of the art separately and we report the log slope a and the fit quality R 2 . b) Data Starvation Curves report the performance losses at progressively larger starvation fractions [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 5.** Figure 5: Convergence efficiency across architectures. Area under the Learning Curve (ALC) for different models and backbone sizes at two input resolutions. Each pair of markers shows the same architecture at different resolutions. landscape. The plot shows that the performance degradation is consistent regardless of model scale. This suggests the bottleneck is not the capacity of the model, but the representation a… view at source ↗

read the original abstract

Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes a label-efficient segmentation framework for adnexal masses in ultrasound cine images. It adapts a pretrained DINOv3 vision transformer backbone with a Dense Prediction Transformer (DPT)-style decoder to hierarchically reassemble multi-scale features. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, the method claims state-of-the-art performance over fully supervised baselines (U-Net, U-Net++, DeepLabV3, MAnet), with a Dice score of 0.945, an 11.4% reduction in 95th-percentile Hausdorff distance relative to the strongest convolutional baseline, and retained strong performance when trained on only 25% of the data. A public GitHub repository is provided.

Significance. If the reported results hold under proper controls, this work is significant for demonstrating that large-scale self-supervised foundation models can substantially reduce annotation requirements for medical image segmentation in data-scarce clinical domains like ultrasound. The empirical comparisons to established baselines and the explicit focus on data-starvation regimes provide concrete evidence of practical utility. Credit is given for releasing reproducible code via the linked repository, which enables verification of the DINOv3 + DPT integration and efficiency experiments.

major comments (2)

[Experimental setup and results (low-data regime)] The data-efficiency claims (abstract and §4) are load-bearing for the paper's central contribution, yet the protocol for forming the 25% training subsets is not described. With 7,777 frames from only 112 patients, frame-level random subsampling would likely introduce patient-level and temporal leakage (adjacent cine frames from the same patient appearing in both train and test), artificially inflating robustness metrics. The manuscript must explicitly state whether splits are patient-wise, report the number of random seeds, and include performance variance or statistical tests to substantiate the 25%-data retention results.
[Results section, Table 2] Table 2 (or equivalent results table) reports aggregate Dice and HD95 without per-patient breakdowns or cross-validation folds. Given the modest patient count (112), it is unclear whether the 0.945 Dice and 11.4% HD improvement generalize across patients or are driven by a few easy cases; patient-wise metrics and a proper k-fold patient-split protocol are needed to support the generalization claims.

minor comments (3)

[Abstract] The abstract states concrete metrics (Dice 0.945, 11.4% HD reduction) without accompanying standard deviations or p-values from statistical tests against baselines; adding these would improve clarity.
[Method section 3.2] Notation for the DPT decoder integration (e.g., how multi-scale features from DINOv3 are reassembled) could be clarified with a diagram or explicit equations in §3.2 to aid reproducibility.
[Reproducibility statement] The GitHub link is provided but the manuscript does not reference specific commit hashes or exact training hyperparameters used for the reported numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The concerns regarding data splitting protocols and generalization metrics are well-taken, and we address each point below with plans for revision.

read point-by-point responses

Referee: [Experimental setup and results (low-data regime)] The data-efficiency claims (abstract and §4) are load-bearing for the paper's central contribution, yet the protocol for forming the 25% training subsets is not described. With 7,777 frames from only 112 patients, frame-level random subsampling would likely introduce patient-level and temporal leakage (adjacent cine frames from the same patient appearing in both train and test), artificially inflating robustness metrics. The manuscript must explicitly state whether splits are patient-wise, report the number of random seeds, and include performance variance or statistical tests to substantiate the 25%-data retention results.

Authors: We agree that explicit documentation of the splitting protocol is essential to substantiate the low-data regime claims and rule out leakage. All splits, including the 25% subsets, were performed patient-wise to ensure no frames from the same patient (or temporally adjacent cine frames) appear across train/test partitions. We employed 5 random seeds for subset selection and will report mean ± standard deviation for the efficiency experiments. We will add a detailed description of this protocol to the Methods section and include variance metrics plus basic statistical comparisons in the revised §4. revision: yes
Referee: [Results section, Table 2] Table 2 (or equivalent results table) reports aggregate Dice and HD95 without per-patient breakdowns or cross-validation folds. Given the modest patient count (112), it is unclear whether the 0.945 Dice and 11.4% HD improvement generalize across patients or are driven by a few easy cases; patient-wise metrics and a proper k-fold patient-split protocol are needed to support the generalization claims.

Authors: We acknowledge that aggregate-only reporting limits assessment of per-patient variability. We will add a supplementary table with per-patient Dice and HD95 values and explicitly describe the patient-stratified 5-fold cross-validation protocol used for all experiments. The main Table 2 will retain aggregate results for readability, with a reference to the supplementary per-patient analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model evaluation

full rationale

The paper describes an architecture (DINOv3 backbone + DPT decoder) and reports measured performance metrics (Dice 0.945, Hausdorff improvements) on a fixed clinical dataset against published baselines. No equations, first-principles derivations, or 'predictions' appear that reduce to fitted parameters or self-citations by construction. All claims rest on standard supervised training and held-out evaluation, which are externally verifiable and independent of the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that DINOv3 features pretrained on natural images provide useful semantic priors for ultrasound without explicit domain adaptation. No new entities are postulated. Standard supervised learning assumptions (i.i.d. frames, pixel-wise cross-entropy or Dice loss) are implicit but not enumerated.

axioms (2)

domain assumption Pretrained DINOv3 weights contain transferable semantic representations for medical ultrasound images
Invoked when the backbone is frozen or lightly fine-tuned and used directly for the downstream segmentation task.
domain assumption The 7777-frame dataset from 112 patients is representative of the target clinical distribution
Required to interpret the reported Dice and Hausdorff numbers as generalizable performance.

pith-pipeline@v0.9.0 · 5618 in / 1557 out tokens · 22226 ms · 2026-05-10T18:00:00.783302+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

integrates the DINOv3 foundational backbone with a high-performance dense prediction head based on the DPT architecture... trained using a combination of Binary Cross-Entropy and Dice loss
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

Clinical significance of us artifacts.Radiographics, 37(5):1408–1423, 2017

Michael Baad, Zheng Feng Lu, Ingrid Reiser, and David Paushter. Clinical significance of us artifacts.Radiographics, 37(5):1408–1423, 2017. 2

work page 2017
[2]

Early detection of ovarian cancer.Disease markers, 23(5-6):397–410, 2007

Donna Badgwell and Robert C Bast Jr. Early detection of ovarian cancer.Disease markers, 23(5-6):397–410, 2007. 2

work page 2007
[3]

Machine learning and radiomics for segmentation and classification of adnexal masses on ultrasound.npj Precision Oncology, 8(1):41, 2024

Jen Barcroft, Kristofer Linton-Reid, Chiara Landolfo, et al. Machine learning and radiomics for segmentation and classification of adnexal masses on ultrasound.npj Precision Oncology, 8(1):41, 2024. 2

work page 2024
[4]

Consider ultrasound first for imaging the female pelvis.American journal of obstetrics and gynecology, 212(4):450–455, 2015

Beryl R Benacerraf, Alfred Z Abuhamad, Bryann Bromley, Steven R Goldstein, Yvette Groszmann, Thomas D Shipp, and Ilan E Timor-Tritsch. Consider ultrasound first for imaging the female pelvis.American journal of obstetrics and gynecology, 212(4):450–455, 2015. 2

work page 2015
[5]

Accurate weakly-supervised deep lesion segmentation using large-scale clinical annotations: Slice-propagated 3d mask generation from 2d recist

Jinzheng Cai, Youbao Tang, Le Lu, Adam P Harrison, Ke Yan, Jing Xiao, Lin Yang, and Ronald M Summers. Accurate weakly-supervised deep lesion segmentation using large-scale clinical annotations: Slice-propagated 3d mask generation from 2d recist. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 396–404. Spring...

work page 2018
[6]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 3

work page 2021
[7]

Applying self-supervised learning to medicine: review of the state of the art and medical implementations

Alexander Chowdhury, Jacob Rosenthal, Jonathan Waring, and Renato Umeton. Applying self-supervised learning to medicine: review of the state of the art and medical implementations. InInformatics, page 59. MDPI, 2021. 3

work page 2021
[8]

M. K. Dhar, L. De Vitis, A. V . Gregory, C. Ainio, G. Schivardi, A. Lembo, J. Dave, S. Laughlin-Tommaso, B. Cliby, A. Mariani, A. Packard, C. Langstraat, and T. L. Kline. A deep learning framework for enhanced ovarian adnexal mass classification using routinely acquired ultrasound images.Journal Of Imaging Informatics In Medicine, 2026. In press. 2

work page 2026
[9]

Harrison C Gottlich, Adriana V Gregory, Vidit Sharma, Abhinav Khanna, Amr U Moustafa, Christine M Lohse, Theodora A Potretzke, Panagiotis Korfiatis, Aaron M Potretzke, Aleksandar Denic, et al. Effect of dataset size and medical image modality on convolutional neural network model performance for automated segmentation: a ct and mr renal tumor imaging stud...

work page 2023
[10]

Fcns in the wild: Pixel-level adversarial and constraint-based adaptation.arXiv preprint arXiv:1612.02649, 2016

Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation.arXiv preprint arXiv:1612.02649, 2016. 2

work page arXiv 2016
[11]

Advantages and limitations of ultrasound as a screening test for ovarian cancer.Diagnostics, 13(12):2078, 2023

Antonios Koutras, Paraskevas Perros, Ioannis Prokopakis, Thomas Ntounis, Zacharias Fasoulakis, Savia Pittokopitou, Athina A Samara, Asimina Valsamaki, Athanasios Douligeris, Anastasia Mortaki, et al. Advantages and limitations of ultrasound as a screening test for ovarian cancer.Diagnostics, 13(12):2078, 2023. 2

work page 2078
[12]

Self-supervised learning in medicine and healthcare.Nature Biomedical Engineering, 6(12):1346–1352, 2022

Rayan Krishnan, Pranav Rajpurkar, and Eric J Topol. Self-supervised learning in medicine and healthcare.Nature Biomedical Engineering, 6(12):1346–1352, 2022. 3

work page 2022
[13]

Adnexal mass segmentation with ultrasound data synthesis

Clara Lebbos, Jen Barcroft, Jeremy Tan, Johanna M ¨uller, Matthew Baugh, Athanasios Vlontzos, Srdjan Saso, others, and Bernhard Kainz. Adnexal mass segmentation with ultrasound data synthesis. InSimplifying Medical Ultrasound (ASMUS 2022), Lecture Notes in Computer Science, pages 106–116, 2022. 2, 5

work page 2022
[14]

Ultrasound prostate segmentation based on multidirectional deeply supervised v-net.Medical physics, 46(7):3194–3206, 2019

Yang Lei, Sibo Tian, Xiuxiu He, Tonghe Wang, Bo Wang, Pretesh Patel, Ashesh B Jani, Hui Mao, Walter J Curran, Tian Liu, et al. Ultrasound prostate segmentation based on multidirectional deeply supervised v-net.Medical physics, 46(7):3194–3206, 2019. 5

work page 2019
[15]

A deep learning model system for diagnosis and management of adnexal masses.Cancers, 14(21):5291, 2022

Yamei Li, Mingxia Liu, Jiandong Ding, Tongtong Wang, Jinan Tan, Guixiang Qian, Yachen Jin, and Yu-Tao Xiang. A deep learning model system for diagnosis and management of adnexal masses.Cancers, 14(21):5291, 2022. 2

work page 2022
[16]

Meddinov3: How to adapt vision foundation models for medical image segmentation?, 2025

Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, and Xiaofeng Yang. Meddinov3: How to adapt vision foundation models for medical image segmentation?, 2025. 3

work page 2025
[17]

Refinenet: Multi-path refinement networks for high-resolution semantic segmentation, 2016

Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation, 2016. 3

work page 2016
[18]

Management of the adnexal mass.Obstetrics & Gynecology, 117(6):1413–1428,

James H Liu and Kristine M Zanotti. Management of the adnexal mass.Obstetrics & Gynecology, 117(6):1413–1428,

work page
[19]

Lu Liu, Wenjun Cai, Feibo Zheng, Hongyan Tian, Yanping Li, Ting Wang, Xiaonan Chen, and Wenjing Zhu. Automatic segmentation model and machine learning model grounded in ultrasound radiomics for distinguishing between low malignant risk and intermediate-high malignant risk of adnexal masses.Insights into Imaging, 16:14, 2025. 2

work page 2025
[20]

Segmentation of gynaecological ultrasound images using different u-net based approaches

S ´onia Marques, Catarina Carvalho, Carla Peixoto, Duarte Pignatelli, Jorge Beires, Jorge Silva, and Aur ´elio Campilho. Segmentation of gynaecological ultrasound images using different u-net based approaches. In2019 IEEE international ultrasonics symposium (IUS), pages 1485–1488. IEEE,

work page
[21]

Vision transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, 2021. 3

work page 2021
[22]

Ovarian cancer screening and early detection in the general population

Jose A Rauh-Hain, Thomas C Krivak, Marcela G Del Carmen, and Alexander B Olawaiye. Ovarian cancer screening and early detection in the general population. Reviews in obstetrics and gynecology, 4(1):15, 2011. 2

work page 2011
[23]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), 2015. 2

work page 2015
[24]

A comprehensive review of screening methods for ovarian masses: towards earlier detection.Cureus, 15(11), 2023

Shreya A Sahu and Deepti Shrivastava. A comprehensive review of screening methods for ovarian masses: towards earlier detection.Cureus, 15(11), 2023. 2

work page 2023
[25]

Dinov3: Learning robust dense visual features without supervision, 2025

Baptiste Sim ´eoni, Yanis Daoudi, L ´eo Gros, Guillaume Bitton, Franc ¸ois-Xavier Joly, Nikolaos Efthymiadis, Georgios Gkioxari, Trung Vu, Antoine Miech, and Cordelia Schmid. Dinov3: Learning robust dense visual features without supervision, 2025. 3

work page 2025
[26]

Dirk Timmerman, P Schw ¨arzler, WP Collins, F Claerhout, M Coenen, Fr ´ed´eric Amant, Ignace Vergote, and TH Bourne. Subjective assessment of adnexal masses with the use of ultrasonography: an analysis of interobserver variability and experience.Ultrasound in Obstetrics and Gynecology: The Official Journal of the International Society of Ultrasound in Obs...

work page 1999
[27]

End-to-end ovarian structures segmentation

Diego S Wanderley, Catarina B Carvalho, Ana Domingues, Carla Peixoto, Duarte Pignatelli, Jorge Beires, Jorge Silva, and Aur ´elio Campilho. End-to-end ovarian structures segmentation. InIberoamerican Congress on Pattern Recognition, pages 681–689. Springer, 2018. 2

work page 2018
[28]

Whitney, Roni Yoeli-Bik, Jacques S

Heather M. Whitney, Roni Yoeli-Bik, Jacques S. Abramowicz, Li Lan, Hui Li, Ryan E. Longman, Ernst Lengyel, and Maryellen L. Giger. Ai-based automated segmentation for ovarian/adnexal masses and their internal components on ultrasound imaging.Journal of Medical Imaging, 11(4):044505, 2024. 2

work page 2024