pith. sign in

arxiv: 2606.11106 · v1 · pith:WYLI3XWInew · submitted 2026-06-09 · 💻 cs.CV · cs.AI

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

Pith reviewed 2026-06-27 13:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords fetal ultrasoundvision-language modelknowledge distillationmedical image segmentationobject detectionclinical interpretationedge deployment
0
0 comments X

The pith

FADA builds a single vision-language model that unifies fetal ultrasound interpretation, detection, segmentation, and classification through selective distillation from four domain models without external labels at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FADA, a model based on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation for fetal ultrasound in one pipeline to address sonographer shortages in low-resource settings. It distills knowledge selectively from FetalCLIP, UltraSAM, USF-MAE, and UltraFedFM using offline pre-computed feature caching, with feature alignment applied only to annotation tasks while interpretation uses standard fine-tuning. This selective approach outperforms full distillation and yields 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer review of 237 images confirms clinical acceptability in both autonomous and human-in-the-loop modes, and the compressed model runs the full pipeline offline on a smartphone in roughly 60 seconds.

Core claim

Selective distillation from the four domain-specific foundation models into Qwen3.5-VL via offline feature caching produces a unified vision-language model that executes a complete five-phase fetal ultrasound pipeline without requiring external labels or separate models at inference, with the recommended FADA-SKD variant reaching 0.8820 mean Dice, 0.7671 mAP@0.50, and 100% structured interpretation compliance while remaining trainable on one consumer GPU and deployable on edge devices.

What carries the argument

Selective distillation with offline pre-computed feature caching from four domain-specific foundation models, restricting feature alignment to annotation tasks only.

If this is right

  • A single model replaces the need for separate task-specific networks for fetal ultrasound analysis.
  • No expert-specified labels or external models are required at inference for any task.
  • Clinically acceptable outputs are produced in both fully autonomous and human-guided modes.
  • Full offline execution on commodity smartphones enables deployment in settings without internet or cloud access.
  • Training fits on a single consumer GPU, lowering the barrier to local adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selective restriction of distillation to annotation tasks may help preserve the base model's interpretive strengths compared with uniform alignment.
  • The same caching-plus-selective-distillation pattern could be tested on other ultrasound domains such as cardiac or abdominal imaging.
  • Direct integration with portable probe hardware would create an end-to-end offline prenatal screening workflow.
  • Reducing the number of source models while monitoring performance could further simplify the pipeline.

Load-bearing premise

The pre-computed features from the four domain-specific models combined with selective distillation will produce a unified model that generalizes reliably to new clinical data without external labels or additional models at inference.

What would settle it

Running the model on a new, unseen dataset from different ultrasound machines or patient populations and observing Dice scores below 0.80 or interpretation compliance below 90 percent would falsify reliable generalization.

read the original abstract

A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference. We present FADA, a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation through a single interpretation-first pipeline without external labels. FADA distills knowledge from four domain-specific foundation models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) via offline pre-computed feature caching. Selective distillation, which applies feature alignment only to annotation tasks while interpretation relies on standard fine-tuning, consistently outperforms full distillation across most evaluation axes. The recommended variant, FADA-SKD, achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer validation across 237 images confirms clinically acceptable outputs in both autonomous and human-in-the-loop modes, with 73.5% of interpretations scoring perfectly under clinician guidance. The system is trainable on a single consumer GPU and deployable without cloud connectivity. We validate edge deployment by running the compressed 0.8B model on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1, 12 GB RAM) using llama.cpp with GGUF quantization, completing the full 5-phase pipeline in approximately 60 seconds entirely offline. This establishes a practical pathway for integrating AI-assisted fetal assessment with portable ultrasound devices, directly addressing diagnostic access gaps in resource-constrained settings. Code, models, and data are available at https://github.com/mahmoodphd/FADA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper introduces FADA, a unified vision-language model based on Qwen3.5-VL for fetal ultrasound that performs clinical interpretation, classification, detection, and segmentation via a single interpretation-first pipeline. It employs selective knowledge distillation from four offline domain-specific models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) with feature caching, claiming that the FADA-SKD variant achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer review on 237 images confirms clinical acceptability in autonomous and human-in-the-loop modes, with the compressed model runnable offline on a smartphone in ~60 seconds. The work targets accessibility in low-resource settings without requiring external labels or source models at inference.

Significance. If the performance and generalization claims hold, the work has substantial significance for prenatal care in LMICs by unifying multiple ultrasound tasks into one deployable model that eliminates per-task labeling at inference and supports edge deployment on commodity hardware. The selective distillation approach and expert validation on real images are notable strengths if supported by fuller experimental detail.

major comments (4)
  1. [Methods] Methods/Results: The manuscript provides no description of the training dataset (size, sources, acquisition parameters, or train/val/test splits) or the composition of the 237-image expert validation set, which is load-bearing for interpreting the headline metrics of 0.8820 Dice and 0.7671 mAP.
  2. [Results] Results: No ablation tables or quantitative comparisons between selective distillation (SKD) and full distillation are shown, despite the explicit claim that SKD 'consistently outperforms full distillation across most evaluation axes'; this omission weakens the justification for the recommended variant.
  3. [Evaluation] Evaluation: Generalization is asserted for 'new clinical data' and 'unseen clinical distributions,' yet the only external check is expert review on a single 237-image internal set with no cross-site, multi-scanner, or geographic-shift experiments reported; this directly tests the central claim of reliable out-of-distribution performance without source models at inference.
  4. [Results] Results: No statistical tests, confidence intervals, or inter-rater agreement metrics accompany the performance numbers or the 73.5% perfect-score clinician guidance result, limiting assessment of whether the reported figures reliably support the clinical-acceptability conclusion.
minor comments (2)
  1. [Abstract] The abstract states that the system is 'trainable on a single consumer GPU' but provides no training protocol details (optimizer, learning rate schedule, epochs, or hardware specifications) that would allow reproduction.
  2. Consider adding a summary table comparing all FADA variants on the key metrics (Dice, mAP, compliance) to improve readability of the selective-distillation advantage.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help improve the clarity and rigor of the manuscript. We address each major comment point by point below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Methods] Methods/Results: The manuscript provides no description of the training dataset (size, sources, acquisition parameters, or train/val/test splits) or the composition of the 237-image expert validation set, which is load-bearing for interpreting the headline metrics of 0.8820 Dice and 0.7671 mAP.

    Authors: We agree that a detailed description of the datasets is essential. In the revised manuscript, we will add a dedicated subsection in Methods providing the training dataset size, sources, acquisition parameters, and train/val/test splits, along with the composition, demographics, and selection criteria for the 237-image expert validation set. revision: yes

  2. Referee: [Results] Results: No ablation tables or quantitative comparisons between selective distillation (SKD) and full distillation are shown, despite the explicit claim that SKD 'consistently outperforms full distillation across most evaluation axes'; this omission weakens the justification for the recommended variant.

    Authors: We acknowledge this gap. We will include new ablation tables in the revised Results section with quantitative comparisons between FADA-SKD and full distillation variants across segmentation, detection, classification, and interpretation metrics to support the stated performance advantages. revision: yes

  3. Referee: [Evaluation] Evaluation: Generalization is asserted for 'new clinical data' and 'unseen clinical distributions,' yet the only external check is expert review on a single 237-image internal set with no cross-site, multi-scanner, or geographic-shift experiments reported; this directly tests the central claim of reliable out-of-distribution performance without source models at inference.

    Authors: The 237-image set consists of images from new clinical acquisitions not used in training. We will revise the Evaluation section to clarify this and to explicitly note the limitations regarding multi-site and geographic generalization. Broader cross-site experiments are beyond the scope of the current resources and will be listed as future work. revision: partial

  4. Referee: [Results] Results: No statistical tests, confidence intervals, or inter-rater agreement metrics accompany the performance numbers or the 73.5% perfect-score clinician guidance result, limiting assessment of whether the reported figures reliably support the clinical-acceptability conclusion.

    Authors: We agree that statistical support is needed. In the revised manuscript, we will add statistical tests, 95% confidence intervals for the key metrics (Dice, mAP), and inter-rater agreement metrics (e.g., Cohen's kappa) for the expert sonographer evaluations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs

full rationale

The paper describes an empirical pipeline: offline feature caching from four external foundation models, selective distillation into a Qwen3.5-VL backbone, and standard fine-tuning for interpretation. Reported metrics (0.8820 Dice, 0.7671 mAP, 100% compliance) and clinician review on 237 held-out images are obtained via conventional train/test splits and external validation, not by algebraic reduction to the training inputs or by re-labeling fitted parameters as predictions. No equations, self-definitions, or load-bearing self-citations appear in the method; the derivation chain consists of standard supervised training followed by independent evaluation and therefore remains self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the transferability of features from the four cited foundation models and on standard assumptions that fine-tuning plus selective alignment will yield a generalizable unified model; no new entities are postulated.

free parameters (1)
  • selective distillation hyperparameters
    Choices of which tasks receive feature alignment and the strength of that alignment are tuned during training and not derived from first principles.
axioms (1)
  • domain assumption Pre-computed features from FetalCLIP, UltraSAM, USF-MAE, and UltraFedFM are sufficiently rich and aligned for the target fetal ultrasound tasks
    The offline caching and selective distillation step rests on this transfer assumption.

pith-pipeline@v0.9.1-grok · 5900 in / 1297 out tokens · 32208 ms · 2026-06-27T13:06:08.796287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 2 linked inside Pith

  1. [1]

    Who recommendations on antenatal care for a positive pregnancy experience-going beyond survival.BJOG: an international journal of obstetrics and gynaecology(2017)

    Lawrie, T. Who recommendations on antenatal care for a positive pregnancy experience-going beyond survival.BJOG: an international journal of obstetrics and gynaecology(2017). 27

  2. [2]

    T., Singh, K., Moran, A., Armbruster, D

    Kim, E. T., Singh, K., Moran, A., Armbruster, D. & Kozuki, N. Obstetric ultra- sound use in low and middle income countries: a narrative review.Reproductive health15, 129 (2018)

  3. [3]

    P.et al.Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes.Scientific Reports10, 10200 (2020)

    Burgos-Artizzu, X. P.et al.Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes.Scientific Reports10, 10200 (2020)

  4. [4]

    Guo, J.et al.Anatomical structures detection using topological constraint knowledge in fetal ultrasound.Neurocomputing619, 129143 (2025)

  5. [5]

    L., de Bruijn, D., de Korte, C

    van den Heuvel, T. L., de Bruijn, D., de Korte, C. L. & Ginneken, B. v. Automated measurement of fetal head circumference using 2d ultrasound images.PloS one 13, e0200412 (2018)

  6. [6]

    Li, C.et al.Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems36, 28541–28564 (2023)

  7. [7]

    Jin, J.et al.Ultrasound-clip: Semantic-aware contrastive pre-training for ultrasound image-text understanding.arXiv preprint arXiv:2604.01749(2026)

  8. [8]

    He, X.et al.Epistemic-aware vision-language foundation model for fetal ultrasound interpretation.arXiv preprint arXiv:2510.12953(2025)

  9. [9]

    S., Kang, H., Chu, Y

    Ryu, J. S., Kang, H., Chu, Y. & Yang, S. Vision-language foundation models for medical imaging: a review of current practices and innovations.Biomedical Engineering Letters15, 809–830 (2025)

  10. [10]

    C., Adaambiik, A

    Kalp´ elb´ e, B. C., Adaambiik, A. G. & Peng, W. Vision language models in medicine.arXiv preprint arXiv:2503.01863(2025)

  11. [11]

    Weng, T.et al.Dolphin technical report: Multimodal large language models for ultrasound understanding.arXiv preprint arXiv:2509.25748(2025)

  12. [12]

    Li, X.et al.Knowledge distillation and teacher-student learning in medical imag- ing: Comprehensive overview, pivotal role, and future directions.Medical Image Analysis103819 (2025)

  13. [13]

    Tran-Anh, D., Nguyen, T. N. A., Yang, H.-J. & Vu, H. N. Multiple teacher- student model guided knowledge distillation for malpositioned catheters and lines detection on chest x-rays.Discover Artificial Intelligence6, 40 (2026)

  14. [14]

    Slimani, S.et al.Fetal biometry and amniotic fluid volume assessment end-to-end automation using deep learning.Nature Communications14, 7047 (2023)

  15. [15]

    Benson, M.et al.Fetal gestational age estimation using artificial intelligence on non-targeted ultrasound images and video.npj Digital Medicine8, 700 (2025). 28

  16. [16]

    Medical Image Analysis104043 (2026)

    Bai, J.et al.Beyond benchmarks of iugc: Rethinking requirements of deep learn- ing method for intrapartum ultrasound biometry from fetal ultrasound videos. Medical Image Analysis104043 (2026)

  17. [17]

    Guo, X.et al.A visually grounded language model for fetal ultrasound understanding.Nature Biomedical Engineering1–17 (2026)

  18. [18]

    Maani, F.et al.Fetalclip: A visual-language foundation model for fetal ultrasound image analysis.arXiv preprint arXiv:2502.14807(2025)

  19. [19]

    Saeed, N., Maani, F. A. & Yaqub, M. Mobilefetalclip: Selective repulsive knowledge distillation for mobile fetal ultrasound analysis.arXiv preprint arXiv:2603.05421(2026)

  20. [20]

    B.et al.Human in the loop artificial intelligence in healthcare: applications, outcomes, and implementation challenges.International Journal of Medical Informatics106362 (2026)

    Olawade, D. B.et al.Human in the loop artificial intelligence in healthcare: applications, outcomes, and implementation challenges.International Journal of Medical Informatics106362 (2026)

  21. [21]

    & Alhejaily, A.-M

    Wadie, P., Zakher, B., Elgazzar, K., Alsbakhi, A. & Alhejaily, A.-M. G. Ai in point-of-care imaging for clinical decision support: Systematic review of diagnostic accuracy, task-shifting, and explainability.JMIR AI5, e80928 (2026)

  22. [22]

    Vega, R.et al.Overcoming barriers in the use of artificial intelligence in point of care ultrasound.NPJ Digital Medicine8, 213 (2025)

  23. [23]

    & Walker, D

    Della Ripa, S., Santos, N. & Walker, D. Ai-enabled obstetric point-of-care ultra- sound as an emerging technology in low-and middle-income countries: provider and health system perspectives.BMC Pregnancy and Childbirth25, 729 (2025)

  24. [24]

    K., Ruby, L

    Abrokwa, S. K., Ruby, L. C., Heuvelings, C. C. & Belard, S. Task shifting for point of care ultrasound in primary healthcare in low-and middle-income countries-a systematic review.EClinicalMedicine45(2022)

  25. [25]

    & Giansanti, D

    Morelli, S. & Giansanti, D. Recent advances in ai-driven mobile health enhancing healthcare—narrative insights into latest progress.Bioengineering13, 54 (2025)

  26. [26]

    F., Humayun, M

    Almufareh, M. F., Humayun, M. & Haseeb, K. Transforming smart health- care systems with ai-driven edge computing for distributed iomt networks. Bioengineering12, 1232 (2025)

  27. [27]

    & Chen, X

    Feng, Q., Li, W., Lin, T. & Chen, X. Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement.CVPR4178– 4188 (2025)

  28. [28]

    Gou, J., Yu, B., Maybank, S. J. & Tao, D. Knowledge distillation: A survey. International journal of computer vision129, 1789–1819 (2021). 29

  29. [29]

    Ge, H.et al.Clinkd: Cross-modal clinical knowledge distiller for multi-task medical images.arXiv preprint arXiv:2502.05928(2025)

  30. [30]

    Cao, J.et al.Move-kd: Knowledge distillation for vlms with mixture of visual encoders.CVPR19846–19856 (2025)

  31. [31]

    Computer Methods and Programs in Biomedicine226, 107170 (2022)

    Lin, Q.et al.How much can AI see in early pregnancy: A multi-center study of fetus head characterization in week 10–14 in ultrasound using deep learning. Computer Methods and Programs in Biomedicine226, 107170 (2022)

  32. [32]

    & Dong, F

    Cui, C. & Dong, F. Dataset for fetus framework (2022). URL https://data. mendeley.com/datasets/n2rbrb9t4f/1

  33. [33]

    Ashkani Chenarlogh, V.et al.Fast and accurate U-Net model for fetal ultrasound image segmentation.Ultrasonic Imaging44, 25–38 (2022)

  34. [34]

    URL https://github.com/vahidashkani/Fast-U-Net

    Ashkani Chenarlogh, V.et al.Fast-U-Net pubic symphysis segmentation dataset (2022). URL https://github.com/vahidashkani/Fast-U-Net. GitHub repository

  35. [35]

    S.et al.Fetal abdominal structures segmentation dataset using ultrasonic images (2023)

    Da Correggio, K. S.et al.Fetal abdominal structures segmentation dataset using ultrasonic images (2023). URL https://data.mendeley.com/datasets/ 4gcpm9dsc3/1

  36. [36]

    URL https://figshare.com/articles/figure/First Trimester Fetal Echocardiography Data Set for Classification/21215492

    Stoean, R.et al.First trimester fetal echocardiography data set for classifi- cation (2022). URL https://figshare.com/articles/figure/First Trimester Fetal Echocardiography Data Set for Classification/21215492

  37. [37]

    Alzubaidi, M.et al.Large-scale annotation dataset for fetal head biometry in ultrasound images.Data in Brief51, 109708 (2023)

  38. [38]

    URL https://zenodo.org/records/14597550

    Wu, S.et al.FOCUS: Four-chamber ultrasound image dataset for fetal cardiac biometric measurement (2025). URL https://zenodo.org/records/14597550

  39. [39]

    S., Hamelmann, P., Ostrowski, E

    Prabakaran, B. S., Hamelmann, P., Ostrowski, E. & Shafique, M. FPUS23: an ultrasound fetus phantom dataset with deep neural network evaluations for fetus orientations, fetal planes, and anatomical features.IEEE Access11, 58308–58317 (2023)

  40. [40]

    Chen, Z.et al.Fetal head and pubic symphysis segmentation in intrapartum ultrasound image using a dual-path boundary-guided residual network.IEEE Journal of Biomedical and Health Informatics28, 4648–4659 (2024)

  41. [41]

    P.et al.FETAL PLANES DB: Common maternal-fetal ultrasound images (2020)

    Burgos-Artizzu, X. P.et al.FETAL PLANES DB: Common maternal-fetal ultrasound images (2020). URL https://zenodo.org/records/3904280

  42. [42]

    Bai, J., Chen, G., Lu, Y., Wang, H. & Ou, Z. PSFHS: Intrapartum ultra- sound image dataset for AI-based segmentation of pubic symphysis and fetal head (2024). URL https://zenodo.org/records/10969427. 30

  43. [43]

    Bai, S.et al.Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  44. [44]

    J.et al.Lora: Low-rank adaptation of large language models.Iclr1, 3 (2022)

    Hu, E. J.et al.Lora: Low-rank adaptation of large language models.Iclr1, 3 (2022)

  45. [45]

    & Padoy, N

    Meyer, A., Murali, A., Zarin, F., Mutter, D. & Padoy, N. Ultrasam: a foundation model for ultrasound using large open-access segmentation datasets.International Journal of Computer Assisted Radiology and Surgery21, 93–102 (2026)

  46. [46]

    Megahed, Y.et al.Usf-mae: Ultrasound self-supervised foundation model with masked autoencoding.Biomedical Signal Processing and Control122, 110313 (2026)

  47. [47]

    Jiang, Y.et al.From pretraining to privacy: federated ultrasound foundation model with self-supervised learning.npj Digital Medicine8, 714 (2025)

  48. [48]

    & Han, M

    Han, D. & Han, M. Unsloth: Fast and memory-efficient fine-tuning. https://github.com/unslothai/unsloth (2024). 31