pith. sign in

arxiv: 2606.29586 · v1 · pith:BT6WD5MNnew · submitted 2026-06-28 · 💻 cs.CV · cs.AI

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

Pith reviewed 2026-06-30 06:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords fetal ultrasoundvision-language modelmask-guided pretrainingzero-shot transferregion-aware learningcontrastive lossfoundation modelmedical image analysis
0
0 comments X

The pith

Integrating segmentation masks as visual prompts enables region-controllable vision-language pretraining for fetal ultrasound.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SonoCLIP as a vision-language foundation model that adds segmentation masks as extra channels to the vision encoder, supporting joint global and local contrastive learning. It introduces a sigmoid-based pairwise contrastive loss to handle large-scale training and pretrains on a curated 1.44 million image dataset spanning 24 standard planes. This setup targets the challenges of speckle noise and variable anatomy in ultrasound by moving beyond purely global image-text alignment. A sympathetic reader would care because the resulting model supports mask-guided inference for more targeted analysis of local structures in zero-shot settings across centers.

Core claim

SonoCLIP integrates segmentation masks as mask-channel visual prompts within the vision encoder to enable joint global-local contrastive representation learning. It employs a sigmoid-based pairwise contrastive loss for stable supervision at scale and pretrains on a 1.44M-image multimodal fetal ultrasound dataset. The result is superior zero-shot transfer performance under both global and mask-guided inference in cross-center evaluations.

What carries the argument

Mask-channel visual prompts that feed segmentation masks into the vision encoder to support region-text alignment alongside global alignment.

If this is right

  • The model allows clinicians to supply masks at inference time to focus contrastive alignment on specific anatomical regions.
  • Zero-shot transfer improves on both global and local tasks without task-specific fine-tuning.
  • The sigmoid-based loss supports stable training when scaling to million-image multimodal supervision.
  • Coverage of 24 standard planes provides broad applicability within fetal ultrasound analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mask-channel approach could be tested on other ultrasound applications such as cardiac or abdominal imaging to check transfer of the region-control benefit.
  • If masks are available from automated segmenters, the model might reduce dependence on expert annotations during deployment.
  • The controllable inference mode could be combined with existing clinical workflows that already produce segmentation outputs.

Load-bearing premise

The curated dataset and mask integration produce representations that generalize across centers without masks introducing bias or the loss causing training instability.

What would settle it

A new-center evaluation in which mask-guided inference shows no improvement or worse performance than global-only inference on the same test images would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.29586 by Bo Du, Chao Sun, Hang Su, Juhua Liu, Wei Hu, Zhaofan Li.

Figure 1
Figure 1. Figure 1: Overview of SonoCLIP. (a) Data curation pipeline, including quality con￾trol, manual and AI-assisted mask annotation, and standardized caption generation from clinical records. (b) Dataset distribution across gestational weeks, hospitals, and anatomical planes. (c) Pre-training framework: ultrasound images and masks are jointly encoded via a mask-channel visual pathway, and aligned with paired text using a… view at source ↗
Figure 2
Figure 2. Figure 2: presents ablation results for SigLoss and mask-guided learning in zero￾shot classification. When mask-guided learning is disabled, incorporating SigLoss alone yields a modest improvement in model performance. However, enabling both mask-guided learning and SigLoss simultaneously results in a significant increase in both Top-1 and Top-5 performance. This indicates that SigLoss’s advantages become more prono… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Zero-shot classification results of each model on the FetalP6 dataset. (b) Qualitative comparison of segmentation results on the FetalP5 dataset [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Vision-language foundation models have shown strong potential in medical image analysis. Although foundation models for ultrasound imaging have recently emerged, the domain remains particularly challenging due to severe speckle noise, acquisition variability, and subtle anatomical boundaries, leading to high inter-observer variability. Existing CLIP-based models rely primarily on global image-text alignment, limiting their sensitivity to clinically decisive local structures. We propose SonoCLIP, the first million-scale region-controllable fetal ultrasound vision-language foundation model that integrates segmentation masks as mask-channel visual prompts within the vision encoder, enabling joint global-local contrastive representation learning. To support scalable region-text alignment, we introduce a sigmoid-based pairwise contrastive loss that improves stability under large-scale supervision. We further curate a 1.44M-image multimodal fetal ultrasound dataset spanning 24 standard planes for large-scale pretraining. Extensive cross-center evaluations demonstrate that SonoCLIP achieves superior zero-shot transfer performance under both global and mask-guided inference, establishing a controllable and clinically oriented foundation model for fetal ultrasound analysis. Our code and data are available at https://github.com/Harrison-one/SonoCLIP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. SonoCLIP is proposed as a mask-guided region-aware vision-language pretraining model for fetal ultrasound analysis. It integrates segmentation masks as additional channels in the vision encoder to enable controllable global-local contrastive learning on a curated 1.44 million image dataset spanning 24 standard planes. A sigmoid-based pairwise contrastive loss is introduced to improve training stability at scale. The model is claimed to achieve superior zero-shot transfer performance on cross-center evaluations using both global and mask-guided inference.

Significance. If the reported performance advantages hold under rigorous evaluation, this work could significantly advance the development of foundation models for ultrasound imaging by providing region-controllable representations that are better suited to the clinical needs of fetal analysis. The public availability of the code and dataset is a strength that supports further research in the field.

minor comments (1)
  1. [Abstract] The abstract asserts superior zero-shot transfer performance without providing any quantitative metrics, specific baselines, or statistical details. Including key results would strengthen the summary of the contributions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of SonoCLIP and the recommendation for minor revision. We appreciate the acknowledgment of the model's potential impact on foundation models for ultrasound imaging and the value of releasing code and data.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is empirical: curation of a 1.44M-image fetal ultrasound dataset, integration of mask-channel prompts into a CLIP-style vision encoder, and introduction of a sigmoid pairwise contrastive loss for region-text alignment. All claims of superior zero-shot performance rest on cross-center evaluations rather than any derivation that reduces to fitted parameters or self-citations by construction. No equations, uniqueness theorems, or ansatzes are shown to be self-referential; the argument is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach relies on standard assumptions from contrastive learning and the utility of mask prompts; no new entities postulated, and no free parameters explicitly fitted beyond typical training hyperparameters not detailed here.

axioms (2)
  • domain assumption Vision-language contrastive learning can produce useful representations for downstream tasks
    Core assumption of CLIP-style models invoked implicitly.
  • domain assumption Segmentation masks can be effectively integrated as additional channels in the vision encoder without degrading performance
    Key to the mask-guided approach described.

pith-pipeline@v0.9.1-grok · 5738 in / 1274 out tokens · 45404 ms · 2026-06-30T06:59:32.642477+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 13 canonical work pages · 2 internal anchors

  1. [1]

    Ultrasonic imaging44(1), 25–38 (2022)

    AshkaniChenarlogh,V.,GhelichOghli,M.,Shabanzadeh,A.,Sirjani,N.,Akhavan, A., Shiri, I., Arabi, H., Sanei Taheri, M., Tarzamni, M.K.: Fast and accurate u-net model for fetal ultrasound image segmentation. Ultrasonic imaging44(1), 25–38 (2022)

  2. [2]

    ACM Comput

    Avola, D., Cinque, L., Fagioli, A., Foresti, G., Mecca, A.: Ultrasound medical imaging techniques: A survey. ACM Comput. Surv.54(3) (Apr 2021).https: //doi.org/10.1145/3447243 10 H. Su et al

  3. [3]

    Scientific Reports10, 10200 (2020).https://doi.org/10.1038/ s41598-020-67076-5

    Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet- Carne, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolu- tional neural networks for automatic classification of common maternal fetal ul- trasound planes. Scientific Reports10, 10200 (2020).https://doi.org/10.1038/ s41598-020-67076-5

  4. [4]

    Scientific Reports10(1), 10200 (2020)

    Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet- Carne, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes. Scientific Reports10(1), 10200 (2020)

  5. [5]

    Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3, published online 15 Jan 2026

    Guo, X., Alsharid, M., Zhao, H., Wang, Y., Lander, J., Papageorghiou, A.T., Noble, J.A.: A visually grounded language model for fetal ultrasound under- standing. Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3, published online 15 Jan 2026

  6. [6]

    Neurocomputing579,127443 (2024).https://doi.org/10.1016/j.neucom.2024

    He, J., Yang, L., Liang, B., Li, S., Xu, C.: Fetal cardiac ultrasound standard sec- tion detection model based on multitask learning and mixed attention mechanism. Neurocomputing579,127443 (2024).https://doi.org/10.1016/j.neucom.2024. 127443

  7. [7]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  8. [8]

    Ultrasound in obstetrics & gynecology: the official journal of the International Society of Ultrasound in Obstetrics and Gynecology p

    Khalil, A., Sotiriadis, A., D’Antonio, F., Da, Silva Costa, F., Odibo, A., Prefumo, F., Papageorghiou, A.T., Salomon, L.J.: Isuog practice guidelines: performance of third-trimester obstetric ultrasound scan. Ultrasound in obstetrics & gynecology: the official journal of the International Society of Ultrasound in Obstetrics and Gynecology p. 63 (2024)

  9. [9]

    IEEE Reviews in Biomedical Engineering 19, 283–304 (2026).https://doi.org/10.1109/RBME.2025.3531360

    Khan, W., Leem, S., See, K.B., Wong, J.K., Zhang, S., Fang, R.: A comprehensive survey of foundation models in medicine. IEEE Reviews in Biomedical Engineering 19, 283–304 (2026).https://doi.org/10.1109/RBME.2025.3531360

  10. [10]

    Khattak, M.U., Kunhimon, S., Naseer, M., Khan, S., Khan, F.S.: Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities (2024),https://arxiv.org/abs/2412.10372

  11. [11]

    Expert Systems with Applications238, 122153 (2024).https://doi.org/10.1016/j.eswa.2023.122153

    Krishna, T.B., Kokil, P.: Standard fetal ultrasound plane classification based on stacked ensemble of deep learning models. Expert Systems with Applications238, 122153 (2024).https://doi.org/10.1016/j.eswa.2023.122153

  12. [12]

    Nature643, 488–498 (2025).https://doi.org/10

    Ma, D., Pang, J., Gotway, M.B., Liang, J.: A fully open AI foundation model applied to chest radiography. Nature643, 488–498 (2025).https://doi.org/10. 1038/s41586-025-09079-8

  13. [13]

    Nature Biomedical Engineering (2025).https://doi.org/10.1038/s41551-025-01488-4, published online 02 Sep 2025

    Ma, J., Guo, Z., Zhou, F., Wang, Y., Xu, Y., Li, J., Yan, F., Cai, Y., Zhu, Z., Jin, C., Lin, Y., Jiang, X., Zhao, C., Li, D., Han, A., Li, Z., Chan, R.C.K., Wang, J., Fei, P., Cheng, K.T., Zhang, S., Li, L.L., Chen, H.: A generalizable pathology foun- dation model using a unified knowledge distillation pretraining framework. Nature Biomedical Engineering...

  14. [14]

    Maani,F.,Saeed,N.,Saleem,T.,Farooq,Z.,Alasmawi,H.,Diehl,W.,Mohammad, A., Waring, G., Valappi, S., Bricker, L., Yaqub, M.: Fetalclip: A visual-language foundation model for fetal ultrasound image analysis (2025),https://arxiv.org/ abs/2502.14807

  15. [15]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/abs/ 2103.00020 SonoCLIP 11

  16. [16]

    Ultrasound in Obstetrics and Gynecology41(2), 102–113 (2013)

    Salomon, L.J., Alfirevic, Z., Bilardo, C.M., Chalouhi, G.E., In, O.I.S.O.U.: Isuog practice guidelines: performance of first-trimester fetal ultrasound scan (vol 41, pg 102, 2013). Ultrasound in Obstetrics and Gynecology41(2), 102–113 (2013)

  17. [17]

    Scientific Reports15, 19612 (2025).https://doi.org/10.1038/s41598-025-04631-y

    Singh, R., Gupta, S., Mohamed, H.G., Bharany, S., Rehman, A.U., Ghadi, Y.Y., Hussen, S.: Advancing prenatal healthcare by explainable AI enhanced fetal ultra- sound image segmentation using U-Net++ with attention mechanisms. Scientific Reports15, 19612 (2025).https://doi.org/10.1038/s41598-025-04631-y

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Sun, Z., Fang, Y., Wu, T., Zhang, P., Zang, Y., Kong, S., Xiong, Y., Lin, D., Wang, J.: Alpha-clip: A clip model focusing on wherever you want. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13019–13029 (June 2024)

  19. [19]

    Nature Medicine31, 2691–2702 (2025).https://doi.org/ 10.1038/s41591-025-03747-y

    Yan, S., Yu, Z., Primiero, C., Vico-Alonso, C., Wang, Z., Yang, L., Tschandl, P., Hu, M., Ju, L., Tan, G., Tang, V., Ng, A.B., Powell, D., Bonnington, P., See, S., Magnaterra, E., Ferguson, P., Nguyen, J., Guitera, P., Banuls, J., Janda, M., Mar, V., Kittler, H., Soyer, H.P., Ge, Z.: A multimodal vision foundation model for clinical dermatology. Nature Me...

  20. [20]

    Medical Image Analysis94, 103147 (2024).https://doi.org/10.1016/j.media.2024.103147

    Yeung, P.H., Hesse, L.S., Aliasi, M., Haak, M.C., 21st Consortium, I., Xie, W., Namburete, A.I.L.: Sensorless volumetric reconstruction of fetal brain freehand ultrasound scans with deep implicit representation. Medical Image Analysis94, 103147 (2024).https://doi.org/10.1016/j.media.2024.103147

  21. [21]

    InProceedings of the SIGGRAPH Asia 2025 Conference Papers (SA Conference Papers ’25)

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11941–11952 (2023).https://doi.org/10.1109/ ICCV51070.2023.01100

  22. [22]

    Zhang,S.,etal.:Biomedclip:amultimodalbiomedicalfoundationmodelpretrained from fifteen million scientific image-text pairs (2023),https://arxiv.org/abs/ 2303.00915