SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

Bo Du; Chao Sun; Hang Su; Juhua Liu; Wei Hu; Zhaofan Li

arxiv: 2606.29586 · v1 · pith:BT6WD5MNnew · submitted 2026-06-28 · 💻 cs.CV · cs.AI

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

Hang Su , Chao Sun , Zhaofan Li , Wei Hu , Juhua Liu , Bo Du This is my paper

Pith reviewed 2026-06-30 06:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords fetal ultrasoundvision-language modelmask-guided pretrainingzero-shot transferregion-aware learningcontrastive lossfoundation modelmedical image analysis

0 comments

The pith

Integrating segmentation masks as visual prompts enables region-controllable vision-language pretraining for fetal ultrasound.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SonoCLIP as a vision-language foundation model that adds segmentation masks as extra channels to the vision encoder, supporting joint global and local contrastive learning. It introduces a sigmoid-based pairwise contrastive loss to handle large-scale training and pretrains on a curated 1.44 million image dataset spanning 24 standard planes. This setup targets the challenges of speckle noise and variable anatomy in ultrasound by moving beyond purely global image-text alignment. A sympathetic reader would care because the resulting model supports mask-guided inference for more targeted analysis of local structures in zero-shot settings across centers.

Core claim

SonoCLIP integrates segmentation masks as mask-channel visual prompts within the vision encoder to enable joint global-local contrastive representation learning. It employs a sigmoid-based pairwise contrastive loss for stable supervision at scale and pretrains on a 1.44M-image multimodal fetal ultrasound dataset. The result is superior zero-shot transfer performance under both global and mask-guided inference in cross-center evaluations.

What carries the argument

Mask-channel visual prompts that feed segmentation masks into the vision encoder to support region-text alignment alongside global alignment.

If this is right

The model allows clinicians to supply masks at inference time to focus contrastive alignment on specific anatomical regions.
Zero-shot transfer improves on both global and local tasks without task-specific fine-tuning.
The sigmoid-based loss supports stable training when scaling to million-image multimodal supervision.
Coverage of 24 standard planes provides broad applicability within fetal ultrasound analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mask-channel approach could be tested on other ultrasound applications such as cardiac or abdominal imaging to check transfer of the region-control benefit.
If masks are available from automated segmenters, the model might reduce dependence on expert annotations during deployment.
The controllable inference mode could be combined with existing clinical workflows that already produce segmentation outputs.

Load-bearing premise

The curated dataset and mask integration produce representations that generalize across centers without masks introducing bias or the loss causing training instability.

What would settle it

A new-center evaluation in which mask-guided inference shows no improvement or worse performance than global-only inference on the same test images would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.29586 by Bo Du, Chao Sun, Hang Su, Juhua Liu, Wei Hu, Zhaofan Li.

**Figure 1.** Figure 1: Overview of SonoCLIP. (a) Data curation pipeline, including quality control, manual and AI-assisted mask annotation, and standardized caption generation from clinical records. (b) Dataset distribution across gestational weeks, hospitals, and anatomical planes. (c) Pre-training framework: ultrasound images and masks are jointly encoded via a mask-channel visual pathway, and aligned with paired text using a… view at source ↗

**Figure 2.** Figure 2: presents ablation results for SigLoss and mask-guided learning in zeroshot classification. When mask-guided learning is disabled, incorporating SigLoss alone yields a modest improvement in model performance. However, enabling both mask-guided learning and SigLoss simultaneously results in a significant increase in both Top-1 and Top-5 performance. This indicates that SigLoss’s advantages become more prono… view at source ↗

**Figure 3.** Figure 3: (a) Zero-shot classification results of each model on the FetalP6 dataset. (b) Qualitative comparison of segmentation results on the FetalP5 dataset [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Vision-language foundation models have shown strong potential in medical image analysis. Although foundation models for ultrasound imaging have recently emerged, the domain remains particularly challenging due to severe speckle noise, acquisition variability, and subtle anatomical boundaries, leading to high inter-observer variability. Existing CLIP-based models rely primarily on global image-text alignment, limiting their sensitivity to clinically decisive local structures. We propose SonoCLIP, the first million-scale region-controllable fetal ultrasound vision-language foundation model that integrates segmentation masks as mask-channel visual prompts within the vision encoder, enabling joint global-local contrastive representation learning. To support scalable region-text alignment, we introduce a sigmoid-based pairwise contrastive loss that improves stability under large-scale supervision. We further curate a 1.44M-image multimodal fetal ultrasound dataset spanning 24 standard planes for large-scale pretraining. Extensive cross-center evaluations demonstrate that SonoCLIP achieves superior zero-shot transfer performance under both global and mask-guided inference, establishing a controllable and clinically oriented foundation model for fetal ultrasound analysis. Our code and data are available at https://github.com/Harrison-one/SonoCLIP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SonoCLIP adds mask channels to the vision encoder plus a sigmoid pairwise loss on 1.44M fetal US images, but the abstract shows no numbers to support the superiority claim.

read the letter

The main thing to know is that SonoCLIP feeds segmentation masks as an extra channel into the vision encoder and trains with a sigmoid-based pairwise contrastive loss on a new 1.44M-image fetal ultrasound collection spanning 24 planes. That specific combination for region-controllable pretraining does not appear in the prior work referenced.

What the paper does reasonably is target the actual pain points in fetal ultrasound: speckle noise, high acquisition variability, and the need to focus on local anatomy rather than global image-text alignment. Using masks to drive joint global-local contrastive learning is a straightforward way to make the model more clinically usable. The claim that the new loss improves stability at scale is plausible on its face, and the cross-center evaluation setup avoids obvious circularity.

The clear soft spot is the complete absence of any metrics, baselines, or statistical details in the abstract. It asserts superior zero-shot transfer under both global and mask-guided inference but gives nothing to check that against. Without those numbers the central claim stays untestable from the provided text.

This is for groups already working on medical vision-language models or ultrasound-specific foundation models. A reader who wants concrete ideas for mask prompting or large-scale curation in this domain could extract value from the method once the full results are visible.

It is worth sending to peer review. The data scale and the architectural choices are substantive enough that referees should see the full experiments and decide whether the performance gains hold up.

Referee Report

0 major / 1 minor

Summary. SonoCLIP is proposed as a mask-guided region-aware vision-language pretraining model for fetal ultrasound analysis. It integrates segmentation masks as additional channels in the vision encoder to enable controllable global-local contrastive learning on a curated 1.44 million image dataset spanning 24 standard planes. A sigmoid-based pairwise contrastive loss is introduced to improve training stability at scale. The model is claimed to achieve superior zero-shot transfer performance on cross-center evaluations using both global and mask-guided inference.

Significance. If the reported performance advantages hold under rigorous evaluation, this work could significantly advance the development of foundation models for ultrasound imaging by providing region-controllable representations that are better suited to the clinical needs of fetal analysis. The public availability of the code and dataset is a strength that supports further research in the field.

minor comments (1)

[Abstract] The abstract asserts superior zero-shot transfer performance without providing any quantitative metrics, specific baselines, or statistical details. Including key results would strengthen the summary of the contributions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of SonoCLIP and the recommendation for minor revision. We appreciate the acknowledgment of the model's potential impact on foundation models for ultrasound imaging and the value of releasing code and data.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is empirical: curation of a 1.44M-image fetal ultrasound dataset, integration of mask-channel prompts into a CLIP-style vision encoder, and introduction of a sigmoid pairwise contrastive loss for region-text alignment. All claims of superior zero-shot performance rest on cross-center evaluations rather than any derivation that reduces to fitted parameters or self-citations by construction. No equations, uniqueness theorems, or ansatzes are shown to be self-referential; the argument is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach relies on standard assumptions from contrastive learning and the utility of mask prompts; no new entities postulated, and no free parameters explicitly fitted beyond typical training hyperparameters not detailed here.

axioms (2)

domain assumption Vision-language contrastive learning can produce useful representations for downstream tasks
Core assumption of CLIP-style models invoked implicitly.
domain assumption Segmentation masks can be effectively integrated as additional channels in the vision encoder without degrading performance
Key to the mask-guided approach described.

pith-pipeline@v0.9.1-grok · 5738 in / 1274 out tokens · 45404 ms · 2026-06-30T06:59:32.642477+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Ultrasonic imaging44(1), 25–38 (2022)

AshkaniChenarlogh,V.,GhelichOghli,M.,Shabanzadeh,A.,Sirjani,N.,Akhavan, A., Shiri, I., Arabi, H., Sanei Taheri, M., Tarzamni, M.K.: Fast and accurate u-net model for fetal ultrasound image segmentation. Ultrasonic imaging44(1), 25–38 (2022)

2022
[2]

ACM Comput

Avola, D., Cinque, L., Fagioli, A., Foresti, G., Mecca, A.: Ultrasound medical imaging techniques: A survey. ACM Comput. Surv.54(3) (Apr 2021).https: //doi.org/10.1145/3447243 10 H. Su et al

work page doi:10.1145/3447243 2021
[3]

Scientific Reports10, 10200 (2020).https://doi.org/10.1038/ s41598-020-67076-5

Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet- Carne, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolu- tional neural networks for automatic classification of common maternal fetal ul- trasound planes. Scientific Reports10, 10200 (2020).https://doi.org/10.1038/ s41598-020-67076-5

2020
[4]

Scientific Reports10(1), 10200 (2020)

Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet- Carne, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes. Scientific Reports10(1), 10200 (2020)

2020
[5]

Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3, published online 15 Jan 2026

Guo, X., Alsharid, M., Zhao, H., Wang, Y., Lander, J., Papageorghiou, A.T., Noble, J.A.: A visually grounded language model for fetal ultrasound under- standing. Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3, published online 15 Jan 2026

2026
[6]

Neurocomputing579,127443 (2024).https://doi.org/10.1016/j.neucom.2024

He, J., Yang, L., Liang, B., Li, S., Xu, C.: Fetal cardiac ultrasound standard sec- tion detection model based on multitask learning and mixed attention mechanism. Neurocomputing579,127443 (2024).https://doi.org/10.1016/j.neucom.2024. 127443

work page doi:10.1016/j.neucom.2024 2024
[7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

2022
[8]

Ultrasound in obstetrics & gynecology: the official journal of the International Society of Ultrasound in Obstetrics and Gynecology p

Khalil, A., Sotiriadis, A., D’Antonio, F., Da, Silva Costa, F., Odibo, A., Prefumo, F., Papageorghiou, A.T., Salomon, L.J.: Isuog practice guidelines: performance of third-trimester obstetric ultrasound scan. Ultrasound in obstetrics & gynecology: the official journal of the International Society of Ultrasound in Obstetrics and Gynecology p. 63 (2024)

2024
[9]

IEEE Reviews in Biomedical Engineering 19, 283–304 (2026).https://doi.org/10.1109/RBME.2025.3531360

Khan, W., Leem, S., See, K.B., Wong, J.K., Zhang, S., Fang, R.: A comprehensive survey of foundation models in medicine. IEEE Reviews in Biomedical Engineering 19, 283–304 (2026).https://doi.org/10.1109/RBME.2025.3531360

work page doi:10.1109/rbme.2025.3531360 2026
[10]

Khattak, M.U., Kunhimon, S., Naseer, M., Khan, S., Khan, F.S.: Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities (2024),https://arxiv.org/abs/2412.10372

work page arXiv 2024
[11]

Expert Systems with Applications238, 122153 (2024).https://doi.org/10.1016/j.eswa.2023.122153

Krishna, T.B., Kokil, P.: Standard fetal ultrasound plane classification based on stacked ensemble of deep learning models. Expert Systems with Applications238, 122153 (2024).https://doi.org/10.1016/j.eswa.2023.122153

work page doi:10.1016/j.eswa.2023.122153 2024
[12]

Nature643, 488–498 (2025).https://doi.org/10

Ma, D., Pang, J., Gotway, M.B., Liang, J.: A fully open AI foundation model applied to chest radiography. Nature643, 488–498 (2025).https://doi.org/10. 1038/s41586-025-09079-8

2025
[13]

Nature Biomedical Engineering (2025).https://doi.org/10.1038/s41551-025-01488-4, published online 02 Sep 2025

Ma, J., Guo, Z., Zhou, F., Wang, Y., Xu, Y., Li, J., Yan, F., Cai, Y., Zhu, Z., Jin, C., Lin, Y., Jiang, X., Zhao, C., Li, D., Han, A., Li, Z., Chan, R.C.K., Wang, J., Fei, P., Cheng, K.T., Zhang, S., Li, L.L., Chen, H.: A generalizable pathology foun- dation model using a unified knowledge distillation pretraining framework. Nature Biomedical Engineering...

work page doi:10.1038/s41551-025-01488-4 2025
[14]

Maani,F.,Saeed,N.,Saleem,T.,Farooq,Z.,Alasmawi,H.,Diehl,W.,Mohammad, A., Waring, G., Valappi, S., Bricker, L., Yaqub, M.: Fetalclip: A visual-language foundation model for fetal ultrasound image analysis (2025),https://arxiv.org/ abs/2502.14807

work page arXiv 2025
[15]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/abs/ 2103.00020 SonoCLIP 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Ultrasound in Obstetrics and Gynecology41(2), 102–113 (2013)

Salomon, L.J., Alfirevic, Z., Bilardo, C.M., Chalouhi, G.E., In, O.I.S.O.U.: Isuog practice guidelines: performance of first-trimester fetal ultrasound scan (vol 41, pg 102, 2013). Ultrasound in Obstetrics and Gynecology41(2), 102–113 (2013)

2013
[17]

Scientific Reports15, 19612 (2025).https://doi.org/10.1038/s41598-025-04631-y

Singh, R., Gupta, S., Mohamed, H.G., Bharany, S., Rehman, A.U., Ghadi, Y.Y., Hussen, S.: Advancing prenatal healthcare by explainable AI enhanced fetal ultra- sound image segmentation using U-Net++ with attention mechanisms. Scientific Reports15, 19612 (2025).https://doi.org/10.1038/s41598-025-04631-y

work page doi:10.1038/s41598-025-04631-y 2025
[18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Sun, Z., Fang, Y., Wu, T., Zhang, P., Zang, Y., Kong, S., Xiong, Y., Lin, D., Wang, J.: Alpha-clip: A clip model focusing on wherever you want. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13019–13029 (June 2024)

2024
[19]

Nature Medicine31, 2691–2702 (2025).https://doi.org/ 10.1038/s41591-025-03747-y

Yan, S., Yu, Z., Primiero, C., Vico-Alonso, C., Wang, Z., Yang, L., Tschandl, P., Hu, M., Ju, L., Tan, G., Tang, V., Ng, A.B., Powell, D., Bonnington, P., See, S., Magnaterra, E., Ferguson, P., Nguyen, J., Guitera, P., Banuls, J., Janda, M., Mar, V., Kittler, H., Soyer, H.P., Ge, Z.: A multimodal vision foundation model for clinical dermatology. Nature Me...

work page doi:10.1038/s41591-025-03747-y 2025
[20]

Medical Image Analysis94, 103147 (2024).https://doi.org/10.1016/j.media.2024.103147

Yeung, P.H., Hesse, L.S., Aliasi, M., Haak, M.C., 21st Consortium, I., Xie, W., Namburete, A.I.L.: Sensorless volumetric reconstruction of fetal brain freehand ultrasound scans with deep implicit representation. Medical Image Analysis94, 103147 (2024).https://doi.org/10.1016/j.media.2024.103147

work page doi:10.1016/j.media.2024.103147 2024
[21]

InProceedings of the SIGGRAPH Asia 2025 Conference Papers (SA Conference Papers ’25)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11941–11952 (2023).https://doi.org/10.1109/ ICCV51070.2023.01100

work page arXiv 2023
[22]

Zhang,S.,etal.:Biomedclip:amultimodalbiomedicalfoundationmodelpretrained from fifteen million scientific image-text pairs (2023),https://arxiv.org/abs/ 2303.00915

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Ultrasonic imaging44(1), 25–38 (2022)

AshkaniChenarlogh,V.,GhelichOghli,M.,Shabanzadeh,A.,Sirjani,N.,Akhavan, A., Shiri, I., Arabi, H., Sanei Taheri, M., Tarzamni, M.K.: Fast and accurate u-net model for fetal ultrasound image segmentation. Ultrasonic imaging44(1), 25–38 (2022)

2022

[2] [2]

ACM Comput

Avola, D., Cinque, L., Fagioli, A., Foresti, G., Mecca, A.: Ultrasound medical imaging techniques: A survey. ACM Comput. Surv.54(3) (Apr 2021).https: //doi.org/10.1145/3447243 10 H. Su et al

work page doi:10.1145/3447243 2021

[3] [3]

Scientific Reports10, 10200 (2020).https://doi.org/10.1038/ s41598-020-67076-5

Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet- Carne, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolu- tional neural networks for automatic classification of common maternal fetal ul- trasound planes. Scientific Reports10, 10200 (2020).https://doi.org/10.1038/ s41598-020-67076-5

2020

[4] [4]

Scientific Reports10(1), 10200 (2020)

Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet- Carne, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes. Scientific Reports10(1), 10200 (2020)

2020

[5] [5]

Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3, published online 15 Jan 2026

Guo, X., Alsharid, M., Zhao, H., Wang, Y., Lander, J., Papageorghiou, A.T., Noble, J.A.: A visually grounded language model for fetal ultrasound under- standing. Nature Biomedical Engineering (2026).https://doi.org/10.1038/ s41551-025-01578-3, published online 15 Jan 2026

2026

[6] [6]

Neurocomputing579,127443 (2024).https://doi.org/10.1016/j.neucom.2024

He, J., Yang, L., Liang, B., Li, S., Xu, C.: Fetal cardiac ultrasound standard sec- tion detection model based on multitask learning and mixed attention mechanism. Neurocomputing579,127443 (2024).https://doi.org/10.1016/j.neucom.2024. 127443

work page doi:10.1016/j.neucom.2024 2024

[7] [7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

2022

[8] [8]

Ultrasound in obstetrics & gynecology: the official journal of the International Society of Ultrasound in Obstetrics and Gynecology p

Khalil, A., Sotiriadis, A., D’Antonio, F., Da, Silva Costa, F., Odibo, A., Prefumo, F., Papageorghiou, A.T., Salomon, L.J.: Isuog practice guidelines: performance of third-trimester obstetric ultrasound scan. Ultrasound in obstetrics & gynecology: the official journal of the International Society of Ultrasound in Obstetrics and Gynecology p. 63 (2024)

2024

[9] [9]

IEEE Reviews in Biomedical Engineering 19, 283–304 (2026).https://doi.org/10.1109/RBME.2025.3531360

Khan, W., Leem, S., See, K.B., Wong, J.K., Zhang, S., Fang, R.: A comprehensive survey of foundation models in medicine. IEEE Reviews in Biomedical Engineering 19, 283–304 (2026).https://doi.org/10.1109/RBME.2025.3531360

work page doi:10.1109/rbme.2025.3531360 2026

[10] [10]

Khattak, M.U., Kunhimon, S., Naseer, M., Khan, S., Khan, F.S.: Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities (2024),https://arxiv.org/abs/2412.10372

work page arXiv 2024

[11] [11]

Expert Systems with Applications238, 122153 (2024).https://doi.org/10.1016/j.eswa.2023.122153

Krishna, T.B., Kokil, P.: Standard fetal ultrasound plane classification based on stacked ensemble of deep learning models. Expert Systems with Applications238, 122153 (2024).https://doi.org/10.1016/j.eswa.2023.122153

work page doi:10.1016/j.eswa.2023.122153 2024

[12] [12]

Nature643, 488–498 (2025).https://doi.org/10

Ma, D., Pang, J., Gotway, M.B., Liang, J.: A fully open AI foundation model applied to chest radiography. Nature643, 488–498 (2025).https://doi.org/10. 1038/s41586-025-09079-8

2025

[13] [13]

Nature Biomedical Engineering (2025).https://doi.org/10.1038/s41551-025-01488-4, published online 02 Sep 2025

Ma, J., Guo, Z., Zhou, F., Wang, Y., Xu, Y., Li, J., Yan, F., Cai, Y., Zhu, Z., Jin, C., Lin, Y., Jiang, X., Zhao, C., Li, D., Han, A., Li, Z., Chan, R.C.K., Wang, J., Fei, P., Cheng, K.T., Zhang, S., Li, L.L., Chen, H.: A generalizable pathology foun- dation model using a unified knowledge distillation pretraining framework. Nature Biomedical Engineering...

work page doi:10.1038/s41551-025-01488-4 2025

[14] [14]

Maani,F.,Saeed,N.,Saleem,T.,Farooq,Z.,Alasmawi,H.,Diehl,W.,Mohammad, A., Waring, G., Valappi, S., Bricker, L., Yaqub, M.: Fetalclip: A visual-language foundation model for fetal ultrasound image analysis (2025),https://arxiv.org/ abs/2502.14807

work page arXiv 2025

[15] [15]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/abs/ 2103.00020 SonoCLIP 11

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Ultrasound in Obstetrics and Gynecology41(2), 102–113 (2013)

Salomon, L.J., Alfirevic, Z., Bilardo, C.M., Chalouhi, G.E., In, O.I.S.O.U.: Isuog practice guidelines: performance of first-trimester fetal ultrasound scan (vol 41, pg 102, 2013). Ultrasound in Obstetrics and Gynecology41(2), 102–113 (2013)

2013

[17] [17]

Scientific Reports15, 19612 (2025).https://doi.org/10.1038/s41598-025-04631-y

Singh, R., Gupta, S., Mohamed, H.G., Bharany, S., Rehman, A.U., Ghadi, Y.Y., Hussen, S.: Advancing prenatal healthcare by explainable AI enhanced fetal ultra- sound image segmentation using U-Net++ with attention mechanisms. Scientific Reports15, 19612 (2025).https://doi.org/10.1038/s41598-025-04631-y

work page doi:10.1038/s41598-025-04631-y 2025

[18] [18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Sun, Z., Fang, Y., Wu, T., Zhang, P., Zang, Y., Kong, S., Xiong, Y., Lin, D., Wang, J.: Alpha-clip: A clip model focusing on wherever you want. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13019–13029 (June 2024)

2024

[19] [19]

Nature Medicine31, 2691–2702 (2025).https://doi.org/ 10.1038/s41591-025-03747-y

Yan, S., Yu, Z., Primiero, C., Vico-Alonso, C., Wang, Z., Yang, L., Tschandl, P., Hu, M., Ju, L., Tan, G., Tang, V., Ng, A.B., Powell, D., Bonnington, P., See, S., Magnaterra, E., Ferguson, P., Nguyen, J., Guitera, P., Banuls, J., Janda, M., Mar, V., Kittler, H., Soyer, H.P., Ge, Z.: A multimodal vision foundation model for clinical dermatology. Nature Me...

work page doi:10.1038/s41591-025-03747-y 2025

[20] [20]

Medical Image Analysis94, 103147 (2024).https://doi.org/10.1016/j.media.2024.103147

Yeung, P.H., Hesse, L.S., Aliasi, M., Haak, M.C., 21st Consortium, I., Xie, W., Namburete, A.I.L.: Sensorless volumetric reconstruction of fetal brain freehand ultrasound scans with deep implicit representation. Medical Image Analysis94, 103147 (2024).https://doi.org/10.1016/j.media.2024.103147

work page doi:10.1016/j.media.2024.103147 2024

[21] [21]

InProceedings of the SIGGRAPH Asia 2025 Conference Papers (SA Conference Papers ’25)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11941–11952 (2023).https://doi.org/10.1109/ ICCV51070.2023.01100

work page arXiv 2023

[22] [22]

Zhang,S.,etal.:Biomedclip:amultimodalbiomedicalfoundationmodelpretrained from fifteen million scientific image-text pairs (2023),https://arxiv.org/abs/ 2303.00915

work page internal anchor Pith review Pith/arXiv arXiv 2023