Ultrasound Vision-Language Alignment via Contrastive Learning

Ruirui Lan; Tongxin Wang; Yiyang Zhang; Zhuoyang Lyu

arxiv: 2605.02126 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.LG

Ultrasound Vision-Language Alignment via Contrastive Learning

Zhuoyang Lyu , Yiyang Zhang , Tongxin Wang , Ruirui Lan This is my paper

Pith reviewed 2026-05-09 17:02 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords ultrasoundvision-language alignmentcontrastive learningCLIPmedical imagingzero-shot classificationimage-text pairsdomain adaptation

0 comments

The pith

Ultrasound images align with clinical text reports via contrastive learning on public data alone, though full fine-tuning harms transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a dual-encoder contrastive model can map ultrasound images from breast, liver, lung, and thyroid to matching clinical text in a shared space, using a corpus of over 16,000 pairs drawn entirely from public sources. This matters for medical imaging because most ultrasound foundation models today are vision-only and cannot handle new tasks without fresh labeled data. The authors demonstrate higher alignment scores than OpenAI CLIP and BiomedCLIP baselines, yet also find that stronger alignment does not reliably improve zero-shot classification on external datasets and that simpler template captions work as well as LLM-generated ones.

Core claim

EchoCare-CLIP is a CLIP-style dual-encoder trained with contrastive loss on a multi-organ ultrasound corpus of more than 16K image-text pairs, over 78 percent drawn from expert-annotated reports. The model lifts paired alignment to 0.682, outperforming baselines, but zero-shot classification on held-out BUSI and AULI datasets peaks at 0.709 and 0.626 respectively only with partial fine-tuning of CLIP-based text encoders; full end-to-end training degrades results through overfitting. Template-based captions prove equal or superior to LLM-generated ones, indicating that lexical diversity is not required for effective supervision.

What carries the argument

The dual-encoder contrastive framework that pulls matched ultrasound-image and text embeddings together while repelling unmatched pairs in a shared space.

If this is right

Partial fine-tuning of CLIP-based text encoders delivers the strongest zero-shot performance on external breast and lung ultrasound sets.
Template-generated captions support alignment and transfer at least as well as LLM-generated captions.
Full end-to-end fine-tuning causes overfitting that reduces generalization on held-out clinical data.
Alignment scores and downstream transfer performance are imperfectly correlated, with rankings varying by dataset on linear probing and few-shot tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Simpler captions may better preserve precise medical terminology than more varied LLM text, reducing noise in supervision.
The same contrastive alignment recipe could be tested on other public medical imaging modalities that already have report text.
Balancing domain adaptation against encoder capacity may become a standard hyper-parameter choice when building medical vision-language models.

Load-bearing premise

The 16K public image-text pairs and the two specific external test sets capture enough real-world clinical variability in scanners, protocols, and reporting styles for the observed trade-offs to hold more broadly.

What would settle it

Zero-shot classification accuracy on a fresh ultrasound dataset collected at a different hospital or with different equipment falls well below the 0.626–0.709 range reported on BUSI and AULI.

Figures

Figures reproduced from arXiv: 2605.02126 by Ruirui Lan, Tongxin Wang, Yiyang Zhang, Zhuoyang Lyu.

**Figure 1.** Figure 1: Zero-shot classification accuracy across anatomical domains on the internal held-out test view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of zero-shot predictions from EchoCare-CLIP (BioClinicalBERT view at source ↗

**Figure 3.** Figure 3: Zero-shot classification accuracy on the BUSI (a) and AULI (b) held-out test sets, with all view at source ↗

**Figure 4.** Figure 4: Zero-shot classification accuracy on the BUSI (a) and AULI (b) held-out test sets, grouped view at source ↗

read the original abstract

Ultrasound foundation models have achieved strong performance on structured prediction tasks but remain exclusively vision-based, limiting zero-shot and few-shot transfer to novel tasks where task-specific annotation is scarce. We address this gap with EchoCare-CLIP, a CLIP-style dual-encoder contrastive framework that aligns ultrasound images with clinical text in a shared embedding space. We curate a multi-organ corpus of over 16K image-text pairs spanning breast, liver, lung, and thyroid, with over 78% of captions derived from expert-annotated reports, and complement the remainder with a three-tier template-based and LLM-based caption generation pipeline. We evaluate model configurations spanning two text encoder families (CLIP, BioClinicalBERT) and two caption strategies (template-based, LLM-generated) against OpenAI CLIP and BiomedCLIP baselines. Our trained models consistently improve cross-modal alignment over baselines, with the best configuration achieving a paired alignment score of 0.682. However, stronger alignment does not guarantee better downstream performance: CLIP-based variants with partial fine-tuning achieve the strongest zero-shot classification on external held-out datasets (0.709 on BUSI; 0.626 on AULI), while full end-to-end fine-tuning degrades transfer due to overfitting. On linear probing and few-shot adaptation, model rankings are dataset-dependent, reflecting a trade-off between domain adaptation and representational generalizability. We further show that template-based captions match or outperform LLM-generated captions, suggesting lexical diversity is not a proxy for caption quality. Taken together, our results demonstrate that ultrasound vision-language alignment is achievable from public data alone, but robust clinical transfer requires careful balancing of domain adaptation, encoder capacity, and caption supervision quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows you can train a decent ultrasound CLIP model from public data with a new 16k multi-organ corpus, but the real contribution is the empirical evidence that stronger alignment does not reliably improve downstream transfer.

read the letter

The main takeaway is that EchoCare-CLIP gets solid cross-modal alignment on ultrasound using mostly public data, yet the experiments reveal a clear trade-off where better alignment scores do not guarantee stronger zero-shot or few-shot performance on external sets. They built a 16k image-text corpus spanning breast, liver, lung, and thyroid, with 78% expert captions and the rest from templates or LLMs, then trained dual-encoder models with contrastive loss. Configurations using CLIP and BioClinicalBERT text encoders beat OpenAI CLIP and BiomedCLIP baselines on alignment (best at 0.682), and partial fine-tuning variants reached 0.709 on BUSI and 0.626 on AULI for zero-shot classification. Full end-to-end fine-tuning hurt transfer due to overfitting, and template captions performed as well as LLM ones. Linear probing and few-shot results varied by dataset, which the authors flag as a domain-adaptation versus generalizability issue. This is useful incremental work. The dataset curation and the explicit demonstration of the alignment-transfer mismatch are new for ultrasound, and the comparisons across encoder families and caption strategies are cleanly executed with held-out metrics. The observation that lexical diversity from LLMs does not help is a practical note worth having. The main limitation is external validity. The 16k corpus and the BUSI/AULI test sets may not span the full range of clinical ultrasound variability in equipment, protocols, demographics, or pathologies, so the reported trade-offs could be specific to these collections rather than general guidance. The paper is honest about dataset-dependent rankings, but that does not remove the need for broader testing. This is for people working on medical vision-language models or label-efficient ultrasound interpretation. Readers who care about practical transfer in medical AI will find the numbers and the cautionary findings useful. It is coherent and engages the literature without circular claims, so it deserves a serious referee. I would send it for review with a request for more discussion of data representativeness and perhaps one additional external validation set.

Referee Report

2 major / 3 minor

Summary. The manuscript describes EchoCare-CLIP, a CLIP-style dual-encoder contrastive learning framework designed to align ultrasound images with corresponding clinical text descriptions. The authors curate a dataset of over 16,000 multi-organ (breast, liver, lung, thyroid) image-text pairs, where 78% are from expert reports and the rest generated via templates and LLMs. They experiment with different text encoders (CLIP and BioClinicalBERT) and caption strategies, evaluating against OpenAI CLIP and BiomedCLIP. Key findings include improved alignment scores (up to 0.682), but a decoupling from downstream performance where partial fine-tuning yields best zero-shot results on BUSI (0.709) and AULI (0.626), full fine-tuning causes overfitting, and template captions perform on par with LLM-generated ones. The work concludes that ultrasound vision-language alignment can be achieved with public data, yet robust transfer necessitates balancing domain adaptation, model capacity, and supervision quality.

Significance. If these empirical results hold under broader validation, the paper makes a meaningful contribution by establishing the feasibility of vision-language pretraining for ultrasound using readily available public data. It provides concrete evidence of trade-offs between alignment strength and transferability, which is valuable for practitioners building clinical AI systems with limited labeled data. The evaluation on independent external datasets (BUSI, AULI) and the systematic comparison of multiple model configurations add credibility. Additionally, the observation that lexical diversity from LLMs does not necessarily improve caption quality for this task is a useful negative result. This could guide the development of more reliable multimodal models in medical imaging domains.

major comments (2)

[Results and Discussion] The central observation that stronger cross-modal alignment (0.682) does not guarantee superior zero-shot transfer (comparing to 0.709 BUSI and 0.626 AULI) is key to the balancing claim. However, these metrics appear to be from single runs without reported standard deviations or statistical significance tests across multiple seeds, which could affect the reliability of the ranking between configurations.
[Dataset Curation and Limitations] The recommendation for balancing domain adaptation, encoder capacity, and caption quality to achieve robust clinical transfer assumes that the 16K-pair corpus and the BUSI/AULI held-out sets adequately represent clinical variability. The manuscript would benefit from a dedicated limitations paragraph quantifying potential shifts (e.g., scanner types, patient populations) or including sensitivity analysis to support generalizability of the trade-offs.

minor comments (3)

[Abstract] Please define the 'paired alignment score' explicitly, including the formula or retrieval protocol used to compute the 0.682 value.
[Methods] The description of the three-tier caption generation pipeline lacks sufficient detail for exact reproduction; consider adding an algorithm box or pseudocode.
[Experiments] Ensure that all baseline implementations (OpenAI CLIP, BiomedCLIP) are described with the same fine-tuning protocols as the proposed models for fair comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and insightful comments, which help clarify the contributions and limitations of our work. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Results and Discussion] The central observation that stronger cross-modal alignment (0.682) does not guarantee superior zero-shot transfer (comparing to 0.709 BUSI and 0.626 AULI) is key to the balancing claim. However, these metrics appear to be from single runs without reported standard deviations or statistical significance tests across multiple seeds, which could affect the reliability of the ranking between configurations.

Authors: We agree that reporting variability is essential to substantiate the central claim regarding the decoupling of alignment strength and transfer performance. In the revised manuscript, we will rerun key experiments across multiple random seeds (at least three) and report mean values with standard deviations for both the paired alignment score and zero-shot classification accuracies on BUSI and AULI. We will also include appropriate statistical significance tests (e.g., paired t-tests with p-values) to support the observed rankings and trade-offs between configurations such as partial versus full fine-tuning. revision: yes
Referee: [Dataset Curation and Limitations] The recommendation for balancing domain adaptation, encoder capacity, and caption quality to achieve robust clinical transfer assumes that the 16K-pair corpus and the BUSI/AULI held-out sets adequately represent clinical variability. The manuscript would benefit from a dedicated limitations paragraph quantifying potential shifts (e.g., scanner types, patient populations) or including sensitivity analysis to support generalizability of the trade-offs.

Authors: We appreciate the emphasis on generalizability. We will add a dedicated Limitations section that quantifies known aspects of our 16K-pair corpus (organ distribution, proportion of expert reports versus templates/LLMs) and the external datasets (BUSI, AULI), while explicitly discussing potential domain shifts in scanner types, imaging protocols, and patient populations based on available metadata. We will note that a full sensitivity analysis across all possible clinical variabilities is not feasible with the current data but will frame our balancing recommendations within these acknowledged constraints and suggest it as future work. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical training and held-out evaluation are independent

full rationale

The paper trains a standard CLIP-style contrastive model on a curated 16K multi-organ image-text corpus and reports alignment scores plus downstream metrics on two external held-out datasets (BUSI, AULI). No equations or claims reduce reported alignment, zero-shot, linear-probing, or few-shot numbers to quantities fitted on the same data used for the central claim. Baselines are external (OpenAI CLIP, BiomedCLIP); caption generation is described as a pipeline but not used to tautologically define success. No self-citation chains, uniqueness theorems, or ansatzes imported from the authors' prior work appear as load-bearing steps. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are introduced beyond standard contrastive learning assumptions and the use of pre-trained encoders. The work relies on the standard InfoNCE-style loss and the assumption that image-text pairs from public sources are sufficient for alignment.

pith-pipeline@v0.9.0 · 5612 in / 1238 out tokens · 38385 ms · 2026-05-09T17:02:20.571411+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

[1]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...

work page 2021
[2]

Clip vit-b/32

OpenAI. Clip vit-b/32. https://huggingface.co/openai/clip-vit-base-patch32 ,

work page
[3]

Accessed: 2026-04-17

work page 2026
[4]

Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B

Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. Publicly available clinical BERT embeddings. InProceedings of the 2nd Clinical Natural Language Processing Workshop (Clinical NLP) at NAACL, 2019

work page 2019
[5]

Bio_ClinicalBERT

Emily Alsentzer. Bio_ClinicalBERT. https://huggingface.co/emilyalsentzer/Bio_ ClinicalBERT, 2019. Accessed: 2026-04-29

work page 2019
[6]

Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Angela Crabtree, Brian Piening, Carlo 13 Bifulco, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. Biomedclip: a multim...

work page 2025
[7]

Dataset of breast ultrasound images.Data in Brief, 28:104863, 2020

Wafaa Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images.Data in Brief, 28:104863, 2020

work page 2020
[8]

Annotated ultrasound liver images, 2022

Yonghao Xu, Bing Zheng, Xiaoyu Liu, Tianfu Wu, Jia Ju, Shuo Wang, Yijun Lian, Hong Zhang, Tian Liang, Yang Sang, Rui Jiang, Guang Wang, Jianfeng Ren, and Tianfu Chen. Annotated ultrasound liver images, 2022

work page 2022
[9]

Peter N. T. Wells. Ultrasound imaging.Physics in Medicine & Biology, 51(13):R83–R98, 2006

work page 2006
[10]

Salcudean, and Nassir Navab

Zhongliang Jiang, Septimiu E. Salcudean, and Nassir Navab. Robotic ultrasound imaging: State-of-the-art and future perspectives.Medical Image Analysis, 89:102878, October 2023

work page 2023
[11]

Strauss, E

S. Strauss, E. Gavish, P. Gottlieb, and L. Katsnelson. Interobserver and intraobserver vari- ability in the sonographic assessment of fatty liver.AJR American Journal of Roentgenology, 189(6):W320–W323, 2007

work page 2007
[12]

Dubinsky, and Manjiri K

Malak Itani, Richard Assaker, Mariam Moshiri, Theodore J. Dubinsky, and Manjiri K. Dighe. Inter-observer variability in the american college of radiology thyroid imaging reporting and data system: In-depth analysis and areas for improvement.Ultrasound in Medicine and Biology, 45(2):461–470, 2019

work page 2019
[13]

A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications

Hongyuan Zhang, Yuheng Wu, Mingyang Zhao, Zhiwei Chen, Rebecca Li, Fei Zhu, Haohan Zhao, Xiaohua Yuan, Meng Yang, Chunli Qiu, et al. A fully open and generalizable foundation model for ultrasound clinical applications.arXiv preprint arXiv:2509.11752, 2025

work page arXiv 2025
[14]

Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis.Medical Image Analysis, 96:103202, 2024

Jing Jiao, Jin Zhou, Xiaokang Li, Menghua Xia, Yi Huang, Lihong Huang, Na Wang, Xiaofan Zhang, Shichong Zhou, Yuanyuan Wang, and Yi Guo. Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis.Medical Image Analysis, 96:103202, 2024

work page 2024
[15]

Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A. W. M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. A survey on deep learning in medical image analysis.Medical Image Analysis, 42:60–88, 2017

work page 2017
[16]

arXiv preprint arXiv:2010.00747 , year=

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive learning of medical visual representations from paired images and text.arXiv preprint, 2020. arXiv:2010.00747

work page arXiv 2020
[17]

Lu, Bowen Chen, Drew F

Ming Y . Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, Anil V . Parwani, Andrew Zhang, and Faisal Mahmood. A visual-language foundation model for computational pathology.Nature Medicine, 30(3):863–874, 2024

work page 2024
[18]

BiomedCLIP- PubMedBERT_256-vit_base_patch16_224

Sheng Zhang, Yanbo Xu, Naoto Usuyama, et al. BiomedCLIP- PubMedBERT_256-vit_base_patch16_224. https://huggingface.co/microsoft/ BiomedCLIP-PubMedBERT_256-vit_base_patch16_224, 2024. Hugging Face model repository

work page 2024
[19]

Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6:317, 2019

work page 2019
[20]

A chain-of-thought reasoning breast ultrasound dataset covering all histopathology categories.Scientific Data, 13:370, 2026

Haojun Yu, Youcheng Li, Zihan Niu, Nan Zhang, Xuantong Gong, Huan Li, Zhiying Zou, Haifeng Qi, Zhenxiao Cao, Zijie Lan, Xingjian Yuan, Jiating He, Haokai Zhang, Shengtao Zhang, Zicheng Wang, Dong Wang, Ziwei Zhao, Congying Chen, Yong Wang, Wangyan Qin, Qingli Zhu, and Liwei Wang. A chain-of-thought reasoning breast ultrasound dataset covering all histopat...

work page 2026
[21]

Curated benchmark dataset for ultrasound based breast lesion analysis.Scientific Data, 11(1):148, 2024

Anna Pawłowska, Anna ´Cwierz-Pie´nkowska, Aleksandra Domalik, Dominika Jagu ´s, Piotr Kasprzak, Rafał Matkowski, Łukasz Fura, Andrzej Nowicki, and Natalia ˙Zołek. Curated benchmark dataset for ultrasound based breast lesion analysis.Scientific Data, 11(1):148, 2024

work page 2024
[22]

A dataset of lung ultrasound images for automated ai-based lung disease classifi- cation.Data in Brief, 62:112034, 2025

Allan Katumba, Samuel Murindanyi, Nicolas Okila, Joyce Nakatumba-Nabende, Christopher Mwikirize, Joseph Serugunda, Stephen Bugeza, Amon Oriekot, Joseph Bossa, and Esther Nabawanuka. A dataset of lung ultrasound images for automated ai-based lung disease classifi- cation.Data in Brief, 62:112034, 2025

work page 2025
[23]

Algerian ultrasound images thyroid dataset: Auitd, 2023

Maroua Azouz. Algerian ultrasound images thyroid dataset: Auitd, 2023. Accessed: 2026-03- 17

work page 2023
[24]

Ultrasound report generation with cross-modality feature alignment via unsupervised guidance.IEEE Transactions on Medical Imaging, 2024

Jun Li, Tongkun Su, Baoliang Zhao, Faqin Lv, Qiong Wang, Nassir Navab, Ying Hu, and Zhongliang Jiang. Ultrasound report generation with cross-modality feature alignment via unsupervised guidance.IEEE Transactions on Medical Imaging, 2024

work page 2024
[25]

Swin transformer V2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer V2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[26]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[28]

Sgdr: Stochastic gradient descent with warm restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), 2017

work page 2017
[29]

a patient

Qwen Team. Qwen3-4b-instruct-2507. https://huggingface.co/Qwen/ Qwen3-4B-Instruct-2507, 2025. Extended Materials 7.1 Caption Templates Extended Table 1 lists all 30 caption templates used in the three-tier caption generation pipeline. Placeholders are typeset in monospace and are resolved at runtime from dataset metadata; default fallback strings are show...

work page 2025
[30]

First caption text here. 2. Second caption text here. 3. Third caption text here. Extended Table 3 reports corpus-level lexical diversity using template captions and LLM-generated captions, measured over all N= 5,114 captions from all 5 datasets without expert-annotated reports (including the 2 held-out datasets, but they are not involved in training). LL...

work page arXiv

[1] [1]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...

work page 2021

[2] [2]

Clip vit-b/32

OpenAI. Clip vit-b/32. https://huggingface.co/openai/clip-vit-base-patch32 ,

work page

[3] [3]

Accessed: 2026-04-17

work page 2026

[4] [4]

Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B

Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. Publicly available clinical BERT embeddings. InProceedings of the 2nd Clinical Natural Language Processing Workshop (Clinical NLP) at NAACL, 2019

work page 2019

[5] [5]

Bio_ClinicalBERT

Emily Alsentzer. Bio_ClinicalBERT. https://huggingface.co/emilyalsentzer/Bio_ ClinicalBERT, 2019. Accessed: 2026-04-29

work page 2019

[6] [6]

Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Angela Crabtree, Brian Piening, Carlo 13 Bifulco, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. Biomedclip: a multim...

work page 2025

[7] [7]

Dataset of breast ultrasound images.Data in Brief, 28:104863, 2020

Wafaa Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images.Data in Brief, 28:104863, 2020

work page 2020

[8] [8]

Annotated ultrasound liver images, 2022

Yonghao Xu, Bing Zheng, Xiaoyu Liu, Tianfu Wu, Jia Ju, Shuo Wang, Yijun Lian, Hong Zhang, Tian Liang, Yang Sang, Rui Jiang, Guang Wang, Jianfeng Ren, and Tianfu Chen. Annotated ultrasound liver images, 2022

work page 2022

[9] [9]

Peter N. T. Wells. Ultrasound imaging.Physics in Medicine & Biology, 51(13):R83–R98, 2006

work page 2006

[10] [10]

Salcudean, and Nassir Navab

Zhongliang Jiang, Septimiu E. Salcudean, and Nassir Navab. Robotic ultrasound imaging: State-of-the-art and future perspectives.Medical Image Analysis, 89:102878, October 2023

work page 2023

[11] [11]

Strauss, E

S. Strauss, E. Gavish, P. Gottlieb, and L. Katsnelson. Interobserver and intraobserver vari- ability in the sonographic assessment of fatty liver.AJR American Journal of Roentgenology, 189(6):W320–W323, 2007

work page 2007

[12] [12]

Dubinsky, and Manjiri K

Malak Itani, Richard Assaker, Mariam Moshiri, Theodore J. Dubinsky, and Manjiri K. Dighe. Inter-observer variability in the american college of radiology thyroid imaging reporting and data system: In-depth analysis and areas for improvement.Ultrasound in Medicine and Biology, 45(2):461–470, 2019

work page 2019

[13] [13]

A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications

Hongyuan Zhang, Yuheng Wu, Mingyang Zhao, Zhiwei Chen, Rebecca Li, Fei Zhu, Haohan Zhao, Xiaohua Yuan, Meng Yang, Chunli Qiu, et al. A fully open and generalizable foundation model for ultrasound clinical applications.arXiv preprint arXiv:2509.11752, 2025

work page arXiv 2025

[14] [14]

Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis.Medical Image Analysis, 96:103202, 2024

Jing Jiao, Jin Zhou, Xiaokang Li, Menghua Xia, Yi Huang, Lihong Huang, Na Wang, Xiaofan Zhang, Shichong Zhou, Yuanyuan Wang, and Yi Guo. Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis.Medical Image Analysis, 96:103202, 2024

work page 2024

[15] [15]

Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A. W. M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. A survey on deep learning in medical image analysis.Medical Image Analysis, 42:60–88, 2017

work page 2017

[16] [16]

arXiv preprint arXiv:2010.00747 , year=

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive learning of medical visual representations from paired images and text.arXiv preprint, 2020. arXiv:2010.00747

work page arXiv 2020

[17] [17]

Lu, Bowen Chen, Drew F

Ming Y . Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, Anil V . Parwani, Andrew Zhang, and Faisal Mahmood. A visual-language foundation model for computational pathology.Nature Medicine, 30(3):863–874, 2024

work page 2024

[18] [18]

BiomedCLIP- PubMedBERT_256-vit_base_patch16_224

Sheng Zhang, Yanbo Xu, Naoto Usuyama, et al. BiomedCLIP- PubMedBERT_256-vit_base_patch16_224. https://huggingface.co/microsoft/ BiomedCLIP-PubMedBERT_256-vit_base_patch16_224, 2024. Hugging Face model repository

work page 2024

[19] [19]

Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6:317, 2019

work page 2019

[20] [20]

A chain-of-thought reasoning breast ultrasound dataset covering all histopathology categories.Scientific Data, 13:370, 2026

Haojun Yu, Youcheng Li, Zihan Niu, Nan Zhang, Xuantong Gong, Huan Li, Zhiying Zou, Haifeng Qi, Zhenxiao Cao, Zijie Lan, Xingjian Yuan, Jiating He, Haokai Zhang, Shengtao Zhang, Zicheng Wang, Dong Wang, Ziwei Zhao, Congying Chen, Yong Wang, Wangyan Qin, Qingli Zhu, and Liwei Wang. A chain-of-thought reasoning breast ultrasound dataset covering all histopat...

work page 2026

[21] [21]

Curated benchmark dataset for ultrasound based breast lesion analysis.Scientific Data, 11(1):148, 2024

Anna Pawłowska, Anna ´Cwierz-Pie´nkowska, Aleksandra Domalik, Dominika Jagu ´s, Piotr Kasprzak, Rafał Matkowski, Łukasz Fura, Andrzej Nowicki, and Natalia ˙Zołek. Curated benchmark dataset for ultrasound based breast lesion analysis.Scientific Data, 11(1):148, 2024

work page 2024

[22] [22]

A dataset of lung ultrasound images for automated ai-based lung disease classifi- cation.Data in Brief, 62:112034, 2025

Allan Katumba, Samuel Murindanyi, Nicolas Okila, Joyce Nakatumba-Nabende, Christopher Mwikirize, Joseph Serugunda, Stephen Bugeza, Amon Oriekot, Joseph Bossa, and Esther Nabawanuka. A dataset of lung ultrasound images for automated ai-based lung disease classifi- cation.Data in Brief, 62:112034, 2025

work page 2025

[23] [23]

Algerian ultrasound images thyroid dataset: Auitd, 2023

Maroua Azouz. Algerian ultrasound images thyroid dataset: Auitd, 2023. Accessed: 2026-03- 17

work page 2023

[24] [24]

Ultrasound report generation with cross-modality feature alignment via unsupervised guidance.IEEE Transactions on Medical Imaging, 2024

Jun Li, Tongkun Su, Baoliang Zhao, Faqin Lv, Qiong Wang, Nassir Navab, Ying Hu, and Zhongliang Jiang. Ultrasound report generation with cross-modality feature alignment via unsupervised guidance.IEEE Transactions on Medical Imaging, 2024

work page 2024

[25] [25]

Swin transformer V2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer V2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[26] [26]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019

[28] [28]

Sgdr: Stochastic gradient descent with warm restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), 2017

work page 2017

[29] [29]

a patient

Qwen Team. Qwen3-4b-instruct-2507. https://huggingface.co/Qwen/ Qwen3-4B-Instruct-2507, 2025. Extended Materials 7.1 Caption Templates Extended Table 1 lists all 30 caption templates used in the three-tier caption generation pipeline. Placeholders are typeset in monospace and are resolved at runtime from dataset metadata; default fallback strings are show...

work page 2025

[30] [30]

First caption text here. 2. Second caption text here. 3. Third caption text here. Extended Table 3 reports corpus-level lexical diversity using template captions and LLM-generated captions, measured over all N= 5,114 captions from all 5 datasets without expert-annotated reports (including the 2 held-out datasets, but they are not involved in training). LL...

work page arXiv