pith. sign in

arxiv: 2605.02126 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.LG

Ultrasound Vision-Language Alignment via Contrastive Learning

Pith reviewed 2026-05-09 17:02 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords ultrasoundvision-language alignmentcontrastive learningCLIPmedical imagingzero-shot classificationimage-text pairsdomain adaptation
0
0 comments X

The pith

Ultrasound images align with clinical text reports via contrastive learning on public data alone, though full fine-tuning harms transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a dual-encoder contrastive model can map ultrasound images from breast, liver, lung, and thyroid to matching clinical text in a shared space, using a corpus of over 16,000 pairs drawn entirely from public sources. This matters for medical imaging because most ultrasound foundation models today are vision-only and cannot handle new tasks without fresh labeled data. The authors demonstrate higher alignment scores than OpenAI CLIP and BiomedCLIP baselines, yet also find that stronger alignment does not reliably improve zero-shot classification on external datasets and that simpler template captions work as well as LLM-generated ones.

Core claim

EchoCare-CLIP is a CLIP-style dual-encoder trained with contrastive loss on a multi-organ ultrasound corpus of more than 16K image-text pairs, over 78 percent drawn from expert-annotated reports. The model lifts paired alignment to 0.682, outperforming baselines, but zero-shot classification on held-out BUSI and AULI datasets peaks at 0.709 and 0.626 respectively only with partial fine-tuning of CLIP-based text encoders; full end-to-end training degrades results through overfitting. Template-based captions prove equal or superior to LLM-generated ones, indicating that lexical diversity is not required for effective supervision.

What carries the argument

The dual-encoder contrastive framework that pulls matched ultrasound-image and text embeddings together while repelling unmatched pairs in a shared space.

If this is right

  • Partial fine-tuning of CLIP-based text encoders delivers the strongest zero-shot performance on external breast and lung ultrasound sets.
  • Template-generated captions support alignment and transfer at least as well as LLM-generated captions.
  • Full end-to-end fine-tuning causes overfitting that reduces generalization on held-out clinical data.
  • Alignment scores and downstream transfer performance are imperfectly correlated, with rankings varying by dataset on linear probing and few-shot tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Simpler captions may better preserve precise medical terminology than more varied LLM text, reducing noise in supervision.
  • The same contrastive alignment recipe could be tested on other public medical imaging modalities that already have report text.
  • Balancing domain adaptation against encoder capacity may become a standard hyper-parameter choice when building medical vision-language models.

Load-bearing premise

The 16K public image-text pairs and the two specific external test sets capture enough real-world clinical variability in scanners, protocols, and reporting styles for the observed trade-offs to hold more broadly.

What would settle it

Zero-shot classification accuracy on a fresh ultrasound dataset collected at a different hospital or with different equipment falls well below the 0.626–0.709 range reported on BUSI and AULI.

Figures

Figures reproduced from arXiv: 2605.02126 by Ruirui Lan, Tongxin Wang, Yiyang Zhang, Zhuoyang Lyu.

Figure 1
Figure 1. Figure 1: Zero-shot classification accuracy across anatomical domains on the internal held-out test view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of zero-shot predictions from EchoCare-CLIP (BioClinicalBERT view at source ↗
Figure 3
Figure 3. Figure 3: Zero-shot classification accuracy on the BUSI (a) and AULI (b) held-out test sets, with all view at source ↗
Figure 4
Figure 4. Figure 4: Zero-shot classification accuracy on the BUSI (a) and AULI (b) held-out test sets, grouped view at source ↗
read the original abstract

Ultrasound foundation models have achieved strong performance on structured prediction tasks but remain exclusively vision-based, limiting zero-shot and few-shot transfer to novel tasks where task-specific annotation is scarce. We address this gap with EchoCare-CLIP, a CLIP-style dual-encoder contrastive framework that aligns ultrasound images with clinical text in a shared embedding space. We curate a multi-organ corpus of over 16K image-text pairs spanning breast, liver, lung, and thyroid, with over 78% of captions derived from expert-annotated reports, and complement the remainder with a three-tier template-based and LLM-based caption generation pipeline. We evaluate model configurations spanning two text encoder families (CLIP, BioClinicalBERT) and two caption strategies (template-based, LLM-generated) against OpenAI CLIP and BiomedCLIP baselines. Our trained models consistently improve cross-modal alignment over baselines, with the best configuration achieving a paired alignment score of 0.682. However, stronger alignment does not guarantee better downstream performance: CLIP-based variants with partial fine-tuning achieve the strongest zero-shot classification on external held-out datasets (0.709 on BUSI; 0.626 on AULI), while full end-to-end fine-tuning degrades transfer due to overfitting. On linear probing and few-shot adaptation, model rankings are dataset-dependent, reflecting a trade-off between domain adaptation and representational generalizability. We further show that template-based captions match or outperform LLM-generated captions, suggesting lexical diversity is not a proxy for caption quality. Taken together, our results demonstrate that ultrasound vision-language alignment is achievable from public data alone, but robust clinical transfer requires careful balancing of domain adaptation, encoder capacity, and caption supervision quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript describes EchoCare-CLIP, a CLIP-style dual-encoder contrastive learning framework designed to align ultrasound images with corresponding clinical text descriptions. The authors curate a dataset of over 16,000 multi-organ (breast, liver, lung, thyroid) image-text pairs, where 78% are from expert reports and the rest generated via templates and LLMs. They experiment with different text encoders (CLIP and BioClinicalBERT) and caption strategies, evaluating against OpenAI CLIP and BiomedCLIP. Key findings include improved alignment scores (up to 0.682), but a decoupling from downstream performance where partial fine-tuning yields best zero-shot results on BUSI (0.709) and AULI (0.626), full fine-tuning causes overfitting, and template captions perform on par with LLM-generated ones. The work concludes that ultrasound vision-language alignment can be achieved with public data, yet robust transfer necessitates balancing domain adaptation, model capacity, and supervision quality.

Significance. If these empirical results hold under broader validation, the paper makes a meaningful contribution by establishing the feasibility of vision-language pretraining for ultrasound using readily available public data. It provides concrete evidence of trade-offs between alignment strength and transferability, which is valuable for practitioners building clinical AI systems with limited labeled data. The evaluation on independent external datasets (BUSI, AULI) and the systematic comparison of multiple model configurations add credibility. Additionally, the observation that lexical diversity from LLMs does not necessarily improve caption quality for this task is a useful negative result. This could guide the development of more reliable multimodal models in medical imaging domains.

major comments (2)
  1. [Results and Discussion] The central observation that stronger cross-modal alignment (0.682) does not guarantee superior zero-shot transfer (comparing to 0.709 BUSI and 0.626 AULI) is key to the balancing claim. However, these metrics appear to be from single runs without reported standard deviations or statistical significance tests across multiple seeds, which could affect the reliability of the ranking between configurations.
  2. [Dataset Curation and Limitations] The recommendation for balancing domain adaptation, encoder capacity, and caption quality to achieve robust clinical transfer assumes that the 16K-pair corpus and the BUSI/AULI held-out sets adequately represent clinical variability. The manuscript would benefit from a dedicated limitations paragraph quantifying potential shifts (e.g., scanner types, patient populations) or including sensitivity analysis to support generalizability of the trade-offs.
minor comments (3)
  1. [Abstract] Please define the 'paired alignment score' explicitly, including the formula or retrieval protocol used to compute the 0.682 value.
  2. [Methods] The description of the three-tier caption generation pipeline lacks sufficient detail for exact reproduction; consider adding an algorithm box or pseudocode.
  3. [Experiments] Ensure that all baseline implementations (OpenAI CLIP, BiomedCLIP) are described with the same fine-tuning protocols as the proposed models for fair comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and insightful comments, which help clarify the contributions and limitations of our work. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Results and Discussion] The central observation that stronger cross-modal alignment (0.682) does not guarantee superior zero-shot transfer (comparing to 0.709 BUSI and 0.626 AULI) is key to the balancing claim. However, these metrics appear to be from single runs without reported standard deviations or statistical significance tests across multiple seeds, which could affect the reliability of the ranking between configurations.

    Authors: We agree that reporting variability is essential to substantiate the central claim regarding the decoupling of alignment strength and transfer performance. In the revised manuscript, we will rerun key experiments across multiple random seeds (at least three) and report mean values with standard deviations for both the paired alignment score and zero-shot classification accuracies on BUSI and AULI. We will also include appropriate statistical significance tests (e.g., paired t-tests with p-values) to support the observed rankings and trade-offs between configurations such as partial versus full fine-tuning. revision: yes

  2. Referee: [Dataset Curation and Limitations] The recommendation for balancing domain adaptation, encoder capacity, and caption quality to achieve robust clinical transfer assumes that the 16K-pair corpus and the BUSI/AULI held-out sets adequately represent clinical variability. The manuscript would benefit from a dedicated limitations paragraph quantifying potential shifts (e.g., scanner types, patient populations) or including sensitivity analysis to support generalizability of the trade-offs.

    Authors: We appreciate the emphasis on generalizability. We will add a dedicated Limitations section that quantifies known aspects of our 16K-pair corpus (organ distribution, proportion of expert reports versus templates/LLMs) and the external datasets (BUSI, AULI), while explicitly discussing potential domain shifts in scanner types, imaging protocols, and patient populations based on available metadata. We will note that a full sensitivity analysis across all possible clinical variabilities is not feasible with the current data but will frame our balancing recommendations within these acknowledged constraints and suggest it as future work. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical training and held-out evaluation are independent

full rationale

The paper trains a standard CLIP-style contrastive model on a curated 16K multi-organ image-text corpus and reports alignment scores plus downstream metrics on two external held-out datasets (BUSI, AULI). No equations or claims reduce reported alignment, zero-shot, linear-probing, or few-shot numbers to quantities fitted on the same data used for the central claim. Baselines are external (OpenAI CLIP, BiomedCLIP); caption generation is described as a pipeline but not used to tautologically define success. No self-citation chains, uniqueness theorems, or ansatzes imported from the authors' prior work appear as load-bearing steps. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are introduced beyond standard contrastive learning assumptions and the use of pre-trained encoders. The work relies on the standard InfoNCE-style loss and the assumption that image-text pairs from public sources are sufficient for alignment.

pith-pipeline@v0.9.0 · 5612 in / 1238 out tokens · 38385 ms · 2026-05-09T17:02:20.571411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...

  2. [2]

    Clip vit-b/32

    OpenAI. Clip vit-b/32. https://huggingface.co/openai/clip-vit-base-patch32 ,

  3. [3]

    Accessed: 2026-04-17

  4. [4]

    Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B

    Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. Publicly available clinical BERT embeddings. InProceedings of the 2nd Clinical Natural Language Processing Workshop (Clinical NLP) at NAACL, 2019

  5. [5]

    Bio_ClinicalBERT

    Emily Alsentzer. Bio_ClinicalBERT. https://huggingface.co/emilyalsentzer/Bio_ ClinicalBERT, 2019. Accessed: 2026-04-29

  6. [6]

    Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Angela Crabtree, Brian Piening, Carlo 13 Bifulco, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. Biomedclip: a multim...

  7. [7]

    Dataset of breast ultrasound images.Data in Brief, 28:104863, 2020

    Wafaa Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images.Data in Brief, 28:104863, 2020

  8. [8]

    Annotated ultrasound liver images, 2022

    Yonghao Xu, Bing Zheng, Xiaoyu Liu, Tianfu Wu, Jia Ju, Shuo Wang, Yijun Lian, Hong Zhang, Tian Liang, Yang Sang, Rui Jiang, Guang Wang, Jianfeng Ren, and Tianfu Chen. Annotated ultrasound liver images, 2022

  9. [9]

    Peter N. T. Wells. Ultrasound imaging.Physics in Medicine & Biology, 51(13):R83–R98, 2006

  10. [10]

    Salcudean, and Nassir Navab

    Zhongliang Jiang, Septimiu E. Salcudean, and Nassir Navab. Robotic ultrasound imaging: State-of-the-art and future perspectives.Medical Image Analysis, 89:102878, October 2023

  11. [11]

    Strauss, E

    S. Strauss, E. Gavish, P. Gottlieb, and L. Katsnelson. Interobserver and intraobserver vari- ability in the sonographic assessment of fatty liver.AJR American Journal of Roentgenology, 189(6):W320–W323, 2007

  12. [12]

    Dubinsky, and Manjiri K

    Malak Itani, Richard Assaker, Mariam Moshiri, Theodore J. Dubinsky, and Manjiri K. Dighe. Inter-observer variability in the american college of radiology thyroid imaging reporting and data system: In-depth analysis and areas for improvement.Ultrasound in Medicine and Biology, 45(2):461–470, 2019

  13. [13]

    A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications

    Hongyuan Zhang, Yuheng Wu, Mingyang Zhao, Zhiwei Chen, Rebecca Li, Fei Zhu, Haohan Zhao, Xiaohua Yuan, Meng Yang, Chunli Qiu, et al. A fully open and generalizable foundation model for ultrasound clinical applications.arXiv preprint arXiv:2509.11752, 2025

  14. [14]

    Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis.Medical Image Analysis, 96:103202, 2024

    Jing Jiao, Jin Zhou, Xiaokang Li, Menghua Xia, Yi Huang, Lihong Huang, Na Wang, Xiaofan Zhang, Shichong Zhou, Yuanyuan Wang, and Yi Guo. Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis.Medical Image Analysis, 96:103202, 2024

  15. [15]

    Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A. W. M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. A survey on deep learning in medical image analysis.Medical Image Analysis, 42:60–88, 2017

  16. [16]

    arXiv preprint arXiv:2010.00747 , year=

    Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. Contrastive learning of medical visual representations from paired images and text.arXiv preprint, 2020. arXiv:2010.00747

  17. [17]

    Lu, Bowen Chen, Drew F

    Ming Y . Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, Anil V . Parwani, Andrew Zhang, and Faisal Mahmood. A visual-language foundation model for computational pathology.Nature Medicine, 30(3):863–874, 2024

  18. [18]

    BiomedCLIP- PubMedBERT_256-vit_base_patch16_224

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, et al. BiomedCLIP- PubMedBERT_256-vit_base_patch16_224. https://huggingface.co/microsoft/ BiomedCLIP-PubMedBERT_256-vit_base_patch16_224, 2024. Hugging Face model repository

  19. [19]

    Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6:317, 2019

  20. [20]

    A chain-of-thought reasoning breast ultrasound dataset covering all histopathology categories.Scientific Data, 13:370, 2026

    Haojun Yu, Youcheng Li, Zihan Niu, Nan Zhang, Xuantong Gong, Huan Li, Zhiying Zou, Haifeng Qi, Zhenxiao Cao, Zijie Lan, Xingjian Yuan, Jiating He, Haokai Zhang, Shengtao Zhang, Zicheng Wang, Dong Wang, Ziwei Zhao, Congying Chen, Yong Wang, Wangyan Qin, Qingli Zhu, and Liwei Wang. A chain-of-thought reasoning breast ultrasound dataset covering all histopat...

  21. [21]

    Curated benchmark dataset for ultrasound based breast lesion analysis.Scientific Data, 11(1):148, 2024

    Anna Pawłowska, Anna ´Cwierz-Pie´nkowska, Aleksandra Domalik, Dominika Jagu ´s, Piotr Kasprzak, Rafał Matkowski, Łukasz Fura, Andrzej Nowicki, and Natalia ˙Zołek. Curated benchmark dataset for ultrasound based breast lesion analysis.Scientific Data, 11(1):148, 2024

  22. [22]

    A dataset of lung ultrasound images for automated ai-based lung disease classifi- cation.Data in Brief, 62:112034, 2025

    Allan Katumba, Samuel Murindanyi, Nicolas Okila, Joyce Nakatumba-Nabende, Christopher Mwikirize, Joseph Serugunda, Stephen Bugeza, Amon Oriekot, Joseph Bossa, and Esther Nabawanuka. A dataset of lung ultrasound images for automated ai-based lung disease classifi- cation.Data in Brief, 62:112034, 2025

  23. [23]

    Algerian ultrasound images thyroid dataset: Auitd, 2023

    Maroua Azouz. Algerian ultrasound images thyroid dataset: Auitd, 2023. Accessed: 2026-03- 17

  24. [24]

    Ultrasound report generation with cross-modality feature alignment via unsupervised guidance.IEEE Transactions on Medical Imaging, 2024

    Jun Li, Tongkun Su, Baoliang Zhao, Faqin Lv, Qiong Wang, Nassir Navab, Ying Hu, and Zhongliang Jiang. Ultrasound report generation with cross-modality feature alignment via unsupervised guidance.IEEE Transactions on Medical Imaging, 2024

  25. [25]

    Swin transformer V2: Scaling up capacity and resolution

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer V2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  26. [26]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  27. [27]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

  28. [28]

    Sgdr: Stochastic gradient descent with warm restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), 2017

  29. [29]

    a patient

    Qwen Team. Qwen3-4b-instruct-2507. https://huggingface.co/Qwen/ Qwen3-4B-Instruct-2507, 2025. Extended Materials 7.1 Caption Templates Extended Table 1 lists all 30 caption templates used in the three-tier caption generation pipeline. Placeholders are typeset in monospace and are resolved at runtime from dataset metadata; default fallback strings are show...

  30. [30]

    First caption text here. 2. Second caption text here. 3. Third caption text here. Extended Table 3 reports corpus-level lexical diversity using template captions and LLM-generated captions, measured over all N= 5,114 captions from all 5 datasets without expert-annotated reports (including the 2 held-out datasets, but they are not involved in training). LL...