Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

arxiv: 2506.08849 · v4 · submitted 2025-06-10 · 💻 cs.CV

Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

Jingguo Qu , Xinyang Han , Jia Ai , Juan Wu , Tong Zhao , Tonghuan Xiao , Sheng Ning , Yuqi Yang

show 5 more authors

Jing Qin Ann Dorothy King Winnie Chiu-Wing Chu Jing Cai Michael Tin-Cheung Ying

This is my paper

Pith reviewed 2026-05-19 10:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords hybrid tuningvision language modelsultrasound image analysisfrequency filteringnoise estimationparameter efficient adaptationmedical imagingfew-shot learning

0 comments p. Extension

The pith

Freezing CLIP's visual backbone and adding a frequency-filtering noise-estimating adapter bridges the ultrasound modality gap for better medical image analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language foundation models struggle with ultrasound images due to unique acoustic effects like speckle noise and shadowing that differ from natural images. The paper introduces Hybrid Tuning, which keeps the pre-trained visual encoder frozen to retain its semantic knowledge and attaches a lightweight adapter equipped with a frequency filtering module to remove periodic artifacts and a noise estimation module to adjust features dynamically. Tests across six datasets show this method beats current adaptation techniques in segmentation and classification, performing well even with few training examples and generalizing across different data sources. This indicates that directly addressing ultrasound physics on top of existing models can enable their use in automated diagnosis without full retraining.

Core claim

By freezing the pre-trained visual backbone of CLIP-based models and integrating a specialized lightweight adapter that includes a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate feature representations, the Hybrid Tuning strategy bridges the profound modality gap in medical ultrasound imaging, resulting in significant performance gains over state-of-the-art adapters in segmentation and classification tasks while demonstrating strong few-shot efficiency and cross-dataset generalization.

What carries the argument

The Hybrid Tuning (HT) adapter consisting of a Frequency Filtering module and a Noise Estimation module applied to the frozen visual backbone of a vision-language foundation model.

If this is right

HT-enhanced models outperform existing state-of-the-art adapters and medical VLFMs in segmentation and classification across six multi-center datasets.
HT shows exceptional data efficiency in few-shot learning scenarios.
HT provides robust cross-dataset generalization capabilities.
Preserving pre-trained semantic priors while explicitly modeling ultrasound-specific noise unlocks the use of foundational models in automated ultrasound diagnosis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method suggests that similar lightweight adapters could adapt foundation models to other medical imaging modalities with distinct noise characteristics.
The parameter-efficient nature may facilitate deployment in resource-limited clinical settings.
Further work could explore combining HT with other modalities or extending to video ultrasound sequences.
Success here implies that modality-specific noise modeling is a general principle for adapting vision-language models beyond natural images.

Load-bearing premise

The assumption that freezing the pre-trained visual backbone and using a lightweight adapter with frequency filtering and noise estimation is enough to overcome the modality gap from ultrasound acoustic physics without updating the original weights.

What would settle it

A direct comparison where full fine-tuning of the backbone or alternative adapters without frequency filtering and noise estimation achieve equal or better results on the same ultrasound datasets would challenge the necessity of this specific HT approach.

Figures

Figures reproduced from arXiv: 2506.08849 by Ann Dorothy King, Jia Ai, Jing Cai, Jingguo Qu, Jing Qin, Juan Wu, Michael Tin-Cheung Ying, Sheng Ning, Tonghuan Xiao, Tong Zhao, Winnie Chiu-Wing Chu, Xinyang Han, Yuqi Yang.

**Figure 1.** Figure 1: Overview of proposed workflow. (a) Fine-tuning stage. Introduce trainable adapters into frozen CLIP to bridge the domain gap between natural images and ultrasound scans. (b) Downstream tasks. Apply trainable heads for ultrasound image segmentation and classification in a supervised manner (solid arrows), and assess zero-shot ultrasound diagnosis capability of CLIP by using unified prompt-image pairs (das… view at source ↗

**Figure 2.** Figure 2: Structure overview of the original LoRA, Mona and proposed ViT with [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Segmentation and classification heads. Feature Map Up-sampling. Transformer feature maps typically exhibit low spatial resolution (e.g., 14×14 for a 224×224 input with 16×16 patch size). However, ROIs can vary significantly in size across different diseases and anatomical sites, making it challenging for CLIP to capture the fine-grained details required for dense prediction tasks. To address this, we integ… view at source ↗

**Figure 4.** Figure 4: Visualization of the proposed method and SOTAs on LN-1, LN-2, BUSI [ [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Vision-Language Foundation Models (VLFMs) exhibit remarkable generalization, yet their direct application to medical ultrasound is severely hindered by a profound modality gap. The unique acoustic physics of ultrasound, characterized by speckle noise, shadowing, and heterogeneous textures, often degrades the performance of off-the-shelf VLFMs. To bridge this gap, we propose a novel Hybrid Tuning (HT) strategy for the parameter-efficient adaptation of CLIP-based models to ultrasound analysis. Instead of updating the pre-trained weights, HT freezes the visual backbone and integrates a specialized lightweight adapter. This adapter features a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate feature representations. Extensive evaluations across six multi-center datasets demonstrate that our HT-enhanced models significantly outperform existing state-of-the-art adapters and medical VLFMs in both segmentation and classification tasks. Notably, HT exhibits exceptional data efficiency in few-shot scenarios and robust cross-dataset generalization. Our findings prove that preserving pre-trained semantic priors while explicitly modeling ultrasound-specific noise is key to unlocking foundational intelligence in automated ultrasound diagnosis. The source code is available at https://github.com/jinggqu/NextGen-UIA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a lightweight adapter with frequency filtering and noise estimation to frozen CLIP backbones for ultrasound, reports gains on six datasets, but the evidence that this closes the full acoustic-physics gap is still thin.

read the letter

The main point is that this work freezes the CLIP visual backbone and plugs in a small adapter containing a Frequency Filtering module to cut periodic artifacts plus a Noise Estimation module to adjust features on the fly. They test the resulting models on segmentation and classification across six multi-center ultrasound datasets and highlight better few-shot performance and cross-dataset stability than prior adapters or medical VLFMs. Code is released, which helps reproducibility.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Hybrid Tuning (HT), a parameter-efficient adaptation strategy for CLIP-based vision-language foundation models to medical ultrasound. The visual backbone is frozen while a lightweight adapter is added, consisting of a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate features. Extensive experiments on six multi-center datasets are reported to show that HT-enhanced models outperform existing state-of-the-art adapters and medical VLFMs on both segmentation and classification, with strong data efficiency in few-shot regimes and robust cross-dataset generalization. The core claim is that preserving pre-trained semantic priors while explicitly modeling ultrasound-specific noise unlocks effective foundational intelligence for automated ultrasound diagnosis. Source code is released.

Significance. If the results are substantiated, the work would offer a practical route for deploying large vision-language models on ultrasound without costly full fine-tuning, addressing a real modality gap in a data-scarce clinical domain. The open-source code release supports reproducibility and community follow-up. This could accelerate adoption of foundation-model techniques in medical ultrasound analysis, where acoustic artifacts and limited annotations are persistent challenges.

major comments (2)

[§3] §3 (HT adapter design): The central claim that freezing the backbone and applying only frequency filtering plus noise estimation suffices to bridge the modality gap rests on the assumption that ultrasound-specific effects (multiplicative speckle, depth-dependent shadowing, tissue-specific scattering) can be corrected in the adapter without altering backbone embeddings. This is load-bearing for all reported gains; the manuscript should include targeted ablations or embedding visualizations demonstrating that the adapter restores semantic utility rather than merely fitting to the six centers.
[§4–5] §4–5 (experimental results): The abstract and results sections assert statistically significant outperformance and few-shot robustness across six datasets, yet the provided description lacks concrete metrics, baseline details, p-values or confidence intervals, and module-level ablations. Without these, the superiority and generalization claims cannot be verified as load-bearing evidence.

minor comments (2)

[Abstract] Abstract: Key quantitative results (e.g., Dice scores, accuracy deltas, few-shot sample counts) should be included to allow readers to gauge the magnitude of improvement without reading the full experiments section.
Notation: Ensure consistent naming of the Frequency Filtering and Noise Estimation modules across text, figures, and equations to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the Hybrid Tuning adapter design and the presentation of experimental results. We address each major comment below and have revised the manuscript to incorporate additional evidence and clarifications.

read point-by-point responses

Referee: [§3] §3 (HT adapter design): The central claim that freezing the backbone and applying only frequency filtering plus noise estimation suffices to bridge the modality gap rests on the assumption that ultrasound-specific effects (multiplicative speckle, depth-dependent shadowing, tissue-specific scattering) can be corrected in the adapter without altering backbone embeddings. This is load-bearing for all reported gains; the manuscript should include targeted ablations or embedding visualizations demonstrating that the adapter restores semantic utility rather than merely fitting to the six centers.

Authors: We appreciate this observation, as validating the adapter's mechanism is indeed central to the contribution. The original manuscript includes module ablations in Section 4.3 that isolate the contributions of frequency filtering and noise estimation to overall performance. To directly address the request for evidence on semantic restoration, we have added t-SNE embedding visualizations in the revised Section 3.4 comparing backbone features before and after the adapter across multiple datasets. These show improved alignment with semantic clusters from the pre-trained CLIP space while reducing ultrasound-specific artifact clusters, supporting that the corrections generalize beyond the six centers rather than overfitting. revision: yes
Referee: [§4–5] §4–5 (experimental results): The abstract and results sections assert statistically significant outperformance and few-shot robustness across six datasets, yet the provided description lacks concrete metrics, baseline details, p-values or confidence intervals, and module-level ablations. Without these, the superiority and generalization claims cannot be verified as load-bearing evidence.

Authors: We agree that clear numerical reporting is essential for verifying the claims. The full manuscript contains detailed results in Tables 1–4 and Figures 3–5, including Dice scores, accuracy, and F1 metrics for all tasks, along with comparisons to multiple baselines. In the revision, we have expanded these sections to explicitly report p-values from paired statistical tests (e.g., Wilcoxon signed-rank), 95% confidence intervals, and a dedicated module-level ablation table quantifying each component's impact. Few-shot and cross-dataset results are now presented with exact sample sizes and variance measures to strengthen the evidence for data efficiency and generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical adaptation evaluated externally

full rationale

The paper describes a Hybrid Tuning (HT) strategy that freezes the CLIP visual backbone and inserts a lightweight adapter containing a Frequency Filtering module and a Noise Estimation module. Performance claims rest on direct empirical comparisons against existing adapters and medical VLFMs across six external multi-center datasets, including few-shot and cross-dataset tests. No equations, fitted parameters renamed as predictions, self-citation load-bearing premises, or uniqueness theorems appear in the abstract or described method. The central result is therefore an externally benchmarked empirical outcome rather than a reduction to its own inputs or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the effectiveness of two newly introduced adapter modules and the assumption that freezing the backbone preserves useful priors; no explicit free parameters are mentioned.

axioms (1)

domain assumption Pre-trained CLIP visual backbones capture semantic priors worth preserving for downstream medical tasks.
The Hybrid Tuning strategy explicitly relies on freezing the backbone rather than updating its weights.

invented entities (2)

Frequency Filtering module no independent evidence
purpose: Suppress domain-specific periodic artifacts in ultrasound images
New component added to the adapter to address ultrasound physics.
Noise Estimation module no independent evidence
purpose: Dynamically calibrate feature representations to handle ultrasound noise
New component added to the adapter to address ultrasound physics.

pith-pipeline@v0.9.0 · 5774 in / 1227 out tokens · 69989 ms · 2026-05-19T10:30:44.379029+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HT freezes the visual backbone and integrates a specialized lightweight adapter. This adapter features a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate feature representations.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CLIP (Mona) without fine-tuning exhibits optimal performance across all datasets... outperforms the second-best method CLIPSeg by 4.91%... in Dice score

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 6 internal anchors

[1]

Cancer Imaging8(1), 48–56 (2008)

Ahuja, A.T., Ying, M., Ho, S.Y., Antonio, G., Lee, Y.P., King, A.D., Wong, K.T.: Ultrasound of malignant cervical lymph nodes. Cancer Imaging8(1), 48–56 (2008). https://doi.org/10.1102/1470-7330.2008.0006

work page doi:10.1102/1470-7330.2008.0006 2008
[2]

American Journal of Roentgenology184(5), 1691–1699 (2005)

Ahuja, A.T., Ying, M.: Sonographic evaluation of cervical lymph nodes. American Journal of Roentgenology184(5), 1691–1699 (2005). https://doi.org/10.2214/ajr. 184.5.01841691

work page doi:10.2214/ajr 2005
[3]

Data in Brief28, 104863 (2020)

Al-Dhabyani, W., Gomaa, M., Khaled, H., Fahmy, A.: Dataset of breast ultrasound images. Data in Brief28, 104863 (2020). https://doi.org/10.1016/j.dib.2019.104863 14 J. Qu et al

work page doi:10.1016/j.dib.2019.104863 2020
[4]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv:2108.07258 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

American Journal of Roentgenology204(2), 234–240 (2015)

Brem, R.F., Lenihan, M.J., Lieberman, J., Torrente, J.: Screening breast ultrasound: Past, present, and future. American Journal of Roentgenology204(2), 234–240 (2015). https://doi.org/10.2214/AJR.13.12072

work page doi:10.2214/ajr.13.12072 2015
[6]

In: 2009 IEEE conference on computer vision and pattern recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

work page 2009
[7]

In: North American Chapter of the Association for Computational Linguistics (2019)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)

work page 2019
[8]

Eslami, S., de Melo, G., Meinel, C.: Does clip benefit visual question answering in the medical domain as much as it does in the general domain? ArXivabs/2112.13906 (2021)

work page arXiv 2021
[9]

Computers in Biology and Medicine 155, 106389 (2023)

Gong, H., Chen, J., Chen, G., Li, H., Li, G., Chen, F.: Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Computers in Biology and Medicine 155, 106389 (2023). https://doi.org/10.1016/j.compbiomed.2022.106389

work page doi:10.1016/j.compbiomed.2022.106389 2023
[10]

Journal of Medical Imaging and Radiation Sciences 55(3), 101544 (2024)

Han, M.X., Ying, M.T.C., Qu, M.J., Chen, Z., Gunda, M.S.T., Cai, J., Qin, J., Chu, W.C.W., King, A.D.: Differentiation of benign and malignant lymph nodes using ultrasound-based radiomics and machine learning. Journal of Medical Imaging and Radiation Sciences 55(3), 101544 (2024)

work page 2024
[11]

BMC cancer 25(1), 73 (2025)

Han, X., Qu, J., Chui, M.L., Gunda, S.T., Chen, Z., Qin, J., King, A.D., Chu, W.C.W., Cai, J., Ying, M.T.C.: Artificial intelligence performance in ultrasound- based lymph node diagnosis: a systematic review and meta-analysis. BMC cancer 25(1), 73 (2025)

work page 2025
[12]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016
[13]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

work page 2022
[14]

Huix, J.P., Ganeshan, A.R., Haslum, J.F., Söderberg, M., Matsoukas, C., Smith, K.: Are natural domain foundation models useful for medical image classification? In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 7634–7643 (2024)

work page 2024
[15]

Computerized medical imaging and graphics : the official journal of the Computerized Medical Imaging Society112, 102326 (2023)

Jiang, H., Imran, M., Muralidharan, P., Patel, A., Pensa, J., Liang, M., Benidir, T., Grajo, J.R., Joseph, J.P., Terry, R.S., DiBianco, J.M., Su, L., Zhou, Y., Brisbane, W., Shao, W.: Microsegnet: A deep learning approach for prostate segmentation on micro-ultrasound images. Computerized medical imaging and graphics : the official journal of the Computeri...

work page 2023
[16]

Medical Image Analysis96, 103202 (2024)

Jiao, J., Zhou, J., Li, X., Xia, M., Huang, Y., Huang, L., Wang, N., Zhang, X., Zhou, S., Wang, Y., et al.: Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis96, 103202 (2024)

work page 2024
[17]

arXiv preprint arXiv:2412.10372 (2024)

Khattak, M.U., Kunhimon, S., Naseer, M., Khan, S., Khan, F.S.: Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities. arXiv preprint arXiv:2412.10372 (2024)

work page arXiv 2024
[18]

Segment Anything

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023) Title Suppressed Due to Excessive Length 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

IEEE Transactions on Medical Imaging (2024)

Li, J., Su, T., Zhao, B., Lv, F., Wang, Q., Navab, N., Hu, Y., Jiang, Z.: Ultrasound report generation with cross-modality feature alignment via unsupervised guidance. IEEE Transactions on Medical Imaging (2024)

work page 2024
[20]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc-clip: Con- trastive language-image pre-training using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 525–536. Springer (2023)

work page 2023
[21]

In: Proceedings of the IEEE/CVF international conference on computer vision

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

work page 2021
[22]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7086–7096 (2022)

work page 2022
[23]

arXiv:2411.16222 (2024)

Meyer, A., Murali, A., Mutter, D., Padoy, N.: Ultrasam: A foundation model for ultrasound using large open-access segmentation datasets. arXiv:2411.16222 (2024)

work page arXiv 2024
[24]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Pedraza, L., Vargas, C., Narváez, F., Durán, O., Muñoz, E., Romero, E.: An open access thyroid ultrasound image database, Tenth International Symposium on Medical Information Processing and Analysis, vol. 9287. SPIE (2015). https: //doi.org/10.1117/12.2073532

work page doi:10.1117/12.2073532 2015
[26]

In: International Conference on Medical Imaging with Deep Learning (2023)

Poudel, K., Dhakal, M., Bhandari, P., Adhikari, R., Thapaliya, S., Khanal, B.: Exploring transfer learning in medical image segmentation using vision-language models. In: International Conference on Medical Imaging with Deep Learning (2023)

work page 2023
[27]

IEEE Access13, 97208–97227 (2025)

Qu, J., Han, X., Chui, M.L., Pu, Y., Gunda, S.T., Chen, Z., Qin, J., King, A.D., Chu, W.C.W., Cai, J., Ying, M.T.C.: The application of deep learning for lymph node segmentation: A systematic review. IEEE Access13, 97208–97227 (2025). https://doi.org/10.1109/ACCESS.2025.3575454

work page doi:10.1109/access.2025.3575454 2025
[28]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021
[29]

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

work page 2018
[30]

In: Medical image computing and computer-assisted interven- tion–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted interven- tion–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. pp. 234–241. Springer (2015)

work page 2015
[31]

Journal of Clinical Ultrasound 23(3), 179–184 (1995)

Takashima, S., Fukuda, H., Nomura, N., Kishimoto, H., Kim, T., Kobayashi, T.: Thyroid nodules: Re-evaluation with ultrasound. Journal of Clinical Ultrasound 23(3), 179–184 (1995). https://doi.org/10.1002/jcu.1870230306

work page doi:10.1002/jcu.1870230306 1995
[32]

Advances in neural information processing systems35, 5696–5710 (2022)

Wang, J., Chen, D., Wu, Z., Luo, C., Zhou, L., Zhao, Y., Xie, Y., Liu, C., Jiang, Y.G., Yuan, L.: Omnivl: One foundation model for image-language and video- language tasks. Advances in neural information processing systems35, 5696–5710 (2022)

work page 2022
[33]

Proceedings of the Conference on Empirical Methods in Natural Language Processing

Wang, Z., Wu, Z., Agarwal, D., Sun, J.: Medclip: Contrastive learning from unpaired medical images and text. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing 2022, 3876–3887 (2022) 16 J. Qu et al

work page 2022
[34]

Demystifying CLIP Data

Xu, H., Xie, S., Tan, X.E., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying clip data. arXiv preprint arXiv:2309.16671 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

arXiv preprint arXiv:2408.08345 (2024)

Yin, D., Hu, L., Li, B., Zhang, Y., Yang, X.: 5%> 100%: Breaking performance shack- les of full fine-tuning on visual recognition tasks. arXiv preprint arXiv:2408.08345 (2024)

work page arXiv 2024
[37]

part i: Normal lymph nodes

Ying, M., Ahuja, A.: Sonography of neck lymph nodes. part i: Normal lymph nodes. Clinical Radiology 58(5), 351–358 (2003). https://doi.org/10.1016/S0009-9260(02) 00584-6

work page doi:10.1016/s0009-9260(02 2003
[38]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Cancer Imaging8(1), 48–56 (2008)

Ahuja, A.T., Ying, M., Ho, S.Y., Antonio, G., Lee, Y.P., King, A.D., Wong, K.T.: Ultrasound of malignant cervical lymph nodes. Cancer Imaging8(1), 48–56 (2008). https://doi.org/10.1102/1470-7330.2008.0006

work page doi:10.1102/1470-7330.2008.0006 2008

[2] [2]

American Journal of Roentgenology184(5), 1691–1699 (2005)

Ahuja, A.T., Ying, M.: Sonographic evaluation of cervical lymph nodes. American Journal of Roentgenology184(5), 1691–1699 (2005). https://doi.org/10.2214/ajr. 184.5.01841691

work page doi:10.2214/ajr 2005

[3] [3]

Data in Brief28, 104863 (2020)

Al-Dhabyani, W., Gomaa, M., Khaled, H., Fahmy, A.: Dataset of breast ultrasound images. Data in Brief28, 104863 (2020). https://doi.org/10.1016/j.dib.2019.104863 14 J. Qu et al

work page doi:10.1016/j.dib.2019.104863 2020

[4] [4]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv:2108.07258 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

American Journal of Roentgenology204(2), 234–240 (2015)

Brem, R.F., Lenihan, M.J., Lieberman, J., Torrente, J.: Screening breast ultrasound: Past, present, and future. American Journal of Roentgenology204(2), 234–240 (2015). https://doi.org/10.2214/AJR.13.12072

work page doi:10.2214/ajr.13.12072 2015

[6] [6]

In: 2009 IEEE conference on computer vision and pattern recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

work page 2009

[7] [7]

In: North American Chapter of the Association for Computational Linguistics (2019)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)

work page 2019

[8] [8]

Eslami, S., de Melo, G., Meinel, C.: Does clip benefit visual question answering in the medical domain as much as it does in the general domain? ArXivabs/2112.13906 (2021)

work page arXiv 2021

[9] [9]

Computers in Biology and Medicine 155, 106389 (2023)

Gong, H., Chen, J., Chen, G., Li, H., Li, G., Chen, F.: Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Computers in Biology and Medicine 155, 106389 (2023). https://doi.org/10.1016/j.compbiomed.2022.106389

work page doi:10.1016/j.compbiomed.2022.106389 2023

[10] [10]

Journal of Medical Imaging and Radiation Sciences 55(3), 101544 (2024)

Han, M.X., Ying, M.T.C., Qu, M.J., Chen, Z., Gunda, M.S.T., Cai, J., Qin, J., Chu, W.C.W., King, A.D.: Differentiation of benign and malignant lymph nodes using ultrasound-based radiomics and machine learning. Journal of Medical Imaging and Radiation Sciences 55(3), 101544 (2024)

work page 2024

[11] [11]

BMC cancer 25(1), 73 (2025)

Han, X., Qu, J., Chui, M.L., Gunda, S.T., Chen, Z., Qin, J., King, A.D., Chu, W.C.W., Cai, J., Ying, M.T.C.: Artificial intelligence performance in ultrasound- based lymph node diagnosis: a systematic review and meta-analysis. BMC cancer 25(1), 73 (2025)

work page 2025

[12] [12]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016

[13] [13]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

work page 2022

[14] [14]

Huix, J.P., Ganeshan, A.R., Haslum, J.F., Söderberg, M., Matsoukas, C., Smith, K.: Are natural domain foundation models useful for medical image classification? In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 7634–7643 (2024)

work page 2024

[15] [15]

Computerized medical imaging and graphics : the official journal of the Computerized Medical Imaging Society112, 102326 (2023)

Jiang, H., Imran, M., Muralidharan, P., Patel, A., Pensa, J., Liang, M., Benidir, T., Grajo, J.R., Joseph, J.P., Terry, R.S., DiBianco, J.M., Su, L., Zhou, Y., Brisbane, W., Shao, W.: Microsegnet: A deep learning approach for prostate segmentation on micro-ultrasound images. Computerized medical imaging and graphics : the official journal of the Computeri...

work page 2023

[16] [16]

Medical Image Analysis96, 103202 (2024)

Jiao, J., Zhou, J., Li, X., Xia, M., Huang, Y., Huang, L., Wang, N., Zhang, X., Zhou, S., Wang, Y., et al.: Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis96, 103202 (2024)

work page 2024

[17] [17]

arXiv preprint arXiv:2412.10372 (2024)

Khattak, M.U., Kunhimon, S., Naseer, M., Khan, S., Khan, F.S.: Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities. arXiv preprint arXiv:2412.10372 (2024)

work page arXiv 2024

[18] [18]

Segment Anything

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023) Title Suppressed Due to Excessive Length 15

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

IEEE Transactions on Medical Imaging (2024)

Li, J., Su, T., Zhao, B., Lv, F., Wang, Q., Navab, N., Hu, Y., Jiang, Z.: Ultrasound report generation with cross-modality feature alignment via unsupervised guidance. IEEE Transactions on Medical Imaging (2024)

work page 2024

[20] [20]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc-clip: Con- trastive language-image pre-training using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 525–536. Springer (2023)

work page 2023

[21] [21]

In: Proceedings of the IEEE/CVF international conference on computer vision

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

work page 2021

[22] [22]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7086–7096 (2022)

work page 2022

[23] [23]

arXiv:2411.16222 (2024)

Meyer, A., Murali, A., Mutter, D., Padoy, N.: Ultrasam: A foundation model for ultrasound using large open-access segmentation datasets. arXiv:2411.16222 (2024)

work page arXiv 2024

[24] [24]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Pedraza, L., Vargas, C., Narváez, F., Durán, O., Muñoz, E., Romero, E.: An open access thyroid ultrasound image database, Tenth International Symposium on Medical Information Processing and Analysis, vol. 9287. SPIE (2015). https: //doi.org/10.1117/12.2073532

work page doi:10.1117/12.2073532 2015

[26] [26]

In: International Conference on Medical Imaging with Deep Learning (2023)

Poudel, K., Dhakal, M., Bhandari, P., Adhikari, R., Thapaliya, S., Khanal, B.: Exploring transfer learning in medical image segmentation using vision-language models. In: International Conference on Medical Imaging with Deep Learning (2023)

work page 2023

[27] [27]

IEEE Access13, 97208–97227 (2025)

Qu, J., Han, X., Chui, M.L., Pu, Y., Gunda, S.T., Chen, Z., Qin, J., King, A.D., Chu, W.C.W., Cai, J., Ying, M.T.C.: The application of deep learning for lymph node segmentation: A systematic review. IEEE Access13, 97208–97227 (2025). https://doi.org/10.1109/ACCESS.2025.3575454

work page doi:10.1109/access.2025.3575454 2025

[28] [28]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021

[29] [29]

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

work page 2018

[30] [30]

In: Medical image computing and computer-assisted interven- tion–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted interven- tion–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. pp. 234–241. Springer (2015)

work page 2015

[31] [31]

Journal of Clinical Ultrasound 23(3), 179–184 (1995)

Takashima, S., Fukuda, H., Nomura, N., Kishimoto, H., Kim, T., Kobayashi, T.: Thyroid nodules: Re-evaluation with ultrasound. Journal of Clinical Ultrasound 23(3), 179–184 (1995). https://doi.org/10.1002/jcu.1870230306

work page doi:10.1002/jcu.1870230306 1995

[32] [32]

Advances in neural information processing systems35, 5696–5710 (2022)

Wang, J., Chen, D., Wu, Z., Luo, C., Zhou, L., Zhao, Y., Xie, Y., Liu, C., Jiang, Y.G., Yuan, L.: Omnivl: One foundation model for image-language and video- language tasks. Advances in neural information processing systems35, 5696–5710 (2022)

work page 2022

[33] [33]

Proceedings of the Conference on Empirical Methods in Natural Language Processing

Wang, Z., Wu, Z., Agarwal, D., Sun, J.: Medclip: Contrastive learning from unpaired medical images and text. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing 2022, 3876–3887 (2022) 16 J. Qu et al

work page 2022

[34] [34]

Demystifying CLIP Data

Xu, H., Xie, S., Tan, X.E., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying clip data. arXiv preprint arXiv:2309.16671 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

arXiv preprint arXiv:2408.08345 (2024)

Yin, D., Hu, L., Li, B., Zhang, Y., Yang, X.: 5%> 100%: Breaking performance shack- les of full fine-tuning on visual recognition tasks. arXiv preprint arXiv:2408.08345 (2024)

work page arXiv 2024

[37] [37]

part i: Normal lymph nodes

Ying, M., Ahuja, A.: Sonography of neck lymph nodes. part i: Normal lymph nodes. Clinical Radiology 58(5), 351–358 (2003). https://doi.org/10.1016/S0009-9260(02) 00584-6

work page doi:10.1016/s0009-9260(02 2003

[38] [38]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023