Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis
Pith reviewed 2026-05-19 10:30 UTC · model grok-4.3
The pith
Freezing CLIP's visual backbone and adding a frequency-filtering noise-estimating adapter bridges the ultrasound modality gap for better medical image analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By freezing the pre-trained visual backbone of CLIP-based models and integrating a specialized lightweight adapter that includes a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate feature representations, the Hybrid Tuning strategy bridges the profound modality gap in medical ultrasound imaging, resulting in significant performance gains over state-of-the-art adapters in segmentation and classification tasks while demonstrating strong few-shot efficiency and cross-dataset generalization.
What carries the argument
The Hybrid Tuning (HT) adapter consisting of a Frequency Filtering module and a Noise Estimation module applied to the frozen visual backbone of a vision-language foundation model.
If this is right
- HT-enhanced models outperform existing state-of-the-art adapters and medical VLFMs in segmentation and classification across six multi-center datasets.
- HT shows exceptional data efficiency in few-shot learning scenarios.
- HT provides robust cross-dataset generalization capabilities.
- Preserving pre-trained semantic priors while explicitly modeling ultrasound-specific noise unlocks the use of foundational models in automated ultrasound diagnosis.
Where Pith is reading between the lines
- This method suggests that similar lightweight adapters could adapt foundation models to other medical imaging modalities with distinct noise characteristics.
- The parameter-efficient nature may facilitate deployment in resource-limited clinical settings.
- Further work could explore combining HT with other modalities or extending to video ultrasound sequences.
- Success here implies that modality-specific noise modeling is a general principle for adapting vision-language models beyond natural images.
Load-bearing premise
The assumption that freezing the pre-trained visual backbone and using a lightweight adapter with frequency filtering and noise estimation is enough to overcome the modality gap from ultrasound acoustic physics without updating the original weights.
What would settle it
A direct comparison where full fine-tuning of the backbone or alternative adapters without frequency filtering and noise estimation achieve equal or better results on the same ultrasound datasets would challenge the necessity of this specific HT approach.
Figures
read the original abstract
Vision-Language Foundation Models (VLFMs) exhibit remarkable generalization, yet their direct application to medical ultrasound is severely hindered by a profound modality gap. The unique acoustic physics of ultrasound, characterized by speckle noise, shadowing, and heterogeneous textures, often degrades the performance of off-the-shelf VLFMs. To bridge this gap, we propose a novel Hybrid Tuning (HT) strategy for the parameter-efficient adaptation of CLIP-based models to ultrasound analysis. Instead of updating the pre-trained weights, HT freezes the visual backbone and integrates a specialized lightweight adapter. This adapter features a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate feature representations. Extensive evaluations across six multi-center datasets demonstrate that our HT-enhanced models significantly outperform existing state-of-the-art adapters and medical VLFMs in both segmentation and classification tasks. Notably, HT exhibits exceptional data efficiency in few-shot scenarios and robust cross-dataset generalization. Our findings prove that preserving pre-trained semantic priors while explicitly modeling ultrasound-specific noise is key to unlocking foundational intelligence in automated ultrasound diagnosis. The source code is available at https://github.com/jinggqu/NextGen-UIA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Hybrid Tuning (HT), a parameter-efficient adaptation strategy for CLIP-based vision-language foundation models to medical ultrasound. The visual backbone is frozen while a lightweight adapter is added, consisting of a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate features. Extensive experiments on six multi-center datasets are reported to show that HT-enhanced models outperform existing state-of-the-art adapters and medical VLFMs on both segmentation and classification, with strong data efficiency in few-shot regimes and robust cross-dataset generalization. The core claim is that preserving pre-trained semantic priors while explicitly modeling ultrasound-specific noise unlocks effective foundational intelligence for automated ultrasound diagnosis. Source code is released.
Significance. If the results are substantiated, the work would offer a practical route for deploying large vision-language models on ultrasound without costly full fine-tuning, addressing a real modality gap in a data-scarce clinical domain. The open-source code release supports reproducibility and community follow-up. This could accelerate adoption of foundation-model techniques in medical ultrasound analysis, where acoustic artifacts and limited annotations are persistent challenges.
major comments (2)
- [§3] §3 (HT adapter design): The central claim that freezing the backbone and applying only frequency filtering plus noise estimation suffices to bridge the modality gap rests on the assumption that ultrasound-specific effects (multiplicative speckle, depth-dependent shadowing, tissue-specific scattering) can be corrected in the adapter without altering backbone embeddings. This is load-bearing for all reported gains; the manuscript should include targeted ablations or embedding visualizations demonstrating that the adapter restores semantic utility rather than merely fitting to the six centers.
- [§4–5] §4–5 (experimental results): The abstract and results sections assert statistically significant outperformance and few-shot robustness across six datasets, yet the provided description lacks concrete metrics, baseline details, p-values or confidence intervals, and module-level ablations. Without these, the superiority and generalization claims cannot be verified as load-bearing evidence.
minor comments (2)
- [Abstract] Abstract: Key quantitative results (e.g., Dice scores, accuracy deltas, few-shot sample counts) should be included to allow readers to gauge the magnitude of improvement without reading the full experiments section.
- Notation: Ensure consistent naming of the Frequency Filtering and Noise Estimation modules across text, figures, and equations to avoid reader confusion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the Hybrid Tuning adapter design and the presentation of experimental results. We address each major comment below and have revised the manuscript to incorporate additional evidence and clarifications.
read point-by-point responses
-
Referee: [§3] §3 (HT adapter design): The central claim that freezing the backbone and applying only frequency filtering plus noise estimation suffices to bridge the modality gap rests on the assumption that ultrasound-specific effects (multiplicative speckle, depth-dependent shadowing, tissue-specific scattering) can be corrected in the adapter without altering backbone embeddings. This is load-bearing for all reported gains; the manuscript should include targeted ablations or embedding visualizations demonstrating that the adapter restores semantic utility rather than merely fitting to the six centers.
Authors: We appreciate this observation, as validating the adapter's mechanism is indeed central to the contribution. The original manuscript includes module ablations in Section 4.3 that isolate the contributions of frequency filtering and noise estimation to overall performance. To directly address the request for evidence on semantic restoration, we have added t-SNE embedding visualizations in the revised Section 3.4 comparing backbone features before and after the adapter across multiple datasets. These show improved alignment with semantic clusters from the pre-trained CLIP space while reducing ultrasound-specific artifact clusters, supporting that the corrections generalize beyond the six centers rather than overfitting. revision: yes
-
Referee: [§4–5] §4–5 (experimental results): The abstract and results sections assert statistically significant outperformance and few-shot robustness across six datasets, yet the provided description lacks concrete metrics, baseline details, p-values or confidence intervals, and module-level ablations. Without these, the superiority and generalization claims cannot be verified as load-bearing evidence.
Authors: We agree that clear numerical reporting is essential for verifying the claims. The full manuscript contains detailed results in Tables 1–4 and Figures 3–5, including Dice scores, accuracy, and F1 metrics for all tasks, along with comparisons to multiple baselines. In the revision, we have expanded these sections to explicitly report p-values from paired statistical tests (e.g., Wilcoxon signed-rank), 95% confidence intervals, and a dedicated module-level ablation table quantifying each component's impact. Few-shot and cross-dataset results are now presented with exact sample sizes and variance measures to strengthen the evidence for data efficiency and generalization. revision: yes
Circularity Check
No significant circularity; empirical adaptation evaluated externally
full rationale
The paper describes a Hybrid Tuning (HT) strategy that freezes the CLIP visual backbone and inserts a lightweight adapter containing a Frequency Filtering module and a Noise Estimation module. Performance claims rest on direct empirical comparisons against existing adapters and medical VLFMs across six external multi-center datasets, including few-shot and cross-dataset tests. No equations, fitted parameters renamed as predictions, self-citation load-bearing premises, or uniqueness theorems appear in the abstract or described method. The central result is therefore an externally benchmarked empirical outcome rather than a reduction to its own inputs or prior self-citations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained CLIP visual backbones capture semantic priors worth preserving for downstream medical tasks.
invented entities (2)
-
Frequency Filtering module
no independent evidence
-
Noise Estimation module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HT freezes the visual backbone and integrates a specialized lightweight adapter. This adapter features a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate feature representations.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CLIP (Mona) without fine-tuning exhibits optimal performance across all datasets... outperforms the second-best method CLIPSeg by 4.91%... in Dice score
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cancer Imaging8(1), 48–56 (2008)
Ahuja, A.T., Ying, M., Ho, S.Y., Antonio, G., Lee, Y.P., King, A.D., Wong, K.T.: Ultrasound of malignant cervical lymph nodes. Cancer Imaging8(1), 48–56 (2008). https://doi.org/10.1102/1470-7330.2008.0006
-
[2]
American Journal of Roentgenology184(5), 1691–1699 (2005)
Ahuja, A.T., Ying, M.: Sonographic evaluation of cervical lymph nodes. American Journal of Roentgenology184(5), 1691–1699 (2005). https://doi.org/10.2214/ajr. 184.5.01841691
work page doi:10.2214/ajr 2005
-
[3]
Data in Brief28, 104863 (2020)
Al-Dhabyani, W., Gomaa, M., Khaled, H., Fahmy, A.: Dataset of breast ultrasound images. Data in Brief28, 104863 (2020). https://doi.org/10.1016/j.dib.2019.104863 14 J. Qu et al
-
[4]
On the Opportunities and Risks of Foundation Models
Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv:2108.07258 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
American Journal of Roentgenology204(2), 234–240 (2015)
Brem, R.F., Lenihan, M.J., Lieberman, J., Torrente, J.: Screening breast ultrasound: Past, present, and future. American Journal of Roentgenology204(2), 234–240 (2015). https://doi.org/10.2214/AJR.13.12072
-
[6]
In: 2009 IEEE conference on computer vision and pattern recognition
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
work page 2009
-
[7]
In: North American Chapter of the Association for Computational Linguistics (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)
work page 2019
- [8]
-
[9]
Computers in Biology and Medicine 155, 106389 (2023)
Gong, H., Chen, J., Chen, G., Li, H., Li, G., Chen, F.: Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Computers in Biology and Medicine 155, 106389 (2023). https://doi.org/10.1016/j.compbiomed.2022.106389
-
[10]
Journal of Medical Imaging and Radiation Sciences 55(3), 101544 (2024)
Han, M.X., Ying, M.T.C., Qu, M.J., Chen, Z., Gunda, M.S.T., Cai, J., Qin, J., Chu, W.C.W., King, A.D.: Differentiation of benign and malignant lymph nodes using ultrasound-based radiomics and machine learning. Journal of Medical Imaging and Radiation Sciences 55(3), 101544 (2024)
work page 2024
-
[11]
Han, X., Qu, J., Chui, M.L., Gunda, S.T., Chen, Z., Qin, J., King, A.D., Chu, W.C.W., Cai, J., Ying, M.T.C.: Artificial intelligence performance in ultrasound- based lymph node diagnosis: a systematic review and meta-analysis. BMC cancer 25(1), 73 (2025)
work page 2025
-
[12]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
work page 2016
-
[13]
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)
work page 2022
-
[14]
Huix, J.P., Ganeshan, A.R., Haslum, J.F., Söderberg, M., Matsoukas, C., Smith, K.: Are natural domain foundation models useful for medical image classification? In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 7634–7643 (2024)
work page 2024
-
[15]
Jiang, H., Imran, M., Muralidharan, P., Patel, A., Pensa, J., Liang, M., Benidir, T., Grajo, J.R., Joseph, J.P., Terry, R.S., DiBianco, J.M., Su, L., Zhou, Y., Brisbane, W., Shao, W.: Microsegnet: A deep learning approach for prostate segmentation on micro-ultrasound images. Computerized medical imaging and graphics : the official journal of the Computeri...
work page 2023
-
[16]
Medical Image Analysis96, 103202 (2024)
Jiao, J., Zhou, J., Li, X., Xia, M., Huang, Y., Huang, L., Wang, N., Zhang, X., Zhou, S., Wang, Y., et al.: Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis96, 103202 (2024)
work page 2024
-
[17]
arXiv preprint arXiv:2412.10372 (2024)
Khattak, M.U., Kunhimon, S., Naseer, M., Khan, S., Khan, F.S.: Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities. arXiv preprint arXiv:2412.10372 (2024)
-
[18]
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023) Title Suppressed Due to Excessive Length 15
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
IEEE Transactions on Medical Imaging (2024)
Li, J., Su, T., Zhao, B., Lv, F., Wang, Q., Navab, N., Hu, Y., Jiang, Z.: Ultrasound report generation with cross-modality feature alignment via unsupervised guidance. IEEE Transactions on Medical Imaging (2024)
work page 2024
-
[20]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc-clip: Con- trastive language-image pre-training using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 525–536. Springer (2023)
work page 2023
-
[21]
In: Proceedings of the IEEE/CVF international conference on computer vision
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)
work page 2021
-
[22]
In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition
Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7086–7096 (2022)
work page 2022
-
[23]
Meyer, A., Murali, A., Mutter, D., Padoy, N.: Ultrasam: A foundation model for ultrasound using large open-access segmentation datasets. arXiv:2411.16222 (2024)
-
[24]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv:2304.07193 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Pedraza, L., Vargas, C., Narváez, F., Durán, O., Muñoz, E., Romero, E.: An open access thyroid ultrasound image database, Tenth International Symposium on Medical Information Processing and Analysis, vol. 9287. SPIE (2015). https: //doi.org/10.1117/12.2073532
-
[26]
In: International Conference on Medical Imaging with Deep Learning (2023)
Poudel, K., Dhakal, M., Bhandari, P., Adhikari, R., Thapaliya, S., Khanal, B.: Exploring transfer learning in medical image segmentation using vision-language models. In: International Conference on Medical Imaging with Deep Learning (2023)
work page 2023
-
[27]
IEEE Access13, 97208–97227 (2025)
Qu, J., Han, X., Chui, M.L., Pu, Y., Gunda, S.T., Chen, Z., Qin, J., King, A.D., Chu, W.C.W., Cai, J., Ying, M.T.C.: The application of deep learning for lymph node segmentation: A systematic review. IEEE Access13, 97208–97227 (2025). https://doi.org/10.1109/ACCESS.2025.3575454
-
[28]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
work page 2021
-
[29]
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
work page 2018
-
[30]
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted interven- tion–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. pp. 234–241. Springer (2015)
work page 2015
-
[31]
Journal of Clinical Ultrasound 23(3), 179–184 (1995)
Takashima, S., Fukuda, H., Nomura, N., Kishimoto, H., Kim, T., Kobayashi, T.: Thyroid nodules: Re-evaluation with ultrasound. Journal of Clinical Ultrasound 23(3), 179–184 (1995). https://doi.org/10.1002/jcu.1870230306
-
[32]
Advances in neural information processing systems35, 5696–5710 (2022)
Wang, J., Chen, D., Wu, Z., Luo, C., Zhou, L., Zhao, Y., Xie, Y., Liu, C., Jiang, Y.G., Yuan, L.: Omnivl: One foundation model for image-language and video- language tasks. Advances in neural information processing systems35, 5696–5710 (2022)
work page 2022
-
[33]
Proceedings of the Conference on Empirical Methods in Natural Language Processing
Wang, Z., Wu, Z., Agarwal, D., Sun, J.: Medclip: Contrastive learning from unpaired medical images and text. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing 2022, 3876–3887 (2022) 16 J. Qu et al
work page 2022
-
[34]
Xu, H., Xie, S., Tan, X.E., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying clip data. arXiv preprint arXiv:2309.16671 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
arXiv preprint arXiv:2408.08345 (2024)
Yin, D., Hu, L., Li, B., Zhang, Y., Yang, X.: 5%> 100%: Breaking performance shack- les of full fine-tuning on visual recognition tasks. arXiv preprint arXiv:2408.08345 (2024)
-
[37]
Ying, M., Ahuja, A.: Sonography of neck lymph nodes. part i: Normal lymph nodes. Clinical Radiology 58(5), 351–358 (2003). https://doi.org/10.1016/S0009-9260(02) 00584-6
-
[38]
Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.