pith. sign in

arxiv: 2506.08849 · v4 · submitted 2025-06-10 · 💻 cs.CV

Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

Pith reviewed 2026-05-19 10:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords hybrid tuningvision language modelsultrasound image analysisfrequency filteringnoise estimationparameter efficient adaptationmedical imagingfew-shot learning
0
0 comments X p. Extension

The pith

Freezing CLIP's visual backbone and adding a frequency-filtering noise-estimating adapter bridges the ultrasound modality gap for better medical image analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language foundation models struggle with ultrasound images due to unique acoustic effects like speckle noise and shadowing that differ from natural images. The paper introduces Hybrid Tuning, which keeps the pre-trained visual encoder frozen to retain its semantic knowledge and attaches a lightweight adapter equipped with a frequency filtering module to remove periodic artifacts and a noise estimation module to adjust features dynamically. Tests across six datasets show this method beats current adaptation techniques in segmentation and classification, performing well even with few training examples and generalizing across different data sources. This indicates that directly addressing ultrasound physics on top of existing models can enable their use in automated diagnosis without full retraining.

Core claim

By freezing the pre-trained visual backbone of CLIP-based models and integrating a specialized lightweight adapter that includes a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate feature representations, the Hybrid Tuning strategy bridges the profound modality gap in medical ultrasound imaging, resulting in significant performance gains over state-of-the-art adapters in segmentation and classification tasks while demonstrating strong few-shot efficiency and cross-dataset generalization.

What carries the argument

The Hybrid Tuning (HT) adapter consisting of a Frequency Filtering module and a Noise Estimation module applied to the frozen visual backbone of a vision-language foundation model.

If this is right

  • HT-enhanced models outperform existing state-of-the-art adapters and medical VLFMs in segmentation and classification across six multi-center datasets.
  • HT shows exceptional data efficiency in few-shot learning scenarios.
  • HT provides robust cross-dataset generalization capabilities.
  • Preserving pre-trained semantic priors while explicitly modeling ultrasound-specific noise unlocks the use of foundational models in automated ultrasound diagnosis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method suggests that similar lightweight adapters could adapt foundation models to other medical imaging modalities with distinct noise characteristics.
  • The parameter-efficient nature may facilitate deployment in resource-limited clinical settings.
  • Further work could explore combining HT with other modalities or extending to video ultrasound sequences.
  • Success here implies that modality-specific noise modeling is a general principle for adapting vision-language models beyond natural images.

Load-bearing premise

The assumption that freezing the pre-trained visual backbone and using a lightweight adapter with frequency filtering and noise estimation is enough to overcome the modality gap from ultrasound acoustic physics without updating the original weights.

What would settle it

A direct comparison where full fine-tuning of the backbone or alternative adapters without frequency filtering and noise estimation achieve equal or better results on the same ultrasound datasets would challenge the necessity of this specific HT approach.

Figures

Figures reproduced from arXiv: 2506.08849 by Ann Dorothy King, Jia Ai, Jing Cai, Jingguo Qu, Jing Qin, Juan Wu, Michael Tin-Cheung Ying, Sheng Ning, Tonghuan Xiao, Tong Zhao, Winnie Chiu-Wing Chu, Xinyang Han, Yuqi Yang.

Figure 1
Figure 1. Figure 1: Overview of proposed workflow. (a) Fine-tuning stage. Introduce train￾able adapters into frozen CLIP to bridge the domain gap between natural images and ultrasound scans. (b) Downstream tasks. Apply trainable heads for ul￾trasound image segmentation and classification in a supervised manner (solid arrows), and assess zero-shot ultrasound diagnosis capability of CLIP by using unified prompt-image pairs (das… view at source ↗
Figure 2
Figure 2. Figure 2: Structure overview of the original LoRA, Mona and proposed ViT with [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Segmentation and classification heads. Feature Map Up-sampling. Transformer feature maps typically exhibit low spatial resolution (e.g., 14×14 for a 224×224 input with 16×16 patch size). However, ROIs can vary significantly in size across different diseases and anatomical sites, making it challenging for CLIP to capture the fine-grained details required for dense prediction tasks. To address this, we integ… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the proposed method and SOTAs on LN-1, LN-2, BUSI [ [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Vision-Language Foundation Models (VLFMs) exhibit remarkable generalization, yet their direct application to medical ultrasound is severely hindered by a profound modality gap. The unique acoustic physics of ultrasound, characterized by speckle noise, shadowing, and heterogeneous textures, often degrades the performance of off-the-shelf VLFMs. To bridge this gap, we propose a novel Hybrid Tuning (HT) strategy for the parameter-efficient adaptation of CLIP-based models to ultrasound analysis. Instead of updating the pre-trained weights, HT freezes the visual backbone and integrates a specialized lightweight adapter. This adapter features a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate feature representations. Extensive evaluations across six multi-center datasets demonstrate that our HT-enhanced models significantly outperform existing state-of-the-art adapters and medical VLFMs in both segmentation and classification tasks. Notably, HT exhibits exceptional data efficiency in few-shot scenarios and robust cross-dataset generalization. Our findings prove that preserving pre-trained semantic priors while explicitly modeling ultrasound-specific noise is key to unlocking foundational intelligence in automated ultrasound diagnosis. The source code is available at https://github.com/jinggqu/NextGen-UIA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Hybrid Tuning (HT), a parameter-efficient adaptation strategy for CLIP-based vision-language foundation models to medical ultrasound. The visual backbone is frozen while a lightweight adapter is added, consisting of a Frequency Filtering module to suppress domain-specific periodic artifacts and a Noise Estimation module to dynamically calibrate features. Extensive experiments on six multi-center datasets are reported to show that HT-enhanced models outperform existing state-of-the-art adapters and medical VLFMs on both segmentation and classification, with strong data efficiency in few-shot regimes and robust cross-dataset generalization. The core claim is that preserving pre-trained semantic priors while explicitly modeling ultrasound-specific noise unlocks effective foundational intelligence for automated ultrasound diagnosis. Source code is released.

Significance. If the results are substantiated, the work would offer a practical route for deploying large vision-language models on ultrasound without costly full fine-tuning, addressing a real modality gap in a data-scarce clinical domain. The open-source code release supports reproducibility and community follow-up. This could accelerate adoption of foundation-model techniques in medical ultrasound analysis, where acoustic artifacts and limited annotations are persistent challenges.

major comments (2)
  1. [§3] §3 (HT adapter design): The central claim that freezing the backbone and applying only frequency filtering plus noise estimation suffices to bridge the modality gap rests on the assumption that ultrasound-specific effects (multiplicative speckle, depth-dependent shadowing, tissue-specific scattering) can be corrected in the adapter without altering backbone embeddings. This is load-bearing for all reported gains; the manuscript should include targeted ablations or embedding visualizations demonstrating that the adapter restores semantic utility rather than merely fitting to the six centers.
  2. [§4–5] §4–5 (experimental results): The abstract and results sections assert statistically significant outperformance and few-shot robustness across six datasets, yet the provided description lacks concrete metrics, baseline details, p-values or confidence intervals, and module-level ablations. Without these, the superiority and generalization claims cannot be verified as load-bearing evidence.
minor comments (2)
  1. [Abstract] Abstract: Key quantitative results (e.g., Dice scores, accuracy deltas, few-shot sample counts) should be included to allow readers to gauge the magnitude of improvement without reading the full experiments section.
  2. Notation: Ensure consistent naming of the Frequency Filtering and Noise Estimation modules across text, figures, and equations to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the Hybrid Tuning adapter design and the presentation of experimental results. We address each major comment below and have revised the manuscript to incorporate additional evidence and clarifications.

read point-by-point responses
  1. Referee: [§3] §3 (HT adapter design): The central claim that freezing the backbone and applying only frequency filtering plus noise estimation suffices to bridge the modality gap rests on the assumption that ultrasound-specific effects (multiplicative speckle, depth-dependent shadowing, tissue-specific scattering) can be corrected in the adapter without altering backbone embeddings. This is load-bearing for all reported gains; the manuscript should include targeted ablations or embedding visualizations demonstrating that the adapter restores semantic utility rather than merely fitting to the six centers.

    Authors: We appreciate this observation, as validating the adapter's mechanism is indeed central to the contribution. The original manuscript includes module ablations in Section 4.3 that isolate the contributions of frequency filtering and noise estimation to overall performance. To directly address the request for evidence on semantic restoration, we have added t-SNE embedding visualizations in the revised Section 3.4 comparing backbone features before and after the adapter across multiple datasets. These show improved alignment with semantic clusters from the pre-trained CLIP space while reducing ultrasound-specific artifact clusters, supporting that the corrections generalize beyond the six centers rather than overfitting. revision: yes

  2. Referee: [§4–5] §4–5 (experimental results): The abstract and results sections assert statistically significant outperformance and few-shot robustness across six datasets, yet the provided description lacks concrete metrics, baseline details, p-values or confidence intervals, and module-level ablations. Without these, the superiority and generalization claims cannot be verified as load-bearing evidence.

    Authors: We agree that clear numerical reporting is essential for verifying the claims. The full manuscript contains detailed results in Tables 1–4 and Figures 3–5, including Dice scores, accuracy, and F1 metrics for all tasks, along with comparisons to multiple baselines. In the revision, we have expanded these sections to explicitly report p-values from paired statistical tests (e.g., Wilcoxon signed-rank), 95% confidence intervals, and a dedicated module-level ablation table quantifying each component's impact. Few-shot and cross-dataset results are now presented with exact sample sizes and variance measures to strengthen the evidence for data efficiency and generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical adaptation evaluated externally

full rationale

The paper describes a Hybrid Tuning (HT) strategy that freezes the CLIP visual backbone and inserts a lightweight adapter containing a Frequency Filtering module and a Noise Estimation module. Performance claims rest on direct empirical comparisons against existing adapters and medical VLFMs across six external multi-center datasets, including few-shot and cross-dataset tests. No equations, fitted parameters renamed as predictions, self-citation load-bearing premises, or uniqueness theorems appear in the abstract or described method. The central result is therefore an externally benchmarked empirical outcome rather than a reduction to its own inputs or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the effectiveness of two newly introduced adapter modules and the assumption that freezing the backbone preserves useful priors; no explicit free parameters are mentioned.

axioms (1)
  • domain assumption Pre-trained CLIP visual backbones capture semantic priors worth preserving for downstream medical tasks.
    The Hybrid Tuning strategy explicitly relies on freezing the backbone rather than updating its weights.
invented entities (2)
  • Frequency Filtering module no independent evidence
    purpose: Suppress domain-specific periodic artifacts in ultrasound images
    New component added to the adapter to address ultrasound physics.
  • Noise Estimation module no independent evidence
    purpose: Dynamically calibrate feature representations to handle ultrasound noise
    New component added to the adapter to address ultrasound physics.

pith-pipeline@v0.9.0 · 5774 in / 1227 out tokens · 69989 ms · 2026-05-19T10:30:44.379029+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 6 internal anchors

  1. [1]

    Cancer Imaging8(1), 48–56 (2008)

    Ahuja, A.T., Ying, M., Ho, S.Y., Antonio, G., Lee, Y.P., King, A.D., Wong, K.T.: Ultrasound of malignant cervical lymph nodes. Cancer Imaging8(1), 48–56 (2008). https://doi.org/10.1102/1470-7330.2008.0006

  2. [2]

    American Journal of Roentgenology184(5), 1691–1699 (2005)

    Ahuja, A.T., Ying, M.: Sonographic evaluation of cervical lymph nodes. American Journal of Roentgenology184(5), 1691–1699 (2005). https://doi.org/10.2214/ajr. 184.5.01841691

  3. [3]

    Data in Brief28, 104863 (2020)

    Al-Dhabyani, W., Gomaa, M., Khaled, H., Fahmy, A.: Dataset of breast ultrasound images. Data in Brief28, 104863 (2020). https://doi.org/10.1016/j.dib.2019.104863 14 J. Qu et al

  4. [4]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv:2108.07258 (2021)

  5. [5]

    American Journal of Roentgenology204(2), 234–240 (2015)

    Brem, R.F., Lenihan, M.J., Lieberman, J., Torrente, J.: Screening breast ultrasound: Past, present, and future. American Journal of Roentgenology204(2), 234–240 (2015). https://doi.org/10.2214/AJR.13.12072

  6. [6]

    In: 2009 IEEE conference on computer vision and pattern recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

  7. [7]

    In: North American Chapter of the Association for Computational Linguistics (2019)

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)

  8. [8]

    Eslami, S., de Melo, G., Meinel, C.: Does clip benefit visual question answering in the medical domain as much as it does in the general domain? ArXivabs/2112.13906 (2021)

  9. [9]

    Computers in Biology and Medicine 155, 106389 (2023)

    Gong, H., Chen, J., Chen, G., Li, H., Li, G., Chen, F.: Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Computers in Biology and Medicine 155, 106389 (2023). https://doi.org/10.1016/j.compbiomed.2022.106389

  10. [10]

    Journal of Medical Imaging and Radiation Sciences 55(3), 101544 (2024)

    Han, M.X., Ying, M.T.C., Qu, M.J., Chen, Z., Gunda, M.S.T., Cai, J., Qin, J., Chu, W.C.W., King, A.D.: Differentiation of benign and malignant lymph nodes using ultrasound-based radiomics and machine learning. Journal of Medical Imaging and Radiation Sciences 55(3), 101544 (2024)

  11. [11]

    BMC cancer 25(1), 73 (2025)

    Han, X., Qu, J., Chui, M.L., Gunda, S.T., Chen, Z., Qin, J., King, A.D., Chu, W.C.W., Cai, J., Ying, M.T.C.: Artificial intelligence performance in ultrasound- based lymph node diagnosis: a systematic review and meta-analysis. BMC cancer 25(1), 73 (2025)

  12. [12]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

  13. [13]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  14. [14]

    Huix, J.P., Ganeshan, A.R., Haslum, J.F., Söderberg, M., Matsoukas, C., Smith, K.: Are natural domain foundation models useful for medical image classification? In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 7634–7643 (2024)

  15. [15]

    Computerized medical imaging and graphics : the official journal of the Computerized Medical Imaging Society112, 102326 (2023)

    Jiang, H., Imran, M., Muralidharan, P., Patel, A., Pensa, J., Liang, M., Benidir, T., Grajo, J.R., Joseph, J.P., Terry, R.S., DiBianco, J.M., Su, L., Zhou, Y., Brisbane, W., Shao, W.: Microsegnet: A deep learning approach for prostate segmentation on micro-ultrasound images. Computerized medical imaging and graphics : the official journal of the Computeri...

  16. [16]

    Medical Image Analysis96, 103202 (2024)

    Jiao, J., Zhou, J., Li, X., Xia, M., Huang, Y., Huang, L., Wang, N., Zhang, X., Zhou, S., Wang, Y., et al.: Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis96, 103202 (2024)

  17. [17]

    arXiv preprint arXiv:2412.10372 (2024)

    Khattak, M.U., Kunhimon, S., Naseer, M., Khan, S., Khan, F.S.: Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities. arXiv preprint arXiv:2412.10372 (2024)

  18. [18]

    Segment Anything

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023) Title Suppressed Due to Excessive Length 15

  19. [19]

    IEEE Transactions on Medical Imaging (2024)

    Li, J., Su, T., Zhao, B., Lv, F., Wang, Q., Navab, N., Hu, Y., Jiang, Z.: Ultrasound report generation with cross-modality feature alignment via unsupervised guidance. IEEE Transactions on Medical Imaging (2024)

  20. [20]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc-clip: Con- trastive language-image pre-training using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 525–536. Springer (2023)

  21. [21]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

  22. [22]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7086–7096 (2022)

  23. [23]

    arXiv:2411.16222 (2024)

    Meyer, A., Murali, A., Mutter, D., Padoy, N.: Ultrasam: A foundation model for ultrasound using large open-access segmentation datasets. arXiv:2411.16222 (2024)

  24. [24]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv:2304.07193 (2023)

  25. [25]

    Pedraza, L., Vargas, C., Narváez, F., Durán, O., Muñoz, E., Romero, E.: An open access thyroid ultrasound image database, Tenth International Symposium on Medical Information Processing and Analysis, vol. 9287. SPIE (2015). https: //doi.org/10.1117/12.2073532

  26. [26]

    In: International Conference on Medical Imaging with Deep Learning (2023)

    Poudel, K., Dhakal, M., Bhandari, P., Adhikari, R., Thapaliya, S., Khanal, B.: Exploring transfer learning in medical image segmentation using vision-language models. In: International Conference on Medical Imaging with Deep Learning (2023)

  27. [27]

    IEEE Access13, 97208–97227 (2025)

    Qu, J., Han, X., Chui, M.L., Pu, Y., Gunda, S.T., Chen, Z., Qin, J., King, A.D., Chu, W.C.W., Cai, J., Ying, M.T.C.: The application of deep learning for lymph node segmentation: A systematic review. IEEE Access13, 97208–97227 (2025). https://doi.org/10.1109/ACCESS.2025.3575454

  28. [28]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  29. [29]

    Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

  30. [30]

    In: Medical image computing and computer-assisted interven- tion–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted interven- tion–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. pp. 234–241. Springer (2015)

  31. [31]

    Journal of Clinical Ultrasound 23(3), 179–184 (1995)

    Takashima, S., Fukuda, H., Nomura, N., Kishimoto, H., Kim, T., Kobayashi, T.: Thyroid nodules: Re-evaluation with ultrasound. Journal of Clinical Ultrasound 23(3), 179–184 (1995). https://doi.org/10.1002/jcu.1870230306

  32. [32]

    Advances in neural information processing systems35, 5696–5710 (2022)

    Wang, J., Chen, D., Wu, Z., Luo, C., Zhou, L., Zhao, Y., Xie, Y., Liu, C., Jiang, Y.G., Yuan, L.: Omnivl: One foundation model for image-language and video- language tasks. Advances in neural information processing systems35, 5696–5710 (2022)

  33. [33]

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    Wang, Z., Wu, Z., Agarwal, D., Sun, J.: Medclip: Contrastive learning from unpaired medical images and text. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing 2022, 3876–3887 (2022) 16 J. Qu et al

  34. [34]

    Demystifying CLIP Data

    Xu, H., Xie, S., Tan, X.E., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying clip data. arXiv preprint arXiv:2309.16671 (2023)

  35. [35]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  36. [36]

    arXiv preprint arXiv:2408.08345 (2024)

    Yin, D., Hu, L., Li, B., Zhang, Y., Yang, X.: 5%> 100%: Breaking performance shack- les of full fine-tuning on visual recognition tasks. arXiv preprint arXiv:2408.08345 (2024)

  37. [37]

    part i: Normal lymph nodes

    Ying, M., Ahuja, A.: Sonography of neck lymph nodes. part i: Normal lymph nodes. Clinical Radiology 58(5), 351–358 (2003). https://doi.org/10.1016/S0009-9260(02) 00584-6

  38. [38]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)