pith. sign in

arxiv: 2604.23977 · v1 · submitted 2026-04-27 · 💻 cs.CV

Multi-View Synergistic Learning with Vision-Language Adaption for Low-Resource Biomedical Image Classification

Pith reviewed 2026-05-08 04:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords biomedical image classificationvision-language modelsfew-shot learningzero-shot learningmulti-granularity contrastive learningparameter-efficient fine-tuningcross-modal alignmentsemantic supervision
0
0 comments X

The pith

MVSL improves low-resource biomedical image classification by decoupling vision-language adaptation, using multi-granularity contrastive learning, and adding LLM-based semantic supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Multi-View Synergistic Learning to address accurate classification of biomedical images when only limited annotations are available. It separates the tuning of visual and textual encoders to suit their different properties, applies contrastive learning both to whole images and to localized lesion details, and incorporates class-level semantic constraints derived from large language models. These elements together aim to create steadier alignment between images and text while sharpening distinctions among similar disease categories. Such work matters because biomedical tasks often face data scarcity yet demand fine discrimination across varied scan types and body regions. A reader would care about whether these steps allow existing vision-language models to transfer more reliably without massive new labeling efforts.

Core claim

The authors claim that jointly addressing adaptation, representation granularity, and disease semantic relationships through decoupled visual-textual encoder adaptation for parameter-efficient tuning, multi-granularity contrastive learning for global and local evidence, and structured supervision from large language models to preserve class-level semantics produces more stable cross-modal alignment and improved discrimination, yielding consistent gains over prior methods in few-shot and zero-shot settings on eleven public biomedical datasets spanning nine imaging modalities and ten anatomical regions.

What carries the argument

Multi-View Synergistic Learning (MVSL), which decouples adaptation of visual and textual encoders, applies multi-granularity contrastive learning, and incorporates LLM-derived structured supervision to constrain textual representations and regularize visual embeddings via cross-modal alignment.

If this is right

  • Decoupled adaptation enables more stable and effective parameter-efficient fine-tuning by respecting the distinct representational characteristics of visual and textual encoders.
  • Multi-granularity contrastive learning explicitly models both global image semantics and localized lesion-level evidence to improve discrimination among visually similar disease categories.
  • LLM-derived structured supervision preserves disease semantic structure at the class level and indirectly regularizes visual embeddings through cross-modal alignment.
  • The combined components deliver consistent outperformance in both few-shot and zero-shot classification across diverse biomedical datasets.
  • The approach supports effective use of vision-language models under limited supervision spanning multiple modalities and anatomical regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decoupling and semantic regularization steps could transfer to other data-scarce specialized imaging domains such as industrial inspection or remote sensing.
  • Ablation studies that isolate each component on additional held-out datasets would clarify which elements most drive the observed gains.
  • Integration with active learning or synthetic data methods might further lower annotation requirements in clinical AI pipelines.
  • Deployment tests on private clinical data with equipment-specific shifts would reveal whether the semantic preservation holds outside public benchmarks.

Load-bearing premise

Structured supervision derived from large language models accurately captures and preserves disease-level semantic relationships without introducing domain-specific biases or inaccuracies that degrade cross-modal alignment.

What would settle it

If MVSL shows no performance advantage over simpler adaptation baselines on the same eleven datasets once the LLM-derived structured supervision is removed or replaced by random class constraints, the central contribution of that component would be falsified.

Figures

Figures reproduced from arXiv: 2604.23977 by Huiqing Qi, Joey Tianyi Zhou, Mengzhu Wang, Minxue Xiao, Taiping Zhang, Ting Xie, Xiaoliu Luo, Xu Wang.

Figure 1
Figure 1. Figure 1: Illustration of single- and cross-paradigm fine-tuning view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed MVSL framework. MVSL unifies fine-tuning paradigms, representation granularity, and view at source ↗
Figure 3
Figure 3. Figure 3: Detailed architecture of the VRA and MGCL mod view at source ↗
Figure 4
Figure 4. Figure 4: Average classification accuracy (%) of various few-shot view at source ↗
Figure 5
Figure 5. Figure 5: Impact of inserting visual residual adapters (VRA) into the last three Transformer blocks of the ViT backbone on Base view at source ↗
Figure 6
Figure 6. Figure 6: Effect of the fusion coefficient α on Base-to-Novel and Few-shot performance. The curves show Base-to-Novel metrics (Base, Novel, HM) and Few-shot accuracies (1-, 2-, 4-, 8-, and 16-shot) as α varies from 0.0 to 1.0. Peak performance is observed at intermediate α values (around 0.5), indicating that a balanced fusion of local and global features optimally enhances classification performance. This figure hi… view at source ↗
Figure 7
Figure 7. Figure 7: Visual saliency maps across various medical imag view at source ↗
Figure 8
Figure 8. Figure 8: t-SNE visualization of image embeddings generated by MVSL compared with pretrained BiomedCLIP. The embeddings view at source ↗
read the original abstract

Accurate biomedical image classification under low-resource conditions remains challenging due to limited annotations, subtle inter-class visual differences, and complex disease semantics. While vision--language models offer a promising foundation for mitigating data scarcity, their effective adaptation in biomedical settings is constrained by the need for parameter-efficient tuning alongside fine-grained and semantically consistent representation learning. In this work, we propose Multi-View Synergistic Learning (MVSL), a unified framework that addresses these challenges by jointly considering adaptation paradigms, representation granularity, and disease semantic relationships. MVSL decouples the adaptation of visual and textual encoders to respect their distinct representational characteristics, enabling more stable and effective parameter-efficient fine-tuning. It further introduces multi-granularity contrastive learning to explicitly model both global image semantics and localized lesion-level evidence, improving fine-grained discrimination for visually similar disease categories. In addition, MVSL preserves disease-level semantic structure by incorporating structured supervision derived from large language models, which constrains textual representations at the class level and indirectly regularizes visual embeddings through cross-modal alignment. Together, these components enable more stable cross-modal alignment and improved discrimination under limited supervision. Extensive experiments on $11$ public biomedical datasets spanning $9$ imaging modalities and $10$ anatomical regions demonstrate that MVSL consistently outperforms state-of-the-art methods in few-shot and zero-shot classification settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims to introduce Multi-View Synergistic Learning (MVSL), a framework for low-resource biomedical image classification that decouples adaptation of visual and textual encoders, introduces multi-granularity contrastive learning to capture global and lesion-level semantics, and incorporates structured supervision from large language models to preserve disease-level semantic relationships and regularize cross-modal alignment. Extensive experiments on 11 public datasets spanning 9 imaging modalities and 10 anatomical regions are reported to show consistent outperformance over state-of-the-art methods in few-shot and zero-shot settings.

Significance. If the empirical claims hold after verification, the work could be significant for parameter-efficient adaptation of vision-language models in data-scarce biomedical settings. The integration of decoupled tuning, multi-granularity contrastive objectives, and LLM-derived semantic constraints offers a structured approach to handling subtle inter-class differences across diverse modalities, with potential to improve generalization in clinical applications where annotations are limited.

major comments (2)
  1. [Abstract] The abstract claims consistent outperformance on 11 datasets but provides no details on baselines, ablation studies, statistical tests, or hyperparameter selection. This omission is load-bearing for the central empirical claim, as it prevents assessment of whether gains are attributable to the proposed components or post-hoc tuning.
  2. [Abstract] The structured supervision from LLMs is presented as preserving disease-level semantic structure and constraining textual representations to regularize visual embeddings. However, the abstract (and by extension the framework description) provides no information on prompt design, LLM choice, post-processing, expert validation, or ablations isolating this component, raising concerns that unverified biases or inaccuracies could undermine the cross-modal alignment on visually similar categories across 9 modalities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of clarity in the abstract. We have revised the abstract to incorporate additional details on the experimental validation and the LLM supervision component. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] The abstract claims consistent outperformance on 11 datasets but provides no details on baselines, ablation studies, statistical tests, or hyperparameter selection. This omission is load-bearing for the central empirical claim, as it prevents assessment of whether gains are attributable to the proposed components or post-hoc tuning.

    Authors: We agree that the abstract can be strengthened by briefly referencing the experimental context. In the revised manuscript, the abstract now notes the use of established baselines including CLIP-based adaptations and recent biomedical VLMs, confirms that ablation studies were performed to isolate each MVSL component, and states that statistical significance was evaluated with paired t-tests across multiple random seeds. Hyperparameter selection followed a standard validation protocol on a small held-out portion of the few-shot data, with full specifications in Section 4.2. These changes make the source of the reported gains more transparent while keeping the abstract concise. revision: yes

  2. Referee: [Abstract] The structured supervision from LLMs is presented as preserving disease-level semantic structure and constraining textual representations to regularize visual embeddings. However, the abstract (and by extension the framework description) provides no information on prompt design, LLM choice, post-processing, expert validation, or ablations isolating this component, raising concerns that unverified biases or inaccuracies could undermine the cross-modal alignment on visually similar categories across 9 modalities.

    Authors: We acknowledge the value of greater transparency on the LLM-derived supervision. The revised abstract now specifies the use of GPT-4 with structured prompts to generate disease-level semantic hierarchies, followed by embedding-based regularization. Section 3.3 of the manuscript already details the prompt templates, post-processing steps for relation extraction, and an ablation study (Table 6) that isolates the contribution of this term to cross-modal alignment. To address potential biases, we performed manual inspection on a representative sample of generated semantics and aligned outputs with standard medical taxonomies; while a full expert review was not feasible given the breadth of 9 modalities, these steps provide reasonable safeguards. The updated abstract references the ablation results to directly respond to the concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent experimental validation

full rationale

The paper proposes MVSL as a composite empirical method (decoupled encoders, multi-granularity contrastive loss, LLM-derived class-level supervision) and validates it via direct performance comparisons on 11 held-out biomedical datasets. No equations, predictions, or first-principles derivations are presented that reduce to fitted inputs or self-citations by construction. The central outperformance claim is statistical and externally falsifiable rather than tautological. Self-citations, if present, are not load-bearing for any closed-form result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Framework rests on standard deep-learning assumptions plus the domain-specific premise that LLM outputs provide reliable disease semantics; no new physical entities or closed-form derivations are introduced.

free parameters (1)
  • contrastive loss weights and adaptation learning rates
    Typical tunable hyperparameters in contrastive and parameter-efficient fine-tuning setups; values chosen to optimize validation performance.
axioms (1)
  • domain assumption LLM-generated structured supervision accurately reflects disease-level semantic relationships
    Invoked when the method incorporates this supervision to constrain textual representations and regularize visual embeddings.

pith-pipeline@v0.9.0 · 5560 in / 1228 out tokens · 69373 ms · 2026-05-08T04:49:23.672921+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages

  1. [1]

    Dataset of breast ultrasound images.Data in brief, 28:104863, 2020

    Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images.Data in brief, 28:104863, 2020. 13

  2. [2]

    Proker: A kernel perspective on few-shot adaptation of large vision- language models

    Yassir Bendou, Amine Ouasfi, Vincent Gripon, and Adnane Boukhayma. Proker: A kernel perspective on few-shot adaptation of large vision- language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25092–25102, 2025

  3. [3]

    Xcoop: Explainable prompt learning for computer-aided diagnosis via concept- guided context optimization

    Yequan Bie, Luyang Luo, Zhixuan Chen, and Hao Chen. Xcoop: Explainable prompt learning for computer-aided diagnosis via concept- guided context optimization. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 773–783, 2024

  4. [4]

    Borkowski, Marilyn M

    Andrew A. Borkowski, Marilyn M. Bui, L. Brannon Thomas, Cather- ine P. Wilson, Lauren A. DeLand, and Stephen M. Mastorides. Lung and colon cancer histopathological image dataset (lc25000), 2019

  5. [5]

    Domain-controlled prompt learning

    Qinglong Cao, Zhengqin Xu, Yuntian Chen, Chao Ma, and Xiaokang Yang. Domain-controlled prompt learning. InAAAI Conference on Artificial Intelligence, pages 936–944, 2024

  6. [6]

    Fully automatic knee osteoarthritis severity grading using deep neural networks with a novel ordinal loss.Computerized Medical Imaging and Graphics, 75:84–92, 2019

    Pingjun Chen, Linlin Gao, Xiaoshuang Shi, Kyle Allen, and Yang Lin. Fully automatic knee osteoarthritis severity grading using deep neural networks with a novel ordinal loss.Computerized Medical Imaging and Graphics, 75:84–92, 2019

  7. [7]

    gscorecam: What objects is clip looking at? InProceedings of the Asian Conference on Computer Vision, pages 1959–1975, 2022

    Peijie Chen, Qi Li, Saad Biaz, Trung Bui, and Anh Nguyen. gscorecam: What objects is clip looking at? InProceedings of the Asian Conference on Computer Vision, pages 1959–1975, 2022

  8. [8]

    Enhanced performance of brain tumor classification via tumor region augmentation and partition

    Jun Cheng, Wei Huang, Shuangliang Cao, Ru Yang, Wei Yang, Zhao- qiang Yun, Zhijian Wang, and Qianjin Feng. Enhanced performance of brain tumor classification via tumor region augmentation and partition. PLoS ONE, 10, 2015

  9. [9]

    Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic).arXiv preprint, 2019

    Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic).arXiv preprint, 2019

  10. [10]

    Pfemed: Few-shot medical image clas- sification using prior guided feature enhancement.Pattern Recognition, 134:109108, 2023

    Zhiyong Dai, Jianjun Yi, Lei Yan, Qingwen Xu, Liang Hu, Qi Zhang, Jiahui Li, and Guoqiang Wang. Pfemed: Few-shot medical image clas- sification using prior guided feature enhancement.Pattern Recognition, 134:109108, 2023

  11. [11]

    Bayesian prompt learning for image-language model gener- alization

    Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor G Turrisi da Costa, Cees GM Snoek, Georgios Tzimiropoulos, and Brais Martinez. Bayesian prompt learning for image-language model gener- alization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15237–15246, 2023

  12. [12]

    Sedigheh Eslami, Christoph Meinel, and Gerard de Melo. PubMedCLIP: How much does CLIP benefit visual question answering in the medical domain? InFindings of the Association for Computational Linguistics: EACL 2023, pages 1181–1193, 2023

  13. [13]

    Aligning medical images with general knowledge from large language models

    Xiao Fang, Yi Lin, Dong Zhang, Kwang-Ting Cheng, and Hao Chen. Aligning medical images with general knowledge from large language models. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 57–67, 2024

  14. [14]

    Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, pages 1–15, 2023

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, pages 1–15, 2023

  15. [15]

    Lp++: A surprisingly strong linear probe for few-shot clip

    Yunshi Huang, Fereshteh Shakeri, Jose Dolz, Malik Boudiaf, Houda Bahig, and Ismail Ben Ayed. Lp++: A surprisingly strong linear probe for few-shot clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23773–23782, 2024

  16. [16]

    Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography.Scientific Reports, 12(1):1–14, 2022

    Md Nazmul Islam, Mehedi Hasan, Md Kabir Hossain, Md Golam Rabiul Alam, Md Zia Uddin, and Ahmet Soylu. Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography.Scientific Reports, 12(1):1–14, 2022

  17. [17]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Conference on Machine Learning, pages 4904–4916, 2021

  18. [18]

    Multi-learner based deep meta-learning for few-shot medical image classification.IEEE Journal of Biomedical and Health Informatics, 27(1):17–28, 2023

    Hongyang Jiang, Mengdi Gao, Heng Li, Richu Jin, Hanpei Miao, and Jiang Liu. Multi-learner based deep meta-learning for few-shot medical image classification.IEEE Journal of Biomedical and Health Informatics, 27(1):17–28, 2023

  19. [19]

    Multi-class texture analysis in colorectal cancer histology.Scientific reports, 6(1):1–11, 2016

    Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Su- sanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Z ¨ollner. Multi-class texture analysis in colorectal cancer histology.Scientific reports, 6(1):1–11, 2016

  20. [20]

    Kermany, Michael Goldbaum, et al

    Daniel S. Kermany, Michael Goldbaum, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning.Cell, 172(5):1122 – 1131.e9, 2018

  21. [21]

    Learning to prompt with text only supervision for vision-language models.arXiv preprint, 2024

    Muhammad Uzair khattak, Muhammad Ferjad, Naseer Muzzamal, Luc Van Gool, and Federico Tombari. Learning to prompt with text only supervision for vision-language models.arXiv preprint, 2024

  22. [22]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023

  23. [23]

    Self- regulating prompts: Foundational model adaptation without forgetting

    Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self- regulating prompts: Foundational model adaptation without forgetting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15190–15200, 2023

  24. [24]

    Biomedcoop: Learning to prompt for biomedical vision-language models

    Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Biomedcoop: Learning to prompt for biomedical vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14766–14776, 2025

  25. [25]

    Automatic no-reference quality assessment for retinal fundus images using vessel segmentation, 06 2013

    Thomas K ¨ohler, Attila Budai, Martin Kraus, Jan Odstrcilik, Georg Michelson, and Joachim Hornegger. Automatic no-reference quality assessment for retinal fundus images using vessel segmentation, 06 2013

  26. [26]

    Learning beyond domains: Misleading prompts and pseudo-label contrast for text domain generalization

    Qizhi Li, Yingke Chen Xuyang Wang, Ming Yan, Dezhong Peng, Xi Peng, and Xu Wang. Learning beyond domains: Misleading prompts and pseudo-label contrast for text domain generalization. InAAAI Conference on Artificial Intelligence, pages 31689–31697, 2026

  27. [27]

    Pmc-clip: Contrastive language-image pre-training using biomedical documents

    Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-clip: Contrastive language-image pre-training using biomedical documents. InMedical Image Computing and Computer Assisted Intervention, page 525–536, 2023

  28. [28]

    Medoptnet: Meta-learning framework for few-shot medical image classification

    Liangfu Lu, Xudong Cui, Zhiyuan Tan, and Yulei Wu. Medoptnet: Meta-learning framework for few-shot medical image classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 21(4):725–736, 2024

  29. [29]

    Prompt distribution learning

    Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. Prompt distribution learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206– 5215, 2022

  30. [30]

    Layer-wise mutual information meta- learning network for few-shot segmentation.IEEE Transactions on Neural Networks and Learning Systems, 36(5):9684–9698, 2025

    Xiaoliu Luo, Zhao Duan, Anyong Qin, Zhuotao Tian, Ting Xie, Taiping Zhang, and Yuan Yan Tang. Layer-wise mutual information meta- learning network for few-shot segmentation.IEEE Transactions on Neural Networks and Learning Systems, 36(5):9684–9698, 2025

  31. [31]

    Intermediate prototype network for few-shot segmentation.Signal Processing, 203:108811, 2023

    Xiaoliu Luo, Zhao Duan, and Taiping Zhang. Intermediate prototype network for few-shot segmentation.Signal Processing, 203:108811, 2023

  32. [32]

    Xiaoliu Luo, Zhuotao Tian, Taiping Zhang, Bei Yu, Yuan Yan Tang, and Jiaya Jia. Pfenet++: Boosting few-shot semantic segmentation with the noise-filtered context-aware prior mask.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):1273–1289, 2024

  33. [33]

    Combining hierarchical sparse representation with adaptive prompt for few-shot segmentation.Expert System with Application, 260, 2025

    Xiaoliu Luo, Ting Xie, Weisen Qin, Zhao Duan, Jin Tan, and Taiping Zhang. Combining hierarchical sparse representation with adaptive prompt for few-shot segmentation.Expert System with Application, 260, 2025

  34. [34]

    Semantic-consistent bidirectional contrastive hashing for noisy multi-label cross-modal retrieval

    Likang Peng, Chao Su, Wenyuan Wu, Yuan Sun, Dezhong Peng, Xi Peng, and Xu Wang. Semantic-consistent bidirectional contrastive hashing for noisy multi-label cross-modal retrieval. InAAAI Conference on Artificial Intelligence, pages 24811–24819, 2026

  35. [35]

    Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection

    Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con- cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, Michael Riegler, and P ˚al Halvorsen. Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. In Proceedings of the 8th ACM ...

  36. [36]

    In- dian diabetic retinopathy image dataset (idrid), 2018

    Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabuddhe, and Fabrice Meriaudeau. In- dian diabetic retinopathy image dataset (idrid), 2018

  37. [37]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763, 2021

  38. [38]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763, 2021

  39. [39]

    Dual-guided prototype alignment network for few-shot medical image segmentation.IEEE Transactions on Instrumentation and Measurement, 73:1–13, 2024

    Yue Shen, Wanshu Fan, Cong Wang, Wenfei Liu, Wei Wang, Qiang Zhang, and Dongsheng Zhou. Dual-guided prototype alignment network for few-shot medical image segmentation.IEEE Transactions on Instrumentation and Measurement, 73:1–13, 2024

  40. [40]

    A closer look at the few-shot adaptation of large vision-language models

    Julio Silva-Rodriguez, Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. A closer look at the few-shot adaptation of large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and 14 Pattern Recognition, pages 23681–23690, 2024

  41. [41]

    Neighbor-aware contrastive disambiguation for cross-modal hashing with redundant annotations

    Chao Su, Likang Peng, Yuan Sun, Dezhong Peng, Xi Peng, and Xu Wang. Neighbor-aware contrastive disambiguation for cross-modal hashing with redundant annotations. InAdvances in Neural Information Processing Systems, 2025

  42. [42]

    Dica: Dis- ambiguated contrastive alignment for cross-modal retrieval with partial labels

    Chao Su, Huiming Zheng, Dezhong Peng, and Xu Wang. Dica: Dis- ambiguated contrastive alignment for cross-modal retrieval with partial labels. InAAAI Conference on Artificial Intelligence, pages 20610– 20618, 2025

  43. [43]

    Hier- archical consensus hashing for cross-modal retrieval.IEEE Transactions on Multimedia, 26:824–836, 2024

    Yuan Sun, Zhenwen Ren, Peng Hu, Dezhong Peng, and Xu Wang. Hier- archical consensus hashing for cross-modal retrieval.IEEE Transactions on Multimedia, 26:824–836, 2024

  44. [44]

    Tahir, Muhammad E.H

    Anas M. Tahir, Muhammad E.H. Chowdhury, Amith Khandakar, Tawsi- fur Rahman, Yazan Qiblawey, Uzair Khurshid, Serkan Kiranyaz, Nabil Ibtehaz, M. Sohel Rahman, Somaya Al-Maadeed, Sakib Mahmud, Maymouna Ezeddin, Khaled Hameed, and Tahir Hamid. Covid-19 infection localization and severity grading from chest x-ray images. Computers in Biology and Medicine, 139:...

  45. [45]

    Prior guided feature enrichment network for few- shot segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2):1050–1065, 2022

    Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrichment network for few- shot segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2):1050–1065, 2022

  46. [46]

    The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, page 180161, 2018

    Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, page 180161, 2018

  47. [47]

    Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

  48. [48]

    Clsep: Contrastive learning of sentence embedding with prompt

    Qian Wang, Weiqi Zhang, Tianyi Lei, Yu Cao, Dezhong Peng, and Xu Wang. Clsep: Contrastive learning of sentence embedding with prompt. Knowledge-Based Systems, 266:110381, 2023

  49. [49]

    Deep semisupervised class- and correlation-collapsed cross-view learning.IEEE Transactions on Cybernetics, 52:1588–1601, 2020

    Xu Wang, Peng Hu, Pei Liu, and Dezhong Peng. Deep semisupervised class- and correlation-collapsed cross-view learning.IEEE Transactions on Cybernetics, 52:1588–1601, 2020

  50. [50]

    Correspondence- free domain alignment for unsupervised cross-domain image retrieval

    Xu Wang, Dezhong Peng, Ming Yan, and Peng Hu. Correspondence- free domain alignment for unsupervised cross-domain image retrieval. In AAAI Conference on Artificial Intelligence, pages 10200–10208, 2023

  51. [51]

    A hard-to-beat baseline for training-free clip-based adaptation.arXiv preprint arXiv:2402.04087, 2024

    Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, and Tie- niu Tan. A hard-to-beat baseline for training-free clip-based adaptation. arXiv preprint arXiv:2402.04087, 2024

  52. [52]

    Visual-language prompt tuning with knowledge-guided context optimization

    Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6757–6767, 2023

  53. [53]

    Filip: Fine-grained interactive language-image pre-training

    Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. InInternational Conference on Learning Representations, 2021

  54. [54]

    Roda: Robust domain alignment for cross-domain retrieval against label noise

    Ziniu Yin, Yanglin Feng, Ming Yan, Xiaoming Song, Dezhong Peng, and Xu Wang. Roda: Robust domain alignment for cross-domain retrieval against label noise. InAAAI Conference on Artificial Intelligence, pages 9535–9543, 2025

  55. [55]

    Low-rank few-shot adaptation of vision-language models

    Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603, 2024

  56. [56]

    Lit: Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123– 18133, 2022

  57. [57]

    Tip-adapter: Training-free clip- adapter for better vision-language modeling

    Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip- adapter for better vision-language modeling. InEuropean Conference on Computer Vision, 2022

  58. [58]

    Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Angela Crabtree, Brian Piening, Carlo Bifulco, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. Biomedclip: a multimoda...

  59. [59]

    Con- ditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Con- ditional prompt learning for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022

  60. [60]

    Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, 2022

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, 2022

  61. [61]

    The image of a normal brain on MRI shows a clear differentiation between different brain regions with no disruptions

    Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15659– 15669, 2023. 15 VI. APPENDIX A. Dataset Details We evaluate our proposed method on 11 biomedical image clas- sification datasets, covering 9 imaging modalities, inc...