pith. sign in

arxiv: 2606.04922 · v1 · pith:D3G5BJ5Lnew · submitted 2026-06-03 · 💻 cs.CV · cs.AI· cs.LG

Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

Pith reviewed 2026-06-28 06:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords prompt tuningvision-language modelsknowledge distillationmedical imagingclass relationsfew-shot adaptationbiomedical VLMsgeometry-aware distillation
0
0 comments X

The pith

Directional geometry targets from class relations improve prompt tuning of biomedical vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard prompt tuning for vision-language models in medicine optimizes only the correct class label and treats every other class as equally wrong. This approach ignores relations that exist between classes, such as similar pathologies, and produces unstable decision boundaries when training data is limited. The paper introduces a distillation method that builds directional targets inside the teacher model; these targets keep the ground-truth label intact while adding geometric structure among classes. Two losses then transfer the structure: one on the global image token and one on attentive patch tokens. Experiments across eleven medical datasets show consistent accuracy gains and better handling of unseen classes under base-to-novel and few-shot protocols.

Core claim

We propose Omni-Geometry Knowledge Distillation (OGKD) that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop Global Geometry-Aware Distillation (GAD) on the global image token and Label-Guided Geometry Distillation (LGD) on attentive patch tokens. Across comprehensive experiments on 11 medical datasets for base-to-novel and few-shot evaluations, OGKD achieves average absolute accuracy gains of 1.7%-2.8% over prior state-of-the-art VLM adaptation methods, with robust generalization to unseen classes and more reliable predictions.

What carries the argument

Omni-Geometry Knowledge Distillation (OGKD) framework that generates directional targets encoding inter-class geometry while preserving ground truth, applied via Global Geometry-Aware Distillation (GAD) and Label-Guided Geometry Distillation (LGD) losses.

If this is right

  • Prompt-tuned models reach higher accuracy on medical image classification under limited supervision.
  • Gains appear in both base-to-novel class splits and few-shot regimes.
  • Generalization to unseen classes improves compared with prior adaptation techniques.
  • Prediction reliability increases as measured across the evaluated datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same geometry-injection idea could be tested on natural-image datasets that contain clear semantic hierarchies between classes.
  • If directional targets can be constructed from class labels alone without a separate teacher, dependence on large pre-trained VLMs might decrease.
  • Applying the method to tasks with conflicting or noisy class relations would reveal when geometry injection helps versus harms performance.

Load-bearing premise

Clinically meaningful relations between classes can be captured as directional geometry targets in the teacher model without distorting the ground-truth signal.

What would settle it

Measure whether the reported accuracy gains disappear when the same method is applied to a medical dataset in which all inter-class relations have been randomly permuted or removed.

Figures

Figures reproduced from arXiv: 2606.04922 by Tran Dinh Tien, Zhiqiang Shen.

Figure 1
Figure 1. Figure 1: Comparison of prompt-learning pipelines: (a) CoOp (Zhou et al., 2022b), (b) BiomedCoOp (Koleilat [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Risk–coverage across four biomedical datasets. Our method yields consistently lower curves (smaller AURC) than BiomedCoOp (Koleilat et al., 2025), indicating fewer high-confidence errors at the same coverage and improved selective reliability. Biomedical imaging poses additional challenges due to complex visual features, diverse modalities, and heterogeneous acquisition devices. BiomedCoOp (Koleilat et al.… view at source ↗
Figure 3
Figure 3. Figure 3: From biomedical text prototypes, we build a class graph [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Wins analysis via KDE curves on the RETINA dataset. We compare BiomedCoOp and our method on challenging test samples that our method classifies correctly while BiomedCoOp is wrong (“wins”). Left: KDE of the logit margin z(y) − maxj̸=y z(j). Right: KDE of the probability assigned to the ground-truth class’s top-k neighbors in W (here k=3). The legend in each panel also reports Σ, the sum over all win sample… view at source ↗
Figure 5
Figure 5. Figure 5: Probability mass on disease classes (excluded GT) on Retina dataset. Each curve is a kernel density estimate over per-image quantity s(x) = P c∈D pθ(y = c | x), where D are the disease classes (Normal excluded). Legends report P x s(x) (Σ): total probability mass placed on disease classes across all images in the subset. Left: subset where both methods misclassify and the ground-truth class is a disease. R… view at source ↗
Figure 6
Figure 6. Figure 6: Few-shot evaluation averaged over 11 datasets under different teacher budgets (number of LLM prompt templates per class) with shots K ∈ {1, 2, 4, 8, 16}. The left y-axis shows absolute accuracies (%) for BiomedCoOp (Koleilat et al., 2025) and our method; the right y-axis and the green strip show the per-budget improvement ∆ = Ours − BiomedCoOp (%), annotated at each budget. Analysis on misclassified sample… view at source ↗
Figure 7
Figure 7. Figure 7: Effect of geometry strength γ on accuracy. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Omni-Geometry Knowledge Distillation (OGKD) for prompt tuning of biomedical vision-language models. It constructs directional geometry targets in the teacher that are claimed to preserve ground-truth labels while encoding inter-class relations, then applies two losses—Global Geometry-Aware Distillation (GAD) on the global token and Label-Guided Geometry Distillation (LGD) on attentive patches—to improve adaptation under limited supervision. Experiments across 11 medical datasets report average absolute accuracy gains of 1.7–2.8% over prior SOTA VLM adaptation methods in base-to-novel and few-shot settings, plus improved generalization to unseen classes.

Significance. If the geometry targets can be shown to preserve ground truth without distortion, the framework would address a genuine limitation of standard distillation (equal treatment of incorrect classes) in data-scarce medical domains where class relations carry clinical meaning. The public code release supports reproducibility and is a clear strength.

major comments (3)
  1. [§3] §3 (Method): The construction of the directional geometry targets and the teacher model itself is not described in sufficient detail to verify the central claim that these targets keep the ground-truth class as the unique optimum while only modulating incorrect-class distances. Without the explicit encoding procedure or any analytic guarantee, it is impossible to assess whether the GAD and LGD losses optimize toward a faithful or a corrupted signal.
  2. [§4] §4 (Experiments): The reported 1.7–2.8% average gains are presented without error bars, without an ablation that isolates the geometry component from other design choices, and without any counter-example check that the injected geometry never elevates an incorrect class above the ground-truth class for any input. These omissions make the performance claim load-bearing on unreported experimental decisions.
  3. [Abstract, §3.2] Abstract and §3.2: The premise that clinically meaningful class relations exist and can be captured directionally without distorting ground truth is invoked to motivate the method, yet no verification (analytic or empirical) is supplied that the targets remain faithful across the 11 datasets.
minor comments (2)
  1. [§3] Notation for the geometry targets and the two losses (GAD, LGD) should be introduced with explicit equations rather than descriptive text only.
  2. [§4] Table captions and axis labels in the result figures should explicitly state whether the reported numbers are means over multiple runs or single-run values.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, clarifying aspects of the method and experiments while committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The construction of the directional geometry targets and the teacher model itself is not described in sufficient detail to verify the central claim that these targets keep the ground-truth class as the unique optimum while only modulating incorrect-class distances. Without the explicit encoding procedure or any analytic guarantee, it is impossible to assess whether the GAD and LGD losses optimize toward a faithful or a corrupted signal.

    Authors: We agree that the current description in §3 would benefit from greater explicitness to allow full verification. The directional geometry targets are formed from the frozen teacher's class-level embeddings by computing a similarity-based geometry matrix (via normalized dot products) and then shifting the target logits such that the ground-truth class is fixed at the maximum value while incorrect-class entries are modulated proportionally to their embedding distances. The teacher is the unmodified pre-trained VLM. We will insert explicit pseudocode, a step-by-step encoding procedure, and a short argument showing that the ground-truth class remains the unique optimum under this construction. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported 1.7–2.8% average gains are presented without error bars, without an ablation that isolates the geometry component from other design choices, and without any counter-example check that the injected geometry never elevates an incorrect class above the ground-truth class for any input. These omissions make the performance claim load-bearing on unreported experimental decisions.

    Authors: We concur that these elements would make the experimental claims more robust. In the revision we will report mean accuracy with standard error bars over multiple random seeds, add an ablation that removes only the geometry injection (keeping all other loss terms and hyperparameters fixed), and include a counter-example audit confirming that the constructed targets never assign a higher value to an incorrect class than to the ground-truth class on any sample from the 11 datasets. revision: yes

  3. Referee: [Abstract, §3.2] Abstract and §3.2: The premise that clinically meaningful class relations exist and can be captured directionally without distorting ground truth is invoked to motivate the method, yet no verification (analytic or empirical) is supplied that the targets remain faithful across the 11 datasets.

    Authors: The consistent gains and improved generalization to unseen classes across the 11 datasets provide supporting empirical evidence that the targets remain faithful in practice. To make this explicit, we will add a dedicated verification paragraph (with a table) showing that, for every test image in all evaluated datasets, the ground-truth class retains the highest target value after geometry injection. This directly addresses the request for verification. revision: yes

Circularity Check

0 steps flagged

No circularity; new loss construction presented as independent of fitted inputs

full rationale

The provided abstract and description introduce OGKD, GAD, and LGD as novel distillation losses that construct directional geometry targets from class relations. No equations, self-citations, or derivation steps are exhibited that reduce the targets, losses, or reported gains (1.7-2.8%) to quantities defined by the paper's own fitted parameters or prior self-referential results. The central claim rests on empirical evaluation across datasets rather than any self-definitional or fitted-input reduction. This is the expected non-finding for a methods paper whose contribution is a new loss formulation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the existence of extractable inter-class geometric structure that can be turned into directional teacher targets. No free parameters, axioms, or invented physical entities are explicitly introduced in the provided text.

pith-pipeline@v0.9.1-grok · 5763 in / 1233 out tokens · 30388 ms · 2026-06-28T06:31:46.240363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Dataset of breast ultrasound images

    Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images. Data in brief, 28: 0 104863, 2020

  2. [2]

    Learning to exploit temporal structure for biomedical vision-language processing

    Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15016--15027, 2023

  3. [3]

    Making the most of text semantics to improve biomedical vision--language processing

    Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision--language processing. In European conference on computer vision, pp.\ 1--21. Springer, 2022

  4. [4]

    Lung and colon cancer histopathological image dataset (LC25000),

    Andrew A Borkowski, Marilyn M Bui, L Brannon Thomas, Catherine P Wilson, Lauren A DeLand, and Stephen M Mastorides. Lung and colon cancer histopathological image dataset (lc25000). arXiv preprint arXiv:1912.12142, 2019

  5. [5]

    Knee osteoarthritis severity grading dataset

    Pingjun Chen. Knee osteoarthritis severity grading dataset. Mendeley Data, 1 0 (10.17632): 0 30784984, 2018

  6. [6]

    Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

    Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019

  7. [7]

    Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906, 2021

    Sedigheh Eslami, Gerard De Melo, and Christoph Meinel. Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906, 2021

  8. [8]

    Seeing unseen: Discover novel biomedical concepts via geometry-constrained probabilistic modeling

    Jianan Fan, Dongnan Liu, Hang Chang, Heng Huang, Mei Chen, and Weidong Cai. Seeing unseen: Discover novel biomedical concepts via geometry-constrained probabilistic modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 11524--11534, 2024

  9. [9]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.\ 1126--1135. PMLR, 2017

  10. [10]

    Clip-adapter: Better vision-language models with feature adapters

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132 0 (2): 0 581--595, 2024

  11. [11]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017

  12. [12]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  13. [13]

    Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition

    Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 3942--3951, 2021

  14. [14]

    Lp++: A surprisingly strong linear probe for few-shot clip

    Yunshi Huang, Fereshteh Shakeri, Jose Dolz, Malik Boudiaf, Houda Bahig, and Ismail Ben Ayed. Lp++: A surprisingly strong linear probe for few-shot clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 23773--23782, 2024

  15. [15]

    A visual--language foundation model for pathology image analysis using medical twitter

    Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual--language foundation model for pathology image analysis using medical twitter. Nature medicine, 29 0 (9): 0 2307--2316, 2023

  16. [16]

    Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography

    Md Nazmul Islam, Mehedi Hasan, Md Kabir Hossain, Md Golam Rabiul Alam, Md Zia Uddin, and Ahmet Soylu. Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography. Scientific Reports, 12 0 (1): 0 11440, 2022

  17. [17]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.\ 4904--4916. PMLR, 2021

  18. [18]

    Learning generalizable prompt for clip with class similarity knowledge

    Sehun Jung and Hyang-won Lee. Learning generalizable prompt for clip with class similarity knowledge. arXiv preprint arXiv:2502.11969, 2025

  19. [19]

    Multi-class texture analysis in colorectal cancer histology

    Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Z \"o llner. Multi-class texture analysis in colorectal cancer histology. Scientific reports, 6 0 (1): 0 1--11, 2016

  20. [20]

    Identifying medical diagnoses and treatable diseases by image-based deep learning

    Daniel S Kermany, Michael Goldbaum, Wenjia Cai, Carolina CS Valentim, Huiying Liang, Sally L Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. cell, 172 0 (5): 0 1122--1131, 2018

  21. [21]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 19113--19122, 2023 a

  22. [22]

    Self-regulating prompts: Foundational model adaptation without forgetting

    Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 15190--15200, 2023 b

  23. [23]

    Automatic no-reference quality assessment for retinal fundus images using vessel segmentation

    Thomas K \"o hler, Attila Budai, Martin F Kraus, Jan Odstr c ilik, Georg Michelson, and Joachim Hornegger. Automatic no-reference quality assessment for retinal fundus images using vessel segmentation. In Proceedings of the 26th IEEE international symposium on computer-based medical systems, pp.\ 95--100. IEEE, 2013

  24. [24]

    Biomedcoop: Learning to prompt for biomedical vision-language models

    Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Biomedcoop: Learning to prompt for biomedical vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 14766--14776, 2025

  25. [25]

    Current perspectives in medical image perception

    Elizabeth A Krupinski. Current perspectives in medical image perception. Attention, Perception, & Psychophysics, 72 0 (5): 0 1205--1217, 2010

  26. [26]

    Graphadapter: Tuning vision-language models with dual knowledge graph

    Xin Li, Dongze Lian, Zhihe Lu, Jiawang Bai, Zhibo Chen, and Xinchao Wang. Graphadapter: Tuning vision-language models with dual knowledge graph. Advances in Neural Information Processing Systems, 36: 0 13448--13466, 2023

  27. [27]

    Promptkd: Unsupervised prompt distillation for vision-language models

    Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 26617--26626, 2024

  28. [28]

    A survey on deep learning in medical image analysis

    Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I S \'a nchez. A survey on deep learning in medical image analysis. Medical image analysis, 42: 0 60--88, 2017

  29. [29]

    A visual-language foundation model for computational pathology

    Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology. Nature medicine, 30 0 (3): 0 863--874, 2024

  30. [30]

    Joint learning of localized representations from medical images and reports

    Philip M \"u ller, Georgios Kaissis, Congyu Zou, and Daniel Rueckert. Joint learning of localized representations from medical images and reports. In European conference on computer vision, pp.\ 685--701. Springer, 2022

  31. [31]

    Brain tumor mri dataset

    Msoud Nickparvar. Brain tumor mri dataset. Kaggle, 2021

  32. [32]

    Quilt-1m: One million image-text pairs for histopathology

    Wisdom Oluchi Ikezogwo, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Stefan Chan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, and Linda Shapiro. Quilt-1m: One million image-text pairs for histopathology. arXiv e-prints, pp.\ arXiv--2306, 2023

  33. [33]

    Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection

    Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Concetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, et al. Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference,...

  34. [34]

    Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research

    Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabuddhe, and Fabrice Meriaudeau. Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research. Data, 3 0 (3): 0 25, 2018

  35. [35]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

  36. [36]

    A foundation language-image model of the retina (flair): Encoding expert knowledge in text supervision

    Julio Silva-Rodriguez, Hadi Chakor, Riadh Kobbi, Jose Dolz, and Ismail Ben Ayed. A foundation language-image model of the retina (flair): Encoding expert knowledge in text supervision. Medical Image Analysis, 99: 0 103357, 2025

  37. [37]

    Prototypical networks for few-shot learning

    Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017

  38. [38]

    Covid-19 infection localization and severity grading from chest x-ray images

    Anas M Tahir, Muhammad EH Chowdhury, Amith Khandakar, Tawsifur Rahman, Yazan Qiblawey, Uzair Khurshid, Serkan Kiranyaz, Nabil Ibtehaz, M Sohel Rahman, Somaya Al-Maadeed, et al. Covid-19 infection localization and severity grading from chest x-ray images. Computers in biology and medicine, 139: 0 105002, 2021

  39. [39]

    The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

    Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5 0 (1): 0 1--9, 2018

  40. [40]

    Matching networks for one shot learning

    Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016

  41. [41]

    Medclip: Contrastive learning from unpaired medical images and text

    Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2022, pp.\ 3876, 2022

  42. [42]

    Clip-kd: An empirical study of clip model distillation

    Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15952--15962, 2024

  43. [43]

    Visual-language prompt tuning with knowledge-guided context optimization

    Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 6757--6767, 2023

  44. [44]

    Lit: Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18123--18133, 2022

  45. [45]

    Tip-adapter: Training-free clip-adapter for better vision-language modeling

    Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021

  46. [46]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915, 2023

  47. [47]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 16816--16825, 2022 a

  48. [48]

    Learning to prompt for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130 0 (9): 0 2337--2348, 2022 b

  49. [49]

    Prompt-aligned gradient for prompt tuning

    Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 15659--15669, 2023