CLIP-SVD: Efficient and Interpretable Vision-Language Adaptation via Singular Values

arxiv: 2509.03740 · v3 · submitted 2025-09-03 · 💻 cs.CV · cs.CL

CLIP-SVD: Efficient and Interpretable Vision-Language Adaptation via Singular Values

Taha Koleilat , Hassan Rivaz , Yiming Xiao This is my paper

Pith reviewed 2026-05-18 18:52 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords CLIP adaptationsingular value decompositionparameter-efficient fine-tuningfew-shot learningvision-language modelsbiomedical image classificationmodel interpretabilitySVD-based tuning

0 comments p. Extension

The pith

Updating only the singular values of CLIP weight matrices adapts the model to new domains using 0.04% of total parameters while preserving generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that CLIP can be adapted to fine-grained domains by decomposing its weight matrices with SVD and then tuning only the singular values. This approach avoids prompt engineering or added adapter modules that might overwrite pretrained knowledge or destabilize the model. A sympathetic reader would care because full fine-tuning is costly and current methods often trade off stability or performance for adaptation. The method delivers state-of-the-art few-shot classification on 11 natural and 10 biomedical datasets while using far fewer parameters and enabling natural-language analysis of the changes.

Core claim

CLIP-SVD introduces Singular Value Fine-tuning (SVF) that decomposes each pretrained weight matrix via SVD and then optimizes only the singular values to rescale the existing basis vectors for the target domain. The singular vectors remain fixed and no new modules are introduced, so adaptation uses just 0.04% of the model's parameters. This yields higher accuracy and better generalization than prior adaptation techniques on 21 datasets under few-shot conditions and supports interpretability by tracing adaptation dynamics through language queries.

What carries the argument

Singular Value Fine-tuning (SVF), the operation of adjusting only the diagonal singular values after SVD decomposition to rescale pretrained basis vectors without altering their directions or adding parameters.

If this is right

The adapted model retains more of the original CLIP generalization than methods that insert new components.
Natural-language probing becomes a practical tool for inspecting what changes during domain adaptation.
The same singular-value mechanism works on both everyday images and biomedical scans without custom redesign.
Adaptation becomes feasible on hardware with limited memory or compute because only a tiny parameter subset is updated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same SVF approach could be tested on other vision-language or vision-only transformers to see whether singular values alone suffice for domain shift in those architectures.
In medical imaging pipelines, keeping the original basis vectors fixed might reduce the risk of losing rare but critical features learned during large-scale pretraining.
Tracking which singular values change most during adaptation could offer a lightweight way to quantify how much a new domain differs from the pretraining distribution.
Combining SVF with a small number of prompt tokens might produce further gains if the paper's claim that singular values capture the bulk of domain knowledge holds.

Load-bearing premise

Domain-specific knowledge needed for adaptation lives mainly in the scaling factors of the existing basis vectors rather than in their directions or in entirely new features.

What would settle it

A held-out dataset where full fine-tuning or adapter methods produce clearly higher accuracy or better generalization than singular-value-only tuning on the same CLIP backbone.

Figures

Figures reproduced from arXiv: 2509.03740 by Hassan Rivaz, Taha Koleilat, Yiming Xiao.

**Figure 2.** Figure 2: 4-shot performance by freezing certain layers during finetuning ( [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a multi-modal and parameter-efficient adaptation framework that applies Singular Value Fine-tuning (SVF) to CLIP, leveraging Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. Overall, this work provides the first extensive empirical evaluation of SVD-based finetuning in the vision-language model setting. The code and biomedical corpus are publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLIP-SVD gets decent few-shot gains on natural and medical datasets by updating only singular values, but the key claim needs a direct control against updating vectors too.

read the letter

The core finding is that fine-tuning just the singular values of CLIP's weight matrices, while freezing the vectors, lets them adapt the model to new domains with only 0.04% of the parameters. They report better accuracy and less forgetting than prompt tuning or adapters across 11 natural and 10 biomedical datasets in few-shot settings, plus a natural-language probe for interpretability. The code and biomedical corpus are released, which helps.

Referee Report

3 major / 1 minor

Summary. The paper introduces CLIP-SVD, a parameter-efficient adaptation method for vision-language models like CLIP. It applies Singular Value Fine-tuning (SVF) by decomposing pretrained weight matrices via SVD and updating only the singular values to rescale the basis vectors for domain adaptation while keeping singular vectors fixed and adding no new modules. The approach is reported to use 0.04% of total parameters, achieve state-of-the-art few-shot classification accuracy on 11 natural and 10 biomedical datasets, improve generalization over prior prompt- and adapter-based methods, and provide interpretability through natural-language analysis of adaptation dynamics. Code and a biomedical corpus are released publicly.

Significance. If the empirical results hold under more rigorous validation, the work would establish SVD-based fine-tuning as a lightweight, module-free alternative for adapting large VLMs that preserves the pretrained singular-vector basis. The public code release and biomedical corpus constitute reproducible assets that could support follow-up studies on efficient adaptation and interpretability in vision-language settings.

major comments (3)

[Abstract] Abstract: the SOTA claim on 21 datasets provides no error bars, standard deviations, or explicit details on the number of shots and train/test splits employed in the few-shot protocol. These omissions prevent assessment of whether the reported gains over baselines are statistically reliable or sensitive to experimental choices.
[SVF design paragraph] SVF design (Abstract and method description): the central assertion that rescaling only the singular values while retaining pretrained singular vectors suffices for domain adaptation (natural or biomedical) is not supported by a controlled ablation that perturbs singular vectors at matched parameter budget. Without this comparison, it remains possible that performance differences arise from training hyperparameters, implicit regularization, or dataset selection rather than the SVF mechanism itself.
[Method] Method section on matrix selection: no ablation or justification is given for the specific choice of which CLIP weight matrices receive SVF updates. This choice directly affects both the 0.04% parameter count and the adaptation quality, yet its impact is not quantified.

minor comments (1)

[Abstract] Abstract: the phrase 'multi-modal and parameter-efficient adaptation framework' could be clarified to specify how the language and vision branches are jointly handled during SVF, as the description focuses primarily on weight-matrix updates.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the SOTA claim on 21 datasets provides no error bars, standard deviations, or explicit details on the number of shots and train/test splits employed in the few-shot protocol. These omissions prevent assessment of whether the reported gains over baselines are statistically reliable or sensitive to experimental choices.

Authors: We agree that reporting variability and protocol details is essential for assessing reliability. In the revised manuscript we will add mean accuracies with standard deviations over three random seeds for all 21 datasets. We will also explicitly state the few-shot protocol (shots per class, train/test split ratios, and sampling procedure) in both the abstract and experimental section so readers can evaluate statistical robustness and sensitivity to choices. revision: yes
Referee: [SVF design paragraph] SVF design (Abstract and method description): the central assertion that rescaling only the singular values while retaining pretrained singular vectors suffices for domain adaptation (natural or biomedical) is not supported by a controlled ablation that perturbs singular vectors at matched parameter budget. Without this comparison, it remains possible that performance differences arise from training hyperparameters, implicit regularization, or dataset selection rather than the SVF mechanism itself.

Authors: We appreciate the request for a controlled ablation. The design intentionally keeps singular vectors fixed to preserve the pretrained basis directions while only rescaling magnitudes; this is the core hypothesis. A matched-budget ablation that perturbs vectors would require a fundamentally different update rule and additional experiments outside the current scope. We will add a paragraph in the method section providing theoretical motivation for preserving the vectors and will note that future work could explore vector perturbation under equivalent budgets. Existing comparisons to prompt- and adapter-based methods already isolate the benefit of the SVF mechanism under the same training protocol. revision: partial
Referee: [Method] Method section on matrix selection: no ablation or justification is given for the specific choice of which CLIP weight matrices receive SVF updates. This choice directly affects both the 0.04% parameter count and the adaptation quality, yet its impact is not quantified.

Authors: We agree that explicit justification and quantification are needed. In the revised method section we will explain the selection of weight matrices in the attention and MLP blocks of both vision and text encoders, as these layers dominate parameter count and feature transformation. We will also include a small ablation table comparing SVF applied to different matrix subsets, reporting resulting parameter counts and accuracy on a representative subset of datasets to quantify the trade-off. revision: yes

Circularity Check

0 steps flagged

Empirical adaptation method validated on external datasets with no circular reduction

full rationale

The paper introduces CLIP-SVD as a practical parameter-efficient fine-tuning approach that updates only the singular values of pretrained CLIP weight matrices while keeping singular vectors fixed. All reported results consist of accuracy and generalization metrics on held-out natural and biomedical classification datasets, compared against prior adaptation baselines. No equations, derivations, or first-principles claims appear in the provided text that would make the performance numbers equivalent to the method's own inputs or fitted hyperparameters by construction. The design choice of SVF is presented as an explicit modeling decision rather than a derived necessity, and success is assessed through standard empirical protocols independent of any self-referential loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical observation that singular values alone can carry domain adaptation signal; no new mathematical axioms or invented physical entities are introduced.

free parameters (1)

learning rate and number of SVF iterations
Standard optimizer hyperparameters that must be chosen or tuned for each dataset; their values are not reported in the abstract.

axioms (1)

domain assumption Pretrained CLIP weight matrices admit a stable SVD that can be recomputed and updated without numerical instability
Invoked when the method decomposes and then modifies only the singular values.

pith-pipeline@v0.9.0 · 5818 in / 1350 out tokens · 36410 ms · 2026-05-18T18:52:04.759876+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets... using only 0.04% of the model's total parameters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 8 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 9, 19

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Intrinsic dimensionality explains the effective- ness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effective- ness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328, 2021. 3

work page 2021
[3]

Dataset of breast ultrasound images.Data in brief, 28:104863, 2020

Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images.Data in brief, 28:104863, 2020. 6, 16, 18

work page 2020
[4]

Proker: A kernel perspective on few-shot adaptation of large vision-language models

Yassir Bendou, Amine Ouasfi, Vincent Gripon, and Adnane Boukhayma. Proker: A kernel perspective on few-shot adaptation of large vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25092–25102, 2025. 7, 8

work page 2025
[5]

Xcoop: Explainable prompt learning for computer- aided diagnosis via concept-guided context optimization

Yequan Bie, Luyang Luo, Zhixuan Chen, and Hao Chen. Xcoop: Explainable prompt learning for computer- aided diagnosis via concept-guided context optimization. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 773–783. Springer, 2024. 2, 3, 8

work page 2024
[6]

Making the most of text semantics to improve biomedical vision–language processing

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuropean conference on computer vision, pages 1–21. Springer, 2022. 3

work page 2022
[7]

Borkowski, Marilyn M

Andrew A. Borkowski, Marilyn M. Bui, L. Brannon Thomas, Catherine P. Wilson, Lauren A. DeLand, and Stephen M. Mastorides. Lung and colon cancer histopathological image dataset (lc25000), 2019. 6, 16, 18

work page 2019
[8]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InECCV, pages 446–461. Springer, 2014. 6, 16

work page 2014
[9]

Domain-controlled prompt learning

Qinglong Cao, Zhengqin Xu, Yuntian Chen, Chao Ma, and Xiaokang Yang. Domain-controlled prompt learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 936–944, 2024. 3, 8

work page 2024
[10]

Knee osteoarthritis severity grading dataset, 2018

Pingjun Chen. Knee osteoarthritis severity grading dataset, 2018. 6, 16, 18

work page 2018
[11]

gscorecam: What objects is clip looking at? InProceedings of the Asian Conference on Computer Vision, pages 1959–1975, 2022

Peijie Chen, Qi Li, Saad Biaz, Trung Bui, and Anh Nguyen. gscorecam: What objects is clip looking at? InProceedings of the Asian Conference on Computer Vision, pages 1959–1975, 2022. 21

work page 1959
[12]

Adapt- former: Adapting vision transformers for scalable visual recognition.Advances in Neural Information Processing Systems, 35:16664–16678, 2022

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adapt- former: Adapting vision transformers for scalable visual recognition.Advances in Neural Information Processing Systems, 35:16664–16678, 2022. 3

work page 2022
[13]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InCVPR, pages 3606–3613, 2014. 6, 16

work page 2014
[14]

Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic).arXiv preprint arXiv:1902.03368, 2019. 16

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009. 6, 16, 18

work page 2009
[16]

Does clip benefit visual question answering in the medical domain as much as it does in the general domain?, 2021

Sedigheh Eslami, Gerard de Melo, and Christoph Meinel. Does clip benefit visual question answering in the medical domain as much as it does in the general domain?, 2021. 3

work page 2021
[17]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. InCVPR Workshop, pages 178–178. IEEE, 2004. 6, 16 11

work page 2004
[18]

Interpreting clip’s image representation via text-based decomposition.arXiv preprint arXiv:2310.05916, 2023

Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. Interpreting clip’s image representation via text-based decomposition.arXiv preprint arXiv:2310.05916, 2023. 2, 5, 9, 18, 19, 21, 22, 23, 24, 25

work page arXiv 2023
[19]

Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2):581–595, 2024

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2):581–595, 2024. 1, 3, 7, 8, 19

work page 2024
[20]

Imagenet auto-annotation with segmentation propagation.International Journal of Computer Vision, 110(3):328–348, 2014

Matthieu Guillaumin, Daniel Küttel, and Vittorio Ferrari. Imagenet auto-annotation with segmentation propagation.International Journal of Computer Vision, 110(3):328–348, 2014. 21

work page 2014
[21]

Parameter-efficient transfer learning with diff pruning

Demi Guo, Alexander M Rush, and Yoon Kim. Parameter-efficient transfer learning with diff pruning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4884–4896, 2021. 3

work page 2021
[22]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019. 6, 16

work page 2019
[23]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, pages 8340–8349, 2021. 16

work page 2021
[24]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InCVPR, pages 15262–15271, 2021. 16

work page 2021
[25]

Nxmtransformer: semi-structured sparsification for natural language understanding via admm.Advances in neural information processing systems, 34: 1818–1830, 2021

Connor Holmes, Minjia Zhang, Yuxiong He, and Bo Wu. Nxmtransformer: semi-structured sparsification for natural language understanding via admm.Advances in neural information processing systems, 34: 1818–1830, 2021. 3

work page 2021
[26]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019. 3

work page 2019
[27]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2021. 3

work page 2021
[28]

Lp++: A surprisingly strong linear probe for few-shot clip

Yunshi Huang, Fereshteh Shakeri, Jose Dolz, Malik Boudiaf, Houda Bahig, and Ismail Ben Ayed. Lp++: A surprisingly strong linear probe for few-shot clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23773–23782, 2024. 3, 7, 8

work page 2024
[29]

Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography.Scientific Reports, 12(1):1–14, 2022

Md Nazmul Islam, Mehedi Hasan, Md Kabir Hossain, Md Golam Rabiul Alam, Md Zia Uddin, and Ahmet Soylu. Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography.Scientific Reports, 12(1):1–14, 2022. 6, 16, 18

work page 2022
[30]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021. 3

work page 2021
[31]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEuropean Conference on Computer Vision, pages 709–727. Springer, 2022. 3

work page 2022
[32]

Compacter: Efficient low-rank hypercomplex adapter layers

Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. InAdvances in Neural Information Processing Systems, pages 1022–1035. Curran Associates, Inc., 2021. 3

work page 2021
[33]

Multi-class texture analysis in colorectal cancer histology.Scientific reports, 6(1):1–11, 2016

Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Zöllner. Multi-class texture analysis in colorectal cancer histology.Scientific reports, 6(1):1–11, 2016. 6, 16, 18

work page 2016
[34]

Kermany, Michael Goldbaum, et al

Daniel S. Kermany, Michael Goldbaum, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning.Cell, 172(5):1122 – 1131.e9, 2018. 6, 16, 18

work page 2018
[35]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023. 1, 2, 3, 7, 8, 19, 32, 33 12

work page 2023
[36]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15190–15200, 2023. 1, 3

work page 2023
[37]

Medclip-sam: Bridging text and image towards universal medical image segmentation

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Medclip-sam: Bridging text and image towards universal medical image segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 643–653. Springer, 2024. 3

work page 2024
[38]

Medclip-samv2: Towards universal text-driven medical image segmentation.Medical Image Analysis, page 103749, 2025

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Medclip-samv2: Towards universal text-driven medical image segmentation.Medical Image Analysis, page 103749, 2025. 3

work page 2025
[39]

Biomedcoop: Learning to prompt for biomedical vision-language models

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Biomedcoop: Learning to prompt for biomedical vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14766–14776, 2025. 2, 3, 6, 8, 16, 33

work page 2025
[40]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InICCV, pages 554–561, 2013. 6, 16

work page 2013
[41]

Automatic no-reference quality assessment for retinal fundus images using vessel segmentation, 2013

Thomas Köhler, Attila Budai, Martin Kraus, Jan Odstrcilik, Georg Michelson, and Joachim Hornegger. Automatic no-reference quality assessment for retinal fundus images using vessel segmentation, 2013. 6, 16, 18

work page 2013
[42]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

Measuring the intrinsic dimension of objective landscapes

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. InInternational Conference on Learning Representations, 2018. 3

work page 2018
[44]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582– 4597, 2021. 3

work page 2021
[45]

Scaling down to scale up: A guide to parameter- efficient fine-tuning.arXiv preprint arXiv:2303.15647, 2023

Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. Scaling down to scale up: A guide to parameter- efficient fine-tuning.arXiv preprint arXiv:2303.15647, 2023. 2

work page arXiv 2023
[46]

Scaling & shifting your features: A new baseline for efficient model tuning.Advances in Neural Information Processing Systems, 35:109–123,

Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning.Advances in Neural Information Processing Systems, 35:109–123,

work page
[47]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 6, 16

work page internal anchor Pith review Pith/arXiv arXiv 2013
[49]

Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038– 121072, 2024

Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038– 121072, 2024. 1, 3

work page 2024
[50]

Brain tumor mri dataset, 2021

Msoud Nickparvar. Brain tumor mri dataset, 2021. 6, 16, 18

work page 2021
[51]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InICVGIP, pages 722–729. IEEE, 2008. 6, 16

work page 2008
[52]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. InCVPR, pages 3498–3505. IEEE, 2012. 6, 16

work page 2012
[53]

Sam-parser: Fine-tuning sam efficiently by parameter space reconstruction

Zelin Peng, Zhengqin Xu, Zhilin Zeng, Xiaokang Yang, and Wei Shen. Sam-parser: Fine-tuning sam efficiently by parameter space reconstruction. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4515–4523, 2024. 3

work page 2024
[54]

Adapterfusion: Non-destructive task composition for transfer learning

Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 487–503, 2021. 3 13

work page 2021
[55]

Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection

Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Concetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, Michael Riegler, and Pål Halvorsen. Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. InProceedings of the 8th ACM on Mu...

work page 2017
[56]

Indian diabetic retinopathy image dataset (idrid), 2018

Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabud- dhe, and Fabrice Meriaudeau. Indian diabetic retinopathy image dataset (idrid), 2018. 6, 16, 18

work page 2018
[57]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR,

work page
[58]

1, 3, 7, 8, 19, 32, 33

work page
[59]

Groundingdino-us-sam: Text-prompted multi-organ segmentation in ultrasound with lora-tuned vision-language models.arXiv preprint arXiv:2506.23903,

Hamza Rasaee, Taha Koleilat, and Hassan Rivaz. Groundingdino-us-sam: Text-prompted multi-organ segmentation in ultrasound with lora-tuned vision-language models.arXiv preprint arXiv:2506.23903,

work page arXiv
[60]

Learning multiple visual domains with residual adapters.Advances in neural information processing systems, 30, 2017

Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters.Advances in neural information processing systems, 30, 2017. 3

work page 2017
[61]

Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400. PMLR, 2019. 16

work page 2019
[62]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 6, 16

work page internal anchor Pith review Pith/arXiv arXiv 2012
[63]

Textsam-eus: Text prompt learning for sam to accurately segment pancreatic tumor in endoscopic ultrasound.arXiv preprint arXiv:2507.18082, 2025

Pascal Spiegler, Taha Koleilat, Arash Harirpoush, Corey S Miller, Hassan Rivaz, Marta Kersten-Oertel, and Yiming Xiao. Textsam-eus: Text prompt learning for sam to accurately segment pancreatic tumor in endoscopic ultrasound.arXiv preprint arXiv:2507.18082, 2025. 3

work page arXiv 2025
[64]

Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning.Advances in neural information processing systems, 35:37484–37496, 2022

Yanpeng Sun, Qiang Chen, Xiangyu He, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Jian Cheng, Zechao Li, and Jingdong Wang. Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning.Advances in neural information processing systems, 35:37484–37496, 2022. 1, 2, 3

work page 2022
[65]

Tahir, Muhammad E.H

Anas M. Tahir, Muhammad E.H. Chowdhury, Amith Khandakar, Tawsifur Rahman, Yazan Qiblawey, Uzair Khurshid, Serkan Kiranyaz, Nabil Ibtehaz, M. Sohel Rahman, Somaya Al-Maadeed, Sakib Mahmud, Maymouna Ezeddin, Khaled Hameed, and Tahir Hamid. Covid-19 infection localization and severity grading from chest x-ray images.Computers in Biology and Medicine, 139:105...

work page 2021
[66]

The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, page 180161, 2018

Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, page 180161, 2018. 16

work page 2018
[67]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. 19

work page 2008
[68]

Learning robust global representations by penalizing local predictive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InNeurIPS, 2019. 16

work page 2019
[69]

A hard-to-beat baseline for training-free clip-based adaptation.arXiv preprint arXiv:2402.04087, 2024

Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, and Tieniu Tan. A hard-to-beat baseline for training-free clip-based adaptation.arXiv preprint arXiv:2402.04087, 2024. 7, 8

work page arXiv 2024
[70]

Sun database: Large- scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large- scale scene recognition from abbey to zoo. InCVPR, pages 3485–3492. IEEE, 2010. 6, 16

work page 2010
[71]

Advances in med- ical image segmentation: A comprehensive review of traditional, deep learning and hybrid approaches

Yan Xu, Rixiang Quan, Weiting Xu, Yi Huang, Xiaolong Chen, and Fengyuan Liu. Advances in med- ical image segmentation: A comprehensive review of traditional, deep learning and hybrid approaches. Bioengineering, 11(10):1034, 2024. 3

work page 2024
[72]

Visual-language prompt tuning with knowledge-guided context optimization, 2023

Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization, 2023. 7, 19, 32, 33

work page 2023
[73]

Visual-language prompt tuning with knowledge-guided context optimization

Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6757–6767, 2023. 8

work page 2023
[74]

Tcp: Textual-based class-aware prompt tuning for visual- language model

Hantao Yao, Rui Zhang, and Changsheng Xu. Tcp: Textual-based class-aware prompt tuning for visual- language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23438–23448, 2024. 7 14

work page 2024
[75]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022. 2

work page 2022
[76]

Low-rank few-shot adaptation of vision-language models

Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603,

work page
[77]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

Tip-adapter: Training-free clip-adapter for better vision-language modeling.arXiv preprint arXiv:2111.03930, 2021

Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hong- sheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling.arXiv preprint arXiv:2111.03930, 2021. 3, 7, 8, 19

work page arXiv 2021
[79]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. Biomedclip: a multimodal biomedical foundation model pretrained from f...

work page 2024

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 9, 19

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Intrinsic dimensionality explains the effective- ness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effective- ness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328, 2021. 3

work page 2021

[3] [3]

Dataset of breast ultrasound images.Data in brief, 28:104863, 2020

Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images.Data in brief, 28:104863, 2020. 6, 16, 18

work page 2020

[4] [4]

Proker: A kernel perspective on few-shot adaptation of large vision-language models

Yassir Bendou, Amine Ouasfi, Vincent Gripon, and Adnane Boukhayma. Proker: A kernel perspective on few-shot adaptation of large vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25092–25102, 2025. 7, 8

work page 2025

[5] [5]

Xcoop: Explainable prompt learning for computer- aided diagnosis via concept-guided context optimization

Yequan Bie, Luyang Luo, Zhixuan Chen, and Hao Chen. Xcoop: Explainable prompt learning for computer- aided diagnosis via concept-guided context optimization. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 773–783. Springer, 2024. 2, 3, 8

work page 2024

[6] [6]

Making the most of text semantics to improve biomedical vision–language processing

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuropean conference on computer vision, pages 1–21. Springer, 2022. 3

work page 2022

[7] [7]

Borkowski, Marilyn M

Andrew A. Borkowski, Marilyn M. Bui, L. Brannon Thomas, Catherine P. Wilson, Lauren A. DeLand, and Stephen M. Mastorides. Lung and colon cancer histopathological image dataset (lc25000), 2019. 6, 16, 18

work page 2019

[8] [8]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InECCV, pages 446–461. Springer, 2014. 6, 16

work page 2014

[9] [9]

Domain-controlled prompt learning

Qinglong Cao, Zhengqin Xu, Yuntian Chen, Chao Ma, and Xiaokang Yang. Domain-controlled prompt learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 936–944, 2024. 3, 8

work page 2024

[10] [10]

Knee osteoarthritis severity grading dataset, 2018

Pingjun Chen. Knee osteoarthritis severity grading dataset, 2018. 6, 16, 18

work page 2018

[11] [11]

gscorecam: What objects is clip looking at? InProceedings of the Asian Conference on Computer Vision, pages 1959–1975, 2022

Peijie Chen, Qi Li, Saad Biaz, Trung Bui, and Anh Nguyen. gscorecam: What objects is clip looking at? InProceedings of the Asian Conference on Computer Vision, pages 1959–1975, 2022. 21

work page 1959

[12] [12]

Adapt- former: Adapting vision transformers for scalable visual recognition.Advances in Neural Information Processing Systems, 35:16664–16678, 2022

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adapt- former: Adapting vision transformers for scalable visual recognition.Advances in Neural Information Processing Systems, 35:16664–16678, 2022. 3

work page 2022

[13] [13]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InCVPR, pages 3606–3613, 2014. 6, 16

work page 2014

[14] [14]

Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic).arXiv preprint arXiv:1902.03368, 2019. 16

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009. 6, 16, 18

work page 2009

[16] [16]

Does clip benefit visual question answering in the medical domain as much as it does in the general domain?, 2021

Sedigheh Eslami, Gerard de Melo, and Christoph Meinel. Does clip benefit visual question answering in the medical domain as much as it does in the general domain?, 2021. 3

work page 2021

[17] [17]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. InCVPR Workshop, pages 178–178. IEEE, 2004. 6, 16 11

work page 2004

[18] [18]

Interpreting clip’s image representation via text-based decomposition.arXiv preprint arXiv:2310.05916, 2023

Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. Interpreting clip’s image representation via text-based decomposition.arXiv preprint arXiv:2310.05916, 2023. 2, 5, 9, 18, 19, 21, 22, 23, 24, 25

work page arXiv 2023

[19] [19]

Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2):581–595, 2024

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2):581–595, 2024. 1, 3, 7, 8, 19

work page 2024

[20] [20]

Imagenet auto-annotation with segmentation propagation.International Journal of Computer Vision, 110(3):328–348, 2014

Matthieu Guillaumin, Daniel Küttel, and Vittorio Ferrari. Imagenet auto-annotation with segmentation propagation.International Journal of Computer Vision, 110(3):328–348, 2014. 21

work page 2014

[21] [21]

Parameter-efficient transfer learning with diff pruning

Demi Guo, Alexander M Rush, and Yoon Kim. Parameter-efficient transfer learning with diff pruning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4884–4896, 2021. 3

work page 2021

[22] [22]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019. 6, 16

work page 2019

[23] [23]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, pages 8340–8349, 2021. 16

work page 2021

[24] [24]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InCVPR, pages 15262–15271, 2021. 16

work page 2021

[25] [25]

Nxmtransformer: semi-structured sparsification for natural language understanding via admm.Advances in neural information processing systems, 34: 1818–1830, 2021

Connor Holmes, Minjia Zhang, Yuxiong He, and Bo Wu. Nxmtransformer: semi-structured sparsification for natural language understanding via admm.Advances in neural information processing systems, 34: 1818–1830, 2021. 3

work page 2021

[26] [26]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019. 3

work page 2019

[27] [27]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2021. 3

work page 2021

[28] [28]

Lp++: A surprisingly strong linear probe for few-shot clip

Yunshi Huang, Fereshteh Shakeri, Jose Dolz, Malik Boudiaf, Houda Bahig, and Ismail Ben Ayed. Lp++: A surprisingly strong linear probe for few-shot clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23773–23782, 2024. 3, 7, 8

work page 2024

[29] [29]

Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography.Scientific Reports, 12(1):1–14, 2022

Md Nazmul Islam, Mehedi Hasan, Md Kabir Hossain, Md Golam Rabiul Alam, Md Zia Uddin, and Ahmet Soylu. Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography.Scientific Reports, 12(1):1–14, 2022. 6, 16, 18

work page 2022

[30] [30]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021. 3

work page 2021

[31] [31]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEuropean Conference on Computer Vision, pages 709–727. Springer, 2022. 3

work page 2022

[32] [32]

Compacter: Efficient low-rank hypercomplex adapter layers

Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. InAdvances in Neural Information Processing Systems, pages 1022–1035. Curran Associates, Inc., 2021. 3

work page 2021

[33] [33]

Multi-class texture analysis in colorectal cancer histology.Scientific reports, 6(1):1–11, 2016

Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Zöllner. Multi-class texture analysis in colorectal cancer histology.Scientific reports, 6(1):1–11, 2016. 6, 16, 18

work page 2016

[34] [34]

Kermany, Michael Goldbaum, et al

Daniel S. Kermany, Michael Goldbaum, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning.Cell, 172(5):1122 – 1131.e9, 2018. 6, 16, 18

work page 2018

[35] [35]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023. 1, 2, 3, 7, 8, 19, 32, 33 12

work page 2023

[36] [36]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15190–15200, 2023. 1, 3

work page 2023

[37] [37]

Medclip-sam: Bridging text and image towards universal medical image segmentation

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Medclip-sam: Bridging text and image towards universal medical image segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 643–653. Springer, 2024. 3

work page 2024

[38] [38]

Medclip-samv2: Towards universal text-driven medical image segmentation.Medical Image Analysis, page 103749, 2025

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Medclip-samv2: Towards universal text-driven medical image segmentation.Medical Image Analysis, page 103749, 2025. 3

work page 2025

[39] [39]

Biomedcoop: Learning to prompt for biomedical vision-language models

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Biomedcoop: Learning to prompt for biomedical vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14766–14776, 2025. 2, 3, 6, 8, 16, 33

work page 2025

[40] [40]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InICCV, pages 554–561, 2013. 6, 16

work page 2013

[41] [41]

Automatic no-reference quality assessment for retinal fundus images using vessel segmentation, 2013

Thomas Köhler, Attila Budai, Martin Kraus, Jan Odstrcilik, Georg Michelson, and Joachim Hornegger. Automatic no-reference quality assessment for retinal fundus images using vessel segmentation, 2013. 6, 16, 18

work page 2013

[42] [42]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[43] [43]

Measuring the intrinsic dimension of objective landscapes

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. InInternational Conference on Learning Representations, 2018. 3

work page 2018

[44] [44]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582– 4597, 2021. 3

work page 2021

[45] [45]

Scaling down to scale up: A guide to parameter- efficient fine-tuning.arXiv preprint arXiv:2303.15647, 2023

Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. Scaling down to scale up: A guide to parameter- efficient fine-tuning.arXiv preprint arXiv:2303.15647, 2023. 2

work page arXiv 2023

[46] [46]

Scaling & shifting your features: A new baseline for efficient model tuning.Advances in Neural Information Processing Systems, 35:109–123,

Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning.Advances in Neural Information Processing Systems, 35:109–123,

work page

[47] [47]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017

[48] [48]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 6, 16

work page internal anchor Pith review Pith/arXiv arXiv 2013

[49] [49]

Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038– 121072, 2024

Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038– 121072, 2024. 1, 3

work page 2024

[50] [50]

Brain tumor mri dataset, 2021

Msoud Nickparvar. Brain tumor mri dataset, 2021. 6, 16, 18

work page 2021

[51] [51]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InICVGIP, pages 722–729. IEEE, 2008. 6, 16

work page 2008

[52] [52]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. InCVPR, pages 3498–3505. IEEE, 2012. 6, 16

work page 2012

[53] [53]

Sam-parser: Fine-tuning sam efficiently by parameter space reconstruction

Zelin Peng, Zhengqin Xu, Zhilin Zeng, Xiaokang Yang, and Wei Shen. Sam-parser: Fine-tuning sam efficiently by parameter space reconstruction. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4515–4523, 2024. 3

work page 2024

[54] [54]

Adapterfusion: Non-destructive task composition for transfer learning

Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 487–503, 2021. 3 13

work page 2021

[55] [55]

Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection

Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Concetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, Michael Riegler, and Pål Halvorsen. Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. InProceedings of the 8th ACM on Mu...

work page 2017

[56] [56]

Indian diabetic retinopathy image dataset (idrid), 2018

Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabud- dhe, and Fabrice Meriaudeau. Indian diabetic retinopathy image dataset (idrid), 2018. 6, 16, 18

work page 2018

[57] [57]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR,

work page

[58] [58]

1, 3, 7, 8, 19, 32, 33

work page

[59] [59]

Groundingdino-us-sam: Text-prompted multi-organ segmentation in ultrasound with lora-tuned vision-language models.arXiv preprint arXiv:2506.23903,

Hamza Rasaee, Taha Koleilat, and Hassan Rivaz. Groundingdino-us-sam: Text-prompted multi-organ segmentation in ultrasound with lora-tuned vision-language models.arXiv preprint arXiv:2506.23903,

work page arXiv

[60] [60]

Learning multiple visual domains with residual adapters.Advances in neural information processing systems, 30, 2017

Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters.Advances in neural information processing systems, 30, 2017. 3

work page 2017

[61] [61]

Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400. PMLR, 2019. 16

work page 2019

[62] [62]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 6, 16

work page internal anchor Pith review Pith/arXiv arXiv 2012

[63] [63]

Textsam-eus: Text prompt learning for sam to accurately segment pancreatic tumor in endoscopic ultrasound.arXiv preprint arXiv:2507.18082, 2025

Pascal Spiegler, Taha Koleilat, Arash Harirpoush, Corey S Miller, Hassan Rivaz, Marta Kersten-Oertel, and Yiming Xiao. Textsam-eus: Text prompt learning for sam to accurately segment pancreatic tumor in endoscopic ultrasound.arXiv preprint arXiv:2507.18082, 2025. 3

work page arXiv 2025

[64] [64]

Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning.Advances in neural information processing systems, 35:37484–37496, 2022

Yanpeng Sun, Qiang Chen, Xiangyu He, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Jian Cheng, Zechao Li, and Jingdong Wang. Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning.Advances in neural information processing systems, 35:37484–37496, 2022. 1, 2, 3

work page 2022

[65] [65]

Tahir, Muhammad E.H

Anas M. Tahir, Muhammad E.H. Chowdhury, Amith Khandakar, Tawsifur Rahman, Yazan Qiblawey, Uzair Khurshid, Serkan Kiranyaz, Nabil Ibtehaz, M. Sohel Rahman, Somaya Al-Maadeed, Sakib Mahmud, Maymouna Ezeddin, Khaled Hameed, and Tahir Hamid. Covid-19 infection localization and severity grading from chest x-ray images.Computers in Biology and Medicine, 139:105...

work page 2021

[66] [66]

The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, page 180161, 2018

Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, page 180161, 2018. 16

work page 2018

[67] [67]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. 19

work page 2008

[68] [68]

Learning robust global representations by penalizing local predictive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InNeurIPS, 2019. 16

work page 2019

[69] [69]

A hard-to-beat baseline for training-free clip-based adaptation.arXiv preprint arXiv:2402.04087, 2024

Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, and Tieniu Tan. A hard-to-beat baseline for training-free clip-based adaptation.arXiv preprint arXiv:2402.04087, 2024. 7, 8

work page arXiv 2024

[70] [70]

Sun database: Large- scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large- scale scene recognition from abbey to zoo. InCVPR, pages 3485–3492. IEEE, 2010. 6, 16

work page 2010

[71] [71]

Advances in med- ical image segmentation: A comprehensive review of traditional, deep learning and hybrid approaches

Yan Xu, Rixiang Quan, Weiting Xu, Yi Huang, Xiaolong Chen, and Fengyuan Liu. Advances in med- ical image segmentation: A comprehensive review of traditional, deep learning and hybrid approaches. Bioengineering, 11(10):1034, 2024. 3

work page 2024

[72] [72]

Visual-language prompt tuning with knowledge-guided context optimization, 2023

Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization, 2023. 7, 19, 32, 33

work page 2023

[73] [73]

Visual-language prompt tuning with knowledge-guided context optimization

Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6757–6767, 2023. 8

work page 2023

[74] [74]

Tcp: Textual-based class-aware prompt tuning for visual- language model

Hantao Yao, Rui Zhang, and Changsheng Xu. Tcp: Textual-based class-aware prompt tuning for visual- language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23438–23448, 2024. 7 14

work page 2024

[75] [75]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022. 2

work page 2022

[76] [76]

Low-rank few-shot adaptation of vision-language models

Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603,

work page

[77] [77]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[78] [78]

Tip-adapter: Training-free clip-adapter for better vision-language modeling.arXiv preprint arXiv:2111.03930, 2021

Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hong- sheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling.arXiv preprint arXiv:2111.03930, 2021. 3, 7, 8, 19

work page arXiv 2021

[79] [79]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[80] [80]

Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. Biomedclip: a multimodal biomedical foundation model pretrained from f...

work page 2024