pith. sign in

arxiv: 2509.03740 · v3 · submitted 2025-09-03 · 💻 cs.CV · cs.CL

CLIP-SVD: Efficient and Interpretable Vision-Language Adaptation via Singular Values

Pith reviewed 2026-05-18 18:52 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords CLIP adaptationsingular value decompositionparameter-efficient fine-tuningfew-shot learningvision-language modelsbiomedical image classificationmodel interpretabilitySVD-based tuning
0
0 comments X p. Extension

The pith

Updating only the singular values of CLIP weight matrices adapts the model to new domains using 0.04% of total parameters while preserving generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that CLIP can be adapted to fine-grained domains by decomposing its weight matrices with SVD and then tuning only the singular values. This approach avoids prompt engineering or added adapter modules that might overwrite pretrained knowledge or destabilize the model. A sympathetic reader would care because full fine-tuning is costly and current methods often trade off stability or performance for adaptation. The method delivers state-of-the-art few-shot classification on 11 natural and 10 biomedical datasets while using far fewer parameters and enabling natural-language analysis of the changes.

Core claim

CLIP-SVD introduces Singular Value Fine-tuning (SVF) that decomposes each pretrained weight matrix via SVD and then optimizes only the singular values to rescale the existing basis vectors for the target domain. The singular vectors remain fixed and no new modules are introduced, so adaptation uses just 0.04% of the model's parameters. This yields higher accuracy and better generalization than prior adaptation techniques on 21 datasets under few-shot conditions and supports interpretability by tracing adaptation dynamics through language queries.

What carries the argument

Singular Value Fine-tuning (SVF), the operation of adjusting only the diagonal singular values after SVD decomposition to rescale pretrained basis vectors without altering their directions or adding parameters.

If this is right

  • The adapted model retains more of the original CLIP generalization than methods that insert new components.
  • Natural-language probing becomes a practical tool for inspecting what changes during domain adaptation.
  • The same singular-value mechanism works on both everyday images and biomedical scans without custom redesign.
  • Adaptation becomes feasible on hardware with limited memory or compute because only a tiny parameter subset is updated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same SVF approach could be tested on other vision-language or vision-only transformers to see whether singular values alone suffice for domain shift in those architectures.
  • In medical imaging pipelines, keeping the original basis vectors fixed might reduce the risk of losing rare but critical features learned during large-scale pretraining.
  • Tracking which singular values change most during adaptation could offer a lightweight way to quantify how much a new domain differs from the pretraining distribution.
  • Combining SVF with a small number of prompt tokens might produce further gains if the paper's claim that singular values capture the bulk of domain knowledge holds.

Load-bearing premise

Domain-specific knowledge needed for adaptation lives mainly in the scaling factors of the existing basis vectors rather than in their directions or in entirely new features.

What would settle it

A held-out dataset where full fine-tuning or adapter methods produce clearly higher accuracy or better generalization than singular-value-only tuning on the same CLIP backbone.

Figures

Figures reproduced from arXiv: 2509.03740 by Hassan Rivaz, Taha Koleilat, Yiming Xiao.

Figure 1
Figure 1. Figure 1: The overall framework of CLIP-SVD. We decompose the Query, Key, Value, and Output [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: 4-shot performance by freezing certain layers during finetuning ( [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a multi-modal and parameter-efficient adaptation framework that applies Singular Value Fine-tuning (SVF) to CLIP, leveraging Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. Overall, this work provides the first extensive empirical evaluation of SVD-based finetuning in the vision-language model setting. The code and biomedical corpus are publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces CLIP-SVD, a parameter-efficient adaptation method for vision-language models like CLIP. It applies Singular Value Fine-tuning (SVF) by decomposing pretrained weight matrices via SVD and updating only the singular values to rescale the basis vectors for domain adaptation while keeping singular vectors fixed and adding no new modules. The approach is reported to use 0.04% of total parameters, achieve state-of-the-art few-shot classification accuracy on 11 natural and 10 biomedical datasets, improve generalization over prior prompt- and adapter-based methods, and provide interpretability through natural-language analysis of adaptation dynamics. Code and a biomedical corpus are released publicly.

Significance. If the empirical results hold under more rigorous validation, the work would establish SVD-based fine-tuning as a lightweight, module-free alternative for adapting large VLMs that preserves the pretrained singular-vector basis. The public code release and biomedical corpus constitute reproducible assets that could support follow-up studies on efficient adaptation and interpretability in vision-language settings.

major comments (3)
  1. [Abstract] Abstract: the SOTA claim on 21 datasets provides no error bars, standard deviations, or explicit details on the number of shots and train/test splits employed in the few-shot protocol. These omissions prevent assessment of whether the reported gains over baselines are statistically reliable or sensitive to experimental choices.
  2. [SVF design paragraph] SVF design (Abstract and method description): the central assertion that rescaling only the singular values while retaining pretrained singular vectors suffices for domain adaptation (natural or biomedical) is not supported by a controlled ablation that perturbs singular vectors at matched parameter budget. Without this comparison, it remains possible that performance differences arise from training hyperparameters, implicit regularization, or dataset selection rather than the SVF mechanism itself.
  3. [Method] Method section on matrix selection: no ablation or justification is given for the specific choice of which CLIP weight matrices receive SVF updates. This choice directly affects both the 0.04% parameter count and the adaptation quality, yet its impact is not quantified.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'multi-modal and parameter-efficient adaptation framework' could be clarified to specify how the language and vision branches are jointly handled during SVF, as the description focuses primarily on weight-matrix updates.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions we will make to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the SOTA claim on 21 datasets provides no error bars, standard deviations, or explicit details on the number of shots and train/test splits employed in the few-shot protocol. These omissions prevent assessment of whether the reported gains over baselines are statistically reliable or sensitive to experimental choices.

    Authors: We agree that reporting variability and protocol details is essential for assessing reliability. In the revised manuscript we will add mean accuracies with standard deviations over three random seeds for all 21 datasets. We will also explicitly state the few-shot protocol (shots per class, train/test split ratios, and sampling procedure) in both the abstract and experimental section so readers can evaluate statistical robustness and sensitivity to choices. revision: yes

  2. Referee: [SVF design paragraph] SVF design (Abstract and method description): the central assertion that rescaling only the singular values while retaining pretrained singular vectors suffices for domain adaptation (natural or biomedical) is not supported by a controlled ablation that perturbs singular vectors at matched parameter budget. Without this comparison, it remains possible that performance differences arise from training hyperparameters, implicit regularization, or dataset selection rather than the SVF mechanism itself.

    Authors: We appreciate the request for a controlled ablation. The design intentionally keeps singular vectors fixed to preserve the pretrained basis directions while only rescaling magnitudes; this is the core hypothesis. A matched-budget ablation that perturbs vectors would require a fundamentally different update rule and additional experiments outside the current scope. We will add a paragraph in the method section providing theoretical motivation for preserving the vectors and will note that future work could explore vector perturbation under equivalent budgets. Existing comparisons to prompt- and adapter-based methods already isolate the benefit of the SVF mechanism under the same training protocol. revision: partial

  3. Referee: [Method] Method section on matrix selection: no ablation or justification is given for the specific choice of which CLIP weight matrices receive SVF updates. This choice directly affects both the 0.04% parameter count and the adaptation quality, yet its impact is not quantified.

    Authors: We agree that explicit justification and quantification are needed. In the revised method section we will explain the selection of weight matrices in the attention and MLP blocks of both vision and text encoders, as these layers dominate parameter count and feature transformation. We will also include a small ablation table comparing SVF applied to different matrix subsets, reporting resulting parameter counts and accuracy on a representative subset of datasets to quantify the trade-off. revision: yes

Circularity Check

0 steps flagged

Empirical adaptation method validated on external datasets with no circular reduction

full rationale

The paper introduces CLIP-SVD as a practical parameter-efficient fine-tuning approach that updates only the singular values of pretrained CLIP weight matrices while keeping singular vectors fixed. All reported results consist of accuracy and generalization metrics on held-out natural and biomedical classification datasets, compared against prior adaptation baselines. No equations, derivations, or first-principles claims appear in the provided text that would make the performance numbers equivalent to the method's own inputs or fitted hyperparameters by construction. The design choice of SVF is presented as an explicit modeling decision rather than a derived necessity, and success is assessed through standard empirical protocols independent of any self-referential loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical observation that singular values alone can carry domain adaptation signal; no new mathematical axioms or invented physical entities are introduced.

free parameters (1)
  • learning rate and number of SVF iterations
    Standard optimizer hyperparameters that must be chosen or tuned for each dataset; their values are not reported in the abstract.
axioms (1)
  • domain assumption Pretrained CLIP weight matrices admit a stable SVD that can be recomputed and updated without numerical instability
    Invoked when the method decomposes and then modifies only the singular values.

pith-pipeline@v0.9.0 · 5818 in / 1350 out tokens · 36410 ms · 2026-05-18T18:52:04.759876+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 9, 19

  2. [2]

    Intrinsic dimensionality explains the effective- ness of language model fine-tuning

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effective- ness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328, 2021. 3

  3. [3]

    Dataset of breast ultrasound images.Data in brief, 28:104863, 2020

    Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images.Data in brief, 28:104863, 2020. 6, 16, 18

  4. [4]

    Proker: A kernel perspective on few-shot adaptation of large vision-language models

    Yassir Bendou, Amine Ouasfi, Vincent Gripon, and Adnane Boukhayma. Proker: A kernel perspective on few-shot adaptation of large vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25092–25102, 2025. 7, 8

  5. [5]

    Xcoop: Explainable prompt learning for computer- aided diagnosis via concept-guided context optimization

    Yequan Bie, Luyang Luo, Zhixuan Chen, and Hao Chen. Xcoop: Explainable prompt learning for computer- aided diagnosis via concept-guided context optimization. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 773–783. Springer, 2024. 2, 3, 8

  6. [6]

    Making the most of text semantics to improve biomedical vision–language processing

    Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuropean conference on computer vision, pages 1–21. Springer, 2022. 3

  7. [7]

    Borkowski, Marilyn M

    Andrew A. Borkowski, Marilyn M. Bui, L. Brannon Thomas, Catherine P. Wilson, Lauren A. DeLand, and Stephen M. Mastorides. Lung and colon cancer histopathological image dataset (lc25000), 2019. 6, 16, 18

  8. [8]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InECCV, pages 446–461. Springer, 2014. 6, 16

  9. [9]

    Domain-controlled prompt learning

    Qinglong Cao, Zhengqin Xu, Yuntian Chen, Chao Ma, and Xiaokang Yang. Domain-controlled prompt learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 936–944, 2024. 3, 8

  10. [10]

    Knee osteoarthritis severity grading dataset, 2018

    Pingjun Chen. Knee osteoarthritis severity grading dataset, 2018. 6, 16, 18

  11. [11]

    gscorecam: What objects is clip looking at? InProceedings of the Asian Conference on Computer Vision, pages 1959–1975, 2022

    Peijie Chen, Qi Li, Saad Biaz, Trung Bui, and Anh Nguyen. gscorecam: What objects is clip looking at? InProceedings of the Asian Conference on Computer Vision, pages 1959–1975, 2022. 21

  12. [12]

    Adapt- former: Adapting vision transformers for scalable visual recognition.Advances in Neural Information Processing Systems, 35:16664–16678, 2022

    Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adapt- former: Adapting vision transformers for scalable visual recognition.Advances in Neural Information Processing Systems, 35:16664–16678, 2022. 3

  13. [13]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InCVPR, pages 3606–3613, 2014. 6, 16

  14. [14]

    Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

    Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic).arXiv preprint arXiv:1902.03368, 2019. 16

  15. [15]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009. 6, 16, 18

  16. [16]

    Does clip benefit visual question answering in the medical domain as much as it does in the general domain?, 2021

    Sedigheh Eslami, Gerard de Melo, and Christoph Meinel. Does clip benefit visual question answering in the medical domain as much as it does in the general domain?, 2021. 3

  17. [17]

    Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. InCVPR Workshop, pages 178–178. IEEE, 2004. 6, 16 11

  18. [18]

    Interpreting clip’s image representation via text-based decomposition.arXiv preprint arXiv:2310.05916, 2023

    Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. Interpreting clip’s image representation via text-based decomposition.arXiv preprint arXiv:2310.05916, 2023. 2, 5, 9, 18, 19, 21, 22, 23, 24, 25

  19. [19]

    Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2):581–595, 2024

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2):581–595, 2024. 1, 3, 7, 8, 19

  20. [20]

    Imagenet auto-annotation with segmentation propagation.International Journal of Computer Vision, 110(3):328–348, 2014

    Matthieu Guillaumin, Daniel Küttel, and Vittorio Ferrari. Imagenet auto-annotation with segmentation propagation.International Journal of Computer Vision, 110(3):328–348, 2014. 21

  21. [21]

    Parameter-efficient transfer learning with diff pruning

    Demi Guo, Alexander M Rush, and Yoon Kim. Parameter-efficient transfer learning with diff pruning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4884–4896, 2021. 3

  22. [22]

    Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019. 6, 16

  23. [23]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, pages 8340–8349, 2021. 16

  24. [24]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InCVPR, pages 15262–15271, 2021. 16

  25. [25]

    Nxmtransformer: semi-structured sparsification for natural language understanding via admm.Advances in neural information processing systems, 34: 1818–1830, 2021

    Connor Holmes, Minjia Zhang, Yuxiong He, and Bo Wu. Nxmtransformer: semi-structured sparsification for natural language understanding via admm.Advances in neural information processing systems, 34: 1818–1830, 2021. 3

  26. [26]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019. 3

  27. [27]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2021. 3

  28. [28]

    Lp++: A surprisingly strong linear probe for few-shot clip

    Yunshi Huang, Fereshteh Shakeri, Jose Dolz, Malik Boudiaf, Houda Bahig, and Ismail Ben Ayed. Lp++: A surprisingly strong linear probe for few-shot clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23773–23782, 2024. 3, 7, 8

  29. [29]

    Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography.Scientific Reports, 12(1):1–14, 2022

    Md Nazmul Islam, Mehedi Hasan, Md Kabir Hossain, Md Golam Rabiul Alam, Md Zia Uddin, and Ahmet Soylu. Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography.Scientific Reports, 12(1):1–14, 2022. 6, 16, 18

  30. [30]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021. 3

  31. [31]

    Visual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEuropean Conference on Computer Vision, pages 709–727. Springer, 2022. 3

  32. [32]

    Compacter: Efficient low-rank hypercomplex adapter layers

    Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. InAdvances in Neural Information Processing Systems, pages 1022–1035. Curran Associates, Inc., 2021. 3

  33. [33]

    Multi-class texture analysis in colorectal cancer histology.Scientific reports, 6(1):1–11, 2016

    Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Zöllner. Multi-class texture analysis in colorectal cancer histology.Scientific reports, 6(1):1–11, 2016. 6, 16, 18

  34. [34]

    Kermany, Michael Goldbaum, et al

    Daniel S. Kermany, Michael Goldbaum, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning.Cell, 172(5):1122 – 1131.e9, 2018. 6, 16, 18

  35. [35]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023. 1, 2, 3, 7, 8, 19, 32, 33 12

  36. [36]

    Self-regulating prompts: Foundational model adaptation without forgetting

    Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15190–15200, 2023. 1, 3

  37. [37]

    Medclip-sam: Bridging text and image towards universal medical image segmentation

    Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Medclip-sam: Bridging text and image towards universal medical image segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 643–653. Springer, 2024. 3

  38. [38]

    Medclip-samv2: Towards universal text-driven medical image segmentation.Medical Image Analysis, page 103749, 2025

    Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Medclip-samv2: Towards universal text-driven medical image segmentation.Medical Image Analysis, page 103749, 2025. 3

  39. [39]

    Biomedcoop: Learning to prompt for biomedical vision-language models

    Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Biomedcoop: Learning to prompt for biomedical vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14766–14776, 2025. 2, 3, 6, 8, 16, 33

  40. [40]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InICCV, pages 554–561, 2013. 6, 16

  41. [41]

    Automatic no-reference quality assessment for retinal fundus images using vessel segmentation, 2013

    Thomas Köhler, Attila Budai, Martin Kraus, Jan Odstrcilik, Georg Michelson, and Joachim Hornegger. Automatic no-reference quality assessment for retinal fundus images using vessel segmentation, 2013. 6, 16, 18

  42. [42]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021. 3

  43. [43]

    Measuring the intrinsic dimension of objective landscapes

    Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. InInternational Conference on Learning Representations, 2018. 3

  44. [44]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582– 4597, 2021. 3

  45. [45]

    Scaling down to scale up: A guide to parameter- efficient fine-tuning.arXiv preprint arXiv:2303.15647, 2023

    Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. Scaling down to scale up: A guide to parameter- efficient fine-tuning.arXiv preprint arXiv:2303.15647, 2023. 2

  46. [46]

    Scaling & shifting your features: A new baseline for efficient model tuning.Advances in Neural Information Processing Systems, 35:109–123,

    Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning.Advances in Neural Information Processing Systems, 35:109–123,

  47. [47]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  48. [48]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 6, 16

  49. [49]

    Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038– 121072, 2024

    Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038– 121072, 2024. 1, 3

  50. [50]

    Brain tumor mri dataset, 2021

    Msoud Nickparvar. Brain tumor mri dataset, 2021. 6, 16, 18

  51. [51]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InICVGIP, pages 722–729. IEEE, 2008. 6, 16

  52. [52]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. InCVPR, pages 3498–3505. IEEE, 2012. 6, 16

  53. [53]

    Sam-parser: Fine-tuning sam efficiently by parameter space reconstruction

    Zelin Peng, Zhengqin Xu, Zhilin Zeng, Xiaokang Yang, and Wei Shen. Sam-parser: Fine-tuning sam efficiently by parameter space reconstruction. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4515–4523, 2024. 3

  54. [54]

    Adapterfusion: Non-destructive task composition for transfer learning

    Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 487–503, 2021. 3 13

  55. [55]

    Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection

    Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Concetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, Michael Riegler, and Pål Halvorsen. Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. InProceedings of the 8th ACM on Mu...

  56. [56]

    Indian diabetic retinopathy image dataset (idrid), 2018

    Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabud- dhe, and Fabrice Meriaudeau. Indian diabetic retinopathy image dataset (idrid), 2018. 6, 16, 18

  57. [57]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR,

  58. [58]

    1, 3, 7, 8, 19, 32, 33

  59. [59]

    Groundingdino-us-sam: Text-prompted multi-organ segmentation in ultrasound with lora-tuned vision-language models.arXiv preprint arXiv:2506.23903,

    Hamza Rasaee, Taha Koleilat, and Hassan Rivaz. Groundingdino-us-sam: Text-prompted multi-organ segmentation in ultrasound with lora-tuned vision-language models.arXiv preprint arXiv:2506.23903,

  60. [60]

    Learning multiple visual domains with residual adapters.Advances in neural information processing systems, 30, 2017

    Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters.Advances in neural information processing systems, 30, 2017. 3

  61. [61]

    Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400. PMLR, 2019. 16

  62. [62]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 6, 16

  63. [63]

    Textsam-eus: Text prompt learning for sam to accurately segment pancreatic tumor in endoscopic ultrasound.arXiv preprint arXiv:2507.18082, 2025

    Pascal Spiegler, Taha Koleilat, Arash Harirpoush, Corey S Miller, Hassan Rivaz, Marta Kersten-Oertel, and Yiming Xiao. Textsam-eus: Text prompt learning for sam to accurately segment pancreatic tumor in endoscopic ultrasound.arXiv preprint arXiv:2507.18082, 2025. 3

  64. [64]

    Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning.Advances in neural information processing systems, 35:37484–37496, 2022

    Yanpeng Sun, Qiang Chen, Xiangyu He, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Jian Cheng, Zechao Li, and Jingdong Wang. Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning.Advances in neural information processing systems, 35:37484–37496, 2022. 1, 2, 3

  65. [65]

    Tahir, Muhammad E.H

    Anas M. Tahir, Muhammad E.H. Chowdhury, Amith Khandakar, Tawsifur Rahman, Yazan Qiblawey, Uzair Khurshid, Serkan Kiranyaz, Nabil Ibtehaz, M. Sohel Rahman, Somaya Al-Maadeed, Sakib Mahmud, Maymouna Ezeddin, Khaled Hameed, and Tahir Hamid. Covid-19 infection localization and severity grading from chest x-ray images.Computers in Biology and Medicine, 139:105...

  66. [66]

    The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, page 180161, 2018

    Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, page 180161, 2018. 16

  67. [67]

    Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. 19

  68. [68]

    Learning robust global representations by penalizing local predictive power

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InNeurIPS, 2019. 16

  69. [69]

    A hard-to-beat baseline for training-free clip-based adaptation.arXiv preprint arXiv:2402.04087, 2024

    Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, and Tieniu Tan. A hard-to-beat baseline for training-free clip-based adaptation.arXiv preprint arXiv:2402.04087, 2024. 7, 8

  70. [70]

    Sun database: Large- scale scene recognition from abbey to zoo

    Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large- scale scene recognition from abbey to zoo. InCVPR, pages 3485–3492. IEEE, 2010. 6, 16

  71. [71]

    Advances in med- ical image segmentation: A comprehensive review of traditional, deep learning and hybrid approaches

    Yan Xu, Rixiang Quan, Weiting Xu, Yi Huang, Xiaolong Chen, and Fengyuan Liu. Advances in med- ical image segmentation: A comprehensive review of traditional, deep learning and hybrid approaches. Bioengineering, 11(10):1034, 2024. 3

  72. [72]

    Visual-language prompt tuning with knowledge-guided context optimization, 2023

    Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization, 2023. 7, 19, 32, 33

  73. [73]

    Visual-language prompt tuning with knowledge-guided context optimization

    Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6757–6767, 2023. 8

  74. [74]

    Tcp: Textual-based class-aware prompt tuning for visual- language model

    Hantao Yao, Rui Zhang, and Changsheng Xu. Tcp: Textual-based class-aware prompt tuning for visual- language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23438–23448, 2024. 7 14

  75. [75]

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022. 2

  76. [76]

    Low-rank few-shot adaptation of vision-language models

    Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603,

  77. [77]

    AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

    Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023. 7

  78. [78]

    Tip-adapter: Training-free clip-adapter for better vision-language modeling.arXiv preprint arXiv:2111.03930, 2021

    Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hong- sheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling.arXiv preprint arXiv:2111.03930, 2021. 3, 7, 8, 19

  79. [79]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023. 3

  80. [80]

    Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. Biomedclip: a multimodal biomedical foundation model pretrained from f...

Showing first 80 references.