CLIP-SVD: Efficient and Interpretable Vision-Language Adaptation via Singular Values
Pith reviewed 2026-05-18 18:52 UTC · model grok-4.3
The pith
Updating only the singular values of CLIP weight matrices adapts the model to new domains using 0.04% of total parameters while preserving generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLIP-SVD introduces Singular Value Fine-tuning (SVF) that decomposes each pretrained weight matrix via SVD and then optimizes only the singular values to rescale the existing basis vectors for the target domain. The singular vectors remain fixed and no new modules are introduced, so adaptation uses just 0.04% of the model's parameters. This yields higher accuracy and better generalization than prior adaptation techniques on 21 datasets under few-shot conditions and supports interpretability by tracing adaptation dynamics through language queries.
What carries the argument
Singular Value Fine-tuning (SVF), the operation of adjusting only the diagonal singular values after SVD decomposition to rescale pretrained basis vectors without altering their directions or adding parameters.
If this is right
- The adapted model retains more of the original CLIP generalization than methods that insert new components.
- Natural-language probing becomes a practical tool for inspecting what changes during domain adaptation.
- The same singular-value mechanism works on both everyday images and biomedical scans without custom redesign.
- Adaptation becomes feasible on hardware with limited memory or compute because only a tiny parameter subset is updated.
Where Pith is reading between the lines
- The same SVF approach could be tested on other vision-language or vision-only transformers to see whether singular values alone suffice for domain shift in those architectures.
- In medical imaging pipelines, keeping the original basis vectors fixed might reduce the risk of losing rare but critical features learned during large-scale pretraining.
- Tracking which singular values change most during adaptation could offer a lightweight way to quantify how much a new domain differs from the pretraining distribution.
- Combining SVF with a small number of prompt tokens might produce further gains if the paper's claim that singular values capture the bulk of domain knowledge holds.
Load-bearing premise
Domain-specific knowledge needed for adaptation lives mainly in the scaling factors of the existing basis vectors rather than in their directions or in entirely new features.
What would settle it
A held-out dataset where full fine-tuning or adapter methods produce clearly higher accuracy or better generalization than singular-value-only tuning on the same CLIP backbone.
Figures
read the original abstract
Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a multi-modal and parameter-efficient adaptation framework that applies Singular Value Fine-tuning (SVF) to CLIP, leveraging Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. Overall, this work provides the first extensive empirical evaluation of SVD-based finetuning in the vision-language model setting. The code and biomedical corpus are publicly available at https://github.com/HealthX-Lab/CLIP-SVD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CLIP-SVD, a parameter-efficient adaptation method for vision-language models like CLIP. It applies Singular Value Fine-tuning (SVF) by decomposing pretrained weight matrices via SVD and updating only the singular values to rescale the basis vectors for domain adaptation while keeping singular vectors fixed and adding no new modules. The approach is reported to use 0.04% of total parameters, achieve state-of-the-art few-shot classification accuracy on 11 natural and 10 biomedical datasets, improve generalization over prior prompt- and adapter-based methods, and provide interpretability through natural-language analysis of adaptation dynamics. Code and a biomedical corpus are released publicly.
Significance. If the empirical results hold under more rigorous validation, the work would establish SVD-based fine-tuning as a lightweight, module-free alternative for adapting large VLMs that preserves the pretrained singular-vector basis. The public code release and biomedical corpus constitute reproducible assets that could support follow-up studies on efficient adaptation and interpretability in vision-language settings.
major comments (3)
- [Abstract] Abstract: the SOTA claim on 21 datasets provides no error bars, standard deviations, or explicit details on the number of shots and train/test splits employed in the few-shot protocol. These omissions prevent assessment of whether the reported gains over baselines are statistically reliable or sensitive to experimental choices.
- [SVF design paragraph] SVF design (Abstract and method description): the central assertion that rescaling only the singular values while retaining pretrained singular vectors suffices for domain adaptation (natural or biomedical) is not supported by a controlled ablation that perturbs singular vectors at matched parameter budget. Without this comparison, it remains possible that performance differences arise from training hyperparameters, implicit regularization, or dataset selection rather than the SVF mechanism itself.
- [Method] Method section on matrix selection: no ablation or justification is given for the specific choice of which CLIP weight matrices receive SVF updates. This choice directly affects both the 0.04% parameter count and the adaptation quality, yet its impact is not quantified.
minor comments (1)
- [Abstract] Abstract: the phrase 'multi-modal and parameter-efficient adaptation framework' could be clarified to specify how the language and vision branches are jointly handled during SVF, as the description focuses primarily on weight-matrix updates.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions we will make to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the SOTA claim on 21 datasets provides no error bars, standard deviations, or explicit details on the number of shots and train/test splits employed in the few-shot protocol. These omissions prevent assessment of whether the reported gains over baselines are statistically reliable or sensitive to experimental choices.
Authors: We agree that reporting variability and protocol details is essential for assessing reliability. In the revised manuscript we will add mean accuracies with standard deviations over three random seeds for all 21 datasets. We will also explicitly state the few-shot protocol (shots per class, train/test split ratios, and sampling procedure) in both the abstract and experimental section so readers can evaluate statistical robustness and sensitivity to choices. revision: yes
-
Referee: [SVF design paragraph] SVF design (Abstract and method description): the central assertion that rescaling only the singular values while retaining pretrained singular vectors suffices for domain adaptation (natural or biomedical) is not supported by a controlled ablation that perturbs singular vectors at matched parameter budget. Without this comparison, it remains possible that performance differences arise from training hyperparameters, implicit regularization, or dataset selection rather than the SVF mechanism itself.
Authors: We appreciate the request for a controlled ablation. The design intentionally keeps singular vectors fixed to preserve the pretrained basis directions while only rescaling magnitudes; this is the core hypothesis. A matched-budget ablation that perturbs vectors would require a fundamentally different update rule and additional experiments outside the current scope. We will add a paragraph in the method section providing theoretical motivation for preserving the vectors and will note that future work could explore vector perturbation under equivalent budgets. Existing comparisons to prompt- and adapter-based methods already isolate the benefit of the SVF mechanism under the same training protocol. revision: partial
-
Referee: [Method] Method section on matrix selection: no ablation or justification is given for the specific choice of which CLIP weight matrices receive SVF updates. This choice directly affects both the 0.04% parameter count and the adaptation quality, yet its impact is not quantified.
Authors: We agree that explicit justification and quantification are needed. In the revised method section we will explain the selection of weight matrices in the attention and MLP blocks of both vision and text encoders, as these layers dominate parameter count and feature transformation. We will also include a small ablation table comparing SVF applied to different matrix subsets, reporting resulting parameter counts and accuracy on a representative subset of datasets to quantify the trade-off. revision: yes
Circularity Check
Empirical adaptation method validated on external datasets with no circular reduction
full rationale
The paper introduces CLIP-SVD as a practical parameter-efficient fine-tuning approach that updates only the singular values of pretrained CLIP weight matrices while keeping singular vectors fixed. All reported results consist of accuracy and generalization metrics on held-out natural and biomedical classification datasets, compared against prior adaptation baselines. No equations, derivations, or first-principles claims appear in the provided text that would make the performance numbers equivalent to the method's own inputs or fitted hyperparameters by construction. The design choice of SVF is presented as an explicit modeling decision rather than a derived necessity, and success is assessed through standard empirical protocols independent of any self-referential loop.
Axiom & Free-Parameter Ledger
free parameters (1)
- learning rate and number of SVF iterations
axioms (1)
- domain assumption Pretrained CLIP weight matrices admit a stable SVD that can be recomputed and updated without numerical instability
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets... using only 0.04% of the model's total parameters
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 9, 19
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Intrinsic dimensionality explains the effective- ness of language model fine-tuning
Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effective- ness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328, 2021. 3
work page 2021
-
[3]
Dataset of breast ultrasound images.Data in brief, 28:104863, 2020
Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images.Data in brief, 28:104863, 2020. 6, 16, 18
work page 2020
-
[4]
Proker: A kernel perspective on few-shot adaptation of large vision-language models
Yassir Bendou, Amine Ouasfi, Vincent Gripon, and Adnane Boukhayma. Proker: A kernel perspective on few-shot adaptation of large vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25092–25102, 2025. 7, 8
work page 2025
-
[5]
Yequan Bie, Luyang Luo, Zhixuan Chen, and Hao Chen. Xcoop: Explainable prompt learning for computer- aided diagnosis via concept-guided context optimization. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 773–783. Springer, 2024. 2, 3, 8
work page 2024
-
[6]
Making the most of text semantics to improve biomedical vision–language processing
Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuropean conference on computer vision, pages 1–21. Springer, 2022. 3
work page 2022
-
[7]
Andrew A. Borkowski, Marilyn M. Bui, L. Brannon Thomas, Catherine P. Wilson, Lauren A. DeLand, and Stephen M. Mastorides. Lung and colon cancer histopathological image dataset (lc25000), 2019. 6, 16, 18
work page 2019
-
[8]
Food-101–mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InECCV, pages 446–461. Springer, 2014. 6, 16
work page 2014
-
[9]
Domain-controlled prompt learning
Qinglong Cao, Zhengqin Xu, Yuntian Chen, Chao Ma, and Xiaokang Yang. Domain-controlled prompt learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 936–944, 2024. 3, 8
work page 2024
-
[10]
Knee osteoarthritis severity grading dataset, 2018
Pingjun Chen. Knee osteoarthritis severity grading dataset, 2018. 6, 16, 18
work page 2018
-
[11]
Peijie Chen, Qi Li, Saad Biaz, Trung Bui, and Anh Nguyen. gscorecam: What objects is clip looking at? InProceedings of the Asian Conference on Computer Vision, pages 1959–1975, 2022. 21
work page 1959
-
[12]
Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adapt- former: Adapting vision transformers for scalable visual recognition.Advances in Neural Information Processing Systems, 35:16664–16678, 2022. 3
work page 2022
-
[13]
Describing textures in the wild
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InCVPR, pages 3606–3613, 2014. 6, 16
work page 2014
-
[14]
Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic).arXiv preprint arXiv:1902.03368, 2019. 16
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009. 6, 16, 18
work page 2009
-
[16]
Sedigheh Eslami, Gerard de Melo, and Christoph Meinel. Does clip benefit visual question answering in the medical domain as much as it does in the general domain?, 2021. 3
work page 2021
-
[17]
Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. InCVPR Workshop, pages 178–178. IEEE, 2004. 6, 16 11
work page 2004
-
[18]
Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. Interpreting clip’s image representation via text-based decomposition.arXiv preprint arXiv:2310.05916, 2023. 2, 5, 9, 18, 19, 21, 22, 23, 24, 25
-
[19]
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2):581–595, 2024. 1, 3, 7, 8, 19
work page 2024
-
[20]
Matthieu Guillaumin, Daniel Küttel, and Vittorio Ferrari. Imagenet auto-annotation with segmentation propagation.International Journal of Computer Vision, 110(3):328–348, 2014. 21
work page 2014
-
[21]
Parameter-efficient transfer learning with diff pruning
Demi Guo, Alexander M Rush, and Yoon Kim. Parameter-efficient transfer learning with diff pruning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4884–4896, 2021. 3
work page 2021
-
[22]
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019. 6, 16
work page 2019
-
[23]
The many faces of robustness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, pages 8340–8349, 2021. 16
work page 2021
-
[24]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InCVPR, pages 15262–15271, 2021. 16
work page 2021
-
[25]
Connor Holmes, Minjia Zhang, Yuxiong He, and Bo Wu. Nxmtransformer: semi-structured sparsification for natural language understanding via admm.Advances in neural information processing systems, 34: 1818–1830, 2021. 3
work page 2021
-
[26]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019. 3
work page 2019
-
[27]
Lora: Low-rank adaptation of large language models
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2021. 3
work page 2021
-
[28]
Lp++: A surprisingly strong linear probe for few-shot clip
Yunshi Huang, Fereshteh Shakeri, Jose Dolz, Malik Boudiaf, Houda Bahig, and Ismail Ben Ayed. Lp++: A surprisingly strong linear probe for few-shot clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23773–23782, 2024. 3, 7, 8
work page 2024
-
[29]
Md Nazmul Islam, Mehedi Hasan, Md Kabir Hossain, Md Golam Rabiul Alam, Md Zia Uddin, and Ahmet Soylu. Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography.Scientific Reports, 12(1):1–14, 2022. 6, 16, 18
work page 2022
-
[30]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021. 3
work page 2021
-
[31]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEuropean Conference on Computer Vision, pages 709–727. Springer, 2022. 3
work page 2022
-
[32]
Compacter: Efficient low-rank hypercomplex adapter layers
Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. InAdvances in Neural Information Processing Systems, pages 1022–1035. Curran Associates, Inc., 2021. 3
work page 2021
-
[33]
Multi-class texture analysis in colorectal cancer histology.Scientific reports, 6(1):1–11, 2016
Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Zöllner. Multi-class texture analysis in colorectal cancer histology.Scientific reports, 6(1):1–11, 2016. 6, 16, 18
work page 2016
-
[34]
Kermany, Michael Goldbaum, et al
Daniel S. Kermany, Michael Goldbaum, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning.Cell, 172(5):1122 – 1131.e9, 2018. 6, 16, 18
work page 2018
-
[35]
Maple: Multi-modal prompt learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023. 1, 2, 3, 7, 8, 19, 32, 33 12
work page 2023
-
[36]
Self-regulating prompts: Foundational model adaptation without forgetting
Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15190–15200, 2023. 1, 3
work page 2023
-
[37]
Medclip-sam: Bridging text and image towards universal medical image segmentation
Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Medclip-sam: Bridging text and image towards universal medical image segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 643–653. Springer, 2024. 3
work page 2024
-
[38]
Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Medclip-samv2: Towards universal text-driven medical image segmentation.Medical Image Analysis, page 103749, 2025. 3
work page 2025
-
[39]
Biomedcoop: Learning to prompt for biomedical vision-language models
Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, and Yiming Xiao. Biomedcoop: Learning to prompt for biomedical vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14766–14776, 2025. 2, 3, 6, 8, 16, 33
work page 2025
-
[40]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InICCV, pages 554–561, 2013. 6, 16
work page 2013
-
[41]
Automatic no-reference quality assessment for retinal fundus images using vessel segmentation, 2013
Thomas Köhler, Attila Budai, Martin Kraus, Jan Odstrcilik, Georg Michelson, and Joachim Hornegger. Automatic no-reference quality assessment for retinal fundus images using vessel segmentation, 2013. 6, 16, 18
work page 2013
-
[42]
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[43]
Measuring the intrinsic dimension of objective landscapes
Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. InInternational Conference on Learning Representations, 2018. 3
work page 2018
-
[44]
Prefix-tuning: Optimizing continuous prompts for generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582– 4597, 2021. 3
work page 2021
-
[45]
Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. Scaling down to scale up: A guide to parameter- efficient fine-tuning.arXiv preprint arXiv:2303.15647, 2023. 2
-
[46]
Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning.Advances in Neural Information Processing Systems, 35:109–123,
-
[47]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[48]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 6, 16
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[49]
Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models.Advances in Neural Information Processing Systems, 37:121038– 121072, 2024. 1, 3
work page 2024
-
[50]
Msoud Nickparvar. Brain tumor mri dataset, 2021. 6, 16, 18
work page 2021
-
[51]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InICVGIP, pages 722–729. IEEE, 2008. 6, 16
work page 2008
-
[52]
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. InCVPR, pages 3498–3505. IEEE, 2012. 6, 16
work page 2012
-
[53]
Sam-parser: Fine-tuning sam efficiently by parameter space reconstruction
Zelin Peng, Zhengqin Xu, Zhilin Zeng, Xiaokang Yang, and Wei Shen. Sam-parser: Fine-tuning sam efficiently by parameter space reconstruction. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4515–4523, 2024. 3
work page 2024
-
[54]
Adapterfusion: Non-destructive task composition for transfer learning
Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 487–503, 2021. 3 13
work page 2021
-
[55]
Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection
Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Concetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, Michael Riegler, and Pål Halvorsen. Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. InProceedings of the 8th ACM on Mu...
work page 2017
-
[56]
Indian diabetic retinopathy image dataset (idrid), 2018
Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabud- dhe, and Fabrice Meriaudeau. Indian diabetic retinopathy image dataset (idrid), 2018. 6, 16, 18
work page 2018
-
[57]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR,
-
[58]
1, 3, 7, 8, 19, 32, 33
-
[59]
Hamza Rasaee, Taha Koleilat, and Hassan Rivaz. Groundingdino-us-sam: Text-prompted multi-organ segmentation in ultrasound with lora-tuned vision-language models.arXiv preprint arXiv:2506.23903,
-
[60]
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters.Advances in neural information processing systems, 30, 2017. 3
work page 2017
-
[61]
Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400. PMLR, 2019. 16
work page 2019
-
[62]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 6, 16
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[63]
Pascal Spiegler, Taha Koleilat, Arash Harirpoush, Corey S Miller, Hassan Rivaz, Marta Kersten-Oertel, and Yiming Xiao. Textsam-eus: Text prompt learning for sam to accurately segment pancreatic tumor in endoscopic ultrasound.arXiv preprint arXiv:2507.18082, 2025. 3
-
[64]
Yanpeng Sun, Qiang Chen, Xiangyu He, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Jian Cheng, Zechao Li, and Jingdong Wang. Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning.Advances in neural information processing systems, 35:37484–37496, 2022. 1, 2, 3
work page 2022
-
[65]
Anas M. Tahir, Muhammad E.H. Chowdhury, Amith Khandakar, Tawsifur Rahman, Yazan Qiblawey, Uzair Khurshid, Serkan Kiranyaz, Nabil Ibtehaz, M. Sohel Rahman, Somaya Al-Maadeed, Sakib Mahmud, Maymouna Ezeddin, Khaled Hameed, and Tahir Hamid. Covid-19 infection localization and severity grading from chest x-ray images.Computers in Biology and Medicine, 139:105...
work page 2021
-
[66]
Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, page 180161, 2018. 16
work page 2018
-
[67]
Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. 19
work page 2008
-
[68]
Learning robust global representations by penalizing local predictive power
Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InNeurIPS, 2019. 16
work page 2019
-
[69]
Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, and Tieniu Tan. A hard-to-beat baseline for training-free clip-based adaptation.arXiv preprint arXiv:2402.04087, 2024. 7, 8
-
[70]
Sun database: Large- scale scene recognition from abbey to zoo
Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large- scale scene recognition from abbey to zoo. InCVPR, pages 3485–3492. IEEE, 2010. 6, 16
work page 2010
-
[71]
Yan Xu, Rixiang Quan, Weiting Xu, Yi Huang, Xiaolong Chen, and Fengyuan Liu. Advances in med- ical image segmentation: A comprehensive review of traditional, deep learning and hybrid approaches. Bioengineering, 11(10):1034, 2024. 3
work page 2024
-
[72]
Visual-language prompt tuning with knowledge-guided context optimization, 2023
Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization, 2023. 7, 19, 32, 33
work page 2023
-
[73]
Visual-language prompt tuning with knowledge-guided context optimization
Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge-guided context optimization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6757–6767, 2023. 8
work page 2023
-
[74]
Tcp: Textual-based class-aware prompt tuning for visual- language model
Hantao Yao, Rui Zhang, and Changsheng Xu. Tcp: Textual-based class-aware prompt tuning for visual- language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23438–23448, 2024. 7 14
work page 2024
-
[75]
Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models
Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022. 2
work page 2022
-
[76]
Low-rank few-shot adaptation of vision-language models
Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603,
-
[77]
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[78]
Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hong- sheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling.arXiv preprint arXiv:2111.03930, 2021. 3, 7, 8, 19
-
[79]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.arXiv preprint arXiv:2303.16199, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[80]
Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. Biomedclip: a multimodal biomedical foundation model pretrained from f...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.