pith. sign in

arxiv: 2603.27383 · v2 · submitted 2026-03-28 · 💻 cs.CV

Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression

Pith reviewed 2026-05-14 22:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords CRISPparameter recombinationmodel compressionparameter-efficient fine-tuningweight factorizationunified frameworkneural network adaptationcomputer vision
0
0 comments X

The pith

CRISP factorizes pretrained weights into shared bases and small mixers to support both model compression and parameter-efficient fine-tuning in a single framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CRISP as a unified parameter recombination method that factorizes neural network weights into basis matrices shared across layers and small component mixing projections. Sharing and resizing the bases enables compression while the tiny mixer weights, often under 200 parameters, support quick adaptation to new tasks. This integration addresses the challenge of applying separate techniques for compression and fine-tuning, which becomes costly in edge deployments where even reduced parameter counts matter. A sympathetic reader would care because it promises models that are both smaller and more adaptable without composing multiple prior methods.

Core claim

CRISP factorizes pretrained weights into basis matrices and their component mixing projections. Sharing basis matrices across layers and adjusting its size enables model compression, whereas the mixer weight's small size enables support for parameter-efficient fine-tuning. Experiments show CRISP outperforms methods from prior work capable of dual-task applications by 4-5% while also outperforming the state-of-the-art in PEFT by 1.5% and PEFT+MC combinations by 1%.

What carries the argument

Coefficient-gated weight Recombination by Interpolated Shared basis Projections (CRISP), which decomposes weights into shared basis matrices for compression and small mixing projections for adaptation.

If this is right

  • Models can be compressed by reducing basis size while still adapting to new tasks with fewer than 200 additional parameters.
  • Dual-task performance exceeds prior recombination methods by 4-5% on relevant benchmarks.
  • The approach outperforms standalone state-of-the-art PEFT methods by 1.5% and combined PEFT plus model compression baselines by 1%.
  • A single factorization replaces the need to compose separate parameter recombination techniques for compression and fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The shared-basis approach might extend naturally to transformer-based models outside computer vision tasks.
  • Further reductions in basis size could be tested to determine the exact compression limits before performance degrades.
  • Combining CRISP with quantization or pruning might produce additive efficiency gains not explored in the current work.
  • Deployment on edge hardware could be measured directly to quantify the practical memory and latency savings.

Load-bearing premise

Factorizing pretrained weights into shared basis matrices and small component mixing projections preserves sufficient model capacity and performance when bases are shared across layers for compression.

What would settle it

A performance drop exceeding 5% on standard vision benchmarks when using the compressed CRISP model compared to uncompressed fine-tuned baselines would falsify the capacity preservation claim.

Figures

Figures reproduced from arXiv: 2603.27383 by Bryan A. Plummer, Nazia Tasnim, Shrimai Prabhumoye.

Figure 1
Figure 1. Figure 1: PR approach comparison. (a) Prior work in PR typi￾cally focuses on PEFT or MC alone [3, 7, 15, 19, 26, 28, 40, 47, 54, 60, 68, 69, 77, 78, 86, 90, 94, 96, 102], which can result in efficient combinations when deployed together. (b) Our unified PR approach CRISP decomposes a pretrained models weights that support both MC and PEFT, enabling us to more effectively use parameter budgets even as tasks scale. PE… view at source ↗
Figure 2
Figure 2. Figure 2: CRISP decomposes a pretrained weight matrix into a frozen shared basis and small, learnable mixer matrices, then retrofits these [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: PEFT performance using a ViT-S/16 across a range of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparing ImageNet [11] performance with and with￾out 8-bit PTQ [83] compression. We find CRISP accurately repro￾duces the original model’s performance while also demonstrating effective compositionality with other compression techniques. mance, demonstrating that we accurately replicated the original model’s performance. Second, CRISP can be ef￾fectively combined with methods like PTQ for additional memor… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of mixer matrix dimensions on model capac [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of regularization constraint placement across [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of reconstruction loss functions during neural mimicry. We compare four loss functions (Huber, Smooth-L1, MSE, L1) used in the neural mimicry stage (Equation 5 of main paper) for retrofitting pretrained weights into CRISP’s basis-mixer decomposition. AIRCRAFT BIRDS CIFAR100 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 90.0 Accuracy (%) Accuracy Across Initialization Methods Uniform Kaiming Xavier Orthogo… view at source ↗
Figure 8
Figure 8. Figure 8: Robustness to initialization methods. We evaluate four standard initialization schemes (Uniform, Kaiming, Xavier, Or￾thogonal) for the mixer matrices A ′rs during both neural mimicry retrofitting and subsequent task adaptation. ing that weight-space reconstruction without data is insuf￾ficient for aggressive compression. In contrast, distillation boosts accuracy by 31%, validating our two-stage approach. W… view at source ↗
read the original abstract

Parameter Recombination (PR) methods aim to efficiently compose the weights of a neural network for applications like Parameter-Efficient FineTuning (PEFT) and Model Compression (MC), among others. Most methods typically focus on one application of PR, which can make composing them challenging. For example, when deploying a large model you may wish to compress the model and also quickly adapt to new settings. However, PEFT methods often can still contain millions of parameters. This may be small compared to the original model size, but can be problematic in resource constrained deployments like edge devices, where they take a larger portion of the compressed model's parameters. To address this, we present Coefficient-gated weight Recombination by Interpolated Shared basis Projections (CRISP), a general approach that seamlessly integrates multiple PR tasks within the same framework. CRISP accomplishes this by factorizing pretrained weights into basis matrices and their component mixing projections. Sharing basis matrices across layers and adjusting its size enables us to perform MC, whereas the mixer weight's small size (fewer than 200 in some experiments) enables CRISP to support PEFT. Experiments show CRISP outperforms methods from prior work capable of dual-task applications by 4-5\% while also outperforming the state-of-the-art in PEFT by 1.5\% and PEFT+MC combinations by 1\%. Our code is available on the repository: https://github.com/appledora/CRISP-CVPR26.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CRISP (Coefficient-gated weight Recombination by Interpolated Shared basis Projections), a unified framework for parameter recombination. Pretrained weights are factorized into shared basis matrices (reduced in size for model compression) and small per-component mixing projections (under 200 parameters for PEFT). This enables simultaneous PEFT and MC within one model. Experiments claim 4-5% gains over prior dual-task methods, 1.5% over SOTA PEFT, and 1% over PEFT+MC combinations.

Significance. If the performance claims are robustly validated, the work provides a flexible, parameter-efficient way to combine adaptation and compression, addressing practical constraints on edge devices where PEFT overhead can dominate compressed models. The code release is a positive factor for reproducibility.

major comments (3)
  1. [Experimental Results] Experimental Results section: Performance claims (4-5% over dual-task baselines, 1.5% over SOTA PEFT) are stated without error bars, explicit data splits, or controls for hyperparameter selection of basis dimensions and mixer sizes (listed as free parameters). This makes the reported gains difficult to interpret as generalizable rather than post-hoc.
  2. [Method] Method section on shared bases: The unification claim rests on the assumption that sharing basis matrices across layers (while keeping mixers small) preserves sufficient capacity for dual-task gains. No ablation comparing shared vs. layer-specific bases is provided, leaving open the risk that layer-wise variation is lost and the small mixers cannot compensate without increasing rank (defeating compression).
  3. [Results tables] Results tables: Comparisons to prior dual-task methods require clearer specification of which baselines support both PEFT and MC simultaneously, along with exact parameter counts and training protocols for CRISP in each regime.
minor comments (2)
  1. [Abstract] Abstract: The repository link points to a future CVPR26 location; replace with a stable, permanent link or include a snapshot.
  2. [Method] Notation: Define the interpolation operation and coefficient-gating mechanism more explicitly, including how the mixing projections are applied during recombination.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment point-by-point below and describe the revisions we will make.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: Performance claims (4-5% over dual-task baselines, 1.5% over SOTA PEFT) are stated without error bars, explicit data splits, or controls for hyperparameter selection of basis dimensions and mixer sizes (listed as free parameters). This makes the reported gains difficult to interpret as generalizable rather than post-hoc.

    Authors: We agree that error bars, explicit data splits, and hyperparameter controls are necessary for robust interpretation. In the revised manuscript we will report mean performance with standard deviations over at least three random seeds, state the precise train/validation/test splits for every dataset, and add a dedicated paragraph describing how basis dimensions and mixer sizes were chosen via validation performance (including the search ranges and selection criterion). revision: yes

  2. Referee: [Method] Method section on shared bases: The unification claim rests on the assumption that sharing basis matrices across layers (while keeping mixers small) preserves sufficient capacity for dual-task gains. No ablation comparing shared vs. layer-specific bases is provided, leaving open the risk that layer-wise variation is lost and the small mixers cannot compensate without increasing rank (defeating compression).

    Authors: Sharing bases across layers is fundamental to the compression objective; layer-specific bases would multiply the basis storage cost and defeat the MC goal. We will add an ablation in the revision that compares the shared-basis CRISP model against a layer-specific variant whose per-layer ranks are reduced so that total parameter count remains comparable. The results will be reported together with a short discussion of whether the small mixers suffice to recover layer-wise capacity. revision: yes

  3. Referee: [Results tables] Results tables: Comparisons to prior dual-task methods require clearer specification of which baselines support both PEFT and MC simultaneously, along with exact parameter counts and training protocols for CRISP in each regime.

    Authors: We will update the tables and text to explicitly mark which baselines support simultaneous PEFT and MC, list exact trainable-parameter counts for CRISP and every baseline in each regime (PEFT-only, MC-only, dual-task), and append a supplementary table or paragraph detailing the optimizer, learning-rate schedule, batch size, and number of epochs/steps used for CRISP under each setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The CRISP framework is constructed by factorizing pretrained weights into shared basis matrices (for compression via size adjustment) and small per-component mixing projections (for PEFT via low parameter count). This decomposition follows standard low-rank ideas and directly enables the dual-task unification by varying basis rank and mixer size, without any equation or claim reducing the outputs to the inputs by definition. Performance numbers (4-5% gains over dual-task priors, 1.5% over SOTA PEFT) are presented as separate empirical results rather than predictions forced by the factorization itself. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the abstract or described method; the approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that pretrained weights admit a useful low-rank or basis decomposition that supports both sharing for compression and small adjustments for adaptation; no new entities are postulated.

free parameters (2)
  • basis matrix dimensions
    Size of shared basis matrices is chosen to achieve desired compression ratio while maintaining performance.
  • mixer projection size
    Number of parameters in the mixing projections is kept small (under 200 in experiments) to enable parameter-efficient fine-tuning.
axioms (1)
  • domain assumption Pretrained neural network weights can be factorized into basis matrices and mixing projections without substantial loss of expressivity
    Invoked in the decomposition step to enable sharing and recombination.

pith-pipeline@v0.9.0 · 5578 in / 1204 out tokens · 56380 ms · 2026-05-14T22:13:48.928754+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · 1 internal anchor

  1. [1]

    P. Agand. Knowledge distillation from single-task teachers to multi-task student for end-to-end autonomous driving. Proceedings of the AAAI Conference on Artificial Intelli- gence, 38:23375–23376, 2024. 2

  2. [2]

    Prune efficiently by soft prun- ing

    Parakh Agarwal, Manu Mathew, Kunal Ranjan Patel, Varun Tripathi, and Pramod Swami. Prune efficiently by soft prun- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) Workshops, pages 2210–2217, 2024. 2

  3. [3]

    Deepcompress-vit: Rethinking model compression to enhance efficiency of vision transformers at the edge

    Sabbir Ahmed, Abdullah Al Arafat, Deniz Najafi, Akhlak Mahmood, Mamshad Nayeem Rizve, Mohaiminul Al Nahian, Ranyang Zhou, Shaahin Angizi, and Ad- nan Siraj Rakin. Deepcompress-vit: Rethinking model compression to enhance efficiency of vision transformers at the edge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR...

  4. [4]

    Piqa: Reasoning about physical commonsense in nat- ural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in nat- ural language. InProceedings of the AAAI conference on artificial intelligence, pages 7432–7439, 2020. 4

  5. [5]

    Fcp dis vit: Efficient vision transformer with neural network pruning

    Yuhao Cao. Fcp dis vit: Efficient vision transformer with neural network pruning. In2024 IEEE 4th International Conference on Power, Electronics and Computer Applica- tions (ICPECA), pages 1216–1221, 2024. 2

  6. [6]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference accelera- tion for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference accelera- tion for large vision-language models. InProceedings of the European Conference on Computer Vision (ECCV), 2024. 1

  7. [7]

    Adaptformer: adapting vision transformers for scalable visual recogni- tion

    Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: adapting vision transformers for scalable visual recogni- tion. InProceedings of the 36th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2022. Curran Associates Inc. 1

  8. [8]

    Lorashear: Efficient large language model structured pruning and knowledge recovery, 2023

    Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, and Luming Liang. Lorashear: Efficient large language model structured pruning and knowledge recovery, 2023. 2

  9. [9]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...

  10. [10]

    Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. 4

  11. [11]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 4, 5, 6, 7, 1

  12. [12]

    Comedian: Self-supervised Table 10

    Julien Denize, Mykola Liashuha, Jaonary Rabarisoa, Astrid Orcesi, and Romain H ´erault. Comedian: Self-supervised Table 10. Comparison of PEFT methods on commonsense reasoning benchmarks. Results from LoRA and DoRA are taken from Liu et al. [47], HiRA results are from Huang et al. [30]. We find that CRISP is on par or better than custom PEFT methods while...

  13. [13]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An im- age is worth 16x16 words: Transformers for image recog- nition at scale.CoRR, abs/2010.11929, 2020. 2, 4, 5, 6, 7, 8, 3

  14. [14]

    Sigmoid- weighted linear units for neural network function approx- imation in reinforcement learning.Neural Networks, 107: 3–11, 2018

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid- weighted linear units for neural network function approx- imation in reinforcement learning.Neural Networks, 107: 3–11, 2018. Special issue on deep reinforcement learning. 4

  15. [15]

    Hyperdiffusion: Generating implicit neu- ral fields with weight-space diffusion.2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 14254–14264, 2023

    Ziya Erkoc ¸, Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. Hyperdiffusion: Generating implicit neu- ral fields with weight-space diffusion.2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 14254–14264, 2023. 1

  16. [16]

    Isomorphic pruning for vision models

    Gongfan Fang, Xinyin Ma, Michael Bi Mi, and Xinchao Wang. Isomorphic pruning for vision models. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXX, page 232–250, Berlin, Heidelberg, 2024. Springer- Verlag. 2, 5, 6

  17. [17]

    The language model evaluation har- ness, 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Gold- ing, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle Mc- Donell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lin- tang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The l...

  18. [18]

    Fast r-cnn

    Ross Girshick. Fast r-cnn. InProceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), 2015. 4

  19. [19]

    Pruning by block benefit: Exploring the properties of vision transformer blocks during domain adaptation

    Patrick Glandorf and Bodo Rosenhahn. Pruning by block benefit: Exploring the properties of vision transformer blocks during domain adaptation. InInternational Confer- ence on Computer Vision Workshop, 2025. 1

  20. [20]

    Deep sparse rectifier neural networks

    Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. InProceedings of the Four- teenth International Conference on Artificial Intelligence and Statistics, pages 315–323, Fort Lauderdale, FL, USA,

  21. [21]

    Balf: Budgeted activation- aware low-rank factorization for fine-tuning-free model compression, 2025

    David Gonz ´alez-Mart´ınez. Balf: Budgeted activation- aware low-rank factorization for fine-tuning-free model compression, 2025. 5, 6

  22. [22]

    The llama 3 herd of models,

    Aaron Grattafiori and et al. The llama 3 herd of models,

  23. [23]

    Weight copy and low-rank adaptation for few-shot distillation of vision transformers

    Diana-Nicoleta Grigore, Mariana-Iuliana Georgescu, Jon Alvarez Justo, Tor Johansen, Andreea Iuliana Ionescu, and Radu Tudor Ionescu. Weight copy and low-rank adaptation for few-shot distillation of vision transformers. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 7368–7378, 2025. 2, 5, 6, 7

  24. [24]

    Dai, and Quoc V

    David Ha, Andrew M. Dai, and Quoc V . Le. Hypernet- works. In5th International Conference on Learning Repre- sentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. 3

  25. [25]

    Svdiff: Compact pa- rameter space for diffusion fine-tuning.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

    Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas, and Feng Yang. Svdiff: Compact pa- rameter space for diffusion fine-tuning.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

  26. [26]

    Learning effi- cient vision transformers via fine-grained manifold distil- lation

    Zhiwei Hao, Jianyuan Guo, Ding Jia, Kai Han, Yehui Tang, Chao Zhang, Han Hu, and Yunhe Wang. Learning effi- cient vision transformers via fine-grained manifold distil- lation. InAdvances in Neural Information Processing Sys- tems, 2022. 1, 2

  27. [27]

    Multi-dimensional model compression of vision transformer

    Zejiang Hou and Sun-Yuan Kung. Multi-dimensional model compression of vision transformer. In2022 IEEE International Conference on Multimedia and Expo (ICME), pages 01–06, 2022. 2

  28. [28]

    LoRA: Low-rank adaptation of large language mod- els

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language mod- els. InInternational Conference on Learning Representa- tions, 2022. 1, 2, 3, 5, 6, 8, 4

  29. [29]

    LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models

    Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5254–5276, Singapore, 2023. As- sociation for Computatio...

  30. [30]

    HiRA: Parameter-efficient hadamard high-rank adaptation for large language models

    Qiushi Huang, Tom Ko, Zhan Zhuang, Lilian Tang, and Yu Zhang. HiRA: Parameter-efficient hadamard high-rank adaptation for large language models. InThe Thirteenth In- ternational Conference on Learning Representations, 2025. 6

  31. [31]

    Finding lottery tickets in vision models via data-driven spectral foresight pruning

    Leonardo Iurada, Marco Ciccone, and Tatiana Tommasi. Finding lottery tickets in vision models via data-driven spectral foresight pruning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16142–16151, 2024. 2

  32. [32]

    Adap- tive class token knowledge distillation for efficient vi- sion transformer.Knowledge-Based Systems, 304:112531,

    Minchan Kang, Sanghyeok Son, and Daeshik Kim. Adap- tive class token knowledge distillation for efficient vi- sion transformer.Knowledge-Based Systems, 304:112531,

  33. [33]

    The need for speed: Pruning transformers with one recipe

    Samir Khaki and Konstantinos N Plataniotis. The need for speed: Pruning transformers with one recipe. InThe Twelfth International Conference on Learning Representa- tions, 2024. 2

  34. [34]

    Krizhevsky and G

    A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.Master’s thesis, Department of Computer Science, University of Toronto, 2009. 4, 5, 3

  35. [35]

    Dct-vit: High-frequency pruned vision transformer with discrete cosine transform

    Jongho Lee and Hyun Kim. Dct-vit: High-frequency pruned vision transformer with discrete cosine transform. IEEE Access, 12:80386–80396, 2024. 2

  36. [36]

    Discovering sparsity allocation for layer- wise pruning of large language models

    Lujun Li, Peijie Dong, Zhenheng Tang, Xiang Liu, Qiang Wang, Wenhan Luo, Wei Xue, Qifeng Liu, Xiaowen Chu, and Yike Guo. Discovering sparsity allocation for layer- wise pruning of large language models. InProceedings of the 38th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2024. Curran Associates Inc. 2

  37. [37]

    Vb-lora: Extreme parameter efficient fine-tuning with vector banks

    Yang Li, Shaobo Han, and Shihao Ji. Vb-lora: Extreme parameter efficient fine-tuning with vector banks. InThe 38th Conference on Neural Information Processing Systems (NeurIPS), 2024. 2, 5, 8

  38. [38]

    LLaMA-VID: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. LLaMA-VID: An image is worth 2 tokens in large language models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2024. 1, 2, 4

  39. [39]

    Scaling & shifting your features: A new baseline for efficient model tuning

    Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2022. 5, 6, 7, 8

  40. [40]

    Inflora: Interference-free low-rank adaptation for continual learning

    Yan-Shuo Liang and Wu-Jun Li. Inflora: Interference-free low-rank adaptation for continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23638–23647, 2024. 1

  41. [41]

    3-in-1: 2d rotary adapta- tion for efficient finetuning, efficient batching and compos- ability

    Baohao Liao and Christof Monz. 3-in-1: 2d rotary adapta- tion for efficient finetuning, efficient batching and compos- ability. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 2, 5, 8

  42. [42]

    Slimgpt: Layer-wise structured pruning for large language models

    Gui Ling, Ziyang Wang, Yuliang Yan, and Qingwen Liu. Slimgpt: Layer-wise structured pruning for large language models. InAdvances in Neural Information Processing Sys- tems, pages 107112–107137. Curran Associates, Inc., 2024. 2

  43. [43]

    SVFT: Parameter-efficient fine-tuning with singular vectors

    Vijay Lingam, Atula Tejaswi Neerkaje, Aditya Vavre, Aneesh Shetty, Gautham Krishna Gudur, Joydeep Ghosh, Eunsol Choi, Alex Dimakis, Aleksandar Bojchevski, and sujay sanghavi. SVFT: Parameter-efficient fine-tuning with singular vectors. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 5, 6, 7, 8

  44. [44]

    Pruning via merging: Com- pressing LLMs via manifold alignment based layer merg- ing

    Deyuan Liu, Zhanyue Qin, Hairu Wang, Zhao Yang, Zecheng Wang, Fangying Rong, Qingbin Liu, Yanchao Hao, Bo Li, Xi Chen, Cunhang Fan, Zhao Lv, Dianhui Chu, Zhiying Tu, and Dianbo Sui. Pruning via merging: Com- pressing LLMs via manifold alignment based layer merg- ing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,...

  45. [45]

    Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning

    Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. InProceedings of the 36th Interna- tional Conference on Neural Information Processing Sys- tems, Red Hook, NY , USA, 2022. Curran Associates Inc. 2, 5, 8

  46. [46]

    Small scale data-free knowledge distillation

    He Liu, Yikai Wang, Huaping Liu, Fuchun Sun, and An- bang Yao. Small scale data-free knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 6008–6016,

  47. [47]

    Dora: weight-decomposed low-rank adaptation

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: weight-decomposed low-rank adaptation. InProceedings of the 41st International Con- ference on Machine Learning. JMLR.org, 2024. 1, 2, 5, 8, 4, 6

  48. [48]

    Black, Adrian Weller, and Bernhard Sch ¨olkopf

    Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, and Bernhard Sch ¨olkopf. Parameter-efficient or- thogonal finetuning via butterfly factorization. InICLR,

  49. [49]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learn- ing Representations, 2019. 4

  50. [50]

    S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft.-, 2013. 4, 5, 3

  51. [51]

    Prune and merge: Efficient token compression for vision transformer with spatial in- formation preserved.IEEE Transactions on Multimedia, 27:4670–4683, 2025

    Junzhu Mao, Yang Shen, Jinyang Guo, Yazhou Yao, Xian- sheng Hua, and Hengtao Shen. Prune and merge: Efficient token compression for vision transformer with spatial in- formation preserved.IEEE Transactions on Multimedia, 27:4670–4683, 2025. 1

  52. [52]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sab- harwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceed- ings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing, pages 2381–2391, Brussels, Bel- gium, 2018. Association for Computational Linguistics. 4

  53. [53]

    RoSA: Accurate parameter-efficient fine-tuning via robust adaptation

    Mahdi Nikdan, Soroush Tabesh, Elvir Crn ˇcevi´c, and Dan Alistarh. RoSA: Accurate parameter-efficient fine-tuning via robust adaptation. InForty-first International Confer- ence on Machine Learning, 2024. 2, 5

  54. [54]

    RoSA: Accurate parameter-efficient fine-tuning via robust adaptation

    Mahdi Nikdan, Soroush Tabesh, Elvir Crn ˇcevi´c, and Dan Alistarh. RoSA: Accurate parameter-efficient fine-tuning via robust adaptation. InProceedings of the 41st Inter- national Conference on Machine Learning, pages 38187– 38206. PMLR, 2024. 1, 2

  55. [55]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InIn- dian Conference on Computer Vision, Graphics and Image Processing, 2008. 4, 3

  56. [56]

    Fair-vpt: Fair visual prompt tuning for image classification

    Sungho Park and Hyeran Byun. Fair-vpt: Fair visual prompt tuning for image classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12268–12278, 2024. 2

  57. [57]

    Data-freeweight com- press and denoise for large language models.CoRR, abs/2402.16319, 2024

    Runyu Peng, Yunhua Zhou, Qipeng Guo, Yang Gao, Hang Yan, Xipeng Qiu, and Dahua Lin. Data-freeweight com- press and denoise for large language models.CoRR, abs/2402.16319, 2024. 6, 4

  58. [58]

    Plummer, Nikoli Dryden, Julius Frost, Torsten Hoefler, and Kate Saenko

    Bryan A. Plummer, Nikoli Dryden, Julius Frost, Torsten Hoefler, and Kate Saenko. Neural parameter allocation search. InInternational Conference on Learning Repre- sentations, 2022. 2, 3, 4

  59. [59]

    Recognizing in- door scenes

    Ariadna Quattoni and Antonio Torralba. Recognizing in- door scenes. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 413–420, 2009. 4, 3

  60. [60]

    Venkatesh Babu

    Harsh Rangwani, Pradipto Mondal, Mayank Mishra, Ashish Ramayee Asokan, and R. Venkatesh Babu. Deit- lt: Distillation strikes back for vision transformer training on long-tailed datasets. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23396–23406, 2024. 1, 2

  61. [61]

    MELoRA: Mini-ensemble low-rank adapters for parameter-efficient fine-tuning

    Pengjie Ren, Chengshun Shi, Shiguang Wu, Mengqi Zhang, Zhaochun Ren, Maarten Rijke, Zhumin Chen, and Jiahuan Pei. MELoRA: Mini-ensemble low-rank adapters for parameter-efficient fine-tuning. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3052–3064, Bangkok, Thailand, 2024. Associati...

  62. [62]

    Winogrande: an adversarial winograd schema challenge at scale.Commun

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- ula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale.Commun. ACM, 64(9):99–106,

  63. [63]

    Social IQa: Commonsense rea- soning about social interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense rea- soning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, Hong Kong, China, 2019...

  64. [64]

    Learning implicitly re- current CNNs through parameter sharing

    Pedro Savarese and Michael Maire. Learning implicitly re- current CNNs through parameter sharing. InInternational Conference on Learning Representations, 2019. 2

  65. [65]

    You only prune once: Designing calibration- free model compression with policy learning

    Ayan Sengupta, Siddhant Chaudhary, and Tanmoy Chakraborty. You only prune once: Designing calibration- free model compression with policy learning. InThe Thir- teenth International Conference on Learning Representa- tions, 2025. 2, 6, 4

  66. [66]

    Pro- vpt: Distribution-adaptive visual prompt tuning via prompt relocation

    Chikai Shang, Mengke Li, Yiqun Zhang, Zhen Chen, Jinlin Wu, Fangqing Gu, Yang Lu, and Yiu-Ming Cheung. Pro- vpt: Distribution-adaptive visual prompt tuning via prompt relocation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1558–1568,

  67. [67]

    Diversity-guided mlp reduc- tion for efficient large vision transformers.arXiv preprint arXiv:2506.07138, 2025

    Chengchao Shen, Hourun Zhu, Gongfan Fang, Jianxin Wang, and Xinchao Wang. Diversity-guided mlp reduc- tion for efficient large vision transformers.arXiv preprint arXiv:2506.07138, 2025. 5, 6

  68. [68]

    UPop: Unified and progressive pruning for compressing vision-language transformers

    Dachuan Shi, Chaofan Tao, Ying Jin, Zhendong Yang, Chun Yuan, and Jiaqi Wang. UPop: Unified and progressive pruning for compressing vision-language transformers. In Proceedings of the 40th International Conference on Ma- chine Learning, pages 31292–31311. PMLR, 2023. 1

  69. [69]

    See further for parameter efficient fine-tuning by standing on the shoulders of decomposition.arXiv preprint arXiv:2407.05417, 2024

    Chongjie Si, Xiaokang Yang, and Wei Shen. See further for parameter efficient fine-tuning by standing on the shoulders of decomposition.arXiv preprint arXiv:2407.05417, 2024. 1

  70. [70]

    Sparse low rank factorization for deep neural network compression.Neurocomputing, 398:185– 196, 2020

    Sridhar Swaminathan, Deepak Garg, Rajkumar Kannan, and Frederic Andres. Sparse low rank factorization for deep neural network compression.Neurocomputing, 398:185– 196, 2020. 2

  71. [71]

    Nazia Tasnim and Bryan A. Plummer. Recast: Reparam- eterized, compact weight adaptation for sequential tasks. InInternational Conference on Learning Representations (ICLR), 2025. 2, 3, 4, 5, 6, 7, 8

  72. [72]

    Tukan, A

    M. Tukan, A. Maalouf, M. Weksler, and D. Feldman. No fine-tuning, no cry: robust svd for compressing deep net- works.Sensors, 21:5599, 2021. 2

  73. [73]

    DyLoRA: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation

    Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. DyLoRA: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3274–3287, Dubrovnik, Croatia, 2023. Association for Computational...

  74. [74]

    C. Wah, S. Branson, P. Welinder, P. Perona, and S. Be- longie. The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technol- ogy, 2011. 4, 5, 3

  75. [75]

    Ao Wang, Hui Chen, Zijia Lin, Sicheng Zhao, Jungong Han, and Guiguang Ding. Cait: Triple-win compression towards high accuracy, fast inference, and favorable trans- ferability for vits.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2025. 2

  76. [76]

    H. Wang, J. Chang, Y . Zhai, X. Luo, J. Sun, Z. Lin, and Q. Tian. Lion: implicit vision prompt tuning.Proceedings of the AAAI Conference on Artificial Intelligence, 38:5372– 5380, 2024. 2

  77. [77]

    Basis sharing: Cross-layer parameter sharing for large language model compression

    Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, and Grace Li Zhang. Basis sharing: Cross-layer parameter sharing for large language model compression. InThe Thir- teenth International Conference on Learning Representa- tions, 2025. 1, 2, 3, 4, 5, 6

  78. [78]

    Neural network pa- rameter diffusion

    Kaili Wang, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, and Yang You. Neural network pa- rameter diffusion. In-, 2024. 1

  79. [79]

    SVD- LLM: Truncation-aware singular value decomposition for large language model compression

    Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD- LLM: Truncation-aware singular value decomposition for large language model compression. InThe Thirteenth In- ternational Conference on Learning Representations, 2025. 2, 3

  80. [80]

    Revisiting the power of prompt for visual tuning

    Yuzhu Wang, Lechao Cheng, Chaowei Fang, Dingwen Zhang, Manni Duan, and Meng Wang. Revisiting the power of prompt for visual tuning. InProceedings of the 41st In- ternational Conference on Machine Learning. JMLR.org,

Showing first 80 references.