Recognition: 2 theorem links
· Lean TheoremAdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
Pith reviewed 2026-05-12 21:05 UTC · model grok-4.3
The pith
AdaLoRA allocates the fine-tuning budget across weight matrices by ranking the importance of their low-rank updates via singular value decomposition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdaLoRA represents each incremental update to a pre-trained weight matrix as a low-rank SVD and uses the magnitudes of the singular values to decide how many of those values to retain for that matrix. By dynamically pruning singular values whose magnitudes fall below a threshold, the method reduces the effective rank of unimportant updates while preserving the full budget for important ones, all without performing expensive exact SVDs at every step.
What carries the argument
SVD parameterization of the low-rank incremental updates, which turns singular-value magnitudes into an importance score that directly controls how many parameters each matrix receives.
If this is right
- Fine-tuning remains effective even when the total number of trainable parameters is cut to a few percent of the model size.
- The same SVD-based importance scoring can be applied on top of other low-rank adapters without changing their training loops.
- Training time per step stays comparable to standard LoRA because the pruning decision re-uses the already-computed singular values.
- The method produces different final ranks for different layers, automatically allocating more capacity to attention or feed-forward blocks that the task needs.
Where Pith is reading between the lines
- The same importance-driven pruning idea could be tested on vision transformers or multimodal models where some layers are known to be more task-specific than others.
- If singular-value magnitudes continue to track importance after the first few epochs, early pruning could further cut memory use during fine-tuning.
- The approach suggests a general principle: any low-rank adapter whose factors admit a cheap importance metric can replace uniform budget allocation.
Load-bearing premise
The importance of a weight matrix for the downstream task can be reliably read from the sizes of the singular values in its current low-rank update.
What would settle it
On a standard GLUE or SQuAD benchmark, run AdaLoRA with its adaptive pruning and also run the same total budget split uniformly; if the uniform version matches or beats AdaLoRA at every budget level, the adaptive-allocation claim fails.
read the original abstract
Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal. To bridge this gap, we propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in the low budget settings. Our code is publicly available at https://github.com/QingruZhang/AdaLoRA .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AdaLoRA for parameter-efficient fine-tuning of large pre-trained language models. It addresses the uniform budget allocation in methods like LoRA by adaptively distributing the parameter budget across weight matrices according to an importance score. The core mechanism parameterizes incremental updates ΔW as low-rank SVD forms UΣV^T and prunes singular values below a dynamic threshold derived from the importance score, thereby reducing the effective parameter count without repeated exact SVD computations. Experiments across NLP, QA, and NLG tasks with several pre-trained models are reported to show improvements over baselines, particularly under low parameter budgets.
Significance. If the SVD-based importance scoring reliably identifies task-relevant directions and the pruning preserves performance, AdaLoRA would offer a practical advance in PEFT by concentrating limited parameters where they contribute most. This could be especially valuable for low-resource fine-tuning scenarios and might inspire further adaptive allocation techniques that avoid uniform distribution across layers or matrices.
major comments (3)
- [§3] §3 (Method), the importance score definition and pruning rule: the claim that singular-value magnitudes serve as a reliable proxy for per-matrix contribution to downstream loss reduction lacks supporting analysis or ablation. Replacing the SVD-derived score with a random or gradient-norm baseline while holding total parameter count fixed would be required to isolate whether adaptivity, rather than the SVD parameterization itself, drives the reported gains.
- [Experiments] Experiments section and Table results: the abstract asserts 'notable improvement... especially in the low budget settings' but supplies no numerical deltas, standard deviations, number of runs, or direct comparison tables against LoRA with identical total budget. Without these, it is impossible to determine whether the gains exceed what could be obtained by simple hyperparameter tuning of uniform LoRA.
- [§3.2] §3.2, the dynamic threshold and pruning schedule: the description does not specify how the importance score is updated during training (e.g., running average vs. per-step recomputation) or whether pruning is performed once or iteratively. If the bases U and V are still evolving early in training, early pruning decisions may discard directions that later become important, undermining the 'parameter-free' aspect of the allocation.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a concise statement of the exact total parameter budget used in the low-budget regime (e.g., 0.1% or 1M parameters) to allow direct replication.
- [§3] Notation for the SVD parameterization (U, Σ, V) should be introduced with an explicit equation early in §3 rather than described only in prose.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We have addressed each major point below and revised the manuscript to provide additional analysis, quantitative details, and clarifications as suggested.
read point-by-point responses
-
Referee: [§3] §3 (Method), the importance score definition and pruning rule: the claim that singular-value magnitudes serve as a reliable proxy for per-matrix contribution to downstream loss reduction lacks supporting analysis or ablation. Replacing the SVD-derived score with a random or gradient-norm baseline while holding total parameter count fixed would be required to isolate whether adaptivity, rather than the SVD parameterization itself, drives the reported gains.
Authors: We agree that an explicit ablation is needed to isolate the contribution of the importance scoring. In the revised manuscript we have added ablation experiments in Section 4 that replace the SVD-derived importance scores with both random singular-value pruning and a gradient-norm baseline while keeping the total parameter budget identical across methods. The new results show that the SVD-based adaptive allocation outperforms these controls, especially under tight budgets, indicating that the gains are driven by the adaptive mechanism rather than the SVD parameterization alone. revision: yes
-
Referee: [Experiments] Experiments section and Table results: the abstract asserts 'notable improvement... especially in the low budget settings' but supplies no numerical deltas, standard deviations, number of runs, or direct comparison tables against LoRA with identical total budget. Without these, it is impossible to determine whether the gains exceed what could be obtained by simple hyperparameter tuning of uniform LoRA.
Authors: We acknowledge that the reporting of quantitative details can be strengthened. The original tables already present head-to-head comparisons under matched total budgets. In the revision we have updated the abstract to include concrete example deltas drawn from the existing results, added standard deviations computed over three independent runs to all tables, and explicitly stated the number of runs and the budget-equivalence protocol in the experimental setup section. revision: yes
-
Referee: [§3.2] §3.2, the dynamic threshold and pruning schedule: the description does not specify how the importance score is updated during training (e.g., running average vs. per-step recomputation) or whether pruning is performed once or iteratively. If the bases U and V are still evolving early in training, early pruning decisions may discard directions that later become important, undermining the 'parameter-free' aspect of the allocation.
Authors: We thank the referee for noting this lack of detail. The importance scores are recomputed periodically (every 100 steps after a short warm-up) using an exponential moving average of the current singular values; pruning is applied iteratively at these intervals rather than in a single step. This design allows the low-rank factors to continue evolving before final pruning decisions. We have expanded §3.2 with a precise description of the update rule, the pruning schedule, pseudocode, and a short discussion addressing the concern about early pruning. revision: yes
Circularity Check
No significant circularity: AdaLoRA's SVD parameterization and importance-based pruning form an independent algorithmic proposal.
full rationale
The paper introduces AdaLoRA as a novel method that parameterizes incremental updates ΔW via SVD and uses singular-value magnitudes to adaptively prune and reallocate budget across matrices. This is presented as an empirical algorithmic change over uniform LoRA baselines, not as a derivation or prediction that reduces to its own fitted inputs by construction. No equations in the abstract or described claims exhibit self-definition (e.g., importance defined circularly from the pruning outcome), fitted-input-as-prediction, or load-bearing self-citation chains. The importance scoring is an explicit design choice within the method rather than a tautological renaming or imported uniqueness theorem. The central claim of improved low-budget performance therefore rests on external validation through experiments, not internal reduction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 30 Pith papers
-
MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
MatryoshkaLoRA inserts a crafted diagonal matrix P into LoRA to learn accurate nested low-rank adapters that support dynamic rank selection with minimal performance drop.
-
Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA
GLoRA replaces raw factor averaging with gauge-aware aggregation in a consensus subspace estimated from client projectors, enabling consistent low-rank federated LoRA under heterogeneity.
-
Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration
CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
-
BoostLoRA: Growing Effective Rank by Boosting Adapters
BoostLoRA grows effective adapter rank linearly via iterative boosting on hard examples with orthogonal low-rank updates, outperforming both single-shot ultra-low-rank adapters and full fine-tuning on math and code ta...
-
Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning
DMEP prunes experts module-by-module in LoRA-MoE and removes load balancing after pruning, cutting trainable parameters 35-43% and raising throughput ~10% while matching or exceeding uniform baselines on reasoning tasks.
-
BioVLM: Routing Prompts, Not Parameters, for Cross-Modality Generalization in Biomedical VLMs
BioVLM achieves state-of-the-art cross-modality generalization on biomedical VLMs by learning a prompt bank and routing inputs to the most discriminative prompts via low-entropy selection plus LLM distillation.
-
SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates
LoRA weight updates are spectrally sparse, with 33% of DCT coefficients capturing 90% of energy on average, enabling 10x storage reduction and occasional gains by masking high frequencies.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
-
MinT: Managed Infrastructure for Training and Serving Millions of LLMs
MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.
-
Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation
Pretraining induces stable leading singular vectors that form a reusable spectral basis inherited by downstream tasks, enabling competitive performance with 0.2% trainable parameters on GLUE.
-
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...
-
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
-
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
-
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
-
Sensitivity-Positional Co-Localization in GQA Transformers
In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU,...
-
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
-
Constraint-Driven Warm-Freeze for Efficient Transfer Learning in Photovoltaic Systems
CDWF achieves 90-99% of full fine-tuning performance with up to 120x fewer trainable parameters by dynamically allocating full trainability to gradient-important blocks and LoRA to others for PV cyberattack transfer learning.
-
Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
Gradient-guided layer selection for LoRA yields 15-28% training speedup with matched downstream results on MMLU, GSM8K, and HumanEval across 14 models from 0.5B to 72B parameters.
-
Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters
PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.
-
CLIPer: Tailoring Diverse User Preference via Classifier-Guided Inference-Time Personalization
CLIPer uses classifier guidance during inference to personalize LLM generations across single and multi-dimensional user preferences without extensive fine-tuning.
-
Text-Guided Multi-Scale Frequency Representation Adaptation
FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
-
Dynamic Scaled Gradient Descent for Stable Fine-Tuning for Classifications
Dynamic scaled gradient descent prevents fine-tuning collapse by dynamically down-weighting gradients of correct examples, yielding lower performance variance and higher accuracy than standard methods on classificatio...
-
ChipLingo: A Systematic Training Framework for Large Language Models in EDA
ChipLingo trains LLMs on EDA data via corpus construction, domain-adaptive pretraining, and RAG scenario alignment, reaching 59.7% accuracy with an 8B model and 70.02% with a 32B model on a new internal EDA benchmark.
-
Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification
Dual-LoRA with a language-anchored adversary achieves 0.91% EER on the TidyVoice benchmark for cross-lingual speaker verification by targeting true linguistic cues while preserving speaker discriminability.
-
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.
-
HiP-LoRA: Budgeted Spectral Plasticity for Robust Low-Rank Adaptation
HiP-LoRA decomposes LoRA updates into principal and residual spectral channels with a singular-value-weighted stability budget to reduce forgetting and interference during foundation model adaptation.
-
A Benchmark Study of Segmentation Models and Adaptation Strategies for Landslide Detection from Satellite Imagery
Transformer-based models deliver strong landslide segmentation on satellite images, and parameter-efficient fine-tuning matches full fine-tuning accuracy while cutting trainable parameters by up to 95%.
-
Efficient Handwriting-Based Alzheimer,s Disease Diagnosis Using a Low-Rank Mixture of Experts Deep Learning Framework
A low-rank mixture of experts model trained on handwriting data delivers strong Alzheimer's diagnosis performance with substantially reduced parameter activation during inference.
-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
Reference graph
Works this paper leans on
-
[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page 2020
-
[3]
A singular value thresholding algorithm for matrix completion
Jian-Feng Cai, Emmanuel J Cand`es, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956–1982,
work page 1956
-
[4]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, ...
work page 2019
-
[5]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. Demi Guo, Alexander M Rush, and Yoon Kim. Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463,
-
[6]
arXiv preprint arXiv:2111.09543 , year=
URL https://openreview.net/forum?id=0RDcd5Axok. Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543, 2021a. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attenti...
-
[7]
The power of scale for parameter-efficient prompt tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, Online and Punta Cana, Dominican Republic, November
work page 2021
-
[8]
doi: 10.18653/v1/2021.emnlp-main.243
Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https: //aclanthology.org/2021.emnlp-main.243. 11 Published as a conference paper at ICLR 2023 Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training ...
-
[9]
Prefix-tuning: Optimizing continuous prompts for generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long P...
work page 2021
-
[10]
Prefix-tuning: Optimizing continuous prompts for generation
doi: 10.18653/v1/2021.acl-long.353. URL https://doi.org/10.18653/v1/2021. acl-long.353. Chen Liang, Simiao Zuo, Minshuo Chen, Haoming Jiang, Xiaodong Liu, Pengcheng He, Tuo Zhao, and Weizhu Chen. Super tickets in pre-trained language models: From model compression to improving generalization. In Proceedings of the 59th Annual Meeting of the Association fo...
-
[11]
doi: 10.18653/v1/2021.acl-long.510
Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.510. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81,
-
[12]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[13]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net,
work page 2019
-
[14]
Importance estimation for neural network pruning
Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 11264–11272. Computer Vision Foundation / IEEE,
work page 2019
-
[15]
Shashi Narayan, Shay B Cohen, and Mirella Lapata
doi: 10.1109/CVPR.2019.01152. Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the sum- mary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745,
-
[16]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K ¨opf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...
work page 2019
-
[17]
AdapterFusion: Non-Destructive Task Composition for Transfer Learning , journal =
Jonas Pfeiffer, Aishwarya Kamath, Andreas R¨uckl´e, Kyunghyun Cho, and Iryna Gurevych. Adapter- fusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247,
-
[18]
SQuAD: 100,000+ questions for machine comprehension of text
12 Published as a conference paper at ICLR 2023 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas,
work page 2023
-
[19]
SQuAD : 100,000+ questions for machine comprehension of text
Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789, Melbourne, Australia,
-
[20]
Know What You Don 't Know : Unanswerable Questions for SQuAD
Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30,
-
[21]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
work page 2019
-
[22]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv preprint, abs/1910.03771,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[23]
Greg Yang and Edward J Hu. Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522,
-
[24]
Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199,
-
[25]
13 Published as a conference paper at ICLR 2023 A G LOBAL BUDGET SCHEDULE As mentioned in Section 3.3, we propose a global budget scheduler to gradually decrease the budget b(t) following a cubic schedule. The detailed equation is given as follows: b(t) = b(0) 0 ≤ t < t i b(T) + b(0) − b(T) 1 − t−ti−tf T −ti−tf 3 ti ≤ t < T − tf b(T) o.w. . (12) B...
work page 2023
-
[26]
Table 6: Summary of the GLUE benchmark
in the following table. Table 6: Summary of the GLUE benchmark. Corpus Task #Train #Dev #Test #Label Metrics Single-Sentence Classification (GLUE) CoLA Acceptability 8.5k 1k 1k 2 Matthews corr SST Sentiment 67k 872 1.8k 2 Accuracy Pairwise Text Classification (GLUE) MNLI NLI 393k 20k 20k 3 Accuracy RTE NLI 2.5k 276 3k 2 Accuracy QQP Paraphrase 364k 40k 39...
work page 2023
-
[27]
Table 11: Statistics of the SQuAD dataset. # Train # Validation SQuAD v1.1 87,599 10,570 SQuAD v2.0 130,319 11,873 E N ATURAL LANGUAGE GENERATION E.1 B UDGET CONFIGURATION Given the budget, we control the trainable parameters for each method as the following table. 15 Published as a conference paper at ICLR 2023 Table 12: Detailed budget setup for summari...
work page 2023
-
[28]
The configuration of AdaLoRA is listed in the following table
We select the learning rate from {8 × 10−5, 5 × 10−5, 3 × 10−5, 1 × 10−4, 3 × 10−4, 5 × 10−4, 8 × 10−4, 1 × 10−3} and pick the best-performing learning rate for every method. The configuration of AdaLoRA is listed in the following table. Table 13: Hyper-parameter setup of AdaLoRA for summarization tasks. Dataset learning rate batch size # epochs γ t i ∆T ...
work page 2022
-
[29]
The memory footprint of two methods are quite close
Table 15 shows that AdaLoRA incurs 11% additional training time on MNLI and 16% on SQuADv2 under different budgets. The memory footprint of two methods are quite close. Such results demonstrate that AdaLoRA does not incur significant training overheads. The reason behind is that 16 Published as a conference paper at ICLR 2023 0 20000 40000 Iterations 10−4...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.