arxiv: 2303.10512 · v2 · submitted 2023-03-18 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Qingru Zhang , Minshuo Chen , Alexander Bukharin , Nikos Karampatziakis , Pengcheng He , Yu Cheng , Weizhu Chen , Tuo Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 21:05 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords parameter-efficient fine-tuningadaptive budget allocationlow-rank adaptationsingular value decompositionpre-trained language modelsNLP downstream tasksLoRA variants

0 comments

The pith

AdaLoRA allocates the fine-tuning budget across weight matrices by ranking the importance of their low-rank updates via singular value decomposition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard parameter-efficient fine-tuning methods waste budget by giving every pre-trained weight matrix the same number of trainable parameters. It shows that re-parameterizing each update as a product of singular vectors and values lets the method compute an importance score for each matrix and then drop the least important singular values. This adaptive pruning concentrates the limited parameter budget on the matrices that matter most for the downstream task. The result is stronger performance than uniform-budget baselines, with the largest gains appearing when the total budget is small.

Core claim

AdaLoRA represents each incremental update to a pre-trained weight matrix as a low-rank SVD and uses the magnitudes of the singular values to decide how many of those values to retain for that matrix. By dynamically pruning singular values whose magnitudes fall below a threshold, the method reduces the effective rank of unimportant updates while preserving the full budget for important ones, all without performing expensive exact SVDs at every step.

What carries the argument

SVD parameterization of the low-rank incremental updates, which turns singular-value magnitudes into an importance score that directly controls how many parameters each matrix receives.

If this is right

Fine-tuning remains effective even when the total number of trainable parameters is cut to a few percent of the model size.
The same SVD-based importance scoring can be applied on top of other low-rank adapters without changing their training loops.
Training time per step stays comparable to standard LoRA because the pruning decision re-uses the already-computed singular values.
The method produces different final ranks for different layers, automatically allocating more capacity to attention or feed-forward blocks that the task needs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same importance-driven pruning idea could be tested on vision transformers or multimodal models where some layers are known to be more task-specific than others.
If singular-value magnitudes continue to track importance after the first few epochs, early pruning could further cut memory use during fine-tuning.
The approach suggests a general principle: any low-rank adapter whose factors admit a cheap importance metric can replace uniform budget allocation.

Load-bearing premise

The importance of a weight matrix for the downstream task can be reliably read from the sizes of the singular values in its current low-rank update.

What would settle it

On a standard GLUE or SQuAD benchmark, run AdaLoRA with its adaptive pruning and also run the same total budget split uniformly; if the uniform version matches or beats AdaLoRA at every budget level, the adaptive-allocation claim fails.

read the original abstract

Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal. To bridge this gap, we propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in the low budget settings. Our code is publicly available at https://github.com/QingruZhang/AdaLoRA .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaLoRA gets practical gains over uniform LoRA in tight budgets by pruning singular values per matrix, but the importance proxy needs tighter validation.

read the letter

The main point is that AdaLoRA improves results over standard LoRA when the total trainable parameter count is small. It does this by parameterizing each update as a low-rank SVD and then dropping the smallest singular values on matrices that appear less important, which reallocates the budget without running full SVD at every step. The experiments cover several models on classification, QA, and generation tasks and report the biggest edges in the low-budget regime, which is the setting where these methods are most useful in practice. The SVD trick is a reasonable engineering choice that keeps the overhead low while allowing per-matrix rank variation. The soft spot is the reliance on singular-value magnitude as the importance signal. That proxy assumes smaller values contribute less to loss reduction on the downstream task, but early in training the bases may not yet be aligned with the gradients that actually matter. Without an ablation that holds total parameter count fixed and swaps in a random or gradient-norm baseline, it is hard to tell how much of the reported lift comes from the adaptivity itself versus the change in optimization landscape. The paper is for practitioners who already use LoRA-style adapters and want a drop-in way to squeeze more performance out of a fixed budget. It is solid enough on the empirical side to deserve referee time, though the central claim would be stronger with one or two targeted controls on the pruning rule.

Referee Report

3 major / 2 minor

Summary. The paper proposes AdaLoRA for parameter-efficient fine-tuning of large pre-trained language models. It addresses the uniform budget allocation in methods like LoRA by adaptively distributing the parameter budget across weight matrices according to an importance score. The core mechanism parameterizes incremental updates ΔW as low-rank SVD forms UΣV^T and prunes singular values below a dynamic threshold derived from the importance score, thereby reducing the effective parameter count without repeated exact SVD computations. Experiments across NLP, QA, and NLG tasks with several pre-trained models are reported to show improvements over baselines, particularly under low parameter budgets.

Significance. If the SVD-based importance scoring reliably identifies task-relevant directions and the pruning preserves performance, AdaLoRA would offer a practical advance in PEFT by concentrating limited parameters where they contribute most. This could be especially valuable for low-resource fine-tuning scenarios and might inspire further adaptive allocation techniques that avoid uniform distribution across layers or matrices.

major comments (3)

[§3] §3 (Method), the importance score definition and pruning rule: the claim that singular-value magnitudes serve as a reliable proxy for per-matrix contribution to downstream loss reduction lacks supporting analysis or ablation. Replacing the SVD-derived score with a random or gradient-norm baseline while holding total parameter count fixed would be required to isolate whether adaptivity, rather than the SVD parameterization itself, drives the reported gains.
[Experiments] Experiments section and Table results: the abstract asserts 'notable improvement... especially in the low budget settings' but supplies no numerical deltas, standard deviations, number of runs, or direct comparison tables against LoRA with identical total budget. Without these, it is impossible to determine whether the gains exceed what could be obtained by simple hyperparameter tuning of uniform LoRA.
[§3.2] §3.2, the dynamic threshold and pruning schedule: the description does not specify how the importance score is updated during training (e.g., running average vs. per-step recomputation) or whether pruning is performed once or iteratively. If the bases U and V are still evolving early in training, early pruning decisions may discard directions that later become important, undermining the 'parameter-free' aspect of the allocation.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a concise statement of the exact total parameter budget used in the low-budget regime (e.g., 0.1% or 1M parameters) to allow direct replication.
[§3] Notation for the SVD parameterization (U, Σ, V) should be introduced with an explicit equation early in §3 rather than described only in prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We have addressed each major point below and revised the manuscript to provide additional analysis, quantitative details, and clarifications as suggested.

read point-by-point responses

Referee: [§3] §3 (Method), the importance score definition and pruning rule: the claim that singular-value magnitudes serve as a reliable proxy for per-matrix contribution to downstream loss reduction lacks supporting analysis or ablation. Replacing the SVD-derived score with a random or gradient-norm baseline while holding total parameter count fixed would be required to isolate whether adaptivity, rather than the SVD parameterization itself, drives the reported gains.

Authors: We agree that an explicit ablation is needed to isolate the contribution of the importance scoring. In the revised manuscript we have added ablation experiments in Section 4 that replace the SVD-derived importance scores with both random singular-value pruning and a gradient-norm baseline while keeping the total parameter budget identical across methods. The new results show that the SVD-based adaptive allocation outperforms these controls, especially under tight budgets, indicating that the gains are driven by the adaptive mechanism rather than the SVD parameterization alone. revision: yes
Referee: [Experiments] Experiments section and Table results: the abstract asserts 'notable improvement... especially in the low budget settings' but supplies no numerical deltas, standard deviations, number of runs, or direct comparison tables against LoRA with identical total budget. Without these, it is impossible to determine whether the gains exceed what could be obtained by simple hyperparameter tuning of uniform LoRA.

Authors: We acknowledge that the reporting of quantitative details can be strengthened. The original tables already present head-to-head comparisons under matched total budgets. In the revision we have updated the abstract to include concrete example deltas drawn from the existing results, added standard deviations computed over three independent runs to all tables, and explicitly stated the number of runs and the budget-equivalence protocol in the experimental setup section. revision: yes
Referee: [§3.2] §3.2, the dynamic threshold and pruning schedule: the description does not specify how the importance score is updated during training (e.g., running average vs. per-step recomputation) or whether pruning is performed once or iteratively. If the bases U and V are still evolving early in training, early pruning decisions may discard directions that later become important, undermining the 'parameter-free' aspect of the allocation.

Authors: We thank the referee for noting this lack of detail. The importance scores are recomputed periodically (every 100 steps after a short warm-up) using an exponential moving average of the current singular values; pruning is applied iteratively at these intervals rather than in a single step. This design allows the low-rank factors to continue evolving before final pruning decisions. We have expanded §3.2 with a precise description of the update rule, the pruning schedule, pseudocode, and a short discussion addressing the concern about early pruning. revision: yes

Circularity Check

0 steps flagged

No significant circularity: AdaLoRA's SVD parameterization and importance-based pruning form an independent algorithmic proposal.

full rationale

The paper introduces AdaLoRA as a novel method that parameterizes incremental updates ΔW via SVD and uses singular-value magnitudes to adaptively prune and reallocate budget across matrices. This is presented as an empirical algorithmic change over uniform LoRA baselines, not as a derivation or prediction that reduces to its own fitted inputs by construction. No equations in the abstract or described claims exhibit self-definition (e.g., importance defined circularly from the pruning outcome), fitted-input-as-prediction, or load-bearing self-citation chains. The importance scoring is an explicit design choice within the method rather than a tautological renaming or imported uniqueness theorem. The central claim of improved low-budget performance therefore rests on external validation through experiments, not internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or new entities. The approach builds on standard linear-algebra operations (SVD) and prior parameter-efficient fine-tuning ideas without additional postulates.

pith-pipeline@v0.9.0 · 5560 in / 1111 out tokens · 41817 ms · 2026-05-12T21:05:26.726030+00:00 · methodology

discussion (0)

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
cs.CL 2026-05 unverdicted novelty 7.0

MatryoshkaLoRA inserts a crafted diagonal matrix P into LoRA to learn accurate nested low-rank adapters that support dynamic rank selection with minimal performance drop.
Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA
cs.LG 2026-05 unverdicted novelty 7.0

GLoRA replaces raw factor averaging with gauge-aware aggregation in a consensus subspace estimated from client projectors, enabling consistent low-rank federated LoRA under heterogeneity.
Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration
cs.CV 2026-05 unverdicted novelty 7.0

CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
BoostLoRA: Growing Effective Rank by Boosting Adapters
cs.LG 2026-04 unverdicted novelty 7.0

BoostLoRA grows effective adapter rank linearly via iterative boosting on hard examples with orthogonal low-rank updates, outperforming both single-shot ultra-low-rank adapters and full fine-tuning on math and code ta...
Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning
cs.LG 2026-04 unverdicted novelty 7.0

DMEP prunes experts module-by-module in LoRA-MoE and removes load balancing after pruning, cutting trainable parameters 35-43% and raising throughput ~10% while matching or exceeding uniform baselines on reasoning tasks.
BioVLM: Routing Prompts, Not Parameters, for Cross-Modality Generalization in Biomedical VLMs
cs.CV 2026-04 unverdicted novelty 7.0

BioVLM achieves state-of-the-art cross-modality generalization on biomedical VLMs by learning a prompt bank and routing inputs to the most discriminative prompts via low-entropy selection plus LLM distillation.
SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates
cs.LG 2026-04 unverdicted novelty 7.0

LoRA weight updates are spectrally sparse, with 33% of DCT coefficients capturing 90% of energy on average, enabling 10x storage reduction and occasional gains by masking high frequencies.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
MinT: Managed Infrastructure for Training and Serving Millions of LLMs
cs.LG 2026-05 unverdicted novelty 6.0

MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.
Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

Pretraining induces stable leading singular vectors that form a reusable spectral basis inherited by downstream tasks, enabling competitive performance with 0.2% trainable parameters on GLUE.
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
cs.AI 2026-05 unverdicted novelty 6.0

JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 conditional novelty 6.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
cs.LG 2026-04 unverdicted novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
Sensitivity-Positional Co-Localization in GQA Transformers
cs.CL 2026-04 unverdicted novelty 6.0

In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU,...
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
cs.CR 2026-04 unverdicted novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
Constraint-Driven Warm-Freeze for Efficient Transfer Learning in Photovoltaic Systems
cs.NE 2026-04 unverdicted novelty 6.0

CDWF achieves 90-99% of full fine-tuning performance with up to 120x fewer trainable parameters by dynamically allocating full trainability to gradient-important blocks and LoRA to others for PV cyberattack transfer learning.
Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
cs.LG 2026-04 conditional novelty 6.0

Gradient-guided layer selection for LoRA yields 15-28% training speedup with matched downstream results on MMLU, GSM8K, and HumanEval across 14 models from 0.5B to 72B parameters.
Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters
cs.LG 2026-04 unverdicted novelty 6.0

PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.
CLIPer: Tailoring Diverse User Preference via Classifier-Guided Inference-Time Personalization
cs.CL 2026-05 unverdicted novelty 5.0

CLIPer uses classifier guidance during inference to personalize LLM generations across single and multi-dimensional user preferences without extensive fine-tuning.
Text-Guided Multi-Scale Frequency Representation Adaptation
cs.CV 2026-05 unverdicted novelty 5.0

FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
Dynamic Scaled Gradient Descent for Stable Fine-Tuning for Classifications
cs.LG 2026-04 unverdicted novelty 5.0

Dynamic scaled gradient descent prevents fine-tuning collapse by dynamically down-weighting gradients of correct examples, yielding lower performance variance and higher accuracy than standard methods on classificatio...
ChipLingo: A Systematic Training Framework for Large Language Models in EDA
cs.LG 2026-04 unverdicted novelty 5.0

ChipLingo trains LLMs on EDA data via corpus construction, domain-adaptive pretraining, and RAG scenario alignment, reaching 59.7% accuracy with an 8B model and 70.02% with a 32B model on a new internal EDA benchmark.
Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification
eess.AS 2026-04 unverdicted novelty 5.0

Dual-LoRA with a language-anchored adversary achieves 0.91% EER on the TidyVoice benchmark for cross-lingual speaker verification by targeting true linguistic cues while preserving speaker discriminability.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 unverdicted novelty 5.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.
HiP-LoRA: Budgeted Spectral Plasticity for Robust Low-Rank Adaptation
cs.LG 2026-04 unverdicted novelty 5.0

HiP-LoRA decomposes LoRA updates into principal and residual spectral channels with a singular-value-weighted stability budget to reduce forgetting and interference during foundation model adaptation.
A Benchmark Study of Segmentation Models and Adaptation Strategies for Landslide Detection from Satellite Imagery
cs.CV 2026-04 unverdicted novelty 4.0

Transformer-based models deliver strong landslide segmentation on satellite images, and parameter-efficient fine-tuning matches full fine-tuning accuracy while cutting trainable parameters by up to 95%.
Efficient Handwriting-Based Alzheimer,s Disease Diagnosis Using a Low-Rank Mixture of Experts Deep Learning Framework
cs.LG 2026-04 unverdicted novelty 4.0

A low-rank mixture of experts model trained on handwriting data delivers strong Alzheimer's diagnosis performance with substantially reduced parameter activation during inference.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
cs.LG 2024-03 accept novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 29 Pith papers · 3 internal anchors

[1]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 2020
[3]

A singular value thresholding algorithm for matrix completion

Jian-Feng Cai, Emmanuel J Cand`es, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956–1982,

work page 1956
[4]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, ...

work page 2019
[5]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. Demi Guo, Alexander M Rush, and Yoon Kim. Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463,

work page doi:10.18653/v1/n19-1423 2012
[6]

arXiv preprint arXiv:2111.09543 , year=

URL https://openreview.net/forum?id=0RDcd5Axok. Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543, 2021a. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attenti...

work page arXiv
[7]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, Online and Punta Cana, Dominican Republic, November

work page 2021
[8]

doi: 10.18653/v1/2021.emnlp-main.243

Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https: //aclanthology.org/2021.emnlp-main.243. 11 Published as a conference paper at ICLR 2023 Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training ...

work page doi:10.18653/v1/2021.emnlp-main.243 2021
[9]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long P...

work page 2021
[10]

Prefix-tuning: Optimizing continuous prompts for generation

doi: 10.18653/v1/2021.acl-long.353. URL https://doi.org/10.18653/v1/2021. acl-long.353. Chen Liang, Simiao Zuo, Minshuo Chen, Haoming Jiang, Xiaodong Liu, Pengcheng He, Tuo Zhao, and Weizhu Chen. Super tickets in pre-trained language models: From model compression to improving generalization. In Proceedings of the 59th Annual Meeting of the Association fo...

work page doi:10.18653/v1/2021.acl-long.353 2021
[11]

doi: 10.18653/v1/2021.acl-long.510

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.510. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81,

work page doi:10.18653/v1/2021.acl-long.510 2021
[12]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[13]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net,

work page 2019
[14]

Importance estimation for neural network pruning

Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 11264–11272. Computer Vision Foundation / IEEE,

work page 2019
[15]

Shashi Narayan, Shay B Cohen, and Mirella Lapata

doi: 10.1109/CVPR.2019.01152. Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the sum- mary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745,

work page doi:10.1109/cvpr.2019.01152 2019
[16]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K ¨opf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

work page 2019
[17]

AdapterFusion: Non-Destructive Task Composition for Transfer Learning , journal =

Jonas Pfeiffer, Aishwarya Kamath, Andreas R¨uckl´e, Kyunghyun Cho, and Iryna Gurevych. Adapter- fusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247,

work page arXiv 2005
[18]

SQuAD: 100,000+ questions for machine comprehension of text

12 Published as a conference paper at ICLR 2023 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas,

work page 2023
[19]

SQuAD : 100,000+ questions for machine comprehension of text

Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789, Melbourne, Australia,

work page doi:10.18653/v1/d16-1264
[20]

Know What You Don 't Know : Unanswerable Questions for SQuAD

Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30,

work page doi:10.18653/v1/p18-2124
[21]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

work page 2019
[22]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv preprint, abs/1910.03771,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[23]

, month = jul, year =

Greg Yang and Edward J Hu. Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522,

work page arXiv 2011
[24]

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , shorttitle =

Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199,

work page arXiv
[25]

The detailed equation is given as follows: b(t) =    b(0) 0 ≤ t < t i b(T) + b(0) − b(T) 1 − t−ti−tf T −ti−tf 3 ti ≤ t < T − tf b(T) o.w

13 Published as a conference paper at ICLR 2023 A G LOBAL BUDGET SCHEDULE As mentioned in Section 3.3, we propose a global budget scheduler to gradually decrease the budget b(t) following a cubic schedule. The detailed equation is given as follows: b(t) =    b(0) 0 ≤ t < t i b(T) + b(0) − b(T) 1 − t−ti−tf T −ti−tf 3 ti ≤ t < T − tf b(T) o.w. . (12) B...

work page 2023
[26]

Table 6: Summary of the GLUE benchmark

in the following table. Table 6: Summary of the GLUE benchmark. Corpus Task #Train #Dev #Test #Label Metrics Single-Sentence Classification (GLUE) CoLA Acceptability 8.5k 1k 1k 2 Matthews corr SST Sentiment 67k 872 1.8k 2 Accuracy Pairwise Text Classification (GLUE) MNLI NLI 393k 20k 20k 3 Accuracy RTE NLI 2.5k 276 3k 2 Accuracy QQP Paraphrase 364k 40k 39...

work page 2023
[27]

Table 11: Statistics of the SQuAD dataset. # Train # Validation SQuAD v1.1 87,599 10,570 SQuAD v2.0 130,319 11,873 E N ATURAL LANGUAGE GENERATION E.1 B UDGET CONFIGURATION Given the budget, we control the trainable parameters for each method as the following table. 15 Published as a conference paper at ICLR 2023 Table 12: Detailed budget setup for summari...

work page 2023
[28]

The configuration of AdaLoRA is listed in the following table

We select the learning rate from {8 × 10−5, 5 × 10−5, 3 × 10−5, 1 × 10−4, 3 × 10−4, 5 × 10−4, 8 × 10−4, 1 × 10−3} and pick the best-performing learning rate for every method. The configuration of AdaLoRA is listed in the following table. Table 13: Hyper-parameter setup of AdaLoRA for summarization tasks. Dataset learning rate batch size # epochs γ t i ∆T ...

work page 2022
[29]

The memory footprint of two methods are quite close

Table 15 shows that AdaLoRA incurs 11% additional training time on MNLI and 16% on SQuADv2 under different budgets. The memory footprint of two methods are quite close. Such results demonstrate that AdaLoRA does not incur significant training overheads. The reason behind is that 16 Published as a conference paper at ICLR 2023 0 20000 40000 Iterations 10−4...

work page 2023