LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

Chang Chu; Qi Li; Qingyue Zhang; Shao-Lun Huang; Tianren Peng; Xiangyang Luo; Zhihao Jiang

arxiv: 2510.24561 · v2 · submitted 2025-10-28 · 💻 cs.LG · cs.AI

LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

Qingyue Zhang , Chang Chu , Tianren Peng , Qi Li , Xiangyang Luo , Zhihao Jiang , Shao-Lun Huang This is my paper

Pith reviewed 2026-05-18 02:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LoRAPEFTinitializationdata-awareasymptotic analysisFisher informationfine-tuninglow-rank adaptation

0 comments

The pith

Minimizing the expected parameter discrepancy between fine-tuned and target models yields an optimal data-aware initialization for LoRA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a theoretical framework for LoRA initialization that uses target-domain data. It minimizes the expected difference between the parameters of the fine-tuned model and the target model, splitting the objective into a bias term and a variance term. The bias term is approximated with a Fisher-gradient formulation to retain directional structure, while the variance term uses Fisher information to account for sampling uncertainty. Solving this optimization produces an initialization rule implemented as the LoRA-DA algorithm. Tests on several benchmarks show higher final accuracy than prior methods, along with quicker and steadier convergence and only minor added cost.

Core claim

Starting from minimizing the expectation of the parameter discrepancy between the fine-tuned and target models, the paper derives an optimization problem with a bias term approximated using a Fisher-gradient formulation to preserve anisotropy and a variance term that accounts for sampling stochasticity through the Fisher information. Solving this problem yields an optimal initialization strategy for LoRA, from which the efficient LoRA-DA algorithm is developed.

What carries the argument

The optimization problem that minimizes expected parameter discrepancy by balancing a Fisher-gradient-approximated bias term and a Fisher-information variance term to determine the best LoRA initialization.

If this is right

LoRA-DA reaches higher final accuracy than existing initialization methods on multiple benchmarks.
It produces faster and more stable convergence during fine-tuning.
Performance gains hold across different ranks.
The method adds only a small overhead at initialization time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same asymptotic discrepancy minimization could be adapted to design initializations for other parameter-efficient fine-tuning modules.
Data-aware initializations may reduce the need for extensive hyperparameter search when moving to new domains or tasks.
The separation of bias and variance through Fisher quantities highlights why one-step gradient methods miss longer-term fine-tuning behavior.

Load-bearing premise

The Fisher-gradient approximation for the bias term and the Fisher information for the variance term must correctly capture the parameter discrepancy that arises during actual fine-tuning.

What would settle it

Applying LoRA-DA on the reported benchmarks and finding no gain or a loss in final accuracy relative to existing initialization methods would show the derived strategy is not optimal.

Figures

Figures reproduced from arXiv: 2510.24561 by Chang Chu, Qi Li, Qingyue Zhang, Shao-Lun Huang, Tianren Peng, Xiangyang Luo, Zhihao Jiang.

**Figure 1.** Figure 1: The yellow circle illustrates the estimation variance induced by the stochasticity of training samples in the unconstrained setting. The red variance term represents its projection onto the LoRA subspace under the fixed-A constraint, while the red bias term corresponds to the approximation error due to the distance between Wtgt and the LoRA subspace. In [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: The loss, grad norm, and evaluation accuracy on GSM8K over the training steps of LoRA [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Loss landscape [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy of LoRA-DA across different ranks on the GSM8K task. [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

read the original abstract

LoRA has become a widely adopted method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition. In this paper, we establish a theoretical framework for data-aware LoRA initialization. Starting from minimizing the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. Solving this problem yields an optimal initialization strategy for LoRA, based on which we develop an efficient algorithm, LoRA-DA. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes a data-aware initialization method for LoRA called LoRA-DA. It starts from the objective of minimizing the expected parameter discrepancy between a fine-tuned model and a target model, decomposes this into bias and variance terms, approximates the bias via a Fisher-gradient formulation (to retain anisotropy) and the variance via the Fisher information matrix (to capture sampling uncertainty), solves the resulting optimization problem to obtain a closed-form initializer, and develops an efficient algorithm implementing it. Empirical evaluations on multiple benchmarks are reported to show consistent accuracy gains over prior LoRA initializers, together with faster convergence and robustness to rank.

Significance. If the Fisher-based approximations are shown to be sufficiently accurate proxies for the true expectation in the fine-tuning regime, the work would supply the first asymptotically derived, data-aware LoRA initializer that goes beyond one-step gradient heuristics. This could improve initialization quality with only modest overhead and would be a useful addition to the PEFT literature, especially given the growing use of LoRA in large-model adaptation.

major comments (3)

[§3] §3 (theoretical framework): the central optimality claim for the derived initializer rests on replacing the true bias term in the expectation with a Fisher-gradient approximation and the variance term with the Fisher information; the manuscript provides neither analytic error bounds on these substitutions nor regime-specific conditions (e.g., distance from initialization or number of fine-tuning steps) under which the approximations remain faithful to the original objective.
[§3.2] §3.2 (bias/variance decomposition): the statement that the Fisher-gradient formulation 'preserves anisotropy' while the Fisher-information term captures 'sampling uncertainty' is presented as sufficient to guarantee optimality of the resulting closed-form solution, yet no verification is given that the approximated objective coincides with the exact minimizer outside the invoked asymptotic regime.
[§4] §4 (algorithm and implementation): the efficient algorithm LoRA-DA is obtained directly from the approximated optimization problem; because the approximation error is unquantified, it is unclear whether the computed initialization remains near-optimal once the first few gradient steps are taken during fine-tuning.

minor comments (3)

[Abstract] The abstract states empirical improvements without any quantitative numbers or baseline comparisons; these should be summarized with effect sizes in the abstract.
[§3] Notation for the Fisher information matrix and the gradient approximation is introduced without an explicit reference to the standard definitions used (e.g., which expectation is taken over).
[§5] Figure captions and axis labels in the convergence plots could be expanded to indicate the precise metric (e.g., validation accuracy vs. training steps) and the number of random seeds.

Simulated Author's Rebuttal

3 responses · 1 unresolved

Thank you for the opportunity to respond to the referee's report. We address the major comments point by point below, providing clarifications on our theoretical framework and proposing revisions where appropriate.

read point-by-point responses

Referee: [§3] §3 (theoretical framework): the central optimality claim for the derived initializer rests on replacing the true bias term in the expectation with a Fisher-gradient approximation and the variance term with the Fisher information; the manuscript provides neither analytic error bounds on these substitutions nor regime-specific conditions (e.g., distance from initialization or number of fine-tuning steps) under which the approximations remain faithful to the original objective.

Authors: We thank the referee for highlighting this important point. Our derivations rely on asymptotic analysis as the number of samples or steps grows, which is standard in such statistical approximations. While we do not provide explicit analytic error bounds, the approximations are motivated by the properties of the Fisher information in the context of fine-tuning large models. In the revision, we will add a dedicated subsection discussing the validity of these approximations under typical fine-tuning regimes, including when the fine-tuned model remains relatively close to the pre-trained one and for a moderate number of steps. We also plan to include additional experiments quantifying the approximation error empirically. revision: partial
Referee: [§3.2] §3.2 (bias/variance decomposition): the statement that the Fisher-gradient formulation 'preserves anisotropy' while the Fisher-information term captures 'sampling uncertainty' is presented as sufficient to guarantee optimality of the resulting closed-form solution, yet no verification is given that the approximated objective coincides with the exact minimizer outside the invoked asymptotic regime.

Authors: The closed-form initializer is optimal with respect to the approximated objective function obtained after applying the asymptotic approximations. We agree that outside the asymptotic regime, it may not coincide exactly with the minimizer of the original expectation. However, the asymptotic regime is relevant for the practical fine-tuning setting we consider. In the revised manuscript, we will explicitly state that the optimality holds for the approximated problem and provide a brief discussion on the implications for finite regimes, supported by our empirical results showing improved performance. revision: partial
Referee: [§4] §4 (algorithm and implementation): the efficient algorithm LoRA-DA is obtained directly from the approximated optimization problem; because the approximation error is unquantified, it is unclear whether the computed initialization remains near-optimal once the first few gradient steps are taken during fine-tuning.

Authors: The initialization is designed to minimize the expected discrepancy at the start of fine-tuning. As fine-tuning proceeds, the parameters evolve, but our experiments demonstrate faster convergence and better final accuracy, indicating that the benefits persist beyond the initial steps. To address this concern, we will add a new experiment or analysis in the revision showing the evolution of the parameter discrepancy or performance over the first few steps for LoRA-DA compared to baselines. revision: partial

standing simulated objections not resolved

Providing analytic error bounds on the Fisher-gradient and Fisher-information approximations.

Circularity Check

0 steps flagged

Derivation starts from external objective and applies standard Fisher approximations without self-referential reduction

full rationale

The paper begins from the external objective of minimizing E[parameter discrepancy] between fine-tuned and target models. It decomposes this into bias and variance terms, then applies Fisher-gradient and Fisher-information approximations that are conventional statistical tools rather than quantities fitted from the paper's own outputs or defined circularly in terms of the final initializer. The resulting closed-form LoRA-DA strategy is obtained by solving the approximated optimization problem; empirical benchmarks are presented separately as validation. No equation or step reduces the claimed optimality to a tautological restatement of the inputs by construction, and no load-bearing self-citation or uniqueness theorem imported from the authors' prior work is invoked.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the asymptotic decomposition and the Fisher approximations; no new entities are postulated and the only free parameter is the user-chosen LoRA rank.

free parameters (1)

LoRA rank r
User-selected hyperparameter that controls the dimensionality of the low-rank update; the derivation treats it as given.

axioms (1)

domain assumption The expected parameter discrepancy between fine-tuned and target models can be decomposed into bias and variance components that are well-approximated by Fisher-gradient and Fisher-information quantities.
This decomposition is the starting point of the optimization problem stated in the abstract.

pith-pipeline@v0.9.0 · 5750 in / 1177 out tokens · 55759 ms · 2026-05-18T02:58:14.357911+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

variance term ... through Fisher information ... bias term ... approximated using a Fisher–gradient formulation ... Initialization Guidance Matrix Ω = ∑ (J(W0)^{-1}/N − (W_tgt−W0)(W_tgt−W0)⊤)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

asymptotic normality of the MLE ... √N (θ̂MLE − θ*) → N(0, J(θ*)^{-1})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 4 internal anchors

[1]

Recall and learn: Fine-tuning deep pretrained language models with less forgetting

Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7870– 7881,

work page 2020
[2]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

work page 2019
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

On the asymptotics of constrained m-estimation.The Annals of statistics, pp

Charles J Geyer. On the asymptotics of constrained m-estimation.The Annals of statistics, pp. 1993–2010,

work page 1993
[6]

Anisotropy is inherent to self-attention in transformers.arXiv preprint arXiv:2401.12143,

Nathan Godey, ´Eric de la Clergerie, and Beno ˆıt Sagot. Anisotropy is inherent to self-attention in transformers.arXiv preprint arXiv:2401.12143,

work page arXiv
[7]

Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366,

10 Preprint. Under review Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366,

work page arXiv
[8]

Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2304.01933,

Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Ka-Wei Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2304.01933,

work page arXiv
[9]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing. Association for Computational Linguistics,

work page 2021
[10]

Under review Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal

11 Preprint. Under review Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct elec- tricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,

work page 2018
[11]

Residual prompt tuning: improving prompt tuning with residual reparameteri- zation

Anastasiia Razdaibiedina, Yuning Mao, Madian Khabsa, Mike Lewis, Rui Hou, Jimmy Ba, and Amjad Almahairi. Residual prompt tuning: improving prompt tuning with residual reparameteri- zation. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 6740–6757,

work page 2023
[12]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URLhttps://openreview.net/forum?id= wNobG8bV5Q. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Milora: Harnessing minor singular components for parameter-efficient llm finetuning

Hanqing Wang, Yixia Li, Shuo Wang, Guanhua Chen, and Yun Chen. Milora: Harnessing minor singular components for parameter-efficient llm finetuning. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 4823–4836,

work page 2025
[14]

Lora-ga: Low-rank adaptation with gradient approximation

Shaowen Wang, Linxi Yu, and Jian Li. Lora-ga: Low-rank adaptation with gradient approximation. Advances in Neural Information Processing Systems, 37:54905–54931, 2024a. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, and Ed Chi. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023a. URLhttps://openreview. net/forum?id=1P...

work page arXiv 2021
[15]

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning.arXiv preprint arXiv:2308.03303, 2023a. Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In11t...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Under review APPENDIX CONTENTS A Notations 15 B Extended Related Work 16 B.1 Parameter-Efficient Fine-Tuning (PEFT)

13 Preprint. Under review APPENDIX CONTENTS A Notations 15 B Extended Related Work 16 B.1 Parameter-Efficient Fine-Tuning (PEFT) . . . . . . . . . . . . . . . . . . . . . . . 16 B.2 Low-Rank Adaptation (LoRA) and its Variants and Initialization . . . . . . . . . . 16 C Proofs 17 C.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2019
[17]

However, LoRA and many of its variants typically adopt random initialization for the low-rank matrixAand zero initialization forB

re-parameterizes LoRA by introducing shared low-rank vectors across layers, thereby reducing the number of trainable parame- ters while maintaining competitive performance. However, LoRA and many of its variants typically adopt random initialization for the low-rank matrixAand zero initialization forB. Such uninformed initialization introduces two main dr...

work page 2024
[18]

For the first term of equation 23, by Lemma 3.3, we know that if ˆWis not restricted to the LoRA form, the training variance is given by tr(J(W tgt)−1) N

Under this assumption, the training objective can be decomposed as E ˆW−W tgt 2 F =E ˆW−W proj tgt 2 F + W proj tgt −W tgt 2 F ,(23) where the first term represents the expected estimation variance within the subspace, and the second term corresponds to the bias arising from the distance betweenW tgt and the LoRA subspace. For the first term of equation 2...

work page 1958
[19]

Taking norms on both sides yields J(W tgt)−1 −J(W 0)−1 ≤ ∥J(W tgt)−1∥∥J(W tgt)−J(W 0)∥∥J(W 0)−1∥.(38) We analyze the right-hand side of the inequality

+O 1√N .(36) By the resolvent identity (Horn & Johnson, 2012), we have J(W tgt)−1 −J(W 0)−1 =J(W tgt)−1 J(W 0)−J(W tgt) J(W 0)−1,(37) which can be easily proved by multiplying both sides of the equation on the left byJ(W tgt). Taking norms on both sides yields J(W tgt)−1 −J(W 0)−1 ≤ ∥J(W tgt)−1∥∥J(W tgt)−J(W 0)∥∥J(W 0)−1∥.(38) We analyze the right-hand si...

work page 2012
[20]

For the second term of equation 44, similar to equation 42, we know that W proj tgt −W tgt 2 F = d2X i=1 (W proj tgt −W tgt)(:,i) 2 F = d2X i=1 (Wtgt −W 0)(:,i) 2 F − A⊤ (Wtgt −W 0)(:,i) 2 F (46) Combining equation 44, equation 45 and equation 46, we have E ˆW−W tgt 2 F ≤ d2X i=1 (Wtgt −W 0)(:,i) 2 F + tr A⊤ d2X i=1 J(W 0)−1 [i] N −(W tgt −W 0)(:,i) (Wtgt...

work page 2024
[21]

Notably, for LoRA-DA and LoRA-One, we pre-sampled 256 examples to estimate both the gradient and the Fisher information matrix during initialization

We evaluated the models using the LLM-adapters framework (Hu et al., 2023). Notably, for LoRA-DA and LoRA-One, we pre-sampled 256 examples to estimate both the gradient and the Fisher information matrix during initialization. 21 Preprint. Under review Table 7: Hyperparameter settings for LoRA-DA and baseline methods. Hyperparameter Value LoRA Rank (r) 8 L...

work page 2023

[1] [1]

Recall and learn: Fine-tuning deep pretrained language models with less forgetting

Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7870– 7881,

work page 2020

[2] [2]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

work page 2019

[3] [3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

On the asymptotics of constrained m-estimation.The Annals of statistics, pp

Charles J Geyer. On the asymptotics of constrained m-estimation.The Annals of statistics, pp. 1993–2010,

work page 1993

[6] [6]

Anisotropy is inherent to self-attention in transformers.arXiv preprint arXiv:2401.12143,

Nathan Godey, ´Eric de la Clergerie, and Beno ˆıt Sagot. Anisotropy is inherent to self-attention in transformers.arXiv preprint arXiv:2401.12143,

work page arXiv

[7] [7]

Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366,

10 Preprint. Under review Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366,

work page arXiv

[8] [8]

Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2304.01933,

Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Ka-Wei Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2304.01933,

work page arXiv

[9] [9]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing. Association for Computational Linguistics,

work page 2021

[10] [10]

Under review Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal

11 Preprint. Under review Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct elec- tricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,

work page 2018

[11] [11]

Residual prompt tuning: improving prompt tuning with residual reparameteri- zation

Anastasiia Razdaibiedina, Yuning Mao, Madian Khabsa, Mike Lewis, Rui Hou, Jimmy Ba, and Amjad Almahairi. Residual prompt tuning: improving prompt tuning with residual reparameteri- zation. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 6740–6757,

work page 2023

[12] [12]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URLhttps://openreview.net/forum?id= wNobG8bV5Q. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Milora: Harnessing minor singular components for parameter-efficient llm finetuning

Hanqing Wang, Yixia Li, Shuo Wang, Guanhua Chen, and Yun Chen. Milora: Harnessing minor singular components for parameter-efficient llm finetuning. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 4823–4836,

work page 2025

[14] [14]

Lora-ga: Low-rank adaptation with gradient approximation

Shaowen Wang, Linxi Yu, and Jian Li. Lora-ga: Low-rank adaptation with gradient approximation. Advances in Neural Information Processing Systems, 37:54905–54931, 2024a. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, and Ed Chi. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023a. URLhttps://openreview. net/forum?id=1P...

work page arXiv 2021

[15] [15]

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning.arXiv preprint arXiv:2308.03303, 2023a. Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In11t...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Under review APPENDIX CONTENTS A Notations 15 B Extended Related Work 16 B.1 Parameter-Efficient Fine-Tuning (PEFT)

13 Preprint. Under review APPENDIX CONTENTS A Notations 15 B Extended Related Work 16 B.1 Parameter-Efficient Fine-Tuning (PEFT) . . . . . . . . . . . . . . . . . . . . . . . 16 B.2 Low-Rank Adaptation (LoRA) and its Variants and Initialization . . . . . . . . . . 16 C Proofs 17 C.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2019

[17] [17]

However, LoRA and many of its variants typically adopt random initialization for the low-rank matrixAand zero initialization forB

re-parameterizes LoRA by introducing shared low-rank vectors across layers, thereby reducing the number of trainable parame- ters while maintaining competitive performance. However, LoRA and many of its variants typically adopt random initialization for the low-rank matrixAand zero initialization forB. Such uninformed initialization introduces two main dr...

work page 2024

[18] [18]

For the first term of equation 23, by Lemma 3.3, we know that if ˆWis not restricted to the LoRA form, the training variance is given by tr(J(W tgt)−1) N

Under this assumption, the training objective can be decomposed as E ˆW−W tgt 2 F =E ˆW−W proj tgt 2 F + W proj tgt −W tgt 2 F ,(23) where the first term represents the expected estimation variance within the subspace, and the second term corresponds to the bias arising from the distance betweenW tgt and the LoRA subspace. For the first term of equation 2...

work page 1958

[19] [19]

Taking norms on both sides yields J(W tgt)−1 −J(W 0)−1 ≤ ∥J(W tgt)−1∥∥J(W tgt)−J(W 0)∥∥J(W 0)−1∥.(38) We analyze the right-hand side of the inequality

+O 1√N .(36) By the resolvent identity (Horn & Johnson, 2012), we have J(W tgt)−1 −J(W 0)−1 =J(W tgt)−1 J(W 0)−J(W tgt) J(W 0)−1,(37) which can be easily proved by multiplying both sides of the equation on the left byJ(W tgt). Taking norms on both sides yields J(W tgt)−1 −J(W 0)−1 ≤ ∥J(W tgt)−1∥∥J(W tgt)−J(W 0)∥∥J(W 0)−1∥.(38) We analyze the right-hand si...

work page 2012

[20] [20]

For the second term of equation 44, similar to equation 42, we know that W proj tgt −W tgt 2 F = d2X i=1 (W proj tgt −W tgt)(:,i) 2 F = d2X i=1 (Wtgt −W 0)(:,i) 2 F − A⊤ (Wtgt −W 0)(:,i) 2 F (46) Combining equation 44, equation 45 and equation 46, we have E ˆW−W tgt 2 F ≤ d2X i=1 (Wtgt −W 0)(:,i) 2 F + tr A⊤ d2X i=1 J(W 0)−1 [i] N −(W tgt −W 0)(:,i) (Wtgt...

work page 2024

[21] [21]

Notably, for LoRA-DA and LoRA-One, we pre-sampled 256 examples to estimate both the gradient and the Fisher information matrix during initialization

We evaluated the models using the LLM-adapters framework (Hu et al., 2023). Notably, for LoRA-DA and LoRA-One, we pre-sampled 256 examples to estimate both the gradient and the Fisher information matrix during initialization. 21 Preprint. Under review Table 7: Hyperparameter settings for LoRA-DA and baseline methods. Hyperparameter Value LoRA Rank (r) 8 L...

work page 2023