LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis
Pith reviewed 2026-05-18 02:58 UTC · model grok-4.3
The pith
Minimizing the expected parameter discrepancy between fine-tuned and target models yields an optimal data-aware initialization for LoRA.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Starting from minimizing the expectation of the parameter discrepancy between the fine-tuned and target models, the paper derives an optimization problem with a bias term approximated using a Fisher-gradient formulation to preserve anisotropy and a variance term that accounts for sampling stochasticity through the Fisher information. Solving this problem yields an optimal initialization strategy for LoRA, from which the efficient LoRA-DA algorithm is developed.
What carries the argument
The optimization problem that minimizes expected parameter discrepancy by balancing a Fisher-gradient-approximated bias term and a Fisher-information variance term to determine the best LoRA initialization.
If this is right
- LoRA-DA reaches higher final accuracy than existing initialization methods on multiple benchmarks.
- It produces faster and more stable convergence during fine-tuning.
- Performance gains hold across different ranks.
- The method adds only a small overhead at initialization time.
Where Pith is reading between the lines
- The same asymptotic discrepancy minimization could be adapted to design initializations for other parameter-efficient fine-tuning modules.
- Data-aware initializations may reduce the need for extensive hyperparameter search when moving to new domains or tasks.
- The separation of bias and variance through Fisher quantities highlights why one-step gradient methods miss longer-term fine-tuning behavior.
Load-bearing premise
The Fisher-gradient approximation for the bias term and the Fisher information for the variance term must correctly capture the parameter discrepancy that arises during actual fine-tuning.
What would settle it
Applying LoRA-DA on the reported benchmarks and finding no gain or a loss in final accuracy relative to existing initialization methods would show the derived strategy is not optimal.
Figures
read the original abstract
LoRA has become a widely adopted method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition. In this paper, we establish a theoretical framework for data-aware LoRA initialization. Starting from minimizing the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. Solving this problem yields an optimal initialization strategy for LoRA, based on which we develop an efficient algorithm, LoRA-DA. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code will be released upon publication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a data-aware initialization method for LoRA called LoRA-DA. It starts from the objective of minimizing the expected parameter discrepancy between a fine-tuned model and a target model, decomposes this into bias and variance terms, approximates the bias via a Fisher-gradient formulation (to retain anisotropy) and the variance via the Fisher information matrix (to capture sampling uncertainty), solves the resulting optimization problem to obtain a closed-form initializer, and develops an efficient algorithm implementing it. Empirical evaluations on multiple benchmarks are reported to show consistent accuracy gains over prior LoRA initializers, together with faster convergence and robustness to rank.
Significance. If the Fisher-based approximations are shown to be sufficiently accurate proxies for the true expectation in the fine-tuning regime, the work would supply the first asymptotically derived, data-aware LoRA initializer that goes beyond one-step gradient heuristics. This could improve initialization quality with only modest overhead and would be a useful addition to the PEFT literature, especially given the growing use of LoRA in large-model adaptation.
major comments (3)
- [§3] §3 (theoretical framework): the central optimality claim for the derived initializer rests on replacing the true bias term in the expectation with a Fisher-gradient approximation and the variance term with the Fisher information; the manuscript provides neither analytic error bounds on these substitutions nor regime-specific conditions (e.g., distance from initialization or number of fine-tuning steps) under which the approximations remain faithful to the original objective.
- [§3.2] §3.2 (bias/variance decomposition): the statement that the Fisher-gradient formulation 'preserves anisotropy' while the Fisher-information term captures 'sampling uncertainty' is presented as sufficient to guarantee optimality of the resulting closed-form solution, yet no verification is given that the approximated objective coincides with the exact minimizer outside the invoked asymptotic regime.
- [§4] §4 (algorithm and implementation): the efficient algorithm LoRA-DA is obtained directly from the approximated optimization problem; because the approximation error is unquantified, it is unclear whether the computed initialization remains near-optimal once the first few gradient steps are taken during fine-tuning.
minor comments (3)
- [Abstract] The abstract states empirical improvements without any quantitative numbers or baseline comparisons; these should be summarized with effect sizes in the abstract.
- [§3] Notation for the Fisher information matrix and the gradient approximation is introduced without an explicit reference to the standard definitions used (e.g., which expectation is taken over).
- [§5] Figure captions and axis labels in the convergence plots could be expanded to indicate the precise metric (e.g., validation accuracy vs. training steps) and the number of random seeds.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We address the major comments point by point below, providing clarifications on our theoretical framework and proposing revisions where appropriate.
read point-by-point responses
-
Referee: [§3] §3 (theoretical framework): the central optimality claim for the derived initializer rests on replacing the true bias term in the expectation with a Fisher-gradient approximation and the variance term with the Fisher information; the manuscript provides neither analytic error bounds on these substitutions nor regime-specific conditions (e.g., distance from initialization or number of fine-tuning steps) under which the approximations remain faithful to the original objective.
Authors: We thank the referee for highlighting this important point. Our derivations rely on asymptotic analysis as the number of samples or steps grows, which is standard in such statistical approximations. While we do not provide explicit analytic error bounds, the approximations are motivated by the properties of the Fisher information in the context of fine-tuning large models. In the revision, we will add a dedicated subsection discussing the validity of these approximations under typical fine-tuning regimes, including when the fine-tuned model remains relatively close to the pre-trained one and for a moderate number of steps. We also plan to include additional experiments quantifying the approximation error empirically. revision: partial
-
Referee: [§3.2] §3.2 (bias/variance decomposition): the statement that the Fisher-gradient formulation 'preserves anisotropy' while the Fisher-information term captures 'sampling uncertainty' is presented as sufficient to guarantee optimality of the resulting closed-form solution, yet no verification is given that the approximated objective coincides with the exact minimizer outside the invoked asymptotic regime.
Authors: The closed-form initializer is optimal with respect to the approximated objective function obtained after applying the asymptotic approximations. We agree that outside the asymptotic regime, it may not coincide exactly with the minimizer of the original expectation. However, the asymptotic regime is relevant for the practical fine-tuning setting we consider. In the revised manuscript, we will explicitly state that the optimality holds for the approximated problem and provide a brief discussion on the implications for finite regimes, supported by our empirical results showing improved performance. revision: partial
-
Referee: [§4] §4 (algorithm and implementation): the efficient algorithm LoRA-DA is obtained directly from the approximated optimization problem; because the approximation error is unquantified, it is unclear whether the computed initialization remains near-optimal once the first few gradient steps are taken during fine-tuning.
Authors: The initialization is designed to minimize the expected discrepancy at the start of fine-tuning. As fine-tuning proceeds, the parameters evolve, but our experiments demonstrate faster convergence and better final accuracy, indicating that the benefits persist beyond the initial steps. To address this concern, we will add a new experiment or analysis in the revision showing the evolution of the parameter discrepancy or performance over the first few steps for LoRA-DA compared to baselines. revision: partial
- Providing analytic error bounds on the Fisher-gradient and Fisher-information approximations.
Circularity Check
Derivation starts from external objective and applies standard Fisher approximations without self-referential reduction
full rationale
The paper begins from the external objective of minimizing E[parameter discrepancy] between fine-tuned and target models. It decomposes this into bias and variance terms, then applies Fisher-gradient and Fisher-information approximations that are conventional statistical tools rather than quantities fitted from the paper's own outputs or defined circularly in terms of the final initializer. The resulting closed-form LoRA-DA strategy is obtained by solving the approximated optimization problem; empirical benchmarks are presented separately as validation. No equation or step reduces the claimed optimality to a tautological restatement of the inputs by construction, and no load-bearing self-citation or uniqueness theorem imported from the authors' prior work is invoked.
Axiom & Free-Parameter Ledger
free parameters (1)
- LoRA rank r
axioms (1)
- domain assumption The expected parameter discrepancy between fine-tuned and target models can be decomposed into bias and variance components that are well-approximated by Fisher-gradient and Fisher-information quantities.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
variance term ... through Fisher information ... bias term ... approximated using a Fisher–gradient formulation ... Initialization Guidance Matrix Ω = ∑ (J(W0)^{-1}/N − (W_tgt−W0)(W_tgt−W0)⊤)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
asymptotic normality of the MLE ... √N (θ̂MLE − θ*) → N(0, J(θ*)^{-1})
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Recall and learn: Fine-tuning deep pretrained language models with less forgetting
Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7870– 7881,
work page 2020
-
[2]
Boolq: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...
work page 2019
-
[3]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
On the asymptotics of constrained m-estimation.The Annals of statistics, pp
Charles J Geyer. On the asymptotics of constrained m-estimation.The Annals of statistics, pp. 1993–2010,
work page 1993
-
[6]
Anisotropy is inherent to self-attention in transformers.arXiv preprint arXiv:2401.12143,
Nathan Godey, ´Eric de la Clergerie, and Beno ˆıt Sagot. Anisotropy is inherent to self-attention in transformers.arXiv preprint arXiv:2401.12143,
-
[7]
Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366,
10 Preprint. Under review Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366,
-
[8]
Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Ka-Wei Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2304.01933,
-
[9]
The power of scale for parameter-efficient prompt tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing. Association for Computational Linguistics,
work page 2021
-
[10]
Under review Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal
11 Preprint. Under review Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct elec- tricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,
work page 2018
-
[11]
Residual prompt tuning: improving prompt tuning with residual reparameteri- zation
Anastasiia Razdaibiedina, Yuning Mao, Madian Khabsa, Mike Lewis, Rui Hou, Jimmy Ba, and Amjad Almahairi. Residual prompt tuning: improving prompt tuning with residual reparameteri- zation. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 6740–6757,
work page 2023
-
[12]
Llama 2: Open Foundation and Fine-Tuned Chat Models
URLhttps://openreview.net/forum?id= wNobG8bV5Q. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Milora: Harnessing minor singular components for parameter-efficient llm finetuning
Hanqing Wang, Yixia Li, Shuo Wang, Guanhua Chen, and Yun Chen. Milora: Harnessing minor singular components for parameter-efficient llm finetuning. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 4823–4836,
work page 2025
-
[14]
Lora-ga: Low-rank adaptation with gradient approximation
Shaowen Wang, Linxi Yu, and Jian Li. Lora-ga: Low-rank adaptation with gradient approximation. Advances in Neural Information Processing Systems, 37:54905–54931, 2024a. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, and Ed Chi. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023a. URLhttps://openreview. net/forum?id=1P...
-
[15]
LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning
Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning.arXiv preprint arXiv:2308.03303, 2023a. Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In11t...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
13 Preprint. Under review APPENDIX CONTENTS A Notations 15 B Extended Related Work 16 B.1 Parameter-Efficient Fine-Tuning (PEFT) . . . . . . . . . . . . . . . . . . . . . . . 16 B.2 Low-Rank Adaptation (LoRA) and its Variants and Initialization . . . . . . . . . . 16 C Proofs 17 C.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . ...
work page 2019
-
[17]
re-parameterizes LoRA by introducing shared low-rank vectors across layers, thereby reducing the number of trainable parame- ters while maintaining competitive performance. However, LoRA and many of its variants typically adopt random initialization for the low-rank matrixAand zero initialization forB. Such uninformed initialization introduces two main dr...
work page 2024
-
[18]
Under this assumption, the training objective can be decomposed as E ˆW−W tgt 2 F =E ˆW−W proj tgt 2 F + W proj tgt −W tgt 2 F ,(23) where the first term represents the expected estimation variance within the subspace, and the second term corresponds to the bias arising from the distance betweenW tgt and the LoRA subspace. For the first term of equation 2...
work page 1958
-
[19]
+O 1√N .(36) By the resolvent identity (Horn & Johnson, 2012), we have J(W tgt)−1 −J(W 0)−1 =J(W tgt)−1 J(W 0)−J(W tgt) J(W 0)−1,(37) which can be easily proved by multiplying both sides of the equation on the left byJ(W tgt). Taking norms on both sides yields J(W tgt)−1 −J(W 0)−1 ≤ ∥J(W tgt)−1∥∥J(W tgt)−J(W 0)∥∥J(W 0)−1∥.(38) We analyze the right-hand si...
work page 2012
-
[20]
For the second term of equation 44, similar to equation 42, we know that W proj tgt −W tgt 2 F = d2X i=1 (W proj tgt −W tgt)(:,i) 2 F = d2X i=1 (Wtgt −W 0)(:,i) 2 F − A⊤ (Wtgt −W 0)(:,i) 2 F (46) Combining equation 44, equation 45 and equation 46, we have E ˆW−W tgt 2 F ≤ d2X i=1 (Wtgt −W 0)(:,i) 2 F + tr A⊤ d2X i=1 J(W 0)−1 [i] N −(W tgt −W 0)(:,i) (Wtgt...
work page 2024
-
[21]
We evaluated the models using the LLM-adapters framework (Hu et al., 2023). Notably, for LoRA-DA and LoRA-One, we pre-sampled 256 examples to estimate both the gradient and the Fisher information matrix during initialization. 21 Preprint. Under review Table 7: Hyperparameter settings for LoRA-DA and baseline methods. Hyperparameter Value LoRA Rank (r) 8 L...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.