On the Convergence Rate of LoRA Gradient Descent
Pith reviewed 2026-05-16 20:23 UTC · model grok-4.3
The pith
LoRA gradient descent converges to a stationary point at rate O(1/log T) without artificial smoothness assumptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By reformulating the LoRA updates in terms of the outer product of the stacked adapter matrices, establishing a modified descent lemma for the resulting Lipschitz-like reparametrized objective, and controlling the step size, the original LoRA gradient descent algorithm is shown to converge to a stationary point at the rate O(1/log T).
What carries the argument
The outer-product reformulation of the stacked adapter matrices, which produces a Lipschitz-like reparametrized function to which a modified descent lemma can be applied.
If this is right
- The unmodified LoRA algorithm used in practice converges without extra boundedness assumptions.
- The convergence rate is sublinear and logarithmic in the iteration count rather than the usual 1/T or 1/sqrt(T) rates.
- Step-size selection must be tuned to the reparametrized Lipschitz-like constant to preserve the guarantee.
- The same outer-product view may extend directly to other low-rank adaptation schemes that preserve the same structure.
Where Pith is reading between the lines
- The slow logarithmic rate suggests that LoRA may require substantially more iterations than full-parameter fine-tuning to reach comparable stationarity, which could explain observed trade-offs in wall-clock time.
- The outer-product reparametrization isolates the low-rank constraint as the source of the non-Lipschitz behavior, opening a route to analyze other rank-constrained optimization problems with similar techniques.
- Because the analysis avoids strong boundedness, it may apply to fine-tuning regimes where adapter norms grow during training, a setting excluded by earlier results.
Load-bearing premise
The modified descent lemma holds for the Lipschitz-like reparametrized function obtained from the outer-product reformulation of the stacked adapter matrices.
What would settle it
A numerical run of standard LoRA gradient descent on a simple quadratic or linear model where the observed distance to stationarity fails to decay at least as fast as 1/log T after sufficiently many iterations.
Figures
read the original abstract
The low-rank adaptation (LoRA) algorithm for fine-tuning large models has grown popular in recent years due to its remarkable performance and low computational requirements. LoRA trains two ``adapter" matrices that form a low-rank representation of the model parameters, thereby massively reducing the number of parameters that need to be updated at every step. Although LoRA is simple, its convergence is poorly understood due to the lack of Lipschitz smoothness, a key condition for classic convergence analyses. As a result, current theoretical results only consider asymptotic behavior or assume strong boundedness conditions which artificially enforce Lipschitz smoothness. In this work, we provide for the first time a non-asymptotic convergence analysis of the \textit{original LoRA gradient descent} algorithm, which reflects widespread practice, without such assumptions. Our work relies on three key steps: i) reformulating the problem in terms of the outer product of the stacked adapter matrices, ii) a modified descent lemma for the ``Lipschitz-like" reparametrized function, and iii) controlling the step size. With this approach, we prove that LoRA gradient descent converges to a stationary point at rate $O(\frac{1}{\log T})$, where $T$ is the number of iterations. We conduct numerical experiments to validate our theoretical findings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to deliver the first non-asymptotic convergence guarantee for the original (unmodified) LoRA gradient-descent algorithm. By (i) stacking the adapter matrices into an outer-product reparametrization, (ii) establishing a modified descent lemma for the resulting Lipschitz-like loss, and (iii) suitably controlling the step size, the authors prove that the iterates converge to a stationary point at rate O(1/log T) without invoking global Lipschitz smoothness or a-priori boundedness assumptions on the adapters or gradients. Numerical experiments are reported to corroborate the rate.
Significance. A non-asymptotic rate for practical LoRA without artificial boundedness assumptions would be a notable theoretical advance for parameter-efficient fine-tuning. The three-step approach (outer-product reformulation, modified descent lemma, step-size control) is conceptually clean and, if the lemma can be shown to close without hidden regularity, would remove a common obstacle in existing analyses.
major comments (2)
- [Key steps / modified descent lemma] The modified descent lemma (key step ii) is load-bearing for the O(1/log T) claim. The abstract states that the lemma applies to the “Lipschitz-like” reparametrized function obtained from the outer-product map, yet it is not shown whether the derivation of this inequality still requires an a-priori bound on ||BA|| or on the Hessian of the loss to close the inequality. If such a bound is implicitly used, the result reduces to the very boundedness assumptions the paper claims to avoid. A complete, self-contained derivation of the lemma, including all regularity conditions, must be supplied.
- [Key steps / step-size control] The step-size control argument (key step iii) inherits the same gap. The abstract indicates that step sizes are chosen to make the modified descent lemma yield the 1/log T rate, but without an explicit statement of the admissible step-size schedule and the precise conditions under which it remains valid, it is impossible to verify that the schedule does not re-introduce boundedness or Lipschitz assumptions.
minor comments (1)
- [Numerical experiments] The numerical experiments section would benefit from an explicit statement of the step-size schedule used in practice and a direct comparison against the theoretically prescribed schedule.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for identifying areas where the presentation of our key technical steps can be strengthened. We address each major comment below and will incorporate revisions to enhance clarity and completeness.
read point-by-point responses
-
Referee: [Key steps / modified descent lemma] The modified descent lemma (key step ii) is load-bearing for the O(1/log T) claim. The abstract states that the lemma applies to the “Lipschitz-like” reparametrized function obtained from the outer-product map, yet it is not shown whether the derivation of this inequality still requires an a-priori bound on ||BA|| or on the Hessian of the loss to close the inequality. If such a bound is implicitly used, the result reduces to the very boundedness assumptions the paper claims to avoid. A complete, self-contained derivation of the lemma, including all regularity conditions, must be supplied.
Authors: We thank the referee for this important observation. The modified descent lemma is indeed central to our analysis. In the manuscript, we derive it in Section 3.2 by leveraging the outer-product reparametrization of the stacked adapter matrices, which allows us to express the update in a form where the loss exhibits a Lipschitz-like property along the optimization path without requiring global Lipschitz continuity or a-priori bounds on the adapters. The derivation relies on the chain rule applied to the reparametrized function and local properties of the gradient, without invoking bounds on ||BA|| or the Hessian. However, we acknowledge that the current write-up may omit some intermediate steps. In the revised version, we will provide a fully self-contained proof of the lemma, explicitly listing all regularity conditions used and confirming that no boundedness assumptions are employed. revision: yes
-
Referee: [Key steps / step-size control] The step-size control argument (key step iii) inherits the same gap. The abstract indicates that step sizes are chosen to make the modified descent lemma yield the 1/log T rate, but without an explicit statement of the admissible step-size schedule and the precise conditions under which it remains valid, it is impossible to verify that the schedule does not re-introduce boundedness or Lipschitz assumptions.
Authors: We agree that the step-size schedule needs to be stated more explicitly. Our choice of step sizes, specifically η_t = Θ(1 / log(t + 2)), is designed to ensure the modified descent lemma produces the desired O(1/log T) rate. The validity of this schedule follows directly from the conditions established in the modified descent lemma, which hold without boundedness assumptions. In the revision, we will include an explicit statement of the admissible step-size schedule along with a detailed proof that it remains valid under the lemma's conditions, without re-introducing any Lipschitz or boundedness assumptions. revision: yes
Circularity Check
No circularity: rate derived from reformulation and modified lemma
full rationale
The paper's central claim of O(1/log T) convergence is obtained by (i) outer-product reformulation of stacked adapters, (ii) a modified descent lemma for the resulting Lipschitz-like function, and (iii) step-size control. None of these steps reduces by construction to the target rate, to a fitted parameter chosen to match the result, or to a self-citation chain. The modified lemma is introduced as a weaker replacement for classical Lipschitz smoothness and is not defined in terms of the convergence bound it is used to prove. No load-bearing self-citation or ansatz smuggling is visible in the provided derivation outline.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The reparametrized function obtained from the outer product of stacked adapter matrices satisfies a modified descent lemma
Reference graph
Works this paper leans on
-
[1]
URL https://aclanthology.org/2021. acl-long.568/. Ahmed, S. How to choose the right learning rate in deep learning (with PyTorch), February 2025. URL https://medium.com/@sahin.samia/ how-to-choose-the-right-learning-rate-in-deep-learning-with-pytorch-690de782b405 . Biderman, D., Portes, J., Ortiz, J. J. G., Paul, M., Greengard, P., Jennings, C., King, D.,...
-
[2]
URL https://openreview.net/forum? id=XuCf87V8OF. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vi- sion and Pattern Recognition, pp. 248–255, 2009. doi: 10.1109/CVPR.2009.5206848. Dettmers, T., Pagnoni, A., Holtzman, A., and Zettle- moyer, L. Qlora...
-
[3]
Z., Balasubramanian, K., Chewi, S., and Salim, A
URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ 1feb87871436031bdc0f2beaa62a049b-Paper-Conference. pdf. Garrigos, G. and Gower, R. M. Handbook of convergence theorems for (stochastic) gradient methods, 2024. URL https://arxiv.org/abs/2301.11235. Ghiasvand, S., Alizadeh, M., and Pedarsani, R. Decentral- ized low-rank fine-tuning of large l...
-
[4]
Jiang, Z., Saadati, N., Balu, A., Pham, M., Waite, J
URL https://openreview.net/forum? id=s1sdx6vNsU. Jiang, Z., Saadati, N., Balu, A., Pham, M., Waite, J. R., Saleem, N., Hegde, C., and Sarkar, S. A unified convergence theory for large language model efficient fine-tuning. InOPT 2024: Optimization for Machine Learning, 2024. URL https://openreview.net/ forum?id=f0lq26eITJ. Kim, J., Kim, J., and Ryu, E. K. ...
work page 2024
-
[5]
Li, T., He, Z., Li, Y ., Wang, Y ., Shang, L., and Huang, X
URL https://openreview.net/forum? id=ryup8-WCW. Li, T., He, Z., Li, Y ., Wang, Y ., Shang, L., and Huang, X. Flat-loRA: Low-rank adaptation over a flat loss landscape. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=3Qj3xSwN2I. Lialin, V ., Muckatira, S., Shivagunde, N., and Rumshisky, A. ReloRA: Hig...
-
[6]
Meng, F., Wang, Z., and Zhang, M
URL https://proceedings.mlr.press/ v202/malladi23a.html. Meng, F., Wang, Z., and Zhang, M. PiSSA: Princi- pal singular values and singular vectors adaptation of large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,
-
[7]
Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian
URL https://openreview.net/forum? id=6ZBHIEtdP4. Oymak, S., Fabian, Z., Li, M., and Soltanolkotabi, M. Generalization guarantees for neural networks via har- nessing the low-rank structure of the jacobian.CoRR, abs/1906.05392, 2019. URL http://arxiv.org/ abs/1906.05392. Park, H. and Klabjan, D. Communication-efficient federated low-rank update algorithm a...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/w 1906
-
[8]
URL https://openreview.net/forum? id=NLPzL6HWNl. Wang, S., Yu, L., and Li, J. LoRA-GA: Low-rank adaptation with gradient approximation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,
-
[9]
Xu, Z., Min, H., Luo, J., MacDonald, L
URL https://openreview.net/forum? id=VaLAWrLHJv. Xu, Z., Min, H., Luo, J., MacDonald, L. E., Tarmoun, S., Mallada, E., and Vidal, R. Understanding the learning dy- namics of loRA: A gradient flow perspective on low-rank adaptation in matrix factorization. InThe 28th Interna- tional Conference on Artificial Intelligence and Statistics,
-
[10]
Yang, Y ., Long, G., Lu, Q., Zhu, L., Jiang, J., and Zhang, C
URL https://openreview.net/forum? id=hphdX8WlcT. Yang, Y ., Long, G., Lu, Q., Zhu, L., Jiang, J., and Zhang, C. Federated low-rank adaptation for foundation models: a survey. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI ’25,
-
[11]
ISBN 978-1-956792-06-5. doi: 10.24963/ijcai. 2025/1196. URL https://doi.org/10.24963/ ijcai.2025/1196. Yaras, C., Wang, P., Balzano, L., and Qu, Q. Compressible dynamics in deep overparameterized low-rank learning & adaptation. InForty-first International Conference on Ma- chine Learning, 2024. URL https://openreview. net/forum?id=uDkXoZMzBv. Zeng, Y . an...
-
[12]
Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., and Tian, Y
URL https://openreview.net/forum? id=KwIlvmLDLm. Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., and Tian, Y . Galore: Memory-efficient LLM train- ing by gradient low-rank projection. In5th Work- shop on practical ML for limited/low resource settings,
-
[13]
10 On the Convergence Rate of LoRA Gradient Descent A
URL https://openreview.net/forum? id=AzqPyO22zt. 10 On the Convergence Rate of LoRA Gradient Descent A. Proofs A.1. Helper Lemmas Lemma A.1.(Weighted) AM-GM inequality. Given the nonnegative valuesx 1, x2, w1, w2, we have w1x1 +w 2x2 w1 +w 2 ≥(x w1 1 xw2 2 ) 1 w1 +w2 .(13) In particular, forw 1 =w 2 = 1, we have x1 +x 2 ≥2 √x1x2.(14) For the proof of Lemm...
work page 1934
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.