pith. sign in

arxiv: 2512.18248 · v3 · submitted 2025-12-20 · 💻 cs.LG

On the Convergence Rate of LoRA Gradient Descent

Pith reviewed 2026-05-16 20:23 UTC · model grok-4.3

classification 💻 cs.LG
keywords LoRAgradient descentconvergence ratelow-rank adaptationfine-tuningstationary pointnon-asymptotic analysis
0
0 comments X

The pith

LoRA gradient descent converges to a stationary point at rate O(1/log T) without artificial smoothness assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves a non-asymptotic convergence guarantee for the original LoRA gradient descent algorithm as used in practice for fine-tuning large models. It shows that the iterates reach a stationary point at a rate of one over the logarithm of the iteration count. Prior work either gave only asymptotic statements or imposed strong boundedness conditions that force Lipschitz smoothness and do not match real deployments. The argument rests on rewriting the adapter matrices as an outer product, deriving a modified descent inequality for the resulting reparametrized function, and carefully choosing the step size to keep the analysis valid.

Core claim

By reformulating the LoRA updates in terms of the outer product of the stacked adapter matrices, establishing a modified descent lemma for the resulting Lipschitz-like reparametrized objective, and controlling the step size, the original LoRA gradient descent algorithm is shown to converge to a stationary point at the rate O(1/log T).

What carries the argument

The outer-product reformulation of the stacked adapter matrices, which produces a Lipschitz-like reparametrized function to which a modified descent lemma can be applied.

If this is right

  • The unmodified LoRA algorithm used in practice converges without extra boundedness assumptions.
  • The convergence rate is sublinear and logarithmic in the iteration count rather than the usual 1/T or 1/sqrt(T) rates.
  • Step-size selection must be tuned to the reparametrized Lipschitz-like constant to preserve the guarantee.
  • The same outer-product view may extend directly to other low-rank adaptation schemes that preserve the same structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The slow logarithmic rate suggests that LoRA may require substantially more iterations than full-parameter fine-tuning to reach comparable stationarity, which could explain observed trade-offs in wall-clock time.
  • The outer-product reparametrization isolates the low-rank constraint as the source of the non-Lipschitz behavior, opening a route to analyze other rank-constrained optimization problems with similar techniques.
  • Because the analysis avoids strong boundedness, it may apply to fine-tuning regimes where adapter norms grow during training, a setting excluded by earlier results.

Load-bearing premise

The modified descent lemma holds for the Lipschitz-like reparametrized function obtained from the outer-product reformulation of the stacked adapter matrices.

What would settle it

A numerical run of standard LoRA gradient descent on a simple quadratic or linear model where the observed distance to stationarity fails to decay at least as fast as 1/log T after sufficiently many iterations.

Figures

Figures reproduced from arXiv: 2512.18248 by Diego Klabjan, Siqiao Mu.

Figure 1
Figure 1. Figure 1: Training logistic regression model on embeddings of the CIFAR-10 dataset with constant, adaptive (η adapt and η adapt2 ), and normalized (η norm) learning rates. For clarity, the moving averages (window size 10 and 200 for a. and b. respectively) of the batch loss are plotted. 3.3. Discussion Our theoretical results yield several interesting insights into the behavior of LoRA. First, the choice of extracto… view at source ↗
Figure 2
Figure 2. Figure 2: Training ResNet-18 model on the CIFAR-10 dataset with constant, adaptive (η adapt2 ) and normalized (η norm) learning rates. For clarity, the moving averages (window size 50) of the batch loss and gradient norm are plotted. This inverse relationship between ηt and ∥∇L(Vt)∥ 1/2 is required for minimizing higher order terms in the descent lemma, as stated in (21) and (23) in Appendix A.3. In practice, we com… view at source ↗
Figure 3
Figure 3. Figure 3: Training logistic regression model on embeddings of the CIFAR-10 dataset with constant, adaptive (η adapt and η adapt2 ), and normalized (η norm) learning rates. For clarity, the moving averages (window size 30) of the batch gradient are plotted. Figures 3a and 2c display the progress of ∥Vt∥ for various learning rates, which in combination with Figures 3b and 2d, suggests that the model is initialized in … view at source ↗
Figure 4
Figure 4. Figure 4: Learning rate selection using a cyclical finder for both experimental settings. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

The low-rank adaptation (LoRA) algorithm for fine-tuning large models has grown popular in recent years due to its remarkable performance and low computational requirements. LoRA trains two ``adapter" matrices that form a low-rank representation of the model parameters, thereby massively reducing the number of parameters that need to be updated at every step. Although LoRA is simple, its convergence is poorly understood due to the lack of Lipschitz smoothness, a key condition for classic convergence analyses. As a result, current theoretical results only consider asymptotic behavior or assume strong boundedness conditions which artificially enforce Lipschitz smoothness. In this work, we provide for the first time a non-asymptotic convergence analysis of the \textit{original LoRA gradient descent} algorithm, which reflects widespread practice, without such assumptions. Our work relies on three key steps: i) reformulating the problem in terms of the outer product of the stacked adapter matrices, ii) a modified descent lemma for the ``Lipschitz-like" reparametrized function, and iii) controlling the step size. With this approach, we prove that LoRA gradient descent converges to a stationary point at rate $O(\frac{1}{\log T})$, where $T$ is the number of iterations. We conduct numerical experiments to validate our theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to deliver the first non-asymptotic convergence guarantee for the original (unmodified) LoRA gradient-descent algorithm. By (i) stacking the adapter matrices into an outer-product reparametrization, (ii) establishing a modified descent lemma for the resulting Lipschitz-like loss, and (iii) suitably controlling the step size, the authors prove that the iterates converge to a stationary point at rate O(1/log T) without invoking global Lipschitz smoothness or a-priori boundedness assumptions on the adapters or gradients. Numerical experiments are reported to corroborate the rate.

Significance. A non-asymptotic rate for practical LoRA without artificial boundedness assumptions would be a notable theoretical advance for parameter-efficient fine-tuning. The three-step approach (outer-product reformulation, modified descent lemma, step-size control) is conceptually clean and, if the lemma can be shown to close without hidden regularity, would remove a common obstacle in existing analyses.

major comments (2)
  1. [Key steps / modified descent lemma] The modified descent lemma (key step ii) is load-bearing for the O(1/log T) claim. The abstract states that the lemma applies to the “Lipschitz-like” reparametrized function obtained from the outer-product map, yet it is not shown whether the derivation of this inequality still requires an a-priori bound on ||BA|| or on the Hessian of the loss to close the inequality. If such a bound is implicitly used, the result reduces to the very boundedness assumptions the paper claims to avoid. A complete, self-contained derivation of the lemma, including all regularity conditions, must be supplied.
  2. [Key steps / step-size control] The step-size control argument (key step iii) inherits the same gap. The abstract indicates that step sizes are chosen to make the modified descent lemma yield the 1/log T rate, but without an explicit statement of the admissible step-size schedule and the precise conditions under which it remains valid, it is impossible to verify that the schedule does not re-introduce boundedness or Lipschitz assumptions.
minor comments (1)
  1. [Numerical experiments] The numerical experiments section would benefit from an explicit statement of the step-size schedule used in practice and a direct comparison against the theoretically prescribed schedule.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for identifying areas where the presentation of our key technical steps can be strengthened. We address each major comment below and will incorporate revisions to enhance clarity and completeness.

read point-by-point responses
  1. Referee: [Key steps / modified descent lemma] The modified descent lemma (key step ii) is load-bearing for the O(1/log T) claim. The abstract states that the lemma applies to the “Lipschitz-like” reparametrized function obtained from the outer-product map, yet it is not shown whether the derivation of this inequality still requires an a-priori bound on ||BA|| or on the Hessian of the loss to close the inequality. If such a bound is implicitly used, the result reduces to the very boundedness assumptions the paper claims to avoid. A complete, self-contained derivation of the lemma, including all regularity conditions, must be supplied.

    Authors: We thank the referee for this important observation. The modified descent lemma is indeed central to our analysis. In the manuscript, we derive it in Section 3.2 by leveraging the outer-product reparametrization of the stacked adapter matrices, which allows us to express the update in a form where the loss exhibits a Lipschitz-like property along the optimization path without requiring global Lipschitz continuity or a-priori bounds on the adapters. The derivation relies on the chain rule applied to the reparametrized function and local properties of the gradient, without invoking bounds on ||BA|| or the Hessian. However, we acknowledge that the current write-up may omit some intermediate steps. In the revised version, we will provide a fully self-contained proof of the lemma, explicitly listing all regularity conditions used and confirming that no boundedness assumptions are employed. revision: yes

  2. Referee: [Key steps / step-size control] The step-size control argument (key step iii) inherits the same gap. The abstract indicates that step sizes are chosen to make the modified descent lemma yield the 1/log T rate, but without an explicit statement of the admissible step-size schedule and the precise conditions under which it remains valid, it is impossible to verify that the schedule does not re-introduce boundedness or Lipschitz assumptions.

    Authors: We agree that the step-size schedule needs to be stated more explicitly. Our choice of step sizes, specifically η_t = Θ(1 / log(t + 2)), is designed to ensure the modified descent lemma produces the desired O(1/log T) rate. The validity of this schedule follows directly from the conditions established in the modified descent lemma, which hold without boundedness assumptions. In the revision, we will include an explicit statement of the admissible step-size schedule along with a detailed proof that it remains valid under the lemma's conditions, without re-introducing any Lipschitz or boundedness assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity: rate derived from reformulation and modified lemma

full rationale

The paper's central claim of O(1/log T) convergence is obtained by (i) outer-product reformulation of stacked adapters, (ii) a modified descent lemma for the resulting Lipschitz-like function, and (iii) step-size control. None of these steps reduces by construction to the target rate, to a fitted parameter chosen to match the result, or to a self-citation chain. The modified lemma is introduced as a weaker replacement for classical Lipschitz smoothness and is not defined in terms of the convergence bound it is used to prove. No load-bearing self-citation or ansatz smuggling is visible in the provided derivation outline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the modified descent lemma after the outer-product reparametrization; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The reparametrized function obtained from the outer product of stacked adapter matrices satisfies a modified descent lemma
    This is listed as one of the three key steps required for the proof.

pith-pipeline@v0.9.0 · 5522 in / 1129 out tokens · 22553 ms · 2026-05-16T20:23:24.583236+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Büyükakyüz

    URL https://aclanthology.org/2021. acl-long.568/. Ahmed, S. How to choose the right learning rate in deep learning (with PyTorch), February 2025. URL https://medium.com/@sahin.samia/ how-to-choose-the-right-learning-rate-in-deep-learning-with-pytorch-690de782b405 . Biderman, D., Portes, J., Ortiz, J. J. G., Paul, M., Greengard, P., Jennings, C., King, D.,...

  2. [2]

    The American Statistician 36(3a):153–157 Charoenphakdee N, Cui Z, Zhang Y, et al (2021) Classification with rejection based on cost- sensitive classification

    URL https://openreview.net/forum? id=XuCf87V8OF. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vi- sion and Pattern Recognition, pp. 248–255, 2009. doi: 10.1109/CVPR.2009.5206848. Dettmers, T., Pagnoni, A., Holtzman, A., and Zettle- moyer, L. Qlora...

  3. [3]

    Z., Balasubramanian, K., Chewi, S., and Salim, A

    URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ 1feb87871436031bdc0f2beaa62a049b-Paper-Conference. pdf. Garrigos, G. and Gower, R. M. Handbook of convergence theorems for (stochastic) gradient methods, 2024. URL https://arxiv.org/abs/2301.11235. Ghiasvand, S., Alizadeh, M., and Pedarsani, R. Decentral- ized low-rank fine-tuning of large l...

  4. [4]

    Jiang, Z., Saadati, N., Balu, A., Pham, M., Waite, J

    URL https://openreview.net/forum? id=s1sdx6vNsU. Jiang, Z., Saadati, N., Balu, A., Pham, M., Waite, J. R., Saleem, N., Hegde, C., and Sarkar, S. A unified convergence theory for large language model efficient fine-tuning. InOPT 2024: Optimization for Machine Learning, 2024. URL https://openreview.net/ forum?id=f0lq26eITJ. Kim, J., Kim, J., and Ryu, E. K. ...

  5. [5]

    Li, T., He, Z., Li, Y ., Wang, Y ., Shang, L., and Huang, X

    URL https://openreview.net/forum? id=ryup8-WCW. Li, T., He, Z., Li, Y ., Wang, Y ., Shang, L., and Huang, X. Flat-loRA: Low-rank adaptation over a flat loss landscape. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=3Qj3xSwN2I. Lialin, V ., Muckatira, S., Shivagunde, N., and Rumshisky, A. ReloRA: Hig...

  6. [6]

    Meng, F., Wang, Z., and Zhang, M

    URL https://proceedings.mlr.press/ v202/malladi23a.html. Meng, F., Wang, Z., and Zhang, M. PiSSA: Princi- pal singular values and singular vectors adaptation of large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  7. [7]

    Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian

    URL https://openreview.net/forum? id=6ZBHIEtdP4. Oymak, S., Fabian, Z., Li, M., and Soltanolkotabi, M. Generalization guarantees for neural networks via har- nessing the low-rank structure of the jacobian.CoRR, abs/1906.05392, 2019. URL http://arxiv.org/ abs/1906.05392. Park, H. and Klabjan, D. Communication-efficient federated low-rank update algorithm a...

  8. [8]

    Wang, S., Yu, L., and Li, J

    URL https://openreview.net/forum? id=NLPzL6HWNl. Wang, S., Yu, L., and Li, J. LoRA-GA: Low-rank adaptation with gradient approximation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  9. [9]

    Xu, Z., Min, H., Luo, J., MacDonald, L

    URL https://openreview.net/forum? id=VaLAWrLHJv. Xu, Z., Min, H., Luo, J., MacDonald, L. E., Tarmoun, S., Mallada, E., and Vidal, R. Understanding the learning dy- namics of loRA: A gradient flow perspective on low-rank adaptation in matrix factorization. InThe 28th Interna- tional Conference on Artificial Intelligence and Statistics,

  10. [10]

    Yang, Y ., Long, G., Lu, Q., Zhu, L., Jiang, J., and Zhang, C

    URL https://openreview.net/forum? id=hphdX8WlcT. Yang, Y ., Long, G., Lu, Q., Zhu, L., Jiang, J., and Zhang, C. Federated low-rank adaptation for foundation models: a survey. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI ’25,

  11. [11]

    2023/120

    ISBN 978-1-956792-06-5. doi: 10.24963/ijcai. 2025/1196. URL https://doi.org/10.24963/ ijcai.2025/1196. Yaras, C., Wang, P., Balzano, L., and Qu, Q. Compressible dynamics in deep overparameterized low-rank learning & adaptation. InForty-first International Conference on Ma- chine Learning, 2024. URL https://openreview. net/forum?id=uDkXoZMzBv. Zeng, Y . an...

  12. [12]

    Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., and Tian, Y

    URL https://openreview.net/forum? id=KwIlvmLDLm. Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., and Tian, Y . Galore: Memory-efficient LLM train- ing by gradient low-rank projection. In5th Work- shop on practical ML for limited/low resource settings,

  13. [13]

    10 On the Convergence Rate of LoRA Gradient Descent A

    URL https://openreview.net/forum? id=AzqPyO22zt. 10 On the Convergence Rate of LoRA Gradient Descent A. Proofs A.1. Helper Lemmas Lemma A.1.(Weighted) AM-GM inequality. Given the nonnegative valuesx 1, x2, w1, w2, we have w1x1 +w 2x2 w1 +w 2 ≥(x w1 1 xw2 2 ) 1 w1 +w2 .(13) In particular, forw 1 =w 2 = 1, we have x1 +x 2 ≥2 √x1x2.(14) For the proof of Lemm...