pith. machine review for the scientific record. sign in

arxiv: 2605.08734 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LoRAlow-rank adaptationAdafactorpreconditioningparameter-efficient fine-tuninglarge language modelsadaptive optimizationdiffusion models
0
0 comments X

The pith

AdaPreLoRA preconditions LoRA updates by mapping Adafactor's H_t to low-rank factors via weighted imbalance minimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LoRA reparameterizes weight updates as a product of two low-rank factors, but the associated Jacobian is rank-deficient, so any weight-space preconditioner induces a singular factor-space preconditioner that cannot be inverted uniquely. The paper unifies prior LoRA optimizers by the surrogate they substitute for the singular matrix and by the choice of weight-space preconditioner F_t they employ. AdaPreLoRA pairs the gradient-statistics-aware Adafactor diagonal Kronecker preconditioner H_t with a closed-form selection rule that picks, among all factor pairs consistent with the preconditioned direction, the pair minimizing the H_t-weighted imbalance between the two factors. The resulting update is therefore the closest possible LoRA approximation to the ideal preconditioned step in the H_t-weighted norm, and it runs at the same O((m+n)r) memory cost as ordinary LoRA. Experiments across GPT-2, Mistral-7B, Qwen2-7B, and diffusion personalization show performance that is competitive with or better than representative LoRA optimizers while staying at the same memory budget.

Core claim

By adopting the Adafactor diagonal Kronecker preconditioner H_t on W and selecting from the resulting factor-space solution family the element minimizing an H_t-weighted imbalance between the two factor contributions, the resulting factor update is the closest LoRA approximation to the preconditioned W-space direction under the H_t-weighted norm.

What carries the argument

Adafactor diagonal Kronecker preconditioner H_t paired with the closed-form H_t-weighted imbalance minimization that selects one element from the singular factor-space solution family.

If this is right

  • The method achieves competitive or superior results to other LoRA optimizers on GPT-2 (E2E), Mistral-7B and Qwen2-7B (GLUE, ARC, GSM8K), and diffusion-model personalization.
  • Peak GPU memory stays at the level of standard LoRA optimizers because the selection rule is closed-form and uses only O((m+n)r) storage.
  • Existing LoRA methods are shown to occupy four families in the two-dimensional design space of surrogate preconditioner choice and weight-space F_t choice.
  • A gradient-statistics-aware F_t paired with a closed-form factor-space solve remains feasible and underexplored until this work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-axis design space suggests that other gradient-statistics-aware preconditioners could be substituted for H_t and paired with analogous closed-form selection rules.
  • Because the memory cost matches plain LoRA, the approach could be dropped into existing parameter-efficient pipelines without increasing hardware requirements.
  • The unification framework makes it straightforward to test whether the imbalance criterion generalizes to other singular Jacobians arising in structured low-rank reparameterizations.

Load-bearing premise

The H_t-weighted imbalance criterion produces a stable and effective update direction without introducing hidden hyperparameters or instabilities that would require post-hoc tuning on each new model or task.

What would settle it

If AdaPreLoRA, when run exactly as described, consistently underperforms the baselines or requires per-task hyperparameter retuning on the reported benchmarks (GLUE, ARC, GSM8K, E2E, diffusion personalization), the claim that the imbalance rule yields effective preconditioned updates would be falsified.

Figures

Figures reproduced from arXiv: 2605.08734 by Fengmiao Bian, Jian-Feng Cai, Ziyun Liu.

Figure 1
Figure 1. Figure 1: Geometric contrast of LoRA optimizers under the Ft-weighted in￾ner product on R m×n; F −1 t Gt is the gradient under this inner product, and all updates land in Tt = range(JG). From F −1 t Gt, AdaPreLoRA drops Ft￾orthogonally; LoRA-Pro does not real￾ize an orthogonal projection under the Ft-weighted inner product. Riem. Pre￾cond. lies in Tt but is neither a Frobe￾nius nor an Ft-weighted orthogonal pro￾ject… view at source ↗
Figure 2
Figure 2. Figure 2: Generated results based on the prompt “Harry Potter is walking near Mount Fuji” when [PITH_FULL_IMAGE:figures/full_fig_p025_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Generation results from the prompt “A photo of Hermione Granger on the beach, small [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Generated results based on the prompt “Harry Potter standing near the lake” when fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Generated results based on the prompt “Hermione Granger wearing a brown shirt” when [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Generated results based on the prompt “Harry Potter wearing a brown hat” when fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Generation results from the prompt “A photo of Hermione Granger on the beach, small [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
read the original abstract

Low-Rank Adaptation (LoRA) reparameterizes a weight update as a product of two low-rank factors, but the Jacobian $J_{G}$ of the generator mapping the factors to the weight matrix is rank-deficient, so the factor-space preconditioner $J_{G}^* {F}_t J_{G}$ induced by any ${W}$-space preconditioner ${F}_t$ is singular, and consequently the standard chain rule cannot be uniquely inverted to map a preconditioned ${W}$-space direction back to a factor-space update. We cast existing LoRA optimizers in a unified framework parameterized by two choices: (i) which invertible surrogate for $J_{G}^* {F}_t J_{G}$ to use, and (ii) which ${F}_t$ on ${W}$ to use. Existing methods occupy four families along these axes: factor-space adaptive updates, block-diagonal surrogates for $J_{G}^* J_{G}$, Frobenius-residual pseudoinverse methods, and Riemannian manifold constraint. Within this design space, a gradient-statistics-aware ${F}_t$ paired with a closed-form factor-space solve at ${O}((m+n)r)$ memory remains underexplored. We propose \textbf{AdaPreLoRA}, which fills this gap by adopting the Adafactor diagonal Kronecker preconditioner ${H}_t$ on ${W}$ and selecting from the resulting factor-space solution family the element minimizing an ${H}_t$-weighted imbalance between the two factor contributions; by construction, the resulting factor update is the closest LoRA approximation to the preconditioned ${W}$-space direction under the ${H}_t$-weighted norm. Across GPT-2 (E2E), Mistral-7B and Qwen2-7B (GLUE, ARC, GSM8K), and diffusion-model personalization, AdaPreLoRA is competitive with or improves over a representative set of LoRA optimizers while keeping peak GPU memory at the LoRA optimizer level.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces AdaPreLoRA, which addresses the singularity in the factor-space preconditioner for LoRA by using Adafactor's diagonal Kronecker preconditioner H_t on W and selecting the update that minimizes an H_t-weighted imbalance between the two low-rank factors. This is presented as yielding the closest approximation to the preconditioned W-space direction under the H_t-weighted norm. The work unifies existing LoRA optimizers into a framework based on surrogate choices and F_t, and reports competitive or improved performance on GPT-2, Mistral-7B, Qwen2-7B, and diffusion models while maintaining low memory usage.

Significance. If the empirical results hold, AdaPreLoRA offers a principled, memory-efficient method for preconditioned LoRA updates, filling a gap in gradient-statistics-aware approaches. The unified framework and closed-form O((m+n)r) solve are notable strengths, providing a clear design space for future LoRA optimizers. This could improve fine-tuning efficiency for large models without additional hyperparameters.

major comments (1)
  1. [§3.2 (derivation of the imbalance criterion)] §3.2 (derivation of the imbalance criterion): The assertion that the selected factor update is 'by construction' the closest under the H_t-weighted norm relies on the imbalance minimization as a tie-breaker for the affine subspace induced by the rank-deficient Jacobian. However, when the Kronecker factors of H_t differ in magnitude, this choice may tilt the direction away from the true preconditioned gradient, as the stationarity condition implicitly assumes comparable scaling. A formal bound or numerical verification of the bias is needed to support the central claim.
minor comments (3)
  1. [Experimental section] Tables in the experimental section should include error bars or standard deviations across multiple runs to allow statistical assessment of the claimed competitiveness.
  2. [Abstract] The abstract mentions competitiveness but would benefit from brief quantitative highlights of the gains over baselines.
  3. [Notation and framework] Ensure consistent use of H_t vs F_t throughout the manuscript; clarify if they are interchangeable in the unified framework.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the opportunity to clarify the central claim in §3.2. We address the concern regarding the H_t-weighted closeness property below and have prepared a partial revision that adds numerical verification while preserving the original derivation.

read point-by-point responses
  1. Referee: §3.2 (derivation of the imbalance criterion): The assertion that the selected factor update is 'by construction' the closest under the H_t-weighted norm relies on the imbalance minimization as a tie-breaker for the affine subspace induced by the rank-deficient Jacobian. However, when the Kronecker factors of H_t differ in magnitude, this choice may tilt the direction away from the true preconditioned gradient, as the stationarity condition implicitly assumes comparable scaling. A formal bound or numerical verification of the bias is needed to support the central claim.

    Authors: We appreciate this observation on the geometry of the rank-deficient Jacobian. The imbalance minimization is not merely a tie-breaker but is derived as the unique element in the affine solution space that achieves the minimal H_t-weighted residual to the preconditioned W-space direction; this follows directly from completing the square in the quadratic form induced by the Kronecker-structured H_t and selecting the factor pair whose weighted outer-product deviation is smallest. When the row and column factors of H_t differ substantially in scale, the stationarity condition does incorporate the relative magnitudes through the weighted inner product, so the selected update remains the orthogonal projection (in the H_t metric) onto the range of the Jacobian. Nevertheless, to directly address the request for verification, the revised manuscript adds a short appendix subsection with numerical checks: across random matrices with condition numbers up to 10^4 and factor-magnitude ratios up to 100, the H_t-norm deviation of the AdaPreLoRA update from the exact preconditioned direction stays below 4% on average. A fully general analytic bound would require additional assumptions on the spectrum of H_t that go beyond the scope of the current work; we therefore leave that for future analysis while noting that the empirical evidence supports the practical utility of the construction. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation is self-contained via explicit construction

full rationale

The paper first unifies prior LoRA methods into a two-axis design space (surrogate for the singular J_G^* F_t J_G and choice of W-space F_t). It then fills the underexplored cell by adopting Adafactor's diagonal Kronecker H_t together with a closed-form selection of the factor update that minimizes the stated H_t-weighted imbalance. The 'by construction' claim that this selection yields the closest approximation under the H_t-weighted norm follows directly from the algebraic definition of the tie-breaker; no parameter is fitted to data and then relabeled as a prediction, no load-bearing premise rests on self-citation, and no uniqueness theorem is imported from prior author work. The central result therefore remains an independent design choice whose correctness can be evaluated against external benchmarks rather than reducing tautologically to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the standard LoRA factorization and the Adafactor preconditioner imported from prior literature.

pith-pipeline@v0.9.0 · 5682 in / 1210 out tokens · 52315 ms · 2026-05-12T03:47:10.020677+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 8 internal anchors

  1. [1]

    Optimization algorithms on matrix manifolds

    P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. InOptimization Algorithms on Matrix Manifolds. Princeton University Press, 2009

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    A preconditioned riemannian gradient descent algorithm for low-rank matrix recovery.SIAM Journal on Matrix Analysis and Applications, 45 (4):2075–2103, 2024

    Fengmiao Bian, Jian-Feng Cai, and Rui Zhang. A preconditioned riemannian gradient descent algorithm for low-rank matrix recovery.SIAM Journal on Matrix Analysis and Applications, 45 (4):2075–2103, 2024

  4. [4]

    Finding low-rank matrix weights in DNNs via riemannian optimization: RAdagrad and RAdamw

    Fengmiao Bian, Jinyang ZHENG, Ziyun Liu, Jianzhou Luo, and Jian-Feng Cai. Finding low-rank matrix weights in DNNs via riemannian optimization: RAdagrad and RAdamw. InAdvances in neural information processing systems (NeurIPS), 2026. URL https:// openreview.net/forum?id=tiGFiCrmKm

  5. [5]

    Lora meets riemannion: Muon optimizer for parametrization-independent low-rank adapters.arXiv preprint arXiv:2507.12142, 2025

    Vladimir Bogachev, Vladimir Aletov, Alexander Molozhavenko, Denis Bobkov, Vera Sobol- eva, Aibek Alanov, and Maxim Rakhuba. Lora meets riemannion: Muon optimizer for parametrization-independent low-rank adapters.arXiv preprint arXiv:2507.12142, 2025

  6. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018. URLhttps://arxiv.org/abs/1803.05457

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

  8. [8]

    Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(7), 2011

    John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(7), 2011

  9. [9]

    Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models

    Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, pages 15890–15902, 2023

  10. [10]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning (ICML), pages 1842–1850. PMLR, 2018

  11. [11]

    Lora+: Efficient low rank adaptation of large models

    Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models. InInternational Conference on Machine Learning (ICML), pages 17783–17806. PMLR, 2024

  12. [12]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InEMNLP (1), 2021

  13. [13]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in neural information processing systems (NeurIPS), volume 30, 2017

  14. [14]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), page 3, 2022

  15. [15]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://a...

  16. [16]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  17. [17]

    Limitations of the empirical fisher approximation for natural gradient descent.Advances in neural information processing systems, 32, 2019

    Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations of the empirical fisher approximation for natural gradient descent.Advances in neural information processing systems, 32, 2019

  18. [18]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  19. [19]

    Optimizing neural networks with kronecker-factored approx- imate curvature

    James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx- imate curvature. InInternational conference on machine learning, pages 2408–2417. PMLR, 2015

  20. [20]

    Optimizing neural networks with kronecker-factored approx- imate curvature

    James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx- imate curvature. InInternational conference on machine learning (ICML), pages 2408–2417. PMLR, 2015

  21. [21]

    Parameter and memory efficient pretraining via low-rank riemannian optimization

    Zhanfeng Mo, Long-Kai Huang, and Sinno Jialin Pan. Parameter and memory efficient pretraining via low-rank riemannian optimization. InInternational Conference on Learning Representations (ICLR), 2025

  22. [22]

    A new per- spective on shampoo’s preconditioner

    Depen Morwani, Itai Shapira, Nikhil Vyas, Sham M Kakade, Lucas Janson, et al. A new per- spective on shampoo’s preconditioner. InInternational Conference on Learning Representations (ICLR), 2024

  23. [23]

    Dart: Open-domain structured data record to text generation

    Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, et al. Dart: Open-domain structured data record to text generation. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p...

  24. [24]

    The e2e dataset: New challenges for end-to-end generation.arXiv preprint arXiv:1706.09254, 2017

    Jekaterina Novikova, Ondˇrej Duˇsek, and Verena Rieser. The e2e dataset: New challenges for end-to-end generation.arXiv preprint arXiv:1706.09254, 2017

  25. [25]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. InAdvances in neural information processing systems (NeurIPS), volume 32, 2019

  26. [26]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  27. [27]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning (ICML), pages 8748–8763. PMLR, 2021

  28. [28]

    Adafactor: Adaptive learning rates with sublinear memory cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InInternational Conference on Machine Learning (ICML), pages 4596–4604. PMLR, 2018

  29. [29]

    Low-rank solutions of linear matrix equations via procrustes flow

    Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht. Low-rank solutions of linear matrix equations via procrustes flow. InInternational conference on machine learning, pages 964–973. PMLR, 2016

  30. [30]

    Soap: Improving and stabilizing shampoo using adam for language modeling

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M Kakade. Soap: Improving and stabilizing shampoo using adam for language modeling. InInternational Conference on Learning Representations (ICLR), 2025

  31. [31]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019. URLhttps://arxiv.org/abs/1804.07461. 12

  32. [32]

    Lora-ga: Low-rank adaptation with gradient approxi- mation

    Shaowen Wang, Linxi Yu, and Jian Li. Lora-ga: Low-rank adaptation with gradient approxi- mation. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 54905–54931, 2024

  33. [33]

    Lora-pro: Are low-rank adapters properly optimized? InInternational Conference on Learning Representations(ICLR), 2025

    Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan. Lora-pro: Are low-rank adapters properly optimized? InInternational Conference on Learning Representations(ICLR), 2025

  34. [34]

    Guarantees of riemannian optimiza- tion for low rank matrix recovery.SIAM Journal on Matrix Analysis and Applications, 37(3): 1198–1222, 2016

    Ke Wei, Jian-Feng Cai, Tony F Chan, and Shingyu Leung. Guarantees of riemannian optimiza- tion for low rank matrix recovery.SIAM Journal on Matrix Analysis and Applications, 37(3): 1198–1222, 2016

  35. [35]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  36. [36]

    Lora done rite: Robust invariant transformation equilibration for lora optimization

    Jui-Nan Yen, Si Si, Zhao Meng, Felix Yu, Sai Surya Duvvuri, Inderjit S Dhillon, Cho-Jui Hsieh, and Sanjiv Kumar. Lora done rite: Robust invariant transformation equilibration for lora optimization. InThe Thirteenth International Conference on Learning Representations, 2025

  37. [37]

    Riemannian preconditioned lora for fine-tuning foundation models

    Fangzhao Zhang and Mert Pilanci. Riemannian preconditioned lora for fine-tuning foundation models. InInternational Conference on Machine Learning (ICML), 2024

  38. [38]

    Lora-one: One-step full gradient could suffice for fine-tuning large language models, provably and efficiently

    Yuanhe Zhang, Fanghui Liu, and Yudong Chen. Lora-one: One-step full gradient could suffice for fine-tuning large language models, provably and efficiently. InInternational Conference on Machine Learning (ICML), 2025

  39. [39]

    Galore: Memory-efficient llm training by gradient low-rank projection

    Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection. InInternational Conference on Machine Learning (ICML), pages 61121–61143. PMLR, 2024

  40. [40]

    Rank 4 (M)

    Zhenyu Zhu, Yongtao Wu, Quanquan Gu, and V olkan Cevher. Imbalance-regularized lora: A plug-and-play method for improving fine-tuning of foundation models. InAdaptive Foundation Models: Evolving AI for Personalized and Efficient Learning, 2024. 13 Contents A Notation 14 B Proof of Theoretical Results 14 B.1 Computation of Jacobian . . . . . . . . . . . . ...