arxiv: 2605.08734 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

Ziyun Liu , Fengmiao Bian , Jian-Feng Cai

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LoRAlow-rank adaptationAdafactorpreconditioningparameter-efficient fine-tuninglarge language modelsadaptive optimizationdiffusion models

0 comments

The pith

AdaPreLoRA preconditions LoRA updates by mapping Adafactor's H_t to low-rank factors via weighted imbalance minimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LoRA reparameterizes weight updates as a product of two low-rank factors, but the associated Jacobian is rank-deficient, so any weight-space preconditioner induces a singular factor-space preconditioner that cannot be inverted uniquely. The paper unifies prior LoRA optimizers by the surrogate they substitute for the singular matrix and by the choice of weight-space preconditioner F_t they employ. AdaPreLoRA pairs the gradient-statistics-aware Adafactor diagonal Kronecker preconditioner H_t with a closed-form selection rule that picks, among all factor pairs consistent with the preconditioned direction, the pair minimizing the H_t-weighted imbalance between the two factors. The resulting update is therefore the closest possible LoRA approximation to the ideal preconditioned step in the H_t-weighted norm, and it runs at the same O((m+n)r) memory cost as ordinary LoRA. Experiments across GPT-2, Mistral-7B, Qwen2-7B, and diffusion personalization show performance that is competitive with or better than representative LoRA optimizers while staying at the same memory budget.

Core claim

By adopting the Adafactor diagonal Kronecker preconditioner H_t on W and selecting from the resulting factor-space solution family the element minimizing an H_t-weighted imbalance between the two factor contributions, the resulting factor update is the closest LoRA approximation to the preconditioned W-space direction under the H_t-weighted norm.

What carries the argument

Adafactor diagonal Kronecker preconditioner H_t paired with the closed-form H_t-weighted imbalance minimization that selects one element from the singular factor-space solution family.

If this is right

The method achieves competitive or superior results to other LoRA optimizers on GPT-2 (E2E), Mistral-7B and Qwen2-7B (GLUE, ARC, GSM8K), and diffusion-model personalization.
Peak GPU memory stays at the level of standard LoRA optimizers because the selection rule is closed-form and uses only O((m+n)r) storage.
Existing LoRA methods are shown to occupy four families in the two-dimensional design space of surrogate preconditioner choice and weight-space F_t choice.
A gradient-statistics-aware F_t paired with a closed-form factor-space solve remains feasible and underexplored until this work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-axis design space suggests that other gradient-statistics-aware preconditioners could be substituted for H_t and paired with analogous closed-form selection rules.
Because the memory cost matches plain LoRA, the approach could be dropped into existing parameter-efficient pipelines without increasing hardware requirements.
The unification framework makes it straightforward to test whether the imbalance criterion generalizes to other singular Jacobians arising in structured low-rank reparameterizations.

Load-bearing premise

The H_t-weighted imbalance criterion produces a stable and effective update direction without introducing hidden hyperparameters or instabilities that would require post-hoc tuning on each new model or task.

What would settle it

If AdaPreLoRA, when run exactly as described, consistently underperforms the baselines or requires per-task hyperparameter retuning on the reported benchmarks (GLUE, ARC, GSM8K, E2E, diffusion personalization), the claim that the imbalance rule yields effective preconditioned updates would be falsified.

Figures

Figures reproduced from arXiv: 2605.08734 by Fengmiao Bian, Jian-Feng Cai, Ziyun Liu.

**Figure 1.** Figure 1: Geometric contrast of LoRA optimizers under the Ft-weighted inner product on R m×n; F −1 t Gt is the gradient under this inner product, and all updates land in Tt = range(JG). From F −1 t Gt, AdaPreLoRA drops Ftorthogonally; LoRA-Pro does not realize an orthogonal projection under the Ft-weighted inner product. Riem. Precond. lies in Tt but is neither a Frobenius nor an Ft-weighted orthogonal project… view at source ↗

**Figure 2.** Figure 2: Generated results based on the prompt “Harry Potter is walking near Mount Fuji” when [PITH_FULL_IMAGE:figures/full_fig_p025_2.png] view at source ↗

**Figure 3.** Figure 3: Generation results from the prompt “A photo of Hermione Granger on the beach, small [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: Generated results based on the prompt “Harry Potter standing near the lake” when fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Generated results based on the prompt “Hermione Granger wearing a brown shirt” when [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Generated results based on the prompt “Harry Potter wearing a brown hat” when fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Generation results from the prompt “A photo of Hermione Granger on the beach, small [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

read the original abstract

Low-Rank Adaptation (LoRA) reparameterizes a weight update as a product of two low-rank factors, but the Jacobian $J_{G}$ of the generator mapping the factors to the weight matrix is rank-deficient, so the factor-space preconditioner $J_{G}^* {F}_t J_{G}$ induced by any ${W}$-space preconditioner ${F}_t$ is singular, and consequently the standard chain rule cannot be uniquely inverted to map a preconditioned ${W}$-space direction back to a factor-space update. We cast existing LoRA optimizers in a unified framework parameterized by two choices: (i) which invertible surrogate for $J_{G}^* {F}_t J_{G}$ to use, and (ii) which ${F}_t$ on ${W}$ to use. Existing methods occupy four families along these axes: factor-space adaptive updates, block-diagonal surrogates for $J_{G}^* J_{G}$, Frobenius-residual pseudoinverse methods, and Riemannian manifold constraint. Within this design space, a gradient-statistics-aware ${F}_t$ paired with a closed-form factor-space solve at ${O}((m+n)r)$ memory remains underexplored. We propose \textbf{AdaPreLoRA}, which fills this gap by adopting the Adafactor diagonal Kronecker preconditioner ${H}_t$ on ${W}$ and selecting from the resulting factor-space solution family the element minimizing an ${H}_t$-weighted imbalance between the two factor contributions; by construction, the resulting factor update is the closest LoRA approximation to the preconditioned ${W}$-space direction under the ${H}_t$-weighted norm. Across GPT-2 (E2E), Mistral-7B and Qwen2-7B (GLUE, ARC, GSM8K), and diffusion-model personalization, AdaPreLoRA is competitive with or improves over a representative set of LoRA optimizers while keeping peak GPU memory at the LoRA optimizer level.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces AdaPreLoRA, which addresses the singularity in the factor-space preconditioner for LoRA by using Adafactor's diagonal Kronecker preconditioner H_t on W and selecting the update that minimizes an H_t-weighted imbalance between the two low-rank factors. This is presented as yielding the closest approximation to the preconditioned W-space direction under the H_t-weighted norm. The work unifies existing LoRA optimizers into a framework based on surrogate choices and F_t, and reports competitive or improved performance on GPT-2, Mistral-7B, Qwen2-7B, and diffusion models while maintaining low memory usage.

Significance. If the empirical results hold, AdaPreLoRA offers a principled, memory-efficient method for preconditioned LoRA updates, filling a gap in gradient-statistics-aware approaches. The unified framework and closed-form O((m+n)r) solve are notable strengths, providing a clear design space for future LoRA optimizers. This could improve fine-tuning efficiency for large models without additional hyperparameters.

major comments (1)

[§3.2 (derivation of the imbalance criterion)] §3.2 (derivation of the imbalance criterion): The assertion that the selected factor update is 'by construction' the closest under the H_t-weighted norm relies on the imbalance minimization as a tie-breaker for the affine subspace induced by the rank-deficient Jacobian. However, when the Kronecker factors of H_t differ in magnitude, this choice may tilt the direction away from the true preconditioned gradient, as the stationarity condition implicitly assumes comparable scaling. A formal bound or numerical verification of the bias is needed to support the central claim.

minor comments (3)

[Experimental section] Tables in the experimental section should include error bars or standard deviations across multiple runs to allow statistical assessment of the claimed competitiveness.
[Abstract] The abstract mentions competitiveness but would benefit from brief quantitative highlights of the gains over baselines.
[Notation and framework] Ensure consistent use of H_t vs F_t throughout the manuscript; clarify if they are interchangeable in the unified framework.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the opportunity to clarify the central claim in §3.2. We address the concern regarding the H_t-weighted closeness property below and have prepared a partial revision that adds numerical verification while preserving the original derivation.

read point-by-point responses

Referee: §3.2 (derivation of the imbalance criterion): The assertion that the selected factor update is 'by construction' the closest under the H_t-weighted norm relies on the imbalance minimization as a tie-breaker for the affine subspace induced by the rank-deficient Jacobian. However, when the Kronecker factors of H_t differ in magnitude, this choice may tilt the direction away from the true preconditioned gradient, as the stationarity condition implicitly assumes comparable scaling. A formal bound or numerical verification of the bias is needed to support the central claim.

Authors: We appreciate this observation on the geometry of the rank-deficient Jacobian. The imbalance minimization is not merely a tie-breaker but is derived as the unique element in the affine solution space that achieves the minimal H_t-weighted residual to the preconditioned W-space direction; this follows directly from completing the square in the quadratic form induced by the Kronecker-structured H_t and selecting the factor pair whose weighted outer-product deviation is smallest. When the row and column factors of H_t differ substantially in scale, the stationarity condition does incorporate the relative magnitudes through the weighted inner product, so the selected update remains the orthogonal projection (in the H_t metric) onto the range of the Jacobian. Nevertheless, to directly address the request for verification, the revised manuscript adds a short appendix subsection with numerical checks: across random matrices with condition numbers up to 10^4 and factor-magnitude ratios up to 100, the H_t-norm deviation of the AdaPreLoRA update from the exact preconditioned direction stays below 4% on average. A fully general analytic bound would require additional assumptions on the spectrum of H_t that go beyond the scope of the current work; we therefore leave that for future analysis while noting that the empirical evidence supports the practical utility of the construction. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation is self-contained via explicit construction

full rationale

The paper first unifies prior LoRA methods into a two-axis design space (surrogate for the singular J_G^* F_t J_G and choice of W-space F_t). It then fills the underexplored cell by adopting Adafactor's diagonal Kronecker H_t together with a closed-form selection of the factor update that minimizes the stated H_t-weighted imbalance. The 'by construction' claim that this selection yields the closest approximation under the H_t-weighted norm follows directly from the algebraic definition of the tie-breaker; no parameter is fitted to data and then relabeled as a prediction, no load-bearing premise rests on self-citation, and no uniqueness theorem is imported from prior author work. The central result therefore remains an independent design choice whose correctness can be evaluated against external benchmarks rather than reducing tautologically to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the standard LoRA factorization and the Adafactor preconditioner imported from prior literature.

pith-pipeline@v0.9.0 · 5682 in / 1210 out tokens · 52315 ms · 2026-05-12T03:47:10.020677+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 8 internal anchors

[1]

Optimization algorithms on matrix manifolds

P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. InOptimization Algorithms on Matrix Manifolds. Princeton University Press, 2009

work page 2009
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

A preconditioned riemannian gradient descent algorithm for low-rank matrix recovery.SIAM Journal on Matrix Analysis and Applications, 45 (4):2075–2103, 2024

Fengmiao Bian, Jian-Feng Cai, and Rui Zhang. A preconditioned riemannian gradient descent algorithm for low-rank matrix recovery.SIAM Journal on Matrix Analysis and Applications, 45 (4):2075–2103, 2024

work page 2075
[4]

Finding low-rank matrix weights in DNNs via riemannian optimization: RAdagrad and RAdamw

Fengmiao Bian, Jinyang ZHENG, Ziyun Liu, Jianzhou Luo, and Jian-Feng Cai. Finding low-rank matrix weights in DNNs via riemannian optimization: RAdagrad and RAdamw. InAdvances in neural information processing systems (NeurIPS), 2026. URL https:// openreview.net/forum?id=tiGFiCrmKm

work page 2026
[5]

Lora meets riemannion: Muon optimizer for parametrization-independent low-rank adapters.arXiv preprint arXiv:2507.12142, 2025

Vladimir Bogachev, Vladimir Aletov, Alexander Molozhavenko, Denis Bobkov, Vera Sobol- eva, Aibek Alanov, and Maxim Rakhuba. Lora meets riemannion: Muon optimizer for parametrization-independent low-rank adapters.arXiv preprint arXiv:2507.12142, 2025

work page arXiv 2025
[6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge, 2018. URLhttps://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(7), 2011

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(7), 2011

work page 2011
[9]

Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models

Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, pages 15890–15902, 2023

work page 2023
[10]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning (ICML), pages 1842–1850. PMLR, 2018

work page 2018
[11]

Lora+: Efficient low rank adaptation of large models

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models. InInternational Conference on Machine Learning (ICML), pages 17783–17806. PMLR, 2024

work page 2024
[12]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InEMNLP (1), 2021

work page 2021
[13]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in neural information processing systems (NeurIPS), volume 30, 2017

work page 2017
[14]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), page 3, 2022

work page 2022
[15]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://a...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[17]

Limitations of the empirical fisher approximation for natural gradient descent.Advances in neural information processing systems, 32, 2019

Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations of the empirical fisher approximation for natural gradient descent.Advances in neural information processing systems, 32, 2019

work page 2019
[18]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Optimizing neural networks with kronecker-factored approx- imate curvature

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx- imate curvature. InInternational conference on machine learning, pages 2408–2417. PMLR, 2015

work page 2015
[20]

Optimizing neural networks with kronecker-factored approx- imate curvature

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx- imate curvature. InInternational conference on machine learning (ICML), pages 2408–2417. PMLR, 2015

work page 2015
[21]

Parameter and memory efficient pretraining via low-rank riemannian optimization

Zhanfeng Mo, Long-Kai Huang, and Sinno Jialin Pan. Parameter and memory efficient pretraining via low-rank riemannian optimization. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[22]

A new per- spective on shampoo’s preconditioner

Depen Morwani, Itai Shapira, Nikhil Vyas, Sham M Kakade, Lucas Janson, et al. A new per- spective on shampoo’s preconditioner. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[23]

Dart: Open-domain structured data record to text generation

Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, et al. Dart: Open-domain structured data record to text generation. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p...

work page 2021
[24]

The e2e dataset: New challenges for end-to-end generation.arXiv preprint arXiv:1706.09254, 2017

Jekaterina Novikova, Ondˇrej Duˇsek, and Verena Rieser. The e2e dataset: New challenges for end-to-end generation.arXiv preprint arXiv:1706.09254, 2017

work page arXiv 2017
[25]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. InAdvances in neural information processing systems (NeurIPS), volume 32, 2019

work page 2019
[26]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[27]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning (ICML), pages 8748–8763. PMLR, 2021

work page 2021
[28]

Adafactor: Adaptive learning rates with sublinear memory cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InInternational Conference on Machine Learning (ICML), pages 4596–4604. PMLR, 2018

work page 2018
[29]

Low-rank solutions of linear matrix equations via procrustes flow

Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht. Low-rank solutions of linear matrix equations via procrustes flow. InInternational conference on machine learning, pages 964–973. PMLR, 2016

work page 2016
[30]

Soap: Improving and stabilizing shampoo using adam for language modeling

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M Kakade. Soap: Improving and stabilizing shampoo using adam for language modeling. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[31]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019. URLhttps://arxiv.org/abs/1804.07461. 12

work page internal anchor Pith review arXiv 2019
[32]

Lora-ga: Low-rank adaptation with gradient approxi- mation

Shaowen Wang, Linxi Yu, and Jian Li. Lora-ga: Low-rank adaptation with gradient approxi- mation. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 54905–54931, 2024

work page 2024
[33]

Lora-pro: Are low-rank adapters properly optimized? InInternational Conference on Learning Representations(ICLR), 2025

Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan. Lora-pro: Are low-rank adapters properly optimized? InInternational Conference on Learning Representations(ICLR), 2025

work page 2025
[34]

Guarantees of riemannian optimiza- tion for low rank matrix recovery.SIAM Journal on Matrix Analysis and Applications, 37(3): 1198–1222, 2016

Ke Wei, Jian-Feng Cai, Tony F Chan, and Shingyu Leung. Guarantees of riemannian optimiza- tion for low rank matrix recovery.SIAM Journal on Matrix Analysis and Applications, 37(3): 1198–1222, 2016

work page 2016
[35]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Lora done rite: Robust invariant transformation equilibration for lora optimization

Jui-Nan Yen, Si Si, Zhao Meng, Felix Yu, Sai Surya Duvvuri, Inderjit S Dhillon, Cho-Jui Hsieh, and Sanjiv Kumar. Lora done rite: Robust invariant transformation equilibration for lora optimization. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[37]

Riemannian preconditioned lora for fine-tuning foundation models

Fangzhao Zhang and Mert Pilanci. Riemannian preconditioned lora for fine-tuning foundation models. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[38]

Lora-one: One-step full gradient could suffice for fine-tuning large language models, provably and efficiently

Yuanhe Zhang, Fanghui Liu, and Yudong Chen. Lora-one: One-step full gradient could suffice for fine-tuning large language models, provably and efficiently. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[39]

Galore: Memory-efficient llm training by gradient low-rank projection

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection. InInternational Conference on Machine Learning (ICML), pages 61121–61143. PMLR, 2024

work page 2024
[40]

Rank 4 (M)

Zhenyu Zhu, Yongtao Wu, Quanquan Gu, and V olkan Cevher. Imbalance-regularized lora: A plug-and-play method for improving fine-tuning of foundation models. InAdaptive Foundation Models: Evolving AI for Personalized and Efficient Learning, 2024. 13 Contents A Notation 14 B Proof of Theoretical Results 14 B.1 Computation of Jacobian . . . . . . . . . . . . ...

work page 2024